4/13/2016

Why Should You Do Microservices (or maybe you shouldn’t)

Filed under: — Aviran Mordo

Microservices architecture is really hyped these days (I should know, I have been talking about it in many conferences), however not many have been written about the actual reason for doing microservies in the first place.

In the stories I tell in my public talks I try to explain that microserves architecture comes to solve a problem, and the main issue it comes to solve is SCALE, but not the scale that you think. Microservices mainly solve engineering scale.

We all know that small teams work faster and better than large teams. The bigger your team is the larger the project is and you end up with a huge monolith that many people are working on the same code base. It becomes very hard to to release a version as you need to synchronize the work of many people and package it into a releasable version.

By breaking your monolith into small microservices you allow the creation of small engineering teams that can release and deploy on their own time with loose coupling between different artifacts and other teams.

Another great benefit you gain is the ability to rollback small changes without affecting other areas in your system. If you have a monolith it is almost impossible to roll back a bad version because it bundled many features and if one feature is bad you cannot roll it back since you will essentially roll back all the other new features. If you breake it to microservices you decouple these parts and each can deploy and roll back without affecting the entire system.

With microservices you basically increase your development velocity and can scale your engineering teams by giving each team a set of microservices which they own and responsible.

Another scalability problem is different SLA for different parts of your system. You may have parts of your system that need to be highly performant and highly available running in multiple data centers or zones, while other parts that have a lesser requirement for performance and availability, for instance off-line batch processing.
If you have one monolith you have to scale the entire system based on your highest SLA which can be costly.
With microservices you can split these services and have different SLA to different parts of your system, thus reducing your operational cost. You can also have different middleware for different parts of your system, thus choosing the best solution to solve different issues.

The 3rd reason of doing microservices is risk management. If you have a monolith and you have issues with it in production, whether it is a production issue, bad deployment or simply a bug, you can bring your whole system done. With microservices being independent and decoupled you only have partial downtime for the affected microservice and have a degragation of service instead of a complete downtime.

Now don’t get me wrong microserices is a great solution but it comes to solve a problem, it has many other issues and complexities. If you are totally fine with a simple monolith, stay with it, when you feel the (scalability) pains in having a monolith, then microservices can help you solve some of your pains, but be prepared for different pains of running a distributed system ;-)

1/28/2016

Best practices for scaling with microservices and DevOps

Filed under: — Aviran Mordo

Wix.com is a highly successful cloud-based web development platform that has scaled rapidly. We now support around 80 million website builders in more than 190 countries. This translates into approximately 2 petabytes of user media files and adds about 1.5 terabytes per day. So how did we get there? The 400-strong Wix engineering team used a microservices architecture and MySQL, MongoDB, and Cassandra. We host our platform in three data centers, as well as on the cloud using Amazon Web Services (AWS) and the Google Cloud Platform.

I’ve been working with Wix since 2010 and oversaw the engineering team’s transition from a traditional waterfall development-based approach to agile methodologies and helped introduce DevOps and Continuous Delivery. Here’s what I learned about using microservices and MySQL to effectively support a fast-scaling environment.

How Wix defines microservices

Wix currently has around 200 microservices, but we didn’t start out on this path. During our days supporting 1 million sites back in 2008, we used a single-monolith approach with Java, Hibernate, Ehcache, Tomcat, and MySQL. This typical scale-up approach was useful in some aspects, but ultimately, we couldn’t tolerate the downtime caused by poor code quality and interdependencies inside the monolith. So, we gradually moved to a service-level-driven architecture and broke down our monolith.

By our definition, —(a single team can manage a few microservices), and the team must be able to describe each microservice’s responsibility in one clear sentence.

Specifically, a microservice is a single application deployed as a process, with one clear responsibility. It does not have to be a single function or even a single class. Each microservice writes only to its own database, to keep things clean and simple. The microservice itself has to be stateless to support frequent deployments and multiple instances, and all persistent states are stored in the database.

Wix’s four sets of microservices

Our architecture involves four main groups of services:

Wix Editor Segment: This set of microservices supports creating a website. The editor is written in JavaScript and runs in a browser. It saves a JSON representation of the site to one of the editor services, which in turn stores the JSON in MySQL and then into the Wix Media Platform (WixMP) file system. The editor back-end services also use the Jetty/Spring/Scala stack.

Wix Public Segment: This set of microservices is responsible for hosting and serving published Wix sites. It uses mostly MySQL and Jetty/Spring/Scala applications to serve the HTML of a site from the data that the Editor has created. Wix sites are rendered on a browser from JSON using JavaScript (React), or on the Wix Public server for bots.

Wix Media Platform (WixMP): This is an Internet media file system that was built and optimized for hosting and delivering images, video, music, and plain files, integrated with CDNs, SSL, etc. The platform runs on AWS and the Google Cloud Platform, using cloud compute instances and storage for on-the-fly image manipulation and video transcoding. We developed the compute instances software using Python, Go, and C, where applicable.

Verticals: This is a set of applications that adds value to a Wix site, such as eCommerce, Shoutout, or Hotels. The verticals are built using an Angular front end and the Jetty/Spring/Scala stack for the back end. We selected Angular over React for verticals because Angular provides a more complete application framework, including dependency injection and service abstraction.

Why MySQL is a great NoSQL

Our microservices use MySQL, so scaling them involves scaling MySQL. We don’t subscribe to the opinion, prevalent in our industry, that a relational database can’t perform as well as a NoSQL database. In our experience, engineers who make that assumption often ignore the operational costs, and don’t always think through production scenarios, uptimes, existing support options, knowledge base maintenance, and more.

We’ve found that, in most cases, we don’t need a NoSQL database, and that MySQL is a great NoSQL database if used appropriately. Relational databases have been around for more than 40 years, and there is a vast and easily accessible body of knowledge on how to use and maintain them. We usually default to using a MySQL database, and use NoSQL only in the cases where there’s a significantly better solution to the problem, such as if we need a document store or a solution for a high data volume that MySQL cannot handle.

Scaling MySQL to support explosive growth

Using MySQL in a large-scale system can present performance challenges. Here is a top 5 list of things we do to get great performance from MySQL:

Whenever possible, we avoid database-level transactions, since they require databases to maintain locks, which in turn have an adverse effect on performance. Instead, we use logical, application-level transactions, which reduce loads and extract better performance from the databases.

We do not use sequential primary keys because they introduce locks. Instead, we prefer client-generated keys, such as UUIDs. Also, when you have master-master replication, auto-increment causes conflicts, so you have to create key ranges for each instance.

We do not have queries with joins, and only look up or query by primary key or index. Any field that is not indexed has no right to exist. Instead, we fold such fields into a single text field (JSON is a good choice).

We often use MySQL simply as a key-value store. We store a JSON object in one of the columns, which allows us to extend the schema without making database schema changes. Accessing MySQL by primary key is extremely fast, and we get sub-millisecond read time by primary key, which is excellent for most uses. MySQL is a great NoSQL that’s ACID compliant.

We are not big fans of sharding because it creates operational overhead in maintaining and replicating clusters inside and across data centers. In terms of database size, we’ve found that a single MySQL instance can work perfectly well with hundreds of millions of records. Having a microservices architecture helps, as it naturally splits the data into multiple databases for each microservice. When the data grows beyond a single instance capacity, we either choose to switch to a NoSQL database that can scale out (Cassandra is our default choice), or try to scale up the database and have no more than two shards.

Takeaways

It’s entirely possible to manage a fast-growing, scale-out architecture without being a cloud-native, two-year-old startup. It’s also possible to do this while combining microservices with relational databases. Taking a long, hard look at both the development and operational pros and cons of tooling options has served us well in creating our own story and in managing a best-in-class, SLA-oriented architecture that drives our business growth.

Original post: http://techbeacon.com/how-wix-scaled-devops-microservices

7/18/2015

Building a Scalable and Resilient Architecture

Filed under: — Aviran Mordo

This article is a summery of my DevoxxUK talk about microservices:

Like many startups before us, Wix.com started as a monolith application, which was the best architectural solution when we had no scalability and availability concerns. But as time went by and our small startup grew and gained success, it was time to change the architecture from a monolith—which experienced many scalability and stability issues—to a more resilient and scalable architecture.

However, every time you build a scalable system you have to make some tradeoffs between availability, performance, complexity, development velocity, and many more, and you really need to understand your system in order to make the right tradeoffs.
Defining System Architecture and Service Level

These days, microservices are a hot topic. But it is not enough to simply build microservices, you also need to understand the boundaries of each microservice. There are many vague claims about how to determine the boundary and size of a microservice, from “you should be able to describe what your microservice does in one line,” to “it should be the size of the team that supports it.” But there is no correct answer. We find that a good rule of thumb (for services that have databases) is that a service should directly access only a couple of database tables to operate.

One very important guideline we set, which helps us determine the boundaries, is based on the service level (SL) needed for each microservice. When we analyzed our system to see how users interact with Wix, we saw two main patterns. Based on these patterns, we defined two different service levels: one for editing sites (Editor segment) and the other for viewing sites (Public segment).

The Public segment supports viewing websites (it is mostly read-only). We defined the Public segment to have higher service-level requirements, because it is more important that a website be fast and available. The Editor segment is where we have all the microservices responsible for website authoring and management. The Editor segment, while important, does not share the same service-level requirements, because its impact is limited to the site owner editing his site.

Every microservice we build belongs to one of these two segments. Having defined these two different SLs, we also architectured our microservices boundaries according to them. We decided that the Editor segment should work in two data centers as an active-standby configuration, while only one data center gets the data writes. However, for the Public segment, we insist that we have at least two active data centers (we actually have three), in which all data centers get traffic all the time.

Since we set this definition, and because the Public segment is mostly read-only data, it made it easier to scale the microservices on the Public segment. When a user publishes his site on the web, we copy the data we need from the microservices in the Editor segment to the microservices in the Public segment, while denormalizing it to be read-optimized.

As for the Editor segment, because we have a lower requirement of availability, writing to just one location is a much simpler problem to solve than writing to multiple locations and replicating the data (which would require us to resolve all kinds of failure-causing conflicts and to handle replication lags). In theory we designed most of our system to be able to write concurrently to two data centers; however, we’ve currently decided not to activate it, as it requires a lot of operational overhead.

Working with Multiple Cloud Vendors

As part of our Public SL, which requires working in at least two data centers, we also set a requirement for ourselves to be able to work with at least two cloud providers. The two dominant providers that are capable of working at the scale we need are Google and Amazon (we have some services running on Microsoft Azure too, but this is out of scope for this post).

The important lesson we learned by moving to the cloud is that the first thing to do is to invest on the write path—i.e., writing data to the cloud service. Just by writing the data, we discovered many problems and limitations of the cloud providers; for instance, throttlers and data consistency, and eventual consistent systems, which may take a long time to regain consistency on some occasions.

Eventually consistent storage for uploaded files presented a big challenge for us, because when a user uploads a file, he expects the file to be downloadable immediately. So we had to put caching mechanisms in place to overcome the lag from the moment the data is written to the point it is available to read. We also had to use cache to overcome throttlers that limited the write rate, and we had to use batch writes as well. Read path is relatively easy—we just needed adapters for each underlying storage.

We started with Google Cloud Storage. Once we overcame all the problems with Google’s platform, we began the same process on Amazon by developing a data distribution system that copied data from one cloud provider to another. This way the data is constantly replicated between two different vendors, and we avoid a vendor lock. Another benefit is that in cases where we have issues with the availability or performance of one cloud, we can easily shift traffic to the other, thus providing our customers with the best service possible—even when the infrastructure is out of our control.

Building Redundancy

With this approach of multiple vendors and data centers, we also build a lot of redundancy and fallbacks into our Public segment to reach a high level of availability. For the critical parts of our service, we always employ fallbacks in case there is a problem.

Databases are replicated in and across data centers, and as mentioned previously, our services are running in multiple data centers simultaneously. In case a service is not available for any reason, we can always fall back to a different data center and operate from there (in most cases this happens automatically by customizing the load balancers).

Creating Guidelines for Microservices

To build a fast, resilient, and scalable system without compromising development productivity, we created a small set of guidelines for our engineers to follow when building a microservice. Using these guidelines, engineers consider the segment the microservice belongs to (Public or Editor) and assess the gains versus the tradeoffs.

Each service has its own schema (if one is needed)
Gain: Easy to scale microservices based on SL concerns
Tradeoff: System complexity; performance
Only one service should write to a specific DB table(s)
Gain: Decoupling architecture; faster development
Tradeoff: System complexity; performance
May have additional read-only services that accesses the DB if performance is an issue
Gain: Performance
Tradeoff: Coupling
Microservice processes are stateless
Gain: Easy to scale out (just add more servers)
Tradeoff: Performance; consistency
Microservice should be independently deployable
Cache is not a building block of a service, but an optimization to a real production performance problem.

Scaling with Simplicity with MySQL

When building a scalable system, we found that an important factor is using proven technology so that we know how to recover fast if there’s a failure.

One good example is using databases. You can use the latest and greatest NoSQL database, which works well in theory, but when you have production problems, you need to resume activity as fast as possible. Already having the knowledge of how the system works, or being able to find answers on Google quickly, is very important. This is one reason we usually default to using a MySQL database instead of opting for NoSQL databases, unless NoSQL is a better solution to the problem.

However, using MySQL in a large-scale system may have performance challenges. To get great performance from MySQL, we employ a few usage patterns, one of which is avoiding database-level transactions. Transactions require that the database maintain locks, which has an adverse effect on performance.

Instead, we use logical application-level transactions and avoid any database transactions, thus extracting high performance from the database. For example, let’s think about an invoicing schema. If there’s an invoice with multiple line items, instead of writing all the line items in a single transaction, we simply write line by line without any transaction. Once all the lines are written to the database, we write a header record, which has pointers to the line items IDs. This way, if something fails while writing the individual lines to the database, and the header record was not written—as it marks the finalization of the transaction—then the whole transaction fails. The one tradeoff is that you may get orphan rows in the database, which isn’t a significant issue because storage is cheap and you can clean these rows later if you care about the space.

We also use MySQL as a NoSQL database, simply as a key-value store. We store a JSON object in one of the columns, which allows us to extend the schema without doing database schema changes. Accessing MySQL by primary key is extremely fast, and we found that MySQL is a great NoSQL when you also have consistent writes.

Summary

When developing a large-scale system, everything is a tradeoff. You need to consciously decide which tradeoffs you are willing to make. But in order to do that, you must first understand your system and set the business service level and requirements. This will affect your decisions and architecture.

You can find out more on Yoav Abrahami’s Post here, and on slide-share.

Also, here is a link to the Original Post on Voxxed.

12/28/2013

Kill The Deadlines

Filed under: — Aviran Mordo

I have been building software for over 20 years and participated in many projects. Usually when you come to write a new feature or starting a new project one of the first thing your manager asks you is a time estimate (if you are lucky) and then will set a deadline for the project completion.

Once a deadline is set everybody is start working to meet this date. While setting a deadline helps management plan and have visibility about what is coming from the development pipeline, the act of setting a deadline, especially an aggressive one is a destructive act that in most cases hearts the company in the long run.

Development work consists of much more than writing and testing code, it also consists of research and design. The problem is that while (experienced) developer can estimate the coding effort there is no real way to estimate the research phase, problems you may encounter and how long the design is going to take (if you want to make a good design). How can you really estimate how long it will take you to learn something that you don’t know?

If deadlines are aggressive, meeting them usually means that developers will start cutting corners. Do a bad design just because you don’t have the time to the right one. If developers are pressed in time they may stick to the bad design choice just because they don’t have time to switch the design to a better one after they realize their initial design has flaws.

Other things developers do in order to meet the deadline is to cut down on testing, while doing that hurts the quality of their work. While cutting down on automated testing may let you believe the work is progressing at a higher rate, however you will usually find the problems in production and spend much more time stabilizing the system after it is deployed. You might think you met the deadline shipping the product, but the quality of the product is low and you are going to pay for that in maintenance and reputation.

In addition to all that working to meet deadlines create stress on the developers which is not a healthy environment to be in for a long time, if they move from one deadline to another.

Now don’t get me wrong, by not setting a deadline you are not giving a free hand to project to stretch beyond reason. Developers should be aware that their performance is measured but they should also know that they can take the time to produce a high quality product by not cutting corners. In most cases a project can be delayed by few days or weeks without any real consequences to the company, and developers should know that if they have the time they need to produce a good product.

In the exception where you do have a deadline which you cannot postpone the delivery, you should take into consideration that there will quality issues and design flaws. After the delivery you should give developers time to complete the missing tests, do necessary refactoring and bring the product to the desired quality.

10/28/2013

Dev Centric Culture - Breaking Down The Walls

Filed under: — Aviran Mordo

We have been doing Continuous Delivery and DevOps at Wix for several years now and as part of that, I have been training my team at the methodology of DevOps and changing the company’s culture. Last week I was trying to explain to some colleagues how we work at Wix and about our dev culture. While doing it I realized that we are doing much more than just DevOps. What we actually do is something I call “developer centric culture” or in short “Dev Centric” culture.

What is Dev Centric culture?

Dev Centric culture is about putting the developer in the middle and create a support circle around him to enable the developer to create the product you need and operate it successfully.

Dev Centric Circle

Before I’ll go into details about Dev Centric culture and why is it good let’s look on the problems we have, what we are trying to solve and why it is more than just DevOps.

DevOps is about breaking the walls between developers and operations but after few years of doing DevOps and continuous delivery we realized that DevOps is not enough. The wall between the developers and the operation is just one wall out of many. If you really want to be agile (and we do) we need to break down ALL the walls.

Let’s take a look at a traditional development pipeline.

  • Product that “tells” engineering what they need to do.
  • Software architect designs the system and “tells” the developer how to build it.
  • Developers write the code (and tests in case the work in TDD)
  • QA checks the developer’s product and cover it with more test
  • Operations deploy the artifact to production

So what we have here is a series of walls. A wall between the product and engineering, a wall between engineering and QA and of course a wall between engineering and operation. (Sometimes there is even a wall between architecture team and developers)
(more…)

6/24/2013

Continuous Delivery - Part 7 - Cultural Change

Filed under: — Aviran Mordo

Previous chapter: Backward and forward compatibility

In order for continuous delivery to work the organization has to go a cultural change and switch to Dev-Centric Culture.

Continuous delivery gives a lot of responsibility in the hand of the developer and as such the developer need to have a sense of ownership. At Wix we say that it is the developer’s responsibility to bring a product / feature to production, and he is responsible from the feature inception to the actual delivery of the product and maintaining it on the production environment.

In order to do that several things have to happen.

Know the business :
The developer has to know the business and be connected to the product he is developing. By understanding the product the developers makes better decisions and better products. Developers are pretty smart people and they don’t need to have product specs in their face (nobody actually reads them). Our developers work together with the product mangers to determine what the product for the feature should be. Remember while the actual product may have several features bundled together for it to be valuable for the customer, we develop per feature and deploy it as soon as it is ready. This way both the product manager and the developer get to test and experience each feature on production (it is only exposed to them via Feature toggle) and determine if it is good enough, and may cause the direction of the product to change not according to plan, and actually develop a different next feature than the planned one.

Take ownership
Developers are taking ownership on the whole process, which means that they are the ones that need to eventually deploy to production. This statement actually changes the roles of several departments. What the company needs to do is to remove every obstacle in the developers way to deploy quickly on his own to production.

The operations department will no longer be doing the deployments. What they will do from now on is to create the automated tooling that will allow the developers to deploy on his own.
Operations together with dev will create the tooling to make the production environment visible and easy to understand to developers. Developers should not start a shell console (ssh) in order to view and know what is going on with the servers. We created web views for monitored metrics of both system and application metrics and exceptions.
(more…)

4/21/2013

Continuous Delivery - Part 6 - Backward & Forward Compatibility

Filed under: — Aviran Mordo

Previous Chapter: Startup - Self Test

One very important mind set developers will have to adopt and practice is backward and forward compatibility.

Most production system do not consist on just one server, but a cluster of servers. When deploying new piece of code, you do not deploy it to all the servers at once because part of Continuous deployment strategy is zero downtime during deployment. If you deploy to all the servers at once and the deployment require a server restart then all the servers will restart at the same time causing a downtime.

Now think of the following scenario. You write a new code that requires a new field in a DTO and it is written to a database. Now if you deploy your servers gradually you will have a period of time that some servers will have the new code that uses the new field and some will not. The servers that have the new code will send the new field in the DTO and the servers that not yet deployed will not have the new field and will not recognize it.

Continuous Delivery - Backward & Forward Compatibility

One more important concept is to avoid deployment dependencies where you have to deploy one set of services before you deploy the other set. If we’ll use our previous example this will even make things worse. Let’s say you work with SOA architecture and you have now clients that send the new field and some clients that do not. Or you deploy the clients that now send the new field but you have not yet deployed the servers that can read them and might break. You might say, well I will not do that and I will first deploy the server that can read the new field and only after that I’ll deploy the client that sends it. However in Continuous deployment as easily as you can deploy new code you can also rollback you code. So even if you deploy first the server and then the client you might now roll back the server without rolling back the client, thus creating again the situation where clients send unknown fields to the server.
(more…)

4/14/2013

Continuous Delivery - Part 5 - Startup - Self Test

Filed under: — Aviran Mordo

Previous Chapter: A/B Testing

So far we discussed Feature Toggle and A/B testing. These two methods enable safe guards that your code does not harm your system. Feature toggles enable to gradually use new features and gradually expose it to users, while monitoring that the system behaves as expected. A/B testing on the other hand let you test how your users react to new features. There is one more crucial test you as a developer should write that will protect your system from bad deployments (and also be a key to gradual deployment implementation).

Self-test sometimes called startup test or post deployment test is a test where your system checks that it can actually work properly. A working program does not only consist of the code that the developer write. In order for an application to work it needs configuration values, external resources and dependencies such as databases and external services it depends on to work properly.

When an application loads and starts up its first operation should be Self-test. Your application should refuse to process any operation if the self-test did not pass successfully.
(more…)

4/4/2013

Continuous Delivery - Part 4 - A/B Testing

Filed under: — Aviran Mordo

Previous chapter: Continuous Delivery - Part 3 - Feature Toggles

UPDATE: We released PETRI our 3′rd generation experiment system as an open source project available on Github

From Wikipedia: In web development and marketing, as well as in more traditional forms of advertising, A/B testing or split testing is an experimental approach to web design (especially user experience design), which aims to identify changes to web pages that increase or maximize an outcome of interest (e.g., click-through rate for a banner advertisement). As the name implies, two versions (A and B) are compared, which are identical except for one variation that might impact a user’s behavior. Version A might be the currently used version, while Version B is modified in some respect. For instance, on an e-commerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal improvements in drop-off rates can represent a significant gain in sales. Significant improvements can be seen through testing elements like copy text, layouts, images and colors.

Although it sounds similar to feature toggles, there is a conceptual difference between A/B testing and feature toggles. With A/B test you measure an outcome for of a completed feature or flow, which hopefully does not have bugs. A/B testing is a mechanism to expose a finished feature to your users and test their reaction to it. While with feature toggle you would like to test that the code behaves properly, as expected and without bugs. In many cases feature toggles are used on the back-end where the users don’t not really experience changes in flow, while A/B tests are used on the front-end that exposes the new flow or UI to users.

Consistent user experience.
One important point to notice in A/B testing is consistent user experience. For instance you cannot display a new menu option one time, not show the same option a second time the user returns to your site or if the user refreshes the browser. So depending on the strategy you’re A/B test works to determine if a user is in group A or in group B , it should be consistent. If a user comes back to your application they should always “fall” to the same test group.
(more…)

3/27/2013

Continuous Delivery - Part 3 - Feature Toggles

Filed under: — Aviran Mordo

Previous chapter:The Road To Continuous Delivery - Part 2 - Visibility

UPDATE: We released PETRI our 3′rd generation experiment system as an open source project available on Github

One of the key elements in Continuous Delivery is the fact that you stop working with feature branches in your VCS repository; everybody works on the MASTER branch. During our transition to Continuous Deployment we switched from SVN to Git, which handles code merges much better, and has some other advantages over SVN; however SVN and basically every other VCS will work just fine.

For people who are just getting to know this methodology it sounds a bit crazy because they think developers cannot check-in their code until it’s completed and all the tests pass. But this is definitely not the case. Working in Continuous Deployment we tell developers to check-in their code as often as possible, at least once a day. So how can this work? Developers cannot finish their task in one day? Well there are few strategies to support this mode of development.

Feature toggles
Telling your developers they must check-in their code at least once a day will get you the reaction of something like “But my code is not finished yet, I cannot check it in”. The way to overcome this “problem” is with feature toggles.

Feature Toggle is a technique in software development that attempts to provide an alternative to maintaining multiple source code branches, called feature branches.
Continuous release and continuous deployment enables you to have quick feedback about your coding. This requires you to integrate your changes as early as possible. Feature branches introduce a by-pass to this process. Feature toggles brings you back to the track, but the execution paths of your feature is still “dead” and “untested”, if a toggle is “off”. But the effort is low to enable the new execution paths just by setting a toggle to “on”.

So what is really a feature toggle?
Feature toggle is basically an “if” statement in your code that is part of your standard code flow. If the toggle is “on” (the “if” statement == true) then the code is executed, and if the toggle is off then the code is not executed.
Every new feature you add to your system has to be wrapped in a feature toggle. This way developers can check-in unfinished code, as long as it compiles, that will never get executed until you change the toggle to “on”. If you design your code correctly you will see that in most cases you will only have ONE spot in your code for a specific feature toggle “if” statement.
(more…)

3/23/2013

The Road To Continuous Delivery - Part 2 - Visibility

Filed under: — Aviran Mordo

Previous chapter: The road to continuous delivery - Part 1

Production visibility

A key point for a successful continuous delivery is to make the production matrix available to the developers. At the heart of continuous delivery methodology is to empower the developer and make the developers responsible for deployment and successful operations of the production environment. In order for the developers to do that you should make all the information about the applications running in production easily available.

Although we give our developers root (sudo) access for the production servers, we do not want our developers to look at the logs in order to understand how the application behaves in production and to solve problems when they occur. Instead we developed a framework that every application at Wix is built on, which takes care of this concern.

Every application built with our framework automatically exposes a web dashboard that shows the application state and statistics. The dashboard shows the following (partial list):
• Server configuration
• All the RPC endpoints
• Resource Pools statistics
• Self test status (will be explained in future post)
• The list of artifacts (dependencies) and their version deployed with this version
• Feature toggles and their values
• Recent log entries (can be filtered by severity)
• A/B tests
• And most importantly we collect statistics about methods (timings, exceptions, number of calls and historical graphs).

Wix Dashboard

We use code instrumentation to automatically expose statistics on every controller and service end-point. Also developers can annotate methods they feel is important to monitor. For every method we can see the historical performance data, exception counters and also the last 10 exceptions for each method.

We have 2 categories for exceptions: Business exceptions and System exceptions.
Business exception is everything that has to do with application business logic. You will always have these kinds of exceptions like validation exceptions. The important thing to monitor on this kind of exception is to watch for sudden increase of these exceptions, especially after deployment.

The other type of exception is System exception. System exception is something like: “Cannot get JDBC connection”, or “HTTP connection timeout”. A perfect system should have zero System exceptions.
(more…)

3/16/2013

The Road To Continuous Delivery - Part 1

Filed under: — Aviran Mordo

The following series of posts are coming from my experience as the head of back-end engineering at Wix.com. I will try to tell the story of Wix and how we see and practice continuous delivery, hoping it will help you make the switch too.

So you decided that your development process is too slow and thinking to go to continuous delivery methodology instead of the “not so agile” Scrum. I assume you did some research, talked to a couple of companies and attended some lectures about the subject and want to practice continuous deployment too, but many companies asking me how to start and what to do?
In this series of articles I will try to describe some strategies to make the switch to Continuous delivery (CD).

Continuous Delivery is the last step in a long process. If you are just starting you should not expect that you can do this within a few weeks or even within few months. It might take you almost a year to actually make several deployments a day.
One important thing to know, it takes full commitment from the management. Real CD is going to change the whole development methodologies and affect everyone in the R&D.

Phase 1 – Test Driven Development
In order to do a successful CD you need to change the development methodology to be Test Driven Development. There are many books and online resources about how to do TDD. I will not write about it here but I will share our experience and the things we did in order to do TDD. One of the best books I recommend is “Growing Object Oriented Guided by tests

A key concept of CD is that everything should be tested automatically. Like most companies we had a manual QA department which was one of the reasons the release process is taking so long. With every new version of the product regression tests takes longer.

Usually when you’ll suggest moving to TDD and eventually to CI/CD the QA department will start having concerns that they are going to be expandable and be fired, but we did not do such thing. What we did is that we sent our entire QA department to learn Java. Up to that point our QA personnel were not developers and did not know how to write code. Our initial thought was that the QA department is going to write tests, but not Unit tests, they are going to write Integration and End to End Tests.

Since we had a lot of legacy code that was not tested at all, the best way to test it, is by integration tests because IT is similar to what manual QA is doing, testing the system from outside. We needed the man power to help the developers so training the QA personal was a good choice.

Now as for the development department, we started to teach the developers how to write tests. Of course the first tests we wrote were pretty bad ones but as time passes, like any skill, knowing how to write good test is also a skill, so it improves in time.
In order to succeed in moving to CD it is critical to get support from the management because before you see results there is a lot of investments to be done and development velocity is going to sink even further as you start training and building the infrastructure to support CD.
(more…)

Powered by WordPress