Sustainable Software Deliverability with Timelines

Filed under: — Aviran Mordo

In my previous post “Kill the Deadlines” I rant about how (fake) deadlines are demotivating, reduce quality and burn development teams. But if you think about why people use deadlines, is to deliver a software project in a timely manner. While I’m not a proponent of using fake deadlines as a way to push developers, in a product development process it is extremely important to have a time frame where you want to launch the product.

Agile software development has good tools and methodologies that can help you ship products fast, as long as you do it right and not follow the process blindly, but you understand the tools in your arsenal and pick the ones that work for you and adjust them to your needs and culture.

While it is extremely hard (if not impossible) to accurately predict how long a software project will last, setting a timeframe for a product launch is necessary to keep the team in focus and decide on priorities.

Release content
A project consists of many features: some take a long time to implement, and some take a short time to implement. While a product manager envisions a whole set of features, implementing all of them before the release will take a very long time and will probably never happen, or the product will never launch as there are always more features you would like to add, or more polishing to the product you would like to do.

Set a timeline
In order to keep everyone focused and be able to launch a product in a timely manner we need to set a timeline when we want to launch the product. The timeline consists of estimations of how long each feature should take. Based on these estimations we can build a release content list with all the features we would like to launch in this version. When we set a timeline we force ourselves to think about what are the features we would like to get into this version. The list of features in a version should be determined by the business value they provide. For instance, we can have one big feature with a lot of value or maybe two small features what each one has less value, but releasing two of these at the same time have a bigger value to the product that the one big feature.

Timeline should have checkpoints or milestones where you can evaluate your progress. In the axis of Time/Quality/Content I tend to keep quality as the constant (you should always strive to produce high quality software) and now you will have to play with time and content. Given a timeline, these milestones are good points to understand if you are going to be able to ship with your initial list of features in time, or you would have to take a good look and see what you can leave out for the next version. You will be surprised how “critical” features can be “less critical” and be cut down when you are forced to deal with a decision of extending the timeline or cut features, which helps you ship the product faster.

Synchronizing the company
Timelines help you synchronizing different departments inside the company. Since releasing a product requires also involvements from other department in the company, such as marketing, support, content writing, QA, etc’, having a public timeline that is visible and communicated to the whole company, helps synchronizing all the departments and help them plan accordingly to be ready for the release. You can communicate the timelines via periodic email updates, public boards, posters on the walls, publicly displayed on monitors in the office or any other method that will keep the timeline transparent and accessible to all.

So what are timelines?
Timeline is a rough time frame that you would like to ship a product. Timeline is NOT a deadline and is flexible to changes and delays (unfortunately most of the times it will be postponed, but you should try to keep it to a minimum).
Depending on the amount of the work timelines should be in a resolution of a week, month or a quarter.

Sometimes due to business constraints the timeline becomes a hard deadline. This is not an arbitrary date that you have set, but it is a direct result of a business constraints. For instance, a special feature that needs to be released before a holiday or a regulatory hard limit on something you need to change or add to your system. In this instance the real reason needs to be clearly communicated to the team.

When a timeline is delayed it should be as part of a discussion to understand what is the business impact of the delay as you may come to a conclusion that maybe instead of delaying the timeline you would make the hard choice of reducing the scope (content) of this version and keep the time unchanged.

So if a timeline is a rough estimate, when is the release date?
The release date, AKA deadline should be set towards the end of the timeline at the point you understand all the unknowns and you feel comfortable that you will be ready to launch (and so all the other dependencies). Setting the deadline late in the process will make the deadline based on REALITY and not a fake one. Yes, you will probably need to push the teams to get to this deadline but it will be something that they will be able to relate, understand and not sacrifice quality and release a bad product.

Continuous improvement
Agile software development is talking about retrospective as a tool to improve development process, in many organizations the retrospective is being done after a sprint and unfortunately geared towards improving the estimations which is causing the side effect of developers taking buffers just to be on the safe side. This should NOT be the point of retrospectives.
Retrospectives should be treated as a tool for continuous improvement of development velocity and not as a tool to improve estimations. In order to improve velocity, you should have a continuous feedback about what would have help you to deliver software faster.
Here are some examples of issues developers should point that would have made them finish their tasks faster: less context switches, faster build time, too many production alerts, I was waiting for QA, provisioning staging machines is taking a lots of time, I had too many meetings, etc’.
As opposed to I estimated a task to take 3 days and I realized that the initial database schema I chose was wrong and I had to redo it which caused delays, or I thought this was an easy task and realized I had to first do a big refactoring before I could implement it.
Once you get these feedbacks you will start to identify patterns and identify what are the bottlenecks in your development process and you could then tackle these bottlenecks and improve your development velocity. (Just between us, as a side effect of this process you will also get better estimations but that is accidental and not the real purpose of this)

In order to release a product efficiently, you can use agile software delivery practices, set a rough timeline and checkpoints along the way to see if you are on schedule. In case you are late you should re-evaluate the version content you may want to cut or switch some features.
Communicate and make the timeline visible to all the people in the company so everyone can be in sync, and when you feel that you can have a release date that you can actually make and it is based on real progress, only then set the actual date, and lastly have a process in place for continuous improvement of your development velocity.


Scaling Engineering by Hacking Conway’s Law

Filed under: — Aviran Mordo

So Wix.com has a very unique company structure. While we are over 1200 people strong, we keep behaving like 400 people company. One of our goals is to have and keep the startup feeling and velocity. In order to do that Wix has evolved the company structure form functional teams, to Gangs and Guilds and now the latest incarnation of the company’s structure to Guilds and Companies.

I recently gave a talk at DevoxxUK about the evolution of the company structure and how we managed to scale our engineering teams in a fast growing company while still keeping a top of the line engineering group without losing quality on the way.

Watch the video from the conference (Download slides)


Why Should You Do Microservices (or maybe you shouldn’t)

Filed under: — Aviran Mordo

Microservices architecture is really hyped these days (I should know, I have been talking about it in many conferences), however not many have been written about the actual reason for doing microservies in the first place.

In the stories I tell in my public talks I try to explain that microserves architecture comes to solve a problem, and the main issue it comes to solve is SCALE, but not the scale that you think. Microservices mainly solve engineering scale.

We all know that small teams work faster and better than large teams. The bigger your team is the larger the project is and you end up with a huge monolith that many people are working on the same code base. It becomes very hard to to release a version as you need to synchronize the work of many people and package it into a releasable version.

By breaking your monolith into small microservices you allow the creation of small engineering teams that can release and deploy on their own time with loose coupling between different artifacts and other teams.

Another great benefit you gain is the ability to rollback small changes without affecting other areas in your system. If you have a monolith it is almost impossible to roll back a bad version because it bundled many features and if one feature is bad you cannot roll it back since you will essentially roll back all the other new features. If you breake it to microservices you decouple these parts and each can deploy and roll back without affecting the entire system.

With microservices you basically increase your development velocity and can scale your engineering teams by giving each team a set of microservices which they own and responsible.

Another scalability problem is different SLA for different parts of your system. You may have parts of your system that need to be highly performant and highly available running in multiple data centers or zones, while other parts that have a lesser requirement for performance and availability, for instance off-line batch processing.
If you have one monolith you have to scale the entire system based on your highest SLA which can be costly.
With microservices you can split these services and have different SLA to different parts of your system, thus reducing your operational cost. You can also have different middleware for different parts of your system, thus choosing the best solution to solve different issues.

The 3rd reason of doing microservices is risk management. If you have a monolith and you have issues with it in production, whether it is a production issue, bad deployment or simply a bug, you can bring your whole system done. With microservices being independent and decoupled you only have partial downtime for the affected microservice and have a degragation of service instead of a complete downtime.

Now don’t get me wrong microserices is a great solution but it comes to solve a problem, it has many other issues and complexities. If you are totally fine with a simple monolith, stay with it, when you feel the (scalability) pains in having a monolith, then microservices can help you solve some of your pains, but be prepared for different pains of running a distributed system ;-)


Best practices for scaling with microservices and DevOps

Filed under: — Aviran Mordo

Wix.com is a highly successful cloud-based web development platform that has scaled rapidly. We now support around 80 130 million website builders in more than 190 countries. This translates into approximately 2 petabytes of user media files and adds about 1.5 terabytes per day. So how did we get there? The 400-strong Wix engineering team used a microservices architecture and MySQL, MongoDB, and Cassandra. We host our platform in three data centers, as well as on the cloud using Amazon Web Services (AWS) and the Google Cloud Platform.

I’ve been working with Wix since 2010 and oversaw the engineering team’s transition from a traditional waterfall development-based approach to agile methodologies and helped introduce DevOps and Continuous Delivery. Here’s what I learned about using microservices and MySQL to effectively support a fast-scaling environment.

How Wix defines microservices

Wix currently has around 200 microservices, but we didn’t start out on this path. During our days supporting 1 million sites back in 2008, we used a single-monolith approach with Java, Hibernate, Ehcache, Tomcat, and MySQL. This typical scale-up approach was useful in some aspects, but ultimately, we couldn’t tolerate the downtime caused by poor code quality and interdependencies inside the monolith. So, we gradually moved to a service-level-driven architecture and broke down our monolith.

By our definition, —(a single team can manage a few microservices), and the team must be able to describe each microservice’s responsibility in one clear sentence.

Specifically, a microservice is a single application deployed as a process, with one clear responsibility. It does not have to be a single function or even a single class. Each microservice writes only to its own database, to keep things clean and simple. The microservice itself has to be stateless to support frequent deployments and multiple instances, and all persistent states are stored in the database.

Wix’s four sets of microservices

Our architecture involves four main groups of services:

Wix Editor Segment: This set of microservices supports creating a website. The editor is written in JavaScript and runs in a browser. It saves a JSON representation of the site to one of the editor services, which in turn stores the JSON in MySQL and then into the Wix Media Platform (WixMP) file system. The editor back-end services also use the Jetty/Spring/Scala stack.

Wix Public Segment: This set of microservices is responsible for hosting and serving published Wix sites. It uses mostly MySQL and Jetty/Spring/Scala applications to serve the HTML of a site from the data that the Editor has created. Wix sites are rendered on a browser from JSON using JavaScript (React), or on the Wix Public server for bots.

Wix Media Platform (WixMP): This is an Internet media file system that was built and optimized for hosting and delivering images, video, music, and plain files, integrated with CDNs, SSL, etc. The platform runs on AWS and the Google Cloud Platform, using cloud compute instances and storage for on-the-fly image manipulation and video transcoding. We developed the compute instances software using Python, Go, and C, where applicable.

Verticals: This is a set of applications that adds value to a Wix site, such as eCommerce, Shoutout, or Hotels. The verticals are built using an Angular front end and the Jetty/Spring/Scala stack for the back end. We selected Angular over React for verticals because Angular provides a more complete application framework, including dependency injection and service abstraction.

Why MySQL is a great NoSQL

Our microservices use MySQL, so scaling them involves scaling MySQL. We don’t subscribe to the opinion, prevalent in our industry, that a relational database can’t perform as well as a NoSQL database. In our experience, engineers who make that assumption often ignore the operational costs, and don’t always think through production scenarios, uptimes, existing support options, knowledge base maintenance, and more.

We’ve found that, in most cases, we don’t need a NoSQL database, and that MySQL is a great NoSQL database if used appropriately. Relational databases have been around for more than 40 years, and there is a vast and easily accessible body of knowledge on how to use and maintain them. We usually default to using a MySQL database, and use NoSQL only in the cases where there’s a significantly better solution to the problem, such as if we need a document store or a solution for a high data volume that MySQL cannot handle.

Scaling MySQL to support explosive growth

Using MySQL in a large-scale system can present performance challenges. Here is a top 5 list of things we do to get great performance from MySQL:

Whenever possible, we avoid database-level transactions, since they require databases to maintain locks, which in turn have an adverse effect on performance. Instead, we use logical, application-level transactions, which reduce loads and extract better performance from the databases.

We do not use sequential primary keys because they introduce locks. Instead, we prefer client-generated keys, such as UUIDs. Also, when you have master-master replication, auto-increment causes conflicts, so you have to create key ranges for each instance.

We do not have queries with joins, and only look up or query by primary key or index. Any field that is not indexed has no right to exist. Instead, we fold such fields into a single text field (JSON is a good choice).

We often use MySQL simply as a key-value store. We store a JSON object in one of the columns, which allows us to extend the schema without making database schema changes. Accessing MySQL by primary key is extremely fast, and we get sub-millisecond read time by primary key, which is excellent for most uses. MySQL is a great NoSQL that’s ACID compliant.

We are not big fans of sharding because it creates operational overhead in maintaining and replicating clusters inside and across data centers. In terms of database size, we’ve found that a single MySQL instance can work perfectly well with hundreds of millions of records. Having a microservices architecture helps, as it naturally splits the data into multiple databases for each microservice. When the data grows beyond a single instance capacity, we either choose to switch to a NoSQL database that can scale out (Cassandra is our default choice), or try to scale up the database and have no more than two shards.


It’s entirely possible to manage a fast-growing, scale-out architecture without being a cloud-native, two-year-old startup. It’s also possible to do this while combining microservices with relational databases. Taking a long, hard look at both the development and operational pros and cons of tooling options has served us well in creating our own story and in managing a best-in-class, SLA-oriented architecture that drives our business growth.

Original post: http://techbeacon.com/how-wix-scaled-devops-microservices


Safe Database Migration Pattern Without Downtime

Filed under: — Aviran Mordo

I’ve been doing a continuous delivery talk for a while now and during my talk I describe a pattern of how to safely migrating one database to another database without downtime. Since many people contacted me and asked for more details about it, I will describe it here in more details as promised.

You can use this pattern to migrate between two different databases, for instance between MySql and MongoDB or between two schemas in the same database.

The idea of this pattern is to do a lazy database migration using feature toggles to control the behaviour of your application and progressing through the phases of the migration.

Let’s assume two databases you want to migrate from the “old” database to the “new” database.

Step 1
Build and deploy the “new” database schema onto production.
In this phase your system stays the same, nothing changes other than the fact that you have deployed a new database which you can start using when ready.

Step 2
Add a new DAO to your app that writes to the “new” database.
You may need to refactor your application to have a single (or very few) point(s) in which you access the database.
At the points you access the database or DAO you add a multi-state feature toggle that will control the flow of writing to the database.
The first state of this feature toggle is “use old database”. In this state your code ignores the “new” database and simply uses the “old” one as always.

Step 3
Start writing to the “new” database but use the “old” one as primary.
We are now getting into the distributed transaction world because you can never be 100% sure that writing to 2 databases can succeed of fail at the same time.
When your code performs a write operation it first writes to the “old” database and if it succeeds it writes to the “new” database as well. Notice that in this step the “old” database is in a consistent state while the “new” database can potentially be inconsistent since the writes to it can fail while the “old” database write succeeded.

It is important to let this step run for a while (several days or even weeks) before moving to the next step. This will give you the confidence that the write path of your new code works as expected and that the “new” database is configured correctly with all the replications in place.

At any time you decide that something is not working you can simply change the feature toggle back to the previous state and stop writing to the “new” database. You can make modification to the new schema or even drop it if you need as all the data is still in the “old” database and in a consistent state.

Safe database migration pattern

Step 4
Enable the read path. Change the feature toggle to enable reading from both databases.
In this step the it is important to remember that “old” database is the consistent one and should still be treated as the authoritative data.

Since there are many read patterns I’ll describe just a couple here but you can adjust it to your own use case.

In case you have immutable data and you know the record id you first read from the “new” database and in case you did not find the record you need to fall back to the “old” database and look for the record there. Only if both databases don’t have the record you return a “not found” to the client. Otherwise if the record is found you return the result preferring the “new” database.

If your data is mutable you’ll need to perform the read operation from both databases and prefer the “new” one only if the timestamp is equal to the record in the “old” database. Remember in this phase only the “old” database is considered consistent.

If you don’t know the record id and need to fetch unknown number of records you basically need to query both databases and merge the results coming from both DBs.

Whatever your read pattern is, remember that in this case the consistent database is the “old” one, but in this phase you need to read and use the “new” database read path as much as you can, in order to test your application and your new DAO on a real production environment. In this phase you may find out that you are missing some indices or need more read replicas.

Let this phase run for a while before moving to the next phase. Like in the previous phase you can always turn the feature toggle back to the previous states without a fear of data loss.

Another thing to note that since you are reading data from two schemas you will probably need to maintain backward and forward compatibility for the two data sets.

Step 5
Making the “new” database the primary one. Change the feature toggle to first write to the new database (you still read from both but now prefer the new DB).
This is a very important step. In this step you already run the write and read path of your code for a while now and when you feel comfortable you now switch roles and making the “new” database the consistent one and the “old” as a not consistent.
Instead of first writing to the “old” database first you now write to the “new” database first and do a “best effort” writing to the old database.
This phase also requires you to change the read priority. Up until now we considered the “old” database as having the authoritative data but now you would prefer the data in the “new” database (of course you need to consider the record timestamp).

This is also the point where you should try as much as you can to avoid switching back the feature toggle to the previous state as you’ll need to run a manual migration script to compare the two databases as writes to the “old” one may not have succeeded (remember distributed transaction). I call this “the point of no return“.

Step 6
Stop writing to the “old” database (read from both).
Change the feature toggle again to now stop writing to the “old” database having only a write path with the “new” database. Since the “new” database still does not have all the records you will still need to read from the “old” database as well as from the new and merge the data coming from both.

This is an important step as it basically transforms the “old” database to a “read-only” database.

If you feel comfortable enough you can do step 5 and 6 in one go.

Step 7
Eagerly migrate data from the “old” database to the “new” one.
Now that the “old” database is in a “read-only” mode it is very easy to write a migration script to migrate all the records from the “old” database that are not present in the “new” database.

Step 8
Delete the “old” DAO.
This is the last step where all the data is migrated to the “new” database you can now safely remove the old DAO from your code and leave only the new DAO that uses the new database. You now of course stop reading from the “old” DB and remove the data merging code that handle merging data from both DAOs.

This is it you are done and safely migrated the data between two databases without downtime.


Side note:
At Wix we usually run steps 3 and 4 for at least 2 weeks each and sometimes even a month before moving on to the next step. Examples for issues we had encounter during these steps were:
On the write path we were holding large objects in memory which caused GC storms during peak traffic.
Replications were not configured/working properly.
Missing proper monitoring.

On the read path we had issues like missing index.
Inefficient data model that caused poor performance which let us to rethink our data model for better read performance.


MySQL Is a Great NoSQL

Filed under: — Aviran Mordo

NoSQL is a set of database technologies built to handle massive amounts of data or specific data structures foreign to relational databases. However, the choice to use a NoSQL database is often based on hype, or a wrong assumption that relational databases cannot perform as well as a NoSQL database. Operational cost is often overlooked by engineers when it comes to selecting a database. At Wix engineering, we’ve found that in most cases we don’t need a NoSQL database, and that MySQL is a great NoSQL database if it’s used appropriately.

When building a scalable system, we found that an important factor is using proven technology so that we know how to recover fast if there’s a failure. For example, you can use the latest and greatest NoSQL database, which works well in theory, but when you have production problems, how long does it take to resume normal activity? Pre-existing knowledge and experience with the system and its workings—as well as being able to Google for answers—is critical for swift mitigation. Relational databases have been around for over 40 years, and there is a vast industry knowledge of how to use and maintain them. This is one reason we usually default to using a MySQL database instead of a NoSQL database, unless NoSQL is a significantly better solution to the problem—for example, if we need a document store, or to handle high data volume that MySQL cannot handle.

However, using MySQL in a large-scale system may have performance challenges. To get great performance from MySQL, we employ a few usage patterns. One of these is avoiding database-level transactions. Transactions require that the database maintains locks, which has an adverse effect on performance.

Instead, we use logical application-level transactions, thus reducing the load and extracting high performance from the database. For example, let’s think about an invoicing schema. If there’s an invoice with multiple line items, instead of writing all the line items in a single transaction, we simply write line by line without any transaction. Once all the lines are written to the database, we write a header record, which has pointers to the line items’ IDs. This way, if something fails while writing the individual lines to the database, and the header record was not written, then the whole transaction fails. A possible downside is that there may be orphan rows in the database. We don’t see it as a significant issue though, as storage is cheap and these rows can be purged later if more space is needed.

Here are some of our other usage patterns to get great performance from MySQL:
Do not have queries with joins; only query by primary key or index.
Do not use sequential primary keys (auto-increment) because they introduce locks. Instead, use client-generated keys, such as GUIDs. Also, when you have master-master replication, auto-increment causes conflicts, so you will have to create key ranges for each instance.
Any field that is not indexed has no right to exist. Instead, we fold such fields into a single text field (JSON is a good choice).

We often use MySQL simply as a key-value store. We store a JSON object in one of the columns, which allows us to extend the schema without making database schema changes. Accessing MySQL by primary key is extremely fast, and we get submillisecond read time by primary key, which is excellent for most use cases. So we found that MySQL is a great NoSQL that’s ACID compliant.

In terms of database size, we found that a single MySQL instance can work perfectly well with hundreds of millions of records. Most of our use cases do not have more than several hundred million records in a single instance.

One big advantage to using relational databases as opposed to NoSQL is that you don’t need to deal with the eventually consistent nature displayed by most NoSQL databases. Our developers all know relational databases very well, and it makes their lives easy.

Don’t get me wrong, there is a place for NoSQL; relational databases have their limits—single host size and strict data structures. Operational cost is often overlooked by engineers in favor of the cool new thing. If the two options are viable, we believe we need to really consider what it takes to maintain it in production and decide accordingly.

This article is published on JAX Magazine.

I will be speaking at JAX London and would be happy if you join my sessions. Get 10% off if you use promo code: SPKR_JLAM
Aviran Mordo - JAX London

Powered by WordPress