1/18/2017

Investing in Engineering Culture Attracts Engineers

Filed under: — Aviran Mordo

I have been a software engineer for most of my life but until I joined Wix I had not worked in a place that put engineering as a core value. In Wix I had the opportunity to change that and one of my main goals was to create a great engineering team and a company with great engineering culture.
One of the reasons I took upon myself this task was selfish, I wanted to work at a place that I will be happy to come to work every day, to be challenged and a place that software engineers can be proud of their profession.

Another task that I had ever since I joined Wix, was to grow the engineering teams and bring on more developers. This was in perfect alignment to my own agenda since I believe that in order to attract the best software engineers we had to create a great engineering culture, promote it, share our knowledge with the community and by doing so to be more attractive and competitive to attract talented engineers.

Operating in a highly competitive area with many startups and deep pockets companies like Google, Facebook and Microsoft the options for good software engineers are endless. The best way to attract talent is not by offering extra high salary since we would always lose to larger and more established companies like Google or Microsoft which can always outbid anything we can offer. Having a great engineering culture, giving back to the community and taking care of our employees is a great way to compete and attract professionals.

In any case this has been my theory and now for the second year I can say it is working pretty good for us.

screen-shot-2017-01-18-at-083744.png

Looking at Google trends I have compared search keywords for people who are looking for a job at the leading companies here in Israel. Comparing “wix jobs” to “google jobs” and “facebook jobs” (and others) resulted in Wix being the #1 in this search term. Of course this is not a scientific data but the assumption is that software engineers are more likely to search these terms in English, while other non technical employees will probably search in Hebrew.

Investing in people, engineering and culture pays back !

9/26/2016

Sustainable Software Deliverability with Timelines

Filed under: — Aviran Mordo

In my previous post “Kill the Deadlines” I rant about how (fake) deadlines are demotivating, reduce quality and burn development teams. But if you think about why people use deadlines, is to deliver a software project in a timely manner. While I’m not a proponent of using fake deadlines as a way to push developers, in a product development process it is extremely important to have a time frame where you want to launch the product.

Agile software development has good tools and methodologies that can help you ship products fast, as long as you do it right and not follow the process blindly, but you understand the tools in your arsenal and pick the ones that work for you and adjust them to your needs and culture.

While it is extremely hard (if not impossible) to accurately predict how long a software project will last, setting a timeframe for a product launch is necessary to keep the team in focus and decide on priorities.

Release content
A project consists of many features: some take a long time to implement, and some take a short time to implement. While a product manager envisions a whole set of features, implementing all of them before the release will take a very long time and will probably never happen, or the product will never launch as there are always more features you would like to add, or more polishing to the product you would like to do.

Set a timeline
In order to keep everyone focused and be able to launch a product in a timely manner we need to set a timeline when we want to launch the product. The timeline consists of estimations of how long each feature should take. Based on these estimations we can build a release content list with all the features we would like to launch in this version. When we set a timeline we force ourselves to think about what are the features we would like to get into this version. The list of features in a version should be determined by the business value they provide. For instance, we can have one big feature with a lot of value or maybe two small features what each one has less value, but releasing two of these at the same time have a bigger value to the product that the one big feature.

Timeline should have checkpoints or milestones where you can evaluate your progress. In the axis of Time/Quality/Content I tend to keep quality as the constant (you should always strive to produce high quality software) and now you will have to play with time and content. Given a timeline, these milestones are good points to understand if you are going to be able to ship with your initial list of features in time, or you would have to take a good look and see what you can leave out for the next version. You will be surprised how “critical” features can be “less critical” and be cut down when you are forced to deal with a decision of extending the timeline or cut features, which helps you ship the product faster.

Synchronizing the company
Timelines help you synchronizing different departments inside the company. Since releasing a product requires also involvements from other department in the company, such as marketing, support, content writing, QA, etc’, having a public timeline that is visible and communicated to the whole company, helps synchronizing all the departments and help them plan accordingly to be ready for the release. You can communicate the timelines via periodic email updates, public boards, posters on the walls, publicly displayed on monitors in the office or any other method that will keep the timeline transparent and accessible to all.

So what are timelines?
Timeline is a rough time frame that you would like to ship a product. Timeline is NOT a deadline and is flexible to changes and delays (unfortunately most of the times it will be postponed, but you should try to keep it to a minimum).
Depending on the amount of the work timelines should be in a resolution of a week, month or a quarter.

Sometimes due to business constraints the timeline becomes a hard deadline. This is not an arbitrary date that you have set, but it is a direct result of a business constraints. For instance, a special feature that needs to be released before a holiday or a regulatory hard limit on something you need to change or add to your system. In this instance the real reason needs to be clearly communicated to the team.

When a timeline is delayed it should be as part of a discussion to understand what is the business impact of the delay as you may come to a conclusion that maybe instead of delaying the timeline you would make the hard choice of reducing the scope (content) of this version and keep the time unchanged.

So if a timeline is a rough estimate, when is the release date?
The release date, AKA deadline should be set towards the end of the timeline at the point you understand all the unknowns and you feel comfortable that you will be ready to launch (and so all the other dependencies). Setting the deadline late in the process will make the deadline based on REALITY and not a fake one. Yes, you will probably need to push the teams to get to this deadline but it will be something that they will be able to relate, understand and not sacrifice quality and release a bad product.

Continuous improvement
Agile software development is talking about retrospective as a tool to improve development process, in many organizations the retrospective is being done after a sprint and unfortunately geared towards improving the estimations which is causing the side effect of developers taking buffers just to be on the safe side. This should NOT be the point of retrospectives.
Retrospectives should be treated as a tool for continuous improvement of development velocity and not as a tool to improve estimations. In order to improve velocity, you should have a continuous feedback about what would have help you to deliver software faster.
Here are some examples of issues developers should point that would have made them finish their tasks faster: less context switches, faster build time, too many production alerts, I was waiting for QA, provisioning staging machines is taking a lots of time, I had too many meetings, etc’.
As opposed to I estimated a task to take 3 days and I realized that the initial database schema I chose was wrong and I had to redo it which caused delays, or I thought this was an easy task and realized I had to first do a big refactoring before I could implement it.
Once you get these feedbacks you will start to identify patterns and identify what are the bottlenecks in your development process and you could then tackle these bottlenecks and improve your development velocity. (Just between us, as a side effect of this process you will also get better estimations but that is accidental and not the real purpose of this)

Summary
In order to release a product efficiently, you can use agile software delivery practices, set a rough timeline and checkpoints along the way to see if you are on schedule. In case you are late you should re-evaluate the version content you may want to cut or switch some features.
Communicate and make the timeline visible to all the people in the company so everyone can be in sync, and when you feel that you can have a release date that you can actually make and it is based on real progress, only then set the actual date, and lastly have a process in place for continuous improvement of your development velocity.

6/18/2016

Scaling Engineering by Hacking Conway’s Law

Filed under: — Aviran Mordo

So Wix.com has a very unique company structure. While we are over 1200 people strong, we keep behaving like 400 people company. One of our goals is to have and keep the startup feeling and velocity. In order to do that Wix has evolved the company structure form functional teams, to Gangs and Guilds and now the latest incarnation of the company’s structure to Guilds and Companies.

I recently gave a talk at DevoxxUK about the evolution of the company structure and how we managed to scale our engineering teams in a fast growing company while still keeping a top of the line engineering group without losing quality on the way.

Watch the video from the conference (Download slides)

4/13/2016

Why Should You Do Microservices (or maybe you shouldn’t)

Filed under: — Aviran Mordo

Microservices architecture is really hyped these days (I should know, I have been talking about it in many conferences), however not many have been written about the actual reason for doing microservies in the first place.

In the stories I tell in my public talks I try to explain that microserves architecture comes to solve a problem, and the main issue it comes to solve is SCALE, but not the scale that you think. Microservices mainly solve engineering scale.

We all know that small teams work faster and better than large teams. The bigger your team is the larger the project is and you end up with a huge monolith that many people are working on the same code base. It becomes very hard to to release a version as you need to synchronize the work of many people and package it into a releasable version.

By breaking your monolith into small microservices you allow the creation of small engineering teams that can release and deploy on their own time with loose coupling between different artifacts and other teams.

Another great benefit you gain is the ability to rollback small changes without affecting other areas in your system. If you have a monolith it is almost impossible to roll back a bad version because it bundled many features and if one feature is bad you cannot roll it back since you will essentially roll back all the other new features. If you breake it to microservices you decouple these parts and each can deploy and roll back without affecting the entire system.

With microservices you basically increase your development velocity and can scale your engineering teams by giving each team a set of microservices which they own and responsible.

Another scalability problem is different SLA for different parts of your system. You may have parts of your system that need to be highly performant and highly available running in multiple data centers or zones, while other parts that have a lesser requirement for performance and availability, for instance off-line batch processing.
If you have one monolith you have to scale the entire system based on your highest SLA which can be costly.
With microservices you can split these services and have different SLA to different parts of your system, thus reducing your operational cost. You can also have different middleware for different parts of your system, thus choosing the best solution to solve different issues.

The 3rd reason of doing microservices is risk management. If you have a monolith and you have issues with it in production, whether it is a production issue, bad deployment or simply a bug, you can bring your whole system done. With microservices being independent and decoupled you only have partial downtime for the affected microservice and have a degragation of service instead of a complete downtime.

Now don’t get me wrong microserices is a great solution but it comes to solve a problem, it has many other issues and complexities. If you are totally fine with a simple monolith, stay with it, when you feel the (scalability) pains in having a monolith, then microservices can help you solve some of your pains, but be prepared for different pains of running a distributed system ;-)

1/28/2016

Best practices for scaling with microservices and DevOps

Filed under: — Aviran Mordo

Wix.com is a highly successful cloud-based web development platform that has scaled rapidly. We now support around 80 million website builders in more than 190 countries. This translates into approximately 2 petabytes of user media files and adds about 1.5 terabytes per day. So how did we get there? The 400-strong Wix engineering team used a microservices architecture and MySQL, MongoDB, and Cassandra. We host our platform in three data centers, as well as on the cloud using Amazon Web Services (AWS) and the Google Cloud Platform.

I’ve been working with Wix since 2010 and oversaw the engineering team’s transition from a traditional waterfall development-based approach to agile methodologies and helped introduce DevOps and Continuous Delivery. Here’s what I learned about using microservices and MySQL to effectively support a fast-scaling environment.

How Wix defines microservices

Wix currently has around 200 microservices, but we didn’t start out on this path. During our days supporting 1 million sites back in 2008, we used a single-monolith approach with Java, Hibernate, Ehcache, Tomcat, and MySQL. This typical scale-up approach was useful in some aspects, but ultimately, we couldn’t tolerate the downtime caused by poor code quality and interdependencies inside the monolith. So, we gradually moved to a service-level-driven architecture and broke down our monolith.

By our definition, —(a single team can manage a few microservices), and the team must be able to describe each microservice’s responsibility in one clear sentence.

Specifically, a microservice is a single application deployed as a process, with one clear responsibility. It does not have to be a single function or even a single class. Each microservice writes only to its own database, to keep things clean and simple. The microservice itself has to be stateless to support frequent deployments and multiple instances, and all persistent states are stored in the database.

Wix’s four sets of microservices

Our architecture involves four main groups of services:

Wix Editor Segment: This set of microservices supports creating a website. The editor is written in JavaScript and runs in a browser. It saves a JSON representation of the site to one of the editor services, which in turn stores the JSON in MySQL and then into the Wix Media Platform (WixMP) file system. The editor back-end services also use the Jetty/Spring/Scala stack.

Wix Public Segment: This set of microservices is responsible for hosting and serving published Wix sites. It uses mostly MySQL and Jetty/Spring/Scala applications to serve the HTML of a site from the data that the Editor has created. Wix sites are rendered on a browser from JSON using JavaScript (React), or on the Wix Public server for bots.

Wix Media Platform (WixMP): This is an Internet media file system that was built and optimized for hosting and delivering images, video, music, and plain files, integrated with CDNs, SSL, etc. The platform runs on AWS and the Google Cloud Platform, using cloud compute instances and storage for on-the-fly image manipulation and video transcoding. We developed the compute instances software using Python, Go, and C, where applicable.

Verticals: This is a set of applications that adds value to a Wix site, such as eCommerce, Shoutout, or Hotels. The verticals are built using an Angular front end and the Jetty/Spring/Scala stack for the back end. We selected Angular over React for verticals because Angular provides a more complete application framework, including dependency injection and service abstraction.

Why MySQL is a great NoSQL

Our microservices use MySQL, so scaling them involves scaling MySQL. We don’t subscribe to the opinion, prevalent in our industry, that a relational database can’t perform as well as a NoSQL database. In our experience, engineers who make that assumption often ignore the operational costs, and don’t always think through production scenarios, uptimes, existing support options, knowledge base maintenance, and more.

We’ve found that, in most cases, we don’t need a NoSQL database, and that MySQL is a great NoSQL database if used appropriately. Relational databases have been around for more than 40 years, and there is a vast and easily accessible body of knowledge on how to use and maintain them. We usually default to using a MySQL database, and use NoSQL only in the cases where there’s a significantly better solution to the problem, such as if we need a document store or a solution for a high data volume that MySQL cannot handle.

Scaling MySQL to support explosive growth

Using MySQL in a large-scale system can present performance challenges. Here is a top 5 list of things we do to get great performance from MySQL:

Whenever possible, we avoid database-level transactions, since they require databases to maintain locks, which in turn have an adverse effect on performance. Instead, we use logical, application-level transactions, which reduce loads and extract better performance from the databases.

We do not use sequential primary keys because they introduce locks. Instead, we prefer client-generated keys, such as UUIDs. Also, when you have master-master replication, auto-increment causes conflicts, so you have to create key ranges for each instance.

We do not have queries with joins, and only look up or query by primary key or index. Any field that is not indexed has no right to exist. Instead, we fold such fields into a single text field (JSON is a good choice).

We often use MySQL simply as a key-value store. We store a JSON object in one of the columns, which allows us to extend the schema without making database schema changes. Accessing MySQL by primary key is extremely fast, and we get sub-millisecond read time by primary key, which is excellent for most uses. MySQL is a great NoSQL that’s ACID compliant.

We are not big fans of sharding because it creates operational overhead in maintaining and replicating clusters inside and across data centers. In terms of database size, we’ve found that a single MySQL instance can work perfectly well with hundreds of millions of records. Having a microservices architecture helps, as it naturally splits the data into multiple databases for each microservice. When the data grows beyond a single instance capacity, we either choose to switch to a NoSQL database that can scale out (Cassandra is our default choice), or try to scale up the database and have no more than two shards.

Takeaways

It’s entirely possible to manage a fast-growing, scale-out architecture without being a cloud-native, two-year-old startup. It’s also possible to do this while combining microservices with relational databases. Taking a long, hard look at both the development and operational pros and cons of tooling options has served us well in creating our own story and in managing a best-in-class, SLA-oriented architecture that drives our business growth.

Original post: http://techbeacon.com/how-wix-scaled-devops-microservices

12/15/2015

Safe Database Migration Pattern Without Downtime

Filed under: — Aviran Mordo

I’ve been doing a continuous delivery talk for a while now and during my talk I describe a pattern of how to safely migrating one database to another database without downtime. Since many people contacted me and asked for more details about it, I will describe it here in more details as promised.

You can use this pattern to migrate between two different databases, for instance between MySql and MongoDB or between two schemas in the same database.

The idea of this pattern is to do a lazy database migration using feature toggles to control the behaviour of your application and progressing through the phases of the migration.

Let’s assume two databases you want to migrate from the “old” database to the “new” database.

Step 1
Build and deploy the “new” database schema onto production.
In this phase your system stays the same, nothing changes other than the fact that you have deployed a new database which you can start using when ready.

Step 2
Add a new DAO to your app that writes to the “new” database.
You may need to refactor your application to have a single (or very few) point(s) in which you access the database.
At the points you access the database or DAO you add a multi-state feature toggle that will control the flow of writing to the database.
The first state of this feature toggle is “use old database”. In this state your code ignores the “new” database and simply uses the “old” one as always.

Step 3
Start writing to the “new” database but use the “old” one as primary.
We are now getting into the distributed transaction world because you can never be 100% sure that writing to 2 databases can succeed of fail at the same time.
When your code performs a write operation it first writes to the “old” database and if it succeeds it writes to the “new” database as well. Notice that in this step the “old” database is in a consistent state while the “new” database can potentially be inconsistent since the writes to it can fail while the “old” database write succeeded.

It is important to let this step run for a while (several days or even weeks) before moving to the next step. This will give you the confidence that the write path of your new code works as expected and that the “new” database is configured correctly with all the replications in place.

At any time you decide that something is not working you can simply change the feature toggle back to the previous state and stop writing to the “new” database. You can make modification to the new schema or even drop it if you need as all the data is still in the “old” database and in a consistent state.

Safe database migration pattern

Step 4
Enable the read path. Change the feature toggle to enable reading from both databases.
In this step the it is important to remember that “old” database is the consistent one and should still be treated as the authoritative data.

Since there are many read patterns I’ll describe just a couple here but you can adjust it to your own use case.

In case you have immutable data and you know the record id you first read from the “new” database and in case you did not find the record you need to fall back to the “old” database and look for the record there. Only if both databases don’t have the record you return a “not found” to the client. Otherwise if the record is found you return the result preferring the “new” database.

If your data is mutable you’ll need to perform the read operation from both databases and prefer the “new” one only if the timestamp is equal to the record in the “old” database. Remember in this phase only the “old” database is considered consistent.

If you don’t know the record id and need to fetch unknown number of records you basically need to query both databases and merge the results coming from both DBs.

Whatever your read pattern is, remember that in this case the consistent database is the “old” one, but in this phase you need to read and use the “new” database read path as much as you can, in order to test your application and your new DAO on a real production environment. In this phase you may find out that you are missing some indices or need more read replicas.

Let this phase run for a while before moving to the next phase. Like in the previous phase you can always turn the feature toggle back to the previous states without a fear of data loss.

Another thing to note that since you are reading data from two schemas you will probably need to maintain backward and forward compatibility for the two data sets.

Step 5
Making the “new” database the primary one. Change the feature toggle to first write to the new database (you still read from both but now prefer the new DB).
This is a very important step. In this step you already run the write and read path of your code for a while now and when you feel comfortable you now switch roles and making the “new” database the consistent one and the “old” as a not consistent.
Instead of first writing to the “old” database first you now write to the “new” database first and do a “best effort” writing to the old database.
This phase also requires you to change the read priority. Up until now we considered the “old” database as having the authoritative data but now you would prefer the data in the “new” database (of course you need to consider the record timestamp).

This is also the point where you should try as much as you can to avoid switching back the feature toggle to the previous state as you’ll need to run a manual migration script to compare the two databases as writes to the “old” one may not have succeeded (remember distributed transaction). I call this “the point of no return“.

Step 6
Stop writing to the “old” database (read from both).
Change the feature toggle again to now stop writing to the “old” database having only a write path with the “new” database. Since the “new” database still does not have all the records you will still need to read from the “old” database as well as from the new and merge the data coming from both.

This is an important step as it basically transforms the “old” database to a “read-only” database.

If you feel comfortable enough you can do step 5 and 6 in one go.

Step 7
Eagerly migrate data from the “old” database to the “new” one.
Now that the “old” database is in a “read-only” mode it is very easy to write a migration script to migrate all the records from the “old” database that are not present in the “new” database.

Step 8
Delete the “old” DAO.
This is the last step where all the data is migrated to the “new” database you can now safely remove the old DAO from your code and leave only the new DAO that uses the new database. You now of course stop reading from the “old” DB and remove the data merging code that handle merging data from both DAOs.

This is it you are done and safely migrated the data between two databases without downtime.

Side note:
At Wix we usually run steps 3 and 4 for at least 2 weeks each and sometimes even a month before moving on to the next step. Examples for issues we had encounter during these steps were:
On the write path we were holding large objects in memory which caused GC storms during peak traffic.
Replications were not configured/working properly.
Missing proper monitoring.

On the read path we had issues like missing index.
Inefficient data model that caused poor performance which let us to rethink our data model for better read performance.

9/9/2015

Why I don’t like to hire team leads

Filed under: — Aviran Mordo

Every company has its own culture that it wants to preserve. As a company grows it becomes harder to preserve its culture as you as a manager need (and should) give up control to team leads and to people you manage.

Good company has a set of values, best practices and culture. When you hire a new person to the company it takes a while until he/she learns and assimilates himself with the company’s culture.

When you put someone in a managerial position where they need to lead other people, there is a mini culture that is being created for each team. If you did your job right this mini culture is more or less aligned with the overall culture of the company.

For a fast growing company building new teams is necessary. The team lead plays a major role in building the team and setting its mini-culture. Hiring a person for a team-lead position is a huge gamble, since he comes with a different set of values, methodologies and culture from the previous jobs they had.

Also, the fact that a person was a team-lead on a different company does not necessary means he has what it takes to be a team lead at another company with a different culture and methodologies. This is especially true for people who want to change positions and did not previously led a team, but are looking to be one. Since you don’t know them you don’t know if they are the kind of people you want to lead your teams.

So what to do in order to make your new team lead successful at his job? The thing I like to do with people who may be hired as team leads is to tell them straight that we don’t promise a team lead position, but they will need to start working as an engineer with a team lead potential, learn the culture, methodologies and best practices for about 6 months where we will have the chance to evaluate if that person is good enough to be a team lead. After a period of time when the need to a new team lead arises we will consider them to be one in case they fit the position.

I also like this method because it promotes people from within the organization and allows people to grow inside the company and not having to leave the company in order to get a promotion. Also by the time a person becomes a team lead he already had a chance to gain the respect of his peers and be more accepted as a team lead.

What do you think?

8/14/2015

Games of Gangs

Filed under: — Aviran Mordo
games-of-thrones-1.jpg

Working in a product company you are always in conflict between the product short term, long term goals and tasks that engineering want and need to do, but have nothing to do with the product itself, for instance improving the testing framework, building plugins for the IDE that will improve their day to day or even creating back-office tools that will solve other people in the company day to day problems or issues.
There are also tasks that engineers want to do to pay technical debt on the product, thus improving the long-term maintainability of the code.
Getting this quality time for tasks not directly related to the product development is hard as there is always pressure to release the next feature.

This is where the Guild steps in and help making it possible and creating a balance between working on features and taking care of other engineering and company wide concerns.

Like I described in the previous post, 20% of the time (one day a week) is dedicated for Guild activities. As part of the Guild activities we wanted this day to not only be about talking and learning, but also about doing. So we have created a game called “Games of Gangs).

“Games of Gangs” is a gamification of the guild tasks, which in its core is our main value of building an engineering culture and knowledge sharing.
While the first half of the day is mostly dedicated to retrospective and training, the second half of the day is dedicated to doing Guild related tasks.

A Games of Gangs task can be anything that is not directly related to the product that the engineer is working on. Also we want to enhance our engineering culture and knowledge sharing by using these tasks as a tool for learning and improving. So here are the guidelines we put for the game’s tasks (these are guidelines and not rules):
Tasks should be done in pair programing with an engineer from and different team.
Tasks should conform to at least one criterion:

  • Enhance quality
  • Improve velocity
  • Enhance our framework
  • Help another company with their own tasks
  • Share knowledge

Examples for good tasks we had are: Creating maven Archetype for new projects; reduce build time; creating CMS for our studio to manage templates; enhancing our monitoring capabilities.

To kick start this activity and to encourage people’s participation we had points assigned to tasks based on the task value and its knowledge sharing value. For instance if you do a solo task you will get only 1 point, but if you do it in pair each one will get 2 points. If you pair up with someone not from your company you will get 3 points and of you do it with someone from an off-shore office you’ll get 4 points.
You would also get points for doing lectures, writing blog posts on our engineering blog and other knowledge sharing activities.

Games of Gangs can sometimes be dedicated to a specific topic we want to push. For instance cleaning warnings in the code, or upgrading to a new Scala version.

So “Games of Gangs” has become a great way to balance between the engineering needs and the product needs while putting our engineering culture to play. It also creates the much-needed personal relationship between Guild members who do not meet on any other day as they are working for different companies at different physical locations.

8/12/2015

MySQL Is a Great NoSQL

Filed under: — Aviran Mordo

NoSQL is a set of database technologies built to handle massive amounts of data or specific data structures foreign to relational databases. However, the choice to use a NoSQL database is often based on hype, or a wrong assumption that relational databases cannot perform as well as a NoSQL database. Operational cost is often overlooked by engineers when it comes to selecting a database. At Wix engineering, we’ve found that in most cases we don’t need a NoSQL database, and that MySQL is a great NoSQL database if it’s used appropriately.

When building a scalable system, we found that an important factor is using proven technology so that we know how to recover fast if there’s a failure. For example, you can use the latest and greatest NoSQL database, which works well in theory, but when you have production problems, how long does it take to resume normal activity? Pre-existing knowledge and experience with the system and its workings—as well as being able to Google for answers—is critical for swift mitigation. Relational databases have been around for over 40 years, and there is a vast industry knowledge of how to use and maintain them. This is one reason we usually default to using a MySQL database instead of a NoSQL database, unless NoSQL is a significantly better solution to the problem—for example, if we need a document store, or to handle high data volume that MySQL cannot handle.

However, using MySQL in a large-scale system may have performance challenges. To get great performance from MySQL, we employ a few usage patterns. One of these is avoiding database-level transactions. Transactions require that the database maintains locks, which has an adverse effect on performance.

Instead, we use logical application-level transactions, thus reducing the load and extracting high performance from the database. For example, let’s think about an invoicing schema. If there’s an invoice with multiple line items, instead of writing all the line items in a single transaction, we simply write line by line without any transaction. Once all the lines are written to the database, we write a header record, which has pointers to the line items’ IDs. This way, if something fails while writing the individual lines to the database, and the header record was not written, then the whole transaction fails. A possible downside is that there may be orphan rows in the database. We don’t see it as a significant issue though, as storage is cheap and these rows can be purged later if more space is needed.

Here are some of our other usage patterns to get great performance from MySQL:
Do not have queries with joins; only query by primary key or index.
Do not use sequential primary keys (auto-increment) because they introduce locks. Instead, use client-generated keys, such as GUIDs. Also, when you have master-master replication, auto-increment causes conflicts, so you will have to create key ranges for each instance.
Any field that is not indexed has no right to exist. Instead, we fold such fields into a single text field (JSON is a good choice).

We often use MySQL simply as a key-value store. We store a JSON object in one of the columns, which allows us to extend the schema without making database schema changes. Accessing MySQL by primary key is extremely fast, and we get submillisecond read time by primary key, which is excellent for most use cases. So we found that MySQL is a great NoSQL that’s ACID compliant.

In terms of database size, we found that a single MySQL instance can work perfectly well with hundreds of millions of records. Most of our use cases do not have more than several hundred million records in a single instance.

One big advantage to using relational databases as opposed to NoSQL is that you don’t need to deal with the eventually consistent nature displayed by most NoSQL databases. Our developers all know relational databases very well, and it makes their lives easy.

Don’t get me wrong, there is a place for NoSQL; relational databases have their limits—single host size and strict data structures. Operational cost is often overlooked by engineers in favor of the cool new thing. If the two options are viable, we believe we need to really consider what it takes to maintain it in production and decide accordingly.

This article is published on JAX Magazine.

I will be speaking at JAX London and would be happy if you join my sessions. Get 10% off if you use promo code: SPKR_JLAM
Aviran Mordo - JAX London

8/10/2015

Building a Guild

Filed under: — Aviran Mordo

A lot of people heard about Spotify company structure of Guilds and tribes. We at Wix.com have a similar structure that has evolved over time and influenced by their ideas, however we have our own interpretation of the structure and the role of the Guild.

In this article I will try to walk down memory lane and describe my experience in building the first Guild (back-end JVM Guild) in Wix, which is now the role model of all the other Guilds at the company, and how it has evolved from the time I joined Wix when we had one back-end team of 4 developers to a about 100 back-end engineers in the back-end guild.

The Guild model did not start right away, when you are a relatively small startup all you have is teams, and this is exactly what we had. We had one server team (4 engineers) that was basically responsible for all the back-end development at Wix. As there was a demand for more back-end engineers the team grew very slowly. As with a small startup the recruitment process was very picky and we were only looking for the best engineers. At the course of a year I have only recruited 4 senior engineers. While this is very slow at this stage of the company it was very important to pick only the best engineers you can find, as these are the core engineering team that will help to build and shape the Guild and the engineering culture at the company in the future.

At this point where we had around 10 engineers we were pretty much functional teams, where everybody knew almost everything and I could move people from project to project according to the company’s priorities.

As we continue to grow (doubling the number of people every year) we saw that we are very good in focusing our efforts in some areas where are that point the company decided to invest, but were neglecting other existing products that had to compete on shared engineering resources but without any priority.

At this point we realized that we need dedicated engineers for each product group (at least for the big ones). We still didn’t have a name for that but I had essentially assigned some developers to be dedicated on some products while the other still remained shared resources.

As Wix continued its growth we had different groups of people who worked on different projects and were less engage with each other. So what we started to do is to formalize our engineering culture. While we always had a strong ownership and DevOps culture we started more and more being involved in knowledge sharing activities in order to keep our engineering teams on the cutting edge and learn from each team’s experience.

At this point we started to have discussions about how to structure the company. We looked around and found the Spotify paper. We realized that while we don’t have a name for our current structure it resembled to what Spotify had. So we adopted some of the naming and agreed that we should be working at a form of Guilds that are defined by a profession; and Gangs, which are the product teams.
Initially we only had the engineering Gangs who were dedicated to a product with all the other as shared resources across products.

This was the point where the role of the Guild had started to form.

The Guild is the one who is responsible for the person’s profession thus the Guild has the following responsibilities:

Recruitment (Hiring and firing)
Assignment to product teams according to the company’s priorities.
Setting the professional guidelines.
Training.
Set the engineers compensation (salary, bonuses etc’).
Create an engineering brand for the company.
Be responsible for the professional development / career of the engineers.

As Wix continued to grow we started to have more and more projects and product teams. What we realized then is that while having dedicated engineering teams (Gangs) is not enough because there was a bottleneck on the other shared resources. Also we had multiple products that had a common domain. What we wanted to do is to give as much independence to each product domain / vertical.

So once more we had to evolve and created what we call now a “Company”. A company is like a startup within Wix, it has all the resources it needs (developers, product manages, analysts, UX engineers, marketing etc’) in order to progress fast and create the best product they can do regardless if, the other products at Wix.

At this point the Guild also had to take on more responsibilities. While we want the “Companies” to progress as fast as they can, the “Companies” also has to keep alignment with Wix as a whole. Another issue is that we expect these “Companies” to create products that compete in the free market with other startups and big companies, but with limited resources.

The Guild now needs to play a big role in enabling the success of the “Companies” within Wix. If each “Company” had to develop everything on their own, for instance the frameworks, deployment, taking care of the infrastructure, monitoring etc’ they would not stand a chance to compete with whole companies that are doing the same product with more resources. So the Guild now took another responsibility in taking care of all the infrastructure, deployment pipeline, and core services that all the “Companies” share. For instance is we see a service that is needed in more than two companies (for example mailing service), we develop it in the Guild (which has its own core services teams) and all the other “Companies” can use this service, thus focusing only on the product itself and not having to worry about the infrastructure.

In order to keep alignment with the other “Companies”, and make it easier for engineers to move between “Companies”, share knowledge and best practices, all the “Companies” share the same infrastructure and methodologies. This is a tradeoff between freedom and velocity. You loose some freedoms but gain a lot of velocity as many of the things you need for your service are already there for you.

Now a “Company” may decide (in coordination with the Guilds) that using the existing infrastructure is the wrong solution for the product they own, and they want to develop on a different stack. They can do that, however they will need to take full responsibility over the whole application lifecycle, deployment, monitoring and integration with the other Wix echo-system. This is a lot of work and usually time to market will be very long, having to develop all the infrastructure on their own, so almost every “Company” opt-in to use the current infrastructure, although we have several cases where it was the right decision to develop some products on a different stack .

So if I was to describe the line of responsibility between a “Company” and a Guild is that the “Company” decides what to do, and the Guild say how to do it.

So now that we have “Companies” and Guilds, the Guild needs to assume more responsibilities in addition to the above:

Align between “Companies”. The Guilds are horizontal while “Companies” are verticals.
Support the engineers working in the “Companies”
Review and guidance
Develop shared infrastructure
Improving development velocity
Temporary help “Companies” in need with additional resources from the Guild.

Guild masters:
Guild masters are senior engineers that part of their responsibilities is to support engineers in different “Companies”. Guild masters conduct reviews, training, mentoring and also since they are horizontal and working with many companies they identify common issues, duplication of code between companies, understand the development bottlenecks and are trying to solve them. Also because of that they also pollinate “Companies” by bringing best practices and lessons learned from other “Companies”

Guild activities:
In order for the Guild to be able to take on these responsibilities it needs developer’s time so at Wix 20% of the engineering time is dedicated to the Guild activities.

Every Thursday we have a Guild day in which the Guild is conducting training activities and Guild tasks. All the engineers from all the “Companies” are assembled at one place for the Guild day.

Here is the back-end guild day schedule:
10:00-11:00 – Guild retrospective in which we discuss engineering dilemmas and lesson learned from across “Companies”.
11:00-11:15 – Break
11:15-11:30 – Project spotlight – where someone is presenting a new project that is being worked on, some lesson learned and challenges they have faced
11:30-13:00 (usually not the whole 1.5 hours is needed) – Tech talk, which if it does not contain any sensitive information is also open to the public at a meetup.
13:00–EOD – Lunch and Guild tasks. (The guild tasks are called “Games of Gangs”, but on “Games of Gangs” we’ll discuss on another post).

7/18/2015

Building a Scalable and Resilient Architecture

Filed under: — Aviran Mordo

This article is a summery of my DevoxxUK talk about microservices:

Like many startups before us, Wix.com started as a monolith application, which was the best architectural solution when we had no scalability and availability concerns. But as time went by and our small startup grew and gained success, it was time to change the architecture from a monolith—which experienced many scalability and stability issues—to a more resilient and scalable architecture.

However, every time you build a scalable system you have to make some tradeoffs between availability, performance, complexity, development velocity, and many more, and you really need to understand your system in order to make the right tradeoffs.
Defining System Architecture and Service Level

These days, microservices are a hot topic. But it is not enough to simply build microservices, you also need to understand the boundaries of each microservice. There are many vague claims about how to determine the boundary and size of a microservice, from “you should be able to describe what your microservice does in one line,” to “it should be the size of the team that supports it.” But there is no correct answer. We find that a good rule of thumb (for services that have databases) is that a service should directly access only a couple of database tables to operate.

One very important guideline we set, which helps us determine the boundaries, is based on the service level (SL) needed for each microservice. When we analyzed our system to see how users interact with Wix, we saw two main patterns. Based on these patterns, we defined two different service levels: one for editing sites (Editor segment) and the other for viewing sites (Public segment).

The Public segment supports viewing websites (it is mostly read-only). We defined the Public segment to have higher service-level requirements, because it is more important that a website be fast and available. The Editor segment is where we have all the microservices responsible for website authoring and management. The Editor segment, while important, does not share the same service-level requirements, because its impact is limited to the site owner editing his site.

Every microservice we build belongs to one of these two segments. Having defined these two different SLs, we also architectured our microservices boundaries according to them. We decided that the Editor segment should work in two data centers as an active-standby configuration, while only one data center gets the data writes. However, for the Public segment, we insist that we have at least two active data centers (we actually have three), in which all data centers get traffic all the time.

Since we set this definition, and because the Public segment is mostly read-only data, it made it easier to scale the microservices on the Public segment. When a user publishes his site on the web, we copy the data we need from the microservices in the Editor segment to the microservices in the Public segment, while denormalizing it to be read-optimized.

As for the Editor segment, because we have a lower requirement of availability, writing to just one location is a much simpler problem to solve than writing to multiple locations and replicating the data (which would require us to resolve all kinds of failure-causing conflicts and to handle replication lags). In theory we designed most of our system to be able to write concurrently to two data centers; however, we’ve currently decided not to activate it, as it requires a lot of operational overhead.

Working with Multiple Cloud Vendors

As part of our Public SL, which requires working in at least two data centers, we also set a requirement for ourselves to be able to work with at least two cloud providers. The two dominant providers that are capable of working at the scale we need are Google and Amazon (we have some services running on Microsoft Azure too, but this is out of scope for this post).

The important lesson we learned by moving to the cloud is that the first thing to do is to invest on the write path—i.e., writing data to the cloud service. Just by writing the data, we discovered many problems and limitations of the cloud providers; for instance, throttlers and data consistency, and eventual consistent systems, which may take a long time to regain consistency on some occasions.

Eventually consistent storage for uploaded files presented a big challenge for us, because when a user uploads a file, he expects the file to be downloadable immediately. So we had to put caching mechanisms in place to overcome the lag from the moment the data is written to the point it is available to read. We also had to use cache to overcome throttlers that limited the write rate, and we had to use batch writes as well. Read path is relatively easy—we just needed adapters for each underlying storage.

We started with Google Cloud Storage. Once we overcame all the problems with Google’s platform, we began the same process on Amazon by developing a data distribution system that copied data from one cloud provider to another. This way the data is constantly replicated between two different vendors, and we avoid a vendor lock. Another benefit is that in cases where we have issues with the availability or performance of one cloud, we can easily shift traffic to the other, thus providing our customers with the best service possible—even when the infrastructure is out of our control.

Building Redundancy

With this approach of multiple vendors and data centers, we also build a lot of redundancy and fallbacks into our Public segment to reach a high level of availability. For the critical parts of our service, we always employ fallbacks in case there is a problem.

Databases are replicated in and across data centers, and as mentioned previously, our services are running in multiple data centers simultaneously. In case a service is not available for any reason, we can always fall back to a different data center and operate from there (in most cases this happens automatically by customizing the load balancers).

Creating Guidelines for Microservices

To build a fast, resilient, and scalable system without compromising development productivity, we created a small set of guidelines for our engineers to follow when building a microservice. Using these guidelines, engineers consider the segment the microservice belongs to (Public or Editor) and assess the gains versus the tradeoffs.

Each service has its own schema (if one is needed)
Gain: Easy to scale microservices based on SL concerns
Tradeoff: System complexity; performance
Only one service should write to a specific DB table(s)
Gain: Decoupling architecture; faster development
Tradeoff: System complexity; performance
May have additional read-only services that accesses the DB if performance is an issue
Gain: Performance
Tradeoff: Coupling
Microservice processes are stateless
Gain: Easy to scale out (just add more servers)
Tradeoff: Performance; consistency
Microservice should be independently deployable
Cache is not a building block of a service, but an optimization to a real production performance problem.

Scaling with Simplicity with MySQL

When building a scalable system, we found that an important factor is using proven technology so that we know how to recover fast if there’s a failure.

One good example is using databases. You can use the latest and greatest NoSQL database, which works well in theory, but when you have production problems, you need to resume activity as fast as possible. Already having the knowledge of how the system works, or being able to find answers on Google quickly, is very important. This is one reason we usually default to using a MySQL database instead of opting for NoSQL databases, unless NoSQL is a better solution to the problem.

However, using MySQL in a large-scale system may have performance challenges. To get great performance from MySQL, we employ a few usage patterns, one of which is avoiding database-level transactions. Transactions require that the database maintain locks, which has an adverse effect on performance.

Instead, we use logical application-level transactions and avoid any database transactions, thus extracting high performance from the database. For example, let’s think about an invoicing schema. If there’s an invoice with multiple line items, instead of writing all the line items in a single transaction, we simply write line by line without any transaction. Once all the lines are written to the database, we write a header record, which has pointers to the line items IDs. This way, if something fails while writing the individual lines to the database, and the header record was not written—as it marks the finalization of the transaction—then the whole transaction fails. The one tradeoff is that you may get orphan rows in the database, which isn’t a significant issue because storage is cheap and you can clean these rows later if you care about the space.

We also use MySQL as a NoSQL database, simply as a key-value store. We store a JSON object in one of the columns, which allows us to extend the schema without doing database schema changes. Accessing MySQL by primary key is extremely fast, and we found that MySQL is a great NoSQL when you also have consistent writes.

Summary

When developing a large-scale system, everything is a tradeoff. You need to consciously decide which tradeoffs you are willing to make. But in order to do that, you must first understand your system and set the business service level and requirements. This will affect your decisions and architecture.

You can find out more on Yoav Abrahami’s Post here, and on slide-share.

Also, here is a link to the Original Post on Voxxed.

6/24/2014

Wix.com Surpasses 50 Million Users Worldwide!

Filed under: — Aviran Mordo

Wix.com Ltd. (Nasdaq:WIX), a leading global web development platform, announced today that its worldwide user base had surpassed 50 million registered users. The milestone followed a record first quarter of 2014, a factor largely driven by the company’s continued focus on product development, which resulted in the release of over 150 new features, advanced design capabilities and apps since the beginning of the year.

“Back in 2006, my co-founders and I tried to build a website for another business venture. There wasn’t a solution out there that could meet our needs, so we founded Wix. Today 50 million users have proven that our need was also theirs,” said Avishai Abrahami, Wix Co-Founder and CEO. “From the get go everything we did was shaped and guided by our users’ needs. Providing the best product in the market and listening to our users has brought us this far. Continuing to do so will take us to new heights.”

Wix’s mission is to bring technologically advanced and function rich solutions to all users, regardless of their technical ability or budget. With a powerful drag-and-drop website editor at its platform’s core, the company has continued to expand its offering by introducing cutting edge mobile solutions, a vibrant App Market enabling 3rd party app integration, eCommerce capabilities, a host of business management tools and more.

Wix users are rapidly adopting products as fast they’re being rolled out, as demonstrated by the 12 million apps installed on users’ websites since the Wix App Market’s launch and the over 3 million mobile websites built with Wix to date. In line with users’ needs and in keeping with the company’s mission to provide comprehensive solutions to its users, Wix recently launched two platform advancements, the WixHive API and Mobile Sonic Technology.

The WixHive API will allow formerly standalone applications to share gathered data giving site owners powerful new capabilities. The Mobile Sonic Technology will ensure that mobile sites created with Wix load quickly, catering to the growing market demand for on-the-go accessibility.

“The key to reaching 50 million users is developing innovative tools, that would typically require having expert coders or designers on deck,” said Nir Zohar, Wix President and COO. “We’re bringing enterprise level capabilities to every business no matter how small, which makes Wix the go-to destination to build, manage and grow a business online.”

Full disclosure: Aviran Mordo is the head of back-end engineering at Wix

Powered by WordPress