Continuous delivery is the next step in continuous integration. When you work in continuous delivery, the application deployments should be very easy and automated. As easy as we can deploy, rollbacks should be easy to perform too, and also automated. However automated deployments and rollback can cause unexpected problems if you are not careful. One such problem brought our production system down. Here is what happened.
At Wix.com we built an automated deployment system based on Artifactory and Chef. Our system works like this. Every few minutes Chef script checks to see if the last version in Artifactory is the one that is deployed on production. If the artifact version is different than the one deployed on production, Chef will get the war from Artifactory and deploy it to all the appropriate machines.
Now what happened to us was that we decided that Artifactory should have a replica in case once instance goes down, so we installed an old back-up of Artifactory on a secondary location and created a script to replicates the master to the slave. Now you can probably guess what happened.
We had a bug in the script where we set the master to be the slave and the slave to be the master. The backed-up Artifactory was from last month. So what happened was that the replication went the wrong way and both repositories rolled back to a state that is a month old.
Since Chef monitors the repository and found that the versions of the artifacts are different than the one in production, it deployed ALL the artifact, causing our entire production system to go back in time (yay, we invented a time machine).
You can guess how fun it was to bring back the whole production system back to the future.
Now of course we are in the process of putting some safeguards in place so it won’t happen again.
On the other hand there was one very positive thing we learned from this experience. Even after our system rolled back a whole month everything continued to work properly, which means we did things right by ensuring everything is forward and backward compatible, so no data was lost, except for some features our users missed for few hours
You might be interested also in: The guide to continuous delivery