Updated: May 2
By: Aviran Mordo - VP Engineering, Wix.com
Platform Engineering is something I’m personally very excited about and we at Wix have been practicing it for over a decade (before the name “platform engineering” was coined). In this article I’ll explain some of the things that our teams at Wix Engineering have already deployed and implemented in the past couple of years taking platform engineering to the next level.
We’ll dive into some new platform projects we’ve been working on to give some insights on our vision for the future.
To provide some context, Wix is a leading website building platform with over 240 million registered users (website builders) from 190 countries. Our platform is deployed in 3 regions +20 points of presence globally. At Wix we employ over 5,000 people, about half of whom are in R&D. So, it is safe to say that engineering is at the core of what we do and the core driver of our business value.
The need for speed
Successful software businesses deliver high quality code fast. However it comes with a lot of challenges related to size. As companies grow older and become larger, the software development process becomes slower due to increased dependencies, legacy code and complexity.
Ever since computer science and engineering existed, there have been multiple approaches to organizing work and finding ways to achieve better velocity and quality, where the latest trends are - Continuous Delivery, DevOps and serverless.
As the popularity of methodologies changed, the tools we used changed, as well. These evolved to help companies deliver software faster - a push influenced much by the sheer number of tech startups and their growth. To support that growth, we saw the emergence of cloud hosting / computing providers enabling quick access to servers in vast data centers, thus reducing operational overhead and cost.
Then came the philosophy of microservices, allowing companies to scale up even faster and maintain their software much “easier”. Finally, we have serverless, which takes away even more of the overhead and the need to maintain production servers. As the name implies, with serverless, you don’t really need to deal with maintaining servers anymore. It takes away the need to care about scaling, configuring servers, dealing with K8s and more, thus leaving you with even less operational overhead.
How microservices affects dev velocity
If you’re just starting and you are building a single service (Monolith), things are fairly easy. You can use any framework of your choice, and utilize all the tools at your disposal. You are able to go very quickly.
But as you scale, things tend to get more complicated quickly, since you aren’t actually building just one service. You build something that integrates with other services and systems - and that thing now needs to connect to other services via RPC/REST, databases, message brokers, and so on. Now you have dependencies that you need to operate, test and maintain. You need to now build some kind of framework around all of this.
And yet, with a handful of microservices, this is still a fairly simplistic architecture. As you continue to grow and succeed, you add more and more microservices, and you get to something like the image below. This is Wix’s microservices map; each rectangle is a cluster of a microservice and the horizontal and vertical lines are the communication between them.
Microservices communication map
As you can see, with many microservices it also implies that many teams are working on them. Since all your microservices need to work in harmony, you need to think about the common concerns between them all - you need them to speak the same ״language״, to have the same protocols and contracts. Like, how do I handle cookies and security? And how about the PRC calls, HTTP headers, Logs?
As a matter of fact, here’s a quick non-exhaustive list of the top things you may now have to worry about:
What do we do then? We create standards via a common framework, a common language, to make sense of it all, otherwise every developer will need to take care of all these cross cutting concerns.
To make things worse, in a DevOps culture, you also expect the developers to be proficient in maintaining the production environment, take care of provision machines, look after the databases and Kafka clusters, firewalls, load balancers, make sure autoscaling is properly configured, build monitoring dashboards, understand distributed systems, request flows and replicate data across regions, etc.
The iceberg of microservice architecture
Let’s look at one example service and the layers it is built upon. At the basis of it, there’s the computing power - a virtual machine. It is running inside a container, and sitting on top of it is the microservice / application framework - for instance, Spring in the JVM world, Express for NodeJS, whatever it is you work with. And then, on top of that, you build your trusted environment framework - a layer that makes it possible for all of your underlying services to communicate in the same way. Via the same protocols, using the same HTTP headers, working with the same encryption / decryption algorithms, and so on. That way they can be used, and trusted, by all of the other services on the network.
Now after having all these layers, next comes the actual work required from software developers. On the very top of this pyramid, you build the service that brings business value that your company actually sells to customers.
For us engineers, this is just the tip of the iceberg. There’s the whole bottom part of it which we need to take care of before we can even deal with the top part, which is the business logic code. Unfortunately, the overhead does not end here. In addition to the actual product features you need to develop, developers also need to take care of regulations, business and legal concerns like GDPR compliance, for instance. This is not something your company actually sells, but nevertheless, you are required to develop it with each new service you build.
That was a long intro, but it explains the magnitude of developing large systems. Now let’s take a breath and let's talk about…
Platform as a Runtime (PaaR)
Usually when building a microservice, you use your internal frameworks and libraries that are compiled with your code, giving you most of the common functionality like gRPC, Kafka client, connection pools, A/B testing, etc. However, with this approach, you end up with a distributed common framework, where you often need to update ALL the microservices in order to have (close) version parity across all your services.
It is a real challenge to constantly update all your services to the latest framework; or have a semantic versioning, which brings its own set of issues (out of scope in this article).
So what can we do about this dependency overhead?
One solution is to change the framework’s dependency from build time to be a runtime dependency. But at Wix we have taken it a step further. There are other concerns when building an internal framework besides communication protocols and contracts like we mentioned before. These other concerns are common business flows, common legal concerns (GDPR, PII, etc’), common tenancy model, permissions, identity management etc.
So we took all these business concerns and flows and added them as runtime dependency , ending up with a Platform as a Runtime (PaaR).
Let me explain what it means by adding them to the PaaR. It means that now every microservice that is running in the PaaR gets these concerns handled automatically for them, without any development needed to be done in each service. For instance with GDPR, all services automatically handle GDPR requests to get my data and “forget me”, saving precious development time in each service.
How did we do it?
We started with our very own serverless platform. At its core lies NodeJS, supporting the entirety of the application framework. Why Node.js? It is lightweight, supports dynamic code loading, and is easy to learn.
Step one: We took that and coded our entire framework into the Node.js server together with an integration layer with GRPC / REST / Kafka (including a discovery service) and made it the “runtime server”.
So now we have a cluster of identical “runtime” servers that have the same functionality (the frameworks).
Step two: We added another layer for data services, allowing developers to easily and quickly connect with databases (the runtime handles all the connection strings, connection pools, JDBC, etc.)
What we ended up with is a “smart” container that can handle ingress traffic and have all the common concerns embedded into it but without having much business logic.
What happens now is that developers can build the business logic using the runtime dependency (serverless) while not having the framework compiled and packaged with their code since it is provided to you as you deploy your code into the running process.
As developers work on the code locally, a time comes when it is ready to be deployed. In the serverless ecosystem, you don’t need to worry about Node.js or containers. You just deploy your TypeScript file(if it is a single function) or bundle, and that file gets dynamically loaded into the platform’s runtime!
Since we already have the service integration layer in the runtime - when you need to make a call to another service, you don’t need to do lookup/discovery. You simply declare what you need and the platform provides you with the right client to connect to your dependent service.
Since all the integrations are done via the “runtime”, it also saves you the integration tests since the integration is actually done via the “runtime” and you get a client library to use in your code.
A major benefit is that you deploy just the code you need, without having to bundle it with common libraries and node.js runtime, which makes deployables extremely small, because it only contains the actual business logic you write (usually less than 100Kb in size)
Separating project structure from deployable topology
Up until now what I have described resembles standard lambda functions (however with Lambda functions you do need to bundle your framework with your deployable). But what happens when you want to add another function? Well, with lambda you would need to have another instance of the whole setup or package the two together, but since we are running on a trusted environment, what we are able to do is run multiple functions inside the same process. Basically, we can develop as independent microservices or even nano services (functions), but deploy all the functions to run as a monolith into the same process. However we can also split the functions deployments to different PaaR (Platform as a Runtime) hosts, mimicking microservices, instead.
We can also do it the other way around. Develop the software as a microservice or even a monolith, but deploy parts of the service to different PaaR hosts.
A classic example is when you have a service that has two endpoints. One is an RPC endpoint that needs to react to a user’s request, and another endpoint that listens to Kafka topics. We can develop the two functions in the same project so that the developer thinks about these two endpoints together and as if they are part of the same “logical” service, but deploy the RPC function to one host and the Kafka listener to a different host.
This will result in enabling different scaling strategies. We could autoscale the “customer facing” RPC function differently than the Kafka service without the developer having to handle the cognitive load of separating these to two different services with all the framework overhead that comes with it..
At Wix we built the system in a way that deploying a function is very fast and easy. After a developer pushes their function to a GitHub repository, it takes about a minute for that code to deploy to and run on production since it is only the small portion of the code (only the business logic, without boilerplate and bundled common libraries)
Platform Engineering Mindset
What did we achieve by doing this?
Integrations became very easy - a developer can simply declare which functions/microservices they want to call and they “automagically” get a client library to work with it.
There is less code to test - no integration tests for glue code is needed.
Developers can write and deploy very small functions focused on just the essence of the business logic they are writing.
The deployment is fast - the code is small and no framework/common libraries packaging is required.
There is zero boilerplate - all integrations are preconfigured and provided for a developer to use by declaration only.
Vision for the future
Next, how can we use what we have so far and build on top of that? Say we want to give separate teams the responsibility of managing their own runtimes. We can do that by cloning the setup I described above so that, for example, ecom and blog teams could each have their own cluster of runtimes to work with. That way, if needed, code from different teams isn’t pushed to the same runtime. All the while, you are still able to call functions between runtimes.
In this context, one thing we will want to have at some point is to be able to optimize the affinity of a given function based on its runtime. What do I mean by that? Let’s say we have 2 runtimes, each containing several functions:
As you can see from the diagram above, Function 2 and Function 5 frequently interact. But since they are deployed on different runtimes the network calls are slowing things down. What if we could have the system automatically deploy Function 5 to the first runtime and Function 2 to the second one, so that in the end we would get a picture like this:
This way, by organizing and optimizing the functions in runtime instead of build time, we can have a highly optimized runtime environment with minimal network calls between microservices. This is in contrast to today, where we couple the domain design and the development environment with the runtime topology.
The future is here with “Single Runtime”
Another future development we are excited about is expanding into other programming languages. In my opinion, this would require a paradigm shift, though - to not build a polyglot system and not to build the same framework as many times for as many tech stacks you support in your company. But instead, use a single runtime that would actually support any language.
We work on achieving that by splitting the “Iceberg” in two, separating the single runtime framework, service integration and the data services layers on one hand. We call this the “Host”. Integration layer to the host + business logic as the “Guest”. The idea is to be able to develop the application framework only once instead of constantly trying to achieve feature parity between frameworks in different programming languages. An obvious advantage to this approach is running framework updates only on the host, not having to duplicate the framework / runtime for every language / dev stack you support, and only having to do it once.
The disadvantage is that we will not run the actual business logic (guest) in-process, but host and guest communicate out of process (on the same localhost). We are still experimenting with GraalVM to see if we can still run multiple Guests (different languages) in-process with the host, but for now we had an easier time and an actually working system by having two different processes.
The road so far
The success of this approach is evident in how much our developers love it. At the core there is the Platform Engineering mindset where we took a lot of complex concerns and made them easy. Wix developers are now able to develop things in a matter of hours that used to take them days and even weeks, before.
Essentially, we took Platform Engineering from a low level developer’s portal to a full-blown PaaR, along with a lot of the cognitive load of production low-level maintenance. Stay tuned for more as we detail the Wix “Single Runtime” project.
To learn even more, watch “Beyond Serverless and DevOps”: