SEKOIA.IO process almost a billion client events per day. That’s tens of thousands of log entries per second. Every single event has to be analyzed quickly and reliably by our detection pipeline, to detect cyber threats and react as soon as possible. You can imagine that breaking this pipeline, even for a few seconds, is out of the question.
Safely deploying changes to this high-throughput, low-latency workflow is a major challenge that we continuously try to solve.
Of course, we embrace the microservice architecture: production is handled by a large set of loosely coupled components. This approach allows us to easily scale and update parts of our infrastructure independently, but it also means that we have to handle a large number of deployments, frequently.
All our microservices run on Kubernetes, which is great for auto-healing and progressive rollouts. Kubernetes is smart, but Kubernetes is not that smart. Even combined with all the CI/CD and staging in the world, we can still deploy broken services. This has multiple implications:
- Deployments need to be triggered and overlooked by a real person (a Site Reliability Engineering team in our case), to make sure that everything is going smoothly.
- Developers have to rely on the SRE team for deploying their code.
Both issues are harmful to productivity and velocity and can induce stress on the ones responsible for deployments.
Even with this measures, we sometimes faced issues when deploying updates: services that worked in the test environment broke in production when they were put under pressure, or simply didn’t work because of an asymmetric configuration.
So how to avoid that? We needed a way to safely roll out updates to microservices, without having to fear that it could trigger an incident.
Fortunately, automation exists and can be applied to just about anything.
Canary deployments 🐦
Canary birds were used in coal mines to alert workers of carbon monoxide leaks, the “silent killer”. As canaries died of carbon monoxide way before mine workers did, they were, at the time, the only way of detecting a dangerous gas leak.
Nowadays, the term is used to describe any test subject, especially an inadvertent or unwilling one.
In the Ops field, canary deployments are exactly this: by transparently deploying the latest version of a service to a small percentage of users, you are able to see how the new version performs without breaking everything, in the event of a bugged release. However, if that version performs well, you can gradually keep rolling it out to more users. If it keeps on performing well, you get a pretty strong indication that it is suitable to replace the old one.
As Kubernetes provides an API to interact with deployments, it is well suited for the automation of this kind of operation.
As one can expect, quite a lot of different solutions trying to solve this problem already exist. However, we had some hard requirements before adopting any solution:
- We didn’t want to have to rewrite any of our Kubernetes stacks, because we have around 80 microservices.
- We didn’t want to install heavy dependencies into the Kubernetes cluster, as our CI/CD and production environments are already complex enough.
- The solution must work for Kafka-based workers, not only HTTP-based services. We don’t need HTTP-traffic balancing, session pinning and all that fancy stuff.
- Canary rollouts should happen only when deploying new Docker images, not changing RAM/CPU requirements or metadata.
With these requirements in mind, we started hunting for a solution, without much success. Solutions like Gloo or Flagger are very much HTTP-based, and Argo is a whole ecosystem, that requires changes to the deployment stacks.
We could not identify a simple solution for Kafka-based workers. After a few days of reading documentation and trying out stuff on our test environment, it became clear that none of the existing solutions would fit our needs.
As any frustrated engineer would do, we decided to create our own solution, coined Aviary (because it handles canaries all day long!).
What we were trying to achieve could simply be implemented this way :
- On init, Aviary duplicates the service’s deployment by adding the suffix -primary to its name, and scales the original deployment to 0.
- Then, it watches changes on the original deployment.
- When a change is detected, Aviary registers it and produces a diff between the old and the new deployment.
- The diff is processed to decide if the changes should be directly or progressively deployed. We don’t want to spread a deployment over several hours if we are just updating CPU requirements, for example.
- In the event of a direct deployment, the -primary deployment is instantly replaced with the new version and it stops there. In the case of a progressive deployment, a copy of the new deployment is created with the -canary suffix.
- The -canary deployment is progressively scaled up, while the -primary is proportionally scaled down. At each step, key metrics are fetched from Prometheus and Kubernetes to decide if the new version is performing well.
Two possibles outcomes are to happen:
- If we reach a break-point where enough -canary instances were successful over a period of time, the -canary deployment is promoted to -primary and the original deployment stays untouched.
- If a number of instances fail, or report less than ideal metrics, the -canary deployment is abandoned and the original deployment is rolled back to what it was before the changes.
This logic enables us to have progressive canary rollouts and automatic rollback in production while making absolutely no change to our existing codebase. Our CI/CD pipeline also stays untouched, as it is interacting with the same object as before, the original deployment. Aviary handles the rest.
The configuration stays minimal, with the ability to define, per service:
- Break-point percentage (after which percentage of successful canary instances do we fully deploy the new version).
- How many canary instances should be added at each iteration of the rollout?
- Maximum tolerated failures in metrics.
- A warm-up delay.
- A list of metrics and thresholds to validate the new deployments.
It is also highly interactive and allows an operator to cancel an in-progress deployment, as well as bypassing itself for the next deployment of a service (for pushing hotfixes).
Our solution has been deployed in production for a few months now, and it has already prevented numerous bad rollouts and regressions. The lack of codebase changes and its very little footprint in the Kubernetes cluster makes it a very satisfying solution.
It doesn’t even change anything to normal SRE operations and automation, as it transparently handles scaling up and down deployments, as well as services restarts.
We are very happy to have made this step forward, as developers are one step closer to being able to deploy their changes themselves, without having to worry about undetected issues that would arise in production.
Of course, Aviary was made in the most business-agnostic way and is freely available on our GitHub repository.