From a two month to a two week release cycle
In GoHealth, we help people to find and enroll in the most fitting health insurance plans. A part of the process is filling in an enrollment application using our web interface, which is then submitted to the insurance company that provides the plan. That’s where my team comes in — we own the enrollment user flow and integrate with major US health insurance carriers such as Humana or Anthem.
When I joined the enrollment team, we were doing ad-hoc releases. We did not have a regular release schedule which meant we performed releases when we felt the time was right — typically we felt a strong urge to release ahead of critical fixes or business imposed deadlines. This often led to a gap of four to six weeks between releases. Big releases mean big problems — each release contains too many changes, it is hard to reason about what actually goes into the release and the testing is a nightmare. Less frequent releases also mean that people need to remind themselves how to perform the release both process-wise (e.g. who to notify) and execution-wise (e.g. which buttons to push to deploy the application). As a result, there was a lot of anxiety and apprehension around every release.
All these issues would snowball into increasingly less frequent and more painful releases. After one such particularly painful release with 6 weeks’ worth of changes in it, it was clear to us we had to do something significant.
We have set on a path to improve our release cycle and while there is always room for improvement, we got to the point where we stick to the regular two-week release cycle, occasionally getting down to a release or three a week. Here is what we have done and learned in the process.
Stick to schedule
Planning the next release two short weeks after the latest troubled release is a daunting proposition but you need to start somewhere. So pick your schedule and stick to it. It might even look like there are not enough changes for the release — and that’s a good thing! Few changes mean a simple release. A small release has a much higher chance of success and will encourage everyone to stick to frequent releases. A less overwhelming release allows you to fine-tune your processes. So we actually did the release after two weeks and then two weeks later and we have kept releasing regularly ever after.
When you do something often you need to be extra efficient about it. We have kept evaluating how we were doing — both via mindfully performing the release and via collecting feedback in our retrospective meetings.
One of the first improvements was to document all the steps in a simple one-page document to make the release process repeatable and faster. This page included everything you needed to perform a release:
- Documentation on the release process
- List of deployment jobs
- Monitoring dashboards and logging
Once we had a decent grasp on the process, we created a checklist to outline all the steps in a release:
- Which systems to release
- People to notify
- Human checks needed (e.g. backwards compatibility)
- Testing areas
Creating documentation and checklists tremendously helped us to get more people involved in the process of releasing.
What actually worked very well was the technology landscape we use. The actual release is no more than a click of a button because we package all our apps as Docker containers and deploy to AWS via Jenkins. In particular, AWS Elastic Container Service (ECS) works great for our use case — the applications and dockerfiles are simple and the environment itself is easy to monitor. It helps to have a wonderful Site Reliability Engineering team and tooling teams that keep the building blocks and the infrastructure in great shape.
Automate the release preparation
An important part of increasing release cadence is making the release preparation faster and easier. We have a few internal tools that help us see the deployed versions, source code diffs between the deployed and next version as well as the testing state. We track the requirements via specification scenarios and Jira tickets and the tools help us see which scenarios will be deployed in the release.
Frequent releases put a lot of strain on manual testing efforts. During the transition to frequent releases it felt like testing for one release starts just as another one ends, which left very little time for the actual value-added part of QA — the test analysis. Our answer? Automate it! (You may start seeing a pattern here). We already have a dedicated QA Automation team that implements our test specifications. Seeing the strain on manual testers we started focusing more on automating the regression testing so that manual testers can focus on testing new features and contributing to the specifications.
With the whole team cooperating, we are able to run regression suites automatically and even verify deployed versions without human intervention — that’s the peace of mind you need for frequent releases.
Make each change releasable
If you want to release often you can’t leave your main branch in an undeployable state for too long — any merged PR should be immediately releasable. If there is unfinished functionality you need to guard it with a feature flag.
Feature flags don’t have to be anything fancy — simple property settings work well for us.
Unfortunately, it is not always effective to keep the new code and old code side by side. In such moments we deploy from hotfix branches but it is an exception that raises more than one eyebrow — releasing from the hotfix branch means we are doing something wrong and need to go back to main quickly.
Tag all the versions
One of the big questions we always need to answer is: “What is in the release?” For each release, we identify what code changes go into the release and for each line of code we see why it is there thanks to the associated Jira ticket.
This is complicated by the fact that applications we deploy consist of multiple dependent libraries which contain business logic for different integrations. Again we have the tooling to help us see changes down the multiple layers of the dependency graph.
The critical enabler is publishing libraries and containers to Nexus. We tag each git commit from which we publish a library. It took us a little work to become consistent but is a lifesaver when you go investigating what could have gone wrong.
Have a rollback plan
Each release has a rollback plan — we note down current production versions of the applications in case we need to roll back the release. The rollback uses the same process as deploying a new version. Thanks to that the process is well understood and is a safe default choice if we encounter issues during release.
It sometimes happens that the release is hard to roll back due to environmental changes or other external dependencies. In such cases, we look for mitigations on the code level (e.g. extra code to allow for backward compatibility) and proceed extra carefully.
The improvements that got us to faster releases are not rocket science and are kind of obvious in hindsight. The essence of changes lies in the prioritization and team culture not in the technological challenges as is the case with many improvements.
At the end of the day, keeping good release practices is like going to the gym — everyone knows it’s useful but it takes determination to stick to it.
Are you interested to find out more? Come work with us! Get in touch at firstname.lastname@example.org and join our team.
Author: Michal Kostic
Follow us on our web and social media!