[This post was originally published in Q1 2020]
A few weeks ago I killed one of my own experimental attempts to fix how we work, and replaced it with a different one, adjusting for the problems found in the first iteration. It took less than 24hs to involve the correct people, get their buy-in, and even iterate on my experiment definition to enable someone to be a lot more empowered so that our team can move faster.
It had been less than 24hs since we started the new experiment when we got the first big improvement. From a pull request being approved to it being in production, in 24hs. It used to take several days, because someone would approve, then it would be merged by the person sending the pull request (if it was a teammate), and then the merger would send it to QA. That would often add 24hs to the QA process just in tickets sitting there waiting.
In those 24hs, more tickets would get merged, adding to the workload of our QA and to the amount of things that could go wrong. Even with a plan to do weekly releases (our last iteration was doing some changes that could bring down release cycles from monthly to weekly, and it was not super successful even if we did improve our cycle time)
Recognizing when something isn’t working.
Our first attempt at improving our release cycle was a compromise. Because of historical reasons, we never made releases a very big priority until this year, so for months now we have been doing small improvements that bring us closer to our goal: when things are ready, they go out. In our last iteration, I decided that we would do weekly releases to production as a “baby step” towards releasing just as soon as something is ready.
The idea was that we’d get better at the release process, then shorter the cycles. So what improved
- Releases went from monthly or longer to weekly or every two weeks.
- Someone who had usually not done releases before started doing them.
- Releases started getting more documentation on what was going live, we committed to always tagging and thanking contributors in release packages, and we started working towards ensuring docs get started on time. Release tags were going out, finally, in a timely fashion.
- Releases got everyone collaborating to get them out. It used to be something that happened very much on the background, and they became a more active part of our weekly check-ins.
With 4 or more major improvements, you could say we succeeded. The reality is that we got a lot of good “side effects” of making releases a priority, but we didn’t get as close to “just in time” releases as we need to be.
The iteration of this experiment with weekly releases failed because the goal wasn’t ” improve a lot of important things”, the main goal was “a ticket should never wait more than a week to be in production once it’s sent for testing”.
We knew this, because we documented the goal from the start. It was clear that was where we wanted to be for this to be considered a success.
After doing this recap of “what is going well” and a very frustrating week when my release was not getting… well, released, I decided to kill this experiment, and start fresh with the “just in time” improvement.
There was some push back to doing releases weekly, with a general feeling that it’d take too much time out of the week, so I expected the same for “just in time”. What I found instead was that everyone accepted it pretty fast. To be honest, this hasn’t been something I’ve opened up for negotiation all that much. I’ve said time and time again that being able to release often is non-negotiable.
Does it feel shitty to do that as a leader who tries to bring everyone into big decisions? yea, totally. I feel shitty writing it here, too. But I also feel it’s the only correct decision. When long release cycles are your biggest engineering team cultural debt, and there’s a lot of people in the company who haven’t seen a release cycle that makes sense in many many years, you have this internet stranger’s permission to be the person who says “No, this actually will be done this way”, even if it costs you points in the short term. Explain until it gets boring, follow up, hear the team’s concerns, but make a decision.
As long as you reserve your right to be stubborn about 1 or 2 things and not all the things, I think this is a reasonable approach. If you are making all the decisions yourself, then you’re in trouble and need to go fix that, first.
Without releasing, the impact of the code you write is nil, or negative even, for a long time; the time you spend writing code consumes valuable resources (money, other people’s time, you time) without producing any value for long periods of time.
It may sound counter intuitive that if you can’t achieve weekly releases you should go for daily/just-in-time, but I think it’s actually the only way to go. If you can’t achieve weekly releases because something always goes wrong early in the process, the only possible solution is to minimize chances of “things going wrong” being a problem. Things will always go wrong, so it’s minimizing the impact of each thing going wrong and our ability to respond that matters the most.
We minimized this risk by making the person in charge of QA the one responsible for merging pull requests. Rather than make it so that developers merge their PR once they are ready, we are putting this responsibility in our QA so that he’s the one making the decision on “can we add this ticket to our testing environment, or not?”, and then deciding if a ticket makes it to production based on the results from testing.
Long term, I want us to have enough automated test coverage, observability and safety nets to do more releases, more often, faster, with less risk. For the moment, this is providing speed that we never had until now, and just enough safety to keep us sane.