Service development is far from all sunshine and rainbows, but for some reason it has become the norm to only talk about the goals you hit and things you deliver. However, I think it’s important to also recognise times when you’ve made an error both as an individual and as a team, which is why I’m writing this blog. Learning from failure pushes teams to reduce future failures, increases the speed at which future problems are resolved and helps to increase the overall maturity and experience of everyone involved.
The main thing I hope people take from this blogpost is that it’s okay to not be perfect. It’s how you react to any issues that you come across that’s important. My advice would be to turn any missteps into a learning opportunity, create or improve processes to fix the issues and prevent them in the future. Most importantly, take the time to fix the actual root cause of an issue, rather than just putting tactical fixes in place. I feel that we should be shouting about how we fixed errors and are preventing them from happening again as much as we do about our successes, as it may help other teams and departments, not just our own.
Discovering the issue with our release
With that in mind, I wanted to share my experiences of working on a project as part of the New Style Employment Allowance (New Style ESA) team. In late June 2020 we created a new release process for our project, which was designed to help provide extra assurance and improve the success rate of our releases. Previously we had a few unsuccessful releases we weren’t happy about due to some issues with the amount of testing and assurances, which is why the new release process was introduced.
Once the new process was introduced, it seemed to solve the issues we’d been experiencing, and we had no failed releases for a period of 4 to 5 months. That was until December, when we had to regress one of our releases due to issues with the application not processing customer claims through to our backend services.
It’s important to note at this stage that no customers were impacted by this issue, which is the main thing. Their claims were still submitted successfully, however there was a temporary issue with those claims being picked up and processed by the backend of the service. This meant they could be left in a queue for longer than we would wish. Luckily this issue was caught by the monitoring systems we have in place and the change we released was regressed straightaway.
The underlying issue was that we updated the version of one of the dependencies in one of our application components, this was to ensure that we were as up to date as possible. This specific version update in this particular dependency was badged as a patch version (meant to contain non-breaking changes for those not familiar with semantic versioning) so was meant to have no impact on functionality of the application.
The issue hadn’t been caught by our release process when it should have been, which made us reflect on how it had managed to make its way through all the testing we do and land in production. There are a few things we have learned from this issue that we feel are important to share and that we hope will help other teams deal with any problems they come across.
1. Failing to prepare is preparing to fail
Runbooks, escalation procedures, incident logs and having procedures in place for failures in production are extremely important. Incident logs help you look back to see if there have been issues like this before and how they were resolved, which helps to identify similar issues quickly. Escalation procedures allow those doing production support of a service to know who to contact when issues arise. Runbooks are there to tell people how to deal with any production issues that may occur. Each application should have steps such as these in place to ensure they can respond to any production issues as quickly and completely as possible to minimise impact.
On our project we had some of this documentation in place, which helped us to identify the issue quickly and come up with remediation plans so that we could fix and ensure that it won’t happen again. We also ensured to update our documentation after the post-mortem of the incident so we are better equipped to handle any future production incidents.
2. Communication is key
It is vital to talk with people when an issue occurs, as other services may have dealt with similar issues and be able to help. In our Health and Working Age team we have a number of communication methods, one being Slack where we have a designated channel for any incidents that happen on the platform. As soon as we came across this issue in production, we set up a call in the Slack channel and posted a message, which resulted in a big group of people helping to diagnose the issue and come up with a solution. If any of those people who offered advice and suggestions are reading this blog, thank you so much for jumping on and helping us fix the issue! It was a great collaborative effort by a lot of different people.
3. Technical debt should be addressed, because it can come back to bite you
For example, we had some technical debt in our backlog for a while to improve our integration test suite. If this had been completed it would have caught the issue mentioned above at the testing stage which we do as part of our release process in the non-production environments before it goes anywhere near the production service. Technical debt describes what happens when development teams take actions to expedite the delivery of a piece of functionality which later needs to be refactored. Or, in other words, it’s the result of prioritising speedy delivery over perfect code.
4. Just because a process works, it doesn’t mean it’s perfect
We had used the same release process for a period of 4 to 5 months and never had a failed release until this issue cropped up. This goes to show the importance of iteration and identifying improvements even when you think something is perfect or is working just fine as it is.
Luckily for us the fix for this issue was relatively quick and easy. But the biggest thing we have taken away from the experience is how we reacted to the production issue as a team, and also the gaps in some of our processes which we are actively working to iterate and review frequently.
The week after the error, we were able to re-release the components with the fixes we applied. I’m happy to say it was a success, and our processes and assurance for releases are even stronger for it. This was down to the hard work of the New Style ESA team as well as some other very helpful colleagues from across Health and Working Age who helped us identify and fix the issue and have helped to stop similar issues cropping up in the future. Myself and everyone involved has learnt from this ‘failure’, and we’re a better team because of it. I hope that by sharing my own experience I can help others to recognise that something can only be classed as a failure if you fail to learn anything from it.
We have a range of opportunities for engineers open now! Take a look at our Careers site for the latest vacancies.