This post outlines a recent production issue on GOV.UK and how it was resolved. We’ve blogged in the past about what happens when things go wrong on GOV.UK, and also how we categorise and prioritise incidents.
We have a format called Local Transactions on GOV.UK, which allows users to find their local authority’s webpage for a specific service by postcode (for example https://www.gov.uk/pay-council-tax). We manage the links to services and local authorities within our Local Links Manager application.
From 1.50 pm to 2.30 pm on Wednesday 12 July Local Links Manager was returning an unusually high number of HTTP 504 gateway timeout errors, which was affecting users’ ability to find local services via GOV.UK.
From our logs we could see that 3670 errors in total were returned to users within this time period. This was classified as a severity 1 incident.
The root technical cause for this was that another GOV.UK application, Link Checker API, was flooding Local Links Manager with too many requests. Local Links Manager sends batches of service and local authority URLs to Link Checker API, which checks for problems with those URLs. It does this via a scheduled ‘rake’ task.
Once it’s checked a batch of URLs, Link Checker API sends the results back to Local Links Manager, which then prepares a report for the batch.
A recent code change to Local Links Manager (which was deployed 30 minutes before the incident) meant that the number of URLs in a batch increased by roughly 25%. This meant that Link Checker API was processing batches of URLs and sending results back at a faster rate than Local Links Manager could keep up with (since preparing the report for each batch now took longer).
When we deployed that code change, we also manually triggered the ‘rake’ task to start the link checking process. This resulted in a cascading flood of results from Link Checker API to Local Links Manager and effectively stopped Local Links Manager from serving its original purpose of serving links to the public.
What users saw
Users entering a postcode for a Local Transaction saw an error message, informing them of technical problems and that they should try again later.
Also, admin users were unable to edit links for services and local authorities via Local Links Manager.
How we responded
Once we identified that Link Checker API was flooding Local Links Manager with results, we cleared the queues in Link Checker API. This stopped the link checking process. Checking links was not crucial at this point in time and could be restarted later. This immediately caused Local Links Manager to recover. We then slowed down the rate that Link Checker API sends results to Local Links Manager, so that Local Links Manager has more time to process the results and can still serve URLs to the public at the same time.
What we’re doing to prevent this from happening again
In the mid to long term we are looking into how we can make the communication between these applications more resilient. This could involve adding a queuing mechanism to Local Links Manager to process Link Checker API requests). We are also emphasising that whenever we run any critical code changes in Production, we should run it in our Staging environment first so that we can catch errors before they go live.
David Basalla is a developer at GOV.UK.