This post outlines a recent production issue on GOV.UK and how it was resolved. We’ve blogged in the past about what happens when things go wrong on GOV.UK, and also how we classify and prioritise incidents.
From around 9:30am until midday on Thursday 16 March, publishers using GOV.UK’s publishing applications experienced intermittent errors when trying to publish content.
The root cause of this problem was two separate errors that happened during scheduled overnight maintenance by our hosting provider, leading to significantly lower than usual performance. As a result, our publishing applications were unable to serve as many users as normal in a timely manner. This was a severity 1 incident.
What users saw
Publishers using GOV.UK’s publishing applications saw error messages and long page loading times. Additionally, the applications were sometimes not available at all for a few minutes at a time.
Most people using GOV.UK saw content as usual. However, some people may have seen an error message – we are currently investigating the exact number of error messages we displayed.
How we responded
We contacted our hosting provider to let them know about the problems and to investigate ways of fixing it. We also let publishers know about the problems so they could contact us if they needed to publish anything urgently.
Once the root cause had been determined, we worked with our hosting provider to fix the problem and ensure all the publishing applications were available.
What we’re doing to prevent this from happening again
We’re investigating better ways of letting both publishers and people using GOV.UK know of any problems more quickly, so that people are aware that we are investigating the problem, as well as anything they can do to work around it in the meantime.
We’re also going to make sure that people who are on-call on GOV.UK are aware of any scheduled maintenance by our providers and that people who are on second line support have access to contact our hosting provider.
Finally, we are asking our hosting provider to give us more notice and reminders of any scheduled maintenance so that we can make sure we are fully prepared for it and have contingency plans in case something goes wrong.
Ruben Arakelyan is a tech lead at GOV.UK.