We thought it would be helpful to explain what happens at dxw when something goes wrong with a service or website that we manage.
Our technology team use incident reviews to understand why an issue happened and the best way to stop it happening again. Sometimes it’s not possible to prevent a repeat incident, but we can make changes to reduce the impact. For example, by making sure we’re notified sooner (so we can react more quickly), making an incident affect fewer users, or falling back to partial functionality.
Here’s how it works.
Working out what happened and why
Our incident review process is similar to a retrospective where everyone talks openly about what happened and how to make it better in future. Having a frank, no blame discussion about why something went wrong is important to spot vulnerabilities in our technology or processes. It also gives clients confidence that we’ll deal with things quickly.
Everyone involved gets together after an issue has been identified and resolved. That normally includes people like developers, operations engineers, product managers, and delivery managers. The team discuss and then write up a timeline of what happened, an explanation of the incident, potential recommendations, and agreed actions.
Our recommendations take a broad look at possible ways of stopping the problem happening again. They include things like training or communication, as well as technical solutions. Some won’t be realistic or sensible because of the time and effort involved, or the work needed relative to the value. But looking at all the options helps us identify the most efficient and effective way forward.
Our agreed actions are based on what will make the most impact. Each action is then assigned to a team member, who documents the work they do in case the issue arises again.
Using this process for our own website
We used this process for an issue we had with our website earlier this year.
Our jobs page had accidentally been made private. Not ideal when you have a number of positions you’re hoping to fill. The people with admin access didn’t realise the page wasn’t publicly available, as they could still view it. So they didn’t report the issue and we had a broken link on a page that we didn’t know about.
In the review meeting, we discussed all the possible ways this could have happened and what we could do to stop it happening again. We talked through options like removing the ability to make pages private and using a monitoring service to tell us when there’s an issue with a page. But needed to balance this with the time we had available.
We decided to investigate using a prompt when changing a page’s status and how much developer time this would take. We also thought we should make it clearer when a page is in private mode, test options to monitor for broken links, and review the permissions levels of people with access to the website.
Make things open, it makes them better
Although this incident only affected one page on our website, we often use the same systems and technologies for our clients. So getting this stuff right for us means we also get it right for them.
Reviewing issues like this openly helps us to be proactive and prevent things going wrong or breaking.
You can read more about our approach to technical support in our playbook.