Colleagues from BPDTS and DWP Digital are enabling the modernisation of DWP’s portfolio of services. Some people might think we are only developing new services, but in fact, we are applying the latest digital thinking to the large heritage services that still handle huge volumes of our customer’s pension and benefit payments. The department is investing heavily in site reliability engineering (SRE) to maintain these services.
In short, SRE is about applying software engineering principles and practices to the world of service delivery and operations. It’s about prioritising reliability over new features, making sure services are stable, secure and performant for when our users need them most.
For our new services, this is very much part of the launch process. Teams must prove services are reliable before they’ll be made available to the public. This blog-post – Gearing up for Site Reliability Engineering – explains more about our move to SRE.
What does SRE mean for our heritage services?
Firstly, we need to ensure that our existing services can stand the test of time, remembering that some of these services are over 20 years old, because we still need to operate them for another few years yet. So we’re using DevOps and continuous delivery practices to make our legacy services run on modern platforms and, where necessary, rebuilding applications using the latest tools. We’re also using the public cloud to host and run our services at optimal cost and efficiency.
We have proven that modern practices like agile and DevOps can apply to heritage services and, once we’ve remediated these services, we can then focus on reliability. We coined the term ‘heritage reliability engineering’ to add focus to the effort involved in making these existing services more reliable.
How we’re making heritage services more reliable
We established service level objectives that are more relevant to heritage services, for example, the time it takes to process batch transactions, payment success rate or data quality.
We’re looking to eliminate toil (non-value adding, repetitive work) by investing time in engineering – by increasing capability and training our colleagues in modern engineering approaches and by allowing colleagues to apply those modern engineering approaches to heritage services. A good example of this is the way we use infrastructure-as-code, which is covered in more detail in our How we’re using Ansible to improve our digital infrastructure blog-post. It talks about how we deploy and configure services and manage the underlying environments, including scaling those environments up or down when necessary.
We’ve also improved our telemetry (how we measure remotely) to make more service operation data available. By making these heritage services more observable, we can spot potential problems before they cause disruptions. This, coupled with the benefits of our cloud platform, is putting our heritage services on a path to be more reliable than ever.