We recently held our second Site Reliability Engineering (SRE) team builder where SREs from across all of our digital hubs came together to network, share ideas, and gain insight. We’re recruiting SREs into our team now.
At the event I spoke about the power of collaboration. I talked about the journey I’d been on since our last team event and how it had started in that first breakout session. I explained how that session on Service Level Objectives (SLOs), led to a number of conversations afterwards about how we define our SLOs.
Later conversations with Gartner led to a look at some approaches for the direction of travel for SRE work. The one we settled on was where we plot applications on a chart by both business criticality and relative stability. The applications with the highest importance to the business, but which are less stable, define the starting point. The direction of travel is then downwards towards the more stable area of the chart.
A simple approach but one that threw up some challenges. How do we establish the relative business criticality for every application? I talked about how I’d drafted a straw man model and shared it with my colleagues in service management. From that we discovered a number of initiatives that could benefit from the insight such a model would provide.
We’re now building a collaborative project to develop the model and ask the business itself to calibrate it. Objective testing will be achieved by using audiences in multiple parts of DWP to apply the model to a representative set of applications. If the results show that business criticality can be modelled consistently by anyone, regardless of where they sit, then we know that we have an objective way of defining relative business criticality for all applications.
At our event I reminded my audience that this all stemmed from the 20 minute SLO breakout session at our previous event – emphasising the importance of getting together and sharing ideas.
The implications of the model are much wider: a readymade IT risk profile for the business unit, automation of error budget setting. Calibration of scales and ranges to reflect relative priority of each criterion. Automatic recalibration every month that reflects the natural curve of the business unit calendar.
I went on to share my conceptual blueprint for our reliability engineering hubs which shows the full context of SRE in relation to the business units we support.
A unique approach
We also had 2 representatives from Gartner and our UXCC with us. They delivered great pieces on our approach to SRE, reminding us how unique it really is, and how the UXCC operates, giving us great insight into the strategy.
Juan Villamil, Director of Technology Service, joined us for our panel discussion at the end of the day and we had great questions from the team. The messages from both Juan and Gartner were eerily coincident with the themes that we’d discussed throughout the day. Don’t you just love it when that happens?
We’re constantly experimenting with the format for these events. We went for a little more structure this time by providing a breakout board that offered spaces to organise things a bit better. We retained the breakout session format for the simple reason that it’s a great opportunity to generate new ideas and different perspectives on what we’re developing. We’ll try other formats for future events though, as I can think of nothing worse than turning up to an event where I already know what to expect.
We’re recruiting SREs into our team now. There’s never been a more exciting time to join us, so visit our Careers website to view the latest vacancies.