Site Reliability Engineering
Site reliability engineering (SRE) is something that sysadmins have been doing for years—incorporating aspects of software engineering and applying it to operations to when creating ultra-scalable and highly reliable software systems. Google were the ones to turn it into a specific function, defining it back in 2003
Put simply, SRE is what happens when a software engineer is in charge of an operation function. They don’t typically build features into your platform or product, but focus instead on ensuring that the product/platform can scale up, is reliable, and does all that as stress-free as possible.
Site reliability engineers deal with a huge amount of code and aim to never have the same problem twice
When a site reliability engineer comes across an issue, part of their job is to figure out exactly what happened, why it happened, and take whatever steps are needed to make sure it never happens again. Part of the excitement(?!) of being a site reliability engineer is that whenever an issue arises, it’s something they’ve not seen before.
Things are going to go wrong — site reliability engineers solve problems and make changes that ensure everything runs as smoothly as possible
A major part of a site reliability engineer’s job is to write code and do the engineering work necessary so that as the platform scales, issues don’t become major. If there’s an issue at 1000 users, it’ll be even worse at 100,000.
If a company takes SRE seriously, there will be a process in place that pulls-back updates to the platform/service if a certain number of incidents are occurring in a short space of time. When this happens you need to stop focusing on features, or shift your focus from launching new features, to making it a more sustainable service.
What makes a good site reliability engineer?
There’s no “one-size fits all” for site reliability engineers. They can come from a pure software engineering background, or be sysadmins or network architects.
Not being fazed by anything is a skill that many great SREs have. Our current SREs, Chris, Paul, Adrian, and Jamie, all have different backgrounds—you can see more about their specific skill sets on their individual pages.
Site reliability engineering is a core principle of the DevOps as a service that we provide to customers.