SRE vs DevOps What they are outside the Big Tech bubble
- devops
- sre
- devex
- technical
- business
Earlier this week, I saw a post asking about the differences between Site Reliability Engineering (SRE) and DevOps. I got a bit carried away with my response and hit the LinkedIn comment character limit, so I decided to turn it into a long-term post. This way, it can be a lasting resource, just like my Helm and Windows articles.
What is DevOps? ¶
I’ve spent most of my time consulting from a DevOps point of view. In practice, that has been providing modern-day operations services rather than “cultural change” that you’ll find promoted across conference stages and even on Wikipedia. Additionally, DevOps is often portrayed as empowering development teams to manage all aspects of their services so they can self service their own releases.
I’ve never seen either the cultural change nor the empowerment of development teams, DevOps has always translated to bringing development methodologies and tools into Operations workloads. This is mostly:
- Managing infrastructure as you would code by committing infrastructure definitions to a code base and managing them through tools such as Terraform, Pullumi, Ansible
- Automating code delivery and maintenance via code pipelines and CICD tools
- Automating maintenance or provisioning and managing services that remove the maintenance burden (serverless, DaaS, etc)
- Developing operations specific solutions such as custom dashboards, monitoring, self service internal tools, etc
Development teams have never been able to be empowered by DevOps as just getting access to the stack doesn’t make them effective at managing the stack. The stack consists of everything, frontend, backend, infrastructure, networking, PKI, etc. Some developers can handle bits of everything at a hobbyist level, but few have the architectural expertise to manage every part of an enterprise stack. That’s perfectly fair, whilst some engineers can manage everything they’re not the norm and you can’t build that expectation into an organisation’s hierarchy, same way that you can’t expect a Finance Director to do the book keeping, VAT submissions and direct finances.
Likewise, expecting some infrastructure engineers to dispense cultural change across major organisations by automating code delivery and managing operational infrastructure is a leap. Whilst making these changes can enable an organisation to improve their efficiency, the benefits are not likely to venture beyond a single product team.
If you’re looking at a DevOps role you’re looking at infrastructure management, CICD wrangling and getting things shipped.
What is SRE? ¶
Systems Reliability Engineering (SRE) is a role that’s more common in name than in practice. Many are titled SRE, but fewer are actually performing SRE duties.
SRE originated at Google, where they wrote a book on the subject and available online. The core idea is that Google created a team of engineers dedicated to making services more reliable, hence the name. Google being a Big Tech organisation had untold multitudes of services requiring hosting, monitoring and incident management. Therefore Google could recognise substantial benefits by determining and enforcing reliability standards across all those services. This removed the responsibilty for reliability governance from development teams and allows them to focus on building their services to a set of centralised and defined standards.
This SRE heritage is reflected in Google’s tooling. For example, Kubernetes has healthchecks, deployment rollbacks, service failover and other reliability functionality built in.
Not every organization has the budget or the scale of applications to justify an SRE team. Typically, only the largest companies or the most technically focused organizations need a dedicated team to enforce and create centralised policies around reliability. In my experience, SRE functions in most organisations act as third-line support. They ensure services are online but do so reactively rather than through systemic, governance-based, and policy-managed approaches. It’s rare to see an SRE team acting as a governance check, implementing critical release gating for services or updates.
How are they related? ¶
In smaller organizations, DevOps and SRE get muddled together. This happens because checks, monitoring, and reliability governance come under an operations perspective. As operations take on more DevOps responsibilities, SRE concerns typically become part of the DevOps team’s duties. The monitoring and visibility provided by operations are then configured and managed by the DevOps team.
So what’s the take away? ¶
Whenever I hear about SRE, I gauge the organization’s size. For large organizations with extensive change management and governance — like banks or big tech companies that are not focused on breaking things — I expect the SRE team to be major stakeholders in the change board. If SRE are on the change board and have the ability to block changes then they’re probably practicing true SRE.
In smaller organizations, I’d expect SRE are providing centralized support and tend toward platform support as the organization’s size decreases. If they’re not able to gatekeep production from botched changes then they’re a third line support team providing break-fix services.
DevOps is primarily about automation and infrastructure. The cultural change aspect often discussed with DevOps rarely manifests in organizations. Engineers can’t be magically empowered to own things they don’t understand, so self-service releasing isn’t happening. Similarly, a CI/CD pipeline won’t suddenly change an organization’s culture or approach.
In small startups, the ability to deploy multiple times a day can be critical for rapidly changing sales or marketing strategies. However, in larger organizations, most departments can’t move fast enough to take advantage of this agility. B2B transactions and engagements can take quarters, not days or weeks. While you might be able to deploy multiple times a day, it won’t make client procurement move any faster.
The only way DevOps changes an organisation is if the whole organisation is a single product team, ie you’re a start up with a single product. In that regard the effects a seasoned DevOps engineer brings can be capitalised on throughout the organisation. Outside of that agility, it’s mostly about getting a smaller function to deliver better and more quickly with practical hands-on improvements.
DevOps is about getting things out the door. It might sound great and have a bunch of fanfare but it’s really about getting stuff out the door.
SRE is about preventing fires. If you’re likely to have lots of fires or the potential for large fires, then you’ll have a budget for sprinklers and smoke alarms and be called a Fire Prevention Officer. Most of the time though, you’re just a fireman.