Site reliability engineering

{{Short description|Use of software engineering practices for IT}}

{{Multiple issues|

{{buzzword|date=May 2023}}

{{External links|date=February 2025}}

}}

Site Reliability Engineering (SRE) is a discipline in the field of Software Engineering and IT infrastructure support that monitors and improves the availability and performance of deployed software systems and large software services (which are expected to deliver reliable response times across events such as new software deployments, hardware failures, and cybersecurity attacks).{{Cite web |title=What is SRE? - Site Reliability Engineering Explained - AWS |url=https://aws.amazon.com/what-is/sre/ |access-date=2024-12-26 |website=Amazon Web Services, Inc. |language=en-US}} There is typically a focus on automation and an infrastructure as Code methodology. SRE uses elements of software engineering, IT infrastructure, web development, and operations{{Cite web |title=Evaluating where your team lies on the SRE spectrum |url=https://cloud.google.com/blog/products/devops-sre/evaluating-where-your-team-lies-on-the-sre-spectrum/ |access-date=2021-06-26 |website=Google Cloud Blog |language=en}} to assist with reliability. It is similar to DevOps as they both aim to improve the reliability and availability of deployed software systems.

History

Site Reliability Engineering originated at Google with Benjamin Treynor Sloss,{{Cite web|last=Hill|first=Patrick|title=Love DevOps? Wait until you meet SRE|url=https://www.atlassian.com/incident-management/devops/sre|access-date=June 17, 2021|website=Atlassian|language=en}}{{Cite web|title=What is SRE?|url=https://www.redhat.com/en/topics/devops/what-is-sre|access-date=June 17, 2021|website=Red Hat|language=en}} who founded SRE team in 2003.{{Cite web|last=Treynor|first=Ben|date=2014|title=Keys to SRE|url=https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre|access-date=June 17, 2021|website=USENIX SREcon14}} The concept expanded within the software development industry, leading various companies to employ site reliability engineers.{{Cite web |last=Gossett |first=Stephen |date=June 1, 2020 |title=What Is a Site Reliability Engineer? What Does an SRE Do? |url=https://builtin.com/software-engineering-perspectives/site-reliability-engineer |access-date=June 17, 2021 |website=Built In |language=en}} By March 2016, Google had more than 1,000 site reliability engineers on staff.{{Cite web|last=Fischer|first=Donald|date=March 2, 2016|title=Are site reliability engineers the next data scientists?|url=https://techcrunch.com/2016/03/02/are-site-reliability-engineers-the-next-data-scientists/|access-date=June 17, 2021|website=TechCrunch|language=en-US}} Dedicated SRE teams are common at larger web development companies. In middle-sized and smaller companies, DevOps teams sometimes perform SRE, as well. Organizations that have adopted the concept include Airbnb, Dropbox, IBM,{{Cite web|date=November 12, 2020|title=Site Reliability Engineering|url=https://www.ibm.com/cloud/learn/site-reliability-engineering|access-date=June 21, 2021|website=IBM Cloud Education|publisher=IBM|language=en}} LinkedIn,{{cite web|title=Site Reliability Engineering (SRE)|url=https://engineering.linkedin.com/india-engineering/site-reliability-engineers |access-date=March 12, 2024|website=engineering.linkedin.com}} Netflix, and Wikimedia.{{Cite web|title=SRE - Wikitech|url=https://wikitech.wikimedia.org/wiki/SRE|access-date=2021-10-17|website=wikitech.wikimedia.org|language=en}}

Definition

Site reliability engineers (SREs) are responsible for a combination of system availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.{{Cite interview|last=Treynor|first=Ben|interviewer=Niall Murphy|title=In Conversation|url=https://sre.google/in-conversation/|publisher=Google Site Reliability Engineering}} SREs often have backgrounds in software engineering, systems engineering, and/or system administration.

{{cite magazine|last1=Jones|first1=Chris|last2=Underwood|first2=Todd|last3=Nukala|first3=Shylaja|date=June 2015|title=Hiring Site Reliability Engineers|url=https://www.usenix.org/system/files/login/articles/login_june_07_jones.pdf|magazine=;login:|volume=40|pages=35–39|access-date=June 17, 2021|number=3}} The focuses of SRE include automation, system design, and improvements to system resilience.

SRE is considered a specific implementation of DevOps;{{cite web |url=https://driftboatdave.com/2018/10/09/interview-with-betsy-beyer-stephen-thorne-of-google/ |title=Interview with Betsy Beyer, Stephen Thorne of Google |date=9 Oct 2018 |author=Dave Harrison |access-date=24 July 2024}} focusing specifically on building reliable systems, whereas DevOps covers a broader scope of operations.{{Cite book |url=https://sre.google/sre-book/table-of-contents/ |title=Site Reliability Engineering: How Google Runs Production Systems |publisher=O'Reilly Media |year=2016 |isbn=978-1-4919-5118-7 |editor-last=Beyer |editor-first=Betsy |location=Sebastopol, CA |oclc=945577030 |editor-last2=Jones |editor-first2=Chris |editor-last3=Petoff |editor-first3=Jennifer |editor-last4=Murphy |editor-first4=Niall}}{{Cite AV media |url=https://www.youtube.com/watch?v=uTEL8Ff1Zvk |title=What's the Difference Between DevOps and SRE? (class SRE implements DevOps) |date=March 1, 2018 |last=Vargo |first=Seth |last2=Fong-Jones |first2=Liz |author2-link=Liz Fong-Jones |type=Video |publisher=Google}}{{Cite web |title=What is SRE? - SRE Explained - AWS |url=https://aws.amazon.com/what-is/sre/ |access-date=2022-11-05 |website=Amazon Web Services, Inc. |language=en-US}} Despite having different focuses, some companies have rebranded their operations teams to SRE teams.

Principles and practices

Common definitions of the practices include (but are not limited to):{{Cite web |title=The 7 SRE Principles [And How to Put Them Into Practice] |url=https://www.blameless.com//blog/sre-principles |access-date=2021-06-26 |website=www.blameless.com |language=en}}

  • Automation of repetitive tasks for cost-effectiveness.
  • Defining reliability goals to prevent endless effort.
  • Design of systems with a goal to reduce risks to availability, latency, and efficiency.
  • Observability, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask.{{Cite web|title=Learn about observability {{!}} Honeycomb|url=https://docs.honeycomb.io/getting-started/learning-about-observability/|access-date=2021-06-26|website=docs.honeycomb.io|language=en}}

Common definitions of the principles include (but are not limited to):

  • [https://sre.google/sre-book/eliminating-toil/ Toil management], the implementation of the first principle outlined above.
  • Defining and measuring reliability goals—SLIs, SLOs, and error budgets.
  • Non-Abstract Large Scale Systems Design ([https://sre.google/workbook/non-abstract-design/ NALSD]) with a focus on reliability.
  • Designing for and implementing observability.
  • Defining, testing, and running an incident management process.
  • Capacity planning.
  • Change and release management, including CI/CD.
  • Chaos engineering.

Deployment

SRE teams collaborate with other departments within organizations to guide the implementation of the mentioned principles. Below is an overview of common practices:{{Cite web |title=SRE at Google: How to structure your SRE team |url=https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started/ |access-date=2021-06-26 |website=Google Cloud Blog |language=en}}

= Kitchen Sink =

[https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started Kitchen Sink] refers to the expansive and often unbounded scope of services and workflows that SRE teams oversee. Unlike traditional roles with clearly defined boundaries, SREs are tasked with various responsibilities, including system performance optimization, incident management, and automation. This approach allows SREs to address multiple challenges, ensuring that systems run efficiently and evolve in response to changing demands and complexities.

= Infrastructure =

Infrastructure SRE teams focus on maintaining and improving the reliability of systems that support other teams' workflows. While they sometimes collaborate with platform engineering teams, their primary responsibility is ensuring up-time, performance, and efficiency. Platform teams, on the other hand, primarily develop the software and systems used across the organization. While reliability is a goal for both, platform teams prioritize creating and maintaining the tools and services used by internal stakeholders, whereas Infrastructure SRE teams are tasked with ensuring those systems run smoothly and meet reliability standards.

= Tools =

SRE teams utilize a variety of tools with the aim of measuring, maintaining, and enhancing system reliability. These tools play a role in monitoring performance, identifying issues, and facilitating proactive maintenance. For instance, Nagios Core is commonly employed for system monitoring and alerting, while Prometheus (software) is frequently used for collecting and querying metrics in cloud-native environments.

= Product or Application =

SRE teams dedicated to specific products or applications are common in large organizations.{{Cite web |title=SRE at Google: How to structure your SRE team |url=https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started |access-date=2024-11-11 |website=Google Cloud Blog |language=en-US}} These teams are responsible for ensuring the reliability, scalability, and performance of key services. In larger companies, it's typical to have multiple SRE teams, each focusing on different products or applications, ensuring that each area receives specialized attention to meet performance and availability targets.

= Embedded =

In an embedded model, individual SREs or small SRE pairs are integrated within software engineering teams. These SREs collaborate with developers, applying core SRE principles—such as automation, monitoring, and incident response—directly to the software development lifecycle. This approach aims to enhance reliability, performance, and collaboration between SREs and developers.

= Consulting =

Consulting SRE teams specialize in advising organizations on the implementation of SRE principles and practices. Typically composed of seasoned SREs with a history across various implementations, these teams provide insights and guidance for specific organizational needs. When working directly with clients, these SREs are often referred to as '[https://cloud.google.com/blog/products/devops-sre/introducing-a-new-era-of-customer-support-google-customer-reliability-engineering Customer Reliability Engineers].'

In large organizations that have adopted SRE, a hybrid model is common{{Citation needed|date=October 2024}}. This model includes various implementations, such as multiple Product/Application SRE teams dedicated to addressing the specific reliability needs of different products. An Infrastructure SRE team may collaborate with a Platform engineering group to achieve shared reliability goals for a unified platform that supports all products and applications.

Industry

Since 2014, the USENIX organization has hosted the annual [https://www.usenix.org/srecon SREcon] conference, bringing together site reliability engineers from various industries. This conference is a platform for professionals to share knowledge, explore effective practices, and discuss trends in site reliability engineering.{{cite web |author= |date=2021 |title=Usenix SREcon |url=https://www.usenix.org/srecon |access-date=June 17, 2021 |website=USENIX}}

See also

References

{{reflist}}

Further reading

  • {{Cite book|last=Limoncelli|first=Tom|url=https://www.worldcat.org/oclc/891786231|title=The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services|volume=2|last2=Chalup|first2=Strata R.|last3=Hogan|first3=Christina J.|date=September 2014|publisher=Addison-Wesley|isbn=978-0133478549|location=Upper Saddle River, NJ|oclc=891786231}}
  • {{Cite book |editor-first1=Betsy |editor-last1=Beyer |editor-first2=Chris |editor-last2=Jones |editor-first3=Jennifer |editor-last3=Petoff |editor-first4=Niall Richard |editor-last4=Murphy |title=Site Reliability Engineering: How Google Runs Production Systems|publisher=O'Reilly|year=2016|isbn=978-1491929124 |url=https://archive.org/details/sitereliabilitye0000unse}}
  • {{Cite book|url=https://www.worldcat.org/oclc/1052565720|title=Seeking SRE: Conversations About Running Production Systems at Scale|publisher=O'Reilly|year=2018|isbn=978-1491978863|editor-last=Blank-Edelman|editor-first=David N.|edition=1|location=Sebastopol, CA|oclc=1052565720}}
  • {{Cite book |last1=Beyer |first1=Betsy |last2=Murphy |first2=Niall |last3=Kawahara |first3=Kent |last4=Rensin |first4=David|last5=Thorne |first5=Stephen |title=The Site Reliability Workbook: Practical Ways to Implement SRE|publisher=O'Reilly|year=2018|isbn=978-1492029502}}
  • {{Cite book|last=Welch|first=Nat|title=Real-World SRE: The Survival Guide for Responding to a System Outage and Maximizing Uptime|publisher=Packt|year=2018|isbn=978-1788628884}}
  • {{cite book | last=Adkins | first=Heather | last2=Beyer | first2=Betsy | last3=Blankinship | first3=Paul | last4=Lewandowski | first4=Piotr | last5=Oprea | first5=Ana | last6=Stubblefield | first6=Adam |publisher=O'Reilly| title=Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems | year=2020 | isbn=978-1-4920-8312-2 | oclc=1129470292}}
  • {{Cite book|last=Rosenthal, Jones|first=Casey, Nora|title=Chaos Engineering: System Resiliency in Practice|publisher=O'Reilly|year=2020|isbn=978-1492043867}}