DrupalCon Vienna 2017: Building Site Reliability Engineering: A Crash Course

From Wikipedia: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.

Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.

This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:

SRE's basic concepts and history from Google
The management support you will need to get started
Introducing the idea of service level objectives and error budgets
Operational Responsibility Assessments as a tool to measure risk
Creating a Launch Readiness Checklist to standardize and improve product launches
Finding ideal candidates for your SRE team



The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.

References:

Site Reliability Engineering: How Google Runs Production Systems, and The Practice of Cloud System Administration, Volume 2

Drupal is a registered trademark of Dries Buytaert.