DrupalCon Dublin 2016: S.R.E. - create ultra-scalable and highly reliable systems

What is SRE?

SRE stands for Site reliability engineering. A word crafted by Google in 2003 when Ben Treynor was hired to lead a team of seven software engineers to run a production environment. Managing such large systems with sizes never seen and still merging new features seamlessly was an huge challenge. Today this paradigm has been adopted by several companies like Microsoft, Apple, Facebook, Dropbox, Amazon and Oracle. They all have assembled SRE teams.

“Spock. This child is about to wipe out every living thing on Earth. Now, what do you suggest we do….spank it?” — Dr. McCoy, Star Trek: The Motion Picture

Yes… Manual Work is killing us all. What should we be doing about it?

We all know how monitoring sucks, platforms fail and manual tasks burn out good Engineers. A site reliability engineer (SRE) uses less than 50% of the time doing "ops" related work such as tickets, alert response or manual fixes. It’s expected that a platform where an SRE team works should be highly automated and self-healing. More than 50% of the time an SRE is automating infrastructure and testing these new features.

Do you have to choose stability versus new features?

More automation means faster time-to-market and continual improvement. Creating self-healing mechanisms or, at the limit, smart rollback strategies to quickly deal with some instability is a strong argument for saying yes to keep the flow of new features. The ideal SRE Engineer has operational, coding and systems knowledge, able to solve complex issues.

What should I expect from this session?

In this session we are going to give examples and inspire change for individuals and companies that are under pressure of repetitive work or seeking innovation in the engineering field. Create incident response automation, self-healing, bring feedback in and break down silos that are impediments to an healthy growth and experimentation.

What about a small Drupal Shop, with few sysadmins?

This will help showing you a different path, but surely the principles of SRE are the way to go. Looking at training and hiring so that sysadmin team gets developers awareness for sensible infrastructure problems. Making sure you're doing blameless postmortems, and fixing all the action items when outages happen. We will detail these and other missed gaps during the session.

“Five card stud, nothing wild. And the sky’s the limit” — Captain Jean Luc Picard, Star Trek: The Next Generation