Site Reliability Engineering: A modern-day programming necessity
Many programmers have their daily routine of working on the same thing all the time. Their time is filled with the implementation of another piece of application logic that stores, retrieves, or moves data from one place to another. Naturally, they will also write tests here and there, but that is it. When doing this, there are few opportunities to expand your horizons and try something different, new, and better. Rarely is there an opportunity to do something different, but this is an exception as nobody wants to reinvent the wheel.
System administrators (sysadmins) are not that different either. They possess a deeper knowledge of the systems to understand what can go wrong with them, as they spend their time deploying these and reacting to users’ requests. They monitor them based on their comprehension and when anything ‘goes wrong,’ they react quickly and effectively. The repetitive nature of this work may be related to the time before cloud was “the thing,” when the admins had also physically managed HW servers and they got used to assembling HW and SW as if it was one piece. Cloud makes it much easier to get more servers and change infrastructure as needed.
Google Vice President Ben Treynor noticed that sysadmins and programmers had something in common that could be unified. A programmer knows how to automate things, a sysadmin knows how systems work. He unified both and called them SRE — Site Reliability Engineers — to represent sysadmins (reliability) along with the engineering part of it. Thanks to the connection, we can have someone who not only deploys and monitors, but can also automate.
I am one of those who started their job as a Java programmer. As I had experience with programming and already had a passion for production systems, I was offered this job — back then, it was at Google. That said, I never wanted to be a sysadmin dealing with endless user requests, being available 24/7. One of the recruiters convinced me to just try it and I enjoyed it back then as I enjoy it now. I really like knowing that I work on something that matters, I can learn details of systems, and I enjoy the uniqueness of every single day as anything repetitive is automated.
SREs in any company work with large systems and besides that have a bunch of other knowledgeable SREs around to help them with anything necessary. As someone with wide-reaching interests, SREs are more often tasked with designing infrastructure that works reliably, safely, and the best for the price paid. As SREs mostly work with large systems and need wider knowledge, there are just a few of them. Furthermore, SRE’s design has a huge impact and can save a lot of money. That leads to never-ending interest in them from a plethora of recruiters and the compensation that matches the impact. SRE tech-leads interact with development and QA teams on a regular basis, gathering feedback on what can be improved from the infrastructure perspective. Proceedings from these meetings are used to schedule the next steps at work.
Besides the above-mentioned, SREs are the ones trying to scale systems out, which means spreading the load over more computers. That differs from scaling them up (increasing computing power) since scaling up has its limits. With the monitoring we implement, we can find the parts that cannot handle the load well. Once we reach the ceiling and there are still some issues, we try to identify exact queries causing a more significant load. Those can be fixed or isolated to separate instances.
How can one become an SRE? There are two ways — most of SREs follow the same path I took. They are programmers interested in systems, managing some systems for themselves, and tinkering with them in their free time. A programmer could also focus on learning this by starting with using Docker, which is a product to abstract specifics of environments, and then expanding it with some knowledge of cloud from Amazon or Google.
One can also be a sysadmin, who knows how to code and wants to become an SRE. This path takes a bit more endeavor since while people can learn system knowledge themselves, it takes a bit more effort to learn to code in a readable and concise manner without reviews from others. That said, there are many SRE/DevOps projects on GitHub that one can contribute to and get much-needed feedback.
As I mentioned “DevOps,” you may wonder how is that different from SRE. The commonly used description would almost cover whatever we have described. The term SRE predates the term DevOps. As I see it, DevOps principles are trying to make developers responsible for part of that original sysadmin work, while they would still develop their applications “full time.” There is a large overlap in the tools used and the goals to be achieved; the difference is mostly in a holistic view and infrastructure decisions carried out by SREs and rarely developed by DevOps. The difference is more visible when there is more infrastructure involved.
To sum up, being an SRE means learning a lot about everything, the work is not repetitive and can have an impact on how the company works. Furthermore, being an SRE is about making the lives of others easier through proper automation and design, which I really enjoy. With the advent of the cloud, SREs are becoming a necessity for any even slightly larger company.
Do you find a career in IT appealing? We’re still hiring, check our open job positions here.
Written by Peter Junos
Follow us on our web and social media!