Rachel Kroll

"SRE" doesn't seem to mean anything useful any more

This seems to be a thing now: someone finds out that you worked as an SRE ("site reliability engineer", something from the big G back in the day) somewhere, and now all you're good for is "devops" - that is, you're going to be the "ops bitch" for the "real" programmers. You are the consumer. They are the producer. They squeeze one out and you have to make it sing and dance. You keep things running and you shut the hell up. You wear the pager so they don't have to.

I've seen this from the hiring side of things: when we were trying to hire well-rounded people and put up a job posting with "SRE" in the title, all of the sudden we got a bunch of applications from people who basically *were* ops monkeys. They wanted to be that and do that. That was their life, and they enjoyed it. Those of us on the hiring side were taken aback by this and didn't want that kind of hire getting into the place.

Clearly, somewhere along the line, someone lost the thread, and it has completely destroyed any notion of what a SRE was supposed to be.

Just so we're operating on a level playing ground here, I'll lay down my own personal definition of the term, and what I expected from people in that role and what I expected from myself.

To me, a SRE is *both* a sysadmin AND a programmer, developer, whatever you want to call it. It's a logical-and, not an XOR.

By sysadmin, I mean "runs a mean Unix box, including fixing things and diving deeply when they break", and by the programmer/whatever part of it, I mean "makes stuff come into existence that wasn't there before". In particular, I expect someone to run the *right* things on those boxes, to find the actual problems and not just reboot stuff that looks squirrelly, and that they write good, solid code that's respectful of the systems and the network. They probably write programs to make the sysadmin part of the job run itself. Automation for the win.

That, to me, is my first order approximation of what a SRE should be.

Now, there must be some reason I'm writing about this, right? Yes, and there is. I put out some feelers to see about maybe working with a small company that's building some interesting things. They're wrangling Linux, C and C++, embedded stuff, networking, and they're almost certainly going to need a certain amount of pickiness regarding correctness and security to keep bad people from breaking into stuff. Also, they didn't have the usual lists of godawful clown software that most places rattle off that you'd be expected to work with.

In short, it was a place that's Actually Building Stuff and isn't just throwing their money at one of the usual clown vendors. That's rare!

So, I reached out... and heard nothing. Then, much later, I reached out a different way, and eventually heard back: they looked me over, figured I'm a SRE and they have devops people already, so, uh, no thanks.

That's it. That's the whole thing. The door is closed.

I'm obviously not happy with this situation. In sharing with some friends, they also mentioned "having to have this conversation" (that they are not just an ops monkey) a great many times in the course of looking for employment over the years.

Clearly, things have gone to hell, and unless you WANT that kind of ops-only life, you probably don't want to sell yourself this way.

Just to be annoying, I'm going to rattle off an example of something an ops monkey would never do. I wrote this C++ build tool, right? I've mentioned it a few times in various posts over the years, and there have been a few anemic web pages on my main server talking about what it is and how it works.

I won't go into the full history of it here. A quick description is: it knows how to build C++ using #includes to figure out dependencies, and so you need not write a build script, a Makefile, a CMakefile, or any other config language file. This goes both for the stuff inside the local source tree, and for external system library dependencies: stuff like libcurl, a pgsql or mysql client library, GNU Radio, libmicrohttpd, jansson, or basically anything else you might think of. It knows how to use pkg-config to get the details for some well-known targets, and you can add entries to a config file to map the "#include <proj/lib.h>" onto the pkg-config name for anything else.

So, again, that's all old news. I first wrote that over a decade ago and have been using it all this time, with small improvements over the years. What's new? Well, a couple of months back, I decided it was finally time to make the thing run its operations in parallel to take advantage of a multi-processor system.

It now scans the targets to determine the dependency graph in parallel, then kicks off compiles, then kicks off the linker. Everything that used to be serialized has now been made as parallel as possible.

Obviously, you can't just do this as a rash decision. There are any number of things which can go terribly wrong if you don't manage your data structures properly for safe cross-thread accesses. You need to be able to put things to sleep and have them wake up later without having them needlessly "spin" on the CPU. You need to do depth-first processing and then have it kick off the "parents" as the "children" finish up. You still have to catch errors and stop the process where appropriate, and you also have to make sure you don't just boil the damn machine by launching EVERYTHING at once. "Just use std::async", it is not!

To give some idea of the impact, touching a file that gets included into a whole bunch of stuff forces everything above it in the graph to get rebuilt. This used to take about 77 seconds on the same 2011-built 8-way workstation box I've had all along. With just the early work on parallelization, it became something more like 21 seconds.

77s -> 21s, just like that. That's a lot of win that was just waiting to be realized by using the hardware appropriately.

Yeah, I did that. Little old me. This is not the behavior of someone who just twiddles other people's config files for a living.

Raar, I'm cranky.