Rachel Kroll

Escalating via post-it note just to get some health checks

I used to work at a place that had an internal task tracking system. Big deal, you think. Lots of places do that. Well, at this particular company, it was sometimes a pit of sorrow into which you would issue a request and never hear from it again... unless you went to some lengths.

Let's back up a little to set the stage. It's June of some year quite a while back, and it's about 9:30 at night. I guess I'm on call for the "last line of defense" debugging team, and I get pinged by the manager type who's wrangling an outage. It seems this team did some kind of code push and now they were completely down in some cluster: "0 online users" type of thing.

The incident manager asked me to find out what the load balancers were doing to healthcheck the systems in question, so I went looking. It turned out they were getting a small HTTP request on port 80, sort of like this: "GET /status HTTP/1.0".

But... the ones in the broken cluster weren't listening on port 80.

I asked if they did port takeover stuff (where one process can actively hand the listening socket to the one that's going to replace it), but then noticed they were running Java, and figured not. That kind of stuff was only really seen in some of the C++ backend processes at this gig.

I asked if maybe they had restarted in such a way that they tried to bind port 80 with the new process before the old one had shut down. Crickets.

Anyway, lacking a response from the engineer in question, I kept going, and found that poking another one of their instances in another (healthy) location would get a "I am alive" type response. To me, that seemed like a smoking gun: no response to HC = no love from the load balancer = no users.

A few minutes had gone by with no reply from the engineer, so I just used my Magic Powers on the box to just kick one of the java instances in the head to see what would happen. A few minutes later, it restarted, and now it was listening on port 80 and answering the health checks. Unsurprisingly, it started receiving requests from the load balancer, and the number of users started creeping upward.

I suggested a follow-up task: their stuff should kill itself when it can't get all of the ports it needs. Also, the thing that runs tasks should be set to check all of the ports too, and if it can't get satisfaction, *it* should kill the task.

Now, okay, granted, this is hacky. The program should be able to survive port 80 not being immediately available. It should also have some way to "hand off" from the other process. Or, you know, it could bind to a new port every time and just update its entry in the service directory. There are lots of ways to go about fixing this. However, considering the context, I went for the lowest-hanging fruit. You don't want to ask people to boil the ocean when they're having trouble making tea.

Anyway, they started restarting tasks and the service slowly returned to normal. I dropped offline and went to bed.

Two months went by. They kept having outages related to not having healthchecks. They were all preventable. They'd happen right when I was getting ready to go home, and so I'd miss my bus and have to wait an hour for the next one. That kind of crap.

I started counting the HC-related outages. Two, three, four.

At some point, I was over it, and dropped a post-it note on the desk of the head of engineering, pointing at the task to fix this thing and pleading for them to get involved. I was through being the "bad cop" for this particular one. It was time for management to deal with it.

Another month went by. Then, one day in late September, someone popped up in our IRC channel saying that they had turned on health checks and now had to test them, and could we help? I happened to be there, grabbed one of their machines, and promptly screwed up one of their java processes so it would just hang. (I forget what I did, but SIGSTOP seems plausible.)

The task runner thing noticed and set it to unhealthy. About three minutes later, it killed it, and then restarted it. Four minutes after that, it was still restarting, and maybe another four minutes after that, it was finally alive again and taking requests.

I informed this person that it did in fact work, but it took something like ten minutes to cycle back up to being useful again. They thanked me for checking and that was the last I heard about it. Apparently they were fine with this.

Considering this method of accessing things sent more people to the service than the *web site* did, you'd think it'd be kind of important. But, no, they were just rolling along with it.

Then today, I read a post about someone who found that their system had something like 1.2 GB strings full of backslashes because they were using JSON for internal state, and it kept escaping the " characters, so it turned into \\\\\\\\\\\\\\\\\\\\\\\\\" type of crap. That part was new to me, but the description of the rest of it seemed far too familiar.

And I went... hey, I think I know that particular circus!

Turns out - same circus, same elephants, different shit.