Rachel Kroll

The night of 1000 alerts (but only on the Linux boxes)

Here's another story from way back that somehow hasn't been told yet. It's from fairly early in my days of working tech support for a web hosting company. I had been there less than two months and one night, things got pretty interesting.

It was around midnight, and our monitoring system went crazy. It popped up hundreds of alerts. We noticed and started looking into it. The boxes were all answering pings, but stuff like ssh was really slow. FTP and SMTP and similar things would connect but wouldn't yield a banner.

Someone realized they were all in one datacenter, so we called up networking to see what was up. They said none of their alarms had gone off. So, uh, great.

Somehow during this, the question of DNS was raised. One customer's box was grabbed randomly, and my usual "run w after login" was showing the IP address of the support network's external (NAT) interface instead of the usual nat-vlanXYZ.company.domain thing (which comes from a PTR). That was weird, and it was an early hint of what was wrong. Running DNS queries from there also failed - "host" this, "dig" that. Even forcing the queries to the two recursive nameservers that customers were supposed to use from that datacenter didn't work.

Next, it was time to see what this had to do with the daemons. With tcpdump running, I'd poke port 25 and watch as sendmail (or whatever) lobbed a bunch of queries to the usual DNS resolvers but got no reply. This would go on for a while, and if you waited long enough, it would eventually stop attempting those lookups, and you'd finally get the usual SMTP banner. The same applied for other services which worked more or less like that.

This made me suspect those resolver boxes, and sure enough, they couldn't be reached with traceroute or ping or really anything else for that matter. Our manager called down to someone again and told them what was happening, but somehow they didn't get on it right away.

Some time passed, and the customers started noticing - the phone calls started and lots of tickets were being created. Eventually, someone who ran the internal infrastructure responded and things started clearing out.

This was my first time in that sort of situation, and I regret to say that I participated in a "maybe it was a transient error, since things seem fine again now" response storm. I mean, it *was* transient in that it happened and then it stopped, and things DID seem fine again then, but it just feels so wrong and dishonest.

One interesting customer during all of this had the whole thing figured out while it was still going on. They managed to do the same thing we had, and noticed that both of the recursive nameserver boxes for that datacenter were toast. I'm sorry to say they got the same form-letter response.

What's even more amazing is that this customer came back with "hey cool, it's up now, just wanted to mention it in case it helped out", and was mostly shocked by the speed of our response. I guess we "got away with it" in that sense.

I found out much later that they were just two physical boxes for that whole datacenter, and apparently they had no monitoring. Nice, right?

Now, telling the customers that? That would have been epic, and it would have started a conflagration of credit memos the likes of which hadn't been seen since the time our friend "Z" "slipped" with the cutters while cleaning up the racks.

I was new on the job. I went with it. Later on, I would try to find ways to convey information without resorting to such slimy tactics. Mostly, I tried to get out of tech support, and eventually succeeded.

So, to that specific customer out there those 18 years ago, you were right. To all of those technical customers out there who think they're being told a line, this is to say that sometimes you are in fact being told a line. They're afraid of sharing the truth.

If you're in a spot where you can tell the truth about what happened and not get in trouble for that, even to a customer, consider yourself lucky.

...

Side note: the reason the alerting blew up was that the poller would only wait so long for the daemons to send their usual banner. The daemons, meanwhile, were waiting on their DNS resolution attempts to fail. The monitoring system's poller timeout was shorter than the DNS timeout, so when DNS went down, everything went into an alert status.

While this was going on, the Windows techs (as in, the people who supported customers running Windows boxes) were giving us grief because "only the Linux boxes are showing up with alerts". Apparently the Windows machines didn't tend to do blocking DNS calls in line with the banner generation, or timed things out sooner, or who knows what, but it allowed their monitoring polls to succeed.

"Windows boxes are fine... seems like Linux can't get the job done!"

They were mostly joking about it (as we tended to do with each other), but it was an interesting difference at the time.