Rachel Kroll

Systems design and being bitten by edge-triggering

Let's try a thought experiment: we're going to design a little program that provides a service on a vaguely Unix-flavored box. It's designed to periodically source information over the Internet from hosts that may be close or far away, and then it keeps a local copy for itself and others to use.

You might have it use some kind of config file where it is told the hostnames of the servers it's going to access. Maybe you've set up a pool, such that any given attempt at resolving foo.service.example yields a different IP address every time, and there are bunches of them.

server 0.foo.service.example
server 1.foo.service.example
server 2.foo.service.example
server 3.foo.service.example

When would you make it resolve the host down to an IP address? It seems like you might want it to happen when your program starts up. Given the above config, it would find four entries, would turn that into four IP addresses, and then would get busy trying to sync data from them.

But, I haven't told you the whole story. What if you designed your program in a day and age where the network was just assumed to "always be there"? There was no such thing as consumer-grade Internet and home connections. You'd probably write it to do the name-to-IP resolution stuff once and then never again.

Consider what happens when a system with that design runs into the reality of running on goofy consumer-grade hardware with goofy consumer-grade Internet connections, raw crappy power from the local utility, and all of the other entropy sources you can think of. It's probably not going to behave well.

Such a system would start when the machine started and would attempt to get its IP addresses. Then it would take the success or failure and would use whatever it happened to get. If it got nothing, then that's it. It would just sit there staring at its own shoes for eternity, or at least until the next wonky utility power situation restarted the cycle.

This is what happens when you run ntpd on a dumb little consumer "router" for home Internet connections. Chances are good that both the router box and the cable modem, DSL bridge (or whatever else) will both restart at the same time. It's also a good bet that the router might manage to boot and start ntpd before the actual Internet connection comes up.

That means ntpd will find itself on a network with no routing to the outside world, and then it will try to resolve things and will fail. Then it will just sit there being useless until something or someone comes along and kicks it.

This happens on Unifi gateway devices, and it will bite you *right now* if the order of things happens to line up as described above.

So, if you find yourself with a machine that's attempting to run, say, systemd-timesyncd against a local USG or something like that and it's not syncing, you probably fell into this trap. Nothing in ntpd is going to wake it up and try to rectify the situation.

The Unifi + ntpd situation is effectively edge-triggered: the "rising edge" of the box starting up sends it off to do a bunch of setup stuff. If it works, you're good, but if it fails, you're screwed.

Let's try a different approach, then. You are a server. Your job is to talk to other servers periodically. You have been given some config directives to help you find them. Until you have "enough" servers to talk to, you keep trying to add more. This means attempting DNS resolution, and then if that succeeds, trying to talk to them and see if they are sane. If they are, then you keep them around and potentially use them as a source of data. If you aren't, you evict them and start the process over to get another one.

This situation is more of a level-triggered one. The system in question is going to keep trying to get to where it needs to be. It's able to start up in a broken environment and then eventually recover once the rest of the world starts doing its job again. It won't just go on vacation because everyone else hasn't shown up for work yet. Now, obviously it needs a little care because retrying in a tight loop is also bad. There's an art to doing retries (backoff, jitter, that sort of thing), and it also needs to be considered.

It's a big difference in how things work, and once you start thinking about systems this way, you'll start noticing all of the little race conditions and timing anomalies which trip up edge-triggered stuff in everyday life. Any time you've had to reset something in "the right order" or otherwise run something back through a series of other states in order to make it all "sync up", you probably were fighting with that.

Isn't it nice when systems know what they're supposed to be doing, and then keep working towards it until they succeed?

TL;DR use chrony.