Monday, July 15, 2019

My server's not having a good time

Remember that scene at the beginning of Back to the Future, when Doc Brown gets all his clocks ringing at the same time, and he's so excited? It's okay if you don't, it's pretty much a throw-away joke to set up a cool skateboarding sequence for Marty. But as a programmer, that scene really speaks to me. Getting a bunch of clocks to all be in perfect sync would be a huge accomplishment!

Time synchronization is one of those things that you don't really have to think about until you do. Part of the reason for this is that our laptops and devices are constantly syncing back to time servers, which themselves are constantly syncing to atomic clocks. Most of the time, this works really well and things stay very well in sync. But if that sync ever breaks down, the device's clock will quickly start to drift. Put your phone in airplane mode for an extended period of time and you'll be surprised how far off the time gets.

Individual devices use the electric frequency of their power source to estimate the passage of a second. This is conceptually similar to how atomic clocks work, but is far less accurate. This is why the time server synchronization is necessary.

The fun part about time synchronization is that when it fails, it usually does so silently. Windows (both desktop and server versions) has a service called the "Windows Time Service" that is responsible for this sync. If it crashes, or runs into some other sort of problem, it doesn't tell you, but clock will begin to drift. It's easy enough to fix - just restart the service (even if it's still running) and it will catch up. (Sometime the catch-up is instantaneous, but sometimes it will adjust itself more gradually. This depends on a number of settings, including domain ones.)

But how do you detect whether it's drifted at all? You can of course check the device clock and manually compare it against another, trusted clock. If you want to check it against the domain clock though, that's usually not something you can just eyeball. Fortunately, the Windows Time Service has a built-in way to do this. It's a command-line option called "stripchart" and you can invoke it like so:

w32tm /stripchart /computer:domain-controller

If you don't know know the domain controller address, you can get it like so:

w32tm /query /peers

The strip chart output looks like this:

11:20:29, d:+00.0005647s o:+00.0005469s  [                 *                 ]
11:20:31, d:+00.0008592s o:+00.0028273s  [                 *                 ]
11:20:33, d:+00.0008876s o:-00.0087090s  [                 *                 ]
11:20:35, d:+00.0008069s o:-00.0067115s  [                 *                 ]
11:20:37, d:+00.0022337s o:-00.0096679s  [                 *                 ]

This is the ideal scenario. Those offsets are small, on the order of a few milliseconds. The little graph to the right shows you that the times are lining up. If they were not, if there was significant drift, you would see a pipe character ( | ) somewhere to the left or right of the *, giving a visual representation of how far off the 'center' you are.

As stated, you can fix a device or a server that has drifted by just restarting the Windows Time Service. But what if it keeps happening? We ran into this issue a little while back. Long story short, we found that the sync intervals across our domain were set too low. Some machines only synced once a week! We made all the syncing much more aggressive, and lowered the tolerance for drift as well.

There's another gotcha here with virtualization. All our servers are VMs running on ESX hosts. What we found was that some of the servers were getting their time from the domain controllers, while others were getting their time from the hosts. The domain controllers get their values from time servers, and we knew we could trust that, so we made sure all the VMs used the domain instead of the host. Where the hosts get their time we didn't investigate. It may be that they get their time from the same (or similar) reliable external time server. But the whole point is consistency, for clocks to be in sync. To do that, you have to use the same source of truth.

The way we stumbled onto this problem most recently was when we saw that a SQL Server Agent job was running a few minutes ahead of when we expected it to. In diagnosing this problem, I wrote a PowerShell script that will do some this legwork for you. It will tell you whether your local clock is out of sync with the domain, and whether or not a specified SQL Server clock is too. You can get it from here!

Again, this is one of those problems that you don't have to think about very often, but is obnoxious when you do. It's come up every place I've ever worked, so it's something that's good to pay attention to and stay on top of.

No comments:

Post a Comment