Friday, July 26, 2019

Sneaky index corruption

A little while back one of the developers asked for my help with a bizarre SQL error they were seeing. They were executing a stored procedure, and getting this scary response:

Msg 0, Level 11, State 0, Line 17
A severe error occurred on the current command.  The results, if any, should be discarded.

That doesn't sound good. Fortunately, this was in a test environment, so we weren't too worried, but it certainly didn't bode well. I checked the server logs (which requires jumping through some hoops on AWS RDS) but there was nothing logged there.

It was a big, gnarly, stored procedure, but I eventually narrowed the error down to a single command. To my surprise, it was a SELECT statement. This got me concerned that we had corruption in the table. So I ran DBCC CHECKTABLE. It was a small table (approx. 250,000 rows) with only four columns, one clustered and one non-clustered index. CHECKTABLE reported back no errors. Suspicious, I tried it a couple more times and a couple more ways, for specific indexes and with more explicit options (like DATA_PURITY and EXTENDED_LOGICAL_CHECKS, even though the later didn't really apply to that table). Still nothing.

Puzzled, I starting trying to query the table in different ways. What I discovered was that doing a 'dump' of the table (SELECT * FROM table) didn't produce an error, but doing a row count (SELECT COUNT(1) FROM table) did! In looking at the query plans for the two queries, I realized the first was utilizing the clustered index, and the second was utilizing the non-clustered index. This suggested that the non-clustered index was corrupt, even though CHECKTABLE couldn't find any problems.

I browsed through the index properties in SQL Server Management Studio, and found that the Fragmentation screen (my favorite go-to spot to check out index stats) was completely blank. Didn't error out, didn't hang (as this screen can do on a big index or hard-hit server), just didn't show anything.


By now I was pretty sure the index had gone bad, despite what CHECKTABLE said. So how to fix it? I tried doing an index rebuild, and got this error message:

Attempting to set a non-NULL-able column's value to NULL.

Uh huh. I wasn't familiar with the history of this table, so I don't know if it changed recently or something. Two of the four columns were NOT NULL and the other two were NULLable. I ran a query to see if I could find some NULL values in the NOT NULL columns, but no dice.

If this error had been in production, I would have tried harder to figure out how the table got corrupt. Once I confirmed that the corruption only existed in this test environment, I just dropped and re-created the index. (Using CREATE INDEX ... WITH(DROP_EXISTING=ON) did not work, I had to actually DROP INDEX and then CREATE INDEX). Once I did that, everything was working fine again!

I got a little lucky - had the table been big, rebuilding the index could have taken a long time. I was also lucky that we use Enterprise Edition, so I could have done an ONLINE (re-)build even if it had been big. Had it been in production, dropping the index might have required an outage.

In talking with our release team later, it seemed something had gone wrong with a database restore. After this popped up on other tables as well, they ended up restoring the whole database from a backup, and things worked fine after that.

It is definitely one of the more exotic problems I've run into as a SQL Server DBA. Hopefully this recap will prove of use to someone else!

Monday, July 15, 2019

My server's not having a good time

Remember that scene at the beginning of Back to the Future, when Doc Brown gets all his clocks ringing at the same time, and he's so excited? It's okay if you don't, it's pretty much a throw-away joke to set up a cool skateboarding sequence for Marty. But as a programmer, that scene really speaks to me. Getting a bunch of clocks to all be in perfect sync would be a huge accomplishment!

Time synchronization is one of those things that you don't really have to think about until you do. Part of the reason for this is that our laptops and devices are constantly syncing back to time servers, which themselves are constantly syncing to atomic clocks. Most of the time, this works really well and things stay very well in sync. But if that sync ever breaks down, the device's clock will quickly start to drift. Put your phone in airplane mode for an extended period of time and you'll be surprised how far off the time gets.

Individual devices use the electric frequency of their power source to estimate the passage of a second. This is conceptually similar to how atomic clocks work, but is far less accurate. This is why the time server synchronization is necessary.

The fun part about time synchronization is that when it fails, it usually does so silently. Windows (both desktop and server versions) has a service called the "Windows Time Service" that is responsible for this sync. If it crashes, or runs into some other sort of problem, it doesn't tell you, but clock will begin to drift. It's easy enough to fix - just restart the service (even if it's still running) and it will catch up. (Sometime the catch-up is instantaneous, but sometimes it will adjust itself more gradually. This depends on a number of settings, including domain ones.)

But how do you detect whether it's drifted at all? You can of course check the device clock and manually compare it against another, trusted clock. If you want to check it against the domain clock though, that's usually not something you can just eyeball. Fortunately, the Windows Time Service has a built-in way to do this. It's a command-line option called "stripchart" and you can invoke it like so:

w32tm /stripchart /computer:domain-controller

If you don't know know the domain controller address, you can get it like so:

w32tm /query /peers

The strip chart output looks like this:

11:20:29, d:+00.0005647s o:+00.0005469s  [                 *                 ]
11:20:31, d:+00.0008592s o:+00.0028273s  [                 *                 ]
11:20:33, d:+00.0008876s o:-00.0087090s  [                 *                 ]
11:20:35, d:+00.0008069s o:-00.0067115s  [                 *                 ]
11:20:37, d:+00.0022337s o:-00.0096679s  [                 *                 ]

This is the ideal scenario. Those offsets are small, on the order of a few milliseconds. The little graph to the right shows you that the times are lining up. If they were not, if there was significant drift, you would see a pipe character ( | ) somewhere to the left or right of the *, giving a visual representation of how far off the 'center' you are.

As stated, you can fix a device or a server that has drifted by just restarting the Windows Time Service. But what if it keeps happening? We ran into this issue a little while back. Long story short, we found that the sync intervals across our domain were set too low. Some machines only synced once a week! We made all the syncing much more aggressive, and lowered the tolerance for drift as well.

There's another gotcha here with virtualization. All our servers are VMs running on ESX hosts. What we found was that some of the servers were getting their time from the domain controllers, while others were getting their time from the hosts. The domain controllers get their values from time servers, and we knew we could trust that, so we made sure all the VMs used the domain instead of the host. Where the hosts get their time we didn't investigate. It may be that they get their time from the same (or similar) reliable external time server. But the whole point is consistency, for clocks to be in sync. To do that, you have to use the same source of truth.

The way we stumbled onto this problem most recently was when we saw that a SQL Server Agent job was running a few minutes ahead of when we expected it to. In diagnosing this problem, I wrote a PowerShell script that will do some this legwork for you. It will tell you whether your local clock is out of sync with the domain, and whether or not a specified SQL Server clock is too. You can get it from here!

Again, this is one of those problems that you don't have to think about very often, but is obnoxious when you do. It's come up every place I've ever worked, so it's something that's good to pay attention to and stay on top of.