Friday, July 26, 2019

Sneaky index corruption

A little while back one of the developers asked for my help with a bizarre SQL error they were seeing. They were executing a stored procedure, and getting this scary response:

Msg 0, Level 11, State 0, Line 17
A severe error occurred on the current command.  The results, if any, should be discarded.

That doesn't sound good. Fortunately, this was in a test environment, so we weren't too worried, but it certainly didn't bode well. I checked the server logs (which requires jumping through some hoops on AWS RDS) but there was nothing logged there.

It was a big, gnarly, stored procedure, but I eventually narrowed the error down to a single command. To my surprise, it was a SELECT statement. This got me concerned that we had corruption in the table. So I ran DBCC CHECKTABLE. It was a small table (approx. 250,000 rows) with only four columns, one clustered and one non-clustered index. CHECKTABLE reported back no errors. Suspicious, I tried it a couple more times and a couple more ways, for specific indexes and with more explicit options (like DATA_PURITY and EXTENDED_LOGICAL_CHECKS, even though the later didn't really apply to that table). Still nothing.

Puzzled, I starting trying to query the table in different ways. What I discovered was that doing a 'dump' of the table (SELECT * FROM table) didn't produce an error, but doing a row count (SELECT COUNT(1) FROM table) did! In looking at the query plans for the two queries, I realized the first was utilizing the clustered index, and the second was utilizing the non-clustered index. This suggested that the non-clustered index was corrupt, even though CHECKTABLE couldn't find any problems.

I browsed through the index properties in SQL Server Management Studio, and found that the Fragmentation screen (my favorite go-to spot to check out index stats) was completely blank. Didn't error out, didn't hang (as this screen can do on a big index or hard-hit server), just didn't show anything.


By now I was pretty sure the index had gone bad, despite what CHECKTABLE said. So how to fix it? I tried doing an index rebuild, and got this error message:

Attempting to set a non-NULL-able column's value to NULL.

Uh huh. I wasn't familiar with the history of this table, so I don't know if it changed recently or something. Two of the four columns were NOT NULL and the other two were NULLable. I ran a query to see if I could find some NULL values in the NOT NULL columns, but no dice.

If this error had been in production, I would have tried harder to figure out how the table got corrupt. Once I confirmed that the corruption only existed in this test environment, I just dropped and re-created the index. (Using CREATE INDEX ... WITH(DROP_EXISTING=ON) did not work, I had to actually DROP INDEX and then CREATE INDEX). Once I did that, everything was working fine again!

I got a little lucky - had the table been big, rebuilding the index could have taken a long time. I was also lucky that we use Enterprise Edition, so I could have done an ONLINE (re-)build even if it had been big. Had it been in production, dropping the index might have required an outage.

In talking with our release team later, it seemed something had gone wrong with a database restore. After this popped up on other tables as well, they ended up restoring the whole database from a backup, and things worked fine after that.

It is definitely one of the more exotic problems I've run into as a SQL Server DBA. Hopefully this recap will prove of use to someone else!

No comments:

Post a Comment