Friday, July 15, 2016

Poking just the wrong spot, or How Two Minor Bugs Generated Some Very Scary Error Messages

So we had a stored procedure that started timing out occasionally in production. We determined the issue was parameter sniffing, so we decided to just add a OPTION RECOMPILE to the query inside it. It was tested, and released.

Upon release however, we starting get an extremely scary-looking error message that was even being entered in the SQL Server log:

SQL Server Assertion: File: <op_ppqte.cpp>, line=12258 Failed Assertion = 'llSkip >= 0'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Naturally we rushed to figure out what was going on. Researching the issue didn't turn up much though - we found a few KB articles that seemed related (like this one and this one), but nothing that described it exactly. As it seemed to be a bug in SQL Server itself, we installed the latest Cumulative Update for SQL Server 2014, but it did not change anything.

As we dug in to try to find a workaround, we discovered there was a different error that had been occurring before we released the change that we had not noticed before:

The offset specified in a OFFSET clause may not be negative.

Now this was easy enough to understand, and we realized that there was indeed a bug in our query. As we worked to reproduce it in our test environments though, we were startled to discover that reproducing it gave us that fatal error! Upon closer inspection of the production error log, we realized that the OFFSET error had disappeared when we added OPTION RECOMPILE.

This led us to believe that the combination of an invalid OFFSET value in a query using OPTION RECOMPILE causes SQL Server to generate a fatal error instead of a specific & useful message. In other words, there's an error in the error handling code.

We fixed the stored procedure to only use valid OFFSET values, kept the OPTION RECOMPILE, tested, and re-deployed. Thankfully, this eliminated all the errors.

This issue has been reported to Microsoft.