Monday, August 2, 2021

Grafana - You get the gist

 At my current company we're using a self-managed instance of Grafana to quickly and easily create flexible, good-looking dashboards for reporting on important production metrics. Over the last few years it's become our core monitoring and analysis platform for infrastructure. It's no good at alerting, but for ad-hoc monitoring and quick overviews, we've been very happy with it. 

Recently we've also been working to decommission an old, on-premise server and replace it with some AWS infrastructure. Because a lot of our Grafana dashboards were talking to this server, we had to go through and migrate or remove each one. Because every Grafana dashboard can have multiple panels, and every panel can use different (or even multiple) data sources, we quickly realized there were hundreds of places the server could be referenced that we were going to have to search through. This would be incredibly tedious and error-prone, plus we were going to need to do it several times because the project was going to take a while and people would almost certainly add or remove panels during that transition period. We needed something automated and reproducible.

Now there are probably scenarios where you can search Grafana's data store directly, but due to our security setup, that option was not available. The good news is that Grafana has a rich set of APIs for querying (and modifying) its dashboards (and other objects). So I ended up writing a little project that would search the JSON model of each dashboard in the instance for a specific keyword and write the results to a file. It's a pretty simple .NET Core 3.1 C# project, and you can find it here.

It uses Active Directory credentials to authenticate with the instance, but you could easily modify it to use an API key or some other method.

Hope someone finds that useful!

Friday, July 26, 2019

Sneaky index corruption

A little while back one of the developers asked for my help with a bizarre SQL error they were seeing. They were executing a stored procedure, and getting this scary response:

Msg 0, Level 11, State 0, Line 17
A severe error occurred on the current command.  The results, if any, should be discarded.

That doesn't sound good. Fortunately, this was in a test environment, so we weren't too worried, but it certainly didn't bode well. I checked the server logs (which requires jumping through some hoops on AWS RDS) but there was nothing logged there.

It was a big, gnarly, stored procedure, but I eventually narrowed the error down to a single command. To my surprise, it was a SELECT statement. This got me concerned that we had corruption in the table. So I ran DBCC CHECKTABLE. It was a small table (approx. 250,000 rows) with only four columns, one clustered and one non-clustered index. CHECKTABLE reported back no errors. Suspicious, I tried it a couple more times and a couple more ways, for specific indexes and with more explicit options (like DATA_PURITY and EXTENDED_LOGICAL_CHECKS, even though the later didn't really apply to that table). Still nothing.

Puzzled, I starting trying to query the table in different ways. What I discovered was that doing a 'dump' of the table (SELECT * FROM table) didn't produce an error, but doing a row count (SELECT COUNT(1) FROM table) did! In looking at the query plans for the two queries, I realized the first was utilizing the clustered index, and the second was utilizing the non-clustered index. This suggested that the non-clustered index was corrupt, even though CHECKTABLE couldn't find any problems.

I browsed through the index properties in SQL Server Management Studio, and found that the Fragmentation screen (my favorite go-to spot to check out index stats) was completely blank. Didn't error out, didn't hang (as this screen can do on a big index or hard-hit server), just didn't show anything.

By now I was pretty sure the index had gone bad, despite what CHECKTABLE said. So how to fix it? I tried doing an index rebuild, and got this error message:

Attempting to set a non-NULL-able column's value to NULL.

Uh huh. I wasn't familiar with the history of this table, so I don't know if it changed recently or something. Two of the four columns were NOT NULL and the other two were NULLable. I ran a query to see if I could find some NULL values in the NOT NULL columns, but no dice.

If this error had been in production, I would have tried harder to figure out how the table got corrupt. Once I confirmed that the corruption only existed in this test environment, I just dropped and re-created the index. (Using CREATE INDEX ... WITH(DROP_EXISTING=ON) did not work, I had to actually DROP INDEX and then CREATE INDEX). Once I did that, everything was working fine again!

I got a little lucky - had the table been big, rebuilding the index could have taken a long time. I was also lucky that we use Enterprise Edition, so I could have done an ONLINE (re-)build even if it had been big. Had it been in production, dropping the index might have required an outage.

In talking with our release team later, it seemed something had gone wrong with a database restore. After this popped up on other tables as well, they ended up restoring the whole database from a backup, and things worked fine after that.

It is definitely one of the more exotic problems I've run into as a SQL Server DBA. Hopefully this recap will prove of use to someone else!

Monday, July 15, 2019

My server's not having a good time

Remember that scene at the beginning of Back to the Future, when Doc Brown gets all his clocks ringing at the same time, and he's so excited? It's okay if you don't, it's pretty much a throw-away joke to set up a cool skateboarding sequence for Marty. But as a programmer, that scene really speaks to me. Getting a bunch of clocks to all be in perfect sync would be a huge accomplishment!

Time synchronization is one of those things that you don't really have to think about until you do. Part of the reason for this is that our laptops and devices are constantly syncing back to time servers, which themselves are constantly syncing to atomic clocks. Most of the time, this works really well and things stay very well in sync. But if that sync ever breaks down, the device's clock will quickly start to drift. Put your phone in airplane mode for an extended period of time and you'll be surprised how far off the time gets.

Individual devices use the electric frequency of their power source to estimate the passage of a second. This is conceptually similar to how atomic clocks work, but is far less accurate. This is why the time server synchronization is necessary.

The fun part about time synchronization is that when it fails, it usually does so silently. Windows (both desktop and server versions) has a service called the "Windows Time Service" that is responsible for this sync. If it crashes, or runs into some other sort of problem, it doesn't tell you, but clock will begin to drift. It's easy enough to fix - just restart the service (even if it's still running) and it will catch up. (Sometime the catch-up is instantaneous, but sometimes it will adjust itself more gradually. This depends on a number of settings, including domain ones.)

But how do you detect whether it's drifted at all? You can of course check the device clock and manually compare it against another, trusted clock. If you want to check it against the domain clock though, that's usually not something you can just eyeball. Fortunately, the Windows Time Service has a built-in way to do this. It's a command-line option called "stripchart" and you can invoke it like so:

w32tm /stripchart /computer:domain-controller

If you don't know know the domain controller address, you can get it like so:

w32tm /query /peers

The strip chart output looks like this:

11:20:29, d:+00.0005647s o:+00.0005469s  [                 *                 ]
11:20:31, d:+00.0008592s o:+00.0028273s  [                 *                 ]
11:20:33, d:+00.0008876s o:-00.0087090s  [                 *                 ]
11:20:35, d:+00.0008069s o:-00.0067115s  [                 *                 ]
11:20:37, d:+00.0022337s o:-00.0096679s  [                 *                 ]

This is the ideal scenario. Those offsets are small, on the order of a few milliseconds. The little graph to the right shows you that the times are lining up. If they were not, if there was significant drift, you would see a pipe character ( | ) somewhere to the left or right of the *, giving a visual representation of how far off the 'center' you are.

As stated, you can fix a device or a server that has drifted by just restarting the Windows Time Service. But what if it keeps happening? We ran into this issue a little while back. Long story short, we found that the sync intervals across our domain were set too low. Some machines only synced once a week! We made all the syncing much more aggressive, and lowered the tolerance for drift as well.

There's another gotcha here with virtualization. All our servers are VMs running on ESX hosts. What we found was that some of the servers were getting their time from the domain controllers, while others were getting their time from the hosts. The domain controllers get their values from time servers, and we knew we could trust that, so we made sure all the VMs used the domain instead of the host. Where the hosts get their time we didn't investigate. It may be that they get their time from the same (or similar) reliable external time server. But the whole point is consistency, for clocks to be in sync. To do that, you have to use the same source of truth.

The way we stumbled onto this problem most recently was when we saw that a SQL Server Agent job was running a few minutes ahead of when we expected it to. In diagnosing this problem, I wrote a PowerShell script that will do some this legwork for you. It will tell you whether your local clock is out of sync with the domain, and whether or not a specified SQL Server clock is too. You can get it from here!

Again, this is one of those problems that you don't have to think about very often, but is obnoxious when you do. It's come up every place I've ever worked, so it's something that's good to pay attention to and stay on top of.

Saturday, June 29, 2019

You're doing it wrong

Attention software developers: you're doing it wrong.

To be more specific, you need to be doing these things, and you're not.

Don't block the UI thread!

No matter what the application is, not matter what you're doing, do not block the UI! A "Please Wait" dialog or screen is trivial to code and makes the user experience so much better. You can never assume that an operation will always execute quick enough not to block. It always will under some scenario.

Catch everything!

There is no excuse (at least in the .NET world) to allow an exception to bubble up so far that the application crashes. There is always a way to catch exceptions at the highest level. Even if there's no way to recover, exit gracefully.

And don't restart the app without warning. Visual Studio and SQL Server Management Studio are big offenders here. If recycling is the only way to recover, fine, but don't just do it without telling me!

Always format numbers.

I'm so sick of having to squint at 41095809 and mentally insert the commas (or whatever the delimiter is in your culture). Having the code format this kind of output is so trivial. Software is supposed to make our lives easier. If you don't do this you have failed.

Don't confuse a pattern with a function.

Okay this one's not quite so obvious. There are times when you have to do a similar thing in lots of places. The natural inclination is to make this into a function/method/class. This is a good instinct! Don't repeat yourself! But it can be taken too far. Some things look similar but have enough differences that trying to create a generic base for them just makes the code way more complicated.

To put it another way, you wouldn't try to write one ForEach() function that handles all loop cases. It would quickly become a monster. Looping is a pattern, not a function.

Let's be clear: I have committed all these sins and will again. Learn from my fail, and we can all stop doing it wrong.

Saturday, February 2, 2019

No junctions with SQL Server FILESTREAM columns

We're using a SQL Server table has a FILESTREAM column, and we have some tests to verify reading & writing to this column. A little while back I started getting this error when I tried to run the tests that accessed this column:

System.ComponentModel.Win32Exception: The system cannot find the path specified.

I found a lot of information out there about this error: it can be caused by permission issues, by bugs in earlier version of SQL Server, and sometimes registry & setup errors. But none of the remedies they suggested worked: the SQL Server process had full control of all the directories involved, the registry hacks didn't change anything, and we're using 2017 Developer Edition. I thought it might be because I'd renamed by laptop (the test SQL Server instance is local), but after fixing the server name within SQL Server the error kept happening.

Then I remembered: our test database setup script has hard-coded full paths for the database files and I had set up a file junction on my laptop because I'd wanted it live in a different folder but didn't want to break compatibility. I wondered if that could be the culprit. I deleted the existing test database and re-created it in a different place altogether that didn't involve the junction. Sure enough, things started working! (I then re-wrote the setup script to use the default directories.)

I don't know if using junctions with FILESTREAM has a bug, is not supported, or I just didn't have all the permissions set up correctly, but either way the two things did not place nice together.

Thursday, August 16, 2018

Stop describing me!

Anyone who's ever worked in software knows how difficult it is to keep the code and the documentation in agreement. It's all too common to find errors in documentation, and not just typos, but misleading or even just false statements. This is (generally) not the result of malice, laziness, or stupidity, but is rather a natural consequence of trying to keep two unrelated things in sync with each other. Programming and natural languages are just not similar enough to make this a simple or automatic task. It's one reason experienced programmers are prone to say that all code comments are lies.

Additionally, documentation is far too often a crutch used to make up for the fact that the design is not intuitive. As Chip Camden once said, "Documentation helps, but documentation is not the solution - simplicity is." That doesn't mean that software can't be doing complex things - doing complex things is the whole point! What it means is that the interfaces to software, whether they be APIs or GUIs, have to make the complex thing simple.

Documentation then is typically a symptom of poor design and a difficult symptom to manage at that. But how then do we communicate the what and why of a software system?

The answer is self-documenting systems. "Self-documenting system" sounds like the tagline of some new tool somebody's trying to sell, but it's actually a design philosophy. It's an approach that says: make the endpoints, buttons, menu options, links, namespaces, class, method, parameter, and property names so obvious as to their purpose & function that, for the majority of them, documentation would be redundant.

The term "redundant" is used very deliberately here - it's not that documentation is never warranted, it's that we should seek to make it unnecessary, for the reasons mentioned above. There will still need to be a few bits of explanation here and there, but they'll be few, so maintaining them won't be a big burden, and they'll be very targeted, so it'll be very obvious when they need to be updated. (By targeted, I mean they'll be either high-level descriptions of the unit as a whole, or notes on some edge cases.)

Self-documenting systems lead us to the Principle of Minimum Necessary Documentation, which states:

There is an inverse relationship between how well designed something is and how much documentation it needs.

The term "inverse" is also used very deliberately. The inverse of infinity is not zero - it's darn near close to it, but it's not actually nothing. A perfect system needs no documentation, but there are no perfect systems, so there is no system that can have zero documentation, no matter how well designed it is. But for a well-designed system, a README file or some tooltips will probably suffice.

But do not assume from this that a project with no documentation except a README is well-designed, or that you can get away with just throwing together a README and taking no other thought for what's going on. Note the principle says "how much documentation it needs" not how much it has. If users have cause to say "the documentation has been skimped on", then the design is poor and no crutch was provided, which is even worse than a bad design band-aided with documentation. Intuitive, self-documenting design is the critical factor, not the documentation itself.

Now, there are a few caveats & clarifications that should be mentioned:
  • 'Self-documenting' is often used in connection with test-driven development. There is a vein of thought that says the tests should be the mechanism for documenting the code. This is a very good approach to take, but tests cannot serve as the user documentation. If the project is a public one, expecting people to read the tests is not realistic, and even with an internal project, understanding the guts of how a component works is often beyond the scope of the task at hand. Tests can and should be the developer documentation, and complete test coverage is critically important to developing robust software, but a test suite is typically too large and specialized to be the sole documentation.
  • Most systems have a fairly obvious surface. It may not be instantly obvious what a class structure or API or SDK or GUI does, but it's usually pretty easy to see what it has. There are interfaces though that don't have an easily reflectable surface, such as command-line tools. At first blush, it might seem that this is a place where more documentation is a necessary evil. Certainly there are a lot of tools that believe extensive documentation is preferable to abstraction (git, I'm looking at you). But the principle does still apply here. It is still incumbent on the developer to make the behavior intuitive (typically by following conventions), and to make the options easily discoverable. There are a myriad of ways to approach this, and it depends a lot on how flexible the tool needs to be. It requires more work, but that's our job: to make it easy to use.
It's as I've said before: deal with things directly. Don't build hedges around them, because it just takes more work and isn't as effective. Don't worry about documentation unless there truly is no other way to get the point across. Focus instead of making your code as intuitive and straightforward as possible, as this is the greatest programming good.

Thursday, May 11, 2017

Adventures in SQL Server Licensing

Over the weekend, we upgraded from SQL Server Standard Edition to Enterprise Edition. (We also upgraded from 2008 to 2016, hooray!) There were of course a lot of factors driving this, but one of the more compelling ones was the desire to utilize all the processors on the host machine. SQL Server Standard is limited to 24 logical processors (official source: Compute Capacity Limits by Edition of SQL Server - see the chart near the bottom of the article). SQL Server Enterprise Edition however can use as many processors as the host machine has to offer (see same source). Our host machine has 96 logical processors, so we were only using a quarter of the capacity, so this was an important jump forward.

Imagine our surprise and alarm to come in the morning after the upgrade to see that the server was performing very poorly. We quickly discovered that it was still not using all the processors, and even worse, it had now dropped down to only 20!

After some research, we were able to discover that we had installed the wrong kind of Enterprise Edition! Enterprise Edition comes in two kinds: Server/CAL and Core-based. CAL stands for Client Access Licensing, and is a licensing model that caps how many clients can connect to the server at a time. As you can imagine, this is hard to estimate and significantly limits scalability, so it is not a popular option. Almost everyone goes for Core-based licensing, and that's what we had purchased and needed, but it was not what we had installed. Server/CAL Enterprise Edition is limited to 20 cores (40 if using hyperthreading) [see the footnote to the afore-mentioned chart].

To confirm that this was our issue, we checked the SQL Server logs at when the server started up. We saw this line (emphasis added):
SQL Server detected 4 sockets with 24 cores per socket and 24 logical processors per socket, 96 total logical processors; using 20 logical processors based on SQL Server licensing.
We also checked the edition by running SELECT SERVERPROPERTY('edition') and got Enterprise Edition (64-bit).  This seemed okay, until we realized that we needed Enterprise Edition: Core-based Licensing (64-bit).

We were afraid that we were going to have to reinstall SQL Server, but the fix turned out to be simpler than that. We had to download the right installer, the one for SQL Server 2016 Enterprise Edition Core, and then do an Edition Upgrade from the Maintenance menu. This article has a good walkthrough of how to do it: see the section "UPGRADING SQL SERVER 2008." The screenshots are from an older version of the installer, and it suggests you'll have to enter the product key manually, which you probably won't have to, it will probably be auto-filled. (This MS Docs article gives a more up-to-date and complete procedure, but it doesn't have any accompanying visuals.)

Once we reached the end of the wizard and clicked "Upgrade", the change was very quick, and once the SQL Server service was restarted, it went up to using all 96 cores! We were able to confirm that the edition was now Enterprise Edition: Core-based Licensing (64-bit), and saw the startup message changed to:
SQL Server detected 4 sockets with 24 cores per socket and 24 logical processors per socket, 96 total logical processors; using 96 logical processors based on SQL Server licensing.
It was of course a huge relief to figure this out, but it was hardly a quick process. We were down for almost three hours. So I'm putting this out there in the hopes that other folks won't have to go through the same pain. And it seems like a lot of people have felt this pain - Microsoft clearly needs to do a much better job of labeling and explaining this quirk of SQL Server installation. (Microsoft needs to massively streamline and simplify all SQL Server licensing, but that's a topic for another time.)

So, final take-aways:
  1. When installing SQL Server Enterprise edition, make certain you have the right installer for the license type you have purchased, whether it be Server/CAL or Core-base Licensing.
  2. If you find you have installed the wrong type of SQL Server Enterprise Edition, you can correct the issue by doing an Edition Upgrade from the correct installer.
  3. SQL Server licensing is way too complex.
(Thanks to this DBA.SE answer and this blog post that helped guide us down the right path.)