Wednesday, February 12, 2025

Cryptocurrency

The argument for the adoption of cryptocurrency goes something like this: "Traditional currencies are based on trust in the government, and you can't trust the government. Let's put our trust instead in an algorithm-based currency, which is objective and mathematical. That'd be a better trust-based currency system."

On the surface, this is a reasonable argument. The problem is that both these assumptions are false.

Government-backed, trust-based currencies are not based on trust in public officials or even really the government itself. They're based on trust in the assets, institutions, and economy the currency operates in. That trust is not based on the efficiency or affinity of those things, but rather their longevity and viability. Take federal land in the United States - there are 640 million acres, worth hundred of billions of dollars. When you combine that with the collective value of all of the private and public land under the sovereignty of the USA, that number goes up to hundreds of trillions. And that's just the land - that doesn't count the value of the U.S.'s gold reserves ($300 billion in 2024) or any other physical resource the U.S controls. And all that pales in comparison to the economic value the U.S. creates through its political and military influence.

This isn't a patriotic thing - the same is true of the U.K. and the British pound, the Euro, and even the economies of smaller countries like Luxemburg. Within the world of NATO at least, you have dozens of countries with their valuable assets and institutions all supporting each other. So trust in practically any first-world currency is trust in a massive, global network of economic stability, not trust in any specific government and certainly not trust in any specific leader or policy.

By contrast, an algorithm - and software in general - is the work of a relatively small number of people. Large, widely-adopted open source projects can attract tens of thousand of contributors, but that's the exception rather than the rule. More significantly, even 25,000 people is a trivial number compared to the billions of people who participate in the world economy. So putting trust in an algorithm is putting trust in a tiny group of people and a trivially tiny piece of infrastructure.

It's also not true that algorithms have no bias. Algorithms incorporate the unconscious biases of their developers by default, and every algorithm is built for a purpose, so they are going to have conscious biases towards inputs and outputs that achieve their goals. That doesn't make them suspect, but it means they aren't objective. We don't need to look any further than social media to know that algorithms can be engineered and primed to exclude certain outcomes, scenarios and groups.

Cryptocurrency is an interesting concept, and it does seem like the math behind it is sound. But that doesn't tell us anything about the health of the implementation. And the adoption of it as an actual financial vehicle is based on unsound logic.

Monday, August 2, 2021

Grafana - You get the gist

 At my current company we're using a self-managed instance of Grafana to quickly and easily create flexible, good-looking dashboards for reporting on important production metrics. Over the last few years it's become our core monitoring and analysis platform for infrastructure. It's no good at alerting, but for ad-hoc monitoring and quick overviews, we've been very happy with it. 

Recently we've also been working to decommission an old, on-premise server and replace it with some AWS infrastructure. Because a lot of our Grafana dashboards were talking to this server, we had to go through and migrate or remove each one. Because every Grafana dashboard can have multiple panels, and every panel can use different (or even multiple) data sources, we quickly realized there were hundreds of places the server could be referenced that we were going to have to search through. This would be incredibly tedious and error-prone, plus we were going to need to do it several times because the project was going to take a while and people would almost certainly add or remove panels during that transition period. We needed something automated and reproducible.

Now there are probably scenarios where you can search Grafana's data store directly, but due to our security setup, that option was not available. The good news is that Grafana has a rich set of APIs for querying (and modifying) its dashboards (and other objects). So I ended up writing a little project that would search the JSON model of each dashboard in the instance for a specific keyword and write the results to a file. It's a pretty simple .NET Core 3.1 C# project, and you can find it here.

It uses Active Directory credentials to authenticate with the instance, but you could easily modify it to use an API key or some other method.

Hope someone finds that useful!

Friday, July 26, 2019

Sneaky index corruption

A little while back one of the developers asked for my help with a bizarre SQL error they were seeing. They were executing a stored procedure, and getting this scary response:

Msg 0, Level 11, State 0, Line 17
A severe error occurred on the current command.  The results, if any, should be discarded.

That doesn't sound good. Fortunately, this was in a test environment, so we weren't too worried, but it certainly didn't bode well. I checked the server logs (which requires jumping through some hoops on AWS RDS) but there was nothing logged there.

It was a big, gnarly, stored procedure, but I eventually narrowed the error down to a single command. To my surprise, it was a SELECT statement. This got me concerned that we had corruption in the table. So I ran DBCC CHECKTABLE. It was a small table (approx. 250,000 rows) with only four columns, one clustered and one non-clustered index. CHECKTABLE reported back no errors. Suspicious, I tried it a couple more times and a couple more ways, for specific indexes and with more explicit options (like DATA_PURITY and EXTENDED_LOGICAL_CHECKS, even though the later didn't really apply to that table). Still nothing.

Puzzled, I starting trying to query the table in different ways. What I discovered was that doing a 'dump' of the table (SELECT * FROM table) didn't produce an error, but doing a row count (SELECT COUNT(1) FROM table) did! In looking at the query plans for the two queries, I realized the first was utilizing the clustered index, and the second was utilizing the non-clustered index. This suggested that the non-clustered index was corrupt, even though CHECKTABLE couldn't find any problems.

I browsed through the index properties in SQL Server Management Studio, and found that the Fragmentation screen (my favorite go-to spot to check out index stats) was completely blank. Didn't error out, didn't hang (as this screen can do on a big index or hard-hit server), just didn't show anything.


By now I was pretty sure the index had gone bad, despite what CHECKTABLE said. So how to fix it? I tried doing an index rebuild, and got this error message:

Attempting to set a non-NULL-able column's value to NULL.

Uh huh. I wasn't familiar with the history of this table, so I don't know if it changed recently or something. Two of the four columns were NOT NULL and the other two were NULLable. I ran a query to see if I could find some NULL values in the NOT NULL columns, but no dice.

If this error had been in production, I would have tried harder to figure out how the table got corrupt. Once I confirmed that the corruption only existed in this test environment, I just dropped and re-created the index. (Using CREATE INDEX ... WITH(DROP_EXISTING=ON) did not work, I had to actually DROP INDEX and then CREATE INDEX). Once I did that, everything was working fine again!

I got a little lucky - had the table been big, rebuilding the index could have taken a long time. I was also lucky that we use Enterprise Edition, so I could have done an ONLINE (re-)build even if it had been big. Had it been in production, dropping the index might have required an outage.

In talking with our release team later, it seemed something had gone wrong with a database restore. After this popped up on other tables as well, they ended up restoring the whole database from a backup, and things worked fine after that.

It is definitely one of the more exotic problems I've run into as a SQL Server DBA. Hopefully this recap will prove of use to someone else!

Monday, July 15, 2019

My server's not having a good time

Remember that scene at the beginning of Back to the Future, when Doc Brown gets all his clocks ringing at the same time, and he's so excited? It's okay if you don't, it's pretty much a throw-away joke to set up a cool skateboarding sequence for Marty. But as a programmer, that scene really speaks to me. Getting a bunch of clocks to all be in perfect sync would be a huge accomplishment!

Time synchronization is one of those things that you don't really have to think about until you do. Part of the reason for this is that our laptops and devices are constantly syncing back to time servers, which themselves are constantly syncing to atomic clocks. Most of the time, this works really well and things stay very well in sync. But if that sync ever breaks down, the device's clock will quickly start to drift. Put your phone in airplane mode for an extended period of time and you'll be surprised how far off the time gets.

Individual devices use the electric frequency of their power source to estimate the passage of a second. This is conceptually similar to how atomic clocks work, but is far less accurate. This is why the time server synchronization is necessary.

The fun part about time synchronization is that when it fails, it usually does so silently. Windows (both desktop and server versions) has a service called the "Windows Time Service" that is responsible for this sync. If it crashes, or runs into some other sort of problem, it doesn't tell you, but clock will begin to drift. It's easy enough to fix - just restart the service (even if it's still running) and it will catch up. (Sometime the catch-up is instantaneous, but sometimes it will adjust itself more gradually. This depends on a number of settings, including domain ones.)

But how do you detect whether it's drifted at all? You can of course check the device clock and manually compare it against another, trusted clock. If you want to check it against the domain clock though, that's usually not something you can just eyeball. Fortunately, the Windows Time Service has a built-in way to do this. It's a command-line option called "stripchart" and you can invoke it like so:

w32tm /stripchart /computer:domain-controller

If you don't know know the domain controller address, you can get it like so:

w32tm /query /peers

The strip chart output looks like this:

11:20:29, d:+00.0005647s o:+00.0005469s  [                 *                 ]
11:20:31, d:+00.0008592s o:+00.0028273s  [                 *                 ]
11:20:33, d:+00.0008876s o:-00.0087090s  [                 *                 ]
11:20:35, d:+00.0008069s o:-00.0067115s  [                 *                 ]
11:20:37, d:+00.0022337s o:-00.0096679s  [                 *                 ]

This is the ideal scenario. Those offsets are small, on the order of a few milliseconds. The little graph to the right shows you that the times are lining up. If they were not, if there was significant drift, you would see a pipe character ( | ) somewhere to the left or right of the *, giving a visual representation of how far off the 'center' you are.

As stated, you can fix a device or a server that has drifted by just restarting the Windows Time Service. But what if it keeps happening? We ran into this issue a little while back. Long story short, we found that the sync intervals across our domain were set too low. Some machines only synced once a week! We made all the syncing much more aggressive, and lowered the tolerance for drift as well.

There's another gotcha here with virtualization. All our servers are VMs running on ESX hosts. What we found was that some of the servers were getting their time from the domain controllers, while others were getting their time from the hosts. The domain controllers get their values from time servers, and we knew we could trust that, so we made sure all the VMs used the domain instead of the host. Where the hosts get their time we didn't investigate. It may be that they get their time from the same (or similar) reliable external time server. But the whole point is consistency, for clocks to be in sync. To do that, you have to use the same source of truth.

The way we stumbled onto this problem most recently was when we saw that a SQL Server Agent job was running a few minutes ahead of when we expected it to. In diagnosing this problem, I wrote a PowerShell script that will do some this legwork for you. It will tell you whether your local clock is out of sync with the domain, and whether or not a specified SQL Server clock is too. You can get it from here!

Again, this is one of those problems that you don't have to think about very often, but is obnoxious when you do. It's come up every place I've ever worked, so it's something that's good to pay attention to and stay on top of.

Saturday, June 29, 2019

You're doing it wrong

Attention software developers: you're doing it wrong.

To be more specific, you need to be doing these things, and you're not.

Don't block the UI thread!


No matter what the application is, not matter what you're doing, do not block the UI! A "Please Wait" dialog or screen is trivial to code and makes the user experience so much better. You can never assume that an operation will always execute quick enough not to block. It always will under some scenario.

Catch everything!


There is no excuse (at least in the .NET world) to allow an exception to bubble up so far that the application crashes. There is always a way to catch exceptions at the highest level. Even if there's no way to recover, exit gracefully.

And don't restart the app without warning. Visual Studio and SQL Server Management Studio are big offenders here. If recycling is the only way to recover, fine, but don't just do it without telling me!


Always format numbers.


I'm so sick of having to squint at 41095809 and mentally insert the commas (or whatever the delimiter is in your culture). Having the code format this kind of output is so trivial. Software is supposed to make our lives easier. If you don't do this you have failed.

Don't confuse a pattern with a function.


Okay this one's not quite so obvious. There are times when you have to do a similar thing in lots of places. The natural inclination is to make this into a function/method/class. This is a good instinct! Don't repeat yourself! But it can be taken too far. Some things look similar but have enough differences that trying to create a generic base for them just makes the code way more complicated.

To put it another way, you wouldn't try to write one ForEach() function that handles all loop cases. It would quickly become a monster. Looping is a pattern, not a function.


Let's be clear: I have committed all these sins and will again. Learn from my fail, and we can all stop doing it wrong.

Saturday, February 2, 2019

No junctions with SQL Server FILESTREAM columns

We're using a SQL Server table has a FILESTREAM column, and we have some tests to verify reading & writing to this column. A little while back I started getting this error when I tried to run the tests that accessed this column:

System.ComponentModel.Win32Exception: The system cannot find the path specified.

I found a lot of information out there about this error: it can be caused by permission issues, by bugs in earlier version of SQL Server, and sometimes registry & setup errors. But none of the remedies they suggested worked: the SQL Server process had full control of all the directories involved, the registry hacks didn't change anything, and we're using 2017 Developer Edition. I thought it might be because I'd renamed by laptop (the test SQL Server instance is local), but after fixing the server name within SQL Server the error kept happening.

Then I remembered: our test database setup script has hard-coded full paths for the database files and I had set up a file junction on my laptop because I'd wanted it live in a different folder but didn't want to break compatibility. I wondered if that could be the culprit. I deleted the existing test database and re-created it in a different place altogether that didn't involve the junction. Sure enough, things started working! (I then re-wrote the setup script to use the default directories.)

I don't know if using junctions with FILESTREAM has a bug, is not supported, or I just didn't have all the permissions set up correctly, but either way the two things did not place nice together.

Thursday, August 16, 2018

Stop describing me!

Anyone who's ever worked in software knows how difficult it is to keep the code and the documentation in agreement. It's all too common to find errors in documentation, and not just typos, but misleading or even just false statements. This is (generally) not the result of malice, laziness, or stupidity, but is rather a natural consequence of trying to keep two unrelated things in sync with each other. Programming and natural languages are just not similar enough to make this a simple or automatic task. It's one reason experienced programmers are prone to say that all code comments are lies.

Additionally, documentation is far too often a crutch used to make up for the fact that the design is not intuitive. As Chip Camden once said, "Documentation helps, but documentation is not the solution - simplicity is." That doesn't mean that software can't be doing complex things - doing complex things is the whole point! What it means is that the interfaces to software, whether they be APIs or GUIs, have to make the complex thing simple.

Documentation then is typically a symptom of poor design and a difficult symptom to manage at that. But how then do we communicate the what and why of a software system?

The answer is self-documenting systems. "Self-documenting system" sounds like the tagline of some new tool somebody's trying to sell, but it's actually a design philosophy. It's an approach that says: make the endpoints, buttons, menu options, links, namespaces, class, method, parameter, and property names so obvious as to their purpose & function that, for the majority of them, documentation would be redundant.

The term "redundant" is used very deliberately here - it's not that documentation is never warranted, it's that we should seek to make it unnecessary, for the reasons mentioned above. There will still need to be a few bits of explanation here and there, but they'll be few, so maintaining them won't be a big burden, and they'll be very targeted, so it'll be very obvious when they need to be updated. (By targeted, I mean they'll be either high-level descriptions of the unit as a whole, or notes on some edge cases.)

Self-documenting systems lead us to the Principle of Minimum Necessary Documentation, which states:

There is an inverse relationship between how well designed something is and how much documentation it needs.

The term "inverse" is also used very deliberately. The inverse of infinity is not zero - it's darn near close to it, but it's not actually nothing. A perfect system needs no documentation, but there are no perfect systems, so there is no system that can have zero documentation, no matter how well designed it is. But for a well-designed system, a README file or some tooltips will probably suffice.

But do not assume from this that a project with no documentation except a README is well-designed, or that you can get away with just throwing together a README and taking no other thought for what's going on. Note the principle says "how much documentation it needs" not how much it has. If users have cause to say "the documentation has been skimped on", then the design is poor and no crutch was provided, which is even worse than a bad design band-aided with documentation. Intuitive, self-documenting design is the critical factor, not the documentation itself.

Now, there are a few caveats & clarifications that should be mentioned:
  • 'Self-documenting' is often used in connection with test-driven development. There is a vein of thought that says the tests should be the mechanism for documenting the code. This is a very good approach to take, but tests cannot serve as the user documentation. If the project is a public one, expecting people to read the tests is not realistic, and even with an internal project, understanding the guts of how a component works is often beyond the scope of the task at hand. Tests can and should be the developer documentation, and complete test coverage is critically important to developing robust software, but a test suite is typically too large and specialized to be the sole documentation.
  • Most systems have a fairly obvious surface. It may not be instantly obvious what a class structure or API or SDK or GUI does, but it's usually pretty easy to see what it has. There are interfaces though that don't have an easily reflectable surface, such as command-line tools. At first blush, it might seem that this is a place where more documentation is a necessary evil. Certainly there are a lot of tools that believe extensive documentation is preferable to abstraction (git, I'm looking at you). But the principle does still apply here. It is still incumbent on the developer to make the behavior intuitive (typically by following conventions), and to make the options easily discoverable. There are a myriad of ways to approach this, and it depends a lot on how flexible the tool needs to be. It requires more work, but that's our job: to make it easy to use.
It's as I've said before: deal with things directly. Don't build hedges around them, because it just takes more work and isn't as effective. Don't worry about documentation unless there truly is no other way to get the point across. Focus instead of making your code as intuitive and straightforward as possible, as this is the greatest programming good.