Tuesday, June 28, 2016

The Case for Dealing Directly

Update 8/1/2016: After giving this article some further thought, I added another reason.

Suppose you started a new job, and sat down to look over the codebase. As you dug through things, you found that, instead of using public methods and/or properties on classes, they had written a whole reflection framework to reach into objects and modify the private fields. Then, when you asked why they had done something so very peculiar, they responded by saying, "Oh, this way is much better - we don't have to care about the structure of the objects, or waste time writing accessor methods and properties, and we can spend all our time just writing actual code!"

What? No. What?

We could probably spend a whole series of articles explaining why this is a bad approach, but let's just focus on these reasons:
  1. it works against the language's paradigm
  2. it leads to contrivances that don't use the objects very effectively
  3. it makes figuring out the usage and dependencies of objects very difficult
  4. it has security risks
  5. it is more risky to change
  6. it doesn't scale, and the performance could get pretty bad without a lot of ways to optimize it
This is what dynamic SQL is like, and my reaction to it is the same.

Current or former colleagues of mine might quickly note that I have written my fair share of dynamic SQL, and astute readers might note that I have published dynamic SQL on this very blog before. Dynamic SQL, like reflection, is not a bad thing, and has good uses. But it's generally not good for line-of-business applications, for all the reasons listed above.

Dynamic SQL is code generation. When code generation is employed at design (develop) time, it is a very good thing! Intellisense is the most common form of code generation, and it's so fluent & ubiquitous that we barely even think about it anymore. One of the reasons I sing the praises of ReSharper to anyone and everyone who'll listen (and I'm starting to get there with SQL Prompt) is because of the code generation features. They just make life easier, while still giving you the flexibility to write exactly what you need, because you don't have to accept everything they give you.

But code generation at run-time is a different story. Would you trust an application that generates, compiles, and executes code on the fly? That kind of sounds like the behavior of a virus, or at least the work of someone who doesn't yet realize that they'll have to debug that thing somehow. Sure, it's clever, but it's fraught with security, maintainability, reliability, and performance pitfalls.

There are pieces of code that we have to write that are trivial and repetitive, and that's what code generation is for. But the whole reason programmers haven't all been replaced by robots is that writing good code is not a deterministic activity. It requires a fair amount of specialized knowledge. SQL is no different.

Let's consider each one of these points a little more closely.


It works against the paradigm


In other words, just because we can doesn't mean we should. 

To an application, a database is an external resource, like a file system or an API. Interactions to external resources generally need to be coarse-grained, and resources need to be well-encapsulated. This means that while tables are the basic unit of database storage, they are not the basic unit of database interactions. In an ideal application, any discrete action will incur only one database call - e.g., there's only one call when the page loads, and only one when the user clicks the button to update the page. Relational databases are built very unambiguously on the principle that interactions should be atomic so they can be consistent and isolated.

Dynamic SQL works against this paradigm because it wants to deal with the tables as individual units instead of dealing with endpoints (stored procedures) performing atomic operations. The workaround for this is typically to create a transaction in the application code, which often is a performance anti-pattern - it makes the database layer chatty, and holds locks open longer, which leads to blocking & degraded performance. (There are places where in-code transactions make sense, but they are the exception, not the rule.)

It's true that databases are often under-abstracted and have too much repetition, but the solution isn't to go around the database, it's to employ the mechanisms it does have.


It leads to contrivances


I've never met an ORM framework I really liked (and it turns out there are a lot of other engineers who feel the same way; Google it to find out more). You're constantly having to work around them to get things done properly, and the SQL they generate is almost always a classic example of what not to do in a relational database.

Writing effective and appropriate UPDATEJOIN and WHERE clauses is not a trivial task. Structuring them in the correct manner requires addressing multiple considerations - the impact to/of indexes, the size of the tables involved, the native change tracking features, replication, etc. Dynamic SQL code generators do not provide any abstraction in this regard: they have to get very detailed in order to provide the level of control needed. In other words, they have to re-create the wheel, and in the end their output is no more optimal, and typically less optimal, than writing the SQL directly in stored procedures.


It makes usage discovery difficult


When dynamic SQL generators build statements, the table, view, column, etc. names are coming from diverse locations, and how they are used is often obfuscated. This makes it hard to know which database objects are actually in use. As with any type of code, when it's difficult to tell what's in use and how it's being used, it's difficult to move forward. It significantly hampers development and debugging efforts. Every database I've ever encountered has had many things in it that everyone is 'pretty sure aren't even used anymore', but because discovering dependencies is difficult even under the best of circumstances, nothing is ever done about it. Dynamic SQL's obfuscation of usage just exacerbates this problem, while direct SQL in stored procedures is much more clear and consolidated.


It has security risks


It is often pointed out that the security concerns of dynamic SQL are mitigated by parameterization. This is true, but it's also true that stored procedures mitigate these risks even further. When the SQL code is being constructed on the fly, it still opens up possibilities for injection (and compile errors) that simply don't exist with stored procedures. (Naturally, stored procedures that build dynamic SQL suffer from these same flaws, and in some ways more so - I include them in the ranks of 'dynamic SQL code generators.' Traitors.)

Stored procedures are also a better way to implement the principle of least privilege. When dynamic and/or inline SQL is involved, the application must be given carte blanche to act on the tables. But with stored procedures, the interactions are much more targeted, and (typically) only one object has to have permissions associated to it. It's more maintainable and granular access, two key elements of good security.

These may seem like minor concerns. They are not. Quite frankly, security in every single piece of software on this planet needs to be better - and not just better, but many orders of magnitude better. Good enough security is not good enough.

It is riskier to change


When every operation is being managed by the same code, the risk of changing that code becomes much greater. If a change is made to a dynamic SQL builder for one scenario, it runs the risk of breaking all the other scenarios - in other words, it is harder to keep changes isolated. Say for example you have code that is generating INSERT statements, and you make a change to accommodate one edge case. In doing so, you run the risk of breaking all the INSERT statements in the application. But with stored procedures, a change to one statement has zero effect on similar statements in other procedures. Changes can be kept isolated, the system is more flexible and we can be more confident in our changes.


It doesn't scale


Typically, the only way to optimize dynamic SQL is to pull it out and make it not dynamic. Yes, indexes are always a part of the optimization equation in a relational database, but they're just one tool, and quite frankly they're usually just the first layer. To really address these matters we have to get into the queries themselves, and when we can't directly control the queries, we're at a serious disadvantage.

Also, compilation is not free. Dynamic/inline SQL execution plans can be cached, and stored procedure execution plans aren't necessarily cached, but the former typically aren't and the latter typically are. It's a significant advantage for the DBAs or the DevOps to be able to look up an execution plan by the procedure name, an impossible task with dynamic and inline SQL. Reducing the number of compilations is also a important step in preventing CPU bottlenecks in SQL Server, something that can only be achieved with stored procedures.


(It is not always possible to use stored procedures - after all, some applications have to interact with databases controlled by third parties. But in these scenarios, inline SQL is preferable to dynamic SQL, because it mitigates or avoids most of these discussed pitfalls of dynamic SQL even though it lacks some of the advantages of stored procedures.)

In the end, this can all be boiled down to one simple idea: deal with the database on its own terms. In that reflection framework example in the first paragraph, the coders are effectively trying to treat their class structure as a set of global variables, or a key-value store. Trying to make something act like something it is not only leads to trouble. Relational databases are fundamentally different from imperative object code, and attempts to treat them like they're not are misguided.