A Very Small Knowledge Base: March 2016

Thursday, March 24, 2016

SQL Dependency Hell: User Functions in Computed Columns

SQL Server allows you to create columns that don't contain data, but rather are calculations, hence "computed" columns. For example, say you have a table with an OpenDate and a CloseDate. You could create a OpenDays computed column which returns the value of the function DATEDIFF(DAY, OpenDate, CloseDate). It's useful, logical, and runs fast. (A computed column can just be an expression, for example a column Interest that equals Balance * 0.01, but in this post we're going to focus on columns using functions.) The value of the computed column is by default calculated on demand (e.g., whenever the column is used in a SELECT or WHERE), but can also be marked persisted so the value is actually saved to disk and then updated on recalculation.

This is all well and good for built-in functions. However, when you start using user-defined functions, things can get pretty ugly pretty fast.

Consider this pretty standard user table:

CREATE TABLE [dbo].[Users]
(
 [ID] [int] IDENTITY(1,1) NOT NULL,
 [MembershipUserID] [uniqueidentifier] NOT NULL,
 [UserDomainID] [int] NOT NULL,
 [SecurityLevelID] [int] NULL,
 [FirstName] [varchar](50) NOT NULL,
 [LastName] [varchar](50) NOT NULL,
 [EmailAddress] [varchar](100) NOT NULL,
 [CreationDate] [datetime] NOT NULL,
 [PhoneNumber] [varchar](50) NULL,
 [FaxNumber] [varchar](50) NULL,
 [AddressID] [int] NULL,
 [InvalidEmail] [bit] NULL,
 [ManagerUserID] [int] NULL,
 [HoursWorkedPerWeek] [decimal](5, 2) NULL,
 [LanguageID] [int] NULL
)

Now let's say I want to add a computed column FullName that runs my user-defined function udf_FullName. This function just glues FirstName and LastName together, so it's not doing anything crazy. Adding the computed column is easy:

ALTER TABLE dbo.Users ADD FullName AS dbo.udf_FullName(FirstName, LastName)

Now let's fast-forward three months - the client/business user has decided that it wants all names to listed in "Surname, GivenName" format, so I've got to change udf_FullName. It's a pretty trivial change, so I go in and write my ALTER FUNCTION statement. But when I hit Execute, I get this:

Msg 3729, Level 16, State 3, Procedure udf_FullName, Line 1
Cannot ALTER 'dbo.udf_Fullname' because it is being referenced by object 'Users'.

The function can't be altered because it's being used in a computed column. Hmm. It doesn't really help to consolidate logic if we can't ever change it.

This can be gotten around by dropping the column, modifying the function, and then re-adding the column. If the column value isn't being persisted, this probably won't be a particularly expensive operation. But if there are any views, stored procedures, other functions, etc. that depend on this table and use SCHEMABINDING, you'll quickly run into a cascading chain or updates you have to script. The column will be moved to the end of the table as well, which can cause heartburn for some clients. These are not insurmountable challenges, but they are hassles you just don't need to make for yourself.

Where things can really get ugly is when the function runs a sub-query. I've detailed in a previous post the performance pitfalls of scalar functions running queries, and they become even worse when put into a computed column. A sub-query function used in another query is bad enough, but it's somewhat scope-restricted - it'll only be used in certain circumstances. But with a computed column, the function and its sub-query have the potential to be re-run every single time the table is touched!

Oh, but it gets worse. You can actually get into a scenario where the table and the function have a circular dependency on each other! Consider this function:

CREATE FUNCTION [dbo].[udf_GetUserRoles]
   (@userID int)
RETURNS NVARCHAR(MAX)
WITH SCHEMABINDING
AS
BEGIN
 DECLARE @roleList NVARCHAR(MAX)

 SELECT @roleList = COALESCE(@roleList+',', ' ') + r.RoleName
 FROM dbo.Users u
 INNER JOIN dbo.aspnet_UsersInRoles ur on u.MembershipUserID = ur.UserId
 INNER JOIN dbo.aspnet_Roles r on ur.RoleId = r.RoleId
 WHERE u.ID = @userID

 RETURN @roleList
END

If we ignore the performance, it's not necessarily a bad function. But it's possible to call it from a computed column on the Users table itself!

ALTER TABLE dbo.Users ADD RoleList AS [dbo].[udf_GetUserRoles]([ID])

Fun fact: if you have a database with this setup, and you try to use Microsoft's Generate Scripts tool, it will crash as it gets into an infinite loop of attempted dependency resolution!

Computed columns are nice in theory, but they are difficult to change, especially when functions are involved. For this reason, they should be used sparingly, and they shouldn't use user-defined functions at all.

Wednesday, March 16, 2016

A SQL No-No: Sub-queries in User-Defined Scalar Functions

User-defined functions (UDFs) in SQL Server are a great way to encapsulate and re-use logic. However, they can be mis-used by writing scalar-valued functions that run queries. A thing we sometimes see is a function like this, which runs a query to return a single value:

CREATE FUNCTION [dbo].[udf_GetFeeAmount] 
(
    @AccountID INTEGER
)
RETURNS MONEY
AS
BEGIN

    DECLARE @Result MONEY
    SELECT @Result = NULL

    IF EXISTS(SELECT 1 FROM dbo.Accounts WHERE AccountID = @AccountID AND StatusID IN (1,4,8))
    BEGIN
        SELECT @Result = COALESCE(SUM(Fees.Amount), 0)
        FROM dbo.Fees
        WHERE AccountID = @AccountID
    END

    RETURN @Result

END

The problem with is not necessarily what's been written, but how it gets used. If this was a stored procedure outputting this value or setting an OUTPUT parameter, it'd be great. But say you're going to use this function in a SELECT statement that will return a million rows:

SELECT 
 AccountID,
 dbo.udf_GetFeeAmount(AccountID) AS FeeAmount
FROM dbo.Accounts

The function will be run for each of the 1,000,000 rows, thus generating an additional 1,000,000 queries. Naturally, the performance of such an operation will be miserable. The SQL Server engine can't run the function query in parallel, it has to run it again for each row emitted.

How then do we encapsulate logic like this in a way that still gets decent performance?

The answer is often a view or a table-valued function (which in many ways is a parameterized view). For example, this scenario could be re-written as this view:

CREATE VIEW [dbo].[vw_AccountFees]
AS
SELECT 
  AccountID,
  ISNULL
  (
    SUM
    (
      CASE WHEN Accounts.StatusID IN (1,4,8) THEN Fees.Amount ELSE 0 END
    )
    , 
    0
  ) AS FeeAmount
FROM dbo.Accounts
  LEFT JOIN dbo.Fees
    ON Accounts.AccountID = Fees.AccountID
GROUP BY 
  Accounts.AccountID

This view is simplistic for the purposes of this example - you could/should expose multiple Accounts columns from this view, so as to avoid the need of JOIN-ing to [dbo].[Accounts] again for other columns.

Writing the query this way allows the database engine to build an execution plan with parallel elements, thus allowing greater throughput and speed, and lower I/O cost.

Now, neither views nor table-valued functions are magic bullets. Table-valued functions don't always work so well in CROSS APPLY queries, and neither approach can save you from a query that's just badly written, or the fact that a table needs better indexing. But they open up optimization and scaling possibilities that querying in scalar functions simply doesn't give you.

The question may be asked: what are scalar-valued user-defined functions good for then? The answer is: for encapsulating non-table-based operations. Say you need to sanitize a column by doing multiple REPLACEs on it, and will need to do this sanitization in multiple places, or even on multiple columns. This would be an excellent candidate for logic that should go into a scalar function. I have written many scalar functions that do date operations that would be tedious to retype all the time, like this:

ALTER FUNCTION [dbo].[udf_GetMonthStart]
(
  @Date DATETIME
)
RETURNS DATE
AS
BEGIN 
   RETURN DATEADD(MONTH, MONTH(@Date)-1, DATEADD(YEAR,YEAR(@Date)-1900,0))
END

Scalar-valued functions are for computing scalars, not for operating on tables, and using them to do so works against the database engine and leads to poor performance.