Wednesday, March 16, 2016

A SQL No-No: Sub-queries in User-Defined Scalar Functions

User-defined functions (UDFs) in SQL Server are a great way to encapsulate and re-use logic. However, they can be mis-used by writing scalar-valued functions that run queries. A thing we sometimes see is a function like this, which runs a query to return a single value:

CREATE FUNCTION [dbo].[udf_GetFeeAmount] 
(
    @AccountID INTEGER
)
RETURNS MONEY
AS
BEGIN

    DECLARE @Result MONEY
    SELECT @Result = NULL

    IF EXISTS(SELECT 1 FROM dbo.Accounts WHERE AccountID = @AccountID AND StatusID IN (1,4,8))
    BEGIN
        SELECT @Result = COALESCE(SUM(Fees.Amount), 0)
        FROM dbo.Fees
        WHERE AccountID = @AccountID
    END

    RETURN @Result

END

The problem with is not necessarily what's been written, but how it gets used. If this was a stored procedure outputting this value or setting an OUTPUT parameter, it'd be great. But say you're going to use this function in a SELECT statement that will return a million rows:

SELECT 
 AccountID,
 dbo.udf_GetFeeAmount(AccountID) AS FeeAmount
FROM dbo.Accounts

The function will be run for each of the 1,000,000 rows, thus generating an additional 1,000,000 queries. Naturally, the performance of such an operation will be miserable. The SQL Server engine can't run the function query in parallel, it has to run it again for each row emitted.

 How then do we encapsulate logic like this in a way that still gets decent performance?

The answer is often a view or a table-valued function (which in many ways is a parameterized view). For example, this scenario could be re-written as this view:

CREATE VIEW [dbo].[vw_AccountFees]
AS
SELECT 
  AccountID,
  ISNULL
  (
    SUM
    (
      CASE WHEN Accounts.StatusID IN (1,4,8) THEN Fees.Amount ELSE 0 END
    )
    , 
    0
  ) AS FeeAmount
FROM dbo.Accounts
  LEFT JOIN dbo.Fees
    ON Accounts.AccountID = Fees.AccountID
GROUP BY 
  Accounts.AccountID

This view is simplistic for the purposes of this example - you could/should expose multiple Accounts columns from this view, so as to avoid the need of JOIN-ing to [dbo].[Accounts] again for other columns.

Writing the query this way allows the database engine to build an execution plan with parallel elements, thus allowing greater throughput and speed, and lower I/O cost.

Now, neither views nor table-valued functions are magic bullets. Table-valued functions don't always work so well in CROSS APPLY queries, and neither approach can save you from a query that's just badly written, or the fact that a table needs better indexing. But they open up optimization and scaling possibilities that querying in scalar functions simply doesn't give you.

The question may be asked: what are scalar-valued user-defined functions good for then? The answer is: for encapsulating non-table-based operations. Say you need to sanitize a column by doing multiple REPLACEs on it, and will need to do this sanitization in multiple places, or even on multiple columns. This would be an excellent candidate for logic that should go into a scalar function. I have written many scalar functions that do date operations that would be tedious to retype all the time, like this:

ALTER FUNCTION [dbo].[udf_GetMonthStart]
(
  @Date DATETIME
)
RETURNS DATE
AS
BEGIN 
   RETURN DATEADD(MONTH, MONTH(@Date)-1, DATEADD(YEAR,YEAR(@Date)-1900,0))
END

Scalar-valued functions are for computing scalars, not for operating on tables, and using them to do so works against the database engine and leads to poor performance.

No comments:

Post a Comment