## Goodness-of-Fit Testing with SQL Server Part 4.2: The Hosmer–Lemeshow Test with Logistic Regression

**By Steve Bolton**

…………The last installment of this amateur series of self-tutorials was the beginning of a short detour into using SQL Server to perform goodness-of-fit testing on regression lines, rather than on probability distributions. These are actually quite simple concepts; any college freshman ought to be able to grasp the idea of a single regression line, since all they do is graph the values of one variable as it changes in tandem with another, whereas distributions merely summarize the distinct counts for each value of a variable, as the famous bell curve does. In the first case we’re dealing with how well a regression line models the relationship between two variables – assuming one exists – and in the second, we’re assessing whether the distribution of a single variable matches one of many distributions that constantly recur in nature, like the Gaussian “normal” distribution, or other common ones like the Poisson, binomial, uniform and exponential distributions.

…………These topics can of course become quite complex rather quickly once we factor in modifications like multivariate cases, but I’ll avoid these topics for the sake of simplicity, plus the fact that there are an endless number of variants of them that could keep us busy to the end of time. I really can’t end this segment though without addressing a particular variant, since logistic regression is a widely used algorithm that addresses a distinctive set of use cases. There is likewise an endless array of alternative takes on the basic logistic regression algorithm, such as adaptations for multinomial cases, but I’ll stick to the old maxim, KISS: Keep It Simple, Stupid. The Hosmer-Lemeshow Test that is commonly used for fitness testing on logistic regression may not be applicable in some of these more advanced cases, but it is indispensable for the bare bones version of the algorithm.

**Adapting Regression Lines to Use the Logistic Function**

This regression algorithm is a topic I became somewhat familiar with while writing A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression a while back, in a different tutorial series. Suffice it to say that the algorithm is ideal for situations in which you need to place bounds on the outputs that a regression line can generate; most commonly this is a Boolean yes-no choice that ranges between 0 and 1, but it can be adapted to other scales, such as 0 to 100. A linear regression line would produce nonsensical values in these instances, since the values would be free to go off the charts, but a logistic regression is guaranteed to stay within its assigned bounds. This is accomplished by using the logistic function, which is really not much more difficult to implement than linear regression (with one critical limitation). The formula is widely available on the Internet, so when writing the code for this article I retrieved it from the most convenient source as usual: Wikipedia.[1]

…………In many ways, the logistic function behaves like the cumulative distribution functions (CDFs) I mentioned in Goodness-of-Fit Testing with SQL Server, Part 2: Implementing Probability Plots in Reporting Services, in that the probabilities assigned to the lowest value begin at 0 and accumulate up to 1 by the time we reach the final value in an ordered dataset. It also behaves in many ways like a weighted function, in that the pressure on it to conform to the bounds increases as it nears them; I think of it in terms of the way quarks are inextricably bound together by the strong force within hadrons, which increases as they come closer to breaking free. In between the upper and lower bounds the regression takes on the appearance of an S-curve rather than the lines seen in normal regression.

…………Another easily readable and insightful commentary on logistic regression can be found at University of Toronto Prof. Saed Sayad’s website[2], in which he provides a succinct explanation of the logistic function equation and some alternative measures for the accuracy of the mining models it generates. Three of these are subspecies of the R^{2} measure we discussed last week in connection with linear regression, which are discussed together under the rubric of Pseudo R^{2}. Fitness testing of this kind is as necessary on regression lines as it is for distribution matching, because they rarely model the relationships between variables perfectly; as statistician George Box (one of the pioneers of Time Series) once put it so colorfully, “All models are wrong, but some are useful.”[3] Sayad also mentions alternative methods like Likelihood Ratio tests and the Wald Test. The measure of fit I’ve seen mentioned most often in connection with logistic regression goes by the memorable moniker of the Hosmer-Lemeshow Test. It apparently has its limitations, as we shall see, but it is not terribly difficult to implement – with some important caveats.

**Banding Issues with Coding the Hosmer-Lemeshow Test**

In fact, the first three steps in the dynamic SQL in Figure 1 are almost identical to those used to calculate regression lines in last week’s article and a tutorial from a different series, Outlier Detection with SQL Server Part 7: Cook’s Distance, which I won’t waste time recapping here.[4] The fourth step just applies the logistic function to the results, in a single, simple line of T-SQL. After this, I simply insert the logistic regression values into a table variable for later retrieval, including returning it to the user towards the end of the procedure; there’s really no reason not to return the correlation, covariance, slope, intercept and standard deviations for both variables as well, given that we’ve already calculated them. Step 5 is where the meat and potatoes of the Hosmer-Lemeshow Test can be found. Its strategy is essentially to divide the values into bands, which are often referred to as “deciles of risk” when the test is employed in one of its most common applications, risk analysis.[5] The bands are then compared to the values we’d expect for them based on probabilistic calculations and the gap between them is quantified and summarized.

…………It is now time for my usual disclaimer: I am writing this series in order to familiarize myself with these data mining tools, not because I have expertise in using them, so it would be wise to check my code thoroughly before putting it to use (or even, God forbid, a production environment). Normally I check my code against the examples provided in various sources, especially the original academic papers whenever possible. In this case, however, I couldn’t find any at a juncture where they would have come in handy, given that I am not quite certain that I am splitting the data into bands on the right axis. I am still learning how to decipher the equations that underpin algorithms of this kind and some of them differ significantly in notation and nomenclature, so it may be that I ought to be counting the observed and expected values differently, which affects how they are split; from the wording in a book inventors David W. Hosmer Jr. and Stanley Lemeshow wrote on *Applied Logistic Regression*, it seems that the counts between the bands are supposed to vary much more significantly[6] than they do in my version, which could be a side effect of incorrect banding. There are apparently many different ways of doing banding[7], however, including my method below, in which the @BandCount parameter is plugged into the NTILE windowing function in Step 1.

…………I’ve seen two general rules of thumb mentioned in the literature for setting the @BandCount to an optimal level: using groups of fewer than five members leads to incorrect results, while using fewer than six groups “almost always” leads to passing the fitness test.[8] Averages for both the X and Y axes are calculated for each band, then the regression line is derived through the usual methods in Steps 3 through 5, with one crucial difference: some of the aggregates have to be derived from the bands rather than the original table, otherwise we’d end up with an apples vs. oranges comparison. This is one of several points where the procedure can go wrong, since the banding obscures the detail of the original table and can lead to a substantially different regression line. Keep in mind though that the literature mentions several alternative methods of banding, so there may be better ways of accomplishing this.

** Figure 1: T-SQL Code for the Hosmer–Lemeshow Test Procedure**CREATE PROCEDURE [Calculations].[GoodnessOfFitLogisticRegressionHosmerLemsehowTestSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName1 AS nvarchar(128),@ColumnName2 AS nvarchar(128), @BandCount As bigint

AS

DECLARE @SchemaAndTableName nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTableName = @DatabaseName + ‘.’ + @SchemaName + ‘.’ + @TableName

SET @SQLString = ‘DECLARE @MeanX float,@MeanY float, @StDevX float, @StDevY float,

@Count float, @Correlation float, @Covariance float, @Slope float, @Intercept float,

@ValueRange float, @HosmerLemeshowTest float

— STEP #1 — GET THE RANGE OF VALUES FOR THE Y COLUMN

SELECT @ValueRange = CASE WHEN RecordCount % 2 = 0 THEN ValueRange + 1 ELSE ValueRange END

FROM (SELECT Max(‘ + @ColumnName2 + ‘) – Min(‘ + @ColumnName2 + ‘) AS ValueRange, Count(*) AS RecordCount

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName2 + ‘ IS NOT NULL) AS T1

— STEP #2 — CREATE THE BANDS

DECLARE @LogisticRegressionTable table

(ID bigint, — ID is the decile identifier

CountForGroup bigint,

AverageX float,

AverageY float,

RescaledY float,

LogisticResult float

)

INSERT INTO @LogisticRegressionTable

(ID, CountForGroup, AverageX, AverageY)

SELECT DISTINCT DecileNumber, COUNT(*) OVER (PARTITION BY DecileNumber) AS CountForGroup , Avg(CAST(‘ + @ColumnName1 + ‘ AS float)) OVER

(PARTITION BY DecileNumber ORDER BY DecileNumber) AS AverageX, Avg(CAST(‘ + @ColumnName2 + ‘ AS float)) OVER (PARTITION BY DecileNumber

ORDER BY DecileNumber) AS AverageY

FROM (SELECT ‘ + @ColumnName1 + ‘, ‘ + @ColumnName2 + ‘, NTILE(‘ + CAST(@BandCount AS nvarchar(128)) + ‘) OVER (ORDER BY ‘ + @ColumnName1 + ‘) AS DecileNumber

FROM ‘ + @SchemaAndTableName + ‘) AS T1

UPDATE @LogisticRegressionTable — this could be done in one step, but Im leaving it this way for legibility purposes

SET RescaledY = AverageY / @ValueRange

— STEP #3 – RETRIEVE THE GLOBAL AGGREGATES NEEDED FOR

OTHER CALCULATIONS

— note that we cant operate on the original table here, otherwise the stats would be different from those of the bands

SELECT @MeanX = Avg(CAST(X AS float)), @MeanY = Avg(CAST(Y AS float)), @StDevX = StDev(CAST(X AS float)), @StDevY = StDev(CAST(Y AS float))

FROM (SELECT ‘ + @ColumnName1 + ‘ AS X, ‘ + @ColumnName2 + ‘ AS Y

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName1 + ‘ IS NOT NULL AND ‘ + @ColumnName2 + ‘ IS NOT NULL) AS T1

— STEP #4 – CALCULATE THE CORRELATION (BY FIRST GETTING THE COVARIANCE)

SELECT @Covariance = SUM((AverageX – @MeanX) * (AverageY – @MeanY)) / (‘ + CAST(@BandCount AS nvarchar(128)) + ‘ – 1)

FROM @LogisticRegressionTable

SELECT @Correlation = @Covariance / (@StDevX * @StDevY)

— STEP #5 – CALCULATE THE SLOPE AND INTERCEPT

SELECT @Slope = @Correlation * (@StDevY / @StDevX)

SELECT @Intercept = @MeanY – (@Slope * @MeanX)

— STEP #6 – CALCULATE THE LOGISTIC FUNCTION

UPDATE T1

SET LogisticResult = 1 / (1 + EXP(-1 * (@Intercept + (@Slope * AverageX))))

FROM @LogisticRegressionTable AS T1

— RETURN THE RESULTS

SELECT @HosmerLemeshowTest = SUM(Power((RescaledY – LogisticResult), 2) / (CountForGroup * LogisticResult * (1- LogisticResult)))

FROM @LogisticRegressionTable AS T1

SELECT * FROM @LogisticRegressionTable

SELECT @HosmerLemeshowTest AS HosmerLemeshowTest, @Covariance AS Covariance, @Correlation AS Correlation, @MeanX As MeanX, @MeanY As MeanY, @StDevX as StDevX, @StDevY AS StDevY, @Intercept AS Intercept, @Slope AS Slope‘

–SELECT @SQLString — uncomment this to debug dynamic SQL errors

EXEC (@SQLString)

…………Another potential problem arises when we derive the expected values of column Y for each band, using the logistic function in Step 6.[9] This is one of those maddening scaling issues that continually seem to arise whenever these older statistical tests are applied to Big Data-sized recordsets. This very simple and well-known function is implemented in the formula 1 / (1 + EXP(-1 * (@Intercept + (@Slope * AverageX)))), but when the result of the exponent operation is an infinitesimally small value, it gets truncated when adding the 1. This often occurs when the values for the @Intercept are high, particularly above 100. When 1 is divided by the resulting 1, the logistic function result is 1, which leads to division by zero errors when calculating the test statistic in the last SELECT assignment. There might be a way for a more mathematically adept programmer to rescale the @Intercept, @Slope and other variables so that this truncation doesn’t occur, but I’m not going to attempt to implement a workaround unless I’m sure it won’t lead to incorrect results in some unforeseen way.

…………Yet another issue is that my implementation allows the second column to be Continuous rather than the usual Boolean either-or choice seen in simple logistic regression. That requires rescaling to the range of permissible values, but the way I’ve implemented it through the RescaledY table variable column and @ValueRange variable may be incorrect. The SELECT that assigns the value to @HosmerLemeshowTest can also probably be done a little more concisely in fewer steps, but I want to highlight the internal logic so that it is easier to follow and debug. The rest of the code follows much the same format as usual, in which null handling, SQL injection protection, bracket handling and validation code are all omitted for the sake of legibility and simplicity. Most of the parameters are designed to allow the user to perform the regression on any two columns in the same table or view, in any database they have sufficient access to. The next-to-last line allows programmers to debug the dynamic SQL, which will probably be necessary before putting this procedure to professional use. In the last two statements I return all of the bands in the regression table plus the regression stats, since the costs of calculating them have already been incurred. It would be child’s play for us to also calculate the Mean Squared Error from these figures with virtually no computational cost, but I’m not yet sure if it enjoys the same validity and significance with logistic regression as it does with linear.

** Figure 2: Sample Results from Duchennes and Higgs Boson Datasets**EXEC Calculations.GoodnessOfFitLogisticRegressionHosmerLemsehowTestSP

@DatabaseName = N’DataMiningProjects’,

@SchemaName = N’Physics’,

@TableName = N’HiggsBosonTable’,

@ColumnName1 = N’Column1′,

@ColumnName2 = N’Column2′,

@BandCount = 12

…………I’ve tested most of the procedures for the last two tutorial series against a 9-kilobyte dataset on the Duchennes form of muscular dystrophy, which is made publicly available by Vanderbilt University’s Department of Biostatistics. For this week’s article, however, the intercepts were too high for most combinations of comparisons between the PyruvateKinase, CreatineKinase, LactateDehydrogenase and Hemopexin columns, resulting in the aforementioned divide-by-zero errors. For the few combinations that worked, the test statistic was ridiculously inflated; for other databases I’m familiar with, it returned results in the single and double digits (which is apparently permissible, since I’ve seen professionals post Hosmer-Lemeshow results online that fall in that range) but for whatever reason, this was not the case with Duchennes dataset.

…………That is why I derived the sample results in Figure 2 from the first two float columns of the Higgs Boson Dataset I downloaded from University of California at Irvine’s Machine Learning Repository, which I normally use for stress-testing because its 11 million rows occupy nearly 6 gigabytes in the SQL Server table I converted it into. Given that the first column is obviously abnormal and the second clearly follows a bell curve, I expected the results to indicate a serious lack of fit, but the test statistic was only a minuscule 1.30909988070185E-05. In fact, the values seem to shrink in tandem with the record count, which makes me wonder if another, more familiar scaling issue is operative here. As we’ve seen through the last two tutorial series, many common statistical measures were not designed with today’s Big Data table sizes in mind and thus end up distorted when we try to cram too much data into them. Given that there are so many other issues with my implementation, it is hard to tell if that is an additional problem or some inaccuracy in my code. Substituting the AverageY value for the RescaledY I used in the test statistic only seems to introduce new problems, without solving this one.

**The Case Against Hosmer-Lemeshow**

…………Regardless of the quality of my implementation, the usefulness of the Hosmer-Lemeshow Test is apparently still questionable, given that even professionals publish articles with inspiring titles like “Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression.”[10] University of Pennsylvania Sociology Prof. Paul Allison lists several drawbacks to the test even when it is correctly implemented, including the most common one mentioned on the Internet, the weaknesses inherent in dividing the dataset into groups. This can even lead to significantly different results when different stats software is run against the same data.[11] Hosmer and Lemeshow themselves point out that the choice of boundaries (“cut points”) for the groups can lead to significantly different results.[12] Furthermore, as Frank Harrell puts it in a thread in the CrossValidated forum, “The Hosmer-Lemeshow test is to some extent obsolete because it requires arbitrary binning of predicted probabilities and does not possess excellent power to detect lack of calibration. It also does not fully penalize for extreme overfitting of the model. …More importantly, this kind of assessment just addresses overall model calibration (agreement between predicted and observed) and does not address lack of fit such as improperly transforming a predictor.”[13] He recommends alternatives like a “generalized R^{2},” while Allison gets into “daunting” alternatives like “standardized Pearson, unweighted sum of squared residuals, Stukel’s test, and the information matrix test.”[14]

…………Nevertheless, despite these well-known shortcomings, the Hosmer-Lemeshow Test remains perhaps the best-known goodness-of-fit measure for logistic regression. Fitness tests seem to have well-defined use cases in comparison to the outlier detection methods we covered in my last tutorial series, with the Hosmer-Lemeshow among those that occupy a very distinct niche. The other methods mentioned by these authors seem to be more advanced and less well-known, so I thought it still worthwhile to post the code, even if there are certain to be problems with it. On the other hand, it is not worth it at this point to optimize the procedures much until the accuracy issues can either be fixed or debunked. It performed well in comparison to other procedures in this series anyways, with a time of 3:08 for the trial run in Figure 2. Only 1 of the 8 queries accounted for 89 percent of the cost of the whole execution plan, and that one contained two expensive Nested Loops operators, which might mean there’s room for further optimization if and when the accuracy can be verified.

…………Given the number of issues with my code as well as the inherent issues with the test itself, it might be fitting to write a rebuttal to my mistutorial titled titled like Allison article, such as “Why I Don’t Trust Steve Bolton’s Version of the Hosmer-Lemeshow Test for Logistic Regression.” It may still be useful in the absence of any other measures, but I’d assign it a lot less trust than some of the other code I’ve posted in the last two series. On the other hand, this also gives me an opportunity to jump into my usual spiel about my own lack of trust in hypothesis testing methods. I have a lack of confidence in confidence intervals and the like, at least as far as our use cases go, for multiple reasons. First and foremost, plugging test statistics into distributions just to derive a simple Boolean pass/fail measure sharply reduces the information content. Another critical problem is that most of the lookup tables for the tests and distributions they’re plugged into stop at just a few hundred values and are often full of gaps; furthermore, calculating the missing values yourself for millions of degrees of freedom can be prohibitively expensive. Once again, these measures were designed long before Big Data became a buzz word, in an era when most statistical tests were done against a few dozen or few hundred records at best. For that reason I have often omitted the hypothesis testing stage that accompanies many of the goodness-of-fit measures in this series, including the Hosmer-Lemeshow Test, which is normally plugged into a Chi-Squared distribution.[15] On the other hand, we can make use of the separate Chi-Squared goodness-of-fit measure, which as we shall see next week, is a versatile metric that can be adapted to assess the fit of a wide variety of probability distributions – with a much higher degree of confidence than we can assign to the Hosmer-Lemeshow Test results on logistic regression.

[1] See the Wikipedia article “Logistic Regression” at http://en.wikipedia.org/wiki/Logistic_regression

[2] See Sayad, Saed, 2014, “Logistic Regression,” published at the __SaedSayad.com__ web address http://www.saedsayad.com/logistic_regression.htm

[3] See the undated publication “Goodness of Fit in Linear Regression” retrieved from Lawrence Joseph’s course notes on Bayesian Statistics on Oct. 30, 2014, which are published at the website of the __McGill University Faculty of Medicine__. Available at the web address http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit.pdf. No author is listed but I presume that Prof. Joseph wrote it.

[4] I derived some of the code for regular linear regression routine long ago from the Dummies.com webpage “How to Calculate a Regression Line” at http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html

[5] See the Wikipedia article Hosmer-Lemeshow Test at the web address http://en.wikipedia.org/wiki/Hosmer%E2%80%93Lemeshow_test

[6] p. 160, Hosmer Jr., David W.; Lemeshow, Stanley and Sturdivan, Rodney X., 2013, __Applied Logistic Regression__. John Wiley & Sons: Hoboken, New Jersey.

[7] *IBID.*, pp. 160-163.

[8] *IBID.*, p. 161 for the second comment.

[9] I was initially confused about the assignment of the expected values (as well as the use of mean scores), but they are definitely derived from the logistic function, according to p. 2 of the undated manuscript, “Logistic Regression,” published at the Portland State University web address http://www.upa.pdx.edu/IOA/newsom/da2/ho_logistic.pdf . It is part of the instructional materials for one of Prof. Jason Newsom’s classes so I assume he wrote it, but cannot be sure.

[10] Allison, Paul, 2013, “Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression,” published March 5, 2013 at the __Statistical Horizons__ web address http://www.statisticalhorizons.com/hosmer-lemeshow

[11] Allison, Paul, 2014, untitled article published in March, 2014 at the __Statistical Horizons__ web address http://www.statisticalhorizons.com/2014/04

[12] pp. 965-966, 968, Hosmer, D.W.; T. Hosmer; Le Cessie, S. and Lemeshow, S., 1997, “A Comparison of Goodness-of-Fit Tests for the Logistic Regression Model,” pp. 965-980 in __Statistics in Medicine__. Vol. 16.

[13] Harrell, Frank, 2011, “Hosmer-Lemeshow vs AIC for Logistic Regression,” published Nov. 22, 2011 at the __CrossValidated__ web address http://stats.stackexchange.com/questions/18750/hosmer-lemeshow-vs-aic-for-logistic-regression

[14] See Allison, 2014.

[15] For more on the usual implementation involving the Chi-Squared distribution, see p. 977. Hosmer et al., 1997 and p. 158, Hosmer et al. 2013.

## Goodness-of-Fit Testing with SQL Server Part 4.1: R2, RMSE and Regression-Related Routines

**By Steve Bolton**

…………Throughout most of this series of amateur self-tutorials, the main topic has been and will continue to be in using SQL Server to perform goodness-of-testing on probability distributions. Don’t let the long syllables (or the alliteration practice in the title) fool you, because the underlying concept really isn’t all that hard; all these statistical tests tells us is whether the distinct counts of our data points approximate shapes like the famous bell curve, i.e. the Gaussian or “normal” distribution. While researching the topic, I found out that the term “goodness-of-fit” is also used to describe how much confidence we can assign to a particular regression line. Recall that in regression, we’re trying to learn something about the relationships between one or more variables, whereas in the case of probability distributions, we’re normally talking about univariate cases, so we’re really trying to learn something about the internal structure of a single variable (or in our case, a database column). Once again, don’t be intimidated by the big words though, because regression is really a very simple idea that every college freshman has been exposed to at some point.

…………As I explain in more detail in a post from an earlier mistutorial series, A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression, regression in its simplest form is just the graph of a line that depicts how much one variable increases or decreases as another changes in value. There are certainly complex variants of regression that could blow up someone’s brain like that poor guy in the horror film Scanners, but the fundamentals are not that taxing on the mind. Thankfully, coding simple regression lines in T-SQL is that difficult either. There are some moderate performance costs, as can be expected whenever we have to traverse a whole dataset, but the actual calculations aren’t terribly difficult to follow or debug (presuming, that is, that you understand set-based languages like SQL). That is especially true for the metrics calculated upon those regression lines, which tell us how well our data mining model might approximate the true relationships between the variables.

**The Foibles and Follies of Calculating R2 and RMSE**

………… Once we’ve incurred the cost of a traversing a dataset, there’s really little incentive not to squeeze all the benefit out of the trip by computing all of the relevant goodness-of-fit regression stats afterwards. For that reason, plus the fact that they’re not terribly challenging to explain, I’ll dispense with them all in a single procedure, beginning with R^{2}. In my earlier article Outlier Detection with SQL Server Part 7: Cook’s Distance we already dealt with the coefficient of determination (also known as R^{2}), which is simply the square of the correlation coefficient. This is a long name for the very simple process of quantifying the relationship between two variables, by computing the difference for each value of the first (usually labeled X) from the average of the second (usually labeled Y) and vice-versa, then multiplying them together. This gives us the covariance, which is then transformed into the correlation by comparing the result to the product of the two standard deviations. All we need to do is implement the same code from the Cook’s Distance article, beginning with the regression calculations, then add a new step: squaring the results of the correlation. That changes all negative values to positives and thus scales the result for easier interpretation. The higher the R^{2}, the more closely the two variables are related, and the closer to 0, the less linkage there is between their values.

…………One of the few pitfalls to watch out for is that the values are often below 1 but can exceed it in some circumstances. End users don’t need to know all of the implementation details and intermediate steps I just mentioned, but the must be able to read the result, which is highly intuitive and can be easily depicted in a visualization like a Reporting Services gauge; they don’t need to be burdened with the boring, technical internals of the computations any more than a commuter needs to give a dissertation on automotive engineering, but they should be able to interpret the result easily, and R^{2} is as mercifully simple as a gas gauge. The same is true of stats like covariance and correlation that it is built on, which costs us nothing to return to the users within the same queries.

Mean Square Error (MSE) is a little more difficult to calculate, but not much harder to interpret, since all end users need to know is that zero represents “perfect accuracy”[i] and values further from it less fitness; the only catch might be that the goodness-of-fit moves in the opposite direction as R^{2}, which might cause confusion unless a tooltip or some other handy reminder is given to end users. Root Mean Square Error (RMSE, a.k.a. Root-Mean-Square Deviation) is usually derived from it by squaring, which statisticians often do to rescale metrics so that they only have positive values. Keep in mind that SQL Server can easily calculate standard deviation through the T-SQL StDev function, which gives us a measure of how dispersed the values in a dataset are; practically all of the procedures I’ve posted in the last two tutorial series have made use of it. What RMSE does is take standard deviation to the next level, by measuring the dispersion between multiple variables instead of just one. I really can’t explain it any better than Will G. Hopkins does at his website A New View of Statistics, which I highly recommend to novices in the field of statistics like myself:

“The RMSE is a kind of generalized standard deviation. It pops up whenever you look for differences between subgroups or for other effects or relationships between variables. It’s the spread left over when you have accounted for any such relationships in your data, or (same thing) when you have fitted a statistical model to the data. Hence its other name, residual variation. I’ll say more about residuals for models, about fitting models in general, and about fitting them to data like these much later.”

…………“Here’s an example. Suppose you have heights for a group of females and males. If you analyze the data without regard to the sex of the subjects, the measure of spread you get will be the total variation. But stats programs can take into account the sex of each subject, work out the means for the boys and the girls, then derive a single SD that will do for the boys and the girls. That single SD is the RMSE. Yes, you can also work out the SDs for the boys and girls separately, but you may need a single one to calculate effect sizes. You can’t simply average the SDs.”[ii]

…………RMSE and R^{2} can be used for goodness-of-fit because they are intimately related to the differences between the actual and predicted values for a regression line; they essentially quantify how much of the standard deviation or variance can be ascribed to these errors, i.e. residuals.[iii] There are many complex variants of these stats, just as there are for regression models as a whole; for example, Wikipedia provides several alternate formulas for RMSE , including some for biased estimators, which is a topic we needn’t worry as much about given the whopping sizes of the datasets the SQL Server community works with.[iv] We have unique cases in which the standard methods of hypothesis testing are less applicable, which is why I’ve generally shied away from applying confidence intervals, significance levels and the like to the stats covered in my last two tutorial series. Such tests sharply reduce the information provided by our hard-won calculations, from float or decimal data types down to simple Boolean, yes-or-no answers that a particular value is an outlier, or that subsets of values do not fit a particular distribution; retaining that information allows us to gauge *how much* a value qualifies as an outlier or a set of them follows a distribution, or a set of columns follows a regression line.

…………For that reason, I won’t get into the a discussion of the F-Tests often performed on our last regression measure, Lack-of-Fit Sum-of-Squares, particularly in connection with Analysis of Variance (ANOVA). The core concepts with this measure are only slightly more advanced than with RMSE and R^{2}. Once again, we’re essentially slicing up the residuals of the regression line in a way that separates the portion that can be ascribed to the inaccuracy of the model, just through alternate means. It is important here to note that with all three measures, the terms “error” and “residual” are often used interchangeably, although there is a strictly definable difference between them: a residual quantifies the difference between actual and predicted values, while errors refer to actual values and “the (unobservable) true function value.”[v] Despite this subtle yet distinguishable difference, the two terms are often used inappropriately even by experts, to the point that novices like myself can’t always discern which of the two is under discussion. Further partitioning of the residuals and errors occurs in the internal calculations of Lack-of-Fit Sum-of-Squares, but I can’t comment at length on the differences between such constituent components as Residual Sum-of-Squares and Sum-of-Squares for Pure Error, except to recommend the explanation by Mukesh Mahadeo, a frequent contributor on statistical concepts at Yahoo! Answers:

“For certain designs with replicates at the levels of the predictor variables, the residual sum of squares can be further partitioned into meaningful parts which are relevant for testing hypotheses. Specifically, the residual sums of squares can be partitioned into lack-of-fit and pure-error components. This involves determining the part of the residual sum of squares that can be predicted by including additional terms for the predictor variables in the model (for example, higher-order polynomial or interaction terms), and the part of the residual sum of squares that cannot be predicted by any additional terms (i.e., the sum of squares for pure error). A test of lack-of-fit for the model without the additional terms can then be performed, using the mean square pure error as the error term. This provides a more sensitive test of model fit, because the effects of the additional higher-order terms is removed from the error.” It is important here to note that with all three measures, the terms “error” and “residual” are often used interchangeably, although there is a strictly definable difference between them: a residual quantifies the difference between actual and predicted values, while errors refer to actual values and “the (unobservable) true function value” chosen from a random sample.”[vi]

…………The important thing is that the code for the Lack-of-Fit Sum-of-Squares formulas[vii] gets the job done. Of course it always helps if a data mining programmer can write a dissertation on the logic and math of the equations they’re working with, but ordinarily, that’s best left to mathematicians; their assignment is analogous to that of an automotive engineers, while our role is that of a garage mechanic, whose main responsibility is to make sure that the car runs, one way or another. If the owner can drive it away without the engine stalling, then mission accomplished.

…………We only need to add two elements to make the Lack-of-Fit Sum-of-Squares code below useful to end users, one of which is to simply interpret higher numbers as greater lack of fit. The second is to define what that is, since it represents the opposite of goodness-of-fit and therefore can cause the same kind of confusion in direction that’s possible when providing RMSE and R^{2} side-by-side. The two terms are sometimes used interchangeably, but in a more specific sense they’re actually polar opposites, in which measures that rise as fit improves can be termed goodness-of-fit and those that rise as the fit of a model declines as lack-of-fit. CrossValidated forum contributor Nick Cox provided the best explanation of the difference I’ve seen yet to date: “Another example comes from linear regression. Here two among several possible figures of merit are the coefficient of determination R^{2} and the root mean square error (RMSE), or (loosely) the standard deviation of residuals. R^{2} could be described as a measure of goodness of fit in both weak and strict senses: it measures how good the fit is of the regression predictions to data for the response and the better the fit, the higher it is. The RMSE is a measure of lack of fit in so far as it increases with the badness of fit, but many would be happy with calling it a measure of goodness of fit in the weak or broad sense.”[viii] If end users need a little more detail in what the stat means, that constitutes the most succinct explanation I’ve yet seen. Ordinarily, however, they only need to that the return values for the @LackOfFitSumOfSquares variable below will rise as the accuracy of their model gets worse and vice-versa.

** Figure 1: T-SQL Code for the Regression Goodness-of-Fit Tests**CREATE PROCEDURE [Calculations].[GoodnessOfFitRegressionTestSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName1 AS nvarchar(128),@ColumnName2 AS nvarchar(128), @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SchemaAndTableName nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTableName = @DatabaseName + ‘.’ + @SchemaName + ‘.’ + @TableName

SET @SQLString = ‘DECLARE @MeanX decimal(‘ + @DecimalPrecision + ‘),@MeanY decimal(‘ + @DecimalPrecision + ‘), @StDevX decimal(‘ + @DecimalPrecision + ‘),

@StDevY decimal(‘ + @DecimalPrecision + ‘), @Count decimal(‘ + @DecimalPrecision + ‘), @Correlation decimal(‘ + @DecimalPrecision + ‘),

@Covariance decimal(‘ + @DecimalPrecision + ‘), @Slope decimal(‘ + @DecimalPrecision + ‘), @Intercept decimal(‘ + @DecimalPrecision + ‘),

@MeanSquaredError decimal(‘ + @DecimalPrecision + ‘), @LackOfFitSumOfSquares decimal(‘ + @DecimalPrecision + ‘)

— STEP #1 – RETRIEVE THE GLOBAL AGGREGATES NEEDED FOR OTHER CALCULATIONS

SELECT @Count=Count(CAST(‘ + @ColumnName1 + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @MeanX = Avg(CAST(‘ + @ColumnName1 + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))),

@MeanY = Avg(CAST(‘ + @ColumnName2 + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))),

@StDevX = StDev(CAST(‘ + @ColumnName1 + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @StDevY = StDev(CAST(‘ + @ColumnName2 + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName1 + ‘ IS NOT NULL AND ‘ + @ColumnName2 + ‘ IS NOT NULL

— STEP #2 – CALCULATE THE CORRELATION (BY FIRST GETTING THE COVARIANCE)

SELECT @Covariance = SUM((‘ + @ColumnName1 + ‘ – @MeanX) * (‘ + @ColumnName2 + ‘ – @MeanY)) / (@Count – 1

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName1 + ‘ IS NOT NULL AND ‘ + @ColumnName2 + ‘ IS NOT NULL

— once wee got the covariance, its trivial to calculate the correlation

SELECT @Correlation = @Covariance / (@StDevX * @StDevY)

— STEP #3 – CALCULATE THE SLOPE AND INTERCEPT

SELECT @Slope = @Correlation * (@StDevY / @StDevX)

SELECT @Intercept = @MeanY – (@Slope * @MeanX)

— STEP #4 – CALCULATE THE MEAN SQUARED ERROR AND LACK OF FIT SUM OF SQUARES TOGETHER

SELECT @MeanSquaredError = SUM(Power((PredictedValue – ‘ + @ColumnName2 + ‘), 2)) * (1 / @Count), @LackOfFitSumOfSquares = SUM(LackofFitInput)

FROM (SELECT ‘ + @ColumnName1 + ‘, ‘ + @ColumnName2 + ‘, PredictedValue, Count(CAST(‘ + @ColumnName2 + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ +

@ColumnName1 + ‘ ORDER BY ‘ + @ColumnName1 + ‘) * (Power(Avg(CAST(‘ + @ColumnName2 + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ +

@ColumnName1 + ‘ ORDER BY ‘ + @ColumnName1 + ‘) – PredictedValue, 2)) AS LackofFitInput

FROM (SELECT ‘ + @ColumnName1 + ‘, ‘ + @ColumnName2 + ‘, (‘ + @ColumnName1 + ‘ * @Slope) + @Intercept AS PredictedValue

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName1 + ‘ IS NOT NULL AND ‘ + @ColumnName2 + ‘ IS NOT NULL) AS T1) AS T2

SELECT @MeanSquaredError AS MeanSquaredError, Power(@MeanSquaredError, 0.5) AS RMSE, @LackOfFitSumOfSquares AS LackOfFitSumOfSquares,

Power(@Correlation, 2) * 100 AS R2, @Covariance AS Covariance, @Correlation AS Correlation, @Slope AS Slope, @Intercept AS Intercept‘

–SELECT @SQLString — uncomment this to debug dynamic SQL errors

EXEC (@SQLString)

…………Most of the code for this procedure is identical to that of the aforementioned Cook’s Distance procedure, which requires regression, covariance and correlation computations.[ix] I won’t rehash here how to derive the slope, intercept and other such constituent calculations, for the sake of brevity. The really striking thing is how few lines of code it takes to derive all of these incredibly useful stats in one fell swoop, which we can thank the powerful T-SQL windowing functions introduced in SQL Server 2012 for. It is noteworthy though that the outer query in Step 4 is necessary because of T-SQL error 4109, “Windowed functions cannot be used in the context of another windowed function or aggregate,” which prevents us from performing the calculations in one big gulp and plug them into the SUM. Besides a few departures like that, the procedure closely follows the format used in the last two tutorial series, in which I start with a common set of parameters that allow users to perform the test on any table in any database they have sufficient access to. The first two lines of code in the procedure body help make this happen, while the rest is dynamic SQL that begins with declarations of the constants, stats and variables we need to make the procedure perform its calculations. As usual, the @DecimalPrecision parameter is provided to help users set their own precision and scale (to avoid errors like arithmetic overflows and still accommodate columns of all sizes) and the SELECT @SQLString near the end, which can be uncommented for debugging purposes.

** Figure 2: Sample Results from the Regression Goodness-of-Fit Test on the Duchennes Dataset** (click to enlarge)

EXEC [Calculations].[GoodnessOfFitRegressionTestSP]

@DatabaseName = N’DataMiningProjects‘,

@SchemaName = N’Health‘,

@TableName = N’DuchennesTable‘,

@ColumnName1 = N’Hemopexin‘,

@ColumnName2 = N’LactateDehydrogenase‘,

@DecimalPrecision = N’38,21′

** Figure 3: Sample Results from the Regression Goodness-of-Fit Test on the Higgs Boson Dataset** (click to enlarge)

EXEC [Calculations].[GoodnessOfFitRegressionTestSP]

@DatabaseName = N’DataMiningProjects‘,

@SchemaName = N’Physics‘,

@TableName = N’HiggsBosonTable‘,

@ColumnName1 = N’Column1′,

@ColumnName2 = N’Column2′,

@DecimalPrecision = N’38,29′

…………I’ve made it standard practice in the last two tutorial series to test my procedures first on the 209 rows of a dataset on the Duchennes form of muscular dystrophy provided by Vanderbilt University’s Department of Biostatistics, then on the first couple of float columns in the 11-million-row Higgs Boson Dataset, which is made publicly available by the University of California at Irvine’s Machine Learning Repository__.__ As depicted in Figure 2, the regression line for the Hemopexin protein and Lactate Dehydrogenase enzyme data in the Duchennes table fits poorly, as indicated by the MeanSquaredError, RMSE, LackOfFitSumOfSquares and R^{2} results. The graphic below it demonstrates clearly that the first two float columns in the Higgs Boson dataset don’t fit well on a regression line either. Neither is not surprising, given that the correlation coefficients for both are near zero, which indicates a lack of relationship between the variables (a strongly negative value would indicate a strongly inverse relationship, whereas positive values would do the opposite).

…………What was truly surprising is how well the latter query performed on the Higgs Boson table, which takes up nearly 6 gigabytes in the DataMiningProjects database I assembled from various practice datasets. It only took 2:04 to execute on my clunker of a development machine, which hardly qualifies as a real database server. The execution plan in Figure 4 may provide clues as to why: most of the costs come in terms of three non-clustered index seeks, which is normally what we want to see. Nor are there any expensive Sort operators. Most of the others are parallelism and Compute Scalar operators that come at next to no cost. In last week’s article, I mentioned that it really doesn’t hurt to calculate both the Jarque-Bera and D’Agostino-Pearson Omnibus Test together, since the costs are incurred almost exclusively in traversing a whole table to derive the constituent skewness and kurtosis values. In the same way, it doesn’t cost us much to calculate the MSE, RMSE and Lack-of-Fit Sum-of-Squares together in Step 4, once we’ve already gone to the trouble of traversing the whole table by calculating one of them. It also costs us just a single operation to derive the R^{2} once we’ve done the regression and have the correlation, and nothing at all to return the covariance, correlation, slope and intercept if we’re going to go to the trouble of getting the R^{2}. The execution plan essentially bears this out, since the Index Seeks perform almost all the work.

** Figure 4: Execution Plan for the Regression Goodness-of-Fit Test on the Higgs Boson Dataset** (click to enlarge)

…………There are of course limitations and drawbacks with this procedure and the formulas it is meant to reflect. It is always possible that I’m not implementing them accurately, since I’m writing this in order to learn the topic, not because I know it already; as usual, my sample code is more of a suggested means of implementation, not a well-tested workhorse ready to go into a production environment tomorrow. I still lack the level of understanding I wish I had of the internal mechanics of the equations; in fact, I’m still having trouble wrapping my head around such concepts as the difference between the coefficient of determination and variance explained, which seem to overlap quite closely.[x] Moreover, the MSE can place too much weight on outliers for some use cases, even when implemented accurately.[xi] The RMSE also can’t be used to compare regressions between different sets of columns, “as it is scale-dependent.”[xii]

…………The values for the some of the stats returned above also suffer from a different scaling issue, in that they tend to increase too quickly as the number of records accumulates. They’re not in the same league as the truly astronomical values I’ve seen with other stats I’ve surveyed in the last two tutorial series, but the fact that the Lack-of-Fit Sum-of-Squares reaches eight digits above the decimal place is bothersome. That’s about the upper limit of what end users can read before they have to start counting the decimal places by hand, which rapidly degrades the legibility of the statistic.

**Traditional Metrics and Tests in the “Big Data” Era**

…………That just adds to my growing conviction that the vastly larger datasets in use today may require new statistical measures or rescaling of tried-and-true ones in order to accommodate their sheer size. We shouldn’t have to sacrifice the main strength of Big Data[xiii], which is the fact that we can now quickly derive very detailed descriptions of very large datasets, just to use these methods. As we have seen throughout the last two tutorial series, this issue has consistently thrown a monkey wrench into many of the established statistical procedures, which were designed decades or even centuries ago with datasets of a few dozen records in mind, not several million. We’ve seen it in the exponent and factorial operations required to derive many of well-established measures, which simply can’t be performed at all on values of more than a few hundred without leading to disastrous arithmetic overflows and loss of precision. We’ve seen it again this week and the last, in which the high record counts made the final statistics a little less legible.

…………We’ve also seen it in some of the hypothesis testing methods, which require lookup tables that often only go up to record counts of a few hundred at best. That’s a problem that will rear its head again in a few weeks when I try, and fail, to implement the popular Shapiro-Wilk Test of normality, which supposedly has excellent statistical power yet is only usable up to about 50 records.[xiv] Such goodness-of-fit tests for probability distributions can also be applied to regression, to determine if the residuals are distributed in a bell curve; cases in point include the histograms discussed in Outlier Detection with SQL Server, Part 6.1: Visual Outlier Detsection with Reporting Services and the Chi-Squared Test, which I’ll cover in a few weeks.[xv] Rather than applying these tests to regression in this segment of the series, I’ll introduce the ones I haven’t covered yet separately. For the sake of simplicity, I won’t delve into complicated topics like lack-of-fit testing on variants like multiple regression at this point. It would be useful, however, to finish off this segment of the series next week by introducing the Hosmer–Lemeshow Test, which can be applied to Logistic Regression, which is one of the most popular alternative regression algorithms. As discussed in A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression, a logistic function is applied to produce an S-shaped curve that bounds outcomes between 0 and 1, which fits many user scenarios. Thankfully, the code will be much simpler to implement now that we’ve got this week’s T-SQL and concepts out of the way, so it should make for an easier read.

[i] See the __Wikipedia__ page “Mean Squared Error” at http://en.wikipedia.org/wiki/Mean_squared_error

[ii] Hopkins, Will G., 2001, “Root Mean-Square Error (RMSE),” published at the __A New View of Statistics__ web address http://www.sportsci.org/resource/stats/rmse.html

[iii] For a more in depth explanation of the interrelationships between these stats and why they operate as they do, see Hopkins, Will G., 2001, “Models: Important Details,” published at the __A New View of Statistics__ web address http://www.sportsci.org/resource/stats/modelsdetail.html#residuals

[iv] See the __Wikipedia__ page “Root Mean Square Deviation” at http://en.wikipedia.org/wiki/Root-mean-square_deviation

[v] See the succinct explanation at the __Wikipedia__ page “Errors and Residuals in Statistics” at http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics

[vi] Mukesh Mahadeo’s reply to the thread “What is Mean by Lack of Fit in Least Square Method?” at the Yahoo! Answers web address https://in.answers.yahoo.com/question/index?qid=20100401082012AAf0yXg

[vii] Which I derived from the formulas at the __Wikipedia__ webpage “Lack-of-Fit Sum of Squares” at http://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares

[viii] See Cox, Nick, 2013, reply to the __CrossValidated__ thread “Are Goodness of Fit and Lack of Fit the Same?” on Aug. 2, 2013. Available at the web address http://stats.stackexchange.com/questions/66311/are-goodness-of-fit-and-lack-of-fit-the-same

[ix] As mentioned in that article, the originals sources for the internal calculations included Hopkins, Will G., 2001, “Correlation Coefficient,” published at the __A New View of Statistics__ web address http://www.sportsci.org/resource/stats/correl.html; the Dummies.Com webpage “How to Calculate a Regression Line” at http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html, the __Wikipedia__ page “Mean Squared Error” at http://en.wikipedia.org/wiki/Mean_squared_error and the __Wikipedia__ page “Lack-of-Fit Sum of Squares” at http://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares.

[x] I’ve seen competing equations in the literature, one based on residual sum-of-squares calculations and the other on squaring of the correlation coefficient. The wording often leads be to believe that they arrive at the same results through different methods, but I’m not yet certain of this.

[xi] See the __Wikipedia__ page “Mean Squared Error” at http://en.wikipedia.org/wiki/Mean_squared_error

[xii] http://en.wikipedia.org/wiki/Root-mean-square_deviation “Root-Mean-Square Deviation”

[xiii] It’s a buzzword, I know, but it’s the most succinct term I can use here.

[xiv] Some sources say up to a couple hundred records, and I’m not familiar enough with the topic to discern which limit applies in which cases. It’s a moot point, however, because we need such tests to work on datasets of several hundred million rows.

[xv] See the undated publication “Goodness of Fit in Linear Regression” retrieved from Lawrence Joseph’s course notes on Bayesian Statistics on Oct. 30, 2014, which are published at the website of the __McGill University Faculty of Medicine__. Available at the web address http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit.pdf. No author is listed but I presume that Prof. Joseph wrote it. This is such a good source of information for the topic of this article that I couldn’t neglect to work in a mention of it.

## Goodness-of-Fit Testing with SQL Server, part 3.2: D’Agostino’s K-Squared Test

**By Steve Bolton**

…………In the last edition of this amateur series of self-tutorials on goodness-of-fit testing with SQL Server, we discussed the Jarque-Bera Test, a measure that unfortunately doesn’t scale well on datasets of the size that DBAs are accustomed to using. The problem is not with the usefulness of the statistics that it is composed of, because skewness and kurtosis are easy to interpret and valuable in their own right as measures of shape and for purposes of outlier detection. Usually scaling problems signify performance issues, but the resource consumption and execution of the Jarque-Bera Test aren’t bad by any means; the issue is that the statistic itself increases to ungodly numbers that are difficult to interpret, precisely because it was designed with smaller datasets in mind. In this week’s installment, I’ll provide an alternative measure that also builds upon skewness and kurtosis and can be calculated in almost exactly the same amount of time as Jarque-Bera, but without the cumbersome scaling issue.

…………The improved interpretability of D’Agostino’s K-Squared Test comes at the cost of more complicated internal calculations, which turn out to be trivial in comparison to the main computational costs, which consist almost exclusively of index seeks and sorts in the execution plan issued by the SQL Server query optimizer. This added complexity is only a problem if one wants to check to see what’s going on under the hood in these calculations, which is rarely necessary in most use cases after the code has been validated. As I pointed out at every opportunity in my earlier mistutorial series A Rickety Stairway to SQL Server Data Mining, most end users have about as much need to understand how such statistics are derived as the average driver needs to know the engineering details of their car; in many cases it is a mistake to overload them with superfluous information like incomprehensible math equations. That is why I haven’t posted any such formulas in the last few tutorial series I’ve posted here. End users should understand enough to interpret the results in light of their domain knowledge, just as the average rush hour commuter needs to know how to read a gas gauge and transmission fluid stick properly. Those who write the computer code that implement these stats obviously need to grasp the inner workings at a much deeper level, but not to the point that they’re designing their own formulas; data mining programmers essentially occupy the middle zone halfway between end users and mathematicians, in the same way that garage mechanics reside in the niche between drivers and automotive engineers. It is my goal to learn the skills necessary to serve at this midpoint, but as I usually point out, I haven’t reached it yet; I hope to use blog posts of this kind to familiarize myself with these topics better, not because I already know the material well. And that is why I cannot explain in great detail *why* D’Agostino’s K-Squared Test (a.k.a. the D’Agostino-Pearson Omnibus Test) works as it does. Like a typical mechanic, I was able to get it running sufficiently well that it returns the expected results in a potentially reliable way, but I don’t have sufficient skill to comment on why it was designed as it was. Nevertheless I did pick up a few things while reading sources like D’Agostino, et al.’s 1990 paper in *The American Statistician*[1] and as usual, the Wikipedia[2] article on the topic, which may not be a professional source but qualifies as the most convenient repository for every math formula under the sun.

…………As you can gather from the T-SQL code in Figure 1, the underlying equations I found in former source are fairly complicated and involve the derivation of several intermediate statistics in between the sample skewness and kurtosis and the final metric. Although the latter source is only an introduction to the topic, it did provide some invaluable insights into the aim of these calculations, albeit without explaining why those particular calculations satisfied those aims. Apparently, the @Z1 and @Z2 measures are meant to bring the skewness and kurtosis in line with the standard normal distribution to solve their “frustratingly slow” approach to the distribution limit, which is a scaling issue of sorts.[3] The SELECT statement towards the end that assigns the final value to the @K2Test combines these internal calculations into a single result so that the skewness and kurtosis can be measured together, in what technically known as an “omnibus test.”[4] After all these esoteric calculations, that final assignment is actually quite simple. I’m sure the nitty gritty details are in the original academic articles published by statisticians Ralph D’Agostino and E.S. Pearson in the early ‘70s, but I couldn’t find any publicly accessible copies; judging from the difficulty I had in following the 1990 paper, much of it would still have been over my head anyways. The important thing to know is that I was able to follow the equations sufficiently well that the code below returns the correct results for the examples provided by D’Agostino, et al.

** Figure 1: T-SQL Code for the D’Agostino-Pearson K-Squared Test**CREATE PROCEDURE [Calculations].[NormalityTestDAgostinosKSquaredSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400)

AS

DECLARE @SchemaAndTableName nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTableName = @DatabaseName + ‘.’ + @SchemaName + ‘.’ + @TableName

SET @SQLString = ‘DECLARE @Mean float, @StDev float, @Count as bigint, @Alpha decimal(38,37),

@One float = 1, @Two float = 2, @Three float = 3, @Four float = 4, @Five float = 5,

@Six float = 6, @Seven float = 7, @Eight float = 8, @Nine float = 9,

@TwentyFour float = 24, @TwentySeven float = 27, @Seventy float = 70,

@RecpiprocalOfNSampleSize float, @DifferenceFromSampleMeanSquared float, @DifferenceFromSampleMeanCubed

float, @DifferenceFromSampleMeanFourthPower float, @SampleSkewness float, @SampleKurtosis float,

@Y float, @B2 float, @WSquared float, @Z1 float, @Z2 float, @Sigma float, @E float, @VarianceKurtosis float,

@StandardizedKurtosis float, @ThirdStandardizedMomentOfKurtosis float, @A float, @K2Test float

SELECT @Count = Count(‘ + @ColumnName + ‘), @Mean = Avg(CAST(‘ + @ColumnName + ‘ AS float)), @StDev = StDev(‘ + @ColumnName + ‘)

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @RecpiprocalOfNSampleSize = @One / @Count

DECLARE @CountPlusOne float = @Count + @One, @CountPlusThree float = @Count + @Three, @CountPlusFive float = @Count + @Five,

@CountPlusSeven float = @Count + @Seven, @CountPlusNine float = @Count + @Nine, @CountMinusTwo float = @Count – @Two,

@CountMinusThree float = @Count – @Three

DECLARE @CountPlusOneTimesCountPlusThree float = (@Count + @One) * (@Count + @Three)

SELECT @DifferenceFromSampleMeanSquared = SUM(Power(DifferenceFromSampleMean, 2)) OVER (ORDER BY ‘ + @PrimaryKeyName + ‘ ASC),

@DifferenceFromSampleMeanCubed = SUM(Power(DifferenceFromSampleMean, 3)) OVER (ORDER BY ‘ + @PrimaryKeyName + ‘ ASC),

@DifferenceFromSampleMeanFourthPower =SUM(Power(DifferenceFromSampleMean, 4)) OVER (ORDER BY ‘ + @PrimaryKeyName + ‘ ASC)

FROM (SELECT ‘ + @PrimaryKeyName + ‘, CAST(‘ + @ColumnName + ‘ AS float) – @Mean as DifferenceFromSampleMean — make a single pass across the table?

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1

SELECT @SampleSkewness = (@RecpiprocalOfNSampleSize * @DifferenceFromSampleMeanCubed) /(Power((@RecpiprocalOfNSampleSize * @DifferenceFromSampleMeanSquared), 1.5))

SELECT @SampleKurtosis = (@RecpiprocalOfNSampleSize * @DifferenceFromSampleMeanFourthPower) /(Power((@RecpiprocalOfNSampleSize * @DifferenceFromSampleMeanSquared), 2))

— perform operations on the Skewness

SELECT @Y = @SampleSkewness * Power(((@CountPlusOneTimesCountPlusThree) / (@CountMinusTwo * @Six)), 0.5) — do the brackets signify multiplication? ****

SELECT @B2 = (@CountPlusOneTimesCountPlusThree * (@Three * ((Power(@Count, 2) + (@TwentySeven * @Count)) -@Seventy))) / (@CountMinusTwo * @CountPlusFive * (@Count + @Seven) * (@Count + @Nine))

SELECT @WSquared = Power(@Two * (@B2 – @One), 0.5) – @One

SELECT @Alpha = Power(Abs(@Two / (@WSquared – @One)), 0.5)

SELECT @Sigma = @One / (Power(Abs((Log(Abs(Power(@WSquared, 0.5))))), 0.5))

— Im not sure if this sigma is related to StDev or not

SELECT @Z1 = @Sigma * Log((@Y / @Alpha) + Power((Power((@Y / @Alpha), 2) + @One), 0.5))

SET @SQLString = @SQLString + ‘— perform operations on the kurtosis

SELECT @E = (@Three * (@Count – @One)) / @CountPlusOne — according to the paper, this is the mean for the kurtosis

SELECT @VarianceKurtosis = (@TwentyFour * @Count * @CountMinusTwo * @CountMinusThree) / (Power(@CountPlusOne, 2) * @CountPlusThree * @CountPlusFive

SELECT @StandardizedKurtosis = (@SampleKurtosis – @E) / Power(@VarianceKurtosis, 0.5)

SELECT @ThirdStandardizedMomentOfKurtosis = ((@Six * ((Power(@Count, 2) – (@Five * @Count)) + @Two)) / (@CountPlusSeven * @CountPlusNine)) *

Power((@Six * @CountPlusThree * @CountPlusFive) / (@Count * @CountMinusTwo * @CountMinusThree), 0.5)

SELECT @A = @Six + ((@Eight / @ThirdStandardizedMomentOfKurtosis) * ((@Two / @ThirdStandardizedMomentOfKurtosis) + Power(@One + (@Four / @ThirdStandardizedMomentOfKurtosis), 0.5)))

SELECT @Z2 = ((@One – (@Two / (@Nine * @A))) – Power((@One – (@Two / @A)) / (@One + (@StandardizedKurtosis * Power((@Two / (@A – @Four)), 0.5))), (@One / @Three))) / Power((@Two / (@Nine *

@A)), 0.5)

SELECT @K2Test = Power(@Z1, 2) + Power(@Z2, 2)

— uncomment this to debug the internal calculations SELECT @Alpha, @Sigma, @Y AS T, @B2 AS B2, @WSquared AS WSquared, @E AS E, @VarianceKurtosis AS VarianceKurtosis, @StandardizedKurtosis AS StandardizedKurtosis, @ThirdStandardizedMomentOfKurtosis

AS ThirdStandardizedmMomentOfKurtosis, @A AS A, @DifferenceFromSampleMeanSquared, @RecpiprocalOfNSampleSize, @DifferenceFromSampleMeanSquared, @DifferenceFromSampleMeanCubed, @DifferenceFromSampleMeanFourthPower

SELECT @K2Test AS KSquaredTest, @SampleSkewness AS SampleSkewness, @SampleKurtosis AS SampleKurtosis, @Z1 as Z1, @Z2 as Z2, @Mean AS Mean, @StDev AS StDev‘

–SELECT @SQLString — uncomment this

to debug dynamic SQL errors

EXEC (@SQLString)

…………A few explanations of why the code was written as it was are in order. The five parameters allow users to run the test on any table or view in any database they have sufficient access to, while the first declaration assists in implementing this. The dynamic SQL differs from some of the procedures I’ve posted in the past by the sheer number of reciprocals and constants that need to be precalculated early on, in order to avoid performing the same operations over again. The length of the dynamic SQL also necessitates the use of the second SET statement on the @SQLString, since such strings can’t be assigned in one big gulp after a certain character limit, but can thankfully be added together easily; keep in mind that if this step is left out, the dynamic SQL may be unexpectedly truncated. This procedure also differs in the sense that I’ve chosen to use floats rather than the decimal data type, for the same reason I did in the article on Jarque-Bera: some of the internal calculations are performed on very small fractions, particularly the exponents and reciprocals, which SQL Server will sometimes convert to zeros in the case of the decimal data type. Secondly, I substituted named variables for many of the constants, such as @CountPlusOne, which are declared near the beginning of the dynamic SQL. This is due to the fact that SQL Server sometimes truncates the decimal points out of certain operations on integers; I haven’t yet determined precisely what causes this, although using integers as dividends seems to trigger it most often. Consider this an experiment in discerning whether using named variables is more legible than using countless CAST operations, some of which would have to be buried deep within subqueries. By all means, feel free to copy and paste the constants back in if you know the answers to those questions. As with the Jarque-Bera Test, I’m not certain whether this K^{2} Test would retain its validity if we substituted the simpler calculations for the full population for the sample skewness and sample kurtosis, but those stats would be preferable if this was case. As usual, I’ve provided a couple of lines of debugging code that can be uncommented if you need to adjust or verify the code, both near the end of the procedure. Be aware that due my difficulty in reading the original equations, @StandardizedKurtosis may need to serve as the root instead of 0.5 (the square root) in my calculation for @Z2, and also in the calculation for the third standardized moment – but I doubt it, since this would throw off the calculations quite a bit. I also added several ABS function calls to avoid Invalid Floating Point Operation errors on T-SQL functions like POWER that can’t handle imaginary roots, but this departure doesn’t seem to affect the final results.

…………The bottom line is that I tested this against the same stem-and-leaf plot cholesterol data from the Framingham Heart Study that D’Agostino, et al. assessed their equations with and get pretty much the same results.[5] They got 1.02 and 4.58 for their sample skewness and kurtosis and 14.75 for the final K2 test statistic, which was derived from Z1 and Z2 values of 3.14 and 2.21 respectively; my results were 1.0235482596477, 4.57738778764656 and 14.7958406166879 for the sample skewness, kurtosis and test statistic respectively, which were derived from values of 3.13939241925962 and 2.22262364213628 for the intermediate Z1 and Z2 stats. It is possible that the slight differences are due to undiscovered errors in my code, but some departure is expected given that I used variables and constants of much higher precision, which would lead to rounding discrepancies. I then tested it against two datasets I’ve been using throughout the last two tutorial series, one on the Duchennes form of muscular dystrophy made publicly available by the Vanderbilt University’s Department of Biostatistics and another on the Higgs Boson that can be downloaded from the University of California at Irvine’s Machine Learning Repository. I derived the first resultset in Figure 2 from the query above it and the following two from queries like it on the first two float columns in the Higgs Boson dataset. Note that test statistic is much larger for the Higgs Boson results – mainly because that table has 11 million rows, compared to just 209 for the Duchennes table – but isn’t quite as inflated as in some of the Jarque-Bera results. One of them has seven digits to the left of the decimal point, which I’d wager is near the limit of numerical legibility for most people. Once you get past that, most people have trouble comparing numbers by eye without resorting to the ungainly strategy of counting digits and mentally interpolating commas between every set of three.

** Figure 2: Sample Results from the Duchennes and Higgs Boson Datasets**EXEC [Calculations].[NormalityTestDAgostinosKSquaredSP]

@DatabaseName = N’DataMiningProjects’,

@SchemaName = N’Health’,

@TableName = N’DuchennesTable’

@ColumnName = N’LactateDehydrogenase’,

@PrimaryKeyName = N’ID’

…………The good news is that the procedure performed unexpectedly well; in fact, the first trial run took 3:43 on the first float column in the Higgs Boson table, i.e. exactly the same execution time as for the Jarque-Bera Test in the last tutorial. After all of those arcane calculations you’d expect to see a rather messy execution plan, but as Figure 3 shows, this procedure isn’t all that hard to follow. The main costs were incurred by two non-clustered index seeks and a sort. This is because almost all of the work occurs in retrieving the values and performing simple calculations for each row, not in the fancy math that occurs after they’ve been summarized, which turns out to have an inconsequential computation cost. The main burden of these calculations falls exactly where we want it: on the brains of the coders and testers, not on the end users, to whom the procedure will be a well-oiled black box after error-checking, validation and SQL injection protection code are added.

__Figure 3: Execution Plan for the D’Agostino-Pearson Omnibus Test
__

…………There’s more good news: since most of the performance cost occurs in the same seeks, sorts and initial calculations of skewness and kurtosis that the Jarque-Bera Test uses, there’s no real penalty incurred by computing it together with the D’Agostino-Pearson Omnibus Test. If we had to sacrifice one, however, it would be the former, since I have heard anecdotes about statisticians preferring the latter, but not the other way around. One of the reasons the K^{2} is favored is because of numerous studies (including some written by D’Agostino Sr.) demonstrating that it has better statistical power, which is a numerical measure of how often the actual effects of a variable are detected by a particular test.[6] This metric is applicable to large sample sizes, unlike the Shapiro-Wilk Test[7], and can be used for both one-sided and two-sided hypothesis tests.[8] As I learn more about the field I’m shying more and more away from hypothesis tests, on the grounds that the small sample sizes and narrow focus aren’t suited to typical SQL Server user scenarios, like exploratory data mining on large datasets. Nevertheless, it doesn’t hurt to know that the D’Agostino-Pearson Test is flexible enough to be used for these purposes. Moreover, it can apparently be applied to goodness-of-fit testing on datasets that don’t follow the Gaussian or “normal” distribution, i.e. the bell curve, which isn’t true of many of them. In fact, the authors of that 1990 study go so far as to say that “The extensive power studies just mentioned have also demonstrated convincingly that the old warhorses, the chi-squared test and the Kolmogorov test (1933), have poor power properties and should not be used when testing for normality.”[9] This is by no means the first time I’ve heard such sentiments expressed by statisticians about these two rival metrics, which still seem to be implemented far more frequently in practice despite such advice.

…………Later on in this series I’ll explain how to implement both the Chi-Squared Test and Kolmogorov-Smirnov Test in T-SQL, but I’m going to skip over a couple of other measures related to skewness and kurtosis, at least for the time being. One of these is Mardia’s multivariate versions of skewness and kurtosis, which I will save for some far-flung future when grappling with the complexity added by dealing with multiple columns isn’t too overwhelming; perhaps someday I’ll tack a segment onto the end of this series for multivariate goodness-of-fit tests, like the Cox-Small Test and Smith and Jain’s Test.[10] I’ve organized this series in the order of how difficult the concepts and underlying code are, which brings us to the topic of regression-related methods of goodness-of-fit testing. As explained in the last article, skewness and kurtosis really aren’t that hard to grasp intuitively, and as I dealt with in A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression, the core concepts behind regression aren’t that difficult either. The variants of regression can get quite complicated, but drawing a line on a graph based on the relationship between two variables is something every college freshman has been exposed to. The stats based on these lines can also vary in their intricacy; there is apparently even a version of Jarque-Bera for multiple regression[11], which I’ll skip over for now to avoid the added complexity of dealing with three or more variables. The code required to implement regression stats for purposes of normality testing can also require differing levels of sophistication, as we’ll see shortly after New Year’s.

[1] D’Agostino, Ralph B.; Belanger, Albert and D’Agostino Jr., Ralph B, 1990, “A Suggestion for Using Powerful and Informative Tests of Normality,” pp. 316–321 in __The American Statistician__. Vol. 44, No. 4. Available online at http://www.ohio.edu/plantbio/staff/mccarthy/quantmet/D’Agostino.pdf

[2] See the __Wikipedia__ article “D’Agostino’s K-Squared Test” at http://en.wikipedia.org/wiki/D’Agostino’s_K-squared_test

[3] *IBID.*

[4] “D’Agostino and Pearson (1973) presented a statistic that combines….to produce an omnibus test of normality. By omnibus, we mean it is able to detect deviations from normality due to either skewness or kurtosis.” See p. 318, D’Agostino, et al., 1990.

[5] *IBID.*, p. 318.

[6] For a better explanation of the term than I can give, see Hopkins, Will G., 2001, “Generalizing to a Population: ESTIMATING SAMPLE SIZE continued,” published at the __A New View of Statistics__ web address http://www.sportsci.org/resource/stats/ssdetermine.html. I highly recommend his website for those who are new to the field of stats, like me.

[7] p. 319, D’Agostino, et al., 1990.

[8] *IBID.*, p. 318.

[9] *IBID.*, p. 316.

[10] I don’t know anything about these tests, but I’ve seen them mentioned in sources like the __Wikipedia__ article “Multivariate Normal Distribution” at http://en.wikipedia.org/wiki/Multivariate_normal_distribution

[11] See the Wikipedia page “Jarque-Bera Test” at http://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test

## Goodness-of-Fit Testing with SQL Server, part 3.1: Skewness, Kurtosis and the Jarque-Bera Test

**By Steve Bolton**

…………In the last installment of this series of amateur self-tutorials on using SQL Server to identify probability distributions, we saw how devices like probability plots can provide simple visual confirmation of a dataset’s shape. I considered doing a quick detour into Q-Q plots, but decided against it because of their simplicity; instead of putting values for the distribution being tested on the horizontal axis, Q-Q plots chop them up into partitions of equal size, a task that is obviously trivial to implement with NTILE. I’m more eager to discuss skewness and kurtosis, two of the oldest, tried-and-true measures of goodness-of-fit[1] – particularly for the normal or “Gaussian” distribution, i.e. the bell curve – precisely because they are often easy to spot with the naked eye. They are numerical measures rather than visualizations, but are often self-evident within graphics like histograms. For example, the third histogram in my recent post Outlier Detection with SQL Server Part 6.1 – Visual Outlier Detection with Reporting Services is a striking examples of a highly skewed column, while the one below it obviously follows a bell curve more closely and has relatively low skewness and kurtosis; later in this article, I’ll run some sample T-SQL code against the same data to derive hard numbers for both. I’ve seen several good explanations of the meanings of skewness and kurtosis in sources at various sites on the Internet, including one of my favorites, the National Institute for Standards and Technology’s Engineering Statistics Handbook, which defines them thus: “Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point…Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.”[2] Another succinct explanation is given by Tompkins County Community College adjunct faculty member Stan Brown, who says that “The histogram can give you a general idea of the shape, but two numerical measures of shape give a more precise evaluation: skewness tells you the amount and direction of skew (departure from horizontal symmetry), and kurtosis tells you how tall and sharp the central peak is, relative to a standard bell curve.” [3]

…………I already had some experience with both measures way back in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm and A Rickety Stairway to SQL Server Data Mining, Part 14.6: Custom Mining Functions, when I made crude attempts to implement skewness and kurtosis in SSDM in order to illustrate the capabilities of its custom algorithms and functions. That called for fairly simple stats which wouldn’t distract from the main mission; I didn’t really even make much of an effort to understand them, because it wasn’t germane to the lesson at hand. Since then I’ve discovered that it’s easier for me to grasp both stats by viewing them as numerical measures of lopsidedness on a histogram that is divided into imaginary stripes, in which skewness detects how uneven a distribution is from one vertical band to another, whereas kurtosis measures how squashed the distribution curve is on the horizontal axis. Either way you look at it, the measures are still simple enough to explain in layman’s terms, which is one of the strengths of the set of normality tests built from them.

…………The most well-known extension of these somewhat forgotten stats is the Jarque-Bera Test, which only dates back to the 1970s despite being one of earliest examples of normality testing. All of these measures have fallen out of favor with statisticians to some extent, for reasons that will be apparent shortly, but one of the side effects of this is that it is a little more difficult to find variations on them that are more suited to the unique needs of the SQL Server community. One of the strengths of data mining on database servers like SQL Server is that you typically have such an enormous number of records to draw from that you can actually perform calculations on the full population, or a proportion close to it. In ordinary statistics, however, you’re often limited to making inferences based on small samples of just a few dozen or a few hundred rows, out of a much larger population that is often of unknown size; the results can still be logically valid, but often only if other preconditions are met on the data (including normality tests, which are often not performed). For that reason, I usually prefer to leverage SQL Server’s fast set-based retrieval methods to quickly calculate statistics on full populations whenever possible, especially when there are simpler versions of the mathematical formulas available for the full dataset. Skewness and kurtosis are among those measures that can be computed in a simpler way when using the whole population[4], but I’ve opted here to use the more intensive formulas for sample skewness and sample kurtosis for one reason only: it might be possible to substitute population skewness and kurtosis for their sampling counterparts in the formulas for the Jarque-Bera Test, but I can’t find any online sources that mention such a swap. I suspect that it probably would be logically valid, but I took the more conservative approach in Figure 1 by employing the usual Jarque-Bera formula, which really isn’t much more difficult to compute.[5]

** Figure 1: Code for the Jarque-Bera Test Procedure**ALTER PROCEDURE [Calculations].[NormalityTestJarqueBeraSkewnessAndKurtosisSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400)

AS

DECLARE @SchemaAndTableName nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTableName = @DatabaseName + ‘.’ + @SchemaName + ‘.’ + @TableName

SET @SQLString = ‘DECLARE

@Mean float,

@StDev float,

@Count as bigint,

@One float = 1,

@RecpiprocalOfNSampleSize float,

@DifferenceFromSampleMeanSquared float,

@DifferenceFromSampleMeanCubed float,

@DifferenceFromSampleMeanFourthPower float,

@SampleSkewness float,

@SampleKurtosis float,

@JarqueBeraTest float

SELECT @Count = Count(‘ + @ColumnName + ‘), @Mean = Avg(CAST(‘ + @ColumnName + ‘ AS float)), @StDev = StDev(‘ + @ColumnName + ‘)

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @RecpiprocalOfNSampleSize = @One / @Count

SELECT @DifferenceFromSampleMeanSquared = SUM(Power(DifferenceFromSampleMean, 2)) OVER (ORDER BY ‘ + @PrimaryKeyName + ‘ ASC),

@DifferenceFromSampleMeanCubed = SUM(Power(DifferenceFromSampleMean, 3)) OVER (ORDER BY ‘ + @PrimaryKeyName + ‘ ASC),

@DifferenceFromSampleMeanFourthPower =SUM(Power(DifferenceFromSampleMean, 4)) OVER (ORDER BY ‘ + @PrimaryKeyName + ‘ ASC)

FROM (SELECT ‘ + @PrimaryKeyName + ‘, CAST(‘ + @ColumnName + ‘ AS float) – @Mean as DifferenceFromSampleMean — make a single pass across the table?

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1

SELECT @SampleSkewness = (@RecpiprocalOfNSampleSize * @DifferenceFromSampleMeanCubed) /(Power((@RecpiprocalOfNSampleSize * @DifferenceFromSampleMeanSquared), 1.5))

SELECT @SampleKurtosis = (@RecpiprocalOfNSampleSize * @DifferenceFromSampleMeanFourthPower) /(Power((@RecpiprocalOfNSampleSize * @DifferenceFromSampleMeanSquared), 2))

SELECT @JarqueBeraTest = CAST((CAST(@Count AS float) / CAST(6 AS float)) AS Decimal(38,12)) * CAST((Power(@SampleSkewness, 2) + (0.25 * Power((@SampleKurtosis -3), 2))) AS Decimal(38,12))

SELECT @JarqueBeraTest AS JarqueBeraTest, @SampleSkewness AS SampleSkewness, @SampleKurtosis AS SampleKurtosis, @SampleKurtosis – 3 AS ExcessKurtosis, @Mean AS Mean, @StDev AS StDev

— to debug the internal calculations, uncomment the rest of this line, @DifferenceFromSampleMeanSquared, @RecpiprocalOfNSampleSize, @DifferenceFromSampleMeanSquared, @DifferenceFromSampleMeanCubed, @DifferenceFromSampleMeanFourthPower

‘

–SELECT @SQLString — uncomment this to debug dynamic SQL errors

EXEC (@SQLString)

…………The end of the final SELECT can be uncommented to debug the internal calculations. I’ve also reserved the second-to-last line for a SELECT that can be uncommented to debug the dynamic SQL string, as is standard in most of my procedures. Much of the initial code ought to be familiar to reader of this series and the one on outliers, since I use many of the same parameters and internal variables, and apply some of the usual preliminary SET operations on them. As usual, I calculate the values of some reusable internal stats and then cache them in dynamic SQL variables so that we don’t have to recalculate them again, as in the case of the reciprocal of the count and the deviation computation in the lowest-level subquery. I’m experimenting with declaring constants like 1 to high precision data types to prevent situations where SQL Server sometimes truncates the values during calculations, which can lead to erroneous results or at best, messy code full of casts deep within subqueries to avoid such errors. One departure from the norm is the use of floats rather than the decimal data type in the dynamic SQL calculations. The square, cube and quartic operations can result in really high and low values, which may in turn cause divide by zero errors if they’re rounded down to nothing or arithmetic overflows if they’re rounded too high, so I resorted to using float data types for the first time in any of my mistutorial series. This may entail some loss of precision in the internal calculations, but shouldn’t have much an effect on the final test statistic. It is not uncommon for this result to seem outlandishly high when the underlying distribution is abnormal.[6]

** Figure 2: Sample Results from the Duchennes Table and HiggsBosonTable**EXEC [Calculations].[NormalityTestJarqueBeraSkewnessAndKurtosisSP]

@DatabaseName = N’DataMiningProjects’,

@SchemaName = N’Health’,

@TableName = N’DuchennesTable’,

@ColumnName = N’LactateDehydrogenase’,

@PrimaryKeyName = N’ID’

…………The query in Figure 2 produced the results in the graphic immediately below it, in which I tested the procedure on the LactateDehydrogenase column of a dataset on the Duchennes form of muscular dystrophy, which is made publicly available by Vanderbilt University’s Department of Biostatistics. The procedure performed surprisingly well when deriving the other two result sets, clocking in at 3:43 and 3:42 minutes on the first two float columns of the Higgs Boson Dataset, which I downloaded from the University of California at Irvine’s Machine Learning Repository and converted into a SQL Server table. It has 11 million rows and takes up about 6 gigabytes of the DataMiningProjects database I created for these tutorial series, which makes it ideal for stress-testing. Keep in mind that my clunker of a development machine hardly qualifies as a professional database server, so your results will probably be spectacularly better – especially after it’s been subjected to query tuning by one of the countless DBA who knows the ins and outs of T-SQL a lot better than I do. As evinced in Figure 3, the execution plan turned out to be a lot easier to interpret than some of the more sophisticated code I posted in the last tutorial series, with two seeks and two sorts taking up the bulk of the computational effort.

__Figure 3: Execution Plan for the Jarque-Bera Procedure
__

…………The results in Figure 2 are a powerful illustration of one of the weaknesses of the Jarque-Bera Test, i.e. its lack of scaling. The higher the values of the column accumulate in the internal calculations, the larger the test results may be; that is why the 209 rows of the LactateDehydrogenase column had much higher skewness and kurtosis scores than the results for Column1 and Column2 of the Higgs Boson table, yet had a Jarque Bera score that was several orders of magnitude smaller. I’m sure that by now some statistician has developed a scaling mechanism to get around this problem, but I question whether it is worth it for our purposes, for “…it is not without weakness. It has low power for distributions with short tails, especially for bimodal distributions. Other authors have declined to include its data in their studies because of its poor overall performance.”[7] The latter wasn’t as much of an issue as expected in this example, but another problem frequently encountered in the last couple of tutorial series reared its head again: the lack of hard-and-fast cut-off points. I couldn’t find a clear winner among the competing criteria for when the Jarque-Bera stat disqualifies a dataset from being Gaussian (although that doesn’t mean one doesn’t exist, given that I lack experience with this field). They seem to all boil down to “rules of thumb,” out of those I’m most inclined to favor M.G. Bulmer’s, that skewness values beyond -1 or +1 are highly skewed, those within 0.5 of zero are pretty much symmetric and the rest are moderately skewed.[8]

…………It may be that we are better off without such hard limits though, given that they limit us to simplistic either-or choices. Confidence intervals are another common way of forcing the same kind of choice, when there might not be a real crying need for such a limit. If we use a continuous measure, we can ask questions about *how close* a dataset comes to a particular distribution, such as the Gaussian bell curve, but we lose all of that flexibility when we resort to arbitrary cut-off criteria. This is a problem we’ll probably see again as we work our way through the whole menagerie of goodness-of-fit tests, some of which blindly affix labels like “normal” and “not normal” in an almost manic depressive, all-or-nothing way. It’s always good to keep in mind that when we assign labels and test results in this way on a simple pass/fail basis, or perform binning and banding on the values within them, we’re sacrificing a lot of information. For our purposes, we’d probably be better of preserving the skewness and kurtosis values as measures of *how* skewed or kurtic a dataset is, as well as *how* normal it might be, rather than tossing out all the insights and details the full numbers provide. Skewness and kurtosis aren’t as useful in resolving the usual chicken-and-egg dilemma that accompanies outlier detection and goodness-of-fit testing, because we can’t determine whether or not a dataset follows a distribution closely but has too many outliers, or if those outliers signify that a different distribution is a better match. Yet they do occupy a substantial niche in the matrix of use cases I hope to develop for goodness-of-fit, as I did for outlier detection methods in my last mistutorial series. They’re simple enough for a layman to understand and easy to visualize, plus they represent really effective measures of the shape of a dataset, aside from whether or not that shape is applicable to goodness-of-fit testing. This makes them useful in their own right as primitive forms of data mining, in a sense. I’m not as enthused about the Jarque-Bera Test though, because it requires extra computational effort in order to derive results that lack adequate scaling, interpretation criteria and statistical power, even when implemented flawlessly by better programmers than myself. It may very well have valid uses in ordinary statistical applications, but its range of usefulness may be more constrained in the realm of databases servers and Big Data. Perhaps D’Agostino’s K-Squared Test, an alternative goodness-of-fit measure also built upon skewness and kurtosis, will prove more useful that the Jarque-Bera Test in next week’s article.

[1] See the __Wikipedia__ page “Normality Test” at http://en.wikipedia.org/wiki/Normality_test

[2] See National Institute for Standards and Technology, 2014, “1.3.5.11 Measures of Skewness and Kurtosis,” published in the online edition of the __Engineering Statistics Handbook__. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

[3] Brown, Stan, 2012, “Measures of Shape: Skewness and Kurtosis,” published Dec. 27, 2012 at the __Tompkins Cortland Community College__ website. Available online at http://www.tc3.edu/instruct/sbrown/stat/shape.htm .

[4] See Brown, Stan, 2012.

[5] I derived this code from the formulas at Brown’s webpage and the __Wikipedia__ entry “Jarque-Bera Test” at http://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test

[6] For example, see the thread posted by the user named ipadawan on Oct. 13, 2011 in the __CrossValidated__ forums, titled “Appropriate Probability Threshold for Jarque-Bera Test,” which is available online at the web address http://stats.stackexchange.com/questions/16949/appropriate-probability-threshold-for-jarque-bera-test

[7] See the Wikipedia entry for “Normality Test” again, at http://en.wikipedia.org/wiki/Normality_test

[8] I’m paraphrasing Brown, 2012, who cites Bulmer, M. G., 1979, __Principles of Statistics__. Dover Publication: New York. I also agree with Brown when he says that “… GraphPad suggests a confidence interval for skewness….I would say, compute that confidence interval, but take it with several grains of salt — and the further the sample skewness is from zero, the more skeptical you should be.” I have no issue with GraphPad, which I’ve never used before, but am not inclined to much stock in hard confidence intervals anyways.

## Goodness-of-Fit Testing with SQL Server, part 2.1: Implementing Probability Plots in Reporting Services

**By Steve Bolton**

…………In the first installment of this series of amateur self-tutorials, I explained how to implement the most basic goodness-of-fit tests in SQL Server. All of those produced simple numeric results that are trivial to calculate, but in terms of interpretability, you really can’t beat the straightforwardness of visual tests like Probability-Probability (P-P) and Quantile-Quantile (Q-Q) Plots. Don’t let the fancy names fool you, because the underlying concepts aren’t that difficult to grasp once the big words are subtracted. It is true that misunderstandings may sometimes arise over the terminology, since both types of visual goodness-of-fit tests are often referred by the generic term of “probability plots” – especially when we use the Q-Q Plot for the Gaussian or “normal” distribution, i.e. the bell curve, which is often called the “normal probability plot.”[1] Nevertheless, the meaning of either one is easy to grasp at a glance, even to an untrained eye: basically, we just build a scatter plot of data points, then compare it to a line that represents the ideal distribution of points for a perfect match. If they look like they follow the same path – usually a straight line – then we can conclude that the distribution we want to assess fits well. Visual analysis of this kind is of course does not provide the kind of detail or rigor that more sophisticated goodness-of-fit tests can, but it serves as an excellent starting point, especially since it is relatively straightforward to implement scatter plots of this kind in Reporting Services.

…………As I found out the hard way, the difficult part with implementing these visual aids is not in representing the data in Reporting Services, but in calculating the deceptively short formulas in T-SQL. For P-P Plots, we need to compare two cumulative distribution functions (CDFs). That may be a mouthful, but one that is not particularly difficult to swallow once we understand how to calculate probability distribution functions. PDFs[2] are easily depicted in histograms, where we can plot the probability of the occurrence of each particular value in a distribution from left to right to derive such familiar shapes as the bell curve. Since probabilities in stochastic theory always start at 0 and sum to 1, we can plot them a different way, by summing them in succession for each associated value until we reach that ceiling. Q-Q Plots are a tad more difficult because they involve comparing the inverse of the CDFs, using what is alternately known as quantile or percent point functions[3], but not terribly so. Apparently the raison d’etre for these operations is to distill distributions like the Gaussian down to the uniform distribution, i.e. a flat line in which all outcomes are equally likely, for easier comparison.[4]

**Baptism By Fire: Why the Most Common CDF is Also the Most Trying**

Most probability distributions have their own CDF and Inverse CDF, which means it would be time-consuming to implement them all in order to encompass all of the known distributions within a single testing program. The equations involved are not always terribly difficult – except, however, when it comes to the most important distribution of all, the Gaussian. No exact solutions are available (let alone mathematically possible) for our most critical, must-have use case, so we must rely on various approximations developed by mathematicians over the years. One of my key goals in writing self-tutorials of this kind is to acquire the ability to translate equations into T-SQL, Visual Basic and Multidimensional Expression (MDX) quickly, but I got a baptism by fire when trying to decipher one of the symbols used in the error functions the normal distribution’s CDF depends upon. The assistance I received from the folks at CrossValidated (StackOverlow’s machine learning and statistics forum) was also indispensable in helping me wrap my head around the formulas, which are apparently a common stumbling block for beginners like me.[5] For the Inverse CDFs I also had to learn the concept of order statistics, i.e. rankits, which I can probably explain a lot more succinctly than some of the unnecessarily prolix resources I waded through along the way. The mathematical operation is really no more difficult than writing down all of your values in order from lowest to highest, then folding the sheet of paper in half and adding the corresponding points together. The Wikipedia discussion page “Talk:Rankit” helped tremendously; in fact, I ended up using the approximation for the R statistical package that is cited there in my implementation of the Gaussian Inverse CDF.[6]

…………While slogging through the material, it began to dawn on me that it might not be possible to implement even a crude solution in T-SQL, at least for tables of the size SQL Server users encounter daily. Indeed, if it weren’t for a couple of workarounds like the aforementioned one for R I found scattered across the Internet, I wouldn’t have finished this article at all. Resorting to the use of lookup tables for known values really doesn’t help us in the SQL Server world, because they simply don’t go high enough. I was reunited with one of the same stumbling blocks I often encountered when writing my last mistutorial series, namely that fact that the available lookup tables for known rankit values simply don’t go anywhere near high enough for the size of the tables used in SQL Server databases and cubes. For example, one compendium of statistical tables I consulted could only accommodate up to 50 values.[7]

**In the Absence of Lookup Tables, Plan on Writing Intricate SQL**

This is merely a subset of the much broader issue of scaling statistical tests that were designed generations ago for much smaller sample sizes, of a few dozen or a few hundred records, to the thousands or millions of rows routinely seen in modern databases. In this case, I was forced to calculate the missing rankit values myself, which opened up a whole new can of worms. Another critical problem with implementing the CDF and Inverse CDF in code is that many of the approximations involve factorials, but those can only be calculated up to values around 170 without reaching the limit of the T-SQL float data type; this is actually quite good compared to other languages and statistical packages, which can often handle values only up to around 20.[8] Thankfully, Peter John Acklam published a decent approximation algorithm online, which can calculate Inverse CDFs for the normal distribution without factorials. It’s only good to a precision of 1.15 x 10^{−9}, which may not be sufficient for some Big Analysis use cases, but this code ought to be enough to get a novice data miner started.[9]

…………The complexity of implemented probability plots is further increased when we factor in the need to write separate code for each distribution; most of them aren’t as difficult as the Gaussian, which has no closed-form solution, but providing code for each of them would require dozens of more articles. For that reason, I’ll stick to the bell curve for now; consequently, I also won’t get into a discussion of the lesser-known Probability Plot Correlation Coefficient Plot (PPCC), which is only applicable to distributions like the Weibull that have shape parameters, unlike the bell curve.[10] Another complication we have to deal with when using CDFs, inverse CDFs and PDFs is that different versions may be required for each, depending on whether you want to return a single value or a whole range, or whether such inputs as the mean, standard deviation and counts are already known or have to be computed on the fly. Later in this series we will probably have to make use of some of these alternate versions for more advanced fitness tests, so I’ve uploaded all 14 versions I’ve coded to date in one fell swoop to one central repository on DropBox, which are listed below:

NormalDistributionCDFSP.sql

NormalDistributionCDFSupplyMeanAndStDevSP.sql

NormalDistributionCDFSupplyMeanStDevAndRangeSP.sql

NormalDistributionCDFSupplyTableParameterSP.sql

NormalDistributionInverseCDFFunction.sql

NormalDistributionPDFAndCDFSupplyMeanStDevAndRangeSP.sql

NormalDistributionPDFSP.sql

NormalDistributionPDFSupplyMeanAndStDevSP.sql

NormalDistributionPDFSupplyMeanStDevAndRangeSP.sql

NormalDistributionRankitApproximationSP.sql

NormalDistributionSingleCDFFunction.sql

RankitApproximationFunction.sql

RankitApproximationSP.sql

RankitApproximationSupplyCountSP.sql

SimpleFloatValueTableParameter.sql

…………Keep in mind that, as usual, I’ve only done very basic testing on these stored procedures and functions, so they’ll probably require some troubleshooting before putting them into a production environment; consider them an example of how a professional solution might be engineered, not as a finished product. I did some validation of the procedures against various CDF and Inverse CDF lookup tables and calculators I found on the Web, but only for a handful of values.[11] The .sql file names are pretty much self-explanatory: for example, NormalDistributionPDFSupplyMeanAndStDevSP returns the PDF function for the normal distribution if you supply the mean and standard deviation, whereas the NormalDistributionSingleCDFFunction does just what it says by returning one value out of a set of CDF results. A few take table variables as inputs, so I’ve included the SimpleFloatValueTableParameter I defined to supply them. I’ve followed my usual coding style by appending SP and Function to the ends of the names to denote what type of object they are. The NormalDistributionRankitApproximationSP, RankitApproximationSP and RankitApproximationSupplyCountSP procedures use the aforementioned approximation from R, while my implementation of Acklam’s approximation can be found in the NormalDistributionInverseCDFFunction.sql file. Some of the objects are dependent on the others, like the RankitApproximationFunction, which utilizes the NormalDistributionInverseCDFFunction.

…………Some of the other procedures will be of use later in this tutorial series, but in this week’s installment, we’ll be feeding the output from DataMiningProjects.Distributions.NormalDistributionSingleCDFFunction listed above into a couple of SSRS line charts. As I pointed out in three previous articles from the tail end of my last tutorial series, there are plenty of better explanations of how to write reports and do other basic tasks in RS, so I won’t clutter this post with those extraneous details. Basically, the sample procedure below derives the CDF values for the horizontal axis and another set of values for the vertical axis called the Empirical Distribution Function (EDF), which is just a fancy way of saying the values actually found in the dataset. Anyone familiar with the style of sample code I’ve posted on this blog can tell that we’re just using dynamic SQL to calculate distinct counts, with the difficult computations hidden inside the CDF function; I reuse most of the same parameters, intermediate variable declarations and other code seen in past articles, like the SELECT @SQLString for debugging the procedure.

** Figure 1: Sample T-SQL to Build a Probability-Probability Plot**CREATE PROCEDURE [GoodnessOfFit].[PPPlot]

@Database1 as nvarchar(128) = NULL, @Schema1 as nvarchar(128), @Table1 as nvarchar(128),@Column1 AS nvarchar(128)

AS

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1

SET @SQLString = ‘DECLARE @Mean as float,

@StDev as float,

@Count bigint

SELECT @Count=Count(CAST(‘ + @Column1 + ‘ as float)), @Mean = Avg(CAST(‘ + @Column1 + ‘ as float)), @StDev = StDev(CAST(‘ + @Column1 + ‘ as float))

FROM ‘ + @SchemaAndTable1 + ‘

WHERE ‘ + @Column1 + ‘ IS NOT NULL

DECLARE @EDFTable table

(ID bigint IDENTITY (1,1),

Value float,

ValueCount bigint,

EDFValue float,

CDFValue decimal(38,37),

EDFCDFDifference decimal(38,37))

INSERT INTO @EDFTable

(Value, ValueCount, EDFValue)

SELECT Value, ValueCount, CAST(SUM(ValueCount) OVER (ORDER

BY Value ASC) as float) / @Count AS EDFValue

FROM (SELECT DISTINCT ‘ + @Column1 + ‘ AS Value, Count(‘ + @Column1 + ‘) OVER (PARTITION BY ‘

+ @Column1 + ‘ ORDER BY ‘ + @Column1 + ‘) AS ValueCount

FROM ‘ + @SchemaAndTable1 + ‘

WHERE ‘ + @Column1 + ‘ IS NOT NULL) AS T1

UPDATE T1

SET CDFValue = T3.CDFValue, EDFCDFDifference = EDFValue – T3.CDFValue

FROM @EDFTable AS T1

INNER JOIN (SELECT DistinctValue, DataMiningProjects.Distributions.NormalDistributionSingleCDFFunction

(DistinctValue, @Mean, @StDev) AS CDFValue

FROM (SELECT

DISTINCT Value AS DistinctValue

FROM @EDFTable) AS T2) AS T3

ON T1.Value = T3.DistinctValue

SELECT ID, ROW_NUMBER() OVER (ORDER BY ID) AS RN, Value, ValueCount, EDFValue, CDFValue, EDFCDFDifference

FROM @EDFTable‘

–SELECT @SQLString — uncomment this to debug dynamic SQL errors

DECLARE @ResultTable table

(PrimaryKey sql_variant,

RN bigint,

Value float,

ValueCount bigint,

EDF float,

CDF float,

EDFCDFDifference float

)

INSERT INTO @ResultTable

EXEC (@SQLString)

SELECT PrimaryKey, RN, Value, ValueCount, EDF, CDF, EDFCDFDifference

FROM @ResultTable

…………If the distribution being tested by the CDF is a good match then the coordinates ought to come as close to an imaginary center line cutting across from (0,0) to (1,1), which are the boundaries of any EDF or CDF calculation. That’s obviously not the case in the first plot in Figure 2, where the coordinates are shifted far to the left and top despite the fact that the horizontal axis is skewed, with most of the values lopped off. The other three all have standard 0.1 intervals, including the second plot, which seems to be a good match. This is not surprising, given that I’ve already performed much more sophisticated goodness-of-fit tests on this data, which represents the second float column in the Higgs Boson Dataset I downloaded from University of California at Irvine’s Machine Learning Repository ages ago for practice data on this blog. The abnormal plot above it comes from the first float column in the same dataset, which I routinely fails tests for the Gaussian/normal distributon. Note how thick the lines are in both: this is because there are 11 million rows in the practice dataset, with 5,001 distinct values for the second column alone. Most of the tests I’ll survey in this series perform well in the databae engine, but trying to depict that many values in an SSRS report can obviously lead to congestion in the user interface. The first plot was particularly slow in loading on my development machine. The third plot loaded quickly because it came from the Duchennes muscular dystrophy dataset[12] I’ve also been using for demonstration purposes, which has a mere 209 rows. The Lactate Dehyrogenase enzyme data embodied in the column I plugged into my procedure is probably not normally distributed, given how erratic it is at the tails and bowed at the center. The fourth plot comes from a time dataset that may be Gaussian despite its jagged appearance, which is caused by the discrete 15-minute intervals it tracks. It is in situations like this where knowing your data is an immense help in successful interpretation, i.e. the end goal of any data mining endeavor. In many other contexts, serrated shapes are often an indicator of abnormality; in this instance, it is dictated by the fixed width of data type intervals chosen.

__Figure 2: Four Sample Probability-Probability Plots Derived from T-SQL
__

…………It should be fairly obvious just from glancing at the results that P-P can serve as outlier detection methods in and of themselves; as the National Institute for Standards and Technology’s Engineering Statistics Handbook (one of my favorite online statistical references) points out, “In addition to checking the normality assumption, the lower and upper tails of the normal probability plot can be a useful graphical technique for identifying potential outliers. In particular, the plot can help determine whether we need to check for a single outlier or whether we need to check for multiple outliers.”[13] Nevertheless, I omitted them from my last tutorial series because they’re simply too crude to be effective in this capacity. If we were going to spot aberrant data points by eye in this manner, we might be better off comparing histograms like the ones I introduced in Outlier Detection with SQL Server Part 6.1: Visual Outlier Detection with Reporting Services with the PDFs of the distributions we want to compare. Even then, we still run into the same chicken-and-egg problem that we encountered through the series on outliers: without goodness-of-fit testing, we can’t determine what the underlying distribution should be and therefore can’t tell if *any* records are outliers. If we force these fitness tests to do double-duty, we end up sitting between two stools, as the old German proverb says, because then we can’t be sure of either the distribution or the aberrance of underlying data points. Moreover, like most other outlier methods, it doesn’t provide any information whatsoever on *why* a record is aberrant. Furthermore, some of the approximations the underlying functions use also intrinsically discount outliers, as Acklam’s does.[14] In the case of P-P Plots and Q-Q Plots, we’re more often than not better off using them in their original capacity as fitness tests. No harm is done if we spot an aberrant data point in the scatter plots and flag them for further investigation, but scaling up this approach to full-fledged automatic outlier detection would become problematic once we get into the thousands or millions of data points.

…………This size issue also places a built-in limitation on the usefulness of these visual methods for fitness testing purposes. If all of the data points from a single table are crammed into one thick black line that obscures all of the underlying detail, then we can still draw a reasonable conclusion that it fits the distribution we’re comparing it against. That approach is no longer tenable once we’re talking about one thousand out of a million records being off that line, which forces us to make a thousand judgment calls. Once we try to scale up these visual methods, we run into many of the same problems we encountered with the visual outlier detection methods surveyed in the last series, such as the use of binning and banding – not to mention the annoying restriction in Reporting Services against consuming more than a single resultset from each stored procedure, which forces us to discard any summary data that really ought to be calculated in T-SQL, MDX or DAX rather than in RS. These methods also have some specific inherent limitations, such as the inapplicability of P-P plots when the two distributions don’t have roughly simple center points (as measured by means, medians, modes, etc.).[15] At a much broader level, these tests don’t provide much information on *how well* a dataset fits a particular distribution, because that would involve half-conscious visual assessments of how much each outlier counts for or against the final verdict. For example, how are we to weigh seven outliers that are two quantiles off the mark, compared to three that are a half a quantile away? These tests are conveniences that allow users to make spot assessments of the fitness of distributions at a glance, with the minimum of interpretation and computational costs, but they simply don’t have much muscle. That is the unavoidable drawback of simplistic tests of this type. They amount to brute force, unconscious assessments that “if nothing looks out of place, the fitness of the distribution is not an issue we need to be worried about” – i.e. the flip side of visual outlier detection methods, which boil down to “if it looks out of place, we’ll look at it more closely.” Once the need arises for more definite confirmation of a dataset’s fit to a particular distribution, we have to resort to tests of greater sophistication, which invariably churn out numeric results rather than eye candy. If I don’t take a quick detour into Q-Q Plots next time around, then in the next installment we’ll climb another rung up this ladder of sophistication as we discuss skewness and kurtosis, which can provide greater detail about how closely a dataset fits its target distribution.

[1] See the __Wikipedia__ articles “P-P Plot” and “Normal Probability Plot” respectively at http://en.wikipedia.org/wiki/P%E2%80%93P_plot and http://en.wikipedia.org/wiki/Normal_probability_plot for mention of these conundrums.

[2] As pointed out in the last article, for the sake of convenience I’ll be using the term “probability distriubtion function” (PDF) to denote probability density functions and the equivalent concepts for distributions on discrete scales, probability mass functions (PMFs). This is sometimes done in the literature, but not often.

[3] See the __Wikipedia__ article “Quantile Function” at http://en.wikipedia.org/wiki/Quantile_function for the terminology.

[4] See this comment at the __Wikipedia__ page “Order Statistic” at http://en.wikipedia.org/wiki/Order_statistic :”When using probability theory to analyze order statistics of random samples from a continuous distribution, the cumulative distribution function is used to reduce the analysis to the case of order statistics of the uniform distribution.”

[5] See the __CrossValidated__ thread “Cumulative Distribution Function: What Does t in \int\exp(-t^2)dt stand for?” at http://stats.stackexchange.com/questions/111868/cumulative-distribution-function-what-does-t-in-int-exp-t2dt-stand-for

[6] Another source I found useful as Holmes, Susan, 1998, “Order Statistics 10/30,” published Dec. 7, 1998 at the Stanford Univeristy web address http://statweb.stanford.edu/~susan/courses/s116/node79.html

[7] pp. 59-61, Rohlf, F. James and Sokal, Robert R., 1995, __Statistical Tables__. Freeman: New York. Retrieved from the Google Books web address http://books.google.com/books?id=1ImWLlMxEzoC&pg=PA59&lpg=PA59&dq=rankits+example&source=bl&ots=fWnT_Gfhvy&sig=bXSLnrtWqlbmT07FXVnVKd5wqbY&hl=en&sa=X&ei=gNJFVJCmNIf2OqKNgMgF&ved=0CDkQ6AEwAg#v=onepage&q=rankits%20example&f=false

[8] Some sources I used when trying to implement the factorial formula include p. 410, Teichroew, D., 1956, “Tables of Expected Values of Order Statistics and Products of Order Statistics for Samples of Size Twenty and Less from the Normal Distribution,” pp. 410-426 in __The Annals of Mathematical Statistics__, Vol. 27, No. 2. Available at the __Project Euclid__ web address http://projecteuclid.org/euclid.aoms/1177728266 as well as Weisstein, Eric W., 2014, “Order Statistic.” published t the Wolfram MathWorld web address http://mathworld.wolfram.com/OrderStatistic.html

[9] See Acklam, Peter John, 2010, “An Algorithm for Computing the Inverse Normal Cumulative Distribution Function,” published Jan. 21, 2010, at the __Peter’s Page__ website. Available online at http://home.online.no/~pjacklam/notes/invnorm/ I made some corrections to my original implementation after consulting John Herrero’s VB example at http://home.online.no/~pjacklam/notes/invnorm/impl/herrero/inversecdf.txt and discovering that I had left off several minus signs from the constants; these might have been clipped off when I imported them.

[10] See the __Wikipedia__ article “Probability Plot Correlation Coefficient Plot” at http://en.wikipedia.org/wiki/Probability_plot_correlation_coefficient_plot

[11] I checked the inverse CDF values at p. 15, University of Glasgow School of Mathematics & Statistics, 2012, “Statistical Tables,” published June 21, 2012 at the __University of Glasgow School of Mathematics & Statistics__ web address http://www.stats.gla.ac.uk/˜levers/software/tables/

[12] I downloaded this long ago from Vanderbilt University’s Department of Biostatistics__.__

[13] See National Institute for Standards and Technology, 2014, ““1.3.5.17 Detection of Outliers,” published in the online edition of the __Engineering Statistics Handbook__. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm . Also see

“1.3.3.26.10. Scatter Plot: Outlier” at http://www.itl.nist.gov/div898/handbook/eda/section3/scattera.htm

[14] See Acklam, Peter John, 2010.

[15] See the aforementioned __Wikipedia__ article “P-P Plot” at http://en.wikipedia.org/wiki/P%E2%80%93P_plot

## Goodness-of-Fit Testing with SQL Server, part 1: The Simplest Methods

**By Steve Bolton**

…………In the last series of mistutorials I published in this amateur SQL Server blog, the outlier detection methods I explained were often of limited usefulness because of a chicken-and-egg problem: some of the tests could tell us that certain data points did not fit a particular set of expected values, but not whether those records were aberrations from the correct distribution, or if our expectations were inappropriate. The problem is akin to trying to solve an algebra problem with too many variables, which often can’t be done without further information. Our conundrum can be addressed by adding that missing information through goodness-of-fit tests, which can give us independent verification of whether or not our data ought to follow a particular distribution; only then can we apply batteries of other statistical tests that require particular distributions in order to make logically valid inferences, including many of the outlier identification methods discussed previously in this blog.

…………As I touched on frequently in that series, it is not uncommon for researchers in certain fields to fail to perform distribution testing, which thereby renders many of their published studies invalid. It is really an obvious problem that any layman can grasp: if we don’t have an expected pattern in mind, then it is difficult to define departures from it, which is essentially what outliers are. Goodness-of-fit tests also provide insights into data that are useful in and of themselves, as a sort of primitive form of data mining, which can be leveraged further to help us make informed choices about which of the more advanced (and concomitantly costly in terms of performance and interpretation effort) algorithms ought to be applied next in a data mining workflow. In fact, SSDM provides a Distribution property allowing users to specify whether a mining column follows a Log Normal, Normal or Uniform pattern, as I touched on briefly in A Rickety Stairway to SQL Server Data Mining, Part 0.0: An Introduction to an Introduction. In this series of mistutorials, I will be focusing more on the information that goodness-of-fit testing can give us about our data, rather than on the statistical tests (particularly on hypotheses) it typically serves as a prerequisite to. For all intents and purposes, it will be used as a ladder to future blog posts on more sophisticated data mining techniques that can be implemented in SQL Server, provided that we have some prior information about the distribution of the data.

**Probability Distributions vs. Regression Lines**

Goodness-of-fit tests are also sometimes applicable to regression models, which I introduced in posts like A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression and A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression. I won’t rehash the explanations here for the sake of brevity; suffice it to say that regressions can be differentiated from probability distributions by looking at them as line charts which point towards the predicted values of one or more variables, whereas distributions are more often represented as histograms representing the full range of a variable’s actual or potential values. I will deal with methods more applicable to regression later in this series, but in this article I’ll explain some simple methods for implementing the more difficult concept of a probability distribution. One thing that sets them apart is that many common histogram shapes associated with them have been labeled, cataloged and studied intensively for generations, in a way that the lines produced by regressions have not. In fact, it may be helpful for people with programming backgrounds (like many SQL Server DBAs) to look at them as objects, in the same sense as object-oriented programming. For example, some of them are associated with Location, Scale and Shape parameters and characteristics like the mode (i.e. the peak of the histogram) and median that can be likened to properties. For an excellent explanation of location and scale parameters that any layman could understand, see the National Institute for Standards and Technology’s Engineering Statistics Handbook, which is one of the most readable sources of information on stats that I’ve found online to date. Statisticians have also done an enormous amount of work studying every conceivable geometrical subsection of distributions and devised measures for them, such as skewness and kurtosis for the left and right corners or “tails” of a histogram. Each distribution has an associated set of functions, such as the probability density function (PDF) in the case of Continuous data types (as explained in A Rickety Stairway to SQL Server Data Mining, Part 0.0: An Introduction to an Introduction) or the probability mass function (PMF) in the case of Discrete types. “Probability distribution function” (PDF) is occasionally used for either one in the literature and will be used as a catch-all term throughout this series.[i] Other common functions associated with distributions include the cumulative distribution function (CDF); inverse cumulative distribution function (also known as the quantile function, percent point function, or ppf); hazard function; cumulative hazard function; survival function; inverse survival function; empirical distribution function (EDF); moment-generating function (MGF) and characteristic function (CF)[ii]. I’ll save discussions of more advanced functions for Fisher Information and Shannon’s Entropy that are frequently used in information theory and data mining for a future series, Information Measurement with SQL Server. Furthermore, many of these functions can have orders applied to them, such as rankits, which are a concept I’ll deal with in the next article. I don’t yet know what many of them do, but some of the more common ones like the PDFs and CDFs are implemented in the goodness-of-fit tests for particular distributions, so we’ll be seeing T-SQL code for them later in this series.

…………I also don’t yet know what situations you’re liable to encounter particular data distributions in, although I aim to by the end of the series. I briefly touched on Student’s T-distribution in the last series, where it is used in some of the hypothesis-testing based outlier detection methods, but I’m not yet acquainted with some of the others frequently mentioned in the data mining literature, like the Gamma, Exponential, Hypergeometric, Poisson, Pareto, Tukey-Lambda, Laplace and Chernoff distributions. The Chi-Squared distribution is used extensively in hypothesis testing, the Cauchy is often used in physics[iii] and the Weibull “is used to model the lifetime of technical devices and is used to describe the particle size distribution of particles generated by grinding, milling and crushing operations.”[iv] What is important for our purposes, however, is that all of the above are mentioned often in the information theory and data mining literature, which means that we can probably put them to good use in data discovery on SQL Server tables.

…………If you really want to grasp the differences between them at a glance, a picture is worth a thousand words: simply check out the page “1.3.6.6 Gallery of Distributions” at the aforementioned NIST handbook for side-by-side visualizations of 19 of the most common distributions. Perhaps the simplest one to grasp is the Uniform Distribution, which has a mere straight line as a histogram; in other words, all values are equally likely, as we would see in rolls of single dice. The runner-up in terms of simplicity is the Bernoulli Distribution, which is merely the distribution associated with Boolean yes-no questions. Almost all of the explanations I’ve seen for it to date have revolved around coin tosses, which any elementary school student can understand. Dice and coin tosses are invariably used to illustrate such concepts in the literature on probabilities because they’re so intuitive, but they also have an advantage in that we can calculate exactly what the results should be, in the absence of any empirical evidence. The problem we run into in data mining is that we’re trying to discover relationships that we can’t reason out in advance, using the empirical evidence provided by the billions of rows in our cubes and tables. Once we’ve used goodness-of-fit testing to establish that the data we’ve collected indeed follows a particular distribution, then we can use all of the functions, properties, statistical tests, data mining techniques and theorems associated with it to quickly make a whole series of new inferences.

**The “Normal” Distribution (i.e. the Bell Curve)**

…………This is especially true of the Gaussian or “normal” distribution, which is by far the most thoroughly studied of them all, simply because an uncanny array of physical processes approximate it. The reasons for its omnipresence are still being debated to this day, but one of the reasons is baked right into the structure of mathematics through such laws as the Central Limit Theorem. Don’t let the imposing name scare you, because the concept is quite simple – to the point where mobsters, I’m told, used to teach themselves to instantly calculate gambling odds from it in order to run book-making operations. Once again, dice are the classic example used to explain the concept: there are obviously many paths through which one could roll a total of six from two dice, but only one combination apiece for snake eyes or boxcars. The results thus naturally form the familiar bell curve associated with the normal distribution. The most common version of it is the “standard normal distribution,” in which a mean of zero and standard deviation of one are plugged into its associated functions, which force it to form a clean bell curve centered on the zero mark in a histogram. The frequency with which the normal distribution pops up in nature is what motivates the disproportionate amount of research poured into it; even the Student’s T-distribution and the Chi-Square Distribution, for example, are used more often in tests of the normal distribution than as descriptions of a dataset in their own right.

…………Unfortunately, one side effect of this lopsided devotion to one particular distribution is that there are far fewer statistical tests associated with its competitors – which tempts researchers into foregoing adequate goodness-of-fit testing, which can also be bothersome, expensive and a bit inconvenient if it disproves their assumptions. Without it, however, there is a gap in the ladder of logic needed to prove anything with hypothesis testing, or to discover new relationships through data mining. This step is disregarded with unnerving frequency – particularly in the medical field, where it can do the most damage – but ought not be, when we can use set-based languages like T-SQL and modern data warehouse infrastructure to quickly perform the requisite goodness-of-fit tests. Perhaps some of the code I’ll provide in this series can even be used in automated testing on a weekly or monthly basis, to ensure that particular columns of interest still follow a particular distribution over time and don’t come uncoupled from it, as stocks, bonds, derivatives and other financial instruments do so frequently from other economic indicators. It is often a fair assumption that a particular dataset ought to follow a normal distribution, but it doesn’t always hold – nor can we say why in many of the cases where it actually does, since the physical processes captured in our billions of records is several orders of magnitude more complex than rolls of dice and coin tosses. Nor can we be certain that many of these complex processes will continue to follow a particular distribution over time, particularly when that most inscrutable of variables, human free will, is factored in.

…………Luckily, there are many goodness-of-fit tests available for the normal distribution, which is fitting given that so much statistical reasoning is dependent on it. Most of the articles in this series will thus be devoted to normality testing, although we may encounter other distributions from time to time, not to mention the tangential topic of regression. I considered kick-starting this series with four incredibly easy methods of normality testing, but one of them turned out to be nowhere near as popular or simple to implement as I believed. The ratio between the min-max range of a column and its standard deviation is listed among the earliest normality tests at Wikipedia[v], but I decided against implementing it fully due to the lack of available comparison values. The concept is quite simple: you subtract the minimum value from a column’s maximum value, then divide by the standard deviation and compare it to a lookup table, but the only reference I could find (in Hartley, et al.’s original paper[vi] from 1954) only went up to 1,000 records and only supplied values for 30 of them. We frequently encountered the same twin problems in the outlier detection series with methods based on hypothesis-testing: most of the lookup tables have massive gaps and are applicable to only a few hundred or thousand records at best, which means they are unsuited to the size of typical SQL Server tables or that popular buzzword, “Big Data.” In the absence of complete lookup tables ranging to very high values, the only alternative is to calculate the missing values ourselves, but I have not yet deciphered these particular formulas sufficiently well yet. Nor is there much point, given that this particular measure is apparently not in common use and might not be applicable to big tables for other reasons, such as the fact that the two bookend values in a dataset of 10 million records probably don’t have much significance. The code in Figure 1 runs fast and is easy to follow, but lacks meaning in the absence of lookup tables to judge what the resulting ratio ought to be for a Gaussian distribution.

__Figure 1: Code to Derive the Ratio of the Range to Standard Deviation__

CREATE PROCEDURE [Calculations].[NormalityTestRangeStDevSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SchemaAndTableName nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTableName = @DatabaseName + ‘.’ + @SchemaName + ‘.’ + @TableName –I’ll change this value one time, mainly for legibility purposes

SET @SQLString = ‘DECLARE @Count bigint, @StDev decimal(‘ + @DecimalPrecision + ‘), @Range decimal(‘ + @DecimalPrecision + ‘)

SELECT @Count=Count(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @StDev = StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))),

@Range = Max(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) – Min(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘)))

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @Range / @StDev AS RangeStDevRatio’

–SELECT @SQLString — uncomment this to debug string errors

EXEC (@SQLString)

…………Thankfully, we have better replacements available at the same low level of complexity. One of the most rudimentary normality tests that any DBA can easily implement and interpret is the 68-95-99.7 Rule, also known as the 3-Sigma Rule. The logic is very simple: if the data follows a normal distribution, then 68 percent of the values should fall within the first standard deviation, 95 percent within the second and 99.7 percent within the third. This can be verified with a simple histogram of distinct counts, of the kind I introduced at the tail end of the last tutorial series. To implement my version, all I did was tack the code in Figure 2 onto the last Select in the HistogramBasicSP stored procedure I posted in Outlier Detection with SQL Server, part 6.1: Visual Outlier Detection with Reporting Services. I also changed the name to HistogramBasicPlusNormalPassFailSP to reflect the added capabilities; for brevity’s sake, I won’t repeat the rest of the code. A @NumberOfStDevsFromTheMean parameter can be added to this code and combined with a clause like SELECT 1 – (1 / POWER (@NumberOfStDevsFromTheMean, 1)) to calculate Chebyshev’s Rule, a less strict test that applies to almost any distribution, not just the normal. In practice, this signifies that half of all the values for any distribution will be one standard deviation from the mean, three-quarters will be within two standard deviations and 87.5 and 93.75 percent will fall within four and five standard deviations respectively. The 3-Sigma Rule is closely to the Law of Large Numbers and Chebyshev’s Rule to its poor cousin, the Weak Law of Large Numbers; if your data fails the first test there’s no reason to hit the panic button, since it might not naturally follow a normal distribution, but failing Chebyshev’s Rule is cause to raise more than one eyebrow.

__Figure 2: Code to Add to the HistogramBasicSP from the Outlier Detection Series__

WHEN @HistogramType = 4 THEN ‘SELECT *, ”FirstIntervalTest” =

CASE WHEN FirstIntervalPercentage BETWEEN 68 AND 100 THEN ”Pass”

ELSE ”Fail” END,

”SecondIntervalTest” = CASE WHEN SecondIntervalPercentage BETWEEN 95 AND 100 THEN ”Pass”

ELSE ”Fail” END,

”ThirdIntervalTest” = CASE WHEN ThirdIntervalPercentage BETWEEN 99.7 AND 100 THEN ”Pass”

ELSE ”Fail” END

FROM (SELECT TOP 1 CAST(@PercentageMultiplier *

(SELECT Sum(FrequencyCount) FROM DistributionWithIntervalsCTE WHERE

StDevInterval BETWEEN -1 AND 1) AS decimal(6,2)) AS FirstIntervalPercentage,

CAST(@PercentageMultiplier * (SELECT

Sum(FrequencyCount) FROM DistributionWithIntervalsCTE WHERE StDevInterval

BETWEEN -2 AND 2) AS decimal(6,2)) AS SecondIntervalPercentage,

CAST(@PercentageMultiplier * (SELECT

Sum(FrequencyCount) FROM DistributionWithIntervalsCTE WHERE StDevInterval

BETWEEN -3 AND 3) AS decimal(6,2)) AS ThirdIntervalPercentage

FROM DistributionWithIntervalsCTE) AS T1′

** Figure 3: Result on the HistogramBasicPlusNormalPassFailSP on the Hemopexin Column**EXEC Calculations.HistogramBasicPlusNormalPassFailSP

@DatabaseName = N’DataMiningProjects’,

@SchemaName = N’Health’,

@TableName = N’DuchennesTable’,

@ColumnName = N’Hemopexin’,

@DecimalPrecision = ‘38,21’,

@HistogramType = 4

…………The results in Figure 3 are child’s play to interpret: the Hemopexin column (in a dataset on the Duchennes form of muscular dystrophy which I downloaded from the Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table) does not quite fit a normal distribution, since the count of values for the first two standard deviations falls comfortably within the 68-95-99.7 Rule, but the third does not. Whenever I needed to stress-test the code posted in the last tutorial series on something more substantial than the Duchennes dataset’s mere 209 rows, I turned to the Higgs Boson dataset made available by the University of California at Irvine’s Machine Learning Repository, which now occupies close to 6 gigabytes of the same DataMiningProjects database. Hopefully in the course of one of these tutorial series (which I plan to keep writing for years to come, till I actually *know* something about data mining) I will be able to integrate practice datasets from the Voynich Manuscript, an inscrutable medieval tome encrypted so well that no one has been able to crack it for the last half-millennium – even the National Security Agency (NSA). The first float column of the Higgs Boson dataset probably makes for a better performance test though, given that the table has 11 million rows, far more than the tens or hundreds of thousands of rows in the tables that I’ve currently compiled from the Voynich Manuscript. The good news is that this simple procedure gave us a quick and dirty normality test in just 4 minutes and 16 seconds on my six-core Sanford and Son version of a development machine – which hardly qualifies as a real server, so the results in a professional setting will probably blow that away.

** Figure 4: Code to Add to Derive the Ratio of Mean Absolute Deviation to Standard Deviation**CREATE PROCEDURE [Calculations].[NormalityTestMeanAbsoluteDeviationStDevRatioSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SchemaAndTableName nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTableName = @DatabaseName + ‘.’ + @SchemaName + ‘.’ + @TableName –I’ll change this ‘ + @ColumnName + ‘ one time, mainly for legibility purposes

SET @SQLString = ‘DECLARE @Mean decimal(‘ + @DecimalPrecision + ‘), @StDev decimal(‘ + @DecimalPrecision + ‘)

SELECT @Mean = Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @StDev = StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT MeanAbsoluteDeviation / @StDev AS Ratio, 0.79788456080286535587989211986877 AS RatioTarget, MeanAbsoluteDeviation, @StDev as StandardDeviation

FROM (SELECT Avg(Abs(‘ + @ColumnName + ‘ – @Mean)) AS MeanAbsoluteDeviation

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1’

–SELECT @SQLString — uncomment this to debug dynamic SQL errors

EXEC (@SQLString)

** Figure 5: Results for the Mean Absolute Deviation to Standard Deviation Ratio Test**EXEC @return_value = [Calculations].[NormalityTestMeanAbsoluteDeviationStDevRatioSP]

@DatabaseName = N’DataMiningProjects’,

@SchemaName = N’Physics’,

@TableName = N’HiggsBosonTable’,

@ColumnName = N’Column1′,

@DecimalPrecision = N’33,29′

…………If HistogramBasicPlusNormalPassFailSP is still too slow for your needs, it may be relieving to know that the code in Figure 4 took only two seconds to run on Column1 on the same machine and a mere five seconds on Column 2, which wasn’t properly indexed at the time. The procedure really isn’t hard to follow, if you’ve seen some of the T-SQL code I posted in the last series of the tutorials. For consistency’s sake, I’ll be using many of the same parameters in this tutorial series as I did in the last, include @DecimalPrecision, which enables users to avoid arithmetic overflows by setting their own precision and scale for the internal calculations. As we saw in the Visual Outlier Detection with Reporting Services segment of the last series, this parameter can also be used to prevent a mystifying problem in which RS reports occasionally return blank results for some columns, if their precision and scale are set too high. The first four parameters allow users to perform the normality test on any numeric column in any database for which they have adequate access, while the next-to-last-line allows users to debug the dynamic SQL.

…………In between those lines it calculates the absolute deviation – i.e. the value for each record vs. the average of the whole column, which was encountered in Z-Scores and other outlier detection methods in the last series – for each row, then takes the average and divides it by the standard deviation. I haven’t yet found a good guide as to how far the resulting ratio should be from the target ratio (which is always the square root of two divided pi) to disqualify a distribution from being Gaussian, but I know from experience that Column1 is highly abnormal, whereas Column2 pretty much follows a bell curve. The first had a ratio of 0.921093 as depicted in Figure 1, whereas Figure 2 scored 0.823127 in a subsequent test, so the ratio converged fairly close to the target as expected.[vii] In its current form, the test lacks precision because there is no definite cut-off criteria, which may have been published somewhere I’m unaware of – especially since I’m an amateur learning as I go, which means I’m unaware of *a lot* that goes on in the fields related to data mining. It is still useful, however, because as a general rule of thumb we can judge that the abnormality of a dataset is proportional to how far the ratio is from the constant target value.’

**Climbing the Ladder of Sophistication with Goodness-of-Fit**

I’m fairly sure that the second float column in the Higgs Boson Dataset is Gaussian and certain that the first is not, given the shapes of the histograms provided for both in Outlier Detection with SQL Server, part 6.1: Visual Outlier Detection with Reporting Services. Histograms represent the easiest visual test of normality you can find; it make take someone with more statistical training than I have to interpret borderline cases, but any layman can detect at a glance when a distribution is definitely following some other shape besides a bell curve. In the next installment of the series, I hope to explain how to use a couple of other visual detection methods like probability plots and Q-Q plots which are more difficult to code and reside at the upper limit of what laymen can interpret at a glance. I had a particularly difficult time calculating the CDFs for the normal distribution, for example. After that I will most likely write something about skewness, kurtosis and the Jarque-Bera test, which are also still within the upper limit of what laymen can interpret; in essence, that group measures how lopsided a distribution is on the left or right side (or “tail”) of its histogram. I wrote code for some of those measures long ago, but after that I will be in uncharted territory with topics with imposing names like the Shapiro-Wilk, D’Agostino’s K-Squared, Hosmer–Lemeshow, Chi-Squared, G, Kolmogorov-Smirnov, Anderson-Darling, Kuiper’s and Lilliefors Tests. I have a little experience with the Likelihood Ratio Test Statistic, Coefficient of Determination (R^{2}) and Lack-of-Fit Sum of Squares, but the rest of these are still a mystery to me.

…………This brings me to my usual disclaimer: I’m publishing this series in order to learn the topic, since the act of writing helps me digest new topics a lot faster and forces me to think about them more explicitly. I mainly teach through misadventure; my posts often end up as cautionary tales that amount to, “Don’t go about this the way this guy did.” There’s still some value in that, but always take the word of a professional over mine if I say anything that contradicts them; my word may carry weight in topics I have expertise in (such as foreign policy history, which I could teach at the graduate school level at the drop of a hat) but data mining and the associated statistics are definitely not among them (yet). Hopefully by the end of the series I will have learned more about probability distributions and their associated tests and made some contributions towards coding them in T-SQL; I may post a coda at the end with a use case map that can help DBAs differentiate at a glance between the various goodness-of-fit tests and their proper applications for particular distributions. At present I plan to end the series with a group of six goodness-of-fit test with wickedly cool names like the Cramér–von Mises, the Deviance, Focused, Hannan-Quinn, Bayesian and Akaike Information Criterions. The last two of these are mentioned frequently in the information theory literature, which will help provide another springboard towards a much more interesting series I’ve been planning to write for some time, Information Measurement with SQL Server. I already built two bridges to this potentially useful but really advanced series at the tail end of Outlier Detection with SQL Server, with my posts on Cook’s Distance and Mahalanobis Distance. My earlier tutorial series on A Rickety Stairway to SQL Server Data Mining also served as a bridge of sorts, since some of the algorithms are related to the methods of information measurement we’ll touch on in that future series. Understanding probability distributions and goodness-of-fit is a prerequisite of sorts to cutting edge topics like Shannon’s Entropy, Minimum Description Length (MDL) and Kolmogorov Complexity that I’ll deal with in that series, which may be quite useful to miners of SQL Server data.

[i] For a discussion, see the Wikipedia article “Probability Density Function” at http://en.wikipedia.org/wiki/Probability_density_function. I have seen “probability distribution function” used to denote both mass and density functions in other data mining and statistical literature, albeit infrequently.

[ii] See the __Wikipedia__ article “Characteristic Function” at

http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory)

[iii] See the __Wikipedia__ article “Cauchy Distribution” http://en.wikipedia.org/wiki/Cauchy_distribution

[iv] See the __Wikipedia__ article “List of Probability Distributions” at http://en.wikipedia.org/wiki/List_of_probability_distributions

[v] See the __Wikipedia__ article “Normality Test” at http://en.wikipedia.org/wiki/Normality_test

[vi] See David, H. A.; Hartley, H. O. and Pearson, E. S., 1954, “The Distribution of the Ratio, in a Single Normal Sample, of Range to Standard Deviation,” pp. 482-493 in Biometrika, December 1954. Vol. 41, No. 3/4. I found the .pdf at the web address http://webspace.ship.edu/pgmarr/Geo441/Readings/David%20et%20al%201954%20-%20The%20Distribution%20of%20the%20Ratio%20of%20Range%20to%20Standard%20Deviation.pdf but it is apparently also available online at the JSTOR web address http://www.jstor.org/stable/2332728. I consulted other sources as well, like Dixon, W.J., 1950, Analysis of Extreme Values,” pp. 488-506 in __The Annals of Mathematical Statistics__. Vol. 21, No. 4. Available online at the Project Euclid web address http://projecteuclid.org/euclid.aoms/1177729747 and p. 484, E.S. Pearson, E.S. and Stephens, M. A., 1964, “The Ratio Of Range To Standard Deviation In The Same Normal Sample,” pp. 484-487 in __Biometrika__, December 1964. Vol. 51, No. 3/4. Published online at the JSTOR web address http://www.jstor.org/discover/10.2307/2334155?uid=2129&uid=2&uid=70&uid=4&sid=21105004560473

[vii] I verified the internal calculations against the eight-value example at the __MathBits.com __page “Mean Absolute Deviation,” which is available at the web address http://mathbits.com/MathBits/TISection/Statistics1/MAD.html

## Outlier Detection with SQL Server, part 8: A T-SQL Hack for Mahalanobis Distance

**By Steve Bolton**

…………Longer code and substantial performance limitations were the prices we paid in return for greater sophistication with Cook’s Distance, the topic of the last article in this series of amateur self-tutorials on identifying outliers with SQL Server. The same tradeoff was even more conspicuous in this final installment – until I stumbled across a shortcut to coding Mahalanobis Distance that really saved my bacon out of the fire. The incredibly cool moniker sounds intimidating, but the concepts and code required to implement it are trivial, as long as we sidestep the usual matrix math that ordinarily makes it prohibitively expensive to run on “Big Data”-sized tables. It took quite a while for me to blunder into a suitable workaround, but it was worthwhile, since Mahalanobis Distance merits a special place in the pantheon of outlier detection methods, by virtue of the fact that it is uniquely suited to certain use cases. Like Cook’s Distance, it can be used to find outliers defined by more than one column, which automatically puts both in a league the others surveyed in this series can’t touch; their competitors are typically limited to detecting unusual two-column values in cases where neither column is at the extreme low or high end, Cook’s D and Mahalanobis Distance sometimes flag unusual intermediate values. The latter, however, can also be extended to more than two columns. Better yet, it also accounts for distortions introduced by variance into the distances between data points, by renormalizing them on a consistent scale that is in many cases equivalent to ordinary Euclidean Distance. Best of all, efficient approximations can be derived through a shortcut that renders all of the complex matrix math irrelevant; since the goal in outlier detection is mainly to serve as an alarm bell to draw attention to specific data points that might warrant human intervention, we can sacrifice a little accuracy in return for astronomical performance gains.

…………For both the cool name and the even cooler multidimensional outlier detection capabilities, we can thank Prasanta Chandra Mahalanobis (1893-1972), who was born in Calcutta at a time when Gandhi, Mother Teresa, distant tech support call centers, Bangladesh and the other things Westerners associate today with the region were still in the far-flung future. He and his grandfather may have acted as moderating influences in Brahmo Samaj, a 19^{th} Century offshoot of Hinduism that has since apparently died out in Bangladesh; later in life he “served as secretary to Rabindranath Tagore, particularly during the latter’s foreign travels.”[i] Some of that polymath’s brilliance must have rubbed off, because Mahalanobis responded to a dilemma he encountered while trying to compare skull sizes by inventing an entirely new measure of similarity, which can be adapted to finding outliers based on how unalike they are. It has many applications, but for our purposes it is most appropriate for finding multidimensional outliers. If you want to find out how unusual a particular value is for a particular column, any of the detection methods presented earlier in this series may suffice, if all of their particular constraints are taken into account – save for Cook’s Distance, which is a comparison between two columns. Mahalanobis Distance takes the multicolumn approach one step further and represents one of the few means available for finding out whether a particular data point is unusual when compared to several columns at same time.

**The Litmus Test: Comparing Outliers to the Shape of Your Data**

Think of it this way: instead of measuring the distance of a single data point or mean to a standard deviation or variance, as we do so often in statistics, we’re measuring several variables against an entire multidimensional matrix of multiple columns, as well as the variances, covariances and averages associated with them. These extra columns allow us to compare our data points against a shape that is more geometrically complex than the single center point defined by a simple average or median. That is why Mahalanobis Distance is intimately related to the field of Principal Components Analysis, i.e. the study of various axes that make up a multidimensional dataset. The metric also has distinctive interrelationships with the regression metric known as leverage[ii], the normal distribution (i.e. the bell curve)[iii], Fisher Information and Beta and Chi-Squared distributions[iv] that are still far above my head, but I was able to explain it to myself crudely in this way: the metric measures how many standard deviations[v] a data point is from a set of center points for all of the columns under consideration, which taken together form an ellipsoid[vi] which is transformed into a circle by the constituent mathematical operations. Don’t let the big word ellipsoid fool you, because it’s actually quite obvious that any normal scatter plot of data points will form a cloud in the shape of an irregular curve around the center of the dataset. It’s also obvious that the center will have a more complex shape than when we use a single variable, since if we have three means or medians taken from three columns we could make a triangle out of them, or a four-sided shape like a rectangle from similar measures applied to four columns, and so on; the only difference is that we have to do some internal transformations to the shape, which need not concern us. Suppose, for example, that you wanted to discover if a particular sound differs from the others in a set by its pitch; in this case, you could simply use a typical unidimensional outlier detection method that merely examines the values recorded for the pitch. You could get a more complete picture, however, by taking into account other variables like the length of the song it belongs to, the type of instrument that produced and so on.

…………The price of this more subtle and flexible means of outlier detection would be quite high in terms of both performance and code maintenance, if our implementation consisted of translating the standard matrix math notation into T-SQL. I programmed an entire series of T-SQL matrix procedures to do just that, which seemed to perform reasonably well and with greater accuracy than the method in Figure 1 – until I hit the same SQL Server internals barrier I did with Cook’s Distance in the last article. To put it bluntly, we can’t use recursive calls to table-valued parameters to implement this sort of thing, because we’ll hit the internal limit of 32 locking classes rather quickly, leading to “No more lock classes available from transaction” errors. This long-standing limitation is by design, plus there are apparently no plans to change it and no widely publicized workarounds, so that’s pretty much the end of the line (unless factorization methods and similar matrix math workarounds I’m not familiar with might do the trick).

**Bypassing the Matrix Math**

I had to put Mahalanobis Distance on the back burner for months until I stumbled across a really simple version expressed in ordinary arithmetic notation (summation operators, division symbols, that sort of thing) rather than matrix operations like transposes and inverses. Unfortunately, I can’t remember where I found this particular formula to give adequate credit, but it allowed me to cut two chop at least two lengthy paragraphs out of this article, which I originally included to explain how the inner working of the gigantic matrix math routines I wrote; otherwise, I might’ve set a SQL Server community record for the lengthiest T-SQL sample code ever crammed into a single blog post. Instead, I present Figure 1, which is short enough to be self-explanatory; the format ought to be familiar to anyone who’s been following this series, since it features similar parameter names and dynamic SQL operations. The good news is that the results derived from the Duchennes muscular dystrophy dataset I’ve been using for practice data throughout this series aren’t substantially different from those derived through the matrix math method. There are indeed discrepancies, but this approximation is good enough to get the job done without any noteworthy performance hit at all.

…………Keep in mind that the results of outlier detection methods are rarely fed into other calculations for further refinement, so perfect accuracy is not mandatory as it might be with hypothesis testing or many other statistical applications. The point of all of the T-SQL sample code in this series is to automate the detection of outliers, whereas their handling requires human intervention; all we’re doing is flagging records for further attention, so that experts with domain knowledge can cast trained eyes upon them, looking for relevant patterns or perhaps evidence of data quality problems. The goal is inspection, not perfection. A few years back I read some articles on how the quality of being “good enough” is affecting software economics of late (although I can’t rustle up the citations for those either) and this hack for Mahalanobis Distance serves as a prime example. It’s not as pretty as a complete T-SQL solution that matches the more common matrix formula exactly, but it serves the purposes of end users just as well – or perhaps even better, considering the short code is more easily maintained and the performance is stellar. This sample code runs in about 20 milliseconds on desktop computer (which could hardly be confused with a real server), compared to 19 for the Cook’s D procedure in the last tutorial. The cool thing is that it scales much better. My implementation of Cook’s D can’t be run at all on the Higgs Boson Dataset I’ve been using to stress-test my code with in this series[vii], because the regression stats would have to be recalculated for each of the 11 million rows, thereby leading to exponential running times and the need to store 121 trillion regression rows in TempDB. That’s not happening on anyone’s server, let alone my wheezing Frankenstein of a desktop. My Mahalanobis hack ran in a respectable 3:43 on the same Higgs Boson data. The lesson I learned from coding Mahalanobis and Cook’s Distances in T-SQL is that arithmetic formulas ought to be preferred to ones defined in matrix notation, whenever possible, even if that entails resorting to approximations of this kind. The difficulty consists of finding them, perhaps hidden in the back of some blog post or journal article in the dark corners of the Internet.

** Figure 1: Code for the Mahalanobis Distance Procedure**CREATE PROCEDURE Calculations.OutlierDetectionMahalanobisDistanceSP

@Database1 nvarchar(128), @Schema1 nvarchar(128), @Table1 nvarchar(128), @Column1 AS nvarchar(128), @Column2 AS nvarchar(128)

AS

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1

SET @SQLString = ‘DECLARE @Var1 decimal(38,32)

SELECT @Var1 = Var(CAST(‘ + @Column1 + ‘ AS decimal(38,32)))

FROM ‘ + @SchemaAndTable1 + ‘

WHERE ‘ + @Column1 + ‘ IS NOT NULL AND ‘ + @Column2 + ‘ IS NOT NULL

SELECT ‘ + @PrimaryKeyName + ‘ AS PrimaryKey, ‘ + @Column1 + ‘ AS Value1, ‘ + @Column2 + ‘ AS Value2,

Power(Power(‘ + @Column1 + ‘ – ‘ + @Column2 + ‘, 2) / (@Var1), 0.5) AS MahalanobisDistance

FROM ‘ + @SchemaAndTable1 + ‘

WHERE ‘ + @Column1 + ‘ IS NOT NULL AND ‘ + @Column2 + ‘ IS NOT NULL

ORDER BY MahalanobisDistance DESC’

–SELECT @SQLString — uncomment this to debug dynamic SQL errors

EXEC (@SQLString)

__Figure 2: Results for the Mahalanobis Distance Procedure__

EXEC Calculations.OutlierDetectionMahalanobisDistanceSP

@Database1 = N’DataMiningProjects‘,

@Schema1 = N’Health‘,

@Table1 = N’DuchennesTable‘,

@PrimaryKeyName = N’ID’,

@Column1 = N’CreatineKinase‘,

@Column2 = ‘Hemopexin’

…………There are still some issues left to be worked out with this approximation of Mahalanobis Distance. I’m not yet sure under which conditions we can expect better accuracy, or conversely greater discrepancies from the matrix version. I know Mahalanobis Distance can also be extended to more than two columns, unlike Cook’s D, but I have yet to engineer a solution. Moreover, I have yet to wrap my head around all of the subtle cases where Mahalanobis is less applicable; for example, it apparently isn’t as appropriate when the relationships are nonlinear.[viii] As I’ve come to realize through reading up on statistical fallacies, these tricky situations can make all of the difference in the world between a mining model that helps end users to make informed decisions and ones that can mislead them into disastrous mistakes. Deriving the numbers certainly isn’t easy, but it is even harder to attach them to the wider scaffolding of hard logic in a meaningful way. As many statisticians themselves decry, that is precisely where a lot of science and public discourse go awry. Thankfully, these issues aren’t life-and-death matters in outlier detection, where the goal is to act as an alarm bell to alert decision-makers, rather than as a decision in and of itself; as I’ve pointed out throughout this series ad infinitum, ad nauseum, these detection methods only tell us that a data point is aberrant, but say nothing about *why*. This is why knee-jerk reactions like simply deleting outliers are not only unwarranted, but can and often are used for deceptive purposes, particularly when money, reputations and sacred cows are on the line. The frequency with which this sort of chicanery still happens is shocking, as I mentioned earlier in the series. As I’ve learned along the way, perhaps the second-most critical problem dogging outlier detection is the lack of methods capable of dealing with “Big Data”-sized databases, or even the moderately sized tables of a few thousand or millions rows as we see routinely in SQL Server. Most of them simply choke and a few are even invalid or incalculable. It might be useful to develop new ones more suited to these use cases, or track down and popularize any that might’ve already been published long ago in the math literature.

…………Despite such subtle interpretive risks, Mahalanobis Distance is the only statistic I’m aware of in my minimal experience that can be applied to the case of multidimensional outliers, beyond the two columns Cook’s D is limited to. In this capacity it acts as a dissimilarity measure, but can also be used for the converse purpose as a measure of similarity. Its scale-invariance and status as a “dimensionless quantity,” i.e. a pure number attached to no particular system of unit measurement, apparently have their advantages as well.[ix] It can be used in other capacities in data mining, such as in feature selection in Bayesian analysis.[x] I don’t necessarily understand a good share of the data mining and machine learning literature I’ve read to date, but can tell by the frequency it crops up that Mahalanobis Distance has diverse uses beyond mere outlier detection. In a future mistutorial series, I intend to demonstrate just how little I know about several dozen other metrics commonly used in the field of data mining, like Shannon’s Entropy, Bregman’s Divergence, the Aikake Information Criterion, Sørensen’s Similarity Index and Lyapunov Exponent. I’ll also include a whole segment on probabilistic applications of distance measures, such as the popular Küllback-Leibler Divergence, all of which turned out to be easier to code and explain than Cook’s D and Mahalanobis Distance. It only gets easier from here on in, at least in terms of common distance measures. I have no timetable for finishing the dozens of such metrics I intend to survey (if all goes according to plan, I will be posting data mining code on this blog for many years to come) but by the time I’m finished with the series tentatively titled Information Measurement with SQL Server, it should be easier than ever before to quantify just how much information there is in every table. We’ll also be able to measure such properties randomness among a column’s values. Before diving into it, however, I might post a quick wrap-up of this series that includes a makeshift use diagram that classifies all of the outlier detection methods covered in this series, as well as a makeshift method of detecting interstitial outliers that I cooked up to meet some specific use cases, one that allowed me to spot a data quality issue in my own databases. I’ll also take a quick detour into coding goodness-of-fit tests in SQL Server, since these seemed to have quite a bit of overlap with some of the outlier detection methods mentioned earlier in this series. Knowing what probability distribution one is dealing can sometimes tell us an awful lot about the underlying processes that produced it, so they can be indispensable tools in DIY data mining on SQL Server.

[i] I glanced at the biography at the Wikipedia page “Prasanta Chandra Mahalanobis,” at the web address http://en.wikipedia.org/wiki/Prasanta_Chandra_Mahalanobis

[ii] See the __Wikipedia__ page “Mahalanobis Distance” at http://en.wikipedia.org/wiki/Mahalanobis_distance

[iii] *IBID.*

[iv] “When an infinite training set is used, the Mahalanobis distance between a pattern measurement vector of dimensionality D and the center of the class it belongs to is distributed as a chi2 with D degrees of freedom. However, the distribution of Mahalanobis distance becomes either Fisher or Beta depending on whether cross validation or resubstitution is used for parameter estimation in finite training sets. The total variation between chi2 and Fisher, as well as between chi2 and Beta, allows us to measure the information loss in high dimensions. The information loss is exploited then to set a lower limit for the correct classification rate achieved by the Bayes classifier that is used in subset feature selection.” Ververidis, D. and Kotropoulos, C., 2009, “Information Loss of the Mahalanobis Distance in High Dimensions: Application to Feature Selection,” pp. 2275-2281 in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 12. See the abstract available at http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4815271&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F34%2F5291213%2F04815271.pdf%3Farnumber%3D4815271

[v] “One interesting feature to note from this figure is that a Mahalanobis distance of 1 unit corresponds to 1 standard deviation along both primary axes of variance.” See the __Jennes Enterprises__ webpage titled “Description” at http://www.jennessent.com/arcview/mahalanobis_description.htm.

[vi] See the post by jjepsuomi titled “Distance of a Test Point from the Center of an Ellipsoid,” published Jun 24, 2013 in the StackExchange Mathematics Forum, as well as the the reply by Avitus on the same date.Available online at http://math.stackexchange.com/questions/428064/distance-of-a-test-point-from-the-center-of-an-ellipsoid__. __Also see jjepsuomi post titled “Bottom to Top Explanation of the Mahalanobis Distance,” published June 19, 2013 in the __CrossValidated__ forums. Available online at http://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance. The folks at __CrossValidated__ gave me some help on Aug. 14, 2014 with these calculations in the thread titled “Order of Matrix Operations in Mahalanobis Calculations,” which can be found at http://stats.stackexchange.com/questions/111871/order-of-matrix-operations-in-mahalanobis-calculations

[vii] I downloaded from the University of California at Irvine’s Machine Learning Repository a long time ago and converted it into a SQL Server table of about 6 gigabytes.

[viii] See Rosenmai, Peter, 2013, “Using Mahalanobis Distance to Find Outliers,” posted Nov. 25, 2013 at the __EurekaStatistics.com __web address http://eurekastatistics.com/using-mahalanobis-distance-to-find-outliers

[ix] See the __Wikipedia__ pages “Mahalanobis Distance” and “Scale Invariance” at http://en.wikipedia.org/wiki/Mahalanobis_distance and http://en.wikipedia.org/wiki/Scale_invariance

[x] See Ververidis and Kotropoulos.

## Outlier Detection with SQL Server, part 7: Cook’s Distance

**By Steve Bolton[**

…………I originally intended to save Cook’s and Mahalanobis Distances to close out this series not only because the calculations and concepts are more difficult yet worthwhile to grasp, but also in part to serve as a bridge to a future series of tutorials on using information measures in SQL Server, including many other distance measures. The long and short of it is that since I’m learning these topics as I go, I didn’t know what I was getting myself into and ended finishing almost all of the other distance measures before Cook’s and Mahalanobis. Like the K-Means algorithm I recapped in the last tutorial and had already explained in greater depth in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering, these two are intimately related to ordinary Euclidean Distance, so how hard could they be? Some other relatively common outlier detection methods are also based on K-Means relatives (like K-Nearest Neighbors) and from there to Euclidean Distance, so I won’t belabor the point by delving into them further here. There are also distance metrics in use today that are based on mind-bending alternative systems like affine, hyperbolic, elliptic and kinematic geometries in which these laws do not necessarily hold, after relaxing some of the Euclidean postulates; for example, the affine type of non-Euclidean geometry is useful in studying parallel lines, while the hyperbolic version is useful with circles.[1] Some of them are exotic, but others are quite useful in DIY data mining, as we shall see in a whole segment on probabilistic distances (like the Küllback-Leibler Divergence) in that future mistutorial series. What tripped me up in the case of Cook’s and Mahalanobis is that the most common versions of both rely on matrix math, which can present some unexpected stumbling blocks in SQL Server. In both cases I had to resort to alternative formulas, after running into performance and accuracy issues using the formulas based on standard notation. They’re entirely worthwhile to code in T-SQL, because they occupy an important niche in the spectrum of outlier detection methods. All of the methods introduced in this series allows us to automatically flag outliers for further inspection, which can be quite useful for ferreting out data quality issues, finding novel patterns and the like in large databases – where we don’t want to go around deleting or segregating thousands of records without some kind of intelligent examination first. Cook’s and Mahalanobis, however, stand out because they’re among the few standard ways of finding aberrant data points defined by more than one column. This also makes it capable of detecting unusual two-column values in cases where neither column is at the extreme low or high end, although that doesn’t happen often. These outlier detection methods are thus valuable to have on hand, despite the fact that “Cook’s D,” as it is often known, is still prohibitively costly to run on “Big Data”-sized databases, unlike my workaround for Mahalanobis Distance. The “D” may stand for “disappointing,” although it can still be quite useful on small and medium-sized datasets.

…………Cook’s Distance is suitable as the next stepping stone because we can not only draw upon the concept of distances between data points drawn from the K-Means version of the SSDM Clustering algorithm, but also make use of the lessons learned in A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression. Like so many other metrics discussed in this series, it made its debut in the American Statistical Association journal *Technometrics*, in this case in a paper published in 1977 by University of Minnesota statistician R. Dennis Cook, which I was fortunate enough to find a copy of. [2] The underlying equation[3] is not necessarily trivial, but the concepts underpinning it really shouldn’t be too difficult for anyone who can already grasp the ideas underpinning regression and Z-Scores, which have been dealt with in previous posts. I found it helpful to view some of the difference operations performed in Cook’s Distance (and the Mean Square Error (MSE) it depends upon) as a sort of twist on Z-Scores, in which we subtract data points from the data points predicted by a simple regression, rather than data points from the mean as we would in the deviation calculations that Z-Scores depend upon. After deriving each of these differences, we square them and sum them – just as we would in many other outlier detection calculations performed earlier in this tutorial series – then finally multiply by the reciprocal of the count.[4] The deviation calculation in the dividend of a Z-Scores can in fact be seen as a sort of crude distance metric in its own right, in which we are measuring how far each data point is from the center of a dataset, as defined in a mean or median; in the MSE, we are performing a similar distance comparison, except between a predicted value and actual value for a data point. To calculate Cook’s Distance we multiply the MSE by the count of parameters – i.e. which for our purposes means the number of columns we’re predicting, which is limited to just one in my code for now. The result forms the divisor in the final calculations, but the dividend is more complex. Instead of comparing a prediction to an actual value, we recalculate a new prediction for each data point in which the regression has been recalculated with that specific data point omitted, then subtract the result from the prediction made for that data point by the full regression model with no points omitted. The dividend is formed by squaring each of those results and summing them, in a process quite similar to the calculation of MSE and Z-Scores. The end result is a measuring stick that we can compare two-column data points against, rather than just one as we have with all of the other outlier detection methods in this series.

…………The difficulty in all of this is not the underlying concept, which is sound, but the execution, given that we have to recalculate an entirely new regression model for each data point. The dilemma is analogous to the one we faced in previous articles on the Modified Thompson Tau Test and Chauvenet’s Criterion, where we had to perform many of the computations recursively in order to recalculate the metrics after simulating the deletion of data points. Each of the difference operations we perform below tells us something about how important each record is within the final regression model, rather than how many outliers there might be if that record was eliminated, but it still presents a formidable performance problem. This drawback is magnified by the fact that we can’t use SQL Server’s exciting new windowing functions to solve the recursion issue with sliding windows, as we did in the articles on the Modified Thompson Tau test and Chauvenet’s Criterion. In fact, Step 5 in Figure 1 would be an ideal situation for the EXCLUDE CURRENT ROW clause that Itzik Ben-Gan, the renowned expert on SQL Server windowing functions, wants to see Microsoft add to the T-SQL specification.[5] As I discovered to my horror, you can’t use combine existing clauses like ROW UNBOUNDED AND ROWS 1 PRECEDING in conjunction with ROWS 1 FOLLOWING AND ROWS UNBOUNDED FOLLOWING to get the same effect. As a result, I had to perform the recalculations of the regressions in a series of table variables that are much less readable and efficient than an EXCLUDE CURRENT ROW clause might be, albeit more legible than the last remaining alternative, a zillion nested subqueries. I’m not yet fluent enough in T-SQL to say if these table variables cause more of a performance impact than subqueries in contexts like this, but this is one case in which they’re appropriate because readability is at a premium. It may also be worthwhile to investigate temporary tables as a replacement; so far, this method does seem to be faster than the common table expression (CTE) method I originally tried. I initially programmed an entire series of matrix math functions and stored procedures to derive both Cook’s and Mahalanobis Distances, since both are often defined in terms of matrix math notation, unlike many other distances used for data mining purposes. That method worked well, except that it ran into a brick wall: SQL Server has an internal limitation of 32 locking classes, which often leads to “No more lock classes available from transaction” error messages with recursive table-valued parameters. This is by design and I have yet to see any workarounds posted or any glimmer of hope that Microsoft intends to ameliorate it in future upgrades, which means no matrix math using table-valued parameters for the foreseeable future.

…………Yet another issue I ran into was interpreting the notation for Cook’s Distance, which can be arrived at from two different directions: the more popular method seems to be the series of calculations outlined two paragraphs above, but the same results can be had by first calculating an intermediate figure known as Leverage. This can be derived from what is known as a Hat Matrix, which can be easily derived in the course of calculating standard regression figures like MSE, predictions, residuals and the like. Unlike most other regression calculations, which are defined in terms of standard arithmetic operations like divisions, multiplication, etc. the notation for deriving Leverage is almost always given in terms of matrices, since it’s derived from a Hat Matrix. It took me a lot of digging to find an equivalent expression of Leverage in terms of arithmetic operations rather than matrix math, which I couldn’t use due to the locking issue. It was a bit like trying to climb a mountain, using a map from the other side; I was able to easily code all of the stats in the @RegressionTable in Figure 1, alongside many other common regression figures, but couldn’t tell exactly which of them could be used to derive the Hat Matrix and Leverage from the opposite direction. As usual, the folks at CrossValidated (StackExchange’s data mining forum) saved my bacon out of the fire.[6] While forcing myself to learn to code the intermediate building blocks of common mining algorithms in T-SQL, one of the most instructive lessons I’ve learned is that translating notations can be a real stumbling block, one that even professionals encounter. Just consider that a word to the wise, for anyone who tries to acquire the same skills from scratch as I’m attempting to do. Almost all of the steps in Figure 1 revolve around common regression calculations, i.e. intercepts, slopes, covariance and the like, except that fresh regression models are calculated for each row. The actual Cook’s Distance calculation isn’t performed until Step #6. At that point it was trivial to add a related stat known as DFFITS, which can be converted back and forth from Cook’s D; usually when I’ve seen DFFITS mentioned (in what little literature I’ve read), it’s in conjunction with Cook’s, which is definitely a more popular means of measuring the same quantity.[7] For the divisor, we use the difference between the prediction for each row and the prediction when that row is left out of the model and for the dividend, we use the standard deviation of the model when that row is omitted, times the square root of the leverage. I also included the StudentizedResidual and the global values for the intercept, slope and the like in the final results, since it was already necessary to calculate them along the way; it is trivial to calculate many other regression-related stats once we’ve derived these table variables, but I’ll omit them for the sake of brevity since they’re not directly germane to Cook’s Distance and DFFITS.

** Figure 1: T-SQL Sample Code for the Cook’s Distance Procedure**CREATE PROCEDURE Calculations.CooksDistanceSP

@Database1 nvarchar(128), @Schema1 nvarchar(128), @Table1 nvarchar(128), @Column1 AS nvarchar(128), @Column2 AS nvarchar(128)

AS

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString1 nvarchar(max),@SQLString2 nvarchar(max)

SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1

SET @SQLString1 = ‘DECLARE

@MeanX decimal(38,21),@MeanY decimal(38,21), @StDevX decimal(38,21), @StDevY decimal(38,21), @Count bigint,

@Correlation decimal(38,21),

@Covariance decimal(38,21),

@Slope decimal(38,21),

@Intercept decimal(38,21),

@MeanSquaredError decimal(38,21),

@NumberOfFittedParameters bigint

SET @NumberOfFittedParameters = 2

DECLARE @RegressionTable table

(ID bigint IDENTITY (1,1),

Value1 decimal(38,21),

Value2 decimal(38,21),

LocalSum bigint,

LocalMean1 decimal(38,21),

LocalMean2 decimal(38,21),

LocalStDev1 decimal(38,21),

LocalStDev2 decimal(38,21),

LocalCovariance decimal(38,21),

LocalCorrelation decimal(38,21),

LocalSlope AS LocalCorrelation * (LocalStDev2 / LocalStDev1),

LocalIntercept decimal(38,21),

PredictedValue decimal(38,21),

Leverage decimal(38,21),

AdjustedPredictedValue decimal(38,21),

GlobalPredictionDifference AS Value2 – PredictedValue,

AdjustmentDifference AS PredictedValue – AdjustedPredictedValue

)

INSERT INTO @RegressionTable

(Value1, Value2)

SELECT ‘ + @Column1 + ‘, ‘ + @Column2 + ‘

FROM ‘ + @SchemaAndTable1 + ‘

WHERE ‘ + @Column1 + ‘ IS NOT NULL AND ‘ + @Column2 + ‘ IS NOT NULL

— STEP #1 – RETRIEVE THE GLOBAL AGGREGATES NEEDED FOR OTHER CALCULATIONS

SELECT @Count=Count(CAST(Value1 AS Decimal(38,21))),

@MeanX = Avg(CAST(Value1 AS Decimal(38,21))), @MeanY = Avg(CAST(Value2 AS Decimal(38,21))),

@StDevX = StDev(CAST(Value1 AS Decimal(38,21))), @StDevY = StDev(CAST(Value2 AS Decimal(38,21)))

FROM @RegressionTable

— STEP #2 – CALCULATE THE CORRELATION (BY FIRST GETTING THE COVARIANCE)

SELECT @Covariance = SUM((Value1 – @MeanX) * (Value2 – @MeanY)) / (@Count – 1)

FROM @RegressionTable

— once weve got the covariance, its trivial to calculate the correlation

SELECT @Correlation = @Covariance / (@StDevX * @StDevY)

— STEP #3 – CALCULATE THE SLOPE AND INTERCEPT AND MAKE PREDICTIONS

SELECT @Slope = @Correlation * (@StDevY / @StDevX)

SELECT @Intercept = @MeanY – (@Slope * @MeanX)

UPDATE @RegressionTable

SET PredictedValue = (Value1 * @Slope) + @Intercept

— STEP #4 – CALCULATE THE MEAN SQUARED ERROR

— subtract the actual values from the PredictedValues and square them; add em together; then multiple the result by the reciprocal of the count

— as defined at the Wikipedia page “Mean Squared Error” http://en.wikipedia.org/wiki/Mean_squared_error

SELECT @MeanSquaredError = SUM(Power((PredictedValue – Value2), 2)) / CAST(@Count – @NumberOfFittedParameters AS float

FROM @RegressionTable

— STEP #5 – NOW CALCULATE A SLIDING WINDOW

— recalculate alternate regression models for each row, plus the leverage from intermediate steps

— none of this is terribly complicated; theres just a lot to fi

— the outer select is needed here because aggregates arent allowed in the main UPDATE statement (silly limitation)

UPDATE T0

SET LocalMean1 = T3.LocalMean1, LocalMean2 = T3.LocalMean2, LocalStDev1 = T3.LocalStDev1, LocalStDev2 = T3.LocalStDev2

FROM @RegressionTable AS T0

INNER JOIN

(SELECT T1.ID AS ID, Avg(T2.Value1) AS LocalMean1, Avg(T2.Value2) AS LocalMean2, StDev(T2.Value1) AS LocalStDev1, StDev(T2.Value2) AS LocalStDev2

FROM @RegressionTable AS T1

INNER JOIN @RegressionTable AS T2

ON T2.ID > T1.ID OR T2.ID < T1.ID

GROUP BY T1.ID) AS T3

ON T0.ID = T3.ID

‘

SET @SQLString2 = ‘UPDATE T0

SET LocalCovariance = T3.LocalCovariance, LocalCorrelation = T3.LocalCovariance / (LocalStDev1 * LocalStDev2), LocalSum = T3.LocalSum

FROM @RegressionTable AS T0

INNER JOIN (SELECT T1.ID AS ID, SUM((T2.Value1 – T2.LocalMean1) * (T2.Value2 – T2.LocalMean2)) / (@Count – 1) AS LocalCovariance,

SUM(Power(T2.Value1 – T2.LocalMean1, 2)) AS LocalSum

FROM @RegressionTable AS T1

INNER JOIN @RegressionTable AS T2

ON T2.ID > T1.ID OR T2.ID < T1.ID

GROUP BY T1.ID) AS T3

ON T0.ID = T3.ID

UPDATE T0

SET Leverage = T3.Leverage

FROM @RegressionTable AS T0

INNER JOIN (SELECT ID, Value1, 1 / CAST(@Count AS float) + (CASE WHEN Dividend1 = 0 THEN 0 ELSE Divisor1 / Dividend1 END) AS Leverage

FROM (SELECT ID, Value1, Power(Value1 – LocalMean1, 2) AS Divisor1, LocalSum AS Dividend1, Power(Value2 – LocalMean2, 2) AS Divisor2

FROM @RegressionTable) AS T2) AS T3

ON T0.ID = T3.ID

UPDATE @RegressionTable

SET LocalIntercept = LocalMean2 – (LocalSlope * LocalMean1)

UPDATE @RegressionTable

SET AdjustedPredictedValue = (Value1 * LocalSlope) + LocalIntercept

— #6 RETURN THE RESULTS

SELECT ID, Value1, Value2, StudentizedResidual,Leverage,CooksDistance,DFFITS

FROM (SELECT ID, Value1, Value2, GlobalPredictionDifference / LocalStDev1 AS StudentizedResidual, Leverage,

(Power(GlobalPredictionDifference, 2) / (@NumberOfFittedParameters * @MeanSquaredError)) * (Leverage / Power(1 – Leverage, 2)) AS CooksDistance, AdjustmentDifference / (LocalStDev2 * Power(Leverage, 0.5)) AS DFFITS

FROM @RegressionTable) AS T1

ORDER BY CooksDistance DESC

— also return the global stats

— SELECT @MeanSquaredError AS GlobalMeanSquaredError, @Slope AS GlobalSlope, @Intercept AS GlobalIntercept, @Covariance AS GlobalCovariance, @Correlation AS GlobalCorrelation

‘

SET @SQLString1 = @SQLString1 + @SQLString2

–SELECT @SQLString1 — uncomment this to debug dynamic SQL errors

EXEC (@SQLString1)

…………Each of the procedures I’ve posted in previous articles has made use of dynamic SQL similar to that in Figure 1, but in this case there’s simply a lot more of it; in this case, it helps to a least have the operations presented sequentially in a series of updates to the @RegressionTable variable rather than bubbling up from the center of a set of nested subqueries. The first three steps in Figure 1 are fairly straightforward: we retrieve the global aggregates we need as usual, then calculate the covariance (a more expensive operation that involves another scan or seek across the table) from them, followed by the slope and intercept in succession.[8] The MSE calculation in Step 4 requires yet another scan or seek across the whole table. Step 5 accounts for most of the performance costs, since we cannot use the aggregates derived in Step 1 for the new regression models we have to build for each data point. It was necessary to break up the dynamic SQL into two chunks via the second SET @SQLString = @SQLString + ‘ statement, which prevents a bug (or “feature”) that apparently limits the size of strings that can be assigned at any one time, even with nvarchar(max).[9] Various thresholds are sometimes baked into the algorithm to flag “influential points” but I decided to allow users to add their own, in part to shorten the code and in part because there’s apparently not a consensus on what those thresholds ought to be.[10]

…………Aside from the lengthy computations, the Cook’s Distance procedure follows much the same format as other T-SQL solutions I’ve posted in this series. One of the few differences is that there is an extra Column parameter so that the user can compare two columns in any database for which they requisite access, since Cook’s Distance involves a comparison between two columns rather than a test of a single column as in previous tutorials. The @DecimalPrecision parameter is still available so that users can avoid arithmetic overflows by manually setting a precision and scale appropriate to the columns they’ve selected. To decomplicate things I omitted the usual @OrderByCode for sorting the results and set a default of 2 for @NumberOfFittedParameters. As usual, the procedure resides in a Calculations schema and there is no code to handle validation, SQL injection or spaces in object names. Uncommenting the next-to-last line allows users to debug the dynamic SQL.

** Figure 2: Results for the Cook’s Distance Query**EXEC Calculations.CooksDistanceSP

@Database1 = N’DataMiningProjects‘,

@Schema1 = N’Health‘,

@Table1 = N’DuchennesTable‘,

@Column1 = N’PyruvateKinase‘,

@Column2 = ‘Hemopexin’

…………As I have in many previous articles, I ran the first test query against a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from the Vanderbilt University’s Department of Biostatistics. As the results in Figure 2 show, the protein Hemopexin had the greatest influence on the Pyruvate Kinase enzyme at the 126^{th} record. Here the Cook’s Distance was 0.081531, which was about 4 times higher than the value for the sixth-highest Cook’s Distance, with a bigint primary key of 23, so we may safely conclude that this record is an outlier, unless existing domain knowledge suggests that this particular point is supposed to contain such extreme values. Be warned that for a handful of value pairs, my figures differ from those obtained in other mining tools (which believe it or not, also have discrepancies between each other) but I strongly suspect that depends on how nulls and divide-by-zeros are dealt with, for which there is no standard method in Cook’s D. These minor discrepancies are not of critical importance, however, since the outlier detection figures are rarely plugged into other calculations, nor is it wise to act on them without further inspection.

…………The procedure executed in 19 milliseconds on the table I imported the Duchennes data into, but don’t let that paltry figure deceive you: on large databases, the cost rises exponentially to the point where it becomes prohibitive. There were only a handful of operators, including two Index Seeks which accounted for practically the entire cost of the query, which means that it may be difficult to gain much performance value from optimizing the execution plans. This brings us to the bad news: the procedure simply won’t run against datasets of the same size as the Higgs Boson dataset I downloaded from the University of California at Irvine’s Machine Learning Repository and have being using to stress-test my sample T-SQL throughout this series. Since we need to recalculate a new regression model for each of the 11 million rows, we’re at least talking about 11 million squared, or 121 trillion rows of regression data in order to derive 11 million separate Cook’s Distances. I believe that puts us in the dreaded EXPTIME and EXPSPACE computation complexity classes; without an EXCLUDE CURRENT ROW windowing clause or some other efficient method of calculating intermediate regression aggregates in one pass, I know of no other way to reduce this down from an exponential running time to a polynomial. I’m weak in GROUP BY operations, so perhaps another workaround can be derived through those – but if not, we’re up the proverbial creek without a paddle. Even if you can wait the lifetime of the universe or whatever it takes to run the 11,000,000^{2} regression operations, it is unlikely that you’ll have enough spare room in TempDB for 121 trillion rows. The price to be paid for the more sophisticated insights Cook’s Distance provides is that it simply cannot be run against Big Data-sized datasets, at least in its current form.

…………As we’ve seen so many times in this series, scaling up existing outlier detection methods to Big Data sizes doesn’t merely present performance issues, but logical ones; in the case of Cook’s Distance, omitting a single observation is only going to have an infinitesimal impact on a regression involving 11 million records, no matter how aberrant the data point might be. Since it is derived from linear least squares regression, Cook’s Distance shares some of its limitations, like “the shapes that linear models can assume over long ranges, possibly poor extrapolation properties, and sensitivity to outliers.”[11] We’re trying to harness that sensitivity when performing outlier detection, but the sheer size of the regression lines generated from Big Data made render it too insensitive to justify such intensive computations. When you factor in the performance costs of recalculating a regression model for that many rows the usefulness of this outlier identification method obviously comes into question. On the other hand, the procedure did seem to identify outliers with greater accuracy when run against other tables I’m very familiar with, which consisted of a few thousand rows apiece. There may be a happy medium at work here, in which Cook’s Distance is genuinely useful for a certain class of moderately sized tables in situations where the extra precision of this particular metric is needed. When deciding whether or not the extra computational costs is worth it for a particular table, keep in mind that the performance costs are magnified in my results because I’m running them on a wimpy eight-core semblance of an AMD workstation that has more in common with Sanford and Son’s truck than a real production environment server. Furthermore, the main uses in this field for outlier detection of any kind are in exploratory data mining and data quality examinations, which don’t require constant, ongoing combing of the database for outliers; these are issues of long-term importance, not short-term emergencies like a relational query that has to be optimized perfectly because it may have to run every day, or even every second. Tests like this can be left for off-peak hours on a weekly or monthly basis, so as not to interfere with normal operations. Cook’s Distance might also be preferred when searching for a specific type of outlier, i.e. those that could throw off predictive modeling, just as Benford’s Law is often selected when identifying data quality problems is paramount, especially the intentional data quality issue we call fraud. Cook’s Distance might also prove more useful in cases where the relationship between two variables is at the heart of the questions that the tester chooses to ask. Cook’s and DFFITS can also apparently be used to convert back and forth from another common stat I haven’t yet learned to use, the Wald Statistic, which is apparently used for ferreting out the values of unknown parameters.[12]. If there’s one thing I’ve learned while writing this series, it’s that there’s a shortage of outlier detection methods appropriate to the size of the datasets that DBAs work with. Thankfully, the workaround I translated into T-SQL for my next column allows us to use Mahalanobis Distance to find outliers across columns, without the cosmic computational performance hit for calculating Cook’s D on large SQL Server databases. As with Cook’s D, there are some minor accuracy issues, but these are merely cosmetic when looking for outliers, where detection can be automated but handling ought to require human intervention.

[1] For a quick run-down, see the __Wikipedia__ page “Non-Euclidean Geometry” at http://en.wikipedia.org/wiki/Non-Euclidean_geometry

[2] Cook, R. Dennis, 1977, “Detection of Influential Observations in Linear Regression,” pp. 15-18 in __Technometrics__, February 1977. Vol. 19, No. 1. A .pdf version is available at the Universidad de São Paulo’s Instituto de Matematica Estatística web address http://www.ime.usp.br/~abe/lista/pdfWiH1zqnMHo.pdf

[3] I originally retrieved it from the __Wikipedia__ page “Cook’s Distance” at http://en.wikipedia.org/wiki/Cook%27s_distance , but there’s no difference between it and the one in Cook’s paper.

[4] I used the formula defined at the __Wikipedia__ page “Mean Squared Error,” at the web address http://en.wikipedia.org/wiki/Mean_squared_error. The same page states that there are two more competing definitions, but I used the one that the Cook’s Distance page linked to (The __Wikipedia__ page “Residual Sum of Squares” at http://en.wikipedia.org/wiki/Residual_sum_of_squares may also be of interest.):

“In regression analysis, the term mean squared error is sometimes used to refer to the unbiased estimate of error variance: the residual sum of squares divided by the number of degrees of freedom. This definition for a known, computed quantity differs from the above definition for the computed MSE of a predictor in that a different denominator is used. The denominator is the sample size reduced by the number of model parameters estimated from the same data, (n-p) for p regressors or (n-p-1) if an intercept is used.[3] For more details, see errors and residuals in statistics. Note that, although the MSE is not an unbiased estimator of the error variance, it is consistent, given the consistency of the predictor.”

“Also in regression analysis, “mean squared error”, often referred to as mean squared prediction error or “out-of-sample mean squared error”, can refer to the mean value of the squared deviations of the predictions from the true values, over an out-of-sample test space, generated by a model estimated over a particular sample space. This also is a known, computed quantity, and it varies by sample and by out-of-sample test space.”

[5] p. 47, Ben-Gan, Itzik, 2012, __Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions__ . O’Reilly Media, Inc.: Sebastopol, California.

[6] See the CrossValidated thread titled “Is It Possible to Derive Leverage Figures Without a Hat Matrix?”, posted by SQLServerSteve on June 26, 2015 at http://stats.stackexchange.com/questions/158751/is-it-possible-to-derive-leverage-figures-without-a-hat-matrix . Also see the reply by the user Glen_B to the CrossValidated thread titled “Which of these points in this plot has the highest leverage and why?” on July 9, 2014 at http://stats.stackexchange.com/questions/106191/which-of-these-points-in-this-plot-has-the-highest-leverage-and-why/106314#106314

[7] See the formula at the Wikipedia page “DFFITS” at https://en.wikipedia.org/wiki/DFFITS

[8] I retrieved this formula from the most convenient source, the Dummies.com page “How to Calculate a Regression Line” at the web address http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html

[9] See the response by the user named kannas at the StackOverflow thread, “Nvarchar(Max) Still Being Truncated,” published Dec. 19, 2011 at the web address http://stackoverflow.com/questions/4833549/nvarcharmax-still-being-truncated

[10] See the __Wikipedia__ page “Cook’s Distance” at http://en.wikipedia.org/wiki/Cook%27s_distance

[11] See National Institute for Standards and Technology, 2014, “4.1.4.1.Linear Least Squares Regression,” published in the online edition of the__ Engineering Statistics Handbook__. Available at http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htm

[12] See the Wikipedia pages “Cook’s Distance,” “DFFITS” and “Wald Test” at http://en.wikipedia.org/wiki/Cook%27s_distance,

http://en.wikipedia.org/wiki/DFFITS and http://en.wikipedia.org/wiki/Wald_test respectively.