# Goodness-of-Fit Testing with SQL Server Part 7.2: The Lilliefors Test

**By Steve Bolton**

…………Since I’m teaching myself as I go in this series of self-tutorials, I often have only a vague idea of the challenges that will arise when trying to implement the next goodness-of-fit test with SQL Server. In retrospect, had I known that the Lilliefors Test was so similar to the Kolmogorov-Smirnov and Kuiper’s Tests, I probably would have combined them into single article. The code for this week’s T-SQL stored procedure is nearly the same, as is the execution plan and the performance. The results are also quite similar right to those of the Kolmogorov-Smirnov Test for some of the practice data I’ve used throughout the series, differing in some cases by just a few decimal places. The slight differences may arise from one of the characteristics of the Lilliefors Test that differentiate it from its more famous cousin, namely that “this test of normality is more powerful than others procedures for a wide range of nonnormal conditions.”[i] Otherwise, they share many mathematical properties in common, like location and scale invariance – i.e., the proportions of the test statistics aren’t altered when using a different starting point or multiplying by a common factor.

…………On the other hand, the test is apparently more restrictive than the Kolmogorov-Smirnov, in that I’ve seen it referred to specifically as a normality test and I haven’t encountered any mention of it being applied to other distributions. Furthermore, its primary use cases seems to be those in which the variance of the data is unknown[ii], which often doesn’t apply in the types of million-row tables the SQL Server community works with daily. The late Hubert Lilliefors (1928-2008), a stats professor at George Washington University, published it in a Journal of the American Statistical Association titled “On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown” back in 1967[iii] – so augmenting its more famous cousin in a few niche scenarios seems to have been the raison d’etre from the beginning. We can always use more statistical tests in our toolbox to meet the never-ending welter of distributions that arise from actual physical processes, but I won’t dwell on the Lilliefors Test for long because its narrower use cases are less suited to our needs than those of the broader Kolmogorov-Smirnov Test.

**Differences from the Kolmogorov-Smirnov**

Another reason for not dwelling on it for too long is that most of the code is identical to that of the stored procedure posted in last week’s article. The Lilliefors Test quantifies the difference between the empirical distribution function (EDF) and cumulative distribution function (CDF) exactly the same as the Kolmogorov-Smirnov and Kuiper’s Tests do; in plain English, it orders the actual values and ranks them on a scale of 0 to 1 and computes the difference from the theoretical probability for the Gaussian “normal” distribution, or bell curve, which is also ranked on a scale of 0 to 1. A couple of notes of caution are in order here, because some of the sources I consulted mentioned inputting Z-Scores into the formula and using the standard normal distribution rather than the actual mean and standard deviation of the dataset, but I verified that the procedure is correct as it stands now against an example at Statd.com.[iv]

…………One of the main characteristics that set it apart from the Kolmogorov-Smirnov Test is that the test statistic is compared against the Lilliefors distribution, which apparently has a better Bahadur Slope[v] (one of many measures of the efficiency of test statistics) than its competitors in certain hypothesis testing scenarios. That is a broad topic I’ve downplayed for several reasons throughout the last two tutorial series. Among the reasons I’ve brought up in the past are the fact that SQL Server users more likely to be using these tests for exploratory data mining, not proving specific points of evidence, as well as the ease of misinterpretation of p-values, critical values and confidence intervals even among professional academic researchers. What we need are continuous measures of *how* closely a dataset follows a particular distribution, not simple Boolean either-or choices of the kind used in hypothesis testing, which reduce the information content of the test statistics as sharply as casting a float data type to a bit would in T-SQL. Furthermore, many of the lookup tables and approximation used in hypothesis testing are only valid up to a few hundred values, not the several million that we would need in Big Data scenarios.

**Abde and Molin’s Approximation**

The Lilliefors distribution was originally derived from Monte Carlo simulations (a broad term encompassing many types of randomized trials) and at least one attempt has been made to approximate it through a set of constants and equations.[vi] I implemented the approximation developed by Hervé Abdi and Paul Molin, but the first couple of SELECTs and the declarations following the comment “code for Molin and Abdi’s approximation” can be safely deleted if you don’t have a need for the P-values the block generates. I verified the P-Values and @A constants used to generate it against the examples given in their undated manuscript “Lilliefors/Van Soest’s Test of Normality,” but as is commonly the case with such workarounds in hypothesis testing, the algorithm is inapplicable when Big Data-sized values and counts are plugged into it.

…………Once @A falls below about 0.74 the approximation begins to return negative P-values and when it climbs above about 5.66 it produces P-values greater than 1, which I believe are invalid under the tenets of probability theory. Most of the practice datasets I plugged into the approximation returned invalid outputs, most of them strongly negative. This is a problem I’ve seen with other approximation techniques when they’re fed values beyond the expected ranges. Nevertheless, since I already coded it, I’ll leave that section intact in case anyone runs into scenarios where they can apply it to smaller datasets.

** Figure 1: T-SQL Code for the Lilliefors Goodness-of-Fit Test**CREATE PROCEDURE [Calculations].[GoodnessOfFitLillieforsTest]

@Database1 as nvarchar(128) = NULL, @Schema1 as nvarchar(128), @Table1 as nvarchar(128),@Column1 AS nvarchar(128), @OrderByCode as tinyint

AS

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString nvarchar(max)

SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1

DECLARE @Mean float,

@StDev float,

@Count float

DECLARE @EDFTable table

(ID bigint IDENTITY (1,1),

Value float,

ValueCount bigint,

EDFValue float,

CDFValue decimal(38,37),

EDFCDFDifference decimal(38,37))

DECLARE @ExecSQLString nvarchar(max), @MeanOUT nvarchar(200),@StDevOUT nvarchar(200),@CountOUT nvarchar(200), @ParameterDefinition nvarchar(max)

SET @ParameterDefinition = ‘@MeanOUT nvarchar(200) UTPUT,@StDevOUT nvarchar(200) OUTPUT,@CountOUT nvarchar(200) OUTPUT ‘

SET @ExecSQLString = ‘SELECT @MeanOUT = CAST(Avg(‘ + @Column1 + ‘) as float),CAST(@StDevOUT = StDev(‘ + @Column1 + ‘) as float),CAST(@CountOUT = Count(‘ + @Column1 + ‘) as float)

FROM ‘ + @SchemaAndTable1 + ‘

WHERE ‘ + @Column1 + ‘ IS NOT NULL’

EXEC sp_executesql @ExecSQLString,@ParameterDefinition, @MeanOUT = @Mean OUTPUT,@StDevOUT = @StDev OUTPUT,@CountOUT = @Count OUTPUT

SET @SQLString = ‘SELECT Value, ValueCount, SUM(ValueCount) OVER (ORDER BY Value ASC) / CAST(‘ + CAST(@Count as nvarchar(50)) + ‘AS float) AS EDFValue

FROM (SELECT DISTINCT ‘ + @Column1 + ‘ AS Value, Count(‘ + @Column1 + ‘) OVER (PARTITION BY ‘ + @Column1 + ‘ ORDER BY ‘ + @Column1 + ‘) AS ValueCount

FROM ‘ + @SchemaAndTable1 + ‘

WHERE ‘ + @Column1 + ‘ IS NOT NULL) AS T1‘

INSERT INTO @EDFTable (Value, ValueCount, EDFValue)

EXEC (@SQLString)

UPDATE T1

SET CDFValue = T3.CDFValue, EDFCDFDifference = EDFValue – T3.CDFValue

FROM @EDFTable AS T1

INNER JOIN (SELECT DistinctValue, Calculations.NormalCalculationsingleCDFFunction (DistinctValue, @Mean, @StDev) AS CDFValue

FROM (SELECT DISTINCT Value AS DistinctValue

FROM @EDFTable) AS T2) AS T3

ON T1.Value = T3.DistinctValue

DECLARE @b0 float = 0.37872256037043,

@b1 float = 1.30748185078790,

@b2 float = 0.08861783849346,

@A float,

@PValue float,

@LillieforsTestStatistic float

SELECT @LillieforsTestStatistic = Max(ABS(EDFCDFDifference))

FROM @EDFTable

— code for Molin and Abdis approximation

— =======================================

SELECT @A = ((-1 * (@b1 + @Count)) + Power(Power((@b1 + @Count), 2) – (4 * @b2 * (@b0 – Power(@LillieforsTestStatistic, –2))), 0.5)) / (2 * @b2)

SELECT @PValue = –0.37782822932809 + (1.67819837908004 * @A)

– (3.02959249450445 * Power(@A, 2))

+ (2.80015798142101 * Power(@A, 3))

– (1.39874347510845 * Power(@A, 4))

+ (0.40466213484419 * Power(@A, 5))

– (0.06353440854207 * Power(@A, 6))

+ (0.00287462087623 * Power(@A, 7))

+ (0.00069650013110 * Power(@A, 8))

– (0.00011872227037 * Power(@A, 9))

+ (0.00000575586834 * Power(@A, 10))

SELECT @LillieforsTestStatistic AS LillieforsTestStatistic, @PValue AS PValueAbdiMollinApproximation

SELECT ID, Value, ValueCount, EDFValue, CDFValue, EDFCDFDifference

FROM @EDFTable

ORDER BY CASE WHEN @OrderByCode = 1 THEN ID END ASC,

CASE WHEN @OrderByCode = 2 THEN ID END DESC,

CASE WHEN @OrderByCode = 3 THEN Value END ASC,

CASE WHEN @OrderByCode = 4 THEN Value END DESC,

CASE WHEN @OrderByCode = 5 THEN ValueCount END ASC,

CASE WHEN @OrderByCode = 6 THEN ValueCount END DESC,

CASE WHEN @OrderByCode = 7 THEN EDFValue END ASC,

CASE WHEN @OrderByCode = 8 THEN EDFValue END DESC,

CASE WHEN @OrderByCode = 9 THEN CDFValue END ASC,

CASE WHEN @OrderByCode = 10 THEN CDFValue END DESC,

CASE WHEN @OrderByCode = 11 THEN EDFCDFDifference END ASC,

CASE WHEN @OrderByCode = 12 THEN EDFCDFDifference END DESC

** Figure 2: Sample Results from the Lilliefors Goodness-of-Fit Test**EXEC Calculations.GoodnessofFitLillieforsTestSP

@DatabaseName = N’DataMiningProjects‘,

@SchemaName = N’Health‘,

@TableName = N’DuchennesTable‘,

@ColumnName = N’LactateDehydrogenase‘,

@OrderByCode = ‘1’

…………Aside from the approximation section, the code in Figure 1 is almost identical to that of last week’s procedure, so I won’t belabor the point by rehashing the explanation here. As usual, I used queries like the one in Figure 2 to test the procedure against several columns in a 209-row dataset on the Duchennes form of muscular dystrophy and an 11-million-row dataset on the Higgs Boson, which are made publicly available by the Vanderbilt University’s Department of Biostatistics and University of California at Irvine’s Machine Learning Repository respectively. It is not surprising that the results nearly matched the Kolmogorov-Smirnov test statistic for many practice columns. For example, the LactateDehydrogenase enzyme scored 0.128712871287129 here and 0.131875117324784 on the Kolmogorov-Smirnov, while the less abnormal Hemopexin protein scored 0.116783569553499 on the Lilliefors and 0.0607407215998911on the Kolmogorov-Smirnov Test. Likewise, the highly abnormal first float column and Gaussian second column in the Higgs Boson table had test statistics of 0.276267238731715 and 0.0181893798916693 respectively, which were quite close to the results of the Kolmogorov-Smirnov. I cannot say if the departure in the case of Hemopexin was the result of some property of the test itself, like its aforementioned higher statistical power for detecting non-normality, or perhaps a coding error on my part. If so, then it would probably be worth it to calculate the Lilliefors test statistic together with the Kolmogorov-Smirnov and Kuiper’s measures and return them in one batch, to give end users a sharper picture of their data at virtually no computational cost.

** Figure 3: Execution Plan for the Lilliefors Goodness-of-Fit Test** (click to enlarge)

…………There were six queries in the execution plan, just as there were for last week’s tests, but the first accounted for 19 percent and the second 82 percent of the batch total. Both of those began with non-clustered Index Seeks, which is exactly what we want to see. Only the second would provide any worthwhile opportunities for further optimization, perhaps by targeting the only three operators besides the seek that contributed anything worthwhile to the query cost: a Hash Match (Aggregate) at 14 percent, a Stream Aggregate that accounted for 10 percent and two Parallelism (Repartition Streams) that together amount to 53 percent . Optimization might not really be necessary, given that the first float column in the mammoth Higgs Boson dataset returned in just 23 seconds and the second in 27. Your marks are likely to be several orders of magnitude better, considering that the procedure was executed on an antiquated semblance of a workstation that is an adventure to start up, not a real database server. The only other fitness tests in this series this fast were the Kolmogorov-Smirnov and Kuiper’s Tests, which I would have calculated together with this test in a single procedure if I’d known there was so much overlap between them. The Anderson-Darling Test we’ll survey in the next installment of the series is also included in the same category of EDF-based fitness tests, but has less in common with the Lilliefors Test and its aforementioned cousins. Unfortunately, high performance is apparently not among the characteristics the Anderson-Darling Test shares with its fellow EDF-based methods. That’s something of a shame, since it is more widely by real statisticians than many other goodness-of-fit tests.

[i] p. 1, Abdi, Hervé and Molin, Paul, undated manuscript “Lilliefors/Van Soest’s Test of Normality,” published at the __University of Texas at Dallas School of Behavioral and Brain Sciences__ web address https://www.utdallas.edu/~herve/Abdi-Lillie2007-pretty.pdf

[ii] See the __Wikipedia__ page “Lilliefors Test” at http://en.wikipedia.org/wiki/Lilliefors_test

[iii] Lilliefors, Hubert W., 1967, “On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown,” pp. 399-402 in __Journal of the American Statistical Association__, Vol. 62, No. 318. June, 1967.

[iv] See the __Statd.com__ webpage “Lilliefors Normality Test” at http://statltd.com/articles/lilliefors.htm

[v] See Arcones, Miguel A., 2006, “On the Bahadur Slope of the Lilliefors and the Cramér–von Mises Tests of Normality,” pp. 196-206 in the __Institute of Mathematical Statistics Lecture Notes – Monograph Series__. No. 51. Available at the web address https://projecteuclid.org/euclid.lnms/1196284113

[vi] See p. 3, Abdi and Molin and the aforementioned __Wikipedia__ page “Lilliefors Test” at http://en.wikipedia.org/wiki/Lilliefors_test

Posted on April 14, 2016, in Goodness-oF-Fit Testing with SQL Server and tagged Analytics, Data Mining, Data Science, Distributions, Goodness-of-Fit, Knowledge Discovery, Lilliefors, Probability, SQL Server, Statistics, Steve Bolton, Stochastic. Bookmark the permalink. 4 Comments.

Hi Steve, I am enjoying your goodness-of-fit series. Any chance you are going to look at AUC? (https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve )

Coming from an R background, the code for this is trivial… replicating in SQL seems complicated.

ROC curve analysis/AUC scoring is very common in machine learning to evaluate models using prediction values vs. [binary] actual responses.

Just a suggestion for your series… and would save me the trouble of offloading this task to R!

Thanks.

Hey Bryan – Thanks for the feedback – I’ve not yet used Area Under the Curve, but I have the impression that it belongs in the set of algorithms that is best left for R, or coded from scratch in a regular programming language like VB.Net or C#. One of the major overarching points I want to get across my blog is that there are certain data mining tasks which are best implemented in set-based languages like T-SQL or MDX, which can inspect and perform math operations on tens of millions of records within seconds, plus divide the data into complex subsets. These are things that traditional languages don’t do well, especially since they don’t have the full SQL Server architecture available to optimize them, like indexing, execution plans, etc. On the other hand, set-based languages don’t handle pure math well, in situation where we’re not traversing millions of records; that’s why I wrote recently in a first draft of a future column that Maximum Likelihood Estimation (MLE) is probably best left for tools like R or .Net languages.

I’ve used R in the past and was planning to write about it in my next installment of the occasional tutorial series Integrating Other Data Mining Tools in SQL Server, which I’ll be putting off until after I can get my hands on the developer edition of SQL Server 2016. I’m going to give an opinion you’ll probably never hear anywhere else, amid all the current hype over R: I’m glad SQL Server’s implementing its own front end to R, because the regular one is by the far the worst of any mining tool I’ve tried to date. This’ll be a big improvement, but it can only be improved so far, because the platform is permanently hobbled by its open source roots. The best thing about R is the large base of existing algorithms, implemented by the user community; I’m sure I’ll be putting these to use myself. On the other hand, the open source/ad hoc/DIY nature of the platform means it does not handle large recordsets well. It is great for purely mathematical exercises like curve fitting and MLE; for performing mining tasks on large numbers of records, it is terrible. I’ve seen it choke on just a few hundred records for very simple tasks; maybe the new SQL Server version will have optimizations so it can handle tens of thousands, but that’s about it. Nor can it *ever* be upgraded to handle Big Data-sized tables as well as T-SQL and MDX, because this limitation is built right into the platform. That is why you’re never, ever going to see intrinsically set-based tasks with large numbers of records – such as Decision Trees or neural nets – ever implemented in R in any efficient way. Even expensive professional tools like Minitab don’t handle these scenarios well at all (even though I love Minitab). Without sets and indexes and execution plans and all that it will simply never be able to handle such workloads, but redesigning it from the ground up to handle these scenarios would be impossible. It’s just one more item in our toolbox, not a cure-all or replacement for Kimball-style data warehousing or SQL Server Data Mining (SSDM). What we’re going to have in the future (unless Microsoft doesn’t really grasp the market and blows it by forcing us down the wrong direction, which they just might do, given the sad state of Windows 8-10 and short shrift given to SSAS and SSDM) is a division of labor between purely mathematical and set-based tasks. Given what I’ve seen in the literature, the implementation of both types of data mining tasks lags decades between the extant research, so it’s going to be a long, long time before the opportunities for writing code for them ever dries up. 🙂

Thanks again – I’m always looking for feedback 🙂

— Steve

Keep going!

Thanks!

Will do – next month I’ll start a whole new series on fuzzy sets that could be of immediate benefit to a really wide range of SQL Server users 🙂 Thanks!