Outlier Detection with SQL Server, part 5: Interquartile Range

By Steve Bolton

…………The last seven articles in this series of mistutorials on identifying outlying values in SQL Server database were clunkers, in the sense that the methods had many properties in common that made them inapplicable to the scenarios DBAs typically need them for. Chauvenet’s Criterion, Peirce’s Criterion, the Tietjen-Moore Test, the Generalized Extreme Studentized Deviate Test (GESD), Grubbs’ Test, the Modified Thompson-Tau Test and Dixon’s Q-Test are well-suited to the uses they were designed for, like hypothesis testing, but are difficult to apply to common SQL Server use cases like finding data quality problems in tables of several millions rows or doing exploratory data mining. Most of them require prior goodness-of-fit testing to verify that the underlying data follows a Gaussian “normal” distribution, i.e. a bell curve, without which they are invalid; many of the lookup tables they depend on are widely available but stop at just a few hundred rows at best, while calculating the missing lookup values for millions of cases can be exceptionally costly. Toss in other drawbacks of hypothesis testing that are often unstated these days (like the use of arbitrary confidence levels and misconceptions about probabilistic reasoning, which statisticians themselves raise frequently in their literature) and it appears that for most scenarios, DBAs would be better off sticking with the methods we kicked off the series with, Z-Scores and Benford’s Law. I’m only writing about these topics as an amateur, but the inapplicability of so many standard outlier identification methods to larger datasets makes me wonder if it the age of “Big Data”[1] doesn’t call for the devising of new means of detection. Thankfully, however, we haven’t by any means exhausted the means already available to us in the common statistical literature, without having to delve into research papers and academic journals and that sort of thing. I haven’t yet had a chance to discuss Interquartile Range because I’m trying to group the detection methods by the properties they have in common, but this particular one has little overlap with any of the others we’ve surveyed to date. It nevertheless performs relatively well and is applicable to a much wider set of use cases than any other means we’ve discussed since finishing up Z-Score a couple of months ago.
…………Interquartile Range has apparently been in use for so long and is so pervasive in academic research that the story of its origin is difficult to find in a cursory search, unlike the colorful histories of some of the lesser-known methods discussed in recent posts. In-depth research of this kind wasn’t really necessary for this week’s article because the calculations and concepts are easier than anything we’ve discussed to date.[2] The idea is fairly simple: instead of calculating a single center point for the whole dataset, we establish two boundaries known as the lower and upper quartiles encompassing the middle half of the values, so named because they are a quarter of the way (25 percent and 75 percent) from the edges of the dataset. The Interquartile Range is just another single measure of how dispersed data is around the center of the dataset, like the more familiar standard deviation and variance, except that it is less sensitive to outlying values (i.e., it is more “robust”). Computing it is trivial once we got the lower and upper quartiles, since all we have to do is subtract the former from the latter. Interquartile Range is apparently useful for other applications such as goodness-of-fit testing, but when used to find those aberrant data points we call outliers, it is usually accompanied by calculations of upper and inner fences. These are established by simply subtracting or adding 1.5 times the Interquartile Range from the lower quartile or doing the same with the upper quartile, except with 3 times the Interquartile Range. Using this test, any values falling outside these four “fences” are defined as outliers. The math in Figure 1 looks a lot more complicated than it really is, when all we’re really doing is a few modulos and simple divisions to get the lower and upper quartiles, then some simple subtraction and multiplication to establish the fence values. The most difficult part of the T-SQL code is probably the common table expression (CTE), which is trivial compared to some of the more difficult nested subqueries, UNPIVOT operations and windowing functions used in other recent tutorials.

Figure 1: Code for the Interquartile Range Procedure
CREATE PROCEDURE [Calculations].[InterquartileRangeSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)
AS
SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName
DECLARE @SQLString nvarchar(max)

SET @SQLString = ‘DECLARE @OrderByCode tinyint,
@Count bigint,
@LowerPoint bigint,
@UpperPoint bigint,
@LowerRemainder decimal(38,37), — use the maximum precision and scale for these two variables to make the
procedure flexible enough to handle large datasets; I suppose I could use a float
@UpperRemainder decimal(38,37),
@LowerQuartile decimal( + @DecimalPrecision + ‘),
@UpperQuartile decimal( + @DecimalPrecision + ‘),
@InterquartileRange decimal( + @DecimalPrecision + ‘),
@LowerInnerFence decimal( + @DecimalPrecision + ‘),
@UpperInnerFence decimal( + @DecimalPrecision + ‘),
@LowerOuterFence decimal( + @DecimalPrecision + ‘),
@UpperOuterFence decimal( + @DecimalPrecision + ‘) 

SET @OrderByCode = ‘ + CAST(@OrderByCode AS nvarchar(50)) +  SELECT @Count=Count( + @ColumnName + ‘)
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @LowerPoint = (@Count + 1) / 4, @LowerRemainder =  ((CAST(@Count AS decimal(‘ + @DecimalPrecision + ‘)) + 1) % 4) /4,
@UpperPoint = ((@Count + 1) *3) / 4, @UpperRemainder =  (((CAST(@Count AS decimal(‘ + @DecimalPrecision + ‘)) + 1) *3) % 4) / 4; –multiply by 3 for the left s’ + @PrimaryKeyName + ‘e on the upper point to get 75 percent

WITH TempCTE
(‘ + @PrimaryKeyName + ‘, RN, ‘ + @ColumnName + ‘)
AS (SELECT ‘ + @PrimaryKeyName + ‘, ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ ASC) AS RN, ‘ + @ColumnName +
FROM ‘ + @SchemaAndTableName +
W
HERE ‘ + @ColumnName + ‘ IS NOT NULL),
TempCTE2 (QuartileValue)
AS (SELECT TOP 1 ‘ + @ColumnName + ‘ + ((Lead( + @ColumnName + ‘, 1) OVER (ORDER BY ‘ + @ColumnName + ‘) – ‘ + @ColumnName + ‘) * @LowerRemainder) AS QuartileValue
FROM TempCTE
WHERE RN BETWEEN @LowerPoint AND @LowerPoint + 1

UNION

SELECT TOP 1 ‘ + @ColumnName + ‘ + ((Lead( + @ColumnName + ‘, 1) OVER (ORDER BY ‘ + @ColumnName + ‘) – ‘ + @ColumnName + ‘) * @UpperRemainder) AS QuartileValue
FROM TempCTE
WHERE RN BETWEEN @UpperPoint AND @UpperPoint + 1)

SELECT @LowerQuartile = (SELECT TOP 1 QuartileValue
FROM TempCTE2 ORDER BY QuartileValue ASC), @UpperQuartile = (SELECT TOP 1 QuartileValue
FROM TempCTE2 ORDER BY QuartileValue DESC)

SELECT @InterquartileRange = @UpperQuartile - @LowerQuartile
SELECT @LowerInnerFence = @LowerQuartile - (1.5 * @InterquartileRange), @UpperInnerFence = @UpperQuartile + (1.5 * @InterquartileRange), @LowerOuterFence = @LowerQuartile - (3 * @InterquartileRange), @UpperOuterFence = @UpperQuartile + (3 * @InterquartileRange)

–SELECT @LowerPoint AS LowerPoint, @LowerRemainder AS LowerRemainder, @UpperPoint AS UpperPoint, @UpperRemainder AS UpperRemainder
— uncomment this line to debug the inner calculations

SELECT @LowerQuartile AS LowerQuartile, @UpperQuartile AS UpperQuartile, @InterquartileRange AS InterQuartileRange,@LowerInnerFence AS LowerInnerFence, @UpperInnerFence AS UpperInnerFence,@LowerOuterFence AS LowerOuterFence, @UpperOuterFence AS UpperOuterFence

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, OutlierDegree
FROM  (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,
       OutlierDegree” =  CASE WHEN (‘ + @ColumnName + ‘ < @LowerInnerFence AND ‘ + @ColumnName + ‘ >= @LowerOuterFence) OR (‘ +
@ColumnName + ‘ > @UpperInnerFence
AND ‘
+ @ColumnName + ‘ <= @UpperOuterFence) THEN 1
       WHEN ‘ + @ColumnName + ‘ < @LowerOuterFence OR ‘ + @ColumnName + ‘ > @UpperOuterFence THEN 2
       ELSE 0 END
       FROM ‘ + @SchemaAndTableName +
       WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1
      ORDER BY CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,
CASE WHEN @OrderByCode = 5 THEN OutlierDegree END ASC,
CASE WHEN @OrderByCode = 6 THEN OutlierDegree END DESC

–SELECT @SQLStringuncomment this to debug string errors
EXEC (@SQLString)

…………The code in Figure 1 basically follows the same format as that of other procedures I’ve posted in this series, for simplicity’s sake. The first five parameters allow users to test any column in any database they have access to, while the @DecimalPrecision enables them to avoid arithmetic overflows by manually setting a precision and scale appropriate to the column they’ve selected. As usual, the procedure is created in a Calculations schema; there are no brackets to handle spaces in object names, nor is there any validation or SQL injection code. As with past procedures, uncommenting the next-to-last line allows users to debug the dynamic SQL; I also provided a second debugging point of the same kind midway through the procedure, for testing the inner calculations. The @OrderByCode I’ve used in previous tutorials also returns, with the same values as usual: value #1 and #2 allow users to order by the primary key ascending and descending, while #3 and #4 do the same for the ColumnName and #5 and #6 order the results by the OutlierDegree column. As depicted below, the OutlierDegree column allows for a range of values depending on how much a particular data point deviates from the norm, not merely a Boolean yes-no flag like we’ve seen in many hypothesis-testing based methods. Note that the results also include the Interquartile Range, fence values and quartiles used to test each data point.

Figure 2: Results for the Interquartile Range Procedure on the PyruvateKinase Column
EXEC [Calculations].[InterquartileRangeSP]
              @DatabaseName = N’DataMiningProjects,
             @SchemaName = N’Health’
             @TableName = N’DuchennesTable,
             @ColumnName = N’PyruvateKinase,
              @PrimaryKeyName = N’ID’,
              @OrderByCode = 6,
              @DecimalPrecision = N’38,21′

InterquartileRangeResults

…………The test in Figure 2 was performed on the Pyruvate Kinase column of a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from the by Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table. For the sake of consistency, I’ve stress-tested outlier detection methods that might have performance issues on the first float column of the Higgs Boson Dataset, which is made publicly available by the University of California at Irvine’s Machine Learning Repository and now occupies almost 6 gigabytes of my practice database. On average, the procedure took about 18-and-a-half to 19 minutes to run against the 11 million rows of that dataset on my poor beat-up semblance of a development machine, as compared to about 3 minutes for the Z-Score procedure I posted earlier in the series. The addition of a non-clustered index on the column improved things a little – as long as I included the clustered primary key, but the execution plans were still too large to post; suffice it to say that they consisted mainly of parallel non-clustered index Seeks with some expensive Sorts, plus a lot of Spools that had inconsequential costs. Perhaps a columnstore index would help things, but I’ve been unable to upgrade yet to SQL Server 2014, where many of the restrictions that once hobbled the feature have been relaxed.
…………Since Z-Scores perform better and can be used in conjunction with any outlier threshold that end users choose, they remain enthroned as the most widely applicable detection method we’ve yet discussed. Nonetheless, Interquartile Range is more likely to be of use to DBAs than the hypothesis testing-based means that took up the middle segment of this series. The calculations are simple to code and perform, the concepts aren’t that hard to explain to database users and perhaps best of all, we’re not limited to using just a Gaussian bell curve. That also means we don’t have to do preliminary goodness-of-fit testing, which is so often omitted by careless researchers. One of the Wikipedia articles I found the formula at mentions it being used in conjunction with Cauchy and Laplace distributions, although not necessarily in its capacity as an outlier detector.[3] It can even be adapted for double duty as a goodness-of-fit test. In and of itself, it constitutes an alternate measure of dispersion that can be used in place of standard deviation and variance. The scenarios in which such a substitution would prove useful include ones where a measure less likely to be altered by outlying values is called for. The same property might make it more appropriate than Z-Scores when there is a real need for a more conservative test for outliers. Another plus in its favor is the fact that it also measures the degree of membership in the set of outliers on a scale, rather than merely flagging it as many hypothesis-testing methods do; furthermore, those methods have numerous other restrictions on the number and types of inputs, outputs and calculation methods that make them unsuitable for most SQL Server tasks, like recursive deletion with Chauvenet’s Criterion and the inability of Dixon’s Q-Test to identify more than one outlier per dataset. Moreover, the fence and quartile values are trivial to return once they’ve been calculated and constitute global measures in their own right.
…………I have yet to try Cook’s Distance and Mahalanobis Distance, but I have high hopes that they too will prove to be useful additions to the toolbelts of SQL Server data miners and DBAs. I hope to use both as a springboard into a much longer and more difficult, albeit worthwhile, series a few months down the line, Information Measurement with SQL Server. Before delving into the difficult math that underpins distance-based metrics of that kind, however, I will give a brief overview of how to use Reporting Services to find outliers the easy way: by the naked eye. Finding outliers is not always that straightforward, but in many cases all we need to do is spot them in a histogram or scatter plot, where they sometimes stand out like sore thumbs. They are sometimes also glaringly obvious in the diagrams produced by the Clustering algorithm in SSDM, which I may give a quick refresher on, based on my last tutorial series, A Rickety Stairway to SQL Server Data Mining. As we will see, scaling up visual detection methods of this kind to meet the size of SQL Server databases is the primary challenge, just as their size stretches beyond the ordinary bounds of hypothesis testing. The pervasiveness of the size issue makes me wonder, once again, if it might not be worthwhile to devise new scalable methods of outlier detection to complement the ones already in common use today.[4]

 

[1] This buzzword is a lot like the overused term “globalization.” Unlike with statistics and data mining, I have real expertise in foreign policy history, and can say definitively that globalization has been going on for millennia; the only difference is that it has accelerated in recent decades. Likewise, the amount of Data the human race has to process is always getting Bigger; it’s just getting Bigger at a faster pace these days.

[2] I retrieved the formulas from the most convenient sources, the Wikipedia pages “Outlier” and “Interquartile Range” at http://en.wikipedia.org/wiki/Outlier and http://en.wikipedia.org/wiki/Interquartile_range respectively. I also tested the procedure against some of the examples provided there. Also see National Institute of Standards and Technology, 2014, “7.1.6. What are Outliers in the Data?” published in the online edition of the Engineering Statistics Handbook. Available online at the web address http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

[3] IBID.

[4] While writing this series I encountered an interesting suggestion by Will G. Hopkins, the writer of one of the best plain English explanations of statistics available online today: “Here’s something challenging for the real lovers of numbers. The mean ± SD encloses 68% of the data on average for a normally distributed variable. So if you want to use a percentile range that corresponds to the mean ± SD, what should it be? Answer: 16th-84th. If I had my way, this measure would replace the interquartile range. We could call it the standard percentile range…” I could write code for this easily, but didn’t bother because the series already features too many outlier identification methods that are dependent on a normal distribution. Nor would it necessarily do anything to help with the scaling problem I mentioned above. It does illustrate, however, how we’re not limited to just using tried and true measures and can devise new ones as needed, if they are appropriate to the contexts at hand. See Hopkins, Will G.  2013, “Percentile Ranges,” published at the website A New View of Statistics and available at the web address http://www.sportsci.org/resource/stats/percentile.html

 

Outlier Detection with SQL Server, part 4: Peirce’s Criterion

By Steve Bolton

…………In the last couple of installments of this amateur series of self-tutorials on outlier identification with SQL Server, we dealt with detection methods that required recursive recomputation of the underlying aggregates. This week’s topic, Peirce’s Criterion, also flags outliers in an iterative manner, but doesn’t require the same sliding window to continually recalculate the mean and standard deviation as Chauvenet’s Criterion and the Modified Thomson Tau Test do. Like these analogous methods, Peirce’s Criterion can be made useful to DBAs by using it to merely flag potential outliers and performing new computations as if they had been removed, rather than deleting them without adequate further investigation, as sometimes occurs with the other two. While writing this series, I’ve slowly come to the realization that the statistical formulas underlying many of these methods can be swapped in and out almost like the modularized parts of a car engine, radio kit or DIY computer; for example, Chauvenet’s Criterion and the Modified Thompson Tau test leaven standard hypothesis testing methods with comparisons to Z-Scores, with the former merely substituting thresholds based on a Gaussian normal distribution (i.e. a bell curve) rather than the Student’s T-distribution used in the latter. Peirce’s Criterion is also recursive, but uses the R-values produced by Pearson Product Moment Correlation calculations as thresholds for its Z-Scores. I originally had high hopes for Peirce’s Criterion because those correlation coefficients are easy to calculate on entire databases, but it turns out that lookup tables are required for the R-Values. These are even shorter and more difficult to find on the Internet than the Gaussian and T-distribution lookup tables required for some of the outlier detection methods based on hypothesis testing, which were covered in the last six posts. For that reason, I found it more difficult than usual to validate my T-SQL samples, so be cautious when implementing the code below. Furthermore, the Criterion is burdened with the same requirement for prior goodness-of-fit testing, to prove that the underlying data follows a bell curve.
…………It is not surprising that the Criterion carries so many restrictions, given that it is one of the first outlier detection methods ever devised. The algorithm that mathematician Benjamin Peirce (1809-1880)[1] introduced in an 1852 paper in the Astronomical Journal is indeed difficult to follow and implement, even for those with far more experience than myself. Programmers have apparently had some success recently in coding the underlying math in R and Python[2], but my solution is based on the more accessible version published in 2003 in the Journal of Engineering Technology by Stephen Ross, a professor of mechanical engineering at the University of New Haven.[3] The DDL in Figure 1 can be used to import the table on page 10 to 12, which translates into 540 R-values for up to nine potential outliers and a maximum of 60 records. Denormalizing it into a single interleaved lookup table allows us to access the values in a more legible way with a single join, rather than the double join that would be required with two normalized tables. I altered the algorithm Ross provides to do all of the comparisons in a single iteration, since it is trivial in a set-based language like T-SQL to simply check the nine highest absolute deviations against the corresponding R-Values in one fell swoop. The T3 subquery in Figure 2 simply looks up all nine R-Values for the count of all the records in the dataset, then calculates the MaximumAllowableDeviation values by multiplying them by the standard deviation of the entire dataset. The T2 subquery merely calculates the nine highest absolute deviations in the dataset (using basically the same logic as that found in Z-Scores) and joins them to the NumberOfOutliers of the same rank. If the absolute deviation is higher than the maximum allowable deviation, the record is flagged as an outlier. The rest of the code follows the same format as that of other procedures posted in this series; the first five parameters allow you to select any column in any database for which you have access and @DecimalPrecision enables users to avoid arithmetic overflows. The rest is all dynamic SQL, with the customary debugging comment line above the EXEC. To avoid cluttering the code, I didn’t supply the brackets needed to accommodate spaces in object names – which I don’t allow in my own code – or validation logic, or SQL injection protection. As always, the procedure and lookup table are implemented in a Calculations schema that can be easily changed. Since we can get a bird’s-eye view of all nine rows. as depicted in Figure 3, there’s no reason to incorporate the @OrderByCode parameter used in past tutorials.

Figure 1: DDL for the R-Value Lookup Table
CREATE TABLE [Calculations].[PeirceRValueTable](
       [ID] [smallint] IDENTITY(1,1) NOT NULL,
       [N] [tinyint] NULL,
       [NumberOfOutliers] [tinyint] NOT NULL,
       [RValue] [decimal](4, 3) NULL,
 CONSTRAINT [PK_PeirceRValueTable] PRIMARY KEY CLUSTERED ( [ID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE
= OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Figure 2: Code for the Peirce’s Criterion Stored Procedure
CREATE PROCEDURE [Calculations].[PiercesCriterionSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS nvarchar(50)
AS
DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @DatabaseName = @DatabaseName + ‘.’
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName
SET @SQLString = ‘DECLARE @Mean decimal(‘ + @DecimalPrecision + ‘), @StDev decimal(‘ + @DecimalPrecision + ‘), @Count decimal(‘ + @DecimalPrecision + ‘)
SELECT @Count=Count(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @Mean = Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @StDev = StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, NumberOfOutliers, AbsoluteDeviation, MaximumAllowableDeviation, ”IsOutlier” = CASE WHEN AbsoluteDeviation > MaximumAllowableDeviation THEN 1 ELSE 0 END
FROM (SELECT TOP 9 ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, AbsoluteDeviation, ROW_NUMBER() OVER (ORDER BY AbsoluteDeviation DESC) AS RN
       FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ABS(+ @ColumnName + ‘ – @Mean) AS AbsoluteDeviation
       FROM ‘ + @SchemaAndTableName +
      WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1) AS T2
INNER JOIN   (SELECT NumberOfOutliers, @StDev * RValue AS MaximumAllowableDeviation
      FROM Calculations.PeirceRValueTable
       WHERE N = 60) AS T3@Count
ON RN = NumberOfOutliers
–SELECT @SQLString  uncomment this to debug string errors
EXEC (@SQLString)

 

Figure 3: Results for Peirce’s Criterion
EXEC [Calculations].[PiercesCriterionSP]
             @DatabaseName = N’DataMiningProjects,
             @SchemaName = N’Health,
             @TableName = N’First60RowsPyruvateKinaseView’,
             @ColumnName = N’PyruvateKinase,
             @PrimaryKeyName = N’ID’,
              @DecimalPrecision = N’10,7′

PeircesCriterionResults

…………The results above come from a view on the first 60 rows of a small dataset on the Duchennes form of muscular dystrophy, which I downloaded from the Vanderbilt University’s Department of Biostatistics. I’ve stress-tested some of the procedures I posted earlier in this series on an 11-million-row table of Higgs Boson data made publicly available by the University of California at Irvine’s Machine Learning Repository, but there’s no point in doing that (or posting client statistics and execution plans) with Peirce’s Criterion procedure if we’re limited to 60 rows. In lieu of new algorithms like those used by R and Python to compute the test in far greater detail, this is about as useful as the test can be made in a SQL Server setting. There are some definite advantages over Chauvenet’s Criterion and the Modified Thompson Tau test, in that automatic deletion of records is not encouraged to the same extent and expensive recursive calculations are not necessary. Yet like the last six standard outlier detection methods surveyed here, it’s not really suitable for usage on tables of thousands of rows, let alone the billions used in Big Data applications. As usual, the available lookup tables are simply too small, calculating the missing lookup values is not feasible at this time and the test is only applicable to a Gaussian distribution. One of the pluses is that Peirce’s Criterion does not depend on confidence levels that are typically set by custom rather than sound reasoning. Furthermore, the probabilistic reasoning it is based upon is sound, but does not represent a guarantee; probabilities only generate reasonable expectations but have no effect on outcomes. This drawback of probabilistic stats was recognized long ago by Peirce’s son, but has since been forgotten – especially after the advent of quantum mechanics. As pointed out by Theodore P. Hill and Arno Berger, the authors of a study on Benford’s Law cited earlier in this series, “The eminent logician, mathematician, and philosopher C.S. Peirce once observed [Ga, p.273] that ‘‘in no other branch of mathematics is it so easy for experts to blunder as in probability theory.’’[4] I expected Peirce’s Criterion to be more useful because it is dependent on correlation stats that are common and easy to calculate, but it turns out that it belongs in the same class of outlier detection methods as Grubbs’ Test, the Generalized Extreme Studentized Deviate (GESD) test, Dixon’s Q-Test, the Tietjen-Moore Test, the Modified Thompson Tau test and Chauvenet’s Criterion. The lookup tables may not involve comparisons to Gaussian and T-distribution values like these hypothesis testing methods do, but the drawbacks are largely the same. Work is apparently ongoing in fields that use statistics to make the R-values easier to calculate from ordinary correlation coefficients, so Peirce’s Criterion may wind up being more usable than any of these in the long run. For now, however, SQL Server users would probably be better off sticking with methods like Z-Score and Benford’s Law that are more appropriate to large databases. So far, what I’ve found most striking about my misadventures in this topic to date is just how difficult it is to apply many commonly used statistical tests for outliers to the kind of datasets the SQL Server community works with; I’m only an amateur learning my way in this field, but I wonder at times if our use cases don’t call for the invention of new classes of tests. In the meantime, we can still rely on more useful outlier detection methods like Interquartile Range, which I’ll explain to the best of my inability next week. DBAs are probably also likely to find real uses for the visual detection methods that can be easily implemented in Reporting Services, as well as Cook’s Distance and Mahalanobis Distance, which I’ve saved for the end of the series because the difficulty in coding them appears to be commensurate to their potential value.

 

[1] The name is not misspelled but is frequently mispronounced as “Pierce” rather than “purse.” The authorship is made even more confusing by the fact that Benjamin’s son, Charles Sanders Peirce (1839-1914), was also a well-known mathematician who published commentaries on his father’s Criterion. Apparently the son fits snugly in the category of mathematicians and physicists with unusual emotional and mental disturbances, given that he was “he was, at first, almost stupefied, and then aloof, cold, depressed, extremely suspicious, impatient of the slightest crossing, and subject to violent outbursts of temper” by trigeminal neuralgia that led to his pattern of “social isolation”; perhaps it also factored into the decision of Harvard’s president to ban him from employment there. He can’t have been entirely irrational though, given that he was very close to William James, one of the few sane American philosophers. For more backstory, see the Wikipedia pages “Benjamin Peirce” and “Charles Sanders Peirce” at http://en.wikipedia.org/wiki/Benjamin_Peirce and http://en.wikipedia.org/wiki/Charles_Sanders_Peirce respectively.

[2] See the Wikipedia page “Peirce’s Criterion,” available at the web address http://en.wikipedia.org/wiki/Peirce%27s_criterion

[3] pp. 3-4, Ross, Stephen M. “Peirce’s Criterion for the Elimination of Suspect Experimental Data,” pp. 1-12 in the Journal of Engineering Technology, Fall 2003. Vol. 2, No. 2. http://newton.newhaven.edu/sross/piercescriterion.pdf

[4] Berger, Arno and Hill , Theodore P., 2011, “Benford’s Law Strikes Back: No Simple Explanation in Sight for Mathematical Gem,” published by Springer Science Business Media, LLC, Vol. 33, No. 1. Available online at http://people.math.gatech.edu/~hill/publications/PAPER%20PDFS/BenfordsLawStrikesBack2011.pdf.

Outlier Detection with SQL Server, part 3.6: Chauvenet’s Criterion

By Steve Bolton

…………This is the last of six articles I’ve segregated in this middle of my mistutorial series on identifying outlying values with SQL Server, because they turned out to be difficult to apply to the typical use cases DBAs encounter. After this detour we’ll get back on track with outlier detection methods like Interquartile Range that are likely to be as useful as the ones the series started with, such as Benford’s Law and Z-Scores, but I’ll first give a brief explanation of  Chauvenet’s Criterion for the sake of completeness and the offhand chance that it might prove useful in the right circumstances. Those circumstances are normally those suitable for statistical hypothesis testing, in which researchers attempt to prove narrow, specific points of evidence using relatively small datasets – not exploratory data mining or data quality examinations on datasets of thousands or even billions of rows, as in a typical relational table. This subset is designed with different use cases in mind, so it is not surprising that they come with some common limitations that make them difficult to apply to big tables. Among these are the necessity of prior goodness-of-fit testing to ensure that the data follows a Gaussian or “normal” distribution, i.e. the bell curve, without which the outlier tests are invalid. Furthermore, the lookup tables that many of these tests require for comparisons are plentiful on the Internet and old texts, but finding ones without gaps or that extend beyond a few hundred records are difficult to find; worse still, the formulas for calculating the missing values are often performance hogs or require precisions and scales that choke T-SQL code with arithmetic overflows errors. Drawbacks like these also restrict the usefulness of Chauvenet’s Criterion, which was among the first outlier detection methods ever developed. Naval Academy mathematician William Chauvenet (1820-1870) formulated it in the Civil War era, but recognized from the beginning that there were already more trustworthy means available, like Peirce’s Criterion. Perhaps its crudest limitation is that it calls for recursive reexamination of data after carrying out automatic deletion of data points without further investigation, which as I have discussed in prior articles, is unwise and sometimes even unethical. Thankfully, we can apply the same type of interpretation-hack used in last week’s article to make the Modified Thompsons Tau test more useful and valid to Chavuenet’s formula, which is still apparently in common use today despite all of these well-known issues.
…………Repetetively deleting records until no more outliers remain to test is probably not going to fly with either users or IT managers in a SQL Server environment, but as we saw last week, it is possible to simply flag records as potential outliers and still recursively recalculate the underlying aggregates that the formula is based on, as if they had been deleted. It is easier to have our cake and eat it too thanks to new T-SQL windowing clauses like ROWS UNBOUNDED PRECEDING, which make the code for this week’s stored procedure much shorter and easier to follow. The T-SQL in Figure 2 closely resembles that of the Modified Thompson Tau procedure for another reason: the Chauvenet Criterion is also a hybrid method that marries some of the logic of Z-Scores with that of hypothesis testing. The key difference is that we need to compare the absolute deviation against the standard normal distribution rather than Student’s T-distribution; we’re really just substituting one statistical part for another in order to address different use cases, just as we would  swap out a computer component or an automotive part in a car engine. That substitution requires the use of a different lookup table than the ones we’ve used in recent articles, but we only need one of them, since we only need to input the absolute deviation rather than the degrees of freedom and an alpha value. That  in turn means we can use a single join rather than a function call, further decomplicating the procedure. The main problem I encountered when implementing this is that it is impossible to find complete lookup tables for the standard normal distribution, which typically only accept just one decimal point of precision despite the fact that possible to the calculate Z-Scores fed into them to far higher precisions. Part of the problem is that they’re continuous values, but as I’ve found out the hard way, it is surprisingly difficult to calculate them to higher precisions with the original cumulative distribution function (CDF). Until I can come up with a more precise approximation for high-precision values, the clumsy lookup table defined in Figure 1 will have to do. I designed it to host the table cited at the Wikipedia page “68–95–99.7 Rule,”[1] which includes the probabilities that values will occur within one to seven standard deviations, at intervals of 0.5. Once I overcome my difficulties with CDFs and can get more accurate measures, it will be possible to replace the clumsy BETWEEN clause and CASE in the procedure that crudely peg data points to these wide limits. Once the probability value has been retrieved, we only need to multiply it by the number of data points and flag the record as an outlier if the result is less than 0.5. The PopulationOutsideRange that the procedure in Figure 2 joins to is a calculated column (which is renamed as Probability in the stored procedure) while the RN ROW_NUMBER value acts as a running count.

Figure 1: Code for the Standard Normal Deviation Lookup Table
CREATE TABLE [Calculations].[StandardNormalDeviationTable](
       [ID] [bigint] IDENTITY(1,1) NOT NULL,
       [StandardDeviations] [decimal](2, 1) NULL,
       [PopulationInRange] [decimal](16, 15) NULL,
       [PopulationOutsideRange]  AS ((1)-[PopulationInRange]),
       [ExpectedFrequency] [bigint] NULL,
       [ApproximateFrequencyForDailyEvent] [nvarchar](400) NULL,
 CONSTRAINT [PK_StandardNormalDeviationTable]
PRIMARY KEY CLUSTERED 
( [ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE
= OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Figure 2: Code for the Chauvenet Criterion Procedure
CREATE PROCEDURE [Calculations].[ChauvenetCriterionSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS nvarchar(50)
AS
DECLARE @SchemaAndTableName nvarchar(400),  @SQLString nvarchar(max)
SET @DatabaseName = @DatabaseName + ‘.’
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName
SET @SQLString = ‘SELECT’ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, AbsoluteDeviation, Probability, ”IsOutlier
= CASE WHEN (RN * Probability) < 0.5 THEN 1 ELSE 0 END
       FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, AbsoluteDeviation, ”Probability” = CASE WHEN PopulationOutsideRange IS NOT NULL THEN PopulationOutsideRange
WHEN PopulationOutsideRange IS NULL AND AbsoluteDeviation < 1 THEN 1 ELSE 0 END
                     FROM   (SELECT T1.’ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, CAST(ROW_NUMBER() OVER (ORDER BY ‘ + @ColumnName + ‘ ASC) AS bigint) AS RN,
                     Abs( + @ColumnName + ‘ – Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))  OVER (ORDER BY ‘ + @ColumnName + ‘ ASC ROWS UNBOUNDED PRECEDING)) /
                                 NullIf(StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))  OVER (ORDER BY ‘ + @ColumnName + ‘ ASC ROWS UNBOUNDED PRECEDING), 0) AS AbsoluteDeviation
                           FROM ‘ + @SchemaAndTableName +
                           WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1
                           LEFT JOIN Calculations.StandardNormalDeviationTable AS T2
                           ON T1.AbsoluteDeviation BETWEEN T2.StandardDeviations – 0.25 AND T2.StandardDeviations + 0.25
                     WHERE AbsoluteDeviation IS NOT NULL) AS T3
ORDER BY IsOutlier DESC, AbsoluteDeviation
DESC, Probability DESC

–SELECT @SQLStringuncomment this to debug string errors
EXEC (@SQLString)

…………If you’ve been following this series, most of the rest is self-explanatory: the first few parameters allow users to apply the test to any column in any database for which they have permissions, while the @DecimalPrecision parameter allows them to adjust the precision and scale to avoid arithmetic overflows in the calculations. The rest is just the usual dynamic SQL, with a couple of nested subqueries to allow calculations like the Z-Score to be bubbled up and handled at higher levels. As usual, I’ve omitted any code to handle spaces in object names (which I never allow when I can get away with it) or SQL injection protections.

Figure 3: Results for the Chauvenet Criterion Procedure
EXEC [Calculations].[ChauvenetCriterionSP]
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’Hemopexin,
              @PrimaryKeyName = N’ID’,
             @DecimalPrecision = N’5,1′

ChauvenetResults

…………Chauvenet’s Criterion turned out to be much more conservative than I expected it be, at least in terms of flagging outliers in columns without a lot of repeating values. Only a single value was identified as an outlier in the Hemopexin column of a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from Vanderbilt University’s Department of Biostatistics and have been using for practice throughout this series. On another practice dataset with 1,600 rows it found only three, far fewer than any of the other procedures tested to date. I was startled at how well the procedure performed against the first float column in the Higgs Boson dataset, which I downloaded from the University of California at Irvine’s Machine Learning Repository and turned into a nearly 6-gigabyte table. Despite the fact that the logic for both procedures is quite similar, Chauvenet’s test took only 3 minutes and 15 seconds to run on my poor beat-up six-core imitation of a workstation, compared to an hour and fifteen minutes for the Modified Thompson Tau test procedure. The execution plan in last week’s tutorial was too small to bother to post, since it consisted mostly of a single Sort operation that sucked 95 percent of the computational cost, whereas the Chauvenet execution plan was too wide to fit here and was comprised of several other operators like Compute Scalar, Nested Loops and Parallelism (Gather Streams). It also included a single Sort, but it only accounted for 38 percent of the expense.
…………It may perform surprisingly well and have its uses on columns with few repeating values when a conservative identification threshold is called for, but this century-and-a-half old test has many drawbacks that should not be understated. The requirement of a Gaussian distribution and the difficulty in getting accurate probability values for the size of the datasets DBAs work with are only the tip of the iceberg. The Central Limit Theorem on which it is based is mathematically based is much more trustworthy than other probabilistic methods extant today, but it is a fallacy to believe that probabilities represent guarantees or have any effect on causation. I’ve barely touched on this issue yet, but the aforementioned Wikipedia article on the 68–95–99.7 Rule put it a lot better than I can: “…it is important to be aware of the fact that there is actually nothing in the process of drawing with replacement that specifies the order in which the unlikely events should occur, merely their relative frequency, and one must take care when reasoning from sequential draws. It is a corollary of the gambler’s fallacy to suggest that just because a rare event has been observed, that rare event was not rare. It is the observation of a multitude of purportedly rare events that undermines the hypothesis that they are actually rare.”[2] The remedy for this uncertainty is the same as for the more serious issue of deletion: further investigation, not knee-jerk deletion of records. As geneticist David M. Glvoer and oceanographers Scott Christopher Doney and Wiliam J. Jenkins put it in their 2011 book, Modeling Methods for Marine Science:

                “Now the truly clever researcher might be tempted to perform this rejection iteratively. That is, why not compute a mean and standard deviation, Z-score the data and reject the fliers, then compute an even better mean and standard deviation and do the same thing all over again, rejecting more data. The advice of all the statistical sages and texts is do Chauvenet rejection only once in a given distribution. If the data were normally distributed, and there weren’t many fliers, you’ll probably find that the second iteration will not yield any more rejectable points. If it does, then it suggests that your data may not be normally distributed. The philosophy is that filtering once is a valid thing to do, but iterative filtering may dramatically alter the data distribution in a fundamental way, invalidating the assumptions behind your statistical calculations, and leading to erroneous results. Moreover, you may accused of being a Chauvenet Chauvinist.”[3]

This is professional confirmation of the Catch-22 I’ve always fretted about with the normal distribution: the more outliers that are found, the less likely it is that a Gaussian bell curve is active, in which case most of these hypothesis-testing based outlier detection methods are invalid. Another Catch-22 is operative when we’re recklessly deleting data in a recursive routine like Chauvenet’s Criterion and the Modified Thompson Tau test: the more we delete, the bigger the impact on the dataset will be. If we follow Glover et al.’s suggestion and limit the criterion to a single use, it’s hardly applicable to a SQL Server database where we may need to find tens of thousands of outliers, while looking for data quality issues or doing exploratory data mining. Such a wide scope also calls for degrees of precision that aren’t readily available in regular lookup tables and would probably be quite costly to compute. The criterion may have been better than nothing when Chauvenet wrote his paper back in the Civil War era, but it’s really hard to justify its use, even in many of the hypothesis testing scenarios it was designed for. Nevertheless, academia and research labs across the planet are apparently still staffed by many of those “Chauvenet Chauvinists” today. While researching this article (including reading parts of Chauvenet’s original paper in .pdf format, which I’ve since lost) I ran across many comments like this one from Stephen Ross, a professor of mechanical engineering at the University of New Haven:

                “Peirce’s criterion has been buried in the scientific literature for approximately 150 years. It is virtually unknown today in the scientific community. In its place, Chauvenet’s criterion is commonly used for rational elimination of “outlier” data by government laboratories, (e.g., Environmental Protection Agency, U.S. Army Corps of Engineers, Agency for Toxic Substances and Disease Registry, Institute for Telecommunication Sciences), industry (e.g., Boeing, Sikorsky), foreign laboratories (e.g., Laboratoire National Henri Becquerel, Joint Astronomy Centre), and universities (e.g., research and courses at University of Michigan, Texas A&M, University of California, Vanderbilt, University of Alberta, Ohio State). Methods of elimination of data “outliers” are useful for anyone working in industry or in an educational institution where statistical information concerning product runs or experimental data is of interest. In an engineering, technology or science program, laboratory courses in chemistry, physics and engineering can, and do, find use for rational spurious data elimination. In the BSME program at the University of New Haven, we have used Chauvenet’s criterion in our instrumentation and fluid/thermal laboratory courses for many years. Other universities have similarly used this criterion in their undergraduate laboratories. Typically, students take several measurements of a quantity, say pressure, at one setting (meaning the experimental conditions are maintained at the same level). Assuming the systematic errors are negligible, each measurement will vary slightly due to random errors (e.g., reading instrument values, flow rate may change slightly, etc.). Often, however, one or two datum points seem to fall “far” outside the range of the others obtained. These outliers greatly impact the mean and standard deviation of the, measurements. A data elimination method can be used to obtain a realistic average value of pressure and an “uncertainty” in the true value given by the standard deviation…Chauvenet’s criterion is in common use today for elimination of suspect data.”[4]

…………Ignorance is bliss. I started off this series with some dire warnings about how haphazardly statistics are handled today, especially in fields like medicine where they can do the most damage. The more I’ve learned while writing this series, the less reassured I’ve become. One of the clearest lessons I’ve learned from this exercise is that, if the SQL Server community and the rest of the database server field get in the habit of routinely doing outlier detection (as I suspect they will, in time), they really need to avoid simply copying the means used in other fields. Chauvenet’s Criterion and the other five hypothesis-testing based methods don’t seem to be well-suited to the Big Data buzzword at all, but it doesn’t stop there: in many cases, they’re not even applied correctly in industries where they’re used on a daily basis, such as medical research. So far in this series, only Benford’s Law and Z-Scores appear to fit our use cases well, although I have high hopes for upcoming topics like Interquartile Range, Cook’s distance and Mahalanobis distance, as well as the various visual means that can be implemented in Reporting Services. Next week’s article on Peirce’s Criterion is also likely to be more valuable to DBAs. As Ross points out in an article on that topic, even Chauvenet recommended it in place of his own test: “Chauvenet himself believed that Peirce’s work was more rigorous and could be applied more generally, and in Chauvenet’s words, ‘For the general case….. when there are several unknown quantities and several doubtful observations, the modifications which the rule (meaning his own criterion) requires renders it more troublesome than Peirce’s formula……What I have given may serve the purpose of giving the reader greater confidence in the correctness and value of Peirce’s Criterion.’”
…………What’s good enough for Chauvenet is good enough for me. Why his advice not to use his own test apparently isn’t heeded in academia and private sector research is beyond me. Perhaps it is only a matter of habit, like the completely arbitrary custom of using confidence levels like 95 percent in hypothesis testing. Hopefully it is a custom that DBAs won’t adopt without some thought; perhaps Chauvenet’s Criterion has a place in our tool belts for unusual use cases, but it ought to be a very small place, considering how many more fitting outlier detection methods we have available to us.

[1] See the Wikipedia pages “68–95–99.7 Rule” and  “Standard Deviation” at http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule “68–95–99.7 Rule” and

http://en.wikipedia.org/wiki/Standard_deviation “Standard Deviation” respectively.

[2] IBID.

[3] p. 29, Glover,  David M.; Jenkins, William J. and Doney, Scott Christopher, 2011, Modeling Methods for Marine Science. Cambridge University Press: New York. I found this reference at the Google Books web address http://books.google.com/books?id=OYAkMs85848C&q=chauvenet#v=snippet&q=chauvenet&f=false

[4] pp. 3-4, Ross, Stephen M. “Peirce’s Criterion for the Elimination of Suspect Experimental Data,” pp. 1-12 in the Journal of Engineering Technology, Fall 2003. Vol. 2, No. 2. http://newton.newhaven.edu/sross/piercescriterion.pdf

Outlier Detection with SQL Server, part 3.5: The Modified Thompson Tau Test

By Steve Bolton

…………Based on what little experience I’ve gained from writing this series on finding outliers in SQL Server databases, I expected the Modified Thompson Tau test to be a clunker. It marries the math underpinning one of the most ubiquitous means of outlier detection, Z-Scores, with the methods taken from the field of statistical hypothesis testing, which I’ve grouped together in the middle of this segment of the series. I feared, however, that it would turn out to be a shotgun wedding that would combine the worst drawbacks of both, at least when applied to the kind of large datasets that DBAs encounter every day. As discussed in previous articles in more depth, hypothesis testing is often performed on datasets of just a few dozen or a few hundred rows, whereas SQL Server tables normally have thousands, if not millions or billions of rows; the former is meant to be applied in proving specific points of evidence, while our main use cases tend to be ferreting out data quality problems and exploratory data mining. I’m dispensing with this subset of outlier detection methods (which amount to a kind of DIY data mining) in this segment because they’re not designed for our main user scenarios – although they can be tweaked somewhat to better fit our needs, as Dixon’s Q-Test was in last week’s installment. The Modified Thompson Tau test shares many of the same limitations as the other methods in the same class, such as the requirement of prior goodness-of-fit testing to ensure that the data follows a Gaussian or “normal” distribution (i.e. the bell curve). In fact, I expected it to be worse than Dixon’s Q-Test, Grubbs’ Test, the GESD and the Tietjen-Moore Test because it involves a recursive elimination of outliers, one at a time, which requires constant recalculations of the underlying averages and standard deviations used in the formulas. Not only could this be a performance headache, but this iterative elimination smacks of the kind of logically and ethically dubious handling of outliers I’ve addressed at length in past articles. Any outlier detection workflow must include investigations of their underlying causes, before matching them to an appropriate response; some are the result of bad data collection, some may indicate a non-Gaussian distribution, others may be beneficial depending on the context, such as finding new customer bases in a marketing survey. The Modified Thompson Tau test apparently is misused routinely to simply jettison unwanted data points without further investigation, judging from the frightening number of sources I found on the Internet where professionals did exactly that. That’s not going to fly in a SQL Server database, where the users and IT directors tend to get miffed when a DBA whimsically decides to delete ten thousand records. We can, however, salvage some usefulness out of the test by simply using it to flag potential outliers, rather than deleting them. The underlying stats can be recomputed as if previously detected outliers were eliminated, without actually making any change to the record.
…………Flagging outliers in this manner solves the problem of inappropriate responses that are often associated with the test, but it takes the magic of SQL Server windowing functions to address the recursive recomputations of aggregates like standard deviation and the mean. The declaration section of the dynamic SQL in Figure 1 is shorter than in any other procedure I’ve written in this series because we can’t simply do a one-time measurement of these basic stats; Modified Thompson Tau is the first test we’ve encountered that uses a sliding window of this kind. The formula I retrieved from Wikipedia[1] really isn’t that hard to calculate, nor are the underlying concepts anything new to this series; all we have to do is compute the absolute deviation (which we’ve already seen in the articles on Z-Scores) and use the Calculations.FindCriticalRegionForTDistributionFunction coded for the Grubbs’ Test article to find the critical region. After that, it’s just a simple matter of making a quick mathematical comparison between the two. The difficulty consists in the recursion, which is computationally costly. Furthermore, it is difficult if not impossible to do these calculations in a recursive CTE, thanks to such fun messages as “TOP operator is not allowed in the recursive part of a recursive common table expression,” “Functions with side effects are not allowed in the recursive part of a recursive common table expression” and “GROUP BY, HAVING, or aggregate functions are not allowed in the recursive part of a recursive common table expression.” Prior to the introduction of new windowing function clauses in SQL Server 2012, it might have been necessary to code this with some messy recursive function calls. The code is much shorter and more legible and the performance is probably much better than any alternatives, thanks to the single ROWS UNBOUNDED PRECEDING clause, which really saved my bacon out of the fire this time. Whenever the subject comes up, I never miss an opportunity to plug Itzik Ben-Gan’s classic reference Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions,[2] which invariably turns out to be useful in unexpected binds like this. I’ve made it a point to learn as much as I can about the largely untapped potential of these new T-SQL clauses, but apparently I need to redouble my efforts, judging on how indispensable they were in this particular situation.

Figure 1: Code for the Modified Thompson Tau Procedure

CREATE PROCEDURE [Calculations].[ModifiedThompsonTauTestSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS nvarchar(50), @Alpha decimal(38,35) = 0.05
AS

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @DatabaseName = @DatabaseName + ‘.’
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName –I’ll change this value one time, mainly for legibility purposes

SET @SQLString = ‘DECLARE @Alpha  decimal(5,4)
SET @Alpha = ‘ + CAST(@Alpha AS nvarchar(50)) + SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, AbsoluteDeviation, RejectionRegion, ”IsOutlier” = CASE WHEN AbsoluteDeviation > RejectionRegion THEN 1 ELSE 0 END
FROM       (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, AbsoluteDeviation, (RejectionRegionInput * (RN – 1)) / ((Power(RN, 0.5) * Power(RN – 2 + Power(RejectionRegionInput, 2), 0.5))) AS RejectionRegion
       FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, AbsoluteDeviation, DataMiningProjects.Calculations.FindCriticalRegionForTDistributionFunction (RN, 1, 0.05) AS RejectionRegionInput
                     FROM
                   (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, CAST(ROW_NUMBER() OVER (ORDER BY ‘ + @ColumnName + ‘ ASC) AS bigint) AS RN,
                    Abs( + @ColumnName + ‘ – Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))  OVER (ORDER BY ‘ + @ColumnName + ‘ ASC ROWS UNBOUNDED PRECEDING)) /   StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))  OVER (ORDER BY ‘ + @ColumnName + ‘ ASC ROWS UNBOUNDED PRECEDING) AS AbsoluteDeviation
                          FROM ‘ + @SchemaAndTableName +
                          WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1
                     WHERE AbsoluteDeviation IS NOT NULL) AS T2) AS T3
ORDER BY IsOutlier DESC, AbsoluteDeviation DESC, RejectionRegion, ‘ + @ColumnName + ‘, ‘ + @PrimaryKeyName +

–SELECT @SQLStringuncomment this to debug string errors
EXEC (@SQLString)

…………To avoid divide-by-zero errors when the standard deviation is zero, I borrowed the NullIf trick from a post by Henrik Staun Poulsen at StackExchange (which probably saved me a lot of time trying to devise an answer of my own, since I was clueless about NullIf).[3] Most of the rest of the code follows the same format I’ve used in previous procedures, with the usual @DecimalPrecision parameter available to avoid arithmetic overflows, the usual set of parameters and implementation code allowing users to select a column in any database for which they have permissions, etc.  The messiest part is the use of subqueries to bubble up previous computations to higher levels, which I’m considering recoding as a series of sequential CTEs, if that would be easier for people to follow. As always, I’m using a Calculations schema that users can change and haven’t included code for accommodating spaces in object names or to prevent SQL injection. My usual disclaimer is still operative: I’m posting this in order to learn, not because I know what I’m talking about, so test my code before putting it into production. This is an introduction to the topic with suggestions on how to code it, not the last word. I’m not 100 percent sure I’ve got the order of operations correct in the second subquery or that I’m not supposed to use the @Alpha value squared instead of the rejection region; it’s likewise possible I ought to be feeding the count minus one to the degrees of freedom parameter of the T-distribution function, rather than the simple count. A T-SQL expert might also be able to suggest numerous ways of improving the performance, given that the query’s still a resource hog even with the ROWS UNBOUNDED clause.

Figure 2: Results for the Modified Thompson Tau Procedure

EXEC   [Calculations].[ModifiedThompsonTauTestSP]
             @DatabaseName = N’DataMiningProjects,
             @SchemaName = N’Physics,
            @TableName = N’HiggsBosonTable,
              @ColumnName = N’Column1′,
              @PrimaryKeyName = N’ID’,
             @DecimalPrecision = N’ 33,29′,
              @Alpha = 0.05

Modified Thompson Tau Test

…………The results in Figure 2 depict outliers for the Creatine Kinase enzyme, which plays in a role in the Duchennes form of muscular dystrophy, which is the subject of a tiny 209-row dataset I downloaded from the Vanderbilt University’s Department of Biostatistics for use as practice data in this series.Whenever performance has been an issue with procedures I’ve posted previously in this series, I’ve stress-tested them against the first float column of the 11-million-row Higgs Boson dataset, which I downloaded from the he University of California at Irvine’s Machine Learning Repository and converted to a nearly 6-gigabyte SQL Server table. That test took an hour and fifteen minutes on my poor beat-up six-core development machine, which is probably several orders of magnitude slower than a real database server yet still sluggish enough to make me worry about the performance in a live environment. On the other hand, this kind of complete examination of an entire column really amounts to a crude data mining operation, of the kind that would normally be run occasionally on off-peak hours. It might thus be sufficient to get the job done as it is, although I’m sure a real T-SQL coder could advise on several ways of getting around the single Sort operation that gobbled up 95 percent of the cost of the execution plan (which I won’t bother to post, because there’s nothing more to the story except that Sort). I have the distinct impression, however, that the ROWS UNBOUNDED PRECEDING method is as close to an optimal method of recursively recalculating the aggregates as we’re going to get.
…………Either way, I feel like I made a unique contribution for the first time in this series, by adapting to SQL Server use cases a statistical test that is often misused, even when applied to its usual scenarios in hypothesis testing. The results in Figure 2 allow DBAs and data miners to make decisions based on a bird’s eye view of the potential outliers in a dataset, without wantonly deleting them in knee-jerk manner. Like the other five outlier detection methods I’ve segregated in this part of the tutorial series, it is still based on hypothesis testing methods that retard its usability, like the requirement of a Gaussian distribution and the difficulties of finding or calculating T-distribution lookup tables for degrees of freedom far in excess of 200 rows. Before proceeding with outlier identification methods that are more likely to be profitable to SQL Server DBAs, like Interquartile Range, Peirce’s Criterion, Cook’s Distance, Mahalanobis Distance and various visual means that can be displayed in Reporting Services, I’ll finish out this segment with a discussion of one of the oldest means, Chauvenet’s Criterion. Like the others in this subset, its usefulness is severely curtailed by its dependence on a normal distribution and the small size of the available lookup tables. Furthermore, it is also implemented in an inherently recursive manner, which brings with it the same performance and logical validity issues that the Modified Thompson Tau test does. I’ll attempt to code it in the next installment of this series anyways, for the sake of completeness and the possibility that a SQL Server user out there might find a use for it – as well as to gain some further experience in translating stats and math equations into code, while passing on my misadventures in T-SQL to others in the hopes that neither I nor they will repeat my cautionary tales.

[1] It is mentioned in the Wikipedia webpage “Outlier,” at the web address http://en.wikipedia.org/wiki/Outlier.

 

[2] Ben-Gan, Itzik, 2012, Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions . O’Reilly Media, Inc.: Sebastopol, California.

[3] Poulsen, Henrik Staun, 2009, reply to the thread “How to Avoid the “Divide by Zero” Error in SQL?” published May 14, 2009 on the StackOverflow.com website. Available online at http://stackoverflow.com/questions/861778/how-to-avoid-the-divide-by-zero-error-in-sql

Outlier Detection with SQL Server, part 3.4: Dixon’s Q-Test

By Steve Bolton

…………In the last three installments of this amateur series of mistutorials on finding outliers using SQL Server, we delved into a subset of standard detection methods taken from the realm of statistical hypothesis testing. These are generally more difficult to apply to tables of thousands of rows, let alone the billions or even trillions commonly associated with the buzzword “Big Data,” for a number of reasons. First, many of them are invalid if the data doesn’t follow a normal distribution, which requires goodness-of-fit testing that can be expensive on large datasets. Secondly, many of them also depend on comparisons to Student’s T and other distributions in order to define a data point as an outlier, but the lookup tables widely available in texts and on the Internet generally stop after sample sizes of a couple of hundred at best. Calculating these for the much larger datasets that DBAs work with is likely to be computationally costly, especially in the case of last week’s topic, the Tietjen-Moore test. Typically, they are used to give a more rigorous numerical definition of an outlier in small dataset of a few dozen or a few hundred data points, which is at least an improvement over simply spotting them by eye in a scatter plot or some other form of data visualization. Hypothesis testing certainly has valid uses when applied to its proper use case, which is proving a single point of evidence, not ferreting out data quality problems or the kind of exploratory data mining DBAs are likely to do. Even then, there are many pitfalls to watch out for, including common misconceptions about probabilistic reasoning and terms like “confidence” and “statistical significance.” The manner in which alpha values are selected to define confidence intervals is also somewhat capricious. I am more confident in hanging my hat on measures like Minimum Description Length and Kolmogorov Complexity which are more deeply rooted in reason, but I’ll have to defer discussion of these for a future series tentatively titled Information Measurement with SQL Server, since they’re not applicable to outlier detection. Despite these caveats, I’ll finish this detour into outlier detection methods dependent on hypothesis testing before getting back on track with topics like Interquartile Range and Cook’s Distance that will probably prove to be more useful to DBAs.
…………For the sake of completeness and finishing what I started, I’ll give a quick rundown of Dixon’s Q-Test, which suffers from many of the limitations listed above. It too is invalid when applied to a dataset that does not follow a Gaussian or “normal” distribution, i.e. a bell curve. The test statistic derived from it must also be compared to a particular distribution, which is much more difficult to find reference lookup tables for than the ubiquitous T-distribution. The DDL in Figure 1 was used hold the critical values I inserted from the only source I could find during a short search, from a webpage at the University of Göttingen’s Department of Sedimentology and Environmental Geology.[i] This particular lookup table only goes up to 25 degrees of freedom, so we can only apply it to datasets with that many rows. Yet the limitations do not end there. As discussed a few columns ago, Grubbs’ Test can only be applied to a single row at a time; Dixon’s Q-Test is even more restrictive, in that it can only be applied to a dataset once, to detect a single outlier. As its Wikipedia entry states, “This assumes normal distribution and per Dean and Dixon, and others, this test should be used sparingly and never more than once in a data set.”[ii] If the 25-record limit wasn’t a fatal blow to its usability, then the single-use criterion certainly delivers the coup de grace. Nevertheless, I’ll provide the stored procedure in Figure 2 for anyone who finds a need for it:

Figure 1: DDL for the Dixon’s Q-Test Critical Value Table
CREATE TABLE [Calculations].[DixonsQTestTable](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[N] [tinyint] NULL,
[Alpha10] [decimal](4, 3) NULL,
[Alpha05] [decimal](4, 3) NULL,
[Alpha02] [decimal](4, 3) NULL,
[Alpha01] [decimal](4, 3) NULL,
[Alpha005] [decimal](4, 3) NULL,
CONSTRAINT [PK_DixonsQTestTable] PRIMARY KEY CLUSTERED
([ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE
= OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Figure 2: Code for the Dixon’s Q-Test Procedure
ALTER PROCEDURE [Calculations].[DixonsQTestSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50), @Alpha decimal(38,35) = 0.05
AS

SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName

DECLARE @SQLString nvarchar(max)
SET @SQLString = DECLARE @Mean decimal( + @DecimalPrecision + ‘), @Range  decimal(‘ + @DecimalPrecision + ‘), @Count decimal( + @DecimalPrecision + ‘), @CriticalValue decimal( + @DecimalPrecision + ‘), @Alpha
decimal(
+ @DecimalPrecision + ‘), @OrderByCode tinyint
S
ET @OrderByCode  = + CAST(@orderByCode AS nvarchar(50)) +
SET @Alpha = ‘ + CAST(@Alpha AS nvarchar(50)) +

SELECT @Range = Max(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) – Min(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))), @Count=Count(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @CriticalValue = CASE
      WHEN @Alpha = 0.1 THEN Alpha10
      WHEN @Alpha = 0.05 THEN Alpha05
      WHEN @Alpha = 0.02 THEN Alpha02
      WHEN @Alpha = 0.01 THEN Alpha01
      WHEN @Alpha = 0.005 THEN Alpha005
      ELSE NULL
      END
FROM Calculations.DixonsQTestTable
WHERE N = @Count

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, Gap, TestStatistic, @CriticalValue AS Critical’ + @ColumnName + ‘, @Alpha AS Alpha, IsOutlier” = CASE WHEN TestStatistic > @CriticalValue THEN 1 WHEN TestStatistic <= @CriticalValue THEN 0 ELSE NULL END
FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, Gap, Gap / @Range AS TestStatistic
       FROM  (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, Lead(+ @ColumnName + ‘, 1, 0) OVER (ORDER BY ‘ + @ColumnName + ‘) – ‘ + @ColumnName + ‘ AS Gap
      FROM ‘ + @SchemaAndTableName +
      WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1) AS T2
ORDER BY CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,
CASE WHEN @OrderByCode = 5 THEN TestStatistic END ASC,
CASE WHEN @OrderByCode = 6 THEN TestStatistic END DESC

–SELECT @SQLStringuncomment this to debug string errors
EXEC (@SQLString)

…………There’s not much going on here in comparison to some of the more complex procedures I’ve posted recently. The first five parameters allow users to perform the test on any column in any database for which they have requisite permissions. The @DecimalPrecision is a parameter I’ve added to many of my procedures to enable users to escape from arithmetic overflows, while the @OrderByCode takes the same values as in other tutorials: the value 1 orders the results by the primary key ascending, 2 by the same descending, 3 and 4 by the column name ascending or descending respectively, and 5 and 6 order them by the TestStatistic in either direction. Most of the procedure consists of implementations of the @OrderByCode, aggregate retrievals and column selection that I’ve merely cut and pasted from past procedures. The logic of the test statistic itself is quite simple: use the T-SQL Lead windowing function to find the previous row, calculate the gap and then sort by it. The test statistic is merely the gap divided by the range.[iii] Just like with the Tietjen-Moore test in last week’s article, this is probably of greater utility for our use case scenarios than a comparison of a single test statistic to the critical value; for that reason, I’ve returned all 25 rows of a view on the Hemopexin column of the 209 rows of the Duchennes dataset, derived from research on a form of muscular dystrophy that Vanderbilt University’s Department of Biostatistics has made publicly available. These are ordered by the gap and test statistic, which tell us more about the data distribution of the view than a single outlier test would do.

Figure 3: Results for a View on the First 25 Values for the Hemopexin Column
EXEC [Calculations].[DixonsQTestSP] @DatabaseName = N’DataMiningProjects, @SchemaName = N’Practice, @TableName = N’Top25HemopexinView’, @ColumnName = N’Hemopexin, @PrimaryKeyName = N’ID’, @OrderByCode = 6, @DecimalPrecision = N’12,8′, @Alpha = 0.1

DixonsQTestResults
…………
There are no outliers in this particular view, according to the comparison made against the critical values cited in the University of Göttingen’s source. It did correctly identify an outlier when I tested it against the example data provided in the Wikipedia entry. The code also works quite quickly, as can be expected for a view of just 25 records; that is why I won’t bother to post execution plans or test it against the much larger Higgs Boson dataset as we have done in previous tutorials to look for performance bottlenecks. It may work as designed, but it is probably more useful when the gaps and test statistics for all of the rows are provided, as depicted in Figure 3. Even when adapted in this way, however, it is of still of little practical utility for datasets with thousands of records, in large part because we don’t have a means of deriving critical values for that many rows. One strategy might be to define a view on a subset of data as I have done above, quite arbitrarily. On the other hand, taking tiny samples of large datasets, even when properly randomized, doesn’t do us much good if our most common purposes are finding and fixing all rows affected by data quality issues, or doing exploratory data mining. When we’re dealing with datasets of billions of records; our main use case is to devise procedures that will ferret out as many of them as we can find, as efficiently as we can – which means getting them all in one pass if possible, not looking for one per test as we do with Grubbs and Dixon. The latter is even more restrictive, because according to the developers of the test itself, it ought not be applied more than once to any dataset. We’re not just limited to testing one outlier in a single pass, but to a single pass, forever. That is obviously not as useful as familiar tests like Z-Scores, which can be applied as often as we like to an entire database. In the next installment of this series we’ll discuss the Modified Thompson Tau test, which is more sophisticated in that it marries Z-Scores to some of the hypothesis testing logic underpinning the last few tutorials. I thought it would turn out to be a shotgun wedding, but it turns out that this test can be enhanced by returning all of the values involved, just as Dixon’s Q-Test can be made mildly more useful in the same way. Such adjustments might be called for in the cases of many of the outlier detection methods based on hypothesis testing, since we’re using them for quite different purposes than what they were designed for. The statistical tools introduced in this segment of the series might not be as useful on an everyday basis to DBAs as upcoming topics like Interquartile Range, Peirce’ Criterion, Cook’s Distance or Mahalanobis Distance, but there’s no reason to remove them from our tool belts if they can be adusted to work with rare use cases that we might eventually encounter.

[i] See the webpage titled “Out?Lier”at the website of the Geowissenschaftliches Zentrum der Universität Göttingen’s Department of Sedimentology and Environmental Geology, which is available at the web address http://www.sediment.uni-goettingen.de/staff/dunkl/software/o_l-help.html

[ii] See the Wikipedia page “Dixon’s Q-Test,” available online at http://en.wikipedia.org/wiki/Dixon%27s_Q_test

[iii] IBID.

Outlier Detection with SQL Server, part 3.3: The Limitations of the Tietjen-Moore Test

By Steve Bolton

…………The Tietjen-Moore test may have the coolest-soundest name of any of the outlier detection methods I’ll be surveying haphazardly in this amateur series of mistutorials, yet it suffers from some debilitating limitations that may render it among the least useful for SQL Server DBAs. It is among a set of six methods that I’ll be dispensing with quickly in the middle of this series because they’re designed to perform hypothesis testing on small datasets of a few dozen or a few hundred records, not the thousands commonly found in the smallest SQL Server databases, let alone the billions or even trillions associated with the Big Data buzzword. By the end of the series, I may be capable of providing a matrix in which the various means of finding aberrant data points are divided by their use cases and the questions users want to ask of the data, followed by the number and types of inputs and outputs and their mathematical properties, as well as the means of calculation in between those two steps if it is relevant to performance or adequate access. Any associated workflow ought to include steps to define outliers with rigorous logic, then assess the underlying causes of any that are found to fit those criteria before proceeding to the final step, matching them to a proper response, whether it be deletion in the case of certain data quality issues or elation if outliers happen to be a positive outcome according to the context. As discussed at length in previous articles, a failure in any one of these areas can be worse than an incorrect calculation, in terms of misleading conclusions or even unethical responses. The set of hypothesis testing methods that Tietjen-Moore belongs to would occupy a quite narrow range of such a matrix, since they have numerous constraints that make them inapplicable in many situations, such as the requirement of a Gaussian or “normal” distribution (i.e. a bell curve) and the requisite goodness-of-fit testing to prove it. Within this subset, the Tietjen-Moore test is actually more restrictive in its own way than many others – even Grubbs’ Test (which Tietjen-Moore is derived from) and Dixon’s Q-Test, which can only be used to identify a single outlier. It too returns a single result, but which is merely a Boolean yes/no answer to the question, “Does the dataset contain exactly n number of outliers?” The Generalized Extreme Studentized Deviate Test (GESD), the topic of the last article, can also be used to ascertain how many outliers a dataset has, but is more useful because it does not require the user to guess the exact number in advance.
…………The chances of guessing the correct number are much greater in the kinds of relatively small datasets used in hypothesis testing, but obviously quite difficult even in a conventional SQL Server database with millions of more rows. Furthermore, the performance penalties for repetitively retesting a dataset until the correct number is arrived at is obviously much higher with a larger database, even if we don’t factor in the obvious need to also test it many more times over. Moreover, as mentioned in the last couple of articles, many of the outlier detection methods drawn from the realm of hypothesis testing require looking up critical regions for common data patterns like Student’s T-distribution. Unfortunately, the lookup tables widely available on the Internet normally stop at a few hundred rows and are often riddled with gaps, but calculating the missing values needed for millions of records can be both computationally costly and intellectually draining. The description of the Tietjen-Moore Test in the National Institute for Standards and Technology’s Engineering Statistics Handbook (one of the most readable online sources of information on statistics) indicates that such lookup tables are not readily available in this case: “The critical region for the Tietjen-Moore test is determined by simulation. The simulation is performed by generating a standard normal random sample of size n and computing the Tietjen-Moore test statistic. Typically, 10,000 random samples are used. The value of the Tietjen-Moore statistic obtained from the data is compared to this reference distribution.”[1] To be representative of a much larger dataset like those found in SQL Server tables, these random samples would have to be much larger than those normally performed in hypothesis testing; taking 10,000 of these larger samples would add to the performance costs, which already include ferreting out the exact number of outliers in advance, plus rigorous goodness-of-fit testing in beforehand to discern whether or not the necessary bell curve is operative or not. That is why I enabled users to supply their own value through the @CriticalValueLowerTail parameter in Figure 1, which implements the test in a T-SQL stored procedure. In Figure 2 I set an arbitrary value of 0.3, and since the value of the test statistic in that case was higher, the hypothesis that the dataset contains exactly 20 outliers was rejected. The test statistic returned by the procedure may still be of use, however, since it can be interpreted thus even without the critical region: “The value of the test statistic is between zero and one. If there are no outliers in the data, the test statistic is close to 1. If there are outliers in the data, the test statistic will be closer to zero.”[2]

Figure 1: Code for the Tietjen-Moore Test
CREATE PROCEDURE [Calculations].[TietjenMooreTestSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS
nvarchar(50), @Alpha decimal(5,4) = 0.05, @K bigint, @CriticalValueLowerTail decimal(38,35), @DoTwoTailedTest bit = 1
AS
SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName

DECLARE @SQLString nvarchar(max), @CommonTestCode nvarchar(max), @MeanConditionCode nvarchar(max), @SelectConditionCode nvarchar(max), @TempCTEOrderByCode nvarchar(max)
SET @CommonTestCode =DECLARE @Mean decimal(‘ + @DecimalPrecision + ‘), @K bigint
SET @K = ‘ + CAST(@K AS nvarchar(50)) +
SELECT @Mean = Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL’

SELECT @MeanConditionCode = CASE WHEN @DoTwoTailedTest = 1 THEN ‘RN < @K’ ELSE ‘RN >= @K’ END
SELECT @SelectConditionCode = CASE WHEN @DoTwoTailedTest = 1 THEN ‘RN > MaxRN – @K’ ELSE ‘RN < @K’ END
SELECT @TempCTEOrderByCode = CASE WHEN @DoTwoTailedTest = 1 THEN  ‘ORDER BY ABS(‘ + @ColumnName + ‘ – @Mean)’ ELSE ‘ORDER BY ‘ + @ColumnName + END

SET @SQLString = @CommonTestCode +
‘; WITH TempCTE
(‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, MaxRN)
AS
(
SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, Max(RN) OVER (ORDER BY RN DESC) AS MaxRN
FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,  ROW_NUMBER() OVER (‘ + @TempCTEOrderByCode + ‘) AS RN
FROM ‘ + @SchemaAndTableName +
W
HERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1),
NewMeanCTE
(NewMean)
AS
(
SELECT Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))
FROM TempCTE AS T1
WHERE ‘ + @MeanConditionCode + ‘ — computes a different mean based on a subset of the dataset
) 

SELECT TestStatistic, ‘ + CAST(@Alpha AS nvarchar(50)) + ‘ AS Alpha, ‘ + CAST(@CriticalValueLowerTail AS
nvarchar(50)) + ‘ AS CriticalValueLowerTail, ”HasOutliers” = CASE WHEN TestStatistic < ‘ + CAST(@CriticalValueLowerTail AS
nvarchar(50)) + ‘ THEN 1 WHEN TestStatistic >= ‘ + CAST(@CriticalValueLowerTail AS nvarchar(50)) + ‘ THEN 0 ELSE NULL END
FROM (SELECT TOP 1 TopOperand / BottomOperand AS TestStatistic
       FROM (SELECT SUM(CASE WHEN ‘ + @SelectConditionCode +
‘ THEN 0 ELSE Power((CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)) – (SELECT NewMean FROM
NewMeanCTE)), 2) END) OVER (ORDER BY RN) AS TopOperand,  SUM(Power((‘
+ @ColumnName + ‘ – @Mean), 2)) OVER (ORDER
BY RN ASC) AS BottomOperand, RN, (SELECT NewMean FROM NewMeanCTE) AS NewMean, ‘

+ @ColumnName +
      FROM TempCTE AS T1) AS T2
       ORDER BY RN DESC) AS T2
–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)

 …………Readers of other installments in this series should easily spot some commonalities between the code in Figure1 and that of other procedures I’ve posted in recent weeks. The procedure resides in a schema named Calculations, which you can change; to keep it short and simple, I left out SQL Injection and security code; I don’t allow spaces in my object names, so you’ll have to add such logic yourself; the first five parameters allow you to perform the test on any column in any database for which you have adequate access. The column must of course have a numerical data type, whose upper and lower ranges ought to be kept in mind when setting the usual @DecimalPrecision value to avoid arithmetic overflows. Most of the differences from other recent procedures revolve around the @DoTwoTailedTest parameter, which calls for separate greater-than and less-than comparisons that are implemented in five strings that are appended or prepended to the dynamic SQL. Usually I use just a single @SQLString variable for this purpose, which can be debugged by uncommenting the next to last line. The extra common table expression (CTE) is also needed to handle the differences between the two test variants. As usual, the procedure ends with a few nested subqueries that look more intimidating than they really are; they’re only needed to bubble up the results of computations so they can be operated on further, before returning the complete results to the user. Sometimes this is strictly necessary because the values can’t get calculated in one fell swoop, while in others it often makes the code more legible and either to debug, rather than cramming a dozen different arithmetic operations and confusing parentheses in one line.

Figure 2: Results for the Tietjen-Moore Procedure

EXEC   [Calculations].[TietjenMooreTestSP]
            @DatabaseName = N’DataMiningProjects’,
            @SchemaName = N’Health’,
             @TableName = N’DuchennesTable’,
             @ColumnName = N’LactateDehydrogenase’,
             @PrimaryKeyName = N’ID’,
             @DecimalPrecision = N’38,21′,
             @Alpha = 0.05,
             @K = 20,
             @CriticalValueLowerTai = 0.3,
             @DoTwoTailedTest = 1

Tietjen-Moore Results

…………After validating the two-tailed version of the procedure against the sample data at the NIST webpage, I ran it against the LactateHydrogenase column of the DuchennesTable, which is derived from a 209-row dataset on the Duchennes form of muscular dystrophy that is published online by the Vanderbilt University’s Department of Biostatistics. There was little incentive to stress test it on the other two datasets I’ll be using for the rest of this tutorial series, one of which has more than 11 million rows. I’ve omitted the client statistics and execution plans I’ve often included in previous articles because the performance hits for running the test are minimal – at least when we omit the 10,000 random samples to establish the critical region and can supply a reasonable guess for the number of outliers to test for through the @K parameter.
…………Omitting the goodness-of-fit tests needed to prove that the dataset follows a normal distribution is not a good idea, however, no matter what the computational expense is. Without that precursor, the Tietjen-Moore test can be misleading, just like many of the other outlier detection methods I’m dispensing with in this segment of the series. Even when we’re certain the data ought to follow a bell curve, the main use scenario for the Tietjen-Moore test for SQL Server DBAs and data miners is likely to be restricted to merely interpreting the test statistic. The critical regions are just too computationally expensive to calculate for such a limited use. The main problem, however, is that correctly divining the correct number of outliers in advance among tens of thousands of records is like looking for a needle in a haystack. In fact, it may be worse, because once we’ve found the needle, all the test tells us is that we have indeed found it. Like many of the other means of outlier identification I’m dispensing with in this segment of the series, Tietjen-Moore is more suited for hypothesis testing, in which researchers attempt to rigorously prove a specific point about a small data sample. As I wade through this series, learning as I go, I’m beginning to realize that the kind of much larger datasets that DBAs and Big Data miners work with these days may call for new methods of outlier identification. Some of the means we’ve already discussed can be used for our scenarios, like Z-Scores and Benford’s Law, but many of the others that statisticians and researchers routinely utilize for hypothesis testing have limited applicability for our use cases, which normally begin with ferreting out data quality problems and exploratory data mining.
…………Once I’ve gotten other methods in the same class, like the Modified Thompson Tau Test and Chauvenet’s Criterion, out of the way I’ll delve into others like Interquartile Range, Peirce’s Criterion, Cook’s Distance and Mahalanobis Distance that are more likely to be of use for our scenarios. The same holds true of the segment I’ll do later on Visual Outlier Detection with Reporting Services, in which I’ll demonstrate how to spot outliers with the naked eye. The hypothesis-test set of outlier detection methods are often quite useful in rigorously confirming that the out-of-place data points spotted using eye candy like histograms and scatter plots are indeed outliers. I have some reservations about hypothesis testing even when applied to its typical use cases, like the arbitrariness of commonly assigned alpha values, the frequency with which such concepts as “confidence” and “statistical significance” are abused and the infrequency with which many researchers apply goodness-of-fit tests. Misunderstandings about randomness and probabilistic methods also abound, even in hard sciences like physics where such math concepts are used every day. I’m hardly well-versed in these matters, but I have the impression that the proofs are not as rigorous and trustworthy as those established by other subdisciplines like computational complexity and measures like Minimum Description Length (MDL) and Kolmogorov Complexity, which I hope to one day be able to code in a future tutorial series, Information Measurement with SQL Server. That series will likely be of practical use to DBAs and data miners, just as the second half of this series on outlier detection is certain to be. For the sake of completeness, however, I’ll first finish off this detour into methods based on hypothesis testing by hurrying through Dixon’s Q-Test, Modified Thompson Tau Test and Chauvenet’s Criterion. After that, we’ll get back on track with outlier detection methods that are more suited to our use cases.

[1] See National Institute for Standards and Technology, 2014,  “1.3.5.17.2.Tietjen-Moore Test for Outliers,” published in the online edition of the Engineering Statistics Handbook. Available at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h2.htm

[2] IBID.

Outlier Detection with SQL Server, part 3.2: GESD

By Steve Bolton

…………In the last edition of this amateur series of self-tutorials on finding outlying values in SQL Server columns, I mentioned that Grubbs’ Test has a number of limitations that sharply constrain its usefulness to DBAs. The Generalized Extreme Studentized Deviate Test (GESD) suffers from some of the same restrictions – most notably the fact that it is only applicable to datasets that have a Gaussian (a.k.a. “normal”) distribution, better known as the bell curve. Nonetheless, the GESD applies to a wider set of use cases because it can be applied to find more than one outlier, unlike the Grubbs’ Test it is derived from. It is “essentially Grubbs test applied sequentially,” with some tweaks applied to its test statistics and critical values to avoid such problems as false negatives with weak stopping criteria.[1] Grubbs’ Test is sometimes performed recursively by deleting aberrant data points one at a time, which raises the whole morass of ethical issues I’ve harped on previous articles, like the far too common tendencies among researchers to classify outliers arbitrarily, subjectively change those classifications (i.e. “moving the goalposts”) and fail to investigate the causes of aberrant values and matching them to appropriate responses. As mentioned previously, outliers can be differentiated by their use cases, numbers and types of inputs, the numbers and types of outputs and their mathematical properties and the means of calculation applied in between the input and output stage, if that would affect performance. A failure in any one of these areas can be worse than an incorrect calculation. An outlier workflow also ought to include rigorous goodness-of-fit testing to make sure that only tests appropriate to the underlying distribution are applied, plus proper classification procedures and criteria, guarding against redefinition (i.e. “changing horses in midstream”) and matching the underlying causes of outliers with appropriate responses. Not all outliers are the result of faulty data collection and storage, so simply deleting aberrant data points as recursive application of the Grubbs’ Test implies is often inappropriate, or even unethical. There is little reason to use it when a test like GESD is available to ameliorate some of these associated drawbacks and temptations The GESD test is still quite limited in comparison to other outlier detection methods we’ll survey here, since it is normally only applied to look for small numbers of aberrant data points, but at least we’re not limited to finding a single one as we are with the Grubbs’ Test. The performance costs for both are quite trivial. Another benefit of the GESD is that we can reuse much of the code from last week’s mistutorial, thereby simplifying the development and troubleshooting processes.
…………We can thank Harvard Biostatistics Prof. Bernard Rosner for this improvement to the Grubbs’ Test, which was published in an issue of the statistical journal Technometrics back in 1983.[2] I was unable to get ahold of the this paper, unlike the original publications for Grubbs’ Test, but I was able to find the formulas at the National Institute for Standards and Technology’s Engineering Statistics Handbook, which is one of the most readable online sources of information on statistics for amateurs like myself. I’d wager that most readers will find a pint of Guinness a lot less boring than these underlying equations, so it is worth noting that we can thank a brewer of Irish beer for some of these formulas. The “Studentized” part of the name is derived from the “Student’s T-distribution” it is dependent on, which I always vaguely pictured as being derived from some sort of high school or college math test; the real story is more colorful, in that it is was the product of a 1908 Biometrick article by William Sealy Gosset, who chose the pen name “Student” after his employer, the Guinness brewery in Dublin, required him to publish under a pseudonym.[3] Just something more colorful that degrees of freedom and critical values to meditate on the next time you’ve had one too many draughts of Guinness. I’m not sure why the GESD calculations, like those of Grubbs’ Test, are dependent on both the T-distribution and the bell curve, although one of my missions when starting this series was to grasp the underlying mechanics and logic of the formulas, which is something I was deficient in when writing my last series of tutorials, on SQL Server Data Mining (SSDM). I am trying to acquire the skill to translate equations into T-SQL, Multidimensional Expressions (MDX) and Visual Basic (VB) code as quickly as possible, but sometimes still struggle to get the formulas right. Some caution is in order here, because the code in the figure below only matched the first four results for Rosner’s 54 sample data points, in the example calculation at the NIST webpage. I had some issues following the order of operations in the NIST’s equations at first, but it is odd that they would only be off by a hundredth of a decimal point for the first four results, then diverge after that. This ought to refocus attention on one of the main caveats associated with any of my tutorial series: always check my code before putting it into production, because I’m writing this in hopes of learning, not because I already know what I’m doing; I’m only publishing it in order to sharpen my thinking further and in the hopes that others might gain something from my misadventures.

Figure 1: Code for the GESD Test Procedure
CREATE PROCEDURE [Calculations].[GESDTestSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @Alpha decimal(38,35) = 0.05, @NumberOfTests bigint = 10
AS

SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName

SET
@SQLString =
‘DECLARE @Count bigint

SELECT @Count=Count( + @ColumnName + ‘)
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT RN, ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + GESDStatistic, P, V, CriticalRegion, IsOutlier” = CASE WHEN GESDStatistic > CriticalRegion THEN 1 ELSE 0 END
FROM (SELECT RN, ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, GESDStatistic, P, V, (Calculations.FindCriticalRegionForTDistributionFunction (V, 1, 1-P) * (@Count – RN)) / Power(((@Count – (RN – (1 + (Power(Calculations.FindCriticalRegionForTDistributionFunction (V, 1, 1-P), 2))))) * (@Count – (RN + 1))), 0.5) AS CriticalRegion
       FROM (SELECT RN, ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, GESDStatistic, 1 – (‘ + CAST(@Alpha AS nvarchar(50)) + ‘ / (2 * (@Count – (RN + 1)))) AS P, ((@Count – RN) – 1) AS V
      FROM (SELECT TOP ‘ + CAST(@NumberOfTests AS nvarchar(50)) + ‘ ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ DESC) AS RN, ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,
              (‘ + @ColumnName + ‘ – Avg( + @ColumnName + ‘) OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) / StDev( + @ColumnName + ‘) OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS GESDStatistic
              FROM ‘ + @SchemaAndTableName +
              WHERE ‘
+ @ColumnName + ‘ IS NOT NULL
              ORDER BY ‘ + @ColumnName + ‘ DESC) AS T1) AS T2) AS T3′

–SELECT @SQLStringuncomment this to debug string errors
EXEC (@SQLString)

…………Anyone who has read previous articles in this series should be able to cut through all the gobbledygook in Figure 1 by noticing the commonalities between this T-SQL code and that of previous procedures. Once again, the first five parameters allow users to perform the test on any column in any database for which they have permissions and the first three lines of the procedure make some adjustments to the string names of those parameters for legibility purposes. There is no @OrderByCode as in previous procedures and @DecimalPrecision is likewise not needed. Instead, we must supply values for two new parameters, the @NumberOfTests to perform (which I’ve set to an arbitrary default of zero) and the @Alpha value, which is a core concept in hypothesis testing along with confidence levels, critical values, critical regions, statistical significance and the like. I strongly recommend Will G. Hopkins’ website A New View of Statistics for anyone interested in a readable introduction to these topics, which really aren’t as hard to grasp as they seem – as long as someone explains them in plain English. As usual, the procedure is created in a schema called Calculations, which you can change to your liking; uncommenting the next-to-last line allows you to debug the dynamic SQL; and you’ll have to add your own code to accommodate spaces in object names, which I don’t allow, or to handle SQL injection, which I haven’t included to keep my code short and to the point. This procedure is much shorter than last week’s because I’m reusing the code I already published for the Calculations.FindCriticalRegionForTDistributionFunction, which performs the comparisons against the T-distribution to see if a particular value is an outlier. The code is also much shorter because there’s only one version of the GESD, whereas Grubbs’ Test has two-tailed, lower one-tailed and upper one-tailed versions. As is customary, I retrieve the global statistics once only at the beginning of the dynamic SQL, but in this case, but all we need is a simple record count. As usual, it takes several subqueries to perform all of the calculations required, which forms the bulk of the rest of the dynamic SQL. I’m really quite proud of myself for coding it with the DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING clauses rather than a common table expression (CTE), since one of my goals is to make my T-SQL more efficient by implementing the kind of windowing functions Itzik Ben-Gan discusses in his classic reference Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions.[4] It seemed to be the most efficient solution to the problem, although performance really isn’t much of an issue with either GESD or Grubbs’. Figure 3 shows only a minimal change in the execution time when a non-clustered index was added, while the execution plans weren’t really noteworthy enough to publish.

Figure 2: Results for the GESD Test Procedure
EXEC   Calculations.GESDTestSP
             @DatabaseName = N’DataMiningProjects,
             @SchemaName = N’Health‘,
             @TableName = N’DuchennesTable,
             @ColumnName = N’PyruvateKinase,
             @PrimaryKeyName= ‘ID’,
             @Alpha = 0.05,
             @NumberOfTests = 15

GESD Results

Figure 3: Client Statistics for the GESD Test Procedure
GESDClientResults

…………If performance had been an issue I would have run the procedure against the first float column in the Higgs Boson Dataset made publicly available by the University of California at Irvine’s Machine Learning Repository, which occupies nearly 6 gigabytes of space in the DataMiningProjects database I’ve created to encompass all of the three practice datasets we’ll be using in this tutorial series. Instead, I ran it against the PyruvateKinase column of a tiny 9-kilobyte dataset on the Duchennes form of muscular dystrophy, published online by Vanderbilt University’s Department of Biostatistics. Since calculations on the same enzyme were also performed in last week’s article, we can easily contrast them. The stark, obvious difference is that we only received one row of output for Grubbs’, whereas Figure 2 flags the first 13 rows as outliers based on the fact that their GESD statistics exceed the values in the T-distribution lookup table for the associated critical region, which is in turn determined by the probability values (P) and the degrees of freedom (V, which is in this case the record count).
…………The EXEC code above it specified a common @Alpha value of 0.05 and a test of 15 values, which is why we received 15 results. This is certainly more useful than returning a single output of the kind we get from Grubbs’ Test, but still of limited utility to DBAs who normally work with tables of thousands or even billions of rows. While writing this series, one of the weaknesses I’ve discovered with applying standard outlier detection methods to SQL Server databases is that many of them simply aren’t designed with the Big Data buzzword in mind. Many of them do a fantastic job when used in hypothesis testing, the main use case scenario they were designed for, which is a much narrower and more specific task than the two most common uses cases DBAs face, exploratory data mining and checking for data quality problems. Studies of the first kind often involve just a few hundred or even just a few dozen cases, whereas in the latter we may be dealing with millions of records. It is almost impossible, however, to find lookup tables for many of the distributions and calculations associated with these hypothesis tests that go beyond more than a few hundred values. For example, I had to hunt all over the Internet to find a table of T-distribution values that went up as far as 200 values, which is still below the 209 rows of the Duchennes table and minuscule in comparison to the 11 million rows of the Physics.HiggsBosonTable. I’ve also learned that attempting to fill the gap by calculating the missing values for these tables yourself can be quite computationally expensive and surprisingly difficult to code, as the cumulative distribution function (CDF) for the Gaussian bell curve can be. I now suspect that Grubbs and GESD belong to a subset of outlier detection methods that DBAs will rarely find use cases for, along with such related means as the Tietjen-Moore Test, Dixon’s Q-Test, Chauvenet’s Criterion and the Modified Thompson Tau Test. I’ll dispense with these in the middle of the series since I’ve already written most of the code  for them and might as well not let it go to waste, just in case a reader out there discovers a need to include them in their toolbelt. After this quick stretch we’ll get into methods like Interquartile Range and Peirce’s Criterion, which I expect to be much more useful for our scenarios, although not perhaps as much as topics we’ve already covered like Benford’s Law and Z-Scores. I also have high hopes for Cook’s Distance and Mahalanobis Distance, which I’ll tackle after a recap of SSDM Clustering and an interlude into Visual Outlier Detection with Reporting Services, in which we can spot outliers with the naked eye. For now, I’ll finish on quickly getting out of the way other means of outlier detection from the same class as GESD and Grubbs. Many of these share some of the same severe limitations, such as dependence on a normal distribution. GESD may be the most flexible among them, since it allows you to specify the number of outliers you want to look for, whereas Dixon’s Q-Test and Grubbs limit you to just one. As we shall see next week, the Tietjen-Moore test appears at first glance to be more useful since it also include a parameter like @NumberofTests. Its utility is crimped, however, by the subtle difference that it only tells you whether or not the dataset contains that number of outliers. GESD will likely be more useful, in that it can actually flag the specified number of records as aberrant data points.

[1] See National Institute for Standards and Technology, 2014,  “1.3.5.17.1. Grubbs’ Test for Outliers,” published in the online edition of the Engineering Statistics Handbook. Available at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

[2] Rosner, Bernard, 1983,  “Percentage Points for a Generalized ESD Many-Outlier Procedure,” pp. 165-172 in Technometrics, May, 1983. Vol. 25, No. 2.  Original citation found at the NIST webpage “1.4.3. References For Chapter 1: Exploratory Data Analysis,” published in the online edition of the Engineering Statistics Handbook. Available  on the Internet at  http://www.itl.nist.gov/div898/handbook/eda/section4/eda43.htm#Rosner

[3] See the Wikipedia page on the “Student’s T-distribution,” available at the web address http://en.wikipedia.org/wiki/Student’s_t-distribution .

[4] Ben-Gan, Itzik, 2012, Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions . O’Reilly Media, Inc.: Sebastopol, California.

 

Outlier Detection with SQL Server, part 3.1: Grubbs’ Test


By Steve Bolton

…………In the last two installments of this series of amateur self-tutorials, I mentioned that the various means of detecting outliers with SQL Server might best be explained as a function of the uses cases, the context determined by the questions one chooses to ask of the data, the number and data types of the inputs and the desired mathematical properties of the outputs. The means of calculation in between the input and output stage may also be pertinent for performance reasons. Moreover, we can differentiate these aberrant data points we call “outliers” by their underlying causes, which must be matched with the correct response; it does us no good to find extreme values in our datasets if we can’t determine whether they were the product of faulty data collection, corruption during storage, natural random variations or some other root cause, then use that determination to handle them correctly. If we could build a matrix of outlier detection methods and their uses cases, then Grubbs’ Test would occupy a very narrow range. The inputs and questions the test can answer are quite constrained, since the test can only determine whether the highest and lowest values in a sample are outliers. It outputs a single test statistic, which can be used to output a single Boolean answer, rejecting or accepting the null hypothesis that there are no outliers in the dataset. The National Institute for Standards and Technology’s Engineering Statistics Handbook, one of the best online resources for explaining statistics in plain English, warns that, “If you suspect more than one outlier may be present, it is recommended that you use either the Tietjen-Moore test or the generalized extreme Studentized deviate test instead of the Grubbs’ test.” In the kinds of billion-row databases that SQL Server DBAs work with on a day-to-day basis, we can expect far more than a single aberrant data point just by random chance alone. Grubbs’ Test is more applicable to hypothesis testing on small samples in a research environment, but I’ll provide some code anyways in the chance that it might prove useful to someone in the SQL Server community on small datasets.
…………The “maximum normed residual test” originated with the a paper penned for the journal Technometrics by Frank E. Grubbs, a statistician for the U.S. Army’s Ballistics Research Laboratory (BRL), six years before his retirement in 1975. Apparently the Allies owe him some gratitude, given that “he was dispatched to England in 1944 as part of a team on a priority mission to sample and sort the artillery ammunition stockpiled for the invasion of France. After the team conducted thousands of test firings of the hundreds of different lots of artillery ammunition in the stockpiles, he analyzed the statistical variations in the data and was able to divide the ammunition into four large categories based on their ballistic characteristics. As a result, the firing batteries did not need to register each lot of ammunition before it was unpacked; they only needed to apply four sets of ballistic corrections to the firing tables to achieve their objectives.” After the war, he assisted the BRL in evaluating the reliability and ballistic characteristics of projectiles, rockets, and guided missiles; maybe he wasn’t a “rocket scientist,” as the saying goes, but close enough. The groundwork for the test that bears his name was laid in 1950, when he published a paper titled “Procedures for Detecting Outlying Observations in Samples” for Annals of Mathematical Statistics, which I also had to consult for this article. The 1950 paper is made publicly available by the Project Euclid website, while the one establishing the test itself is made available at California Institute of Technology’s Infrared Processing and Analysis Center, for anyone wise enough to double-check my calculations and code or to get more background.
…………Calculating the test statistic from the formula at the NIST webpage is really trivial; the difficulty is in finding proper tables of the T-distribution to interpret the statistic with. The equation for the two-sided test is quite similar to the familiar Z-Score, except that we take the maximum value of the absolute deviation (i.e., the data point minus the mean) before dividing by the standard deviation. The one-sided tests for determining if a minimum or maximum value in a dataset is an outlier are only slightly different; in the former we subtract the minimum value from the mean, while in the latter we subtract the mean from the maximum. Since the code is so easy that even a caveman can do it, I decided not to complicate it by adding logic to let the user select which of the three tests to use; I simply return all three in one row, along with the critical regions for each. The formulas for calculating the critical regions at the NIST’s webpage on Grubbs’ Test are more involved, which requires the use of the function in Figure 3. This in turn calls a rather sloppy but effective function to find the correct critical values for the T-distribution, from the lookup tables defined in Figure 1. I haven’t supplied any code to populate them, but this can be easily rectified by using one of the thousands of lookup tables available on the Internet for that distribution. The tricky part was finding a table that was reasonably complete, since many sources begin skipping over deciles around 40 or 50 degrees of freedom; I populated my own from the best source I could find, the “Tables of Critical values of t for Probabilities” at the StatsToDo website. In fact, you may need to tweak the DDL and retrieval code if you use a different source, since my ranges and stopping point of 200 degrees of freedom are derived from that particular source. According to another lookup table (from a Blogspot post by a writer who I’ve been unable to identify to give proper credit) that I didn’t use because it skips some deciles, the values for 200 and 250 are nearly identical except down to the hundredth of percentage points; the next value listed there is for infinity, which varies only a few hundredths of a percentage point from 250. Unlike researchers working with small samples drawn from an unknown population, SQL Server users can often instantly call up millions of records, so using the smaller values of these lookup tables may be of limited utility for our use cases. I only recently learned how to do hypothesis testing, so my standard advice to check my code before putting it into a production environment definitely holds here. The Grubbs Statistic values match the NIST sample results though and could prove useful, by providing a measurement of “the largest absolute deviation from the sample mean in units of the sample standard deviation.”

Figure 1: DDL for the T-Distribution Lookup Tables
CREATE TABLE [Calculations].[CriticalValueRangeTable](
       [ID] [smallint] IDENTITY(1,1) NOT NULL,
       [OneTail] [decimal](5, 4) NULL,
       [TwoTail] [decimal](5, 4) NULL,
 CONSTRAINT [PK_CriticalValueRangeTable] PRIMARY KEY CLUSTERED (
       [ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE =
OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [Calculations].[TDistributionTable](
[ID] [smallint] IDENTITY(1,1) NOT NULL,
[ProbabilityRange1] [decimal](7, 4) NULL,
[ProbabilityRange2] [decimal](7, 4) NULL,
[ProbabilityRange3] [decimal](7, 4) NULL,
[ProbabilityRange4] [decimal](7, 4) NULL,
[ProbabilityRange5] [decimal](7, 4) NULL,
[ProbabilityRange6] [decimal](7, 4) NULL,
[ProbabilityRange7] [decimal](7, 4) NULL,
CONSTRAINT [PK_TDistributionTable] PRIMARY KEY CLUSTERED ([ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]

Figure 2: Function to Look Up Values in the T-Distribution Tables
CREATE FUNCTION [Calculations].[FindCriticalRegionForTDistributionFunction]
(@DegreesOfFreedom bigint, @SidedTestType bit, @ProbabilityValue decimal(5,4))
RETURNS decimal(7, 4)
AS
BEGIN
DECLARE @CriticalRegion decimal(7, 4)
– this is a little more awkward than I’d like, but hey
      SELECT @CriticalRegion = CASE WHEN ProbabilityRangeColumnID = 1 THEN (SELECT ProbabilityRange1
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
      WHEN ProbabilityRangeColumnID = 2 THEN (SELECT ProbabilityRange2
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom
ELSE NULL END))
       WHEN ProbabilityRangeColumnID = 3 THEN (SELECT ProbabilityRange3
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
       WHEN ProbabilityRangeColumnID = 4 THEN (SELECT ProbabilityRange4
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
       WHEN ProbabilityRangeColumnID = 5 THEN (SELECT ProbabilityRange5
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
      WHEN ProbabilityRangeColumnID = 6 THEN (SELECT ProbabilityRange6
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom >= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
       WHEN ProbabilityRangeColumnID = 7 THEN (SELECT ProbabilityRange7
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom >= 200 THEN
200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
       ELSE NULL END
      FROM   (SELECT TOP 1 ID AS ProbabilityRangeColumnID
             FROM  (SELECT ID, ValueRange, ABS(ValueRange - @ProbabilityValue) AS RangeDifference, Lead(ValueRange, 1, 0) OVER (ORDER BY ValueRange) AS Lead
                     FROM (SELECT ID, ‘ValueRange’ = CASE WHEN @SidedTestType = 0 THEN OneTail
                           WHEN @SidedTestType = 1 THEN TwoTail
                           ELSE NULL END
                     FROM [Calculations].[CriticalValueRangeTable] ) AS T1) AS T2
              ORDER BY RangeDifference ASC) AS T3
      RETURN @CriticalRegion
E
ND

Figure 3: Grubbs Hypothesis Testing Function
CREATE FUNCTION [Calculations].[GrubbsHypothesisTestSP](
@DegreesofFreedom bigint, @TestType bit = 0, @SignificanceLevel decimal(38,35))
RETURNS decimal(38,32)
AS
BEGIN
       DECLARE @CriticalValue decimal(38,32), – *** look this up in a table by the SignificanceLevel I’ve already  recalculated according to the formulas, and also the Degrees of Freedom – 2
      @ReturnValue decimal(38,32)
       SELECT @CriticalValue = [Calculations].[FindCriticalRegionForTDistributionFunction] (@DegreesofFreedom, @TestType, @SignificanceLevel)
      SELECT @ReturnValue = Power(Power(@CriticalValue, 2) / (@DegreesOfFreedom - 2
+ Power(@CriticalValue, 2)), 0.5) * ((@DegreesOfFreedom -1) / Power(@DegreesOfFreedom,
0.5))
       RETURN @ReturnValue
END

Figure 4: Code for the the Grubbs Test Procedure
CREATE PROCEDURE [Calculations].[GrubbsTestSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @DecimalPrecision AS nvarchar(50), @Alpha decimal(38,35) = 0.05
AS

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
S
ET @DatabaseName = @DatabaseName + ‘.’
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName –I’ll change this value one time, mainly for legibility purposes

SET @SQLString = ‘DECLARE @Mean decimal(‘ + @DecimalPrecision + ‘),
@StDev decimal(‘ + @DecimalPrecision + ‘),
@Min decimal(‘ + @DecimalPrecision + ‘),
@Max decimal(‘ + @DecimalPrecision + ‘),
@GrubbsVersion1 decimal(‘ + @DecimalPrecision + ‘),
@GrubbsVersion2 decimal(‘ + @DecimalPrecision + ‘),
@GrubbsVersion3 decimal(‘ + @DecimalPrecision + ‘),
@DegreesofFreedom bigint,
@SignificanceLevel decimal(‘ + @DecimalPrecision + ‘),
@SignificanceLevelOneSided decimal(‘ + @DecimalPrecision + ‘)

SELECT @DegreesofFreedom=Count(‘ + @ColumnName + ‘), @Mean = Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @StDev = StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @Min = Min(‘ + @ColumnName + ‘), @Max = Max(‘ + @ColumnName + ‘) FROM ‘ + @SchemaAndTableName + WHERE ‘ + @ColumnName + ‘ IS NOT NULL

– the sample exercise at the NIST webpage uses a straight value of @SignificanceLevel = @Alpha, rather than the two calculations for two- and one-sided tests that are recommended elsewhere on the directions; hence Ive commented them out for now
-
-SET @SignificanceLevel  = ‘ + CAST(@Alpha AS nvarchar(50)) + ‘ / (2 * @DegreesofFreedom)
–SET @SignificanceLevelOneSided = ‘ + CAST(@Alpha AS nvarchar(50)) + ‘ / (@DegreesofFreedom)
SET @SignificanceLevel = ‘ +
CAST(@Alpha AS nvarchar(50)) +
SET @SignificanceLevelOneSided = ‘ + CAST(@Alpha AS nvarchar(50)) +

SELECT @GrubbsVersion1 = Max(‘ + @ColumnName + ‘ – @Mean) / @StDev, @GrubbsVersion2 = ((@Mean – @Min) / @StDev),
@GrubbsVersion3 = ((@Max – @Mean) / @StDev)
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @GrubbsVersion1 AS  GrubbsTwoSided, CAST([Calculations].[GrubbsHypothesisTestSP] (@DegreesOfFreedom, 1, @SignificanceLevel) AS decimal(‘ + @DecimalPrecision + ‘)) AS
CriticalRegionForGrubbsTwoSided,
@GrubbsVersion2 AS  GrubbsLowerOneSided, @GrubbsVersion3 AS  GrubbsUpperOnesided, CAST([Calculations].[GrubbsHypothesisTestSP] (@DegreesOfFreedom, 0, @SignificanceLevelOneSided)  AS decimal(‘ + @DecimalPrecision + ‘)) AS CriticalRegionForGrubbsUpperOneSided
–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)

…………Thankfully, we will be able to reuse the mass of T-SQL in the first three figures in next week’s tutorial, which also requires looking up T-distribution values. The code in Figure 4 will look familiar if you’ve been following this mistutorial series. As always, you’ll have to add the brackets and program in the logic yourself if you allow spaces in your object names; you may also need to add SQL injection and other security code. Also, keep in mind that I’m still using a Calculations schema for these sample routines, so you may need to create one in your database or change the name as needed. The first three parameters allow you to run the procedure against any column in any database for which you have the requisite permissions and access. The @DecimalPrecision also allows you to manually set the precision and scale for the internal calculations of the procedure, in order to avoid arithmetic overflows. There are some slight differences between the parameters of this procedure and those discussed in the last few articles though, such as the fact that the usual @OrderByCode and @PrimaryKeyName are not needed here. The @Alpha parameter allows you to set the significance level of the test to any value you please (including incorrect ones, since I haven’t added any validation code) like the default of 0.05, which corresponds to a 95 percent confidence level. Oddly, Grubbs wrote in the 1969 paper that the confidence levels ought to be at least 99 percent for use with this test, but still used the standard 95 percent in his own example. Similarly, the NIST webpage says to use @Alpha divided by the degrees of freedom for a one-sided test and divided by twice the degrees of freedom for a two-sided test, yet uses a plain significance value of @Alpha = 0.05 in its sample data. Hence the commenting out of the code that would change the @SignificanceLevel to the alternate values. When run against the sample exercise on the NIST page, my code is accurate within about a hundredth of a percentage point, as long as this adjustment is made.

Figure 5: Results for the Grubbs Test Procedure
EXEC   Calculations.GrubbsTestSP
            @DatabaseName =N’DataMiningProjects’,
             @SchemaName= N’Health’,
             @TableName = N’DuchennesTable’,
             @ColumnName= N’PyruvateKinase’,
             @DecimalPrecision= N’12,7′,
             @Alpha = 0.05

 GrubbsTestResults

 

…………Executing a query like the one above against the Duchennes dataset we’ll be working with in this tutorial series produced the five columns above. The first, third and fourth columns represent the Grubbs Statistics for the two-sided, minimum and maximum tests respectively, while the Critical Regions are derived from the aforementioned code in Figures 1 through 3. The Grubbs Statistics are far beyond the critical regions, so yes, the maximum values in the dataset are beyond the thresholds and thus can be defined as “outliers.” Keep in mind that after testing the procedure against several different columns in different datasets, I’ve only seen slight differences between the two-sided result and the upper one-sided; due to lack of appropriate samples to work through, I cannot say whether or not that can be expected or is an error on my part. The PyruvateKinase column refers to an enzyme that is apparently involved in the devastating Duchennes form of muscular dystrophy, which is the subject of a tiny nine-kilobyte sample dataset made publicly available by Vanderbilt University’s Department of Biostatistics. In the last couple of blog posts I used the first float column of the Higgs Boson Dataset that the University of California at Irvine’s Machine Learning Repository has made available online, which occupies nearly 6 gigabytes of space in the DataMiningProjects database I’ve created to encompass all of the practice datasets I’ll be using in tutorial series to come. Traversing that much space in index scans and seeks turned out to be somewhat costly for the two versions of Z-Scores that I coded in the last two posts, but I haven’t bothered to post Client Statistics from SQL Server Management Studio (SSMS) because the Grubbs’ Test procedure takes only a matter of seconds, even for the massive float columns of the Physics.HiggsBosonTable.
…………The procedure may perform well compared to other outlier detection methods, but its usefulness is limited. I have yet to test it against a table that didn’t turn out to have outliers – which is likely to be the case for most of the large tables that DBAs might run the procedure on. Grubbs’ Test is more useful for the kind of small tables used in hypothesis testing, rather than exploratory data mining and data quality analyses, which are far more common uses cases in the SQL Server user community. Statistical testing of this kind is also prone to several real dangers I’ve touched on in the last few articles and others that I have yet to address. The confidence levels commonly associated with them are pretty much drawn out of thin air; you’ll see 95 percent used most of the time, but only because it was an arbitrary de facto standard long before the UNIVAC. There really isn’t a good justification for it, except for the fact that it has been used for so long. Secondly, probabilities are not guarantees; there is a finite chance that random fluctuations alone could produce a dataset that consisted of nothing but outliers, using any definition and detection method. Worst of all, Grubbs’ Test requires a Gaussian (i.e.”normal”) distribution, i.e. the bell curve. Without goodness-of-fit tests that clearly demonstrate that the data ought to fit the normal distribution, such tests are useless – or worse, deceptive. As Grubbs himself puts it mildly, “Until such time as criteria not sensitive to the normality assumption are developed, the experimenter is cautioned against interpreting the probabilities too literally when normality of the data is not assured.” I don’t yet know how to apply some of the best goodness-of-fit tests (mainly because I’m still having trouble wrapping my head around some of the intricacies of cumulative distribution functions) but that won’t stop me from harping on this point repeatedly: the extent to which statistics are bandied about in many academic disciplines without proper testing today is simply alarming. The one place we can least afford to see them is in medical research, where they become a matter of life and death, but at least half of all studies published today contain at least one statistical error. The most common error appears to be the lack of goodness-of-fit testing; researchers in many fields seem to be in the habit of routinely applying tests that depend on a Gaussian distribution with reckless disregard for their validity. It’s not surprising that this occurs, given that there are so few statistical tests that can be used with the scores of other distributions that data might follow. If researchers everywhere were held to a proper standard of evidence, they might not be able to back up claims for a favorite project or experimental medicine with any statistics at all.
…………This leads straight back to the logical and moral issues that I’ve tried to reiterate throughout this series: there are an awful lot of shenanigans that go on with “lies, damned lies and statistics,” as the old saying goes. Grubbs’ Test is no more vulnerable to misuse than any other measure based on the normal distribution, all of which can be invalid, misleading or worse when the data is not naturally Gaussian. It is sometimes, however, applied recursively by simply deleting outliers until some stopping criteria is reached, which raises the grim possibility of improper handling of unfavorable data points. In some situations, an iterative Grubbs Test is vulnerable to false negatives, in which actual outliers are not detected, or false positives, in which haphazard definitions of stopping criteria lead to the deletion of good data. That brings us back full circle to the confluence of subjective definitions, “moving the goalposts” and inadequate analysis of the causes of aberrance which I discussed at length in the last article. Thankfully, the Generalized Extreme Studentized Deviate Test (GESD) ameliorates some of the standard problems associated with the sequential application of Grubbs’ Test by making a few adjustments; the NIST recommends that either the GESD or the Tietjen-Moore test be applied when looking for more than one outlier, which is almost always going to be the case in a SQL Server database. The math and code for both are also relatively simple. Unfortunately, they are both dependent on a Gaussian distribution, which means they also require goodness-of-fit tests that are often dispensed with in an irresponsible manner. The same limitation applies to Dixon’s Q-Test, which is simple to code, as well as to Chauvenet’s Criterion, which is not. It may also be true of Peirce’s Criterion, which will also be featured later in this series. Interquartile Range is a much more easily coded method of outlier detection which also may be less dependent on the normal distribution. Later in the series, I’ll give a quick recap of the Clustering algorithm in SQL Server Data Mining (SSDM) and supply some eye candy that is much easier on the brain than these fancy equations in Visual Outlier Detection with SQL Server Reporting Services. Towards the end I’ll delve into more difficult methods like Cook’s Distance and the Modified Thompson Tau Test, then Mahalanobis Distance. Many of these more sophisticated methods are of course more difficult to code than GESD, Tietjen-Moore and Dixon’s Q-Test, but they may also be more applicable to distributions besides the bell curve.

Outlier Detection with SQL Server, part 2.2: Modified Z-Scores

By Steve Bolton

…………There are apparently many subtle variations on Z-Scores, a ubiquitous measure that is practically a cornerstone in the foundation of statistics. The popularity and ease of implementation of Z-Scores are what made me decide to tackle them early on in this series of amateur self-tutorials, on using various components of SQL Server to detect those aberrant data points we call “outliers.” As discussed in the last two posts, there are many different means of identifying outliers, which may be understood best by categorizing them by their use cases; the right choice of detection tools is essentially a function of the questions one wants to ask of the data, the number and types of inputs, the desired mathematical properties of the outputs and in between, the performance and other requirements used in transforming the inputs into outputs. From my scant understanding of what little literature I’ve read on the topic, statisticians and other researchers commonly encounter use cases where the sensitivity of ordinary measurements to outliers has to be toned down, often in response to fat-tailed (i.e. highly skewed) distributions. The Modified Z-Scores developed by Temple University Prof. Boris Iglewicz and University of Massachusetts Prof. David C. Hoaglin are one means of adjusting Z-Scores for such cases, but hardly the only one. I’m highlighting it merely because I was introduced to it early on, while trying to learn the subject of stats from the National Institute for Standards and Technology’s Engineering Statistics Handbook, one of the best online resources for anyone trying to wade through this notoriously dry subject.[I]
                Iglewicz and Hoaglin suggest specific cut-off criteria for their measure, that can of course be adjusted as needed by users – which raises the whole question of researchers “moving the goalposts” or setting them haphazardly when using any means of outlier detection. Correct classification is a thorny issue with every method we’ll discuss in this series; I’m merely using the hard boundary associated with Modified Z-Scores as an introduction to the all-important topic of subjectivity in category definitions. As an amateur, I can’t give any guidance on whether or not to use a specific cut-off point for classifying outliers, although it would seem to be common sense that more detailed rough and fuzzy sets ought to be more commonly used in place of hard limits than they are. It is worth reiterating that outliers also vary significantly in their causes and the responses made to them, not just the means of differentiation. The frequent mismatches between the causes and responses and the lack of attention paid to discerning them both leave the door wide open to innumerable fallacies and sometimes outright fraud, which as I discussed earlier in this series, is frighteningly common in certain fields. The definition of an outlier is subjective, depending on the kind of investigation a researcher chooses to perform, but whether or not a particular data point meets the chosen criteria is wholly objective. Fallacies and fraud arise when the distinction in the right uses and proper places of subjectivity and objectivity are blurred; the whole history of human philosophy demonstrates that when the former is loosed from such bonds, the result is always maniacal madness. For example, a person can choose to affix the name “pepperoni pizza” to anything they want; but once they’ve set tomato sauce, bread, cheese and the like as part of the criteria, then they can’t pretend that a pencil sharpener or a Planck length fits the definition, because whether or not they consist of the same ingredients set forth in the criteria is an objective matter. That’s plain common sense, which suddenly becomes uncommon when the labeler has an incentive to fudge their definitions, or worse yet, a pedantic justification for it, like solipsism (i.e., one of the major symptoms of schizophrenia). Outlier detection presents a serious temptation to simply ignore the distinctions between causes and put no effort to differentiating the correct response to others, so that data miners and others who use these tools frequently just delete records that don’t fit their models and theories, or adjust their definitions of the term to achieve the same purpose. I’ll delve into outlier deletion in more depth a few posts from now, but the issue of subjective limits can serve as a transition into my usual dire disclaimer that math formulas, including those underpinning data mining algorithms and outlier detection, resides in a Pandora’s Box. The Modified Z-Scores under discussion today do not open the box any wider than any other formula; this is merely the context in which all statistical measures naturally reside, in which the slightest logical deviation in their use may lead to erroneous, misleading or even fraudulent conclusions. Data mining tools can be used quite easily by amateurs like myself for exploratory data analysis, but need to be handled like scalpels when attempting to prove a specific point. Nevertheless, they’re often employed carelessly like sledge hammers by professionals in many different fields, particularly health care. The specter of fallacious reasoning hems us in all sides, and wielding these tools properly for this purpose requires more skill than the old board game of Operation. The difference with math and logic is that there is no buzzer to warn us when we’ve used them wrong; there may be terrible consequences down the line in the form of falling bridges and adverse medical reactions, but the intelligentsia also has the intellectual power to explain those away using the same poor reasoning. What is called for here is not intelligence, but wisdom; without it, outlier detection methods merely prove the old adage, “There are three kinds of lies: lies, damned lies and statistics.”[ii] No data mining tool or math formula is going to going to provide a quick fix for this overarching problem, which hangs like a Sword of Damocles over everything that researchers, scientists, mathematicians and data miners do; the only fix is to apply the use of reason rigorously, which requires a deep understanding of logical fallacies and in turn, painful self-examination. Neither I nor most of the DBAs who read this probably have that kind of training, so our use cases ought to be limited to exploratory analysis – which can be a highly productive exercise, even for the unqualified – rather than hypothesis testing and the like.
…………The point of using Modified Z-Scores is to address situations where it is desirable to reduce the sensitivity to outliers, so that there are for all intents and purposes fewer false positives when classifying them. Whether or not such reduced sensitivity is a wise choice to fit the problem at hand is one question; whether or not Modified Z-Scores succeed in doing so seems to be an open and shut case. In this series I’m trying to grasp the mechanisms that make these algorithms and formulas work as they do, which is something I didn’t delve into adequately in my series on SQL Server Data Mining (SSDM). The reason why Iglewicz and Hoaglin’s Z-Scores are less sensitive to outliers without being completely blind to them is that they use medians rather than means, which are an alternate measure of central tendency that is known for being less affected by unusual values. Both medians and means are basically primordial forms of clustering that identify a specific location near the center of a dataset, but the former is less affected by the most distant points. The formula given at the NIST website is not all that difficult to decipher or code in T-SQL; I was unable to get ahold of a copy of their original paper to see what the reasoning was behind the constant that appears in it, but it is child’s play to simply include it in the code and be done with it.[iii]  This was my first introduction to median absolute deviation (MAD), which is a variation of the average absolute deviation that is even less affected by extremes in the tail because the data in the tails have less influence on the calculation of the median than they do on the mean.”[iv] I initially confused it with a more common calculation, mean absolute deviation because of the similar names. The idea is basically the same though: instead of taking a mean of a mean, we compare each data point to the median of the whole dataset, then calculate a new median for the absolute value of those distances. Then we take subtract the median from each data point again and multiply that result by a constant, 0.6745, the divide the result by the MAD. The equations are actually quite easy to read; most of the T-SQL involved in implementing them is dedicated to calculating the two medians, using some subqueries and windowing functions. I’ve precalculated both in Common Table Expressions (CTEs) at the beginning of this week’s T-SQL stored procedure, because this reduces them to one-time operations (I think the technical term might an “Θ(n) operation”) and makes the complicated dynamic SQL a little more legible. The T-SQL in Figure 1 could be streamlined further to suit your needs by removing the DENSE_RANK calculation, the OutlierCandidate column and the @OrderByCode logic, which are dispensable elements I’ve added as conveniences.

Figure 1: Code for the Modified Z-Score Stored Procedure[v]
CREATE PROCEDURE [Calculations].[ModifiedZScoreSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)–1 is by PK ASC, 2 is by
PK Desc, 3  is by ColumnName ASC, 4 is by ColumnName DESC, 5  is by ZScore ASC, 6 is by ZScore DESC
AS

SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName –I’ll change
this value one time, mainly for legibility purposes

SET @SQLString =
‘DECLARE @OrderByCode as tinyint ,– pass the outer value like a parameter of sorts
@Median AS decimal(‘ + @DecimalPrecision + ‘),
@MedianAbsoluteDeviation AS decimal(‘ + @DecimalPrecision + ‘)

– PRECALCULATED STATS
————————–
– precalculating these 3 stats not only makes the code more legible, but is more efficient because it is a one-time operation
- first get the median

WITH MedianCTE
(‘ + @ColumnName + ‘, RN, DenseRank)
AS
(
SELECT ‘ + @ColumnName + ‘, RN, DENSE_RANK() OVER (PARTITION BY 1 ORDER BY RN DESC) AS DenseRank
FROM (SELECT ‘ + @ColumnName + ‘, ROW_NUMBER() OVER (ORDER BY ‘ + @ColumnName + ‘) AS RN
     FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1
)

SELECT @Median = Avg(‘ + @ColumnName + ‘) FROM MedianCTE WHERE RN BETWEEN DenseRank – 1 AND DenseRank +1;

– get the MedianAbsoluteDeviation
WITH MedianAbsoluteDeviationCTE
(‘ + @ColumnName + ‘, RN, DenseRank)
AS
(
SELECT NewMedian, RN, DENSE_RANK() OVER
(PARTITION BY 1 ORDER BY RN DESC) AS DenseRank
F
ROM        (SELECT NewMedian, ROW_NUMBER() OVER (ORDER BY NewMedian) AS RN
       FROM             (SELECT ABS(‘ + @ColumnName + ‘ – @Median) AS NewMedian
              FROM ‘ + @SchemaAndTableName +
              WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1) AS T2)

SELECT @MedianAbsoluteDeviation = Avg(‘ + @ColumnName + ‘)
FROM MedianAbsoluteDeviationCTE
WHERE RN BETWEEN DenseRank – 1 AND DenseRank +1;

–SELECT @Median
–SELECT @MedianAbsoluteDeviation
S
ET @OrderByCode = ‘ + CAST(@OrderByCode  AS nvarchar(50) ) +

– OUTLIER COMPARISON OPERATIONS
– now check each data point
SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ModifiedZScore, DENSE_RANK () OVER (ORDER BY ModifiedZScore) AS GroupRank,
”OutlierCandidate” = CASE WHEN Abs(ModifiedZScore) > 3.5 THEN 1
       ELSE 0
       END
FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ((0.6745 * (‘ + @ColumnName + ‘ – @Median)) / @MedianAbsoluteDeviation) AS ModifiedZScore
FROM ‘ + @SchemaAndTableName + ‘) AS T1
       ORDER BY
CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,
CASE WHEN @OrderByCode = 5 THEN ModifiedZScore
END ASC,
CASE WHEN @OrderByCode = 6 THEN ModifiedZScore END DESC’
–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)

…………Note that I’m using almost all of the same parameters and dynamic SQL format as in the article on regular Z-Scores. The combination of the first three parameters allows you to execute the procedure against any table in any database, assuming you have the requisite permissions. Just like in the last tutorial, the @OrderByCode parameter allows you to sort the results by 1) the primary key values, ascending; 2) the primary key value descending; 3) the column values ascending; 4) the column values descending; 5) the Modified Z-Score ascending and 6) the Modified Z-Score descending. You’ll have to supply your own @DecimalPrecision values and tweak them to avoid arithmetic overflows, which are tricky to handle when multiple calculations can change the number of decimal places repeatedly. I usually try setting these values to the original precision and scale of decimal and numeric columns if they’re high enough, but when working with integers you’ll have to decide how many decimal places are appropriate for your output.  You can debug the dynamic SQL by uncommenting the next-to-last line and two others beginning with comment marks and SELECTS. I’ve also used DENSE_RANK windowing function again to assign identical results to specific groups by their Modified Z-Score values, which comes in handy with columns that have few distinct values that are repeated many times. The OutlierCandidate is merely a bit column that reveals whether or not the ModifiedZScore falls outside the ±3.5 threshold set by Iglewicz and Hoaglin. Your requirements may be different, so feel free to change the threshold or eliminate it altogether; it wouldn’t be all that difficult either to replace hard thresholds like this with more flexible fuzzy set criteria with graded memberships. If you use @OrderByCode 5 or 6, values where OutlierCandidate = 1 will be sorted to the top and bottom of the results. As usual, you’ll have to add your own brackets and logic to handle spaces if you allow them in your object names (I have a ruthless ban on them in my own code, for legibility purposes) and program in your own security to handle risks like SQL injection.

Figure 2: Results for Column1 of the HiggsBosonTable
EXEC [Calculations].[ModifiedZScoreSP]
              @DatabaseName = N’DataMiningProjects’,
             @SchemaName = N’Physics’,
              @TableName = N’HiggsBosonTable’,
              @ColumnName = N’Column1′,
              @PrimaryKeyName = N’ID’,
             @OrderByCode = 6,
              @DecimalPrecision = N’33,29′
ModifiedZScoresValues

 

Figure 3: Client Statistics for the Modified Z-Scores Procedure
ModifiedZScoresClientStats (1)
ModifiedZScoresClientStats (2)

…………In last week’s tutorial, I tested my Z-Score stored procedure on the first float column of a nearly 6-gigabyte table from the Higgs Boson Dataset, which is made publicly available by the University of California at Irvine’s Machine Learning Repository. In future mistutorials I will use a dataset on the Duchennes form of muscular dystrophy provided by Vanderbilt University’s Department of Biostatistics, as well as transcriptions of the Voynich Manuscript, a creepy tome whose medieval author encrypted it so well that no one has been able to crack it since, including the National Security Agency (NSA). For the sake of consistency, I tested my Modified Z-Scores procedure against the same Higgs Boson column. Using the query at the top of Figure 2 returned the corresponding results, plus about 11 million more rows that I somehow couldn’t cram into the article. There were some records at the very bottom with Modified Z-Scores near -1, but none that qualified for Iglewicz and Hoaglin’s cut-off point for outliers.
…………I didn’t bother to post screenshots of the execution plans because they weren’t pretty, nor would they fit on an NFL scoreboard. The novel approach I took of comparing the middle point of two windowing functions moving in opposite directions added a lot of expensive sorts, which even the addition of a non-clustered index couldn’t fix. As depicted in Figure 3, the index improved the Client Processing Time and Total Execution Time significantly, but the procedure still consumed far too much memory on my poor beat-up development machine and took too long for my tastes. It will do just fine on columns in small tables, but expect it to take a while if you’re executing it on 11 million rows of a 6-gig database using an outdated workstation in place of a real server.
…………That drawback ought to refocus attention on one of the caveats I want to stress in this series: I’m posting these articles because I don’t know what I’m doing and want to learn, not because I have any real expertise. As with my series A Rickety Stairway to SQL Server Data Mining, I’m passing on my misadventures so that others don’t repeat them. Another error I rectified along the way was accidentally substituting a mode for the median while wool-gathering; that procedure might actually be useful in catching certain outliers and I will post it if anyone thinks they can benefit, but the bottom line is that I almost posted an article based on the wrong formula. Just keep in mind that my code samples in this series will always need further testing before going into a production environment. Consider these posts an introduction to the topic, not the last word. If all goes according to plan, I’ll be introducing both myself and my readers to Chauvenet’s Criterion, which is a means of outlier detection that is intrinsically dependent on a Gaussian distribution. I may follow these up by going on a tangent with some fairly easy means of outlier detection, like Grubbs’ Test and the Tietjen-Moore Test, the Generalized Extreme Studentized Deviate (ESD) Test, Interquartile Range and Dixon’s Q-Test. At some point I’ll also get into a discussion of Visual Outlier Detection with Reporting Services (featuring a lot of eye candy) and do a quick recap of Clustering with SSDM. Towards the end of the series I’ll tackle Cook’s Distance the Modified Thompson Tau Test, then end with the daunting task of coding Mahalanobis Distance. I hope to use that last post as a springboard towards a much longer and more difficult series months down the line, Information Measurement with SQL Server.

[i] That’s more “dry humor,” but not as bad as the hydrology joke in the last column.

[ii] Although frequently attributed to Mark Twain and Benjamin Disraeli, the quip apparently originated with British politician Leonard H. Courtney in 1895. See the TwainQuotes webpage “Statistics” at http://www.twainquotes.com/Statistics.html.

[iii] See “Detection of Outliers,” an undated article published at the National Institute for Standards and Technology’s Engineering Statistics Handbook website. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm. The page in turn cites Iglewicz, Boris and Hoaglin, David, 1993, “Volume 16: How to Detect and Handle Outliers,” The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

[iv] See “Measures of Scale,” an undated article published at the National Institute for Standards and Technology’s Engineering Statistics Handbook website. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda356.htm#MAD

[v]  While coding this, I forgot how to use modulo properly and made use of See Byers, Mark, 2009, response to the thread “How Can I Determine is Digit Even Number?” published Nov. 26, 2009 at StackOverflow.com. Available online at http://stackoverflow.com/questions/1805420/how-can-i-determine-is-digit-even-number.  I also double-checked my median calculations against the MathIsFun webpage “How to Find the Median Value” at http://www.mathsisfun.com/median.html.

Outlier Detection with SQL Server, part 2.1: Z-Scores

By Steve Bolton

…………Using SQL Server to ferret out those aberrant data points we call outliers may call for some complex T-SQL, Multidimensional Expressions (MDX) or Common Language Runtime (CLR) code. Yet thankfully, the logic and math that underpin the standard means of outlier detection I’ll delve into in this series are sometimes mercifully simple. That was not the case in the introductory article in this amateur series of self-tutorials, in which I tackled the fascinating mystery of Benford’s Law, a means of finding outliers that is often used to expose fraud. I also used the occasion to bring up the topic of potential misuses of outlier detection itself, including some duplicitous practices that are nonetheless frighteningly frequent in academic studies – particularly in the health care field, where a lot of money is at stake and where poorly handled stats can do the most damage to ordinary people. I cannot say enough about how critical it is to moor statistics and data mining solidly in reason, because all it takes is a single fallacy to render the conclusions drawn from them useless, misleading or downright false; this point is worth reiterating throughout this blog series, given that it dwarfs all of the technical issues that mathematicians, statisticians, data miners and the field of computing spend much more time on. Our algorithms may go on correctly generating numbers, but if they’re not firmly embedded in reason, they may be deceptively erroneous. Some of the saving graces of this week’s method of outlier detection are that it is well-understood and simple implement, both of which mean there is there is less room for fallacies to worm their way into Z-Score calculations than there are with other methods.
…………Z-Scores may be more run-of-the-mill than cutting edge methods like Benford’s Law, but the technique is ubiquitous throughout the field of statistics precisely because it is a reliable workhorse. In fact, it is used as a building block in many other higher-order statistics, many of which are in turn used in the edifices of today’s sophisticated data mining algorithms. When I wrote my series of self-tutorials on SQL Server Data Mining (SSDM) I merely set out to prove that this neglected component could be used by inexperienced amateurs like myself for practical purposes, but I didn’t yet grasp the mechanics of why the constituent algorithms worked. In this series and the next several I hope to write, I plan to rectify that defect by looking more under the hood. I’m still hardly qualified to write about statistics, but I have found it is much easier to grasp the material by categorizing various stats by their use cases. It is easiest to sort through the confusing welter of numbers and algorithms by viewing the use cases as a function of the questions one wants to ask of the data, the number and type of inputs an algorithm requires, the number and type of outputs it returns, as well as the mathematical properties associated with the inputs and outputs. For example, if you choose to ask a particular question of a dataset but don’t have the correct data types or number of input parameters, or sample sizes, your choices will be quickly narrowed down to a few stats and algorithms. If you require the outputs to have specific mathematical properties, such as positivity or homoscedasticity, your choices will be further constrained, because the formulas are also highly specific as to what kind of data they spit back out. Will G. Hopkins, the author of an excellent plain English guide to stats[i], likewise writes that he had a difficult time sorting out the various types of statistical models until he categorized them by the types of comparisons being made, such as numerical vs. nominal data, or numeric vs. numeric, etc. Those categories are basically equivalent to the column Content types I discussed in the series on SSDM, where they represent an important consideration in the design of mining models. It might be helpful in the future to develop a matrix of use cases for all permutations of Content types, data types, numbers of inputs, properties of outputs and perhaps even the performance requirements for the inner workings of the algorithms in between the input and output stages. For now, however, we’ll just use the concept to illustrate what we’re doing with Z-Scores.
…………This particular measure is merely a comparison of a data point to the mean and standard deviation of the dataset it belongs to. The formula for Z-Scores is fairly simple: subtract the data point from the average, then divide by the deviation. What’s the purpose in doing this though? I was able to grope towards a better understanding by resorting to categorization again. Averages are merely the most popular instance of a broader classification of measures of central tendency or location, which identify the center of a dataset; the mean really represents the most primordial form of clustering we know of. If we want to tell how close a particular data point is to that center point, we also need a yardstick to measure the distance by. This is where standard deviation, the most basic metric of dispersion, comes in handy. Rather than pinpointing the single center of the dataset, the deviation is a single measure of how diffuse or spread out all the data points are. Like the mean, standard deviation is a fundamental building block of higher statistics, one which also gives us a convenient means of gauging how far a particular data point is from the center point identified by the mean. In plain English, a Z-Score essentially tells us how many standard deviations (i.e. units of dispersion) there are between a given data point and the center. Many other calculations we’ll encounter in the next few tutorial series are either derived directly from Z-Scores, or resemble them in their use of the mean and standard deviation. To someone lost in the swirl of numbers and equations surrounding statistics, it may seem that there is no rhyme or reason to any of them, but there is a method behind the madness. In each Z-Score calculation, we’re not plugging in just anything, like a whole probability distribution or a sequence or a set, but a single data point – which matches our question, “Is this an outlier?” More sophisticated calculations may require us to further limit our choices by such considerations as data types, Content types, the internal performance requirements of the algorithm, the number of inputs, the sample size and the desired mathematical properties of the output. In the case of Z-Scores, all we really have to make sure of is that we’re inputting one of SQL Server’s numeric data types. We obviously can’t plug text or dates into a Z-Score equation, although we could perform calculations on such fields and then plug them in as needed. We also need to know the mean and standard deviation for the entire dataset, rather than a mere sample; in some situations it might be impractical to calculate them due to resource constraints, but DBAs usually have one over on those engaged in scientific research, in that they usually have populations of millions of records to drawn from if they choose to. Sometimes researchers only have access to small samples taken from unknown populations, in which case it may not be possible to apply Z-Scores at all.
…………Fortunately, coding a Z-Score is also a lot less taxing on the brain than the subject of the last post, Benford’s Law. That is why the stored procedure in Figure 1 is a lot easier to follow. The first three parameters allow you to specify a table in any database for which you have privileges, while the fourth identifies the column to be sampled and the fifth is the name of the table’s primary key. Keep in mind that I don’t allow spaces in my object names, so if you’re going to be operating on objects that have them, you’re going to have to add the requisite brackets to this code yourself. Enter the @DecimalPrecision parameter with care, since an incorrect setting will result in arithmetic overflows; leaving that setting up to the end user was a lot easier to code than a workaround that would fit all use cases. The most difficult part of the code to grasp may be the @OrderByCode, which allows you to sort the results by 1) the primary key values, ascending; 2) the primary key value descending; 3) the column values ascending; 4) the column values descending; 5) the Z-Score ascending and 6) the Z-Score descending. I’m a stickler about giving credit where it is due, so I’ll point that I’ve already done ORDER BY CASES before, but double-checked the syntax at a thread by one of the greatest assets of the SQL Server community, Pinal Dave.[ii] Uncommenting the next-to-last line will allow you to debug the procedure as needed by checking the dynamic SQL. Also be aware that I haven’t taken any steps to proof this against SQL injection attacks, so be ready to program your own security requirements into it. In addition, the procedure is created in a schema called Calculations that I will be using frequently throughout this series, so be prepared to add it to your own database or change the code.

Figure 1: Code for the Z-Score Stored Procedure
CREATE PROCEDURE [Calculations].[ZScoreSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)
AS

SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName –I’ll change this value one time, mainly for legibility purposes

DECLARE @SQLString nvarchar(max)
SET @SQLString =
‘DECLARE @OrderByCode as tinyint ,– pass the outer value like a parameter of sorts
@StDev AS decimal(‘ + @DecimalPrecision + ‘),
@Mean AS decimal(‘ + @DecimalPrecision + ‘)

– precalculating these not only makes the code more legible, but is more efficient because it is a one-time operation
SELECT @StDev = StDEv(‘ + @ColumnName + ‘)   FROM ‘ + @SchemaAndTableName +
SELECT @Mean = Avg(‘ + @ColumnName + ‘)   FROM ‘ + @SchemaAndTableName +

–SELECT @StDev  — uncomment these to debug
value errors
–SELECT @Mean

SET @OrderByCode = ‘ + CAST(@OrderByCode  AS nvarchar(50) ) +

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ZScore, DENSE_RANK () OVER (ORDER BY ZScore) AS GroupRank
FROM
       (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,  ”ZScore” = CASE WHEN @StDev = 0 THEN 0
     ELSE (‘ + @ColumnName + ‘ – @Mean) / @StDev
     END
        FROM ‘ + @SchemaAndTableName +               GROUP BY ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘) AS T1 — the purpose of the inner query is to allow us to order by the ZScore

       ORDER BY
CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,
C
ASE WHEN @OrderByCode = 5 THEN ZScore END ASC,
CASE WHEN @OrderByCode = 6 THEN ZScore END DESC

–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)

Figure 2: Sample Results from Column1 of the HiggsBosonTable

EXEC [Calculations].[ZScoreSP]
             @DatabaseName = ‘DataMiningProjects’,
             @SchemaName = ‘Physics’,
             @TableName = N’HiggsBosonTable’,
             @ColumnName = N’Column1′,
             @PrimaryKeyName = N’ID’,
             @OrderByCode = 6,
             @DecimalPrecision = ‘38,28’

ZScoreResults

…………As discussed in my last few posts, I’ll be using three publicly available practice datasets for my next three or four series of tutorials, beginning in last week’s post with a data on the Duchennes form of muscular dystrophy provided by Vanderbilt University’s Department of Biostatistics. In time, I will probably also perform outlier detection and other data mining calculations on the Voynich Manuscript, an inscrutable medieval tome with an encryption scheme so obscure that no one has been able to crack it for more than five centuries, including the National Security Agency (NSA). The best of the three datasets to stress test this procedure is the data on the Higgs Boson made available by the University of California at Irvine’s Machine Learning Repository, given that its single table dwarfs the tiny 9-kilobyte Duchennes table by almost 6 gigabytes. It also consists entirely of numeric data, unlike the transcriptions of the Voynich Manuscript I’ve imported, which are mainly appropriate for text mining. To date, I have yet to find an explanation of what the 28 float columns actually measure, although there’s an outside chance I might be able to interpret an explanation if I find one, given that I understand particle physics too well for my own good back in fourth grade.[iii] Figure 2 depicts a sample query against the first float column in the HiggsBosonTable of my DataMiningProjects database, which includes all three datasets. Note that it’s in descending order by Z-Scores. The GroupRank separates identical Z-Score values into distinct groups, through the DENSE_RANK windowing function; feel free to eliminate it from the code if it drags down performance on your databases. I find it handy when running the procedure against tables with small ranges of distinct and frequently duplicated values. This logic may also be enhanced by intrepid programmers to handle bucketing and banding of contiguous but not quite identical values, perhaps using fuzzy sets with graded memberships. The interpretation is not difficult at all: the further away Z-Scores are in either direction from zero, the more likely they are to represent any outliers. It only becomes difficult once we compare the results to particular probability distributions, which often expect certain percentage of their values to occur in specific ranges and therefore makes the definition of an outlier less arbitrary in that context; for example, the Gaussian or “normal” distribution, i.e. the bell curve, expects about 68 percent of the values to be within the first standard deviation, 95 within the second and 99.7 within the third.
…………The procedure took much less time to execute on the 11 million rows (the seven at the tail end are accidental duplicates I’ve been procrastinating on removing, but they’re inconsequential for today’s article) of the HiggsBosonTable than I expected, given that my poor beat up development machine is hardly a match for the servers most DBAs use every day. Unfortunately, I was unable to test it with a columnstore index because I haven’t been able to afford to upgrade from SQL Server 2012 Developer Edition to 2014, in which Microsoft lifted many of the restrictions that made it impossible to apply them to many user scenarios. The Column1 I was testing in the HiggsBosonTable has a precision of 33 and a scale of 29, so I naturally received this error when trying to create one: “CREATE INDEX statement failed. A columnstore index cannot include a decimal or numeric data type with a precision greater than 18. Reduce the precision of column ‘Column 1′ to 18 or omit column ‘Column1′. (Microsoft SQL Server, Error: 35341).” I was, however, able to reduce the client processing time and total execution time by adding a regular nonclustered index to Column1. The total execution time was higher, but only because of an increased Wait Time on Server Replies, which was probably due to pressure on the server from other unrelated tasks. If you click on the pictures of the execution plans to enlarge them, you’ll see that the index was used when calculating the standard deviation, the average and the Z-Scores alike.

Figure 3: Client Statistics and Execution Plans for the Z-Score Procedure on the HiggsBosonTable
ZScoreClientStats ZScoreExecutionPlans (1) ZScoreExecutionPlans (2)

…………In my last post, I cited many articles by professionals who gave a long laundry list of use cases, limitations and nuances of interpretation for Benford’s Law. I searched for similar materials for Z-Scores but essentially came up empty, perhaps because they’re so well-established, well-understood and trivial to calculate. The one clear exception I saw mentioned in the literature is that fat-tailed distributions, in which the data is significantly skewed in one or more directions, may require the use of less outlier-sensitive techniques. This particularly outlier identification method is used most often in conjunction with the normal distribution but doesn’t have to be; out of all the techniques I’ll survey in this series, it is perhaps the most flexible and suited to the widest range of use cases. It is fairly easy to understand and interpret, performs well and isn’t dependent on any particular cut-off criteria for defining outliers, unlike many others that force unnecessary all-or-nothing choices. This means we can use it to ask how much of an outlier a record is, which is a much more informative question than simply asking if a record belongs in a single outlier bucket or outside it.
…………In the next installment of this series I’ll briefly touch on Iglewicz and Hoaglin’s Modified Z-Scores, which are one of many twists that can be applied to Z-Scores depending on the use cases at hand. That material should be fairly light and easy, as should some of the posts that follow on stats like Grubbs’ Test and the Tietjen-Moore Test that have deceptively scary names. Interquartile Range and Dixon’s Q-Test should also be trivial to handle. I’m not sure yet how difficult it will be to explain and code lesser-known measures like the Generalized Extreme Studentized Deviate (ESD) Test , Cook’s Distance, Peirce’s Criterion, Chauvenet’s Criterion and the Modified Thompson Tau Test. Eventually I’ll also be doing a recap of how to use SSDM Clustering for finding aberrant data and writing a post tentatively titled Visual Outlier Detection with Reporting Services, which will be full of eye candy rather than T-SQL and equations like the others. I’m not yet sure what order I’ll be tackling them all in, except for Mahalanobis Distance, which is apparently a quite sophisticated and useful method that unfortunately has math of commensurate difficulty. If all goes according to plan, climbing that final hurdle will propel me into a much more difficult but useful series, Information Measurement with SQL Server. We will see references to Z-Scores and formulas like it through that series, as well as the remainder of this one.

[i] See See Hopkins, Will G., 2001, A New View of Statistics website. The home page for the site is http://www.sportsci.org/resource/stats/index.html.

[ii] Pinal Dave, 2007, “SQL SERVER – CASE Statement in ORDER BY Clause – ORDER BY Using Variable,” published July 17, 2007 at the Journey to SQL Authority with Pinal Dave website. Available online at

http://blog.sqlauthority.com/2007/07/17/sql-server-case-statement-in-order-by-clause-order-by-using-variable/

[iii] When my family moved out of the home we had in my elementary school days, we never did find the uranium and radium samples that came with my do-it-yourself cloud chamber kit. Oh well, I suppose the new owners either sold it to some North Koreans, or their kids all have seven toes. I suppose I shouldn’t lose hope of finding them again, given that their half-lives were several million years apiece.

Follow

Get every new post delivered to your Inbox.