## Outlier Detection with SQL Server, part 6.3: Visual Outlier Detection with Reporting Services Plots and SSDM Clustering

**By Steve Bolton**

…………When the goal is to illustrate how just how outlying an outlier may be, the efficiency with which scatter plots represent distances really can’t be beaten. It doesn’t take any training in mathematics to look at one and notice that a few data points are further away from the others, in proportion to how different they are from the normal values by some particular measure. Any five-year-old can grasp that idea. It isn’t necessarily easy to calculate the measures that populate many of the more advance species of scatter plots, but you still don’t need to be a rocket scientist to interpret them. That makes them an ideal choice when our uses cases call for easily interpretable visualizations. There are still some drawbacks to scatter plots and their variants – including their simplicity, which can be a weakness when our user requirements call for greater sophistication. Like every other visual method outlined in this series of amateur mistutorials on identifying outliers with SQL Server, it amounts to a brute force, unconscious assessment that “if something looks out of place, we’ll look at it more closely.” That tells us nothing about *why* a data point is an outlier, or whether or not the domain knowledge and underlying data distribution predict how many values we ought to find at that particular point. Sooner or later, every scatter plot can be saturated with so many data points that they obscure the information we’re looking for. Every outlier detection method we’ve surveyed in this series has suffered from some similar limitation imposed by the sheer scale of the tables commonly seen in SQL Server, which can number in the thousands, millions or even billions of rows. Nevertheless, scatter plots take a lot longer than some other methods to reach the breaking point where strategies like binning and banding have to be applied; that is mainly because they can show outliers at any point in a graph of two or more dimensions, whereas other display techniques like box plots limit them to very confined spaces. Even the relatively efficient histogram/run chart hybrids discussed in a previous post retain a lot of the white space above their bars, where no outliers can be depicted; this is even before we take into consideration the binning that they are more susceptible to, which tends to hide outliers. Given how flexible and efficient they are, it is not surprising that there are so many variations on the theme. In this installment of the series, however, I’ll concentrate on just three subtypes that are frequently mentioned in connection with outlier detection in the data mining literature.

…………The logic of scatter plots is so simple, in fact, that it requires less T-SQL to populate one through a stored procedure than in any other code sample I’ve posted in this series. Readers of this series have probably grown accustomed to the familiar list of parameters, which allows users to select a count for all the distinct combinations of two columns in any single table in any database they have access to. The rest is just the usual dynamic SQL and the debugging string on the next-to-last line. The chart in Figure 1 is fairly self-explanatory: on the horizontal row we find the value for the Hemopexin protein in the 209-row dataset on the Duchennes form of muscular dystrophy we’ve been using for sample purposes throughout this series (which I downloaded from the Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table). On the vertical axis we find the value of the Creatine Kinase enzyme; wherever there’s a scatter plot point, we have a combination of the values of both columns. The only thing that might require explanation here is the size of the bubbles in the 3D Bubble Chart included with Report Builder, which I tied to the count of the records for each combination. As anyone can see, the bulk of the values are centered on the bottom right, but there are two separate groups of outliers, which I’ve drawn red circles around.

**Figure 1: Code for the Simple Scatter Plot Procedure
**CREATE PROCEDURE [Calculations].[SimpleScatterPlotSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName1 AS nvarchar(128), @ColumnName2 AS nvarchar(128)

AS

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + + ‘.’ + @SchemaName + ‘.’ + @TableName

SET @SQLString = ‘SELECT ‘ + @ColumnName1 + ‘ AS Column1, ‘ + @ColumnName2 + ‘ AS Column2,

Count(*) OVER (PARTITION BY ‘ + @ColumnName1 + ‘, ‘ + @ColumnName2 + ‘ ORDER BY ‘ + @ColumnName1 + ‘, ‘ + @ColumnName2 + ‘ ) AS CombinationCount

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName1 + ‘ IS NOT NULL AND ‘ + @ColumnName2 + ‘ IS NOT NULL’

–SELECT @SQLString — uncomment this to debug string errors

EXEC (@SQLString)

**Figure 2: Simple 3D Scatter Plot Example with Outliers
**

…………As Stan Lee might say, ‘Nuff Said. We might be able to add more useful information through such tricks as setting the bubbles for the normal values semi-transparent or the color of the outliers to an eye-catching red, or setting the ToolTip to display the value associated with each outlier (as I did here, although my screen capture utility failed to pick it up). Yet the primary means of adding sophistication is by changing the measures used on both axes. It is quite common to compare multiple columns using bubble of different colors, but this is unsuitable for our purposes because it would quickly conceal the outliers for all columns in a mass of clutter. Adding columns to the mix might be unwise, but using more complex statistical means to calculate the data points would not cause clutter by saturating it with data points. There might be a performance cost in the calculations and the ease of interpretation would decline as the meaning of the measures grows more intricate, but in many use cases such drawbacks aren’t problematic. Our imagination is really the only limit on the kinds of measures we can put on the axes of scatter plots, but two in particular are singled out as outlier detection methods in statistical literature like the National Institute for Standards and Technology’s Engineering Statistics Handbook. One of these is the lag plot[1], which takes very little code to implement thanks to the Lag windowing function added in to T-SQL in SQL Server 2012. The basic idea is to compare the value of a column on the vertical axis against a prior data point in the column, by some fixed amount that is often measured in time slices. I’ve had some issues converting the SampleDate in the DuchennesTable accurately to a SQL Server data type, but regardless of whether the results are accurate, Figure 4 still illustrates how unusual values can be exposed through such a technique. Lag plots are a less appropriate means of outlier detection than a regular scatter plot because their primary purpose is measuring the randomness of a dataset, which is tangentially rather than directly related to finding outliers, i.e. seemingly random points within a non-random pattern. A lag value of one time slice is the most common value input into a lag plot, although this procedure allows it to be set to any arbitrary bigint value through the @LagInterval parameter. There are apparently many nuances to the interpretation of lag plots that amount to pattern recognition, such as the identification of sinusoidal patterns with cyclical activity[2], but that is really beyond our purview. In the absence of some rigorous logical or mathematical definition of what a “pattern” is, we basically have to use the old rule of thumb: if it looks out of place, it might be an outlier.

**Figure 3: Code for the Lag Plot Procedure
**CREATE PROCEDURE [Calculations].[LagPlotSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @LagColumnName AS nvarchar(128), @LagInterval bigint

AS

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + ‘.’ + @SchemaName + ‘.’ + @TableName

SELECT @SQLString = ‘SELECT DISTINCT ‘ + @ColumnName + ‘ AS ColumnValue, LagResult, Count(*) OVER (PARTITION BY ‘ + @ColumnName + ‘, LagResult

ORDER BY LagResult) AS CombinationCount

FROM (SELECT TOP 99999999999 ‘ + @ColumnName + ‘, ‘ + @LagColumnName + ‘ AS LagColumn,

Lag(‘ + @ColumnName + ‘, ‘ + CAST(@LagInterval AS nvarchar(50)) + ‘) OVER (ORDER BY ‘ + @LagColumnName + ‘) AS LagResult

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1

WHERE LagResult IS NOT NULL

ORDER BY ColumnValue, LagResult ASC’

–SELECT @SQLString

EXEC (@SQLString)

**Figure 4: Lag Plot Example with Outliers
**

…………As can be gathered from the length of the code for the two procedures in Figure 5, computing an autocorrelation plot is somewhat trickier.[3] The basic idea is that instead of partitioning a range of values for a single column on the horizontal axis by a single lag value, we instead make a comparison against a range of lag values. The second procedure merely creates a list of lag values within the limits specified by the user through the @LagBegin, @LagEnd and @LagStep parameters, then calls the first procedure iteratively to return an autocorrelation value for each row. The calculations for this stat are a little more involved than for a simple lag, but not nearly as difficult and dry as some of those introduced in previous posts – let alone the monster procedures required for the next two tutorials on Cook’s Distance and Mahalanobis Distance.[4] On occasion I have received values greater than one for these autocorrelations, which is abnormal but apparently allowable when the values have been normalized (i.e. recalibrated to a different scale, rather than being “normalized” like a database schema in DBA lingo).[5] Alas, the point is moot anyways because autocorrelation plots are not as useful as lag plots for identifying outliers on the kind of scale DBAs operate at, just as lag plots are in turn less well-suited in many cases than ordinary scatter plots. The basic concept is that the further the values are from the baseline of zero – like the first data point in Figure 6 – the less random the process is. The more values that are closer to zero, the more random the dataset is.[6] This really amounts to shifting back to a bar chart-style type of visualization, which can’t display as many outliers as an ordinary scatter plot. It also requires more computation and more interpretation, since the meaning is not readily apparent to the untrained eye. To make matters worse, autocorrelation is designed to depict the degree of randomness exhibited by a dataset, which requires us to identify a pattern before looking for an exception to it; with ordinary scatter plots, any pattern and its exceptions are evident in a single step. Another difficulty with this approach is that you have to take the automatic rescaling into account; I also tried this on the Hemopexin, CreatineKinase and PyruvateKinase columns and there were a couple of points that were further away from the others, but the autocorrelation values ranged between roughly 0 and 0.25, which isn’t that big of a gap; yet with the LactateDehydrogenase column in Figure 6, they’re between -0.335196 and 0.405092, which is about a third of the scale from -1 to 1.

**Figure 5: Code for the Two Autocorrelation Plot Procedures
**CREATE PROCEDURE [Calculations].[AutocorrelationSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @LagColumnName AS nvarchar(128), @LagInterval AS bigint, @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + ‘.’ + @SchemaName + ‘.’ + @TableName

SELECT @SQLString = ‘DECLARE @Mean as decimal(‘ + @DecimalPrecision + ‘), @NCount as bigint

SELECT @Mean = Avg(‘ + @ColumnName + ‘), @NCount = Count(‘ + @ColumnName + ‘)

FROM ‘ + @SchemaAndTableName + ‘

SELECT TOP 1 SUM(TopOperand) OVER (ORDER BY RN ) / BottomOperand AS AutoCorrelationCoefficient

FROM (SELECT RN, TopOperand, SUM(BottomOperand) OVER (ORDER BY RN) AS BottomOperand — this is the n – k on the top operand summation; the Bottom Operand is to RN, not RN – @LagInterval

FROM (SELECT RN, ‘ + @ColumnName + ‘, (‘ + @ColumnName + ‘ – @Mean) * (Lag(‘ + @ColumnName + ‘, ‘ + CAST(@LagInterval AS nvarchar(50)) + ‘) OVER (ORDER BY RN)) AS TopOperand,

Power((‘ + @ColumnName + ‘ – @Mean), 2) as BottomOperand –this middle query is necessary because we need to add RN to the LagInterval for one calculation

FROM (SELECT ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY ‘ + @LagColumnName + ‘ DESC) AS RN, ‘ + @ColumnName + ‘

FROM ‘ + @SchemaAndTableName + ‘) AS T1

GROUP BY RN, ‘ + @ColumnName + ‘) AS T2) AS T3

WHERE RN <= @NCount – ‘ + CAST(@LagInterval AS nvarchar(50)) + ‘

ORDER BY RN DESC’

–SELECT @SQLString — uncomment this to debug the dynamic SQL

EXEC (@SQLString)

CREATE PROCEDURE [Calculations].[AutocorrelationPlotSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @LagColumnName AS nvarchar(128), @LagBegin decimal (38,21), @LagEnd decimal(38,21), @LagStep decimal (38,21), @DecimalPrecision AS nvarchar(50)

AS

DECLARE @InputTable table

(ID bigint IDENTITY (1,1),

LagAmount decimal(38,21)

);

DECLARE @ResultTable table

(ID bigint IDENTITY (1,1),

AutoCorrelationValue decimal (38,21)

);

— use the standard CTE method of generating sequences to populate the lag amounts for the table

WITH RangeCTE(RangeNumber) AS

( SELECT @LagBegin as RangeNumber

UNION ALL

SELECT RangeNumber + @LagStep

FROM RangeCTE

WHERE RangeNumber < @LagEnd)

INSERT INTO @InputTable

(LagAmount)

SELECT RangeNumber

FROM RangeCTE

ORDER BY RangeNumber ASC

DECLARE @SQLString nvarchar(max),

@CurrentTableVarID bigint = 0,

@MaxTableVarID bigint = 0,

@CounterCheck bigint = 0,

@LagInterval decimal(38,21)

SELECT @MaxTableVarID = Max(ID) FROM @InputTable GROUP BY ID ORDER BY ID ASC

SELECT @CurrentTableVarID =Max(ID) FROM @InputTable GROUP BY ID ORDER BY ID DESC

WHILE @CurrentTableVarID <= @MaxTableVarID

BEGIN

SELECT @LagInterval = LagAmount

FROM @InputTable

WHERE ID = @CurrentTableVarID

SET @SQLString = ‘EXEC Calculations.AutocorrelationSP

@DatabaseName = ‘+ CAST(@DatabaseName as nvarchar(50)) + ‘,

@SchemaName = ‘+ CAST(@SchemaName as nvarchar(50)) + ‘,

@TableName = ‘+ CAST(@TableName as nvarchar(50)) + ‘,

@ColumnName = ‘+ CAST(@ColumnName as nvarchar(50)) + ‘,

@LagColumnName = ‘+ CAST(@LagColumnName as nvarchar(50)) + ‘,

@LagInterval = ‘+ CAST(@LagInterval as nvarchar(50))+ ‘,

@DecimalPrecision = ”’+ CAST(@DecimalPrecision as nvarchar(50)) + ””

–SELECT @SQLString — uncomment this to debug the dynamic SQL

INSERT INTO @ResultTable

EXEC(@SQLString)

SET @CounterCheck = @CounterCheck + 1

SET @CurrentTableVarID = @CurrentTableVarID + 1 — increment the loop

END

SELECT LagAmount, AutoCorrelationValue

FROM @InputTable AS T1

INNER JOIN @ResultTable AS T2

ON T1.ID = T2.ID

**Figure 6: Autocorrelation Plot Example with Outliers
**

…………When all is said and done, one of the more effective visual means of identifying outliers available to us out-of-the-box is also one of its least publicized: the Microsoft Cluster Viewer, which depicts the results calculated by the two SQL Server Data Mining (SSDM, to coin an abbreviation) Clustering algorithms. The calculations are fast and don’t require any custom code of the kind I’ve posted here. Best of all, it can do what Reporting Services cannot: efficiently depict both summary stats and individual data points together, by clicking on a cluster and viewing the Drill Through window (by selecting Model Columns Only or Model and Structure Columns in the submenu) as seen in Figure 7. As discussed in the last post, SSRS is crippled by its simultaneous inabilities to consume more than one result set from a stored procedure in a single dataset, to consume more than one dataset in a single chart, or calculate statistics as efficiently as T-SQL or .Net languages like Visual Basic. Perhaps it is possible to add this functionality through custom .Net code, but if so, the workaround hasn’t been well-publicized, judging from the frequency with which the complaint is raised in coding forums. As we have seen in this segment of the series, it can still be used as an effective outlier detection tool, but much of its potential has not be realized. The potential of SSDM, in contrast, has been realized but not recognized by the SQL Server community at large. As discussed in my last tutorial series, SSDM as a whole is perhaps the most unsung and long-neglected component of SQL Server. One of its many overlooked benefits is the Cluster Viewer, which acts much like a scatter plot, except that the points represents a set of summary stats and the associated data points, rather one or the other as would be the case in any SSRS chart. By using a combination of the Density shading color, the Shading Variable dropdown and various values of the State dropdown (like Very Low, Low, Average, High and Very High), outliers easily become apparent. The All Links slider can also be used to show or hide links between clusters, which isn’t possible between data point in an ordinary scatter plot. As I pointed out in the last two posts, I’m trying to avoid tutorial clutter by omitting descriptions of basic tasks in Reporting Services, like adding a diagram to a report; in the same vein, I’m not going to get into an in-depth description of the Clustering algorithm, which I’ve already introduced in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering. As discussed there, the Cluster Viewer can be used to display two types of results, one of which is derived from the Expectation Maximization (EM) version of the Clustering algorithm. It also displays the results of the K-Means Clustering algorithm, which is closely related to the popular K-Nearest Neighbors formulas that apparently used quite frequently in outlier detection.[7] All of the methods in this family are based on the idea of calculating a relative distance between data points, which is a somewhat more sophisticated measure than the crude distances implied in scatter plots or the difference between each data point and the mean or median, as seen in Z-Scores and many other outlier detection methods we’ve already discussed. This particular flavor of SSDM Clustering is really the first of three outlier detection methods that are based on such distance calculations. In the next two posts I’ll close out the series by discussing the two most sophisticated distance-based outlier identification methods I could find in the data mining literature, Cook’s Distance and Mahalanobis Distance. There are some costs in terms of interpretability, performance and the legibility of the lengthy code, but they address some unique use needs in the field of outlier identification.

**Figure 7: SSDM Clustering Example with a Cluster of Outliers
**

[1] National Institute for Standards and Technology, 2014, “1.3.3.15 Lag Plot,” published in the online edition of the Engineering Statistics Handbook. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htm

[2] *IBID*.

[3] I tried to use the windowing functions ROW ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING in this code but couldn’t get them to perform the way I wanted. I may take another shot at it though if I ever need to rewrite this procedure for some practical application.

[4] I derived this code from the formulas given at two sources: National Institute for Standards and Technology, 2014, “1.3.5.12 Autocorrelation,” published in the online edition of the Engineering Statistics Handbook. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35c.htm and Also see the course notes published by McInally, Cameron, 2008, “WCSLU2850.Lo1 Web Project 11,” published April 23, 2008 at the Coparoom.com web address http://www.coparoom.com/archive/Fordham/courses/Spring2008/CS2850/web_project/project11.html

[5] See the reply by the user named thrillhouse86 to the thread “Can Autocorrelation Be Geater Than One?” published April 27, 2010 at the Physics Forums web address http://www.physicsforums.com/showthread.php?t=392277

[6] National Institute for Standards and Technology, 2014, “1.3.3.1.Autocorrelation Plot,” published in the online edition of the Engineering Statistics Handbook. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htm

[7] See the Wikipedia page “Outlier” at http://en.wikipedia.org/wiki/Outlier

## Outlier Detection with SQL Server, part 6.2: Finding Outliers Visually with Reporting Services Box Plots

** By Steve Bolton**

…………Throughout this series of amateur mistutorials in using SQL Server to identify outliers, we have repeatedly seen that the existing tried-and-true methods of detection long used for such purposes as hypothesis testing are actually poorly suited for finding aberrant values in large databases. The same problem of scale also affects the simple visual inspection methods we’re surveying in this segment of the series; for example, we reached a point in last week’s tutorial where histograms and run charts had to be binned to accommodate large record counts. The problem with such strategies for our purposes is that they blur outliers by concealing them in a mass of other values, rather than highlighting how much they contrast with normal values. That difficulty is even more pronounced with the box plots, which are a well-established form of outlier detection that unfortunately doesn’t seem to scale well. Because all of the outliers are represented on a single axis, they easily blur into a single undifferentiated line after the accumulation of a handful of records. This is a problem with all visual inspection methods, but it is more pronounced with box plots and their relatives because they’re limited to representing them in a single dimension. The histograms we introduced last week and the variants of scatter plots we’ll discuss in the next installment can have the advantageous capability of displaying values in two dimensions, which conveys information in the available space in a much more efficient way. Because of this insurmountable, built-in limitation, I’ll only spend a short time discussing how to implement this family of plots in Reporting Services. I’m omitting discussion of probability plots from this series altogether for basically the same reason: they certainly have their uses, as we shall discover in a future series on goodness-of-fit testing, but outlier detection only amounts to an afterthought in comparison to them.

…………Don’t get me wrong: box plots are entirely appropriate for certain use cases, including detecting a handful of outliers. Their most common use, however, is in comparing the variability of multiple columns against each other, or a single column against itself after partitioning it by some flag; this is especially useful when comparing trials of scientific experiments against each other. The technique was popularized by 20^{th} Century statistician John Tukey, who is recognized as one of the Founding Fathers of the field of data mining[1] and was also instrumental in the development of a number of important statistical tools, particularly the Fast Fourier Transform (FFT).[2] He was also apparently quite sane, in contrast to the many famous mathematicians and physicists who have completely lost their minds and morals, as I pointed out a few times in my last tutorial series, A Rickety Stairway to SQL Server Data Mining; he had the common sense to correct the uncommon nonsense found in the infamously flawed Kinsey Report on human sex habits.[3] Edward Tufte, whose name is synonymous with data visualization, has nothing but praise for Tukey.[4] They may be ugly, but the box plots he invented are very effective in conveying some of the stats associated with Interquartile Range, if you know how to read them. Unfortunately, most of the emphasis is on comparing the variability of data over a set of columns or trials, not finding outliers, so the outlier information we’re looking for is obscured.

…………When interpreting plots of the kind depicted in Figure 1, just remember this simple rule mentioned by Kaiser Fung in an old post on data visualization: “the box contains the middle 50% of the data…the line inside the box is the median score; the dots above (or below, though nonexistent here) the vertical lines are outliers.”[5] The edges of the box represent the first and third quartiles, which we discussed earlier in this series in the post on Interquartile Range. The whiskers are the tricky part, since they can represent all kinds of different measures, like standard deviations and various percentiles of values.[6] Sometimes the min and max values of the dataset are used for the whiskers, but this is unsuited to our purposes because it would further obscure any outliers. My implementation is equivalent to a Tukey box plot, in which the whiskers represent the inner fence values of the Interquartile Range and another line is added to represent the median. The yellow points represent the maximum and minimum values where the Interquartile Range procedure returned an OutlierDegree equal to 1 and the red ones correspond to the same for OutlierDegree = 2. This isn’t standard practice, so I had to add code to the SSRS report to implement it.

…………Keep in mind when interpreting Figure 1 that only there may be many more outliers than the four dots depicted here, which take up a minuscule amount of the available space and thus convey the information we need very inefficiently. As I will discuss in more depth momentarily, SSRS simply doesn’t provide an out-of-the-box way to drill down to the records that ought to accompany the kind of summary statistics that the mean, median, Interquartile Range and the like represent. We’re therefore limited to displaying just a few data points based on the minimum and maximum values associated with the class of OutlierDegree they belong to. And even if we could access all of the individual data points alongside the summary statistics, as is normally the case with other implementations of box plots, we still wouldn’t be able to display them all because of the scaling issue. Adding more than a handful of values above or below the boxes quickly blurs them into an undifferentiated line, thereby concealing the information we’re after. If our purpose is looking for aberrant values, then it is much more efficient to simply display all the data points at once in a table ordered by the OutlierDegree flag column, with the summary statistics neatly available in a single view above it. For an example, see the tables returned in Outlier Detection with SQL Server, part 5: Interquartile Range. Figure 1 doesn’t present any new facts or unearth any buried information that isn’t already freely available to us in the table format, nor does it execute any faster. The data below comes from the same 209-row dataset on the Duchennes form of muscular dystrophy I have been using throughout this series for examples.[7] Since I’m not familiar with Duchennes[8] or biochemistry, I might be unwittingly making an apples-and-oranges comparison between the values for the protein Hemopexin and the enzyme Lactate Dehydrogenase. The purpose here was merely to demonstrate how difficult it is spot outliers with this technique, not to expose any relationship between the columns, so it’s beside the point anyways. It is worth noting though that adding the columns for the Creatine Kinase and Pyruvate Kinase enzymes to the box plot quickly rendered it unreadable, due to a few extreme outliers stretching out the vertical scale too far – which gives you an idea of how fragile box plots can be when applied to our purposes.

**Figure 1: A Sample Box Plot Report with Outliers** (click to enlarge)

…………To feed data to these reports, I had to rejig the stored procedure used in the aforementioned article on Interquartile Range and write the T-SQL code in Figure 2 to call it multiple times for each comma-separated column included in the @ColumnString parameter. The length of the code may seem intimidating, but it’s actually quite simple; the SplitColumnNameStringCTE merely separates the @ColumnString into a table variable, which is joined to a table variable that holds the results for each column. The rest of the code merely loops over each column in the list and feeds it to the Interquartile Range procedure; the number of parameters we need to feed to that procedure and the large number of return fields accounts for much of the length of this code. The first three parameters also enable users to select columns in any single table in any database for which they have the requisite access. It is certainly possible to extend this procedure to compare columns from multiple tables, but this is just for illustration purposes, so I kept it as simple as I could. I’ve included the @DecimalPrecision parameter in most of the procedures I’ve posted in this series so that users can adjust the precision and scale of the calculations to avoid overflows, but it may be necessary to ratchet it down further to keep Reporting Services from returning blank reports, as SSRS was doing with the run sequence plots in last week’s tutorial for some unfathomable reason.

**Figure 2: Code for the Multiple IQR Stored Procedure
**CREATE PROCEDURE [Calculations].[MultipleIQRSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnString AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SchemaAndTableName nvarchar(400)

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + @SchemaName + ‘.’ + @TableName

DECLARE @ColumnTable table

(ID bigint IDENTITY (1,1),

ColumnName nvarchar(128)

);

DECLARE @ResultTable table

(ID bigint IDENTITY (1,1),

Mean decimal(38,9),

Median decimal(38,9),

LowerQuartile decimal(38,9),

UpperQuartile decimal(38,9),

InterquartileRange decimal(38,9),

LowerInnerFence decimal(38,9),

UpperInnerFence decimal(38,9),

LowerOuterFence decimal(38,9),

UpperOuterFence decimal(38,9),

OutlierDegreeMax1 decimal(38,9),

OutlierDegreeMin1 decimal(38,9),

OutlierDegreeCount1 bigint,

OutlierDegreeMax2 decimal(38,9),

OutlierDegreeMin2 decimal(38,9),

OutlierDegreeCount2 bigint

);

; WITH SplitColumnNameStringCTE

(TempPatIndex,LeftString, RemainingString, StringOrder)

AS (SELECT TempPatIndex, LEFT(@ColumnString, TempPatIndex) AS LeftString, RIGHT(@ColumnString, LEN(@ColumnString) – TempPatIndex) AS RemainingString, 1 AS StringOrder

FROM (SELECT PATINDEX(‘%,%’, @ColumnString) AS TempPatIndex) AS T1

UNION ALL /* after splitting the string, send the remainder back to the PATINDEX and LEFT/RIGHT functions in the part below */

SELECT NewPatIndex, LeftString = CASE WHEN LEFT(LastString , NewPatIndex) = ” THEN RIGHT(LastString, LEN(LastString) – NewPatIndex)

WHEN LEFT(LastString , NewPatIndex) IS NULL THEN RIGHT(LastString, LEN(LastString) – NewPatIndex)

ELSE LEFT(LastString , NewPatIndex) END,

RIGHT(LastString, LEN(LastString) – NewPatIndex) AS RemainingString, StringOrder + 1

FROM ( SELECT PATINDEX(‘%,%’, RemainingString) AS NewPatIndex, RemainingString AS LastString, StringOrder

FROM SplitColumnNameStringCTE

WHERE LeftString IS NOT NULL AND LeftString != ” AND LeftString LIKE ‘%,%’ ) AS T1

)

INSERT INTO @ColumnTable

(ColumnName)

SELECT T1.SplitString AS ColumnA

FROM (SELECT TOP 99999999999 REPLACE(REPLACE(LeftString, ‘,’, ”), ‘ ‘, ”) AS SplitString, StringOrder

FROM SplitColumnNameStringCTE

ORDER BY StringOrder) AS T1

DECLARE @CurrentID bigint = 0, @MaxID bigint = 0, @CurrentColumnName nvarchar(128)

SELECT @MaxID = Max(ID) FROM @ColumnTable GROUP BY ID ORDER BY ID ASC

WHILE @CurrentID < @MaxID

BEGIN

SET @CurrentID = @CurrentID + 1 — increment the loop

SELECT @CurrentColumnName = ColumnName

FROM @ColumnTable

WHERE ID = @CurrentID

INSERT @ResultTable

(Mean, Median, LowerQuartile, UpperQuartile, InterquartileRange, LowerInnerFence, UpperInnerFence, LowerOuterFence, UpperOuterFence, OutlierDegreeMax1,

OutlierDegreeMin1, OutlierDegreeCount1, OutlierDegreeMax2, OutlierDegreeMin2, OutlierDegreeCount2)

EXEC Calculations.InterquartileRangeSP3 @DatabaseName, @SchemaName, @TableName, @CurrentColumnName, @PrimaryKeyName, @DecimalPrecision

END

SELECT ColumnName, Mean, Median, LowerQuartile, UpperQuartile, InterquartileRange, LowerInnerFence, UpperInnerFence, LowerOuterFence, UpperOuterFence, OutlierDegreeMax1,

OutlierDegreeMin1, OutlierDegreeCount1, OutlierDegreeMax2, OutlierDegreeMin2, OutlierDegreeCount2

FROM @ColumnTable AS T1

INNER JOIN @ResultTable AS T2

ON T1.ID = T2.ID

ORDER BY T1.ID

…………As discussed in last week’s tutorial, describing how to do basic SSRS tasks like adding data sources is not part of the scope of this series; there are plenty of other tutorials available on the Web which explain them better than I can. I will mention a few critical details needed for my implementations though, like the technique discussed in the last tutorial for retrieving data from stored procedures and using it in an SSRS report. One potential “gotcha” I ought to highlight is the fact the Mean and Median are sometimes left blank in the Series Properties window in Figure 3, even when they are correctly assigned in the Chart Data setup in Figure 4, so you may have to add the values again manually.

**Figures 3 and 4: The Series Properties Window in Report Builder and the Chart Data Setup
**

**Figure 5: Types of Range Charts Available in Report Builder
**

…………It may be worth noting here that box plots are grouped together in Report Builder under the Range heading with Smooth Range and Range Column, which we discussed last week, as well as the Stock, Candlestick and Error Bar charts, as seen in Figure 5. The latter three are just stripped-down variants of a box plot, so there’s no sense in discussing them further unless someone can point out a read need for more detail. I’m not aware of any means of implementing violin plots, another popular variant on the box plot, through Reporting Services out-of-the-box, although it might be possible to write custom code that achieves this end. That leaves the Range Bar, which as shown in Figure 6, can be used to conveniently compare the various fence values and quartiles returned with the Interquartile Range.

**Figure 6: Range Bar Example** (click to enlarge)

…………Please note that Report Builder quits altogether whenever I try to combine a range bar with any variant of a scatter plot, so it is apparently difficult to enhance them further for the purposes of outlier detection. That means I can’t even apply the technique for combining box plots with scatter plots, as presented by Mike Davis in his excellent tutorial, “How to Make a Box Plot Chart in SQL Reporting Services 2008 SSRS.”[9] That is how I managed to get the four measly dots into Figure 1, which might be sufficient if we were only doing exploratory data mining or hypothesis testing, but is woefully inadequate if our primary goal is finding outliers. Furthermore, as Davis points out, we’re better off calculating the quartiles and other stats associated with Interquartile Range ourselves anyways, since “Reporting services does not do a good job of calculating these numbers. The best thing to do is have analysis services calculate these for you or use a stored procedure to produce them.”[10] So the optimal way to go about it is to design a stored procedure of the kind I used here, but that presents another problem which really amounts to an unnecessary complication: Reporting Services will only recognize the first result set a stored procedure feeds to it, so you can’t return summary statistics and the records they’re calculated from in the same dataset.[11] And since you can’t include them in the same dataset, they can’t be combined in the same chart. One unworkable workaround is to simply return the summary stats together with the data in a single denormalized table, but this is grossly inefficient at best and quickly becomes impractical as the number of rows increases. Another poor solution is to just send the data and then add code to recalculate the detail rows in the SSRS reports, but this forfeits all of the power and performance advantages of computing through set-based methods. I haven’t had a chance yet to investigate other potential workarounds like writing Custom Data Extensions, using LINQ through a web service or operating on a local report in a WPF report control, but they all seem to be so awkward as to nullify the real the selling point of using Reporting Services charts to identify outliers, which is their simplicity. At present I’m leaning towards trying to implement multiple resultsets through VB.Net code embedded in SSRS reports, but I have no idea if it’s feasible. One of the most useful things Microsoft could do to improve the performance of Reporting Services in future editions of SQL Server might be to allow SSRS reports to consume multiple result sets of this kind, since it might drastically cut down the number of round-trips to the database and recalculations performed within it. Microsoft has been in the habit of artificially limiting the usefulness of its software through such oversights and arbitrary limitations for so long that it almost seems to be part of the corporate culture; as one programmer whose name escapes me once puts it, the company commonly adds great features to its software and then renders them useless, which he likened to saying, “Here’s a glass of milk – with a hole in the bottom.” This particular empty glass already forced me to truncate the histogram creation procedure in the last tutorial by stripping out useful code that tested the normality of the columns using the 3-Sigma Rule [12], by checking whether or not the first standard deviation comprised 68 percent of the values and the second and third comprised 95 and 99.7 percent respectively. It was trivial to calculate these tests on a pass-fail basis, but impossible to return the results efficiently because of this senseless limitation against multiple result sets in Reporting Services. It may be worthwhile to start a Connect request for this upgrade to SSRS, if one doesn’t already exist. Yet even if the next version of SQL Server allowed us to consume multiple result sets in SSRS reports, that still wouldn’t make box plots much more useful when the primary goal is exposing numerous outliers. For that objective, we would probably be much better off using the full two-dimensional space available to us in scatter plots, as we’ll delve into in the next installment.

[1] Indiana University of Pennsylvania IT Prof. James A. Rodger says that the roots of modern data mining can be found in Tukey’s exploratory analysis in the70s. p. 178, Rodger, James A., 2003, “Utilization of Data Mining Techniques To Detect and Predict Accounting Fraud,” pp. 174-187 in Pendharkar, Parag C. ed. Managing Data Mining Technologies in Organizations: Techniques and Applications. Idea Group Publications: Hershey, Pennsylvania.

[2] See the Wikipedia page “John Tukey” at http://en.wikipedia.org/wiki/John_Tukey

[3] *IBID.*

[4]

…………“John Tukey on data analysis and behavioral science, with a fierce attack on statistical practices for sanctification, formalism, and hiding the messiness inherent in real data.”

…………“I first saw this as an unpublished manuscript as a graduate student in statistics at Stanford. It set the way for me in data analysis. When I interviewed at Princeton University for my first teaching job, John and I discussed badmandments. After circulating underground for years, John’s essay was finally published in volume III of his collected papers…”

See Tufte, Edward, undated post titled “John Tukey ‘Badmandments’ in Statistical Work, Mainly in the Behavioral Sciences “ at the EdwardTufte.com website. Available online at http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003xA

[5] Fung , Kaiser, 2010, “Eye Heart This,” published Aug 12, 2010 at the Junk Charts website. Available online at http://junkcharts.typepad.com/junk_charts/boxplot/

[6] See the Wikipedia page “Box Plot” at http://en.wikipedia.org/wiki/Box_plot

[7] I originally downloaded it from Vanderbilt University’s Department of Biostatistics.

[8] I am acquainted with two people who have it though, which is why I selected this particular biostats dataset.

[9] Davis, Mike, 2009, “How to Make a Box Plot Chart in SQL Reporting Services 2008 SSRS,” published Nov. 12, 2009 at the BIDN.com web address http://www.bidn.com/blogs/MikeDavis/ssis/163/how-to-make-a-box-plot-chart-in-sql-reporting-services-2008-ssrs Another post of interest is user3557796’s thread “SSRS Box Plot with Scatter Outliers,” published April 21, 2014 at the StackOverflow web address http://stackoverflow.com/questions/23204901/ssrs-box-plot-with-scatter-outliers

[10] *IBID.*

[11] The Internet is awash with threads about this topic. Some examples include Jes Borland’s post titled “T-SQL Tuesday #024: Reporting Services, Stored Procedures, and Multiple Result Sets” on Nov. 8, 2011 at LessThanDot, which can be accessed at http://blogs.lessthandot.com/index.php/DataMgmt/ssrs/t-sql-tuesday-024-reporting/ ; the post by the user named mou_inn in Microsoft’s ASP.Net Forum titled “Multiple Datasets from Stored Procedure in SSRS,” on Feb. 16, 2012, which is available at http://forums.asp.net/t/1770328.aspx?Multiple+Datasets+from+Stored+Procedure+in+SSRS ; and Matt Gibson’s reply on March 11, 2011 to the StackOverflow thread titled “Query That Returns Multiple Result Sets,” which is available at the web address http://stackoverflow.com/questions/5270843/query-that-returns-multiple-result-sets

[12] See the Wikipedia pages “68–95–99.7 Rule” and “Standard Deviation” at http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule “68–95–99.7 Rule” and http://en.wikipedia.org/wiki/Standard_deviation “Standard Deviation” respectively.

## Outlier Detection with SQL Server, part 6.1: Visual Outlier Detection with Reporting Services

** By Steve Bolton**

…………Most of the previous articles in this self-tutorials on using SQL Server to find outliers required us to implement statistical formulas, in order to derive measures that required some explanation before they could be interpreted correctly. In this segment of the series, we’ll be discussing a group of outlier detection methods that require very little interpretation, because they’re often self-evident. Many of the tests outlined in previous installments of this series can of course be trussed up with some eye candy in SQL Server Reporting Services (SSRS), but would still require some knowledge of the concepts involved in order to interpret them correctly. In this segment of the series we’ll instead highlight plots and charts whose significance can be grasped intuitively, even by amateurs like myself with few technical skills; we’ll be building reports that constitute separate outlier detection methods in and of themselves, rather than merely displaying the results of some other outlier test like Z-Scores or Chauvenet’s Criterion. The strength of data visualization methods like histograms and scatter plots is their simplicity, which allows even an untrained eye to spot outliers. On the other hand, their simplicity is also their major drawback, because they lack sophistication and rigor. All of the methods highlights in this segment amount to a sort of brute force approach to visualizing outliers, using an unconscious assessment that “if this data point looks out of place, we’ll look at it more closely.” It’s a quick but superficial strategy. For that reason it may pay to put these visualization techniques at the beginning of any outlier identification workflow, so that more rigorous and costly methods can be applied later on to the data points singled out by them for further investigation.

…………In addition, this set of visual detection methods is not immune from the downsides that have pervaded this series. Like any other outlier identification technique, charts of this kind don’t tell you *why* a record is an outlier; they merely flag them as worthy of further analysis. Moreover, as we have seen throughout this series, the sheer number of records we’re dealing with in typical modern database tables presents many challenges; we might not be constrained to the same degree as we were with statistical measures that are designed for hypothesis testing on a mere few hundred rows at best, like with Grubbs’ Test, the Modified Thompson Tau Test, Dixon’s Q-Test, Chauvenet’s Criterion, the Tietjen-Moore Test and the Generalized Extreme Studentized Deviate Test (GESD), but we still run into issues with fitting all of the data points of interest into a single chart. Various workarounds are available for this issue, such as displaying counts of distinct data points rather than each point individually, or more advanced methods like binning and banding (perhaps by fuzzy set membership), but some of them defeat the purpose by hiding the outliers from view. Furthermore, these visual methods are not insulated completely from the chicken-and-egg arguments over whether or not the aberrant values are departures from the expected distribution, or if they represent evidence that a different distribution would actually model the data better. Like an open-ended algebraic expression with too many variables, that question can only be solved by adding more evidence, in the form of goodness-of-fit tests – which are, as I’ve discussed previously in this series, are omitted with shocking frequency in many fields, as many professional statisticians lament. Outlier detection as a whole is essentially a crude form of pattern recognition, in which we single out records that don’t fit that pattern – which begs a question that is still debated in that field of machine learning, namely what the definition of a pattern is. That is why sooner or later, it will be necessary to apply some substantial domain knowledge to defining what type of pattern we’re looking for and whether or not particular records represent aberrations from it. Visual methods like scatter plots may give quick and cheap suggestions that a particular data point that is distant from all the rest may constitute an outlier, but they tell us nothing about why it is aberrant, or even if it is indeed a departure from our expectations. For that reason, these methods ought to be considered a starting point an outlier detection process, not its culmination.

…………For example, histograms are easy to read because they merely represent the counts of records for each distinct value in a set as a data bar, but they tell you nothing about what the counts are supposed to be. Like any other good idea, there are many variations on the theme, such as cumulative histograms in which the totals accrue until the entire count of records is reached, at a the upper right corner; I won’t present an example of this subtype, however, because it is difficult to spot outliers with them. It is more common to implement some kind of binning or banding scheme to the values, which normally appear on the horizontal axis, rather than changes to the frequencies (i.e. counts) that normally appear on the vertical axis. Figure 1 presents a stored procedure that allows users to choose between three different types of HistogramType values, in which the first is a simple frequency count (which isn’t of much practical use, so I’ve omitted any examples), the second organizes the counts of distinct values by their distances from the mean and the third bins them by the number of standard deviations from the mean, as is customary. Each HistogramType value is implemented in a different common table expression (CTE) in Figure 1, which ends with logic that selects between them. The rest of the T-SQL follows essentially the same format as the rest of the code samples provided in this series, such as the use of @DecimalPrecision to allow users to avoid arithmetic overflows by setting their own precisions and scales. The other four parameters enable users to create a histogram from any table in any database for which they have requisite access, as usual, while the rest is similar to the dynamic SQL examples provided in previous posts. The simple logic is implemented in a few math functions and windowing functions which are actually quite easy to follow, once you’ve subtracted the dynamic SQL they’re enclosed in.

**Figure 1: Code for the Basic Histogram Stored Procedure
**CREATE PROCEDURE [Calculations].[HistogramBasicSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @HistogramType tinyint = 1, @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SQLString AS nvarchar(max), @HistogramTypeString nvarchar(max), @SchemaAndTableName nvarchar(400), @IntervalNumber bigint = 1

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + ‘.’ + @SchemaName + ‘.’ + @TableName

SET @SQLString = ‘DECLARE @Min int, @Max int, @PopulationMean decimal (‘ + @DecimalPrecision + ‘), @StDev decimal(‘ + @DecimalPrecision + ‘), @StDevBucketNumber int,

@Count bigint, @PercentageMultiplier decimal (‘ + @DecimalPrecision + ‘)

SELECT @Count = Count(‘ + @ColumnName + ‘), @Min = Min(‘ + @ColumnName + ‘), @Max = Max(‘ + @ColumnName + ‘), @PopulationMean = Avg(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))), @StDev = StDev(‘ + @ColumnName + ‘)

FROM ‘ + @SchemaAndTableName + ‘

SELECT @StDevBucketNumber = CEILING((@Max – @Min) / @StDev) — basically asking the question, “How many StDevs can we cram in between the min and max values?

SELECT @PercentageMultiplier = 100 / CAST(@Count AS decimal(‘ + @DecimalPrecision + ‘)) — for calculating percentages against the entire set;

WITH FrequencyCTE

(IntervalNumber, FrequencyCount)

AS (SELECT DISTINCT TOP 99999999 IntervalNumber * ‘ + CAST(@IntervalNumber AS nvarchar(50)) + ‘ AS IntervalNumber, Count(T3.’ + @ColumnName + ‘) OVER (PARTITION BY IntervalNumber)

FROM (SELECT DISTINCT ‘ + @ColumnName + ‘, CEILING(CAST(‘ + @ColumnName + ‘ AS decimal(6,2)) /’ + CAST(@IntervalNumber AS nvarchar(50)) + ‘) AS IntervalNumber

FROM (SELECT DISTINCT ‘ + @ColumnName + ‘

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1) AS T2

INNER JOIN ‘ + @SchemaAndTableName + ‘ AS T3

ON T2.’ + @ColumnName + ‘ = T3.’ + @ColumnName + ‘

ORDER BY IntervalNumber),

DividedIntoStDevIntervalsCTE

(IntervalNumber, FrequencyCount, DistanceFromTheMean, StDevInterval)

AS (SELECT *, ”StDevInterval” =

CASE WHEN DistanceFromTheMean / @StDev <= 0 THEN FLOOR(DistanceFromTheMean / @StDev)

ELSE CEILING(DistanceFromTheMean / @StDev)

END

FROM (SELECT *, CAST((@PopulationMean – IntervalNumber)

AS decimal(6,2)) * -1 AS DistanceFromTheMean

FROM FrequencyCTE) AS T1),

DistributionWithIntervalsCTE

(StDevInterval, FrequencyCount)

AS (SELECT DISTINCT StDevInterval, SUM(FrequencyCount) OVER (PARTITION BY StDevInterval)

FROM DividedIntoStDevIntervalsCTE)’

— DECIDE WHICH TYPE OF HISTOGRAM TO RETURN

SELECT @HistogramTypeString = CASE WHEN @HistogramType = 1 THEN

‘SELECT IntervalNumber, FrequencyCount

FROM FrequencyCTE‘

WHEN @HistogramType = 2 THEN

‘SELECT IntervalNumber, FrequencyCount, DistanceFromTheMean, StDevInterval

FROM DividedIntoStDevIntervalsCTE‘

WHEN @HistogramType = 3 THEN

‘SELECT StDevInterval, FrequencyCount

FROM DistributionWithIntervalsCTE ‘

ELSE NULL END

— *** I could also add to this by converting the Counts to Percentages *****”

SET @SQLString = @SQLString + @HistogramTypeString

EXEC (@SQLString)

…………The procedure above was used to supply values for the four Reporting Services bar charts depicted in Figures 2 through 5, which display data from various columns from a 209-row dataset on the Duchennes form of muscular dystrophy and an 11-million row dataset on the Higgs Boson, which I downloaded from the by Vanderbilt University’s Department of Biostatistics and University of California at Irvine’s Machine Learning Repository respectively and converted into a database of about 7 gigabytes. The level of detail is dramatically reduced in the first two histograms, which represent different views of the same column in the DuchennesTable. The peaks and troughs that occur with interstitial outliers in Figure 2 are completely obscured in Figure 3, which limits us to looking at how frequently values fall outside the expected range – which is quite a bit in the cases of the values I’ve circled in red, since some of them are a whopping four or five standard deviations beyond the mean in a relatively small dataset. HistogramType 2 might provide more detail, but can’t be used with the far larger Higgs Boson Dataset, since we’d essentially run out of screen space to provide the counts of all the distinct values of any column compared to its mean. Note how lopsided the values are for the middle four standard deviations of the first float column in the HiggsBosonTable, which are shifted far to the left in the diagram because there are so many values occurring at exceptionally high standard deviation. As indicated by the red lines I circled them with, the values for the first standard deviation above the mean are minuscule in comparison to those for one standard deviation below it, just as the values two standard deviations above the mean vastly outnumber those that are two below it. This is a perfect illustration of how outlier detection methods often represent the flip side of goodness-of-fit tests, because unless the table consists of little else except outliers, then this column clearly does not fit the bell curve shape of the Gaussian i.e. “normal” distribution. Column 2 in the same dataset may fit the normal distribution though, as is self-evident from the roughly bell-curved shape of Figure 5.

**Figure 2 to 5: Four Sample Reports with Outliers for the Histogram Stored Procedure** (click to enlarge)

…………This is not a lesson in how to do basic tasks in Reporting Services, like adding a diagram to a report, which I will omit in order to avoid the kind of extraneous detail that so often clutters tutorials on other computing subjects. There are plenty of other tutorials available that can explain such tasks more competently and cleanly than I can. Yet there are a few nuances here that require explanation, including the technique used to populate these reports from the stored procedure.[1] As depicted in the three Report Builder windows in Figure 6, you have to right-click the Datasets folder in the Report Data Explorer, then select the “Use a dataset embedded in my report” button on the Query tab, then select the Stored Procedure button under Query Type and pick the right procedure out of the drop-down list that appears under “Select or enter stored procedure name.” Typically, you’ll have copy and paste the names of the columns returned by the stored procedure into both columns displayed on the Fields tab. You may also need to copy and paste the parameter names (including the ampersands) on the left column in the Parameters tab, then again enclosed in brackets on the right column, or use the function buttons to the right to add the parameters manually. It is also sometimes necessary to add the parameters again by right-clicking on the Parameters folder in the Report Data Explorer, although SQL Server sometimes autocompletes these. I’ve highlighted the difference between the two Parameters locations in the top graphic. It is important (but not all that important) to follow these directions, since we’ll be essentially following the same script to implement other procedures in the next two tutorials.

**Figure 6: Setting Up a Reporting Services Dataset to Use a Stored Procedure
**

…………One the key principles of visual outlier detection is to highlight them in some way, as I’ve done in Figure 2 by changing the label color and box border width whenever the frequency counts are more than two standard deviations beyond the mean in either direction, or when they fall or rise by more than a count of three from one value to the next. The code for this is depicted in Figure 8. There may be a workaround for this – I’m familiar with SSRS but am certainly not knowledgeable – but you’re apparently only limited to a Previous function that is essentially equivalent to the T-SQL Lag function with a value of one, since there is no “Next” function equivalent to Lead that would allow you to go in the other direction. If you need reports of this kind for outlier detection, it may pay to do the reading I have not done yet on how the human eye reacts to contrast, shading and various other stimuli in order to highlight such values. One common sense example of this principle is the use of red for highlighting. One avenue I want to investigate further is to set the normal values semi-transparent, which might provide the needed contrast in an eye-catching way. Unfortunately, I don’t think there is any clean, straightforward way of implementing it through a Transparency or Opacity property at present, although I may be wrong. As SSRS developer Graeme Malcolm lamented in one useful tutorial, “It turns out you can only set a transparency level for gauges and charts in Reporting Services – not for shapes or data regions.”[2] One workaround is to either set the Fill color to No Color in the Properties[3] or set the color to Nothing programmatically[4] when a particular value meets your conditions. You might also get achieve semi-transparency by setting the bars of the normal values to colors only slightly different from the background, such as a different shade of white. A little imagination could go a long way when designing reports for outlier detection; you could, for example, apply gradients to shade data points based on their Z-Scores or something similarly fancy. Keep in mind, however, that there are limitations in terms of things like screen space that eventually force tradeoffs and constrain the possibilities. Once data records get numerous enough, you can adjust the widths and heights of SSRS charts and their constituent elements to accommodate them, but eventually you may run out of room entirely and be forced into strategies like binning, as we were in the case of the 11-million-row HiggsBosonTable.

**Figures 7 and 8: Setup for the Hybrid Line Chart and Associated Outlier Highlighting
**

…………All four sample reports include a line chart that traces the peaks of the bar graph, thereby adding more information in an easily comprehensible and efficient manner. I followed the directions for creating a hybrid visualization of this kind in the “Column and Line Hybrid Charts” section of Robert M. Bruckner’s classic TechNet white paper, “Get More Out of SQL Server Reporting Services Charts.”[5] The Cliff’s Notes version of the implementation is to create a second data series, right-clicking in the Chart Data and select Change Chart Type. That is why we have one series for IntervalNumber and another for FrequencyCount in the sample Chart Data window depicted in Figure 7, which was used in the four sample reports.

…………The directions for building and interpreting a line chart of this kind are trivial, so I’ll dispense with them. Once we know how to create one, however, it is child’s play to adapt it for use as an alternative form of outlier detection known as a run sequence plot or run chart. For the most part, what we’re looking for is the same kind of peaks and troughs evident in the hybrid charts we’ve already seen, except without the accompanying histograms. It is actually a more limited identification method, however, because we’re comparing values vs. time slices rather than values against their counts, which doesn’t necessarily tell us anything about how frequently a value should occur. A line chart for department store sales, for example, might see a big jump in sales before Christmas and a commensurate slump afterwards, but domain knowledge would tell us that the associated peaks and troughs represent the expected distribution, not actual outliers. More subtle relationships might be discovered through Time Series algorithms like the ones I covered in A Rickety Stairway to SQL Server Data Mining, Algorithm 9: Time Series, which are often visualized through run sequence plots of the same kind.

**Figure 9: Code for the Simple Run Sequence Plot Stored Procedure
**CREATE PROCEDURE [Calculations].[RunSequencePlotSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @TimeColumnName AS nvarchar(128), @DatePeriodType as nvarchar(400), @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SQLString AS nvarchar(max), @SchemaAndTableName nvarchar(400), @DatePeriodTypeString nvarchar(400)

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + ‘.’ + @SchemaName + ‘.’ + @TableName

SELECT @DatePeriodTypeString = CASE WHEN @DatePeriodType = ” THEN ‘@TimeColumnName‘

WHEN @DatePeriodType = ‘Year’ THEN ‘Year(‘ + @TimeColumnName + ‘)’

WHEN @DatePeriodType = ‘Month’ THEN ‘Month(‘ + @TimeColumnName + ‘)’

WHEN @DatePeriodType = ‘Day’ THEN ‘Day(‘ + @TimeColumnName + ‘)’

WHEN @DatePeriodType = ‘Week’ THEN ‘Datepart(week, ‘ + @TimeColumnName + ‘)’

ELSE NULL END

SET @SQLString = ‘SELECT DISTINCT ‘ + @DatePeriodTypeString + ‘ AS TimePeriod, Avg(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS Mean,

Min(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS Minimum, Max(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS Maximum, StDev(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS StandardDeviation Var(CAST(‘ + @ColumnName

+ ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS Variance

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL

ORDER BY ‘ + @DatePeriodTypeString + ”

EXEC (@SQLString)

…………Run sequence plots are not only simple to interpret but a breeze to program. The procedure in Figure 9 may be the shortest I’ve posted to date in this series and can be followed effortlessly, if you’re familiar with T-SQL and have seen the sample code I’ve posted previously. All it does is extract a count of all distinct values for a particular column, partitioned by another column that typically represents a time slice. The can of course be easily extended to handle other DatePart values like quarter, month, dayofyear, weekday, hour, minute, second, millisecond, microsecond, nanosecond, etc. – whatever your use cases call for. I’ve thrown in the kitchen sink as far as built-in aggregates go, but the ones you don’t need can of course be removed with a few backspaces. The @DatePeriodType parameter accepts the values Year, Month, Day and Week to partition the time column by particular intervals, or can be left blank to return all time periods. Note that it may be necessary to use the @DecimalPrecision parameter to avoid missing values in the chart, not merely to prevent arithmetic overflows; for whatever unfathomable reason, SSRS sometimes returned blank charts with no warnings or error messages unless I ratcheted the precision and scale down a little. For example, I was using 38,29 most of the time and my means wouldn’t show up, nor my mins and maxes, but changing it to 38,27 brought them back.

**Figures 10 and 11: Simple Run Sequence Plot Example with Outliers** (click to enlarge)

…………Run Sequence plots are frequently used to compare the Time Series values of multiple columns for exploratory data mining purposes, but I’ve only included a single column in the example in Figure 10 because our purpose is outlier detection. I’ve adapted it to that purpose by adding a Smooth Range Column visualization that displays the minimum and maximum values for the same column at the same time slices as the line chart, which displays the mean. That gives us a more accurate picture of the variability of the values over time without adding much difficulty to the interpretation or wasting white space that could be providing more information. For example, the sample reports above show not only a few obvious outliers on the line chart for the CreatineKinase and Hemopexin columns of the DuchennesTable, but a few extreme peaks and troughs in the min and max values.[6] This may not be a typical use for run sequence plots, but it does make them more useful for outlier detection purposes. In the next article I’ll delve into other chart types like box plots that are grouped together with the Range Column and Smooth Range Column in Report Builder, but which are unfortunately hampered by constraints that make them much less useful in detecting outliers in large datasets.

[1] Please note that I consulted several webpages awhile back in order to learn how to do this, but can’t find my original citations in order to give proper credit

[2] Malcolm, Graeme, 2010, “Using a Transparent Background in Reporting Services,” published Oct. 13, 2010 at the Content Master website. Available online at http://cm-bloggers.blogspot.com/2010/10/using-transparent-background-in.html.

[3] I got this idea from Yilmaz, Eralper, 2014, “Building Transparent Stacked Column Charts Using MS SQL Server Reporting Services 2008,” published at the Kodyaz Development Resources website and available at the web address http://www.kodyaz.com/articles/transparent-stacked-column-in-reporting-services.aspx

[4] I took this idea from the reply by the user name i_h in the thread “’Transparent’ is not a Valid BackgroundColor?” published Jan. 13, 2012 in the MSDN forums. Available at the web address https://social.msdn.microsoft.com/Forums/en-US/81ec0049-f007-4619-b8df-ee74ec2b9ec1/transparent-is-not-a-valid-backgroundcolor?forum=sqlreportingservices

[5] Bruckner, Robert M., 2006, “Get More Out of SQL Server Reporting Services Charts,” a white paper published by Techet in September 2006. Available online at http://technet.microsoft.com/en-us/library/aa964128(v=sql.90).aspx

[6] These results should be taken with a grain of salt, however, because I have had trouble accurately converting the SampleDate column in the original Duchennes dataset to a SQL Server date data type.

## Outlier Detection with SQL Server, part 5: Interquartile Range

**By Steve Bolton**

…………The last seven articles in this series of mistutorials on identifying outlying values in SQL Server database were clunkers, in the sense that the methods had many properties in common that made them inapplicable to the scenarios DBAs typically need them for. Chauvenet’s Criterion, Peirce’s Criterion, the Tietjen-Moore Test, the Generalized Extreme Studentized Deviate Test (GESD), Grubbs’ Test, the Modified Thompson-Tau Test and Dixon’s Q-Test are well-suited to the uses they were designed for, like hypothesis testing, but are difficult to apply to common SQL Server use cases like finding data quality problems in tables of several millions rows or doing exploratory data mining. Most of them require prior goodness-of-fit testing to verify that the underlying data follows a Gaussian “normal” distribution, i.e. a bell curve, without which they are invalid; many of the lookup tables they depend on are widely available but stop at just a few hundred rows at best, while calculating the missing lookup values for millions of cases can be exceptionally costly. Toss in other drawbacks of hypothesis testing that are often unstated these days (like the use of arbitrary confidence levels and misconceptions about probabilistic reasoning, which statisticians themselves raise frequently in their literature) and it appears that for most scenarios, DBAs would be better off sticking with the methods we kicked off the series with, Z-Scores and Benford’s Law. I’m only writing about these topics as an amateur, but the inapplicability of so many standard outlier identification methods to larger datasets makes me wonder if it the age of “Big Data”[1] doesn’t call for the devising of new means of detection. Thankfully, however, we haven’t by any means exhausted the means already available to us in the common statistical literature, without having to delve into research papers and academic journals and that sort of thing. I haven’t yet had a chance to discuss Interquartile Range because I’m trying to group the detection methods by the properties they have in common, but this particular one has little overlap with any of the others we’ve surveyed to date. It nevertheless performs relatively well and is applicable to a much wider set of use cases than any other means we’ve discussed since finishing up Z-Score a couple of months ago.

…………Interquartile Range has apparently been in use for so long and is so pervasive in academic research that the story of its origin is difficult to find in a cursory search, unlike the colorful histories of some of the lesser-known methods discussed in recent posts. In-depth research of this kind wasn’t really necessary for this week’s article because the calculations and concepts are easier than anything we’ve discussed to date.[2] The idea is fairly simple: instead of calculating a single center point for the whole dataset, we establish two boundaries known as the lower and upper quartiles encompassing the middle half of the values, so named because they are a quarter of the way (25 percent and 75 percent) from the edges of the dataset. The Interquartile Range is just another single measure of how dispersed data is around the center of the dataset, like the more familiar standard deviation and variance, except that it is less sensitive to outlying values (i.e., it is more “robust”). Computing it is trivial once we got the lower and upper quartiles, since all we have to do is subtract the former from the latter. Interquartile Range is apparently useful for other applications such as goodness-of-fit testing, but when used to find those aberrant data points we call outliers, it is usually accompanied by calculations of upper and inner fences. These are established by simply subtracting or adding 1.5 times the Interquartile Range from the lower quartile or doing the same with the upper quartile, except with 3 times the Interquartile Range. Using this test, any values falling outside these four “fences” are defined as outliers. The math in Figure 1 looks a lot more complicated than it really is, when all we’re really doing is a few modulos and simple divisions to get the lower and upper quartiles, then some simple subtraction and multiplication to establish the fence values. The most difficult part of the T-SQL code is probably the common table expression (CTE), which is trivial compared to some of the more difficult nested subqueries, UNPIVOT operations and windowing functions used in other recent tutorials.

**Figure 1: Code for the Interquartile Range Procedure
**CREATE PROCEDURE [Calculations].[InterquartileRangeSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)

AS

SET @DatabaseName = @DatabaseName + ‘.’

DECLARE @SchemaAndTableName nvarchar(400)

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + @SchemaName + ‘.’ + @TableName

DECLARE @SQLString nvarchar(max)

SET @SQLString = ‘DECLARE @OrderByCode tinyint,

@Count bigint,

@LowerPoint bigint,

@UpperPoint bigint,

@LowerRemainder decimal(38,37), — use the maximum precision and scale for these two variables to make the

procedure flexible enough to handle large datasets; I suppose I could use a float

@UpperRemainder decimal(38,37),

@LowerQuartile decimal(‘ + @DecimalPrecision + ‘),

@UpperQuartile decimal(‘ + @DecimalPrecision + ‘),

@InterquartileRange decimal(‘ + @DecimalPrecision + ‘),

@LowerInnerFence decimal(‘ + @DecimalPrecision + ‘),

@UpperInnerFence decimal(‘ + @DecimalPrecision + ‘),

@LowerOuterFence decimal(‘ + @DecimalPrecision + ‘),

@UpperOuterFence decimal(‘ + @DecimalPrecision + ‘)

SET @OrderByCode = ‘ + CAST(@OrderByCode AS nvarchar(50)) + ‘ SELECT @Count=Count(‘ + @ColumnName + ‘)

FROM ‘ + @SchemaAndTableName +

‘ WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @LowerPoint = (@Count + 1) / 4, @LowerRemainder = ((CAST(@Count AS decimal(‘ + @DecimalPrecision + ‘)) + 1) % 4) /4,

@UpperPoint = ((@Count + 1) *3) / 4, @UpperRemainder = (((CAST(@Count AS decimal(‘ + @DecimalPrecision + ‘)) + 1) *3) % 4) / 4; –multiply by 3 for the left s’ + @PrimaryKeyName + ‘e on the upper point to get 75 percent

WITH TempCTE

(‘ + @PrimaryKeyName + ‘, RN, ‘ + @ColumnName + ‘)

AS (SELECT ‘ + @PrimaryKeyName + ‘, ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ ASC) AS RN, ‘ + @ColumnName + ‘

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL),

TempCTE2 (QuartileValue)

AS (SELECT TOP 1 ‘ + @ColumnName + ‘ + ((Lead(‘ + @ColumnName + ‘, 1) OVER (ORDER BY ‘ + @ColumnName + ‘) – ‘ + @ColumnName + ‘) * @LowerRemainder) AS QuartileValue

FROM TempCTE

WHERE RN BETWEEN @LowerPoint AND @LowerPoint + 1

UNION

SELECT TOP 1 ‘ + @ColumnName + ‘ + ((Lead(‘ + @ColumnName + ‘, 1) OVER (ORDER BY ‘ + @ColumnName + ‘) – ‘ + @ColumnName + ‘) * @UpperRemainder) AS QuartileValue

FROM TempCTE

WHERE RN BETWEEN @UpperPoint AND @UpperPoint + 1)

SELECT @LowerQuartile = (SELECT TOP 1 QuartileValue

FROM TempCTE2 ORDER BY QuartileValue ASC), @UpperQuartile = (SELECT TOP 1 QuartileValue

FROM TempCTE2 ORDER BY QuartileValue DESC)

SELECT @InterquartileRange = @UpperQuartile – @LowerQuartile

SELECT @LowerInnerFence = @LowerQuartile – (1.5 * @InterquartileRange), @UpperInnerFence = @UpperQuartile + (1.5 * @InterquartileRange), @LowerOuterFence = @LowerQuartile – (3 * @InterquartileRange), @UpperOuterFence = @UpperQuartile + (3 * @InterquartileRange)

–SELECT @LowerPoint AS LowerPoint, @LowerRemainder AS LowerRemainder, @UpperPoint AS UpperPoint, @UpperRemainder AS UpperRemainder

— uncomment this line to debug the inner calculations

SELECT @LowerQuartile AS LowerQuartile, @UpperQuartile AS UpperQuartile, @InterquartileRange AS InterQuartileRange,@LowerInnerFence AS LowerInnerFence, @UpperInnerFence AS UpperInnerFence,@LowerOuterFence AS LowerOuterFence, @UpperOuterFence AS UpperOuterFence

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, OutlierDegree

FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,

”OutlierDegree” = CASE WHEN (‘ + @ColumnName + ‘ < @LowerInnerFence AND ‘ + @ColumnName + ‘ >= @LowerOuterFence) OR (‘ +

@ColumnName + ‘ > @UpperInnerFence

AND ‘ + @ColumnName + ‘ <= @UpperOuterFence) THEN 1

WHEN ‘ + @ColumnName + ‘ < @LowerOuterFence OR ‘ + @ColumnName + ‘ > @UpperOuterFence THEN 2

ELSE 0 END

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1

ORDER BY CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,

CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,

CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,

CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,

CASE WHEN @OrderByCode = 5 THEN OutlierDegree END ASC,

CASE WHEN @OrderByCode = 6 THEN OutlierDegree END DESC‘

–SELECT @SQLString — uncomment this to debug string errors

EXEC (@SQLString)

…………The code in Figure 1 basically follows the same format as that of other procedures I’ve posted in this series, for simplicity’s sake. The first five parameters allow users to test any column in any database they have access to, while the @DecimalPrecision enables them to avoid arithmetic overflows by manually setting a precision and scale appropriate to the column they’ve selected. As usual, the procedure is created in a Calculations schema; there are no brackets to handle spaces in object names, nor is there any validation or SQL injection code. As with past procedures, uncommenting the next-to-last line allows users to debug the dynamic SQL; I also provided a second debugging point of the same kind midway through the procedure, for testing the inner calculations. The @OrderByCode I’ve used in previous tutorials also returns, with the same values as usual: value #1 and #2 allow users to order by the primary key ascending and descending, while #3 and #4 do the same for the ColumnName and #5 and #6 order the results by the OutlierDegree column. As depicted below, the OutlierDegree column allows for a range of values depending on how much a particular data point deviates from the norm, not merely a Boolean yes-no flag like we’ve seen in many hypothesis-testing based methods. Note that the results also include the Interquartile Range, fence values and quartiles used to test each data point.

**Figure 2: Results for the Interquartile Range Procedure on the PyruvateKinase Column
**EXEC [Calculations].[InterquartileRangeSP]

@DatabaseName = N’DataMiningProjects‘,

@SchemaName = N’Health’

@TableName = N’DuchennesTable‘,

@ColumnName = N’PyruvateKinase‘,

@PrimaryKeyName = N’ID’,

@OrderByCode = 6,

@DecimalPrecision = N’38,21′

…………The test in Figure 2 was performed on the Pyruvate Kinase column of a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from the by Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table. For the sake of consistency, I’ve stress-tested outlier detection methods that might have performance issues on the first float column of the Higgs Boson Dataset, which is made publicly available by the University of California at Irvine’s Machine Learning Repository and now occupies almost 6 gigabytes of my practice database. On average, the procedure took about 18-and-a-half to 19 minutes to run against the 11 million rows of that dataset on my poor beat-up semblance of a development machine, as compared to about 3 minutes for the Z-Score procedure I posted earlier in the series. The addition of a non-clustered index on the column improved things a little – as long as I included the clustered primary key, but the execution plans were still too large to post; suffice it to say that they consisted mainly of parallel non-clustered index Seeks with some expensive Sorts, plus a lot of Spools that had inconsequential costs. Perhaps a columnstore index would help things, but I’ve been unable to upgrade yet to SQL Server 2014, where many of the restrictions that once hobbled the feature have been relaxed.

…………Since Z-Scores perform better and can be used in conjunction with any outlier threshold that end users choose, they remain enthroned as the most widely applicable detection method we’ve yet discussed. Nonetheless, Interquartile Range is more likely to be of use to DBAs than the hypothesis testing-based means that took up the middle segment of this series. The calculations are simple to code and perform, the concepts aren’t that hard to explain to database users and perhaps best of all, we’re not limited to using just a Gaussian bell curve. That also means we don’t have to do preliminary goodness-of-fit testing, which is so often omitted by careless researchers. One of the Wikipedia articles I found the formula at mentions it being used in conjunction with Cauchy and Laplace distributions, although not necessarily in its capacity as an outlier detector.[3] It can even be adapted for double duty as a goodness-of-fit test. In and of itself, it constitutes an alternate measure of dispersion that can be used in place of standard deviation and variance. The scenarios in which such a substitution would prove useful include ones where a measure less likely to be altered by outlying values is called for. The same property might make it more appropriate than Z-Scores when there is a real need for a more conservative test for outliers. Another plus in its favor is the fact that it also measures the degree of membership in the set of outliers on a scale, rather than merely flagging it as many hypothesis-testing methods do; furthermore, those methods have numerous other restrictions on the number and types of inputs, outputs and calculation methods that make them unsuitable for most SQL Server tasks, like recursive deletion with Chauvenet’s Criterion and the inability of Dixon’s Q-Test to identify more than one outlier per dataset. Moreover, the fence and quartile values are trivial to return once they’ve been calculated and constitute global measures in their own right.

…………I have yet to try Cook’s Distance and Mahalanobis Distance, but I have high hopes that they too will prove to be useful additions to the toolbelts of SQL Server data miners and DBAs. I hope to use both as a springboard into a much longer and more difficult, albeit worthwhile, series a few months down the line, Information Measurement with SQL Server. Before delving into the difficult math that underpins distance-based metrics of that kind, however, I will give a brief overview of how to use Reporting Services to find outliers the easy way: by the naked eye. Finding outliers is not always that straightforward, but in many cases all we need to do is spot them in a histogram or scatter plot, where they sometimes stand out like sore thumbs. They are sometimes also glaringly obvious in the diagrams produced by the Clustering algorithm in SSDM, which I may give a quick refresher on, based on my last tutorial series, A Rickety Stairway to SQL Server Data Mining. As we will see, scaling up visual detection methods of this kind to meet the size of SQL Server databases is the primary challenge, just as their size stretches beyond the ordinary bounds of hypothesis testing. The pervasiveness of the size issue makes me wonder, once again, if it might not be worthwhile to devise new scalable methods of outlier detection to complement the ones already in common use today.[4]

[1] This buzzword is a lot like the overused term “globalization.” Unlike with statistics and data mining, I have real expertise in foreign policy history, and can say definitively that globalization has been going on for millennia; the only difference is that it has accelerated in recent decades. Likewise, the amount of Data the human race has to process is always getting Bigger; it’s just getting Bigger at a faster pace these days.

[2] I retrieved the formulas from the most convenient sources, the Wikipedia pages “Outlier” and “Interquartile Range” at http://en.wikipedia.org/wiki/Outlier and http://en.wikipedia.org/wiki/Interquartile_range respectively. I also tested the procedure against some of the examples provided there. Also see National Institute of Standards and Technology, 2014, “7.1.6. What are Outliers in the Data?” published in the online edition of the Engineering Statistics Handbook. Available online at the web address http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

[3] *IBID.*

[4] While writing this series I encountered an interesting suggestion by Will G. Hopkins, the writer of one of the best plain English explanations of statistics available online today: “Here’s something challenging for the real lovers of numbers. The mean ± SD encloses 68% of the data on average for a normally distributed variable. So if you want to use a percentile range that corresponds to the mean ± SD, what should it be? Answer: 16th-84th. If I had my way, this measure would replace the interquartile range. We could call it the standard percentile range…” I could write code for this easily, but didn’t bother because the series already features too many outlier identification methods that are dependent on a normal distribution. Nor would it necessarily do anything to help with the scaling problem I mentioned above. It does illustrate, however, how we’re not limited to just using tried and true measures and can devise new ones as needed, if they are appropriate to the contexts at hand. See Hopkins, Will G. 2013, “Percentile Ranges,” published at the website A New View of Statistics and available at the web address http://www.sportsci.org/resource/stats/percentile.html

## Outlier Detection with SQL Server, part 4: Peirce’s Criterion

**By Steve Bolton**

…………In the last couple of installments of this amateur series of self-tutorials on outlier identification with SQL Server, we dealt with detection methods that required recursive recomputation of the underlying aggregates. This week’s topic, Peirce’s Criterion, also flags outliers in an iterative manner, but doesn’t require the same sliding window to continually recalculate the mean and standard deviation as Chauvenet’s Criterion and the Modified Thomson Tau Test do. Like these analogous methods, Peirce’s Criterion can be made useful to DBAs by using it to merely flag potential outliers and performing new computations *as if* they had been removed, rather than deleting them without adequate further investigation, as sometimes occurs with the other two. While writing this series, I’ve slowly come to the realization that the statistical formulas underlying many of these methods can be swapped in and out almost like the modularized parts of a car engine, radio kit or DIY computer; for example, Chauvenet’s Criterion and the Modified Thompson Tau test leaven standard hypothesis testing methods with comparisons to Z-Scores, with the former merely substituting thresholds based on a Gaussian normal distribution (i.e. a bell curve) rather than the Student’s T-distribution used in the latter. Peirce’s Criterion is also recursive, but uses the R-values produced by Pearson Product Moment Correlation calculations as thresholds for its Z-Scores. I originally had high hopes for Peirce’s Criterion because those correlation coefficients are easy to calculate on entire databases, but it turns out that lookup tables are required for the R-Values. These are even shorter and more difficult to find on the Internet than the Gaussian and T-distribution lookup tables required for some of the outlier detection methods based on hypothesis testing, which were covered in the last six posts. For that reason, I found it more difficult than usual to validate my T-SQL samples, so be cautious when implementing the code below. Furthermore, the Criterion is burdened with the same requirement for prior goodness-of-fit testing, to prove that the underlying data follows a bell curve.

…………It is not surprising that the Criterion carries so many restrictions, given that it is one of the first outlier detection methods ever devised. The algorithm that mathematician Benjamin Peirce (1809-1880)[1] introduced in an 1852 paper in the Astronomical Journal is indeed difficult to follow and implement, even for those with far more experience than myself. Programmers have apparently had some success recently in coding the underlying math in R and Python[2], but my solution is based on the more accessible version published in 2003 in the Journal of Engineering Technology by Stephen Ross, a professor of mechanical engineering at the University of New Haven.[3] The DDL in Figure 1 can be used to import the table on page 10 to 12, which translates into 540 R-values for up to nine potential outliers and a maximum of 60 records. Denormalizing it into a single interleaved lookup table allows us to access the values in a more legible way with a single join, rather than the double join that would be required with two normalized tables. I altered the algorithm Ross provides to do all of the comparisons in a single iteration, since it is trivial in a set-based language like T-SQL to simply check the nine highest absolute deviations against the corresponding R-Values in one fell swoop. The T3 subquery in Figure 2 simply looks up all nine R-Values for the count of all the records in the dataset, then calculates the MaximumAllowableDeviation values by multiplying them by the standard deviation of the entire dataset. The T2 subquery merely calculates the nine highest absolute deviations in the dataset (using basically the same logic as that found in Z-Scores) and joins them to the NumberOfOutliers of the same rank. If the absolute deviation is higher than the maximum allowable deviation, the record is flagged as an outlier. The rest of the code follows the same format as that of other procedures posted in this series; the first five parameters allow you to select any column in any database for which you have access and @DecimalPrecision enables users to avoid arithmetic overflows. The rest is all dynamic SQL, with the customary debugging comment line above the EXEC. To avoid cluttering the code, I didn’t supply the brackets needed to accommodate spaces in object names – which I don’t allow in my own code – or validation logic, or SQL injection protection. As always, the procedure and lookup table are implemented in a Calculations schema that can be easily changed. Since we can get a bird’s-eye view of all nine rows. as depicted in Figure 3, there’s no reason to incorporate the @OrderByCode parameter used in past tutorials.

**Figure 1: DDL for the R-Value Lookup Table
**CREATE TABLE [Calculations].[PeirceRValueTable](

[ID] [smallint] IDENTITY(1,1) NOT NULL,

[N] [tinyint] NULL,

[NumberOfOutliers] [tinyint] NOT NULL,

[RValue] [decimal](4, 3) NULL,

CONSTRAINT [PK_PeirceRValueTable] PRIMARY KEY CLUSTERED ( [ID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE

= OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]

**Figure 2: Code for the Peirce’s Criterion Stored Procedure
**CREATE PROCEDURE [Calculations].[PiercesCriterionSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)

SET @DatabaseName = @DatabaseName + ‘.’

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + @SchemaName + ‘.’ + @TableName

SET @SQLString = ‘DECLARE @Mean decimal(‘ + @DecimalPrecision + ‘), @StDev decimal(‘ + @DecimalPrecision + ‘), @Count decimal(‘ + @DecimalPrecision + ‘)

SELECT @Count=Count(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @Mean = Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @StDev = StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘)))

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, NumberOfOutliers, AbsoluteDeviation, MaximumAllowableDeviation, ”IsOutlier” = CASE WHEN AbsoluteDeviation > MaximumAllowableDeviation THEN 1 ELSE 0 END

FROM (SELECT TOP 9 ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, AbsoluteDeviation, ROW_NUMBER() OVER (ORDER BY AbsoluteDeviation DESC) AS RN

FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ABS(‘ + @ColumnName + ‘ – @Mean) AS AbsoluteDeviation

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1) AS T2

INNER JOIN (SELECT NumberOfOutliers, @StDev * RValue AS MaximumAllowableDeviation

FROM Calculations.PeirceRValueTable

WHERE N = 60) AS T3@Count

ON RN = NumberOfOutliers‘

–SELECT @SQLString —uncomment this to debug string errors

EXEC (@SQLString)

**Figure 3: Results for Peirce’s Criterion
**EXEC [Calculations].[PiercesCriterionSP]

@DatabaseName = N’DataMiningProjects‘,

@SchemaName = N’Health‘,

@TableName = N’First60RowsPyruvateKinaseView’,

@ColumnName = N’PyruvateKinase‘,

@PrimaryKeyName = N’ID’,

@DecimalPrecision = N’10,7′

…………The results above come from a view on the first 60 rows of a small dataset on the Duchennes form of muscular dystrophy, which I downloaded from the Vanderbilt University’s Department of Biostatistics. I’ve stress-tested some of the procedures I posted earlier in this series on an 11-million-row table of Higgs Boson data made publicly available by the University of California at Irvine’s Machine Learning Repository, but there’s no point in doing that (or posting client statistics and execution plans) with Peirce’s Criterion procedure if we’re limited to 60 rows. In lieu of new algorithms like those used by R and Python to compute the test in far greater detail, this is about as useful as the test can be made in a SQL Server setting. There are some definite advantages over Chauvenet’s Criterion and the Modified Thompson Tau test, in that automatic deletion of records is not encouraged to the same extent and expensive recursive calculations are not necessary. Yet like the last six standard outlier detection methods surveyed here, it’s not really suitable for usage on tables of thousands of rows, let alone the billions used in Big Data applications. As usual, the available lookup tables are simply too small, calculating the missing lookup values is not feasible at this time and the test is only applicable to a Gaussian distribution. One of the pluses is that Peirce’s Criterion does not depend on confidence levels that are typically set by custom rather than sound reasoning. Furthermore, the probabilistic reasoning it is based upon is sound, but does not represent a guarantee; probabilities only generate reasonable expectations but have no effect on outcomes. This drawback of probabilistic stats was recognized long ago by Peirce’s son, but has since been forgotten – especially after the advent of quantum mechanics. As pointed out by Theodore P. Hill and Arno Berger, the authors of a study on Benford’s Law cited earlier in this series, “The eminent logician, mathematician, and philosopher C.S. Peirce once observed [Ga, p.273] that ‘‘in no other branch of mathematics is it so easy for experts to blunder as in probability theory.’’[4] I expected Peirce’s Criterion to be more useful because it is dependent on correlation stats that are common and easy to calculate, but it turns out that it belongs in the same class of outlier detection methods as Grubbs’ Test, the Generalized Extreme Studentized Deviate (GESD) test, Dixon’s Q-Test, the Tietjen-Moore Test, the Modified Thompson Tau test and Chauvenet’s Criterion. The lookup tables may not involve comparisons to Gaussian and T-distribution values like these hypothesis testing methods do, but the drawbacks are largely the same. Work is apparently ongoing in fields that use statistics to make the R-values easier to calculate from ordinary correlation coefficients, so Peirce’s Criterion may wind up being more usable than any of these in the long run. For now, however, SQL Server users would probably be better off sticking with methods like Z-Score and Benford’s Law that are more appropriate to large databases. So far, what I’ve found most striking about my misadventures in this topic to date is just how difficult it is to apply many commonly used statistical tests for outliers to the kind of datasets the SQL Server community works with; I’m only an amateur learning my way in this field, but I wonder at times if our use cases don’t call for the invention of new classes of tests. In the meantime, we can still rely on more useful outlier detection methods like Interquartile Range, which I’ll explain to the best of my inability next week. DBAs are probably also likely to find real uses for the visual detection methods that can be easily implemented in Reporting Services, as well as Cook’s Distance and Mahalanobis Distance, which I’ve saved for the end of the series because the difficulty in coding them appears to be commensurate to their potential value.

** **

[1] The name is not misspelled but is frequently mispronounced as “Pierce” rather than “purse.” The authorship is made even more confusing by the fact that Benjamin’s son, Charles Sanders Peirce (1839-1914), was also a well-known mathematician who published commentaries on his father’s Criterion. Apparently the son fits snugly in the category of mathematicians and physicists with unusual emotional and mental disturbances, given that he was “he was, at first, almost stupefied, and then aloof, cold, depressed, extremely suspicious, impatient of the slightest crossing, and subject to violent outbursts of temper” by trigeminal neuralgia that led to his pattern of “social isolation”; perhaps it also factored into the decision of Harvard’s president to ban him from employment there. He can’t have been entirely irrational though, given that he was very close to William James, one of the few sane American philosophers. For more backstory, see the Wikipedia pages “Benjamin Peirce” and “Charles Sanders Peirce” at http://en.wikipedia.org/wiki/Benjamin_Peirce and http://en.wikipedia.org/wiki/Charles_Sanders_Peirce respectively.

[2] See the Wikipedia page “Peirce’s Criterion,” available at the web address http://en.wikipedia.org/wiki/Peirce%27s_criterion

[3] pp. 3-4, Ross, Stephen M. “Peirce’s Criterion for the Elimination of Suspect Experimental Data,” pp. 1-12 in the Journal of Engineering Technology, Fall 2003. Vol. 2, No. 2. http://newton.newhaven.edu/sross/piercescriterion.pdf

[4] Berger, Arno and Hill , Theodore P., 2011, “Benford’s Law Strikes Back: No Simple Explanation in Sight for Mathematical Gem,” published by Springer Science Business Media, LLC, Vol. 33, No. 1. Available online at http://people.math.gatech.edu/~hill/publications/PAPER%20PDFS/BenfordsLawStrikesBack2011.pdf.

## Outlier Detection with SQL Server, part 3.6: Chauvenet’s Criterion

**By Steve Bolton**

…………This is the last of six articles I’ve segregated in this middle of my mistutorial series on identifying outlying values with SQL Server, because they turned out to be difficult to apply to the typical use cases DBAs encounter. After this detour we’ll get back on track with outlier detection methods like Interquartile Range that are likely to be as useful as the ones the series started with, such as Benford’s Law and Z-Scores, but I’ll first give a brief explanation of Chauvenet’s Criterion for the sake of completeness and the offhand chance that it might prove useful in the right circumstances. Those circumstances are normally those suitable for statistical hypothesis testing, in which researchers attempt to prove narrow, specific points of evidence using relatively small datasets – not exploratory data mining or data quality examinations on datasets of thousands or even billions of rows, as in a typical relational table. This subset is designed with different use cases in mind, so it is not surprising that they come with some common limitations that make them difficult to apply to big tables. Among these are the necessity of prior goodness-of-fit testing to ensure that the data follows a Gaussian or “normal” distribution, i.e. the bell curve, without which the outlier tests are invalid. Furthermore, the lookup tables that many of these tests require for comparisons are plentiful on the Internet and old texts, but finding ones without gaps or that extend beyond a few hundred records are difficult to find; worse still, the formulas for calculating the missing values are often performance hogs or require precisions and scales that choke T-SQL code with arithmetic overflows errors. Drawbacks like these also restrict the usefulness of Chauvenet’s Criterion, which was among the first outlier detection methods ever developed. Naval Academy mathematician William Chauvenet (1820-1870) formulated it in the Civil War era, but recognized from the beginning that there were already more trustworthy means available, like Peirce’s Criterion. Perhaps its crudest limitation is that it calls for recursive reexamination of data after carrying out automatic deletion of data points without further investigation, which as I have discussed in prior articles, is unwise and sometimes even unethical. Thankfully, we can apply the same type of interpretation-hack used in last week’s article to make the Modified Thompsons Tau test more useful and valid to Chavuenet’s formula, which is still apparently in common use today despite all of these well-known issues.

…………Repetetively deleting records until no more outliers remain to test is probably not going to fly with either users or IT managers in a SQL Server environment, but as we saw last week, it is possible to simply flag records as potential outliers and still recursively recalculate the underlying aggregates that the formula is based on, as if they had been deleted. It is easier to have our cake and eat it too thanks to new T-SQL windowing clauses like ROWS UNBOUNDED PRECEDING, which make the code for this week’s stored procedure much shorter and easier to follow. The T-SQL in Figure 2 closely resembles that of the Modified Thompson Tau procedure for another reason: the Chauvenet Criterion is also a hybrid method that marries some of the logic of Z-Scores with that of hypothesis testing. The key difference is that we need to compare the absolute deviation against the standard normal distribution rather than Student’s T-distribution; we’re really just substituting one statistical part for another in order to address different use cases, just as we would swap out a computer component or an automotive part in a car engine. That substitution requires the use of a different lookup table than the ones we’ve used in recent articles, but we only need one of them, since we only need to input the absolute deviation rather than the degrees of freedom and an alpha value. That in turn means we can use a single join rather than a function call, further decomplicating the procedure. The main problem I encountered when implementing this is that it is impossible to find complete lookup tables for the standard normal distribution, which typically only accept just one decimal point of precision despite the fact that possible to the calculate Z-Scores fed into them to far higher precisions. Part of the problem is that they’re continuous values, but as I’ve found out the hard way, it is surprisingly difficult to calculate them to higher precisions with the original cumulative distribution function (CDF). Until I can come up with a more precise approximation for high-precision values, the clumsy lookup table defined in Figure 1 will have to do. I designed it to host the table cited at the Wikipedia page “68–95–99.7 Rule,”[1] which includes the probabilities that values will occur within one to seven standard deviations, at intervals of 0.5. Once I overcome my difficulties with CDFs and can get more accurate measures, it will be possible to replace the clumsy BETWEEN clause and CASE in the procedure that crudely peg data points to these wide limits. Once the probability value has been retrieved, we only need to multiply it by the number of data points and flag the record as an outlier if the result is less than 0.5. The PopulationOutsideRange that the procedure in Figure 2 joins to is a calculated column (which is renamed as Probability in the stored procedure) while the RN ROW_NUMBER value acts as a running count.

**Figure 1: Code for the Standard Normal Deviation Lookup Table
**CREATE TABLE [Calculations].[StandardNormalDeviationTable](

[ID] [bigint] IDENTITY(1,1) NOT NULL,

[StandardDeviations] [decimal](2, 1) NULL,

[PopulationInRange] [decimal](16, 15) NULL,

[PopulationOutsideRange] AS ((1)-[PopulationInRange]),

[ExpectedFrequency] [bigint] NULL,

[ApproximateFrequencyForDailyEvent] [nvarchar](400) NULL,

CONSTRAINT [PK_StandardNormalDeviationTable]

PRIMARY KEY CLUSTERED ( [ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE

= OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]

**Figure 2: Code for the Chauvenet Criterion Procedure
**CREATE PROCEDURE [Calculations].[ChauvenetCriterionSP]

@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS nvarchar(50)

AS

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)

SET @DatabaseName = @DatabaseName + ‘.’

SET @SchemaAndTableName = ISNull(@DatabaseName, ”) + @SchemaName + ‘.’ + @TableName

SET @SQLString = ‘SELECT’ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, AbsoluteDeviation, Probability, ”IsOutlier”

= CASE WHEN (RN * Probability) < 0.5 THEN 1 ELSE 0 END

FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, RN, AbsoluteDeviation, ”Probability” = CASE WHEN PopulationOutsideRange IS NOT NULL THEN PopulationOutsideRange

WHEN PopulationOutsideRange IS NULL AND AbsoluteDeviation < 1 THEN 1 ELSE 0 END

FROM (SELECT T1.’ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, CAST(ROW_NUMBER() OVER (ORDER BY ‘ + @ColumnName + ‘ ASC) AS bigint) AS RN,

Abs(‘ + @ColumnName + ‘ – Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))) OVER (ORDER BY ‘ + @ColumnName + ‘ ASC ROWS UNBOUNDED PRECEDING)) /

NullIf(StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))) OVER (ORDER BY ‘ + @ColumnName + ‘ ASC ROWS UNBOUNDED PRECEDING), 0) AS AbsoluteDeviation

FROM ‘ + @SchemaAndTableName + ‘

WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1

LEFT JOIN Calculations.StandardNormalDeviationTable AS T2

ON T1.AbsoluteDeviation BETWEEN T2.StandardDeviations – 0.25 AND T2.StandardDeviations + 0.25

WHERE AbsoluteDeviation IS NOT NULL) AS T3

ORDER BY IsOutlier DESC, AbsoluteDeviation

DESC, Probability DESC‘

–SELECT @SQLString — uncomment this to debug string errors

EXEC (@SQLString)

…………If you’ve been following this series, most of the rest is self-explanatory: the first few parameters allow users to apply the test to any column in any database for which they have permissions, while the @DecimalPrecision parameter allows them to adjust the precision and scale to avoid arithmetic overflows in the calculations. The rest is just the usual dynamic SQL, with a couple of nested subqueries to allow calculations like the Z-Score to be bubbled up and handled at higher levels. As usual, I’ve omitted any code to handle spaces in object names (which I never allow when I can get away with it) or SQL injection protections.

**Figure 3: Results for the Chauvenet Criterion Procedure
**EXEC [Calculations].[ChauvenetCriterionSP]

@DatabaseName = N’DataMiningProjects‘,

@SchemaName = N’Health‘,

@TableName = N’DuchennesTable‘,

@ColumnName = N’Hemopexin‘,

@PrimaryKeyName = N’ID’,

@DecimalPrecision = N’5,1′

…………Chauvenet’s Criterion turned out to be much more conservative than I expected it be, at least in terms of flagging outliers in columns without a lot of repeating values. Only a single value was identified as an outlier in the Hemopexin column of a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from Vanderbilt University’s Department of Biostatistics and have been using for practice throughout this series. On another practice dataset with 1,600 rows it found only three, far fewer than any of the other procedures tested to date. I was startled at how well the procedure performed against the first float column in the Higgs Boson dataset, which I downloaded from the University of California at Irvine’s Machine Learning Repository and turned into a nearly 6-gigabyte table. Despite the fact that the logic for both procedures is quite similar, Chauvenet’s test took only 3 minutes and 15 seconds to run on my poor beat-up six-core imitation of a workstation, compared to an hour and fifteen minutes for the Modified Thompson Tau test procedure. The execution plan in last week’s tutorial was too small to bother to post, since it consisted mostly of a single Sort operation that sucked 95 percent of the computational cost, whereas the Chauvenet execution plan was too wide to fit here and was comprised of several other operators like Compute Scalar, Nested Loops and Parallelism (Gather Streams). It also included a single Sort, but it only accounted for 38 percent of the expense.

…………It may perform surprisingly well and have its uses on columns with few repeating values when a conservative identification threshold is called for, but this century-and-a-half old test has many drawbacks that should not be understated. The requirement of a Gaussian distribution and the difficulty in getting accurate probability values for the size of the datasets DBAs work with are only the tip of the iceberg. The Central Limit Theorem on which it is based is mathematically based is much more trustworthy than other probabilistic methods extant today, but it is a fallacy to believe that probabilities represent guarantees or have any effect on causation. I’ve barely touched on this issue yet, but the aforementioned Wikipedia article on the 68–95–99.7 Rule put it a lot better than I can: “…it is important to be aware of the fact that there is actually nothing in the process of drawing with replacement that specifies the order in which the unlikely events should occur, merely their relative frequency, and one must take care when reasoning from sequential draws. It is a corollary of the gambler’s fallacy to suggest that just because a rare event has been observed, that rare event was not rare. It is the observation of a multitude of purportedly rare events that undermines the hypothesis that they are actually rare.”[2] The remedy for this uncertainty is the same as for the more serious issue of deletion: further investigation, not knee-jerk deletion of records. As geneticist David M. Glvoer and oceanographers Scott Christopher Doney and Wiliam J. Jenkins put it in their 2011 book, Modeling Methods for Marine Science:

“Now the truly clever researcher might be tempted to perform this rejection iteratively. That is, why not compute a mean and standard deviation, Z-score the data and reject the fliers, then compute an even better mean and standard deviation and do the same thing all over again, rejecting more data. The advice of all the statistical sages and texts is do Chauvenet rejection only once in a given distribution. If the data were normally distributed, and there weren’t many fliers, you’ll probably find that the second iteration will not yield any more rejectable points. If it does, then it suggests that your data may not be normally distributed. The philosophy is that filtering once is a valid thing to do, but iterative filtering may dramatically alter the data distribution in a fundamental way, invalidating the assumptions behind your statistical calculations, and leading to erroneous results. Moreover, you may accused of being a Chauvenet Chauvinist.”[3]

This is professional confirmation of the Catch-22 I’ve always fretted about with the normal distribution: the more outliers that are found, the less likely it is that a Gaussian bell curve is active, in which case most of these hypothesis-testing based outlier detection methods are invalid. Another Catch-22 is operative when we’re recklessly deleting data in a recursive routine like Chauvenet’s Criterion and the Modified Thompson Tau test: the more we delete, the bigger the impact on the dataset will be. If we follow Glover et al.’s suggestion and limit the criterion to a single use, it’s hardly applicable to a SQL Server database where we may need to find tens of thousands of outliers, while looking for data quality issues or doing exploratory data mining. Such a wide scope also calls for degrees of precision that aren’t readily available in regular lookup tables and would probably be quite costly to compute. The criterion may have been better than nothing when Chauvenet wrote his paper back in the Civil War era, but it’s really hard to justify its use, even in many of the hypothesis testing scenarios it was designed for. Nevertheless, academia and research labs across the planet are apparently still staffed by many of those “Chauvenet Chauvinists” today. While researching this article (including reading parts of Chauvenet’s original paper in .pdf format, which I’ve since lost) I ran across many comments like this one from Stephen Ross, a professor of mechanical engineering at the University of New Haven:

“Peirce’s criterion has been buried in the scientific literature for approximately 150 years. It is virtually unknown today in the scientific community. In its place, Chauvenet’s criterion is commonly used for rational elimination of “outlier” data by government laboratories, (e.g., Environmental Protection Agency, U.S. Army Corps of Engineers, Agency for Toxic Substances and Disease Registry, Institute for Telecommunication Sciences), industry (e.g., Boeing, Sikorsky), foreign laboratories (e.g., Laboratoire National Henri Becquerel, Joint Astronomy Centre), and universities (e.g., research and courses at University of Michigan, Texas A&M, University of California, Vanderbilt, University of Alberta, Ohio State). Methods of elimination of data “outliers” are useful for anyone working in industry or in an educational institution where statistical information concerning product runs or experimental data is of interest. In an engineering, technology or science program, laboratory courses in chemistry, physics and engineering can, and do, find use for rational spurious data elimination. In the BSME program at the University of New Haven, we have used Chauvenet’s criterion in our instrumentation and fluid/thermal laboratory courses for many years. Other universities have similarly used this criterion in their undergraduate laboratories. Typically, students take several measurements of a quantity, say pressure, at one setting (meaning the experimental conditions are maintained at the same level). Assuming the systematic errors are negligible, each measurement will vary slightly due to random errors (e.g., reading instrument values, flow rate may change slightly, etc.). Often, however, one or two datum points seem to fall “far” outside the range of the others obtained. These outliers greatly impact the mean and standard deviation of the, measurements. A data elimination method can be used to obtain a realistic average value of pressure and an “uncertainty” in the true value given by the standard deviation…Chauvenet’s criterion is in common use today for elimination of suspect data.”[4]

…………Ignorance is bliss. I started off this series with some dire warnings about how haphazardly statistics are handled today, especially in fields like medicine where they can do the most damage. The more I’ve learned while writing this series, the less reassured I’ve become. One of the clearest lessons I’ve learned from this exercise is that, if the SQL Server community and the rest of the database server field get in the habit of routinely doing outlier detection (as I suspect they will, in time), they really need to avoid simply copying the means used in other fields. Chauvenet’s Criterion and the other five hypothesis-testing based methods don’t seem to be well-suited to the Big Data buzzword at all, but it doesn’t stop there: in many cases, they’re not even applied correctly in industries where they’re used on a daily basis, such as medical research. So far in this series, only Benford’s Law and Z-Scores appear to fit our use cases well, although I have high hopes for upcoming topics like Interquartile Range, Cook’s distance and Mahalanobis distance, as well as the various visual means that can be implemented in Reporting Services. Next week’s article on Peirce’s Criterion is also likely to be more valuable to DBAs. As Ross points out in an article on that topic, even Chauvenet recommended it in place of his own test: “Chauvenet himself believed that Peirce’s work was more rigorous and could be applied more generally, and in Chauvenet’s words, ‘For the general case….. when there are several unknown quantities and several doubtful observations, the modifications which the rule (meaning his own criterion) requires renders it more troublesome than Peirce’s formula……What I have given may serve the purpose of giving the reader greater confidence in the correctness and value of Peirce’s Criterion.’”

…………What’s good enough for Chauvenet is good enough for me. Why his advice not to use his own test apparently isn’t heeded in academia and private sector research is beyond me. Perhaps it is only a matter of habit, like the completely arbitrary custom of using confidence levels like 95 percent in hypothesis testing. Hopefully it is a custom that DBAs won’t adopt without some thought; perhaps Chauvenet’s Criterion has a place in our tool belts for unusual use cases, but it ought to be a very small place, considering how many more fitting outlier detection methods we have available to us.

[1] See the Wikipedia pages “68–95–99.7 Rule” and “Standard Deviation” at http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule “68–95–99.7 Rule” and

http://en.wikipedia.org/wiki/Standard_deviation “Standard Deviation” respectively.

[2] *IBID.*

[3] p. 29, Glover, David M.; Jenkins, William J. and Doney, Scott Christopher, 2011, Modeling Methods for Marine Science. Cambridge University Press: New York. I found this reference at the Google Books web address http://books.google.com/books?id=OYAkMs85848C&q=chauvenet#v=snippet&q=chauvenet&f=false

[4] pp. 3-4, Ross, Stephen M. “Peirce’s Criterion for the Elimination of Suspect Experimental Data,” pp. 1-12 in the Journal of Engineering Technology, Fall 2003. Vol. 2, No. 2. http://newton.newhaven.edu/sross/piercescriterion.pdf