Outlier Detection with SQL Server, part 8: A T-SQL Hack for Mahalanobis Distance

By Steve Bolton

…………Longer code and substantial performance limitations were the prices we paid in return for greater sophistication with Cook’s Distance, the topic of the last article in this series of amateur self-tutorials on identifying outliers with SQL Server. The same tradeoff was even more conspicuous in this final installment – until I stumbled across a shortcut to coding Mahalanobis Distance that really saved my bacon out of the fire. The incredibly cool moniker sounds intimidating, but the concepts and code required to implement it are trivial, as long as we sidestep the usual matrix math that ordinarily makes it prohibitively expensive to run on “Big Data”-sized tables. It took quite a while for me to blunder into a suitable workaround, but it was worthwhile, since Mahalanobis Distance merits a special place in the pantheon of outlier detection methods, by virtue of the fact that it is uniquely suited to certain use cases. Like Cook’s Distance, it can be used to find outliers defined by more than one column, which automatically puts both in a league the others surveyed in this series can’t touch; their competitors are typically limited to detecting unusual two-column values in cases where neither column is at the extreme low or high end, Cook’s D and Mahalanobis Distance sometimes flag unusual intermediate values. The latter, however, can also be extended to more than two columns. Better yet, it also accounts for distortions introduced by variance into the distances between data points, by renormalizing them on a consistent scale that is in many cases equivalent to ordinary Euclidean Distance. Best of all, efficient approximations can be derived through a shortcut that renders all of the complex matrix math irrelevant; since the goal in outlier detection is mainly to serve as an alarm bell to draw attention to specific data points that might warrant human intervention, we can sacrifice a little accuracy in return for astronomical performance gains.
…………For both the cool name and the even cooler multidimensional outlier detection capabilities, we can thank Prasanta Chandra Mahalanobis (1893-1972), who was born in Calcutta at a time when Gandhi, Mother Teresa, distant tech support call centers, Bangladesh and the other things Westerners associate today with the region were still in the far-flung future. He and his grandfather may have acted as moderating influences in Brahmo Samaj, a 19th Century offshoot of Hinduism that has since apparently died out in Bangladesh; later in life he “served as secretary to Rabindranath Tagore, particularly during the latter’s foreign travels.”[i] Some of that polymath’s brilliance must have rubbed off, because Mahalanobis responded to a dilemma he encountered while trying to compare skull sizes by inventing an entirely new measure of similarity, which can be adapted to finding outliers based on how unalike they are. It has many applications, but for our purposes it is most appropriate for finding multidimensional outliers. If you want to find out how unusual a particular value is for a particular column, any of the detection methods presented earlier in this series may suffice, if all of their particular constraints are taken into account – save for Cook’s Distance, which is a comparison between two columns. Mahalanobis Distance takes the multicolumn approach one step further and represents one of the few means available for finding out whether a particular data point is unusual when compared to several columns at same time.

The Litmus Test: Comparing Outliers to the Shape of Your Data

                Think of it this way: instead of measuring the distance of a single data point or mean to a standard deviation or variance, as we do so often in statistics, we’re measuring several variables against an entire multidimensional matrix of multiple columns, as well as the variances, covariances and averages associated with them. These extra columns allow us to compare our data points against a shape that is more geometrically complex than the single center point defined by a simple average or median. That is why Mahalanobis Distance is intimately related to the field of Principal Components Analysis, i.e. the study of various axes that make up a multidimensional dataset. The metric also has distinctive interrelationships with the regression metric known as leverage[ii], the normal distribution (i.e. the bell curve)[iii], Fisher Information and Beta and Chi-Squared distributions[iv] that are still far above my head, but I was able to explain it to myself crudely in this way: the metric measures how many standard deviations[v] a data point is from a set of center points for all of the columns under consideration, which taken together form an ellipsoid[vi] which is transformed into a circle by the constituent mathematical operations. Don’t let the big word ellipsoid fool you, because it’s actually quite obvious that any normal scatter plot of data points will form a cloud in the shape of an irregular curve around the center of the dataset. It’s also obvious that the center will have a more complex shape than when we use a single variable, since if we have three means or medians taken from three columns we could make a triangle out of them, or a four-sided shape like a rectangle from similar measures applied to four columns, and so on; the only difference is that we have to do some internal transformations to the shape, which need not concern us. Suppose, for example, that you wanted to discover if a particular sound differs from the others in a set by its pitch; in this case, you could simply use a typical unidimensional outlier detection method that merely examines the values recorded for the pitch. You could get a more complete picture, however, by taking into account other variables like the length of the song it belongs to, the type of instrument that produced and so on.
…………The price of this more subtle and flexible means of outlier detection would be quite high in terms of both performance and code maintenance, if our implementation consisted of translating the standard matrix math notation into T-SQL. I programmed an entire series of T-SQL matrix procedures to do just that, which seemed to perform reasonably well and with greater accuracy than the method in Figure 1 – until I hit the same SQL Server internals barrier I did with Cook’s Distance in the last article. To put it bluntly, we can’t use recursive calls to table-valued parameters to implement this sort of thing, because we’ll hit the internal limit of 32 locking classes rather quickly, leading to “No more lock classes available from transaction” errors. This long-standing limitation is by design, plus there are apparently no plans to change it and no widely publicized workarounds, so that’s pretty much the end of the line (unless factorization methods and similar matrix math workarounds I’m not familiar with might do the trick).

Bypassing the Matrix Math

                I had to put Mahalanobis Distance on the back burner for months until I stumbled across a really simple version expressed in ordinary arithmetic notation (summation operators, division symbols, that sort of thing) rather than matrix operations like transposes and inverses. Unfortunately, I can’t remember where I found this particular formula to give adequate credit, but it allowed me to cut two chop at least two lengthy paragraphs out of this article, which I originally included to explain how the inner working of the gigantic matrix math routines I wrote; otherwise, I might’ve set a SQL Server community record for the lengthiest T-SQL sample code ever crammed into a single blog post. Instead, I present Figure 1, which is short enough to be self-explanatory; the format ought to be familiar to anyone who’s been following this series, since it features similar parameter names and dynamic SQL operations. The good news is that the results derived from the Duchennes muscular dystrophy dataset I’ve been using for practice data throughout this series aren’t substantially different from those derived through the matrix math method. There are indeed discrepancies, but this approximation is good enough to get the job done without any noteworthy performance hit at all.
…………Keep in mind that the results of outlier detection methods are rarely fed into other calculations for further refinement, so perfect accuracy is not mandatory as it might be with hypothesis testing or many other statistical applications. The point of all of the T-SQL sample code in this series is to automate the detection of outliers, whereas their handling requires human intervention; all we’re doing is flagging records for further attention, so that experts with domain knowledge can cast trained eyes upon them, looking for relevant patterns or perhaps evidence of data quality problems. The goal is inspection, not perfection. A few years back I read some articles on how the quality of being “good enough” is affecting software economics of late (although I can’t rustle up the citations for those either) and this hack for Mahalanobis Distance serves as a prime example. It’s not as pretty as a complete T-SQL solution that matches the more common matrix formula exactly, but it serves the purposes of end users just as well – or perhaps even better, considering the short code is more easily maintained and the performance is stellar. This sample code runs in about 20 milliseconds on desktop computer (which could hardly be confused with a real server), compared to 19 for the Cook’s D procedure in the last tutorial. The cool thing is that it scales much better. My implementation of Cook’s D can’t be run at all on the Higgs Boson Dataset I’ve been using to stress-test my code with in this series[vii], because the regression stats would have to be recalculated for each of the 11 million rows, thereby leading to exponential running times and the need to store 121 trillion regression rows in TempDB. That’s not happening on anyone’s server, let alone my wheezing Frankenstein of a desktop. My Mahalanobis hack ran in a respectable 3:43 on the same Higgs Boson data. The lesson I learned from coding Mahalanobis and Cook’s Distances in T-SQL is that arithmetic formulas ought to be preferred to ones defined in matrix notation, whenever possible, even if that entails resorting to approximations of this kind. The difficulty consists of finding them, perhaps hidden in the back of some blog post or journal article in the dark corners of the Internet.

Figure 1: Code for the Mahalanobis Distance Procedure
CREATE PROCEDURE Calculations.OutlierDetectionMahalanobisDistanceSP
@Database1 nvarchar(128), @Schema1  nvarchar(128), @Table1  nvarchar(128), @Column1 AS nvarchar(128), @Column2 AS nvarchar(128)

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString nvarchar(max)
SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1

SET @SQLString = DECLARE @Var1 decimal(38,32)

SELECT @Var1 = Var(CAST(‘ + @Column1 + ‘ AS decimal(38,32)))
FROM ‘ + @SchemaAndTable1 +
WHERE ‘ + @Column1 + ‘ IS NOT NULL AND ‘ + @Column2 + ‘ IS NOT NULL

SELECT ‘ + @PrimaryKeyName + ‘ AS PrimaryKey, ‘ + @Column1 + ‘ AS Value1, ‘ + @Column2 + ‘ AS Value2,
Power(Power(‘ + @Column1 + ‘ – ‘ + @Column2 + ‘, 2) / (@Var1), 0.5) AS MahalanobisDistance
FROM ‘ + @SchemaAndTable1 +
WHERE ‘ + @Column1 + ‘ IS NOT NULL AND ‘ + @Column2 + ‘ IS NOT NULL
ORDER BY MahalanobisDistance DESC’

–SELECT @SQLStringuncomment this to debug dynamic SQL errors

EXEC (@SQLString)

Figure 2: Results for the Mahalanobis Distance Procedure

EXEC Calculations.OutlierDetectionMahalanobisDistanceSP
              @Database1 = N’DataMiningProjects,
              @Schema1 = N’Health,
              @Table1 = N’DuchennesTable,
              @PrimaryKeyName = N’ID’,
              @Column1 = N’CreatineKinase,
              @Column2 = ‘Hemopexin’

Quick Version of Mahalanobis Distance

…………There are still some issues left to be worked out with this approximation of Mahalanobis Distance. I’m not yet sure under which conditions we can expect better accuracy, or conversely greater discrepancies from the matrix version. I know Mahalanobis Distance can also be extended to more than two columns, unlike Cook’s D, but I have yet to engineer a solution. Moreover, I have yet to wrap my head around all of the subtle cases where Mahalanobis is less applicable; for example, it apparently isn’t as appropriate when the relationships are nonlinear.[viii] As I’ve come to realize through reading up on statistical fallacies, these tricky situations can make all of the difference in the world between a mining model that helps end users to make informed decisions and ones that can mislead them into disastrous mistakes. Deriving the numbers certainly isn’t easy, but it is even harder to attach them to the wider scaffolding of hard logic in a meaningful way. As many statisticians themselves decry, that is precisely where a lot of science and public discourse go awry. Thankfully, these issues aren’t life-and-death matters in outlier detection, where the goal is to act as an alarm bell to alert decision-makers, rather than as a decision in and of itself; as I’ve pointed out throughout this series ad infinitum, ad nauseum, these detection methods only tell us that a data point is aberrant, but say nothing about why. This is why knee-jerk reactions like simply deleting outliers are not only unwarranted, but can and often are used for deceptive purposes, particularly when money, reputations and sacred cows are on the line. The frequency with which this sort of chicanery still happens is shocking, as I mentioned earlier in the series. As I’ve learned along the way, perhaps the second-most critical problem dogging outlier detection is the lack of methods capable of dealing with “Big Data”-sized databases, or even the moderately sized tables of a few thousand or millions rows as we see routinely in SQL Server. Most of them simply choke and a few are even invalid or incalculable. It might be useful to develop new ones more suited to these use cases, or track down and popularize any that might’ve already been published long ago in the math literature.
…………Despite such subtle interpretive risks, Mahalanobis Distance is the only statistic I’m aware of in my minimal experience that can be applied to the case of multidimensional outliers, beyond the two columns Cook’s D is limited to. In this capacity it acts as a dissimilarity measure, but can also be used for the converse purpose as a measure of similarity. Its scale-invariance and status as a “dimensionless quantity,” i.e. a pure number attached to no particular system of unit measurement, apparently have their advantages as well.[ix] It can be used in other capacities in data mining, such as in feature selection in Bayesian analysis.[x] I don’t necessarily understand a good share of the data mining and machine learning literature I’ve read to date, but can tell by the frequency it crops up that Mahalanobis Distance has diverse uses beyond mere outlier detection. In a future mistutorial series, I intend to demonstrate just how little I know about several dozen other metrics commonly used in the field of data mining, like Shannon’s Entropy, Bregman’s Divergence, the Aikake Information Criterion, Sørensen’s Similarity Index and Lyapunov Exponent. I’ll also include a whole segment on probabilistic applications of distance measures, such as the popular Küllback-Leibler Divergence, all of which turned out to be easier to code and explain than Cook’s D and Mahalanobis Distance. It only gets easier from here on in, at least in terms of common distance measures. I have no timetable for finishing the dozens of such metrics I intend to survey (if all goes according to plan, I will be posting data mining code on this blog for many years to come) but by the time I’m finished with the series tentatively titled Information Measurement with SQL Server, it should be easier than ever before to quantify just how much information there is in every table. We’ll also be able to measure such properties randomness among a column’s values. Before diving into it, however, I might post a quick wrap-up of this series that includes a makeshift use diagram that classifies all of the outlier detection methods covered in this series, as well as a makeshift method of detecting interstitial outliers that I cooked up to meet some specific use cases, one that allowed me to spot a data quality issue in my own databases. I’ll also take a quick detour into coding goodness-of-fit tests in SQL Server, since these seemed to have quite a bit of overlap with some of the outlier detection methods mentioned earlier in this series. Knowing what probability distribution one is dealing can sometimes tell us an awful lot about the underlying processes that produced it, so they can be indispensable tools in DIY data mining on SQL Server.

[i] I glanced at the biography at the Wikipedia page “Prasanta Chandra Mahalanobis,” at the web address http://en.wikipedia.org/wiki/Prasanta_Chandra_Mahalanobis

[ii] See the Wikipedia page “Mahalanobis Distance” at http://en.wikipedia.org/wiki/Mahalanobis_distance

[iii] IBID.

[iv] “When an infinite training set is used, the Mahalanobis distance between a pattern measurement vector of dimensionality D and the center of the class it belongs to is distributed as a chi2 with D degrees of freedom. However, the distribution of Mahalanobis distance becomes either Fisher or Beta depending on whether cross validation or resubstitution is used for parameter estimation in finite training sets. The total variation between chi2 and Fisher, as well as between chi2 and Beta, allows us to measure the information loss in high dimensions. The information loss is exploited then to set a lower limit for the correct classification rate achieved by the Bayes classifier that is used in subset feature selection.” Ververidis, D. and Kotropoulos, C., 2009, “Information Loss of the Mahalanobis Distance in High Dimensions: Application to Feature Selection,” pp. 2275-2281 in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 31, No. 12. See the abstract available at http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4815271&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F34%2F5291213%2F04815271.pdf%3Farnumber%3D4815271

[v] “One interesting feature to note from this figure is that a Mahalanobis distance of 1 unit corresponds to 1 standard deviation along both primary axes of variance.” See the Jennes Enterprises webpage titled “Description” at http://www.jennessent.com/arcview/mahalanobis_description.htm.

[vi] See the post by jjepsuomi titled “Distance of a Test Point from the Center of an Ellipsoid,” published Jun 24, 2013 in the StackExchange Mathematics Forum, as well as the the reply by Avitus on the same date.Available online at http://math.stackexchange.com/questions/428064/distance-of-a-test-point-from-the-center-of-an-ellipsoid. Also see jjepsuomi post titled “Bottom to Top Explanation of the Mahalanobis Distance,” published June 19, 2013 in the CrossValidated forums. Available online at http://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance. The folks at CrossValidated gave me some help on Aug. 14, 2014 with these calculations in the thread titled “Order of Matrix Operations in Mahalanobis Calculations,” which can be found at http://stats.stackexchange.com/questions/111871/order-of-matrix-operations-in-mahalanobis-calculations

[vii] I downloaded from the University of California at Irvine’s Machine Learning Repository a long time ago and converted it into a SQL Server table of about 6 gigabytes.

[viii] See Rosenmai, Peter, 2013, “Using Mahalanobis Distance to Find Outliers,” posted Nov. 25, 2013 at the EurekaStatistics.com web address http://eurekastatistics.com/using-mahalanobis-distance-to-find-outliers

[ix] See the Wikipedia pages “Mahalanobis Distance” and “Scale Invariance” at http://en.wikipedia.org/wiki/Mahalanobis_distance and http://en.wikipedia.org/wiki/Scale_invariance

[x] See Ververidis and Kotropoulos.


Outlier Detection with SQL Server, part 7: Cook’s Distance

By Steve Bolton[

…………I originally intended to save Cook’s and Mahalanobis Distances to close out this series not only because the calculations and concepts are more difficult yet worthwhile to grasp, but also in part to serve as a bridge to a future series of tutorials on using information measures in SQL Server, including many other distance measures. The long and short of it is that since I’m learning these topics as I go, I didn’t know what I was getting myself into and ended finishing almost all of the other distance measures before Cook’s and Mahalanobis. Like the K-Means algorithm I recapped in the last tutorial and had already explained in greater depth in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering, these two are intimately related to ordinary Euclidean Distance, so how hard could they be? Some other relatively common outlier detection methods are also based on K-Means relatives (like K-Nearest Neighbors) and from there to Euclidean Distance, so I won’t belabor the point by delving into them further here. There are also distance metrics in use today that are based on mind-bending alternative systems like affine, hyperbolic, elliptic and kinematic geometries in which these laws do not necessarily hold, after relaxing some of the Euclidean postulates; for example, the affine type of non-Euclidean geometry is useful in studying parallel lines, while the hyperbolic version is useful with circles.[1] Some of them are exotic, but others are quite useful in DIY data mining, as we shall see in a whole segment on probabilistic distances (like the Küllback-Leibler Divergence) in that future mistutorial series. What tripped me up in the case of Cook’s and Mahalanobis is that the most common versions of both rely on matrix math, which can present some unexpected stumbling blocks in SQL Server. In both cases I had to resort to alternative formulas, after running into performance and accuracy issues using the formulas based on standard notation. They’re entirely worthwhile to code in T-SQL, because they occupy an important niche in the spectrum of outlier detection methods. All of the methods introduced in this series allows us to automatically flag outliers for further inspection, which can be quite useful for ferreting out data quality issues, finding novel patterns and the like in large databases – where we don’t want to go around deleting or segregating thousands of records without some kind of intelligent examination first. Cook’s and Mahalanobis, however, stand out because they’re among the few standard ways of finding aberrant data points defined by more than one column. This also makes it capable of detecting unusual two-column values in cases where neither column is at the extreme low or high end, although that doesn’t happen often. These outlier detection methods are thus valuable to have on hand, despite the fact that “Cook’s D,” as it is often known, is still prohibitively costly to run on “Big Data”-sized databases, unlike my workaround for Mahalanobis Distance. The “D” may stand for “disappointing,” although it can still be quite useful on small and medium-sized datasets.
…………Cook’s Distance is suitable as the next stepping stone because we can not only draw upon the concept of distances between data points drawn from the K-Means version of the SSDM Clustering algorithm, but also make use of the lessons learned in A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression. Like so many other metrics discussed in this series, it made its debut in the American Statistical Association journal Technometrics, in this case in a paper published in 1977 by University of Minnesota statistician R. Dennis Cook, which I was fortunate enough to find a copy of. [2] The underlying equation[3] is not necessarily trivial, but the concepts underpinning it really shouldn’t be too difficult for anyone who can already grasp the ideas underpinning regression and Z-Scores, which have been dealt with in previous posts. I found it helpful to view some of the difference operations performed in Cook’s Distance (and the Mean Square Error (MSE) it depends upon) as a sort of twist on Z-Scores, in which we subtract data points from the data points predicted by a simple regression, rather than data points from the mean as we would in the deviation calculations that Z-Scores depend upon. After deriving each of these differences, we square them and sum them – just as we would in many other outlier detection calculations performed earlier in this tutorial series – then finally multiply by the reciprocal of the count.[4] The deviation calculation in the dividend of a Z-Scores can in fact be seen as a sort of crude distance metric in its own right, in which we are measuring how far each data point is from the center of a dataset, as defined in a mean or median; in the MSE, we are performing a similar distance comparison, except between a predicted value and actual value for a data point. To calculate Cook’s Distance we multiply the MSE by the count of parameters – i.e. which for our purposes means the number of columns we’re predicting, which is limited to just one in my code for now. The result forms the divisor in the final calculations, but the dividend is more complex. Instead of comparing a prediction to an actual value, we recalculate a new prediction for each data point in which the regression has been recalculated with that specific data point omitted, then subtract the result from the prediction made for that data point by the full regression model with no points omitted. The dividend is formed by squaring each of those results and summing them, in a process quite similar to the calculation of MSE and Z-Scores. The end result is a measuring stick that we can compare two-column data points against, rather than just one as we have with all of the other outlier detection methods in this series.
…………The difficulty in all of this is not the underlying concept, which is sound, but the execution, given that we have to recalculate an entirely new regression model for each data point. The dilemma is analogous to the one we faced in previous articles on the Modified Thompson Tau Test and Chauvenet’s Criterion, where we had to perform many of the computations recursively in order to recalculate the metrics after simulating the deletion of data points. Each of the difference operations we perform below tells us something about how important each record is within the final regression model, rather than how many outliers there might be if that record was eliminated, but it still presents a formidable performance problem. This drawback is magnified by the fact that we can’t use SQL Server’s exciting new windowing functions to solve the recursion issue with sliding windows, as we did in the articles on the Modified Thompson Tau test and Chauvenet’s Criterion. In fact, Step 5 in Figure 1 would be an ideal situation for the EXCLUDE CURRENT ROW clause that Itzik Ben-Gan, the renowned expert on SQL Server windowing functions, wants to see Microsoft add to the T-SQL specification.[5] As I discovered to my horror, you can’t use combine existing clauses like ROW UNBOUNDED AND ROWS 1 PRECEDING in conjunction with ROWS 1 FOLLOWING AND ROWS UNBOUNDED FOLLOWING to get the same effect. As a result, I had to perform the recalculations of the regressions in a series of table variables that are much less readable and efficient than an EXCLUDE CURRENT ROW clause might be, albeit more legible than the last remaining alternative, a zillion nested subqueries. I’m not yet fluent enough in T-SQL to say if these table variables cause more of a performance impact than subqueries in contexts like this, but this is one case in which they’re appropriate because readability is at a premium. It may also be worthwhile to investigate temporary tables as a replacement; so far, this method does seem to be faster than the common table expression (CTE) method I originally tried. I initially programmed an entire series of matrix math functions and stored procedures to derive both Cook’s and Mahalanobis Distances, since both are often defined in terms of matrix math notation, unlike many other distances used for data mining purposes. That method worked well, except that it ran into a brick wall: SQL Server has an internal limitation of 32 locking classes, which often leads to “No more lock classes available from transaction” error messages with recursive table-valued parameters. This is by design and I have yet to see any workarounds posted or any glimmer of hope that Microsoft intends to ameliorate it in future upgrades, which means no matrix math using table-valued parameters for the foreseeable future.
…………Yet another issue I ran into was interpreting the notation for Cook’s Distance, which can be arrived at from two different directions: the more popular method seems to be the series of calculations outlined two paragraphs above, but the same results can be had by first calculating an intermediate figure known as Leverage. This can be derived from what is known as a Hat Matrix, which can be easily derived in the course of calculating standard regression figures like MSE, predictions, residuals and the like. Unlike most other regression calculations, which are defined in terms of standard arithmetic operations like divisions, multiplication, etc. the notation for deriving Leverage is almost always given in terms of matrices, since it’s derived from a Hat Matrix. It took me a lot of digging to find an equivalent expression of Leverage in terms of arithmetic operations rather than matrix math, which I couldn’t use due to the locking issue. It was a bit like trying to climb a mountain, using a map from the other side; I was able to easily code all of the stats in the @RegressionTable in Figure 1, alongside many other common regression figures, but couldn’t tell exactly which of them could be used to derive the Hat Matrix and Leverage from the opposite direction. As usual, the folks at CrossValidated (StackExchange’s data mining forum) saved my bacon out of the fire.[6] While forcing myself to learn to code the intermediate building blocks of common mining algorithms in T-SQL, one of the most instructive lessons I’ve learned is that translating notations can be a real stumbling block, one that even professionals encounter. Just consider that a word to the wise, for anyone who tries to acquire the same skills from scratch as I’m attempting to do. Almost all of the steps in Figure 1 revolve around common regression calculations, i.e. intercepts, slopes, covariance and the like, except that fresh regression models are calculated for each row. The actual Cook’s Distance calculation isn’t performed until Step #6. At that point it was trivial to add a related stat known as DFFITS, which can be converted back and forth from Cook’s D; usually when I’ve seen DFFITS mentioned (in what little literature I’ve read), it’s in conjunction with Cook’s, which is definitely a more popular means of measuring the same quantity.[7] For the divisor, we use the difference between the prediction for each row and the prediction when that row is left out of the model and for the dividend, we use the standard deviation of the model when that row is omitted, times the square root of the leverage. I also included the StudentizedResidual and the global values for the intercept, slope and the like in the final results, since it was already necessary to calculate them along the way; it is trivial to calculate many other regression-related stats once we’ve derived these table variables, but I’ll omit them for the sake of brevity since they’re not directly germane to Cook’s Distance and DFFITS.

Figure 1: T-SQL Sample Code for the Cook’s Distance Procedure
CREATE PROCEDURE Calculations.CooksDistanceSP
@Database1 nvarchar(128), @Schema1  nvarchar(128), @Table1  nvarchar(128), @Column1 AS nvarchar(128), @Column2 AS nvarchar(128)

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString1 nvarchar(max),@SQLString2 nvarchar(max)
SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1
@MeanX decimal(38,21),@MeanY decimal(38,21),
@StDevX decimal(38,21), @StDevY decimal(38,21), @Count  bigint,
@Correlation   decimal(38,21),
@Covariance decimal(38,21),
@Slope decimal(38,21),
@Intercept decimal(38,21),
@MeanSquaredError decimal(38,21),
@NumberOfFittedParameters bigint

SET @NumberOfFittedParameters = 2
DECLARE @RegressionTable table
(ID bigint IDENTITY (1,1),
Value1 decimal(38,21),
Value2 decimal(38,21),
ocalSum bigint,
LocalMean1 decimal(38,21),
LocalMean2 decimal(38,21),
LocalStDev1 decimal(38,21),
LocalStDev2 decimal(38,21),
LocalCovariance decimal(38,21),
LocalCorrelation decimal(38,21),
LocalSlope  AS LocalCorrelation * (LocalStDev2 / LocalStDev1),
LocalIntercept decimal(38,21),
PredictedValue decimal(38,21),
Leverage decimal(38,21),
AdjustedPredictedValue decimal(38,21),
GlobalPredictionDifference AS Value2 – PredictedValue,
AdjustmentDifference AS PredictedValueAdjustedPredictedValue

INSERT INTO @RegressionTable
(Value1, Value2)
SELECT ‘ + @Column1 + ‘, ‘ + @Column2 +
FROM ‘ + @SchemaAndTable1 +
WHERE ‘ + @Column1 + ‘ IS NOT NULL AND ‘ + @Column2 + ‘ IS NOT NULL 

SELECT @Count=Count(CAST(Value1 AS Decimal(38,21))),
@MeanX = Avg(CAST(Value1 AS Decimal(38,21))), @MeanY = Avg(CAST(Value2 AS Decimal(38,21))),
@StDevX = StDev(CAST(Value1 AS Decimal(38,21))), @StDevY = StDev(CAST(Value2 AS Decimal(38,21)))
FROM @RegressionTable

SELECT @Covariance = SUM((Value1 – @MeanX) * (Value2 – @MeanY)) / (@Count – 1)
FROM @RegressionTable

once weve got the covariance, its trivial to calculate the correlation
SELECT @Correlation = @Covariance / (@StDevX * @StDevY)

SELECT @Slope = @Correlation * (@StDevY / @StDevX)
SELECT @Intercept = @MeanY – (@Slope * @MeanX)
UPDATE @RegressionTable

SET PredictedValue = (Value1 * @Slope) + @Intercept
— subtract the actual values from the PredictedValues and square them; add em together; then multiple the result by the reciprocal of the count
as defined at the Wikipedia page “Mean Squared Error” http://en.wikipedia.org/wiki/Mean_squared_error

SELECT @MeanSquaredError = SUM(Power((PredictedValue  – Value2), 2)) / CAST(@Count – @NumberOfFittedParameters AS  float
              FROM @RegressionTable
— recalculate alternate regression models for each row, plus the leverage from intermediate steps
none of this is terribly complicated; theres just a lot to fi
the outer select is needed here because aggregates arent allowed in the main UPDATE statement (silly limitation)

SET LocalMean1 = T3.LocalMean1, LocalMean2 = T3.LocalMean2, LocalStDev1 = T3.LocalStDev1, LocalStDev2 = T3.LocalStDev2
FROM @RegressionTable AS T0
       (SELECT T1.ID AS ID, Avg(T2.Value1) AS LocalMean1, Avg(T2.Value2) AS LocalMean2, StDev(T2.Value1) AS LocalStDev1, StDev(T2.Value2) AS LocalStDev2
       FROM   @RegressionTable AS T1
              INNER JOIN @RegressionTable AS T2
              ON T2.ID > T1.ID OR T2.ID < T1.ID
       GROUP BY T1.ID) AS T3
       ON T0.ID = T3.ID

SET LocalCovariance = T3.LocalCovariance, LocalCorrelation = T3.LocalCovariance / (LocalStDev1 * LocalStDev2), LocalSum = T3.LocalSum
FROM @RegressionTable AS T0
       INNER JOIN (SELECT T1.ID AS ID, SUM((T2.Value1 – T2.LocalMean1) * (T2.Value2 – T2.LocalMean2)) / (@Count – 1) AS LocalCovariance,
       SUM(Power(T2.Value1 – T2.LocalMean1, 2)) AS LocalSum
       FROM   @RegressionTable AS T1
             INNER JOIN @RegressionTable AS T2
              ON T2.ID > T1.ID OR T2.ID < T1.ID
       GROUP BY T1.ID) AS T3
       ON T0.ID = T3.ID 

SET Leverage = T3.Leverage
FROM @RegressionTable AS T0
       INNER JOIN (SELECT ID, Value1,  1 / CAST(@Count AS float) + (CASE WHEN Dividend1 = 0 THEN 0 ELSE Divisor1 / Dividend1 END) AS Leverage
       FROM (SELECT ID, Value1, Power(Value1 – LocalMean1, 2) AS  Divisor1, LocalSum  AS Dividend1, Power(Value2 – LocalMean2, 2) AS  Divisor2
              FROM @RegressionTable) AS T2) AS T3
              ON T0.ID = T3.ID 

UPDATE @RegressionTable
SET LocalIntercept = LocalMean2 – (LocalSlope * LocalMean1) 

UPDATE @RegressionTable
SET AdjustedPredictedValue = (Value1 * LocalSlope) + LocalIntercept 

SELECT ID, Value1, Value2, StudentizedResidual,Leverage,CooksDistance,DFFITS
FROM (SELECT ID, Value1, Value2, GlobalPredictionDifference / LocalStDev1 AS StudentizedResidual, Leverage,
(Power(GlobalPredictionDifference, 2) / (@NumberOfFittedParameters * @MeanSquaredError)) * (Leverage / Power(1 – Leverage, 2)) AS CooksDistance, AdjustmentDifference / (LocalStDev2 * Power(Leverage, 0.5)) AS DFFITS
FROM @RegressionTable) AS T1
ORDER BY CooksDistance DESC

also return the global stats
— SELECT @MeanSquaredError AS GlobalMeanSquaredError, @Slope AS GlobalSlope, @Intercept AS GlobalIntercept, @Covariance AS GlobalCovariance, @Correlation AS GlobalCorrelation

SET @SQLString1 = @SQLString1 + @SQLString2

–SELECT @SQLString1 — uncomment this to debug dynamic SQL errors
EXEC (@SQLString1)

…………Each of the procedures I’ve posted in previous articles has made use of dynamic SQL similar to that in Figure 1, but in this case there’s simply a lot more of it; in this case, it helps to a least have the operations presented sequentially in a series of updates to the @RegressionTable variable rather than bubbling up from the center of a set of nested subqueries. The first three steps in Figure 1 are fairly straightforward: we retrieve the global aggregates we need as usual, then calculate the covariance (a more expensive operation that involves another scan or seek across the table) from them, followed by the slope and intercept in succession.[8] The MSE calculation in Step 4 requires yet another scan or seek across the whole table. Step 5 accounts for most of the performance costs, since we cannot use the aggregates derived in Step 1 for the new regression models we have to build for each data point. It was necessary to break up the dynamic SQL into two chunks via the second SET @SQLString = @SQLString + ‘ statement, which prevents a bug (or “feature”) that apparently limits the size of strings that can be assigned at any one time, even with nvarchar(max).[9] Various thresholds are sometimes baked into the algorithm to flag “influential points” but I decided to allow users to add their own, in part to shorten the code and in part because there’s apparently not a consensus on what those thresholds ought to be.[10]
…………Aside from the lengthy computations, the Cook’s Distance procedure follows much the same format as other T-SQL solutions I’ve posted in this series. One of the few differences is that there is an extra Column parameter so that the user can compare two columns in any database for which they requisite access, since Cook’s Distance involves a comparison between two columns rather than a test of a single column as in previous tutorials. The @DecimalPrecision parameter is still available so that users can avoid arithmetic overflows by manually setting a precision and scale appropriate to the columns they’ve selected. To decomplicate things I omitted the usual @OrderByCode for sorting the results and set a default of 2 for @NumberOfFittedParameters. As usual, the procedure resides in a Calculations schema and there is no code to handle validation, SQL injection or spaces in object names. Uncommenting the next-to-last line allows users to debug the dynamic SQL.

Figure 2: Results for the Cook’s Distance Query
EXEC Calculations.CooksDistanceSP
             @Database1 = N’DataMiningProjects‘,
              @Schema1 = N’Health‘,
              @Table1 = N’DuchennesTable,
              @Column1 = N’PyruvateKinase,
              @Column2 = ‘Hemopexin’

Cook's Distance Results

…………As I have in many previous articles, I ran the first test query against a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from the Vanderbilt University’s Department of Biostatistics. As the results in Figure 2 show, the protein Hemopexin had the greatest influence on the Pyruvate Kinase enzyme at the 126th record. Here the Cook’s Distance was 0.081531, which was about 4 times higher than the value for the sixth-highest Cook’s Distance, with a bigint primary key of 23, so we may safely conclude that this record is an outlier, unless existing domain knowledge suggests that this particular point is supposed to contain such extreme values. Be warned that for a handful of value pairs, my figures differ from those obtained in other mining tools (which believe it or not, also have discrepancies between each other) but I strongly suspect that depends on how nulls and divide-by-zeros are dealt with, for which there is no standard method in Cook’s D. These minor discrepancies are not of critical importance, however, since the outlier detection figures are rarely plugged into other calculations, nor is it wise to act on them without further inspection.
…………The procedure executed in 19 milliseconds on the table I imported the Duchennes data into, but don’t let that paltry figure deceive you: on large databases, the cost rises exponentially to the point where it becomes prohibitive. There were only a handful of operators, including two Index Seeks which accounted for practically the entire cost of the query, which means that it may be difficult to gain much performance value from optimizing the execution plans. This brings us to the bad news: the procedure simply won’t run against datasets of the same size as the Higgs Boson dataset I downloaded from the University of California at Irvine’s Machine Learning Repository and have being using to stress-test my sample T-SQL throughout this series. Since we need to recalculate a new regression model for each of the 11 million rows, we’re at least talking about 11 million squared, or 121 trillion rows of regression data in order to derive 11 million separate Cook’s Distances. I believe that puts us in the dreaded EXPTIME and EXPSPACE computation complexity classes; without an EXCLUDE CURRENT ROW windowing clause or some other efficient method of calculating intermediate regression aggregates in one pass, I know of no other way to reduce this down from an exponential running time to a polynomial. I’m weak in GROUP BY operations, so perhaps another workaround can be derived through those – but if not, we’re up the proverbial creek without a paddle. Even if you can wait the lifetime of the universe or whatever it takes to run the 11,000,0002 regression operations, it is unlikely that you’ll have enough spare room in TempDB for 121 trillion rows. The price to be paid for the more sophisticated insights Cook’s Distance provides is that it simply cannot be run against Big Data-sized datasets, at least in its current form.
…………As we’ve seen so many times in this series, scaling up existing outlier detection methods to Big Data sizes doesn’t merely present performance issues, but logical ones; in the case of Cook’s Distance, omitting a single observation is only going to have an infinitesimal impact on a regression involving 11 million records, no matter how aberrant the data point might be. Since it is derived from linear least squares regression, Cook’s Distance shares some of its limitations, like “the shapes that linear models can assume over long ranges, possibly poor extrapolation properties, and sensitivity to outliers.”[11] We’re trying to harness that sensitivity when performing outlier detection, but the sheer size of the regression lines generated from Big Data made render it too insensitive to justify such intensive computations. When you factor in the performance costs of recalculating a regression model for that many rows the usefulness of this outlier identification method obviously comes into question. On the other hand, the procedure did seem to identify outliers with greater accuracy when run against other tables I’m very familiar with, which consisted of a few thousand rows apiece. There may be a happy medium at work here, in which Cook’s Distance is genuinely useful for a certain class of moderately sized tables in situations where the extra precision of this particular metric is needed. When deciding whether or not the extra computational costs is worth it for a particular table, keep in mind that the performance costs are magnified in my results because I’m running them on a wimpy eight-core semblance of an AMD workstation that has more in common with Sanford and Son’s truck than a real production environment server. Furthermore, the main uses in this field for outlier detection of any kind are in exploratory data mining and data quality examinations, which don’t require constant, ongoing combing of the database for outliers; these are issues of long-term importance, not short-term emergencies like a relational query that has to be optimized perfectly because it may have to run every day, or even every second. Tests like this can be left for off-peak hours on a weekly or monthly basis, so as not to interfere with normal operations. Cook’s Distance might also be preferred when searching for a specific type of outlier, i.e. those that could throw off predictive modeling, just as Benford’s Law is often selected when identifying data quality problems is paramount, especially the intentional data quality issue we call fraud. Cook’s Distance might also prove more useful in cases where the relationship between two variables is at the heart of the questions that the tester chooses to ask. Cook’s and DFFITS can also apparently be used to convert back and forth from another common stat I haven’t yet learned to use, the Wald Statistic, which is apparently used for ferreting out the values of unknown parameters.[12]. If there’s one thing I’ve learned while writing this series, it’s that there’s a shortage of outlier detection methods appropriate to the size of the datasets that DBAs work with. Thankfully, the workaround I translated into T-SQL for my next column allows us to use Mahalanobis Distance to find outliers across columns, without the cosmic computational performance hit for calculating Cook’s D on large SQL Server databases. As with Cook’s D, there are some minor accuracy issues, but these are merely cosmetic when looking for outliers, where detection can be automated but handling ought to require human intervention.


[1] For a quick run-down, see the Wikipedia page “Non-Euclidean Geometry” at http://en.wikipedia.org/wiki/Non-Euclidean_geometry

[2] Cook, R. Dennis, 1977, “Detection of Influential Observations in Linear Regression,” pp. 15-18 in Technometrics, February 1977. Vol. 19, No. 1. A .pdf version is available at the Universidad de São Paulo’s Instituto de Matematica Estatística web address http://www.ime.usp.br/~abe/lista/pdfWiH1zqnMHo.pdf

[3] I originally retrieved it from the Wikipedia page “Cook’s Distance” at http://en.wikipedia.org/wiki/Cook%27s_distance , but there’s no difference between it and the one in Cook’s paper.

[4] I used the formula defined at the Wikipedia page “Mean Squared Error,” at the web address http://en.wikipedia.org/wiki/Mean_squared_error. The same page states that there are two more competing definitions, but I used the one that the Cook’s Distance page linked to (The Wikipedia page “Residual Sum of Squares” at http://en.wikipedia.org/wiki/Residual_sum_of_squares may also be of interest.):

                “In regression analysis, the term mean squared error is sometimes used to refer to the unbiased estimate of error variance: the residual sum of squares divided by the number of degrees of freedom. This definition for a known, computed quantity differs from the above definition for the computed MSE of a predictor in that a different denominator is used. The denominator is the sample size reduced by the number of model parameters estimated from the same data, (n-p) for p regressors or (n-p-1) if an intercept is used.[3] For more details, see errors and residuals in statistics. Note that, although the MSE is not an unbiased estimator of the error variance, it is consistent, given the consistency of the predictor.”

“Also in regression analysis, “mean squared error”, often referred to as mean squared prediction error or “out-of-sample mean squared error”, can refer to the mean value of the squared deviations of the predictions from the true values, over an out-of-sample test space, generated by a model estimated over a particular sample space. This also is a known, computed quantity, and it varies by sample and by out-of-sample test space.”

[5] p. 47, Ben-Gan, Itzik, 2012, Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions . O’Reilly Media, Inc.: Sebastopol, California.

[6] See the CrossValidated thread titled “Is It Possible to Derive Leverage Figures Without a Hat Matrix?”, posted by SQLServerSteve on June 26, 2015 at http://stats.stackexchange.com/questions/158751/is-it-possible-to-derive-leverage-figures-without-a-hat-matrix . Also see the reply by the user Glen_B to the CrossValidated thread titled “Which of these points in this plot has the highest leverage and why?” on July 9, 2014 at http://stats.stackexchange.com/questions/106191/which-of-these-points-in-this-plot-has-the-highest-leverage-and-why/106314#106314

[7] See the formula at the Wikipedia page “DFFITS” at https://en.wikipedia.org/wiki/DFFITS

[8] I retrieved this formula from the most convenient source, the Dummies.com page “How to Calculate a Regression Line” at the web address http://www.dummies.com/how-to/content/how-to-calculate-a-regression-line.html

[9] See the response by the user named kannas at the StackOverflow thread, “Nvarchar(Max) Still Being Truncated,” published Dec. 19, 2011 at the web address http://stackoverflow.com/questions/4833549/nvarcharmax-still-being-truncated

[10] See the Wikipedia page “Cook’s Distance” at http://en.wikipedia.org/wiki/Cook%27s_distance

[11] See National Institute for Standards and Technology, 2014, “ Least Squares Regression,” published in the online edition of the Engineering Statistics Handbook. Available at http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htm

[12] See the Wikipedia pages “Cook’s Distance,” “DFFITS” and “Wald Test” at http://en.wikipedia.org/wiki/Cook%27s_distance,

http://en.wikipedia.org/wiki/DFFITS and http://en.wikipedia.org/wiki/Wald_test respectively.


Integrating Other Data Mining Tools with SQL Server, Part 2.2: Minitab vs. SSDM and Reporting Services

By Steve Bolton

…………Professional statistical software like Minitab can fill some important gaps in SQL Server’s functionality, as I addressed in the last post of this occasional series of pseudo-reviews. I’m only concerned here with assessing how well a particular data mining tool might fit into a SQL Server user’s toolbox, not with their usefulness in other scenarios; that is why I made comparisons solely on the ability of various SQL Server components to compete with Minitab’s functionality, whenever the two overlapped. Most of the use cases for Minitab (and possibly its competitors, most which I have yet to try) come under rubric of statistics, which falls in the cracks between T-SQL aggregates and the “Big Data”-sized number-crunching power of SQL Server Analysis Services (SSAS) and SQL Server Data Mining (SSDM). For example, as I mentioned last time around, Minitab implements many statistical functions, tests and workflows that are not available in SSAS or SSDM, but which can be coded in T-SQL; whether or not it is profitable to do so varies by the simplicity of each particular stat and the skill level of the coder in translating the math formulas into T-SQL (something I’m hell-bent on acquiring). In this installment, I’ll cover some Minitab’s implementations of more advanced algorithms that we’d normally use SSDM for, but which are sometimes simple enough to still be implemented in T-SQL. So far in this haphazard examination of Microsoft’s competitors, the general rule of thumb seems to be that SSDM is to be preferred, particularly on large datasets, except when it doesn’t offer a particular algorithm out-of-the-box. That happens quite often, given that there are literally so many thousands of algorithms that no single company can ever implement them all. Minitab offers a wider and more useful selection of these alternative algorithms than WEKA, an open source tool profiled in the first couple of articles. In cases when SQL Server and Minitab compete head-to-head, SSDM wins hands down in both performance and usability. As we shall see, the same is true in comparisons of Minitab’s visualizations to SQL Server Reporting Services (SSRS), where the main dividing line is between out-of-the-box functionality vs. customizable reports.
…………Minitab’s data mining capabilities differ from SQL Server’s mainly by the fact that it implements algorithms of lower sophistication, but with a wider array of really useful variations and enhancements. The further we get from ordinary statistical tasks like hypothesis testing and analysis of variance (ANOVA) towards machine learning and other examples of “soft computing,” the more the balance shifts back to SSDM. I couldn’t find any reference in Minitab’s extensive Help files to topics that are often associated with pure data mining, like neural nets, fuzzy sets, entropy, decision trees, pattern recognition or the Küllback-Leibler Divergence. Nor is there any mention of information, at least as the term was used in the professional sense of information theory, or of the many measures associated with such famous names in the field as Claude Shannon or Andrey Kolmogorov.[1] Given that, it’s not surprising that there’s no mention of information geometry, which is apparently a bleeding edge topic in data mining and knowledge discovery. On the other hand, Minitab implements four of the nine algorithms found in SSDM, as discussed in my earlier amateur tutorial series, A Rickety Stairway to SQL Server Data Mining. Out of these four, Minitab clearly has the advantage in terms of features when it comes to Linear Regression – but definitely not when it comes to performance.
…………As depicted in Figure 1, many more types of regression are available in Minitab, like nominal, ordinal, orthogonal, nonlinear, partial least squares and Poisson. Each of these has its own set of options and parameters which greatly enhance their usefulness, most of which are not available in SSDM. For example, it is easier to access the resulting regression equations and related stats in the output of ordinary regression routines, which can return additional metrics like the Durbin-Watson Test that are not available in SQL Server at all. On top of these myriad enhancements, Minitab has entire classes of algorithms that SSDM does not provide out-of-the-box. As shown in Figure 3, many different functions can be plugged into Minitab’s version of nonlinear regression, thereby making it into an entire family of related algorithms, many of which can be quite useful in analysis. There’s no reason why Microsoft could not have implemented all of these algorithms in SSDM, but as I lamented often in the Rickety series, the top brass is slowly squandering an entire market through almost a decade of pointless neglect. It is a shame that Microsoft doesn’t know how good its own product is, given that SSDM still blows away its rivals, at least in areas where the same functionality competes head-to-head.
…………As mentioned in the last article, Minitab worksheets are limited to just 10 million rows, which means that  displaying all 11 million rows in the Higgs Boson dataset[2] I’ve been using for practice data for the last couple of tutorial series is out of the question. In SQL Server Management Studio (SSMS) this is no problem, but the real issue here is not a matter of display, but of the fact that we can’t perform calculations on this many records. When I tried to run a regression on the first 10 million rows, it ran on one core for 16 minutes and ended up gobbling up 2 gigs of memory. It crashed during the loading phase before even initiating the regression calculations, with the error message: “Insufficient memory to complete operation. Minitab ran out of memory and was unable to recover. Close other applications to reduce memory and then press Retry. If this error continues you may need to exit Minitab and restart your system. If you select Abort, Minitab will terminate and you may lose work you have not saved.” In contrast, SSDM was able to run a regression on the same dataset in just 3 minutes and 54 seconds. SSDM’s version of Logistic Regression was able to process the whole table in just 3:32. Given that Minitab can’t even load that many records into a worksheet, let alone compute the regressions, the edge in performance definitely goes to SQL Server. This was accomplished without any of the myriad server options that can be used to enhance performance in SQL Server, none of which are available in Minitab; the same rule essentially holds when we compare T-SQL relational solutions to Minitab’s functionality, which doesn’t offer any indexing, tracing, tuning or other such tweaks that we take for granted. Furthermore, SSDM can better handle marking columns as inputs, outputs or both in its mining models (i.e. Predict, PredictOnly, etc.). On the other hand, SSDM lacks a good regression viewer; we’re limited to the Decision Trees and Generic Content viewers, when what we really need is a regression plot of the kind that Minitab returns out-of-the-box, like the Fitted Line Plot in Figure 4.[3] Since SSDM doesn’t implement this, I would either write a plug-in visualization of the kind I wrote about in A Rickety Stairway to SQL Server Data Mining, Part 15, The Grand Finale: Custom Data Mining Viewers, or write an SSRS report with a line graph. When mining large datasets using existing algorithms, I would first perform the calculations in SSDM, then display the regression lines in an SSRS report or custom mining viewer. I would integrate Minitab into this workflow by performing calculations on large samples of the data, in order to derive the extra regression stats it provides. In cases of small datasets, tight deadlines or algorithms that SSDM doesn’t have, I’d go with Minitab, at least in situations where T-SQL solutions would also be beyond my skill level or would take too much time to write and test.

Figures 1 and 2: The Regression and Time Series Menus
Minitab Regression Menu
Minitab Time Series Menu

Figure 3: The Many Options for Nonlinear Regression
Minitab Nonlinear Regression Options

Figure 4: An Example of a Fitted Line Plot
Minitab Fitted Line Plot

…………The same principles essentially apply to Minitab’s version of Time Series, which is also accessible through the Stat menu. Figure 2 shows that Minitab obviously provides a lot of functionality that SSDM does not, like Trend Analysis (which includes some useful Seasonal Analysis choices), Decomposition and Winters’ Method.  Some of these options return accuracy measures like Mean Absolute Percentage Error (MAPE), Mean Absolute Deviation (MAD) and Mean Squared Deviation (MSD) and other stats that SSDM does not provide. One advantage is that Minitab can calculate Time Series using linear, quadratic, exponential growth and “S-Curve (Pearl-Reed logic)” models. The gap in functionality is not as wide as with regression, however, given that it is not terribly difficult to implement various types of lags, autocorrelations, differences and smoothing operations with T-SQL windowing functions that scale better. SSDM and Minitab have competing implementations of ARIMA, but I strongly prefer the Microsoft version on the strength of its user interface; the Minitab version is mainly useful for making some of the intermediate stats readily available, like the residuals and Modified Box-Pierce (Ljung-Box) results. Time Series in Minitab is hobbled, however, by the fact that it can only calculate one variable per Time Series, unlike SSDM, which can plot them all. The Minitab Time Series Plot is also bland in comparison to the Microsoft Time Series Viewer. Once again, I would use Minitab’s Time Series only to supplement SSDM with additional stats or for cases where there’s a need for alternative algorithms, like Winters’ Method. SSDM would be my go-to tool for any functionality they implement in common, especially when any serious heavy lifting is called for. For low-level stats like autocorrelation and moving averages, I would bypass Minitab altogether in favor of my homegrown T-SQL and SSRS reports.

Figure 5: How to Access Minitab’s Clustering Algorithms
Minitab Multivariate and Clustering

…………One of Minitab’s main strengths is that it meets some use cases tangential to data mining, such as Principal Components Analysis, Maximum Likelihood and other Multivariate items and subitems. SSDM doesn’t do any of that, but it does Clustering and it does it well. Minitab doesn’t implement the subtype I discussed in A Rickety Stairway to SQL Server Data Mining, Algorithm 8: Sequence Clustering or the Expectation Maximization (EM) method mentioned in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering, but both implement the most common flavor, K-Means. There are literally thousands of extant clustering algorithms available in the research literature, each of which is useful for specific use cases, so no single product is going to be capable of implementing them all. Even if the top brass at Microsoft were fully committed to SSDM, they’d never be able to incorporate them all, which means that clustering software doesn’t necessarily compete head-to-head. In this case, Minitab has the advantage in the terms of enhancements, such as choices of Linkage Methods like Average, Centroid, Complete, McQuitty, Median, Single and Ward, or distance metrics like Euclidean, Manhattan, Pearson, Squared Euclidean and Squared Pearson. Aside from these options and a couple of related stats, however, SSDM outclasses Minitab. In terms of performance, processing a K-Means mining model on all of the columns of the 5-gigabyte Higgs Boson table only took 1:36:42 on my wheezing old development machine. As noted in the earlier discussion on regression, Minitab can’t even load datasets of this size without choking. That’s not surprising, giving that it’s intended mainly statistical analysis on datasets of small or moderate size, not heavy number-crunching on Big Data. In terms of visualization, the SSDM Cluster Viewer is light years ahead of the simple text output and dendrograms available in Minitab. Clustering is an inherently visual task, but the graphics in Figure 6 and 7 simply don’t convey information concisely and efficiently like the Cluster Viewer, which also has the advantage of being an interactive tool.

Figures 6 and 7: Sample Session Output and Dendrogram for Minitab Clustering
Minitab Clustering Text Output Minitab Dendrogram

 Figure 8: The Minitab Graph Menu
Minitab Graphics Menu

…………Many of the individual statistical tests, functions, algorithms and Assistant workflows return various plots in separate windows, alongside the data returned in the worksheets and text output in the Session window. Most of these scattered visualizations are collected in the Graph menu depicted above, or can be found in the Graphical Analysis Assistant mentioned in my last post. Other common visualizations like run charts and Pareto charts are available from the Quality Tools menu, while the Control Charts item on the Stat menu provides access to plots for some simple stats like moving averages. The advantages of all of the above can be summed up in one word: convenience. They’re all implemented out-of-the-box, thereby eliminating the need to write your own reports. On the other hand, someone with the skill to code their own SSRS reports will quickly find themselves chafing at the limitations of these canned graphics, which offer less in the way of customization. For example, the Line Plot…command implements a graphic not available out-of-the-box in SQL Server, which allows users to view associations across the variables in dataset. It quickly becomes cluttered when there are many distinct values, which is an obstacle that SSRS could deal with far more efficiently by programmatically changing such colors, shapes, sizes and so forth of the graphic elements as needed. Users are basically stuck with the format Minitab provides, with some minor customizations available through such means as right-clicking the graphic elements, as in the sample histogram in Figure 9. Sometimes that’s good enough to get the job done; whether or not it suffices for a particular analyst’s needs is in part dictated by the data and problems at hand, and in part is a highly individual choice dependent on their skills.
…………The Dotplot is rather ugly and the Stem-and-Leaf is output as text; coding the latter in T-SQL and hooking it up to Reporting Services isn’t terribly difficult but looks much better, as I’ve discovered first-hand. Histograms can be returned with many of the statistical functions mentioned in my last blog post, plus many of the mining algorithms mentioned here. As I demonstrated in Outlier Detection with SQL Server, part 6.1 – Visual Outlier Detection with Reporting Services though, these can be implemented fairly quickly in SSRS with a lot more eye candy and customizability. Probability plots are also returned by many functions and tests, but only for certain distributions, like the Gaussian (i.e. “normal”), lognormal, smallest extreme value, largest extreme value and various takes on the log-logistic, exponential, gamma and Weibull. I will demonstrate how to include some of these in SSRS reports in a future article on goodness-of-fit testing with SQL Server. The concept of empirical distribution functions (EDFs) will also be introduced in articles on the Kolmogorov-Smirnov and Lilliefors Tests in that future series. I like their Matrix Plots, but it’s nothing that can’t be done in SSRS. The scatter, bubble, bar and pie charts are all definitely inferior to their SSRS counterparts, as are the 3D versions of the scatter plot. I prefer SSDM’s Time Series visualizations to Minitab’s, although that’s more of a judgment call. I figured that the box and interval plots would have an advantage over SQL Server reports in its ability to overcome the display issues I mentioned in Outlier Detection with SQL Server, part 6.2: Visual Outlier Detection with Box Plots in Reporting Services. Basically, SSRS only allows one resultset from each stored procedure, thereby limiting its ability to display summary statistics alongside individual records without doing client-side calculations – which just isn’t going to happen on cubes, mining models and Big Data-sized relational tables. Unfortunately, I received my first crashes on both, on a practice dataset of just 1,715 records; Minitab started running on one core (no others were in use), with no discernible disk activity and no growth in memory usage; in fact, I had to kill the process after a couple of minutes, given that the memory use wasn’t budging at all. There is apparently no Escape command in Minitab, which is something that really comes in handy in SQL Server for runaway queries. The area, marginal, probability distribution and individual value plots are just really simple special cases of some of these aforementioned plots, so I’ll skip over them. Perhaps the only two Minitab visualizations I’d use for any purpose other than convenience are the interval plots mentioned above, plus the contour and 3D surface plots depicted below. The latter has some cool features, such as wireframe display.

Figure 9: An Example of a Minitab Histogram
Minitab Histogram Example

Figures 10 and 11: Examples of Contour and 3D Surface Plots in Minitab
Minitab Contour Plot Example
Minitab Surface Plot Example

…………It is good to keep in mind when reading these pseudo-reviews that I’m an amateur posting my initial reactions, not an expert with years of experience in these third-party tools. In the case of Minitab, we’re talking about an expensive professional tool with many nooks and crannies I never got to explore and a lot of functionality I’m not familiar with at all, like the Six Sigma and other engineering-specific tools I mentioned in the last article. I barely scratched the surface of a very big topic. That became crystal clear to me when writing these final paragraphs, when I discovered quite late in the game that more customization was available through that context menu in Figure 9. I’ve undoubtedly short-changed Minitab somewhere along the way, as I’m sure I did with WEKA a few articles ago. These articles are intended solely to provide introductions to these tools to SQL Server users, not expert advice to a general audience. Based on this limited experience, my general verdict is that I’d use Minitab as a go-to tool for functionality that SQL Server doesn’t provide out-of-the-box, like ANOVA, discriminant analysis, hypothesis testing and some of the alternative mining algorithms mentioned in this article. This is especially true when speaking of the helpful workflow Assistants Minitab provides for such tasks, particularly hypothesis testing and the unfamiliar engineering processes.
…………The less complex the functionality is, the more I’d lean towards T-SQL solutions, while the more complicated the underlying formulas become, the more I’d lean towards SSDM. Whenever SQL Server competes with Minitab head-on, it wins hands down, except in the area of supplemental stats; if only Microsoft had updated SSDM regularly over the years instead of abandoning the market, it might have been able to extend this advantage over Minitab to additional areas. This advantage is twice as strong whenever performance, tracing, higher precision data types and tweaks like indexing are paramount. In terms of graphical capabilities, Minitab’s edge is in convenience, whereas SSRS definitely offers more power. Because the human mind processes most of its information visually, eye candy cannot be overlooked as a key step of conveying the complex information derived from mining tasks to end users. Perhaps Excel would be a worthy competitor in Minitab’s bread-and-butter, which is performing kinds of common statistical tests that lay somewhere between the simple aggregates of T-SQL and the sophistication of SSDM algorithms. I’m ignorant of a lot that goes on with Excel, but it seems like more of a general purpose spreadsheet than Minitab, which is a specialized program that just happens to use a spreadsheet interface; it’s no accident that I’ve so far found easier to use for statistical testing, given that this is its raison d’etre.
…………Perhaps there are other statistical packages that would perform the same tasks in a SQL Server environment much better than Minitab; maybe I will run into a competitor that performs the same functions at half the price tomorrow. Until then, however, I will leave Minitab a big space in my toolbox in comparison to WEKA, which in turn outperformed the sloppy Windows versions of DB2 and Oracle, as I discussed in Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial and Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2. Data mining is a taxing topic that simply doesn’t leave much time and mental energy left for the hassles of unprofessional interfaces. Usability is one of the many categories I will take into consideration throughout this occasional, open-ended series, along with performance, the quality and availability of algorithms, visualizations, documentation, error-handling and crashes and portability, not to mention security, extensibility, logging and tracing. I have many of Minitab’s competitors in my cross-hairs, including RapidMiner, R, Pentaho, Autobox, Clementine, SAS and Predixion Software, a company founded by SSDM developers Jamie MacLennan and Bogdan Crivat. Which one I will examine next is still up in the air, nor do I know what I’ll find when I finally try them out. My misadventures with DB2 and Oracle taught me not to delve into these topics with preconceived notions, because there are surprises lurking out there in the data mining marketplace – such as the Cinderella story of WEKA, the free tool which beat DB2 and Oracle hands-down in terms of reliability. The most pleasant surprise with Minitab was how smoothly the GUI interface worked, making it trivial to perform many advanced statistical tests effortlessly.

[1] Kolmogorov is only mentioned in connection with the Kolmogorov-Smirnov goodness-of-fit test.

[2] I downloaded this last year from the University of California at Irvine’s Machine Learning Repository and converted it to a SQL Server table of about 5 gigs, which now resides in the sham DataMiningProjects database I’ve been using for practice purposes for the last few tutorial series.

[3] This example displays data from the same Duchennes muscular dystrophy dataset I’ve been using as practice data for the last several tutorial series, which I downloaded ages ago from Vanderbilt University’s Department of Biostatistics.

Integrating Other Data Mining Tools with SQL Server, Part 2.1: The Minuscule Hassles of Minitab

By Steve Bolton

…………It may be called Minitab, but SQL Server users can derive maximum benefits from the Windows version of this professional data mining and statistics tool – provided that they use it for tasks that SQL Server doesn’t do natively. This was one the caveats I also observed when appraising WEKA in the first installments of this occasional series, in which I’ll pass on my misadventures with using various third-party data mining tools to the rest of the SQL Server community. These are intended less as formal reviews than preliminary answers to the question, “How would these fit in a SQL Server data miner’s toolbox?”
…………WEKA occupies a very small place in that toolbox, due to various shortcomings, including an inability to handle datasets that many SQL Server users would consider microscopic. In a recent trial with Minitab 17.1 I encountered many of the same limitations, but at much less serious levels – which really ought to be the case, given that WEKA is a free open source tool and Minitab costs almost $1,500 for a single-user license. I didn’t know what to expect going into the trial, since I had zero experience with it to that point, but I immediately realized how analysts could recoup the costs in a matter of weeks, provided that they encountered some specific use cases often enough. Minitab is useful for a much wider range of scenarios than WEKA, but the same principles apply to both: it is best to use SQL Server for any functionality Microsoft has provided out-of-the-box, but to use these third-party tools when their functionality can’t be coded quickly and economically in T-SQL and .Net languages like Visual Basic. Like most other analysis tools, Minitab only competes with SQL Server Data Mining (SSDM) tangentially; most of its functionality is devoted to statistical analysis, which neither SSDM nor SQL Server Analysis Services (SSAS) directly addresses. If I someday had enough clients with needs for activities like Analysis of Variance (ANOVA), experiment design or dozens of specific statistics that aren’t easily calculable in SQL Server, Minitab would be at the top of my shopping list (with the proviso that I’d also evaluate their competitors, which I have yet to do). I’m not a big Excel user, so I can’t speak at length on whether or not it compares favorably, but I personally found Minitab much easier to work with for statistical tasks like these. Minitab has some nice out-of-the-box visualizations which can be done with more pizzazz in Reporting Services, provided one has the need, skills and time to code them. One of Minitab’s shortcomings is that it simply doesn’t have the same “Big Data”-level processing capabilities as SQL Server. This was also the case with WEKA, but Minitab can at least perform calculations on hundreds of thousands of rows rather than a paltry few thousand. It doesn’t provide neural nets, sequence clustering or some of my other favorite SSDM algorithms from the A Rickety Stairway to SQL Server Data Mining series, but it does deliver dozens of alternatives for lower-level data mining methods like regression and clustering which SSDM doesn’t provide. If given the opportunity and the need, I’d incorporate this into my workflows for the kind of hypothesis testing routines I spoke of in Outlier Detection with SQL Server, preliminary testing of statistical code, formula validation and certain data mining problems, when one of Minitab’s specialized algorithms is called for.
…………One pitfall to watch out for when evaluating Minitab is that there are scammers out there selling counterfeit copies on some popular, above-board online shops. They’re just pricy enough to look like legitimate second-software being resold from surplus corporate inventory or whatever; a Minitab support specialist politely advised me that resales are violations of the license agreement, so the only way to get a copy is to shell out the $1,500 fee for a single per-user license. Another Minitab rep was kind enough to extend my trial software for another 30 days after I simply ran out of time to collect information for these reviews, but I don’t think that will color the opinions I represent here, which were already positive from my first hours of tinkering with it (except, that is, in moments where I considered the hefty price tag). One obvious plus in Minitab’s favor is that the installer worked right off the bat, which wasn’t the case in Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial. In fact, I never did get Oracle’s installer to work at all in Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2, thanks to some Java errors that Oracle has chosen not to fix, rather than novice inexperience. Another plus is that the .chm Help section is crisply written and easy to follow even for non-experts, which is really critical when you’re dealing with the advanced statistical topics that can quickly become mind-numbing. I didn’t run into the kinds of issues I did with SSDM and WEKA in terms of insufficient documentation. I was also pleasantly surprised to find that Minitab installed more than 350 practice datasets out of the box, far more than any other analytics or database-related product I can recall seeing, although I rare use samples of this kind.

The Minitab GUI

                At first launch, it is immediately obvious that spreadsheets are the centerpiece of the GUI, in conjunction with a text output window that centralizes summary data for of all worksheets when algorithms are run on them. That window can quickly become cluttered with data if you’re running a lot of different analyses in one session; I was also relieved to discover that this text-only format is supplemented by many non-text visualizations available as well, which I’ll cover in the next article. The user interface is obviously based on Microsoft’s old COM standard, not Windows Presentation Foundation (WPF), but it’s well-done for COM and definitely leaps and bounds ahead of the Java interfaces used in third-party mining suites like Oracle, DB2 and WEKA. Incidentally, Minitab has automation capabilities, but these are exposed through the old COM standard, which of course a lot more difficult to work with than .Net. Greater emphasis is placed on macros, which involves learning Session Commands that have an Excel-like syntax. Although I was generally pleased with the usability of the interface, there were of course a few issues, especially when I unconsciously expected it to behave like SQL Server Management Studio (SSMS). Sorting is really cumbersome in comparison, for instance. You have to go up to a menu and choose a Sort… command, then make sure it’s manually applied to every column if you want the worksheet synchronized; the sorted data then has to be placed into a new worksheet, none of which would fly in a SQL Server environment. Most of the action takes place in dialog windows brought up through menu commands, where end users are expected to select a series of worksheet columns and enter appropriate parameters. One pitfall is that typing constants into the dialog boxes is often a non-starter; most of the time you need to select a worksheet column from a list on the left side, which can be counter-intuitive in some situations. A lesser annoyance is that sometimes the columns to the left in the selection boxes are blank until you click inside a textbox, which makes you wonder sometimes if it is supposed to be greyed out to indicate unavailability. Another issue is that if you forget to change worksheets during calculations, Minitab will just dump rows from the table you’re doing computations on into whatever spreadsheet is topmost; as if to rub salt in our wounds, it’s not sorted either.
…………Minitab can import data from many sources, but in this series we’re specifically concerned with integrating it into a SQL Server environment. This is done entirely through ODBC; apparently Minitab also has Dynamic Data Exchange (DDE) capabilities, but I didn’t bother to try to connect through this old Windows 2000-era medium, which I haven’t used since I got my MCSD in Visual Basic 6. From the File Menu, choose Query ODBC Database… as shown in Figure 1. If you don’t have a file or machine DNS set up yet, you will have to click New… in the Select Data Source window shown in Figure 2. The graphic after that depicts six windows you’ll have to navigate through to create one, most of which is self-explanatory; you basically select a SQL Server driver type, an existing server name and the type of authentication, plus a few connection options like default database. Later in the process, you can test the connection in a window I left out for the sake of succinctness. There isn’t much going on here that’s terribly different from what we’re already used to with SQL Server; the only stumbling block I ran into was in the SQL Server Login windows in Figure 4, where I had to leave the Server SPN blank, just as I did in the DNS definition. I’m not up on Service Principal Names (SPNs), so there’s probably a sound reason I’m not aware for leaving them out in this case.

 Figure 1: Using the Query Database (ODBC) Menu Command
Menu Command for Connecting

Figure 2: Selecting a DNS Data Source
Select Data Source

Figure 3: Six Windows Used to Set Up a SQL Server DNS
Minitab New Data Source Windows (1)Minitab New Data Source Windows (2)

Minitab New Data Source Windows (4)Minitab New Data Source Windows (5)

Minitab New Data Source Windows (6)Minitab New Data Source Windows (7)

Figure 4: Logging in with SQL Server
Two SQL Server Login Windows (2)

…………One of my primary concerns was that Minitab wouldn’t be able to display as many rows as SSMS, especially after WEKA choked on just 5,500 records in my first two tutorials. Naturally, one of the first things I did was stress-test it using the 11-million-row Higgs Boson dataset I’ve been using for practice data for the last couple of tutorial series, which originally came from the University of California at Irvine’s Machine Learning Repository and now takes up about 5 gigs in a SQL Server table. SSMS can handle it no problem on my wheezing old development machine, but I didn’t know what to expect, given than Minitab is not designed with “Big Data”-sized relational tables and cubes in mind. I was initially happy with how fast it loaded the first two float columns, which took about a minute in which mtb.exe ran on one core. Then I discovered that I couldn’t scroll past the 10 millionth row, although the distance to the end of the scrollbar was roughly proportional to the remaining million rows, i.e. about 10 percent. I then discovered the following limits in Minitab’s documentation, which SQL Server users might run into frequently given the size of the datasets we’re accustomed to:

                “Each worksheet can contain up to 4000 columns, 1000 constants, and 10,000,000 rows. The total number of cells depends on the memory of your computer, up to 150,000,000. This worksheet size limit applies to each worksheet in a Minitab project. For example, you could have two worksheets in your project, each with 150 million cells of data. Minitab does not limit the number of worksheets you can have in a project file. The maximum number of worksheets depends on your computer’s memory.”[1]

…………It is often said that SSMS is not intended for displaying data, yet DBAs, developers and others often use it that way anyway; I would regard it as something of a marketing failure on Microsoft’s part not to recognize that and deliberately upgrade the interface, rather than trying to force customers into a preconceived set of use cases. Despite this inattention, SSMS still gets the job of displaying large datasets done much better than Minitab; this may be a feature that I just happen to notice more due to the fact that I’m used to using SQL Server for data mining purposes, not the more popular use case of serving high transaction volumes. Performance comparisons of the calculation speed and resource usage during heavy load are more appropriate and in this area, Minitab did better than expected. I wouldn’t use it to mine models of the size I used in the Rickety series, let alone terabyte-sized cubes, but it performed better than I expected on datasets of moderate size. Keep in mind, however, that it lacks almost all of the tweaks and options we can apply in SQL Server, like indexing, server memory parameters, dynamic management views (DMVs) to monitor resource usage, tools like Resource Governor and Profiler – you name it. That’s because SQL Server is designed to meet a different set of problems that only overlap Minitab at certain points, mainly in the data mining field.
…………Comparisons of stability during mining tasks are also more appropriate and in this respect, Minitab fared better than any of competitors of SSDM I’ve tried to date. Despite being an open source tool, WEKA turned out to be more stable than DB2 and Oracle, but I’m not surprised that Minitab outclassed them all, given that all three are written in clunky Java ports to Windows. I had some crashes while using certain computationally intensive features, particularly while performing variations of ANOVA. One error on a simple one-way ANOVA and another while using Tukey’s multiple comparison method forced me to quit Minitab. A couple of these were runtime exceptions on Balanced ANOVA and Nested ANOVA tasks that didn’t force termination of the program. I encountered a rash of errors towards the end of the trial, including a plot that seized up Minitab and a freeze that occurred while trying to select from the Regression menu. One of these occasions, I tried to kill the process in Task Manager, only to discover that I couldn’t close any windows at all in Explorer for a couple of minutes (there was no CPU usage, disk errors or other such culprits in this period, which was definitely triggered by the Minitab error). Perhaps the most troubling problem I encountered towards the end was increasingly lengthy delays in loading worksheets with a couple of hundred columns, but only about 1,500 rows; these were on the order of four or five minutes apiece, which is unacceptable. Overall, however, Minitab performed better and was more stable than any other mining tool I’ve used to date, except SSDM. The two tools are really designed for tangential use cases though, with the first specializing in statistical analysis and lower-level mining algorithms like regression, while SSDM is geared more towards serious number-crunching on large datasets, using higher-level mining methods like neural nets.

Weak Data Types but Unique Functions

                That explains why Minitab doesn’t hold a candle to SQL Server in terms of the range of its data types, which may become an issue in large datasets and calculations where high precision makes sense. Worksheets can only hold positive or negative numbers to a maximum of 1018 in either direction, beyond which the values are tagged as missing and an error is raised.[2] It is possible to store values up to 80 decimal places long in the spreadsheet (scientific notation is not automatically invoked), but they may be treated as text, not numbers. The Fixed Decimal dialog box only allows users to select up 30 decimal places. Worse still, only 17 digits can be entered to the left or right before truncation begins, whereas SQL Server’s decimal and numeric types can go as high as 38. Our floats can handle up to 308 decimal places – which sounds like overkill, until you start translating common statistical functions for use on mining large datasets and quickly exhaust all of this extra slack. The existing SQL Server data types are actually inadequate for implementing useful data mining algorithms on Big Data-sized models – so where does that leave Minitab, where the permissible ranges are an order of magnitude smaller? Incidentally, another possible source of frustration with Minitab’s data type handling is its lack of an equivalent to identity columns; the same functionality can only be implemented awkwardly, through such methods as manually setting the same sort options for each column in a worksheet.
…………At present, I’m trying to acquire the math skills to translate statistical formulas into T-SQL, Visual Basic and Multidimensional Expressions (MDX), which in some cases can be done more efficiently in SQL Server. This DIY approach can take care of some of the use cases in between SQL Server’s and Minitab’s respective spheres of influence, but as the sophistication of the stats begins to surpass a developer’s skill levels, the balance increasingly leans towards Minitab. One area where home-baked T-SQL solutions have the advantage is in terms of the mathematical functions and constants that Minitab provides out-of-the-box. It has pretty much the same arithmetic, statistical logical, trigonometric, logarithmic, text and date/time functions that SQL Server and Common Language Runtime (CLR) languages like Visual Basic and C# do, except that our versions have much higher precision. It is also trivial to use far more precise values of Pi and Euler’s Number in T-SQL than those provided in Minitab. On top of that, it is much easier to use one of the functions inside a set-based routine than it is to type it into a spreadsheet, which opens up a whole world of possibilities in SQL Server that simply cannot be done in Minitab. There are Excel-like commands to Lag, Rank and Sort data, but they don’t hold a candle to T-SQL windowing functions and plain old ORDER BY statements.
…………Minitab provides a few functions that aren’t available out-of-the-box with SQL Server, but even here, the advantage resides with T-SQL solutions. It is trivial to implement stats like the sum of squares and geometric mean in T-SQL, where we have fine-grained control and can leverage our knowledge of all of SQL Servers’ internal workings for better performance and encapsulation; a DBA can do things like write queries that do a single index scan and then calculate two similar stats from it afterwards at trivial added cost, but that’s not going to happen in statistical packages like Minitab. This is true even in terms of advanced statistical tests where Minitab’s implementation is probably the better choice; their Kolmogorov-Smirnov Test is certainly better than the crude attempt I’ll post in my next series, but you’re not going to be able to calculate Kuiper’s Test alongside it in a sort of two-for-the-price-of-one deal like I’ll do in that tutorial. In general, it is best to trust to Minitab for such advanced tests unless there’s a need for tricks of that kind, but to use T-SQL solutions when they’d be easy to write and validate. Some critical cases in point include Minitab’s Combinations, Permutations and Gamma functions, which are severely restricted by the limitations of their data types. At 170 records, I was only able to get permutations and combinations results when I used a k value no higher than 8, but it only took me a couple of minutes to write T-SQL procedures that leveraged the size of SQL Server’s float data type to top out at 168 k. I was likewise able to write a factorial function that took inputs up to 170, but Minitab’s version only goes up to 19. In the same vein, their gamma function only accepts inputs up to 20. These limitations might not cut it for some statistical and data mining applications with high values or record counts; as I’ve found out the hard way over the last couple of tutorial series, some potentially useful algorithms and equations can’t even be implemented at all in many mainstream languages because they require permutations and other measures that are subject to combinatorial explosion. There are still a few Minitab functions I haven’t tried to implement yet, like Incomplete Gamma, Ln gamma, MOD, Partial Product, Partial Sum, Transform Count and Transform Population, in large part because they have narrower use cases I’m not familiar with, but I suspect the same observations hold there.
…………As the sophistication of the math under the hood increases, the balance shifts to Minitab over T-SQL solutions. For example, all of the probability functions I’ll code in T-SQL for my series Goodness-of-Fit Testing with SQL Server are provided out-of-the-box in Minitab 17, including probability density functions (PDFs), cumulative distribution functions (CDFs), inverse cumulative distribution functions and empirical distribution functions (EDFs) for many more distributions beside the Gaussian normal I was limited to. These include the Normal, Lognormal, 3-parameter lognormal, Gamma, 3-parameter gamma Exponential, 2-parameter exponential, Smallest extreme value Weibull, 3-parameter Weibull Largest extreme value Logistic, Loglogistic and 3-parameter loglogistic, which are the same ones available for Minitab probability plots. There is something to be said for coding these in T-SQL if you run into situations where higher precision data types, indexing, execution plans and the efficiency of windowing functions can make a difference, but for most use cases, you’re probably off depending on the proven quality of the Minitab implementation. In fact, Minitab implements many of the same goodness-of-fit tests I’ll be covering in that series, like the Anderson-Darling, Kolmogorov-Smirnov, Ryan-Joiner, Chi-Squared, Poisson and Hosmer-Lemeshow, as well as the Pearson correlation coefficient. You’re probably much better off depending on the proven quality of their versions than taking the risk of coding your own – unless, of course, you have a special need for higher-precision results for Big Data scenarios, as my mistutorial series demonstrated how to implement.

Figure 5: The Stat Menu
…………That is doubly true when we’re talking about even more complex calculations, such as ANOVA tests, which are accessible from the Stat menu. Analysis of variance is only tangentially related to data mining per se, but its output can be useful in shedding light on the data from a different direction; to make a long story short, variance is partitioned in order to provide insight into the reasons why the mean values of multiple datasets differ. As depicted in Figure 5, Minitab includes many of most popular tests, like Balanced, Fully Nested, General and One-Way ANOVA, plus One Way Analysis of Means and a Test for Equal Variances; I’ve tried to code a couple of these myself and can attest that they’re around that boundary where a professional tool begins to make more sense than DIY T-SQL solutions. Some of the tests on the Nonparametrics submenu, like Friedman, Kruskal-Wallis, Mann-Whitney and the like, are fairly easy to do in T-SQL, as are some of the Equivalence Tests.  A couple of routines are available to force data into a Gaussian or “normal” distribution, like the Box-Cox and Johnson Transformation, but I don’t have any experience with using them, let alone coding them in T-SQL. Minitab also has some limited matrix math capabilities available through other menus, but I’m on the fence so far as to whether I’d prefer a T-SQL or .Net solution for these. The Basic Statistics menu features stats that are easy to code or come out-of-the-box in certain SQL Server components, like variance, correlation, covariance and proportions, but it also has more advanced ones like Z and T tests, outlier detection and normality testing functions. There are also some related specifically to the Poisson distribution. The Table menu is home to the Chi-Square Test for Association and Cross-Tabulation, each of which isn’t particularly difficult to code in T-SQL either; the time, skills and energy required to program them all yourself begins to mount with each one you develop a need for though, till the point is eventually reached where Minitab (or perhaps one of its competitors) begins to justify its cost.
…………Minitab really shines in the area of stats for specific engineering applications, like reports and templates for Six Sigma engineering, plus separate sections in Help explaining in-depth how to use the Reliability and Survival Analysis and Quality Process and Improvement functionality on the Stat menu. The documentation for Design of Experiments (DOE) is excellent as well. This functionality is accessible through the DOE item on the Stat menu, which allows you to perform such helpful tasks as gauging how many records are required to meet your chosen confidence levels. Various factorial, mixture, Taguchi and response surface DOE designs are available. I’m not familiar with either DOE or these engineering applications, so I’d definitely use a third-party tool for these purposes instead of coding them in SQL Server or .Net.  Some of the individual items on these menus include Distribution Analysis, Warranty Analysis, Accelerated Life Testing, Probit Analysis, Gage Study and Attribute Agreement Analysis, all of which are highly specialized. Most of the meat and potatoes in the program can be found on Stat menu, but the Assistant menu also provides access to many prefabricated workflows that can really save a lot of time and hassle. In fact, I was able to learn something about the function of Capability Analysis and Measurement Systems Analysis just by looking at the available options on these workflows. The Regression Assistant is directly relevant to data mining, while the workflows for certain other activities like planning and interpreting experiments might prove just as useful. The hypothesis testing workflow in Figure 6 would probably come in handy for statistical tasks that are complementary to data mining.

Figure 6: The Hypothesis Testing Assistant
Hypothesis Testing Assistant

…………The Graphical Analysis Assistant also helps centralize access to many of the disparate visualizations scattered throughout the GUI, like probability plots, histogram windows, contour plots, 3D surface plots and the like. Normally, these open up in separate windows when a task from the Stat menu is run. I’ll cover these in the next installment and address the question of whether or not it is better off buying an off-the-shelf functionality like this, or developing your own Reporting Services solutions in-house. All of these visualizations can be coded in SQL Server – with the added benefit that RS reports can be customized, which is not the case with their Minitab counterparts. I’ll also delve into some of the Stat menu items that overlap SSDM’s functionality, like Regression and Time Series. Minitab features a wider range of clustering algorithms than SSDM, which are accessible from the Multivariate item.  This item also includes Principle Components Analysis, Factor Analysis, Item Analysis and  Discriminant Analysis, none of which I’m familiar enough with to code myself; the inclusion of principle components, for example, in data mining workflows is justified by the fact it’s useful in selecting the right variables for analysis. I have no clue as to what Minitab’s competitors are capable of yet, but after my experience with it I’d definitely use a third-party tool in this class for tasks like this, plus hypothesis testing, ANOVA and DOE. Some of the highly specific engineering uses are beyond the use cases that SQL Server data miners are likely to encounter, but should the need arise, there they are. As with WEKA, Minitab’s chief benefits in a SQL Server environment are its unique mining algorithms, which I’ll introduce in a few weeks.

[1] See the Minitab webpage “Topic Library / Interface: Worksheets” at http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/interface/the-minitab-interface/worksheets/

[2] See the Minitab webpage “Numeric Data and Formats” at http://support.minitab.com/en-us/minitab/17/topic-library/minitab-environment/data-and-data-manipulation/numeric-data-and-formats/numeric-data-and-formats/

Integrating Other Data Mining Tools with SQL Server, Part 1.2: Finding Use Cases for WEKA

By Steve Bolton

…………As recounted in the first installment of this occasional series of amateur self-tutorials, there are some serious limitations to using Waikato Environment for Knowledge Analysis (WEKA), a popular open source data mining tool, in a SQL Server environment. The documentation is full of white space and even thinner than that available for SQL Server Data Mining (SSDM); the Windows version runs only on the unstable and insecure Java Runtime Environment (JRE); as far as I can tell, there is no way to connect it to a SQL Server Analysis Services (SSAS) cube; perhaps worst of all, the user interface simply chokes on datasets of a mere few thousand records. This is compounded by the fact that any SQL Server they’re pulled from will run on one core when the interface crashes in this way, until the connections are killed in SQL Server Management Studio (SSMS) – which has the unexpected side effect of ending the WEKA process and the JRE (which must be rebooted before WEKA can be launched again). The WEKA process can’t even be stopped in Task Manager through ordinary means, since it runs through the JRE; this can be an issue, given that all of the analysis programs I’ve used to data have locked up under heavy loads at some point or another, no matter how professional or expensive they were. All this turns WEKA into a potential threat to server stability. Even when it runs correctly, a welter of other constraints further reduce its usefulness. The data types are less sophisticated than SQL Server’s by several orders of magnitude, given that it only supports nominal, strings, dates and relationals. The numeric subsumes all integer and continuous types, but as the documentation notes, “While double floating point values are stored internally, only seven decimal digits are usually processed.” We also have to resort to using an awkward Java filters package to perform ordinary tasks like partitioning and sampling datasets, which SQL Server users can perform instantaneously with SSMS GUI and T-SQL. There are few performance tools to speak of, except performing a few brute force, low-level actions like changing memory heaps sizes with arcane Java command prompts such as the -Xmx1024m parameter. Even garbage collection has to run manually in the WEKA Explorer, which is a function that the .Net Framework now implements quite well under the hood (finally, after several early Framework versions went by where Microsoft didn’t quite succeed). This article is not intended as a knock against WEKA though, because it does many legitimate uses, particularly in low-budget organizations that need free open source alternatives to high-priced software like SQL Server. This series is geared to an audience of SQL Server users though, most of whom will find it useful only in a much narrower set of use cases than the rest of the I.T. industry.
…………Given that SQL Server outmatches WEKA in every case where their functionality overlaps, it is much easier to ascertain its real benefits by a process of elimination. There are some fuzzy text capabilities, including the use such structures as stemmers and the like, but WEKA’s capabilities in this area don’t hold a candle to SQL Server’s Full-Text Search. There are many things SSDM and Analysis Services can do which WEKA cannot even do at all, like processing related tables and mining sequences.[1] Nor are there any neural nets or Time Series algorithms available out of the box. WEKA provides cross-validation capabilities like text-only confusion matrix and contingency tables to display true and false positives, but as usual, SSDM is simply several orders of magnitude better in this respect. These cross-validation components might come in handy, however, if we were using some of the unique algorithms that WEKA and its user community have provided. Furthermore, some of the unique functionality is pointless in a SQL Server environment. For example, using the Pattern command under Preprocess merely brings up a modal box titled Input, with the message, “Enter a Perl regular expression.” This is something SQL Server does not provide, for the sound reason that Perl scripting isn’t of much benefit in a .Net ecosystem; for that reason, we really can’t classify it as a “feature” in favor of WEKA.

A Comparison of Visualization Capabilities

…………I had high hopes for the WEKA’s visualization capabilities, given that there are so many types of infographics that no single data mining tool has yet implemented them all. Don’t underestimate the power of eye candy in analytics: yes, it is possible to misuse visualizations, especially as a marketing ploy of sorts to cover up a lack of content, but they serve a definite, no-nonsense purpose that saves time, energy and money. End users often don’t have the same level of experience in interpreting data that analysts do, nor do they have the time to acquire it, which is why they hire us; it is our job to communicate the information we’ve mined to the people who can act on it as efficiently as possible. This can sometimes be done more efficiently with graphics than numbers and equations, for as the saying goes, “a picture is worth a thousand words.” This is not precisely true in all circumstances, because sometimes words can convey a thousand pictures, but either way, imagery plays an indispensable part in human intelligence. This is a hard-coded, neurological consideration. Even an end user who understands the problem set quite well and are proficient in math and computer science can still benefit from them, given that every human brain still takes in most of its information visually; it quietly removes an unnoticed load even from the minds of experts and frees up their brains up to do other productive things, which there is no shortage of in the data mining field. I was glad to discover that WEKA indeed offers several types of plots, unlike some other open source tools, but they just don’t hold a candle to SQL Server’s capabilities. One issue with them is that they’re scattered throughout the interface, not concentrated in the Visualize Tab and Visualize menu as I expected.
…………The WEKA Explorer version of the user interface is some ways to superior to that of the Windows version of DB2, in that it at least works without constantly crashing, but it’s a little disjointed in this respect (and might possibly benefit from some mouse and hand-eye coordination testing of users in action, if such a thing is possible in the open source Java world). There are multiple types of scatter plots with slightly different capabilities, which are accessible through different components of the GUI, for example. The Preprocess Tab has a handy histogram in the bottom right corner, which is pretty, but also pretty small. As far as I can tell, the size and colors can’t be adjusted and hovering over different points doesn’t tell you much about the class an attribute is being compared to, as ToolTips do in SSDM and Reporting Services. The Visualize All button can be used to display all of the histograms for every attribute in one fell swoop in a separate window, but none of this is really special in comparison to the capabilities of Reporting Services. As I demonstrated in Outlier Detection with SQL Server, Part 6.1: Visual Outlier Detection with Reporting Services, histograms are fairly simple to code right in SQL Server, with the added benefits of flexibility, customization and capacity for displaying large datasets. In the same vein, the WekaManual.pdf contains a screenshot of a 3D Scatter Plot that can be implemented through the Perspective menu command in the Knowledge Flow GUI, but it’s only eye-catching because of its black background; there’s nothing there that an SSRS developer couldn’t whip together in a jiffy.[2]
                The visualization methods available through the GUI menu are barely mentioned in a single bulleted list in the.pdf, which is a shame given that I ran into issues with all of them except the Plot command. Various GUI controls are available with the scatter plots to adjust the jitter, diagram size, point size, colors, scrolling speed and sampling percentages, but it’s nothing that can’t be coded in Reporting Services. I had trouble figuring out what the ROC Displays command did, given that it simply brought up the same window as the Plot command for each sample dataset I tried. I discovered that it is an abbreviation for “Receiver operating characteristic,” whatever that is. Although I complained frequently in A Rickety Stairway to SQL Server Data Mining about the quality of the SSDM documentation and wrote my first series, An Informal Compendium of SSAS Errors, in order to address the lack of published information on that topic, I was unprepared for just how thin the WEKA documentation is in spots. I had a difficult time tracking down any citations at all that might help me figure out what I was doing wrong with ROC Curves. All I turned up is in a mention of ROC Curves in the separate KnowledgeFlow component; given that this is an item on the startup window’s menu bar, there ought to be at least some immediately accessible mention of it in the documentation. The TreeVisualizer is apparently used for displaying directed graphs defined in GraphViz’s .dot format, such as the output of tree classifiers available in WEKA’s J48 and M5P packages.[3] Unfortunately, the TreeVisualizer didn’t respond when I loaded various sample .dot files I found on the Internet into it. Likewise, the GraphVisualizer is supposed to display Bayesian networks encoded in.xml, .bif and .dot files[4], but none of the .bif sample documents I downloaded from the Bayesian Network Repository produced any output. Also, the blue bar got stuck on “Removing gaps by adding vertices” about a quarter of the way through when I set the LayoutType to Naive instead of Priority and checked Edge Concentration. I’m not saying that that my own naiveté isn’t the culprit here, especially since I barely gave this functionality a cursory check-over. Nevertheless, a quick search of the Internet uncovered posts by other users who have had the same problem. I’m simply not going to get into writing Java in command prompts or inspecting the sample files for EOF or Unicode characters, as the various workarounds require. It’s a shame it doesn’t work out of the box, given that visualization of Bayesian networks might address some use cases that SSDM’s Naïve Bayes algorithm does not. In fact, WEKA has some distinct advantages in terms of Bayesian data mining, in that Chapter 9 in the documentation is devoted to the topic and some good information is also available for using the Bayes Net Editor, which can be accessed from the Tools menu in WEKA Explorer. Except for the broken GraphVisualizer, the BoundaryVisualizer probably exposes the most functionality not found in SSDM, where the emphasis is on using trees to depict how classifiers discriminate between data points. The plot in Figure 1 could easily be coded in Reporting Services though. The same can also be said for the sample bar charts and scatter plots in Figures 2 through 4.

Figure 1: The WEKA Boundary Visualizer
The WEKA Boundary Visualizer

Figure 2: Sample Bar Charts from WEKA’s Preprocess Tab
Sample Bar Charts from WEKA's Preprocess Tab

Figure 3: Screenshot of the WEKA Plot Command
Screenshot of the WEKA Plot Command

Figure 4: Example of a WEKA Scatter Plot Matrix
Example of a WEKA Scatter Plot Matrix

…………The BoundaryVisualizer and Bayes net functionality brings us to the topic of the actual mining functionality that leads to these outputs. In many respects WEKA overlaps SSDM, in which case it is invariably wiser to go with the latter, but it implements a few algorithms and parameters that SQL Server does not. For example, WEKA users can perform Association Rules mining with the Apriori and FilteredAssociator algorithms, which will return various metrics like Confidence, Conviction, Lift and Leverage in its output. This would only be beneficial in a low-budget, non-SQL Server environment though, given that the output in Figure 5 is far more difficult to decipher than SSDM’s output for similar algorithms, as depicted in A Rickety Stairway to SQL Server Data Mining, Algorithm 6: Association Rules. As I mentioned in that article, Association Rules is a brute-force method of data mining that can be computationally intensive, to the point where the parameters require a lot of tweaking to prevent SSDM from crashing. I seriously doubt that WEKA can handle the kind of half-gigabyte datasets I crammed into the SQL Server version of Association Rules for that article, given that it has so much trouble just displaying 5,500 records in it SQLViewer.
…………Clustering is far less taxing on server resources than Association Rules, as I noted in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering, so it would probably be a safer bet to explorer WEKA’s capabilities using this set of mining methods from the Cluster Tab. On the other hand, it suffers from an identical problem: as shown in Figure 6, it only outputs text, not any visualizations comparable to the Microsoft Cluster Viewer. This is a more serious drawback than with Association Rules, since Clustering is inherently visual. The good news is that besides the Expectation Maximization (EM) and K-Means brands of clustering available in SSDM, WEKA also provides others like Canopy, Cobweb, FarthestFirst, FilteredCluster, HierarchicalClusterer and MakeDesnityBasedClusterer that aren’t available out of the box in SQL Server at all. If a client came to me with a special need for one of these flavors of Clustering on a small dataset, I’d first try running it through WEKA. Another advantage is that it the Cluster Tab makes a broader list of parameters available than SSDM does for its two Clustering algorithms (plus Sequence Clustering, which WEKA doesn’t offer).

Figure 5: Sample Output for Association Rules in WEKA (click to enlarge)
Sample WEKA Association Rules Output

Figure 6: Sample Output for EM Clustering in WEKA (click to enlarge)
Sample WEKA EM Clustering Output

…………The same general principle applies to other aspects of WEKA: it’s mainly useful to SQL Server users in cases where 1) it has functionality that is not implemented in SSDM; 2) the same functionality couldn’t be quickly coded in an economical way in T-SQL, Multidimensional Expressions (MDX) or SSDM plug-ins; 3) the problems it solves aren’t encountered frequently enough to justify buying a different professional tool that does the same job; and 4) WEKA proves capable of processing the number of required rows without choking, which is a big if. Another area where these four conditions might be met is in feature selection, which is taken care of through the Select Attributes Tab. There are options available here which appear (at least from their names) to go beyond what SSDM offers in some respects, like GainRatioAttributeEval, InfoGainAttributeEval, CfsSubsetEva;, CorrelationAttributeEval, OneRAttributeEval, PrincipalComponents, ReliefAttributeEval, SymmetricalUncertAttributeEval and WrapperSubsetEval. The Search Method Choose button also contains choices not available out-of-the-box in SSDM, like BestFirst, GreedyStepwise and Ranker. Once again, however, the benefits might be canceled out by WEKA’s inability to process “Big Data”-sized datasets, which necessitates far more stringent thresholds for feature selection than tools like SSDM; assessing how many attributes a data mining tool can process becomes a whole different ballgame when we know in advance that it can only handle a few thousand rows of input. As usual, any benefit for SQL Server users is going to be found in the fine-grained control made possible by specialized parameters, but these will only prove useful in a really narrow set of use cases. In fact, WEKA’s weak processing capabilities may nullify the value of its additional feature selection choices altogether.

Figure 7: Sample Output from the WEKA Classifier (click to enlarge)
Sample Output from the WEKA Classifier

…………The WEKA Classifier Tab is likely to be of more benefit to SQL Server users than any of the aforementioned components, save perhaps for the Bayes Net Editor and other Bayesian functionality SSDM doesn’t have. SSDM includes some solid classification methods out of the box, like Decision Trees, Naïve Bayes and Microsoft’s Neural Network Algorithm. Nevertheless, end users might find a use for the six classifiers WEKA provides, ZeroR, DecisionTable, Jrip, OneR, PART, MSRules (which was always greyed out during my trial), all of which have their own unique sets of parameters. I’m not sure what the use cases for all of them are – there are literally thousands of data mining algorithms out in the wild today, so I’m sure that even experts can’t match them all up – but if the need should arise for one of them, WEKA might be a good starting point. Provided, that is, that we’re speaking of small datasets, which represent an omnipresent caveat. In Figure 7, we can see that the Classifier returns only text output, like the crude Confusion Matrix. The real story, however, is the extra information provided by the Kappa statistic, Mean absolute error, Root mean squared error and the like, none of which are returned by SSDM or SSAS. I now know how to code many of these myself in T-SQL and would probably prefer to do it that way, particularly when datasets of more than a few thousand rows are at stake.
…………One of the few scenarios SQL Server users might have for accessing the WEKA KnowledgeFlow Environment is to automate ETL loading and extraction for the functionality SQL Server doesn’t have, like certain aspects of the Classifier Tab, the Bayes Net Editor and some of the Clustering algorithms. For example, fine-grained control of the unique parameters and algorithms available in WEKA can be automated, but I won’t go into much depth on this topic because it applies mainly to an even rare set of use cases, for organizations which need to use SQL Server and WEKA together on a continual basis. Furthermore, the minor league documentation is thin, even when all of the white space isn’t taken into account. SQL Server Integration Services (SSIS) simply blows it away – as it should, considering that WEKA and its KnowledgeFlow is free. One of the few noteworthy differences is that the available data sources and destinations include Matlab, JSON, SVMLight, .xrff, serialized and LibSVM formats, in addition to the .csv and text formats included in SSIS.

The Silver Linings: WEKA’s User Community and Extensibility Architecture

…………Other automation capabilities are available through an API[5], at least for those with a taste for Java. Yet the real benefits to SQL Server data miners can be found in WEKA’s Extensibility interface, which allows us to take advantage of the hundreds of add-ons developed by the WEKA user community. This is where WEKA really shines, especially since Microsoft quit upgrading SSDM’s algorithms back in SQL Server 2008 R2, as I lamented often in the Rickety tutorial series. Last time I checked, there were 116 add-on packages available at the WEKA website, covering everything from Arabic stemmers to alternating Decision Trees to a Radial Basis Function (RBF) neural net, which might be of interest to me. Some of these were repeated in the list of 369 available at SourceForge’s page for WEKA packages. The sheer number of add-ons should be a clue that they come from many highly specialized fields and are thus beyond any one person’s powers of comprehension and description. It’s a good idea to know precisely what algorithm you need for the job at hand prior to diving into the pool of packages, especially since many of them have inscrutable names and terse descriptions which are probably only clear to the initiated and the enlightened. WEKA is not like SSDM, where you can go in blind and untrained and still get grope your way towards meaningful results; you have to know exactly what these WEKA packages do and why they do it, which is something I couldn’t do with SSDM’s algorithms when I started the Rickety series. The omnipresent caveat of small dataset sizes also hangs like a pall over WEKA add-ons, like a dark cloud over a silver lining: don’t assume that any of these packages are any better at handling larger datasets than WEKA itself is, especially since they’re often written by just a single user. I strongly suspect that many of them haven’t been subjected to any stress tests at all, since I’ve never seen the subject come up in any of the discussions, documentation or description of WEKA packages. Since the SQLViewer alone is capable of locking up SQL Server on one core, there is no telling what these open source add-ons might do if we run them with SQL Server connected. Unlike the JRE, I doubt that they represent security holes, but they might lead to performance issues and crashes. After all, we are talking about Java here, which is not in the same league of reliability as .Net. Just because the WEKA add-on packages might occasionally prove useful to SQL Server users doesn’t mean I’d ever waste time trying to code one in Java myself, based off of the weka.classifiers.AbstractClassifier class. Chapters 18  and 19in the WekaManual.pdf can get intrepid Java die-hards started with writing their own classifiers, associators, filters and the like, but it really only makes sense if you’re already both an experience SQL Server user with an advanced Java skillset  to boot, plus are faced with narrow classes of mining problems. It would otherwise be several orders of magnitude easier for someone used to the Microsoft ecosystem to write T-SQL routines, SSDM plug-ins or .Net code, perhaps with the aid of the Accord Nuget package. It may likewise be possible to upgrade the GUI and augment WEKA with graphics functionality from the Prefuse Visualization Toolkit – but why bother, when we can do so much more, at a much faster pace and in a much more reliable way with Reporting Services and Windows Presentation Foundation (WPF) interfaces?
…………If I had the option, I would try to meet the data mining needs of clients through SSDM, SSAS, Reporting Services, in cases when SQL Server provides the same functionality as WEKA. This leaves the existing add-on algorithms, the Experimenter, the Bayes Net Editor, some of the functionality on the Classifier Tab and some of the Clustering algorithms as the most useful aspects of WEKA, at least in a SQL Server environment. I originally figured that parsing .arff and .xrff files might be among these narrow use cases, but it turns out that they can be easily converted and imported into SQL Server, seeing that they’re basically just text and xml; I’m not yet certain about the .bif and .dot formats also used by other WEKA components though. Of course, these remaining use cases are further narrowed by the fact that WEKA simply can’t handle datasets of the size that DBAs are accustomed to working with every day, let alone SSAS cubes and SSDM mining models. For these scenarios, we have no choice but to turn to custom SQL Server and .Net code or other third-party mining tools. I’m presently trying to acquire the skills to translate the underlying math formulas for many other algorithms not included with SSDM, and in some specialties where I’ve made a little progress, like fuzzy sets, neural nets and various measures of information, I’d prefer concocting my own custom T-SQL and Visual Basic solutions (possibly in conjunction with Accord, which I haven’t had the time to try yet). As the saying goes, Your Mileage May Vary (YMMV): other SQL Servers users may have far more experience in other Microsoft languages or other math formulas and concepts, so the scenarios where custom code is an option could vary wildly from one development team to the next.  It might also pay to look at one of the other tools I plan to cover in this occasional series in coming years, like RapidMiner, R, Pentaho, Autobox, Clementine, SAS and last but not least, Predixion Software, which was started by former SSDM developers like Jamie MacLennan and Bogdan Crivat. I’ve already dispensed with two of SQL Server’s main competitors in Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial and Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2.
…………My criticisms of WEKA  may come across as harsh, but I was actually quite surprised at how well it worked for a free tool; for data miners with no budgets and no access to SSDM, it might be the perfect solution. In fact, IBM and Oracle really ought to be embarrassed by the fact that WEKA behaved better in many respects than the Windows versions of their expensive data mining suites. As I often said in my Rickety series, SSDM is still a really good tool in comparison to its competitors, despite a decade of neglect by Microsoft. If a development team were thinking of trying to save a few bucks by using WEKA instead of SSDM I’d strongly advise against it, because the hassle, training time, debugging, inevitable crashes and other intangibles would probably wipe out the supposed savings quite quickly. These intangibles would weigh even more heavily with DB2 and Oracle, which were less professional than WEKA in many respects; no matter how expensive a data mining suite is, it isn’t worth a thing if the installers and user interfaces are so riddled with careless errors that it won’t launch or stay running. Judging from how frequently the same “gotchas” were reported by experienced DB2 and Oracle users, my misadventures with those two products were less indicative of novice errors than corporate rot leading to real declines in software quality. For that reason, I had no idea what to expect when I tried Minitab, a moderately expensive data mining tool that can nonetheless meet certain use cases that SQL Server cannot. As we shall see in the next installment, I was pleasantly surprised by how well it worked, especially in fields hat SSDM and SSAS don’t cover at all, like Analysis of Variance (ANOVA) and experiment design. That cuts further into the use cases for WEKA Experimenter and Classifier Tab, which I’d prefer to use Minitab for, provided I had enough clients with those specific needs to justify the costs. I look at the analytics marketplace not as a hierarchy of bigger and smaller fish, but more as a toolbox in which everything has its proper place. In recently learned that the mathematical research and available algorithms far outstrips the available software, probably by a factor of several dozen to one; the theoreticians are literally decades ahead of the marketplace, which means no data mining tool is even close to incorporating them all. The analytics marketplace is going to be wide open for a long time to come. The lesson I’ve learned is that anyone hoping to succeed in the analytics field can benefit from familiarizing themselves with as many tools as possible, as well learning to code their own on the fly as quickly as possible. For the foreseeable future, I hope to inch towards both goals on dual tracks, by writing this occasional series alongside tutorial series like Outlier Detection with SQL Server.

[1] See the Wikipedia article “Weka (Machine Learning)” at  http://en.wikipedia.org/wiki/Weka_(machine_learning)

[2] p. 117, Bouckaert, Remco R.; Frank, Eibe; Hall, Mark;  Kirkby, Richard; Reutemann, Peter; Seewald, Alex; Scuse, David, 2014, WEKA Manual for Version 3-7-11. The University of Wakaito: Hamilton, New Zealand.

[3] IBID., p. 237.

[4] IBID., p. 32.

[5] IBID., p. 205.

Integrating Other Data Mining Tools with SQL Server, Part 1.1: The Weaknesses of WEKA

By Steve Bolton

…………The same rules that applied to my amateur mistutorial series on A Rickety Stairway to SQL Server Data Mining and Outlier Detection with SQL Server are in play for this series of occasional articles, which will provide a brief overview of using various third-party data mining tools with SQL Server. Basically, I’m am a novice writing about my experiences as I go, partly because it helps me absorb the material a lot faster and partly so others won’t repeat my inevitable mistakes. I’ve only had a cursory introduction to tools like Waikato Environment for Knowledge Analysis (WEKA), an open source data mining suite developed at the University of Waikato in New Zealand[1], so I’m certain that aficionados can point out many more legitimate use cases that I haven’t thought of, in addition to many corrections for my desultory observations. Most of the use cases for WEKA simply don’t apply to SQL Server users, however, so it ought to occupy only a small portion of a DBA’s toolbox. My aim in this series is to assess how these tools may benefit SQL Server users, not the wider pool of data miners, so I will have to gloss over many obvious strengths they may possess. That includes many aspects of WEKA that the user community is rightly enthusiastic about, but which are difficult to integrate in a SQL Server environment. The Cliff’s Notes version of this article is that WEKA is useful to us mainly in cases where all three of the following conditions are met: 1) we need a particular algorithm that SSDM doesn’t provide and which can’t be coded easily in T-SQL or Multidimensional Expressions (MDX)[2]; 2) we can’t afford other professional tools that implement the same functionality; and 3) we only need to operate on small tables and views. It has some definite pluses, including the price tag: zero. Despite the old adage that “you get what you pay for,” it behaved much better than IBM’s supposedly professional data mining suite did when I wrote Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial. The fact that installation was even possible put it head and shoulders above the Windows version of Oracle’s data mining tools, which have regressed badly since I last installed them in the mid-2000s. These misadventures are recapped in Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2. This series is essentially a continuation of those two articles, hopefully with results that will do more than serve as cautionary tales about the poor software quality of SQL Server’s competitors.
…………I can’t really put WEKA into perspective without doing a little editorializing and pontificating on some broader software industry issues. Some might see the buzz word “open source” attached to the product and automatically assume that it somehow provides some benefit over proprietary products, but this isn’t necessarily the case. Any open source software is much more vulnerable to security holes, since coding in the open is akin to an army handing over access to all of its battle plans, troop movements and communications to a foreign power. The fact that it is provided free under the GNU General Public License is a plus, but the risks of open source development also have to be factored in. Since it’s been downloaded by at least 2.5 million users at Sourceforge.net, we can be sure that it has better support than certain other fly-by-night open sources projects. On the other hand, popularity is not always indicative of the worth of software; platform dependence can be very difficult to maintain in many areas of the software business, especially when market forces get involved. One of the key lessons of software economics over the last few decades is that if there’s a stampede in a particular product, stragglers eventually have no choice but to use it too – regardless of whether or not it suits their uses, or works at all. This is a lesson I had learned by the mid-1990s, when my father and I had to switch from programming for the Mac to Windows, even though we preferred the other platform. Another big minus is that WEKA’s dependent on the Java Runtime Environment (JRE), which simply can’t hold a candle to .Net in a Windows environment. I understand precisely why the U.S. Department of Homeland Security once warned users to disable Java entirely[3], long after China and others had already restricted its use: the only times in recent memory I’ve detected security issues on my poor, wheezing artifact of  a development machine, the Java Runtime has been the backdoor that allowed entry. For that reason I disallow its use as a matter of policy, except when testing out software packages that depend on it. One of the alleged benefits of Java is its portability, but the problem is that code written for the Java Runtime just doesn’t port well to a Windows environment; in my experience, portability of code has been a holy grail among the software giants since the 1990s, but every product I can think of that emphasized it as a benefit has fallen very short. Many of the aforementioned I encountered in the articles on Oracle and DB2 data mining were due to the fact that many of their components are dependent on Java, which may be the cat’s meow in the Linux world, but is a productivity killer in the Windows ecosystem.
…………In fact, I ran into installation problems right off the bat with WEKA, which were fortunately resolved much more satisfactorily than with DB2 and Oracle. The culprit once again, however, was Java, not the standard Windows installer available from the WEKA homepage. At present, running WEKA is a real inconvenience, given that I had to remove the Java environment to stay within my security policy and would thus have to reinstall it all over again, should some specific data mining use case arise. If the JRE is not installed, a command prompt window will pop up and immediately vanish whenever you run WEKA, which will also disappear in Task Manager. WEKA 3.7.11 and the JRE together took up 110.1 megabytes of space, which is a pittance in these days of cheap storage. The drawback, however, is that the size is indicative of the lack of functionality; the user interface is light precisely because it doesn’t do as much as your average SQL Server component, plus most of the algorithms developed by the user community have to be downloaded separately. The installer includes tabs and other controls for the License Agreement, Choose Start Menu Folder, and Start WEKA checkbox, all of which are fairly self-explanatory. The main choices can be found under the Choose Components window depicted in Figure 1, where an Install JRE control and Associate Files checkboxes for .arff and .xrff files can be found. After installation the structure of the new WEKA directory will look something like Figure 2. The Data directory includes a series of .arff files that are apparently sample datasets, such as “breast-cancer.arff” and “ReutersGrain-train.arff,” the changelogs folder is nothing but CHANGELOG files with different numbers appended which apparently refer to particular versions of WEKA. The doc folder is almost exclusively composed of .html Help files. Most of the other files in the screenshot are self-explanatory, with a couple of .gif, .ico, .exe, Java .jar and .bat files, plus a readme, a .pdf for the documentation and an uninstaller. It is also worth noting that the working directory will be set to an address like “C:\Users\your user name\wekafiles.”

Figure 1: The Choose Components Windows of the WEKA Installer
WEKA Choose Components

Figure 2: The WEKA Folder after Installation
WEKA Folder

Figure 3: The WEKA GUI Chooser
WEKA GUI Chooser

…………On a successful launch, the program starts out with the GUI Chooser in Figure 3. I’ll simply ignore the fourth choice, Simple CLI, which is merely a Java-based console application that is even more awkward to work with than ordinary Windows command prompts. As I pointed out in my SSDM series, command prompts aren’t something data miners ought to be hassled with on a regular basis, given that the topic is so broad, taxing and sophisticated that we can’t afford such quite unnecessary distractions. I stick to whatever GUI is available whenever possible unless there’s a console app has some indispensable, exclusive functionality, but I’m not aware of any the Simple CLI has that are not found in the other three GUI options. In addition to these, the documentation says that it is also possible to install Multiple Document Interface (MDI) capabilities in the GUI, but I didn’t try this feature.[4] Of the remaining three, the Experimenter offers the most functionality not found or easily implementable in SSDM or other SQL Server components. In my Outlier Detection series I spoke frequently of how statistical hypothesis testing methods are not common use cases in a SQL Server environment, even in the case of data mining activities. If a need arose, however, we could use the WEKA Experimenter interface to calculate stats like entropy, mean absolute error, false positives, Root Mean Squared Error (RMSE) and others listed in Figure 4. In the next graphic, we see how it is possible to select standard statistical parameters like significance levels and perform two types of Paired T-tests. Some of the other controls in Figure 5 are self-explanatory, like the Sorting (Asc. By) button, Displayed Columns button and Show Std. Deviations checkbox. When I tried to set up an experiment using the Start button on the Run tab, in conjunction with Distribute experiment and no Hosts file, I received a “No hosts specified! warning.” The first time I was able to use the Stop button to end the run, but the second time around, both the Start and Stop buttons were greyed out – thereby forcing me to restart the whole experiment. The window didn’t freeze per se, so I was able to save the associated .exp definition file, but I had to close the Experiment Environment window, open it again, then reopen the same .exp. The Setup and Run tabs allow users to perform ordinary tasks like randomizing data and producing training data, create new .exp Experimenter definition files, selecting datasets to operate on and choosing Destination types, like .csv files, JDBC databases or the default .arff files. Users can also select up to 10 Runs, with the results being reported in a Log window on the Runs tab. The Output format on the Analyse tab also includes such types as plain text, GNUPlot, HTML and LaTex along with various options for including or excluding certain types of data. These were initially saved to the default location of AppData/Local/Temp folder.

Figure 4: Some Stats That WEKA Experimenter Can Calculate
Experimenter Stat Selection

Figure 5: The Analyse Tab of the WEKA Experimenter
Experimenter Interface


…………Nevertheless, the WEKA Experimenter will probably only prove useful in a SQL Server user environment where several preconditions are met. First, there must be a need to perform tasks like experiment design and model comparison – but only for the basic stats available through the Cols button, some of which are depicted in Figure 4. Second, the need has to be just great enough to justify installing and learning to use the tool, which will take time – without being so pressing or frequent that proprietary tools are called for. In between trying WEKA and writing up this post, I became acquainted with Minitab, which is far beyond my budget at about $1,500 for a single user license. On the other hand, Minitab does far more than WEKA Experimenter ever could in terms of things like advanced experiment design, performing ANOVA and calculating more advanced model stats like the Akaike Information Criterion (which I hope to code in T-SQL in my next self-tutorial series). Everything WEKA can do, Minitab can do better; even the user interface is an immediate and obvious improvement. If I expected to do more than 75 hours of work of this kind for a client over the course of a couple of years, then a tool like Minitab would pay for itself, even if the labor rate was as low as $20 an hour[5]. The amount of time saved between using a professional and open source interface isn’t even comparable – although as I mentioned above, WEKA is actually more reliable than certain professional data mining packages like DB2 and Oracle, which often don’t work at all. I haven’t tried any of Minitab’s high-priced competitors yet, although I suspect that when I finally get around to it, I will find their interfaces and algorithms far superior to WEKA’s. The third caveat is that the datasets have to be much smaller than we’re accustomed to working with in SQL Server. This limitation applies to every component in WEKA, which sharply reduces the number of use cases we can utilize it for.
…………Most of the action occurs in the WEKA Explorer, where we find a series of menus including Program (containing commands to log all of the text results, display MemoryUsage and Exit) and Help, with the usual .html documentation links and the like. Some of the most valuable documentation resources include the WEKA Wiki,a guide to Data Mining: Practical Machine Learning Tools and Techniques and the WEKA forums. The 327-page WekaManual.pdf installed in the WEKA directory in Program Files is actually quite well-written and informative in comparison to other open source documentation I’ve seen before. In fact, it is indispensable for users trying to open XML Attribute Relation File Format (.xrff) and Attribution-Relation File Format (.arff) files, including the aforementioned practice datasets. The first format is used for “representing the data in a format that can store comments, attribute and instance weights,”[6] while the latter can be used to load the 23 plain text sample datasets included at installation, or others downloaded from the Internet. If you can’t get Weka to work with SQL Server, you can still use it in conjunction with these .arff files; which are basically plain text, plus .other extension like xrff, .bif and .dot that I’ve never seen used in a SQL Server environment. These can be inspected via the ArffViewer on the Tools menu, which also contains a SQLViewer that is really just a glorified text editor, not a worthy opponent for SQL Server Management Studio (SSMS). It is through this menu item that we begin our adventure in connecting to SQL Server. This comment in the WekaManual.pdf doesn’t exactly inspire confidence:

“A common query we get from our users is how to open a Windows database in the Weka Explorer. This page is intended as a guide to help you achieve this. It is a complicated process and we cannot guarantee that it will work for you. The process described makes use of the JDBC-ODBC bridge that is part of Sun’s JRE/JDK 1.3 (and higher). The following instructions are for Windows 2000. Under other Windows versions there may be slight differences.”[7]

…………Chapters 14 and 15 contain some information about connecting to Windows and JDBC databases, but it’s not really relevant. I complained often in my Rickety series about the incompleteness of the documentation for SSDM, but this takes the cake: the instructions are for Microsoft SQL Server 2000 (Desktop Engine), as well as Microsoft Access. Thankfully, I was able to connect with the aid of the WEKA Wiki page on Windows Databases and Anders Spur Hansen’s excellent tutorial on connecting through SQL Server, so I won’t waste time reinventing the wheel here.[8] Basically the procedure boils down to creating a DSN and using it in conjunction with the Query command through WEKA Explorer’s OpenDB… menu item, with some extra steps thrown in for Java-related hacks like editing the DatabaseUtils.props file. In the event of an error message like “No suitable driver found” when connecting via SQL Server, be aware that there’s at least one old MSDN thread about this, which turns out to be yet another Java issue.[9]
…………To date, I’ve been unable to find a means of connecting WEKA to Analysis Services, but the point is moot because the SQLViewer simply can’t handle cube-sized data. In fact, one of its chief drawbacks is that it simply chokes on relational tables of the size SQL Server users work with all the time, which limits us to really small datasets. The Lilliputian size of the 23 sample datasets included out-of-the-box is a clue to just how limited WEKA’s processing and display capabilities are in comparison to SQL Server: the breast-cancer.arff file, for example, has a measly 286 rows, whereas the largest, the supermarket.arff has 4,627, which is about several million short of the “Big Data” league. SSMS can effortlessly display millions of rows, even on my clunker of a development machine, but WEKA simply gives up the ghost after a few thousand. To stress test the SQLViewer, I set the maximum rows to more than 51,000, knowing I was loading data from a 41,360 record table that occupied a mere 0.883 megs in a half-gig SQL Server database. This not only locked up the SQLViewer window, but set SQLServer.exe running on one core. The situation deteriorated from there, as I immediately discovered that you can’t simply quit Java apps in the Task Manager, since they’re not listed there. Nor could I exit any of the separate windows that open with WEKA using their control boxes, once one of them froze. To extricate myself, I had to use the Kill Process command in SQL Server’s Activity Monitor, which had the unexpected secondary effect of killing all of the WEKA windows, plus the JRE.
…………Since this is a matter of critical importance for SQL Server users who need to use WEKA, I set out to find a threshold below which SQL Server tables could be safely accessed. When I limited the same dataset to the first 1,500 WEKA behaved itself, but at 5,500 it took about 2 minutes to unfreeze – in which time SQL Server was locked on one core the whole time, which is unacceptable. Plan ahead for this limitation, because the performance of the whole server could be affected by accessing unexpectedly small number of records. Just handling an ordinary record request that SSMS can do in a heartbeat can bring a server to its knees. After some experimentation I came to suspect that the flake point was only a little above 5,500 records, which of course may be substantially higher on a real server or workstation. Even after successfully loading that many records, I still couldn’t open the WEKA Explorer, which was simply greyed out in the GUI Chooser after the crash. The Experimenter did the same after this particular Frankenstein’s Experiment. After restarting the program and finding the same problems again, I checked the LogWindow, which was full of lots of red error messages with alarming titles like “Exception in thread ‘AWT-Event-Queue-0’ java.lang.NoClassDefFoundError.” Given that I was merely starting the program, I concluded that this was not good. I wasn’t able to restart WEKA successfully until I noticed that there were three different javaw.exe versions running and killed them all in Task Manager, then rebooted the JRE.
…………Given that WEKA had so many problems with merely displaying a few thousand rows, I strongly suspected that it wouldn’t be able to perform high-powered data mining calculations on the kinds of datasets I used for the Rickety and Outlier Detection series. There was simply no way that WEKA would be able to swallow the half-gig of fake Monitoring database data I routinely crammed through SSDM in the first series, nor the 11-million-row table of Higgs Boson data I tested my home-baked T-SQL outlier code on in the second. As a general rule of thumb, SSDM ought to be our go-to tool, except in cases where there’s a specific need for algorithms available only in WEKA. It is here that WEKA really shines, thanks to its vibrant user community and the growing pool of open source algorithms they’ve contributed; I still believe SSDM is a far better data mining tool than anything else I’ve yet used, but as I lamented in the Rickety series, Microsoft hasn’t supported it with any new algorithms since SQL Server 2008 R2. While familiarizing myself with the field over the last few years, I’ve been awe-struck by the sheer size of the gap between the data mining algorithms available in the academic literature and those that are available in the current software. This means there’s room for a lot of variation between software packages, none of which implements the same functionality as its competitors. The realization that building a toolbox that includes them all can be beneficial in meeting a wider range of use cases is what prompted me to write this series, in which I intend to match the software packages to the right problems. The limitation on dataset sizes hobbles WEKA even in these situations where it has a clear advantage over SSDM, but as long as we stay within the limit of about 5,000 records or so, it can meet some narrow use cases that SQL Server simply can’t. Now that we’ve got the preliminary steps of installing, connecting and displaying data out of the way, I’ll explain what WEKA can do with our data in the next article.


[1] Apparently the name’s a play on words, based on “a flightless bird with an inquisitive nature” that is native to New Zealand. See Asanka, Dinesh, 2013, “Weka 3: Data Mining Software in Java,” posted on Sept. 10, 2013 at the Toad World blog address http://www.toadworld.com/platforms/sql-server/b/weblog/archive/2013/09/10/weka-3-data-mining-software-in-java.aspx

[2] I rarely mention Data Analysis Expressions (DAX) in these contexts because I have found it more difficult to cook up custom data mining routines with it, in contrast to MDX and T-SQL. Perhaps my judgment is clouded though by the fact that I just plain dislike the language itself, which I find unnecessarily awkward and inflexible for my tastes.

[3] CBS News, 2013, “U.S. Tells Computer Users to Disable Java Software,” published Jan. 11, 2013 at the CBSNews.com web address http://www.cbsnews.com/news/us-tells-computer-users-to-disable-java-software/

[4] “If one prefers a MDI (“multiple document interface”) appearance, then this is provided by an alternative launcher called ‘Main’ (class weka.gui.Main).”

[5] Of course, at that rate the labor would end up paying for the program, rather than producing any immediate profit. The gist of the argument is clear though.

[6] p. 177, Bouckaert, Remco R.; Frank, Eibe; Hall, Mark;  Kirkby, Richard; Reutemann, Peter; Seewald, Alex; Scuse, David, 2014, WEKA Manual for Version 3-7-11. The University of Wakaito: Hamilton, New Zealand.

[7] IBID., p. 195.

[8] Hansen, Anders Spur, 2013, “Connect WEKA to SQL Server 2012 and ‘14’” posted Oct. 11, 2013 at the My Life with Business Intelligence blog address http://andersspur.wordpress.com/2013/10/11/connect-weka-to-sql-server-2012-and-14/

[9] See the replies by the user named “M i k e” and Joris Valkonet to the MSDN thread “Need Help Implementing Weka into SQL Server 2005” on Aug. 22, 2007. Available online at


Outlier Detection with SQL Server, part 6.3: Visual Outlier Detection with Reporting Services Plots and SSDM Clustering

By Steve Bolton

…………When the goal is to illustrate how just how outlying an outlier may be, the efficiency with which scatter plots represent distances really can’t be beaten. It doesn’t take any training in mathematics to look at one and notice that a few data points are further away from the others, in proportion to how different they are from the normal values by some particular measure. Any five-year-old can grasp that idea. It isn’t necessarily easy to calculate the measures that populate many of the more advance species of scatter plots, but you still don’t need to be a rocket scientist to interpret them. That makes them an ideal choice when our uses cases call for easily interpretable visualizations. There are still some drawbacks to scatter plots and their variants – including their simplicity, which can be a weakness when our user requirements call for greater sophistication. Like every other visual method outlined in this series of amateur mistutorials on identifying outliers with SQL Server, it amounts to a brute force, unconscious assessment that “if something looks out of place, we’ll look at it more closely.” That tells us nothing about why a data point is an outlier, or whether or not the domain knowledge and underlying data distribution predict how many values we ought to find at that particular point. Sooner or later, every scatter plot can be saturated with so many data points that they obscure the information we’re looking for. Every outlier detection method we’ve surveyed in this series has suffered from some similar limitation imposed by the sheer scale of the tables commonly seen in SQL Server, which can number in the thousands, millions or even billions of rows. Nevertheless, scatter plots take a lot longer than some other methods to reach the breaking point where strategies like binning and banding have to be applied; that is mainly because they can show outliers at any point in a graph of two or more dimensions, whereas other display techniques like box plots limit them to very confined spaces. Even the relatively efficient histogram/run chart hybrids discussed in a previous post retain a lot of the white space above their bars, where no outliers can be depicted; this is even before we take into consideration the binning that they are more susceptible to, which tends to hide outliers. Given how flexible and efficient they are, it is not surprising that there are so many variations on the theme. In this installment of the series, however, I’ll concentrate on just three subtypes that are frequently mentioned in connection with outlier detection in the data mining literature.
…………The logic of scatter plots is so simple, in fact, that it requires less T-SQL to populate one through a stored procedure than in any other code sample I’ve posted in this series. Readers of this series have probably grown accustomed to the familiar list of parameters, which allows users to select a count for all the distinct combinations of two columns in any single table in any database they have access to. The rest is just the usual dynamic SQL and the debugging string on the next-to-last line. The chart in Figure 1 is fairly self-explanatory: on the horizontal row we find the value for the Hemopexin protein in the 209-row dataset on the Duchennes form of muscular dystrophy we’ve been using for sample purposes throughout this series (which I downloaded from the Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table). On the vertical axis we find the value of the Creatine Kinase enzyme; wherever there’s a scatter plot point, we have a combination of the values of both columns. The only thing that might require explanation here is the size of the bubbles in the 3D Bubble Chart included with Report Builder, which I tied to the count of the records for each combination. As anyone can see, the bulk of the values are centered on the bottom right, but there are two separate groups of outliers, which I’ve drawn red circles around.

Figure 1: Code for the Simple Scatter Plot Procedure
CREATE PROCEDURE [Calculations].[SimpleScatterPlotSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName1 AS nvarchar(128), @ColumnName2 AS nvarchar(128)

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + + ‘.’ + @SchemaName + ‘.’ + @TableName

SET @SQLString = SELECT ‘ + @ColumnName1 + ‘ AS Column1, ‘ + @ColumnName2 + ‘ AS Column2,
Count(*) OVER (PARTITION BY ‘ + @ColumnName1 + ‘, ‘ + @ColumnName2 + ‘ ORDER BY ‘ + @ColumnName1 + ‘, ‘ + @ColumnName2 + ‘ ) AS CombinationCount
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName1 + ‘ IS NOT NULL AND ‘ + @ColumnName2 + ‘ IS NOT NULL’

 –SELECT @SQLStringuncomment this to debug string errors
EXEC (@SQLString)

Figure 2: Simple 3D Scatter Plot Example with Outliers
Simple Scatter Plot

…………As Stan Lee might say, ‘Nuff Said. We might be able to add more useful information through such tricks as setting the bubbles for the normal values semi-transparent or the color of the outliers to an eye-catching red, or setting the ToolTip to display the value associated with each outlier (as I did here, although my screen capture utility failed to pick it up). Yet the primary means of adding sophistication is by changing the measures used on both axes. It is quite common to compare multiple columns using bubble of different colors, but this is unsuitable for our purposes because it would quickly conceal the outliers for all columns in a mass of clutter. Adding columns to the mix might be unwise, but using more complex statistical means to calculate the data points would not cause clutter by saturating it with data points. There might be a performance cost in the calculations and the ease of interpretation would decline as the meaning of the measures grows more intricate, but in many use cases such drawbacks aren’t problematic. Our imagination is really the only limit on the kinds of measures we can put on the axes of scatter plots, but two in particular are singled out as outlier detection methods in statistical literature like the National Institute for Standards and Technology’s Engineering Statistics Handbook. One of these is the lag plot[1], which takes very little code to implement thanks to the Lag windowing function added in to T-SQL in SQL Server 2012. The basic idea is to compare the value of a column on the vertical axis against a prior data point in the column, by some fixed amount that is often measured in time slices. I’ve had some issues converting the SampleDate in the DuchennesTable accurately to a SQL Server data type, but regardless of whether the results are accurate, Figure 4 still illustrates how unusual values can be exposed through such a technique. Lag plots are a less appropriate means of outlier detection than a regular scatter plot because their primary purpose is measuring the randomness of a dataset, which is tangentially rather than directly related to finding outliers, i.e. seemingly random points within a non-random pattern. A lag value of one time slice is the most common value input into a lag plot, although this procedure allows it to be set to any arbitrary bigint value through the @LagInterval parameter. There are apparently many nuances to the interpretation of lag plots that amount to pattern recognition, such as the identification of sinusoidal patterns with cyclical activity[2], but that is really beyond our purview. In the absence of some rigorous logical or mathematical definition of what a “pattern” is, we basically have to use the old rule of thumb: if it looks out of place, it might be an outlier.

Figure 3: Code for the Lag Plot Procedure
CREATE PROCEDURE [Calculations].[LagPlotSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @LagColumnName AS nvarchar(128), @LagInterval bigint

DECLARE @SchemaAndTableName nvarchar(400),  @SQLString nvarchar(max)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + ‘.’ + @SchemaName + ‘.’ + @TableName

SELECT @SQLString =  ‘SELECT DISTINCT ‘ + @ColumnName + ‘ AS ColumnValue, LagResult, Count(*) OVER (PARTITION BY ‘ + @ColumnName + ‘, LagResult
ORDER BY LagResult) AS CombinationCount
             FROM (SELECT  TOP 99999999999 ‘ + @ColumnName + ‘, ‘ +  @LagColumnName + ‘ AS LagColumn,
+ @ColumnName + ‘, ‘ + CAST(@LagInterval AS nvarchar(50)) + ‘) OVER (ORDER BY ‘ +  @LagColumnName + ‘) AS LagResult

                    FROM ‘ + @SchemaAndTableName +
+ @ColumnName + ‘ IS NOT NULL) AS T1
              WHERE LagResult IS NOT NULL
             ORDER BY ColumnValue, LagResult ASC’

EXEC (@SQLString)

Figure 4: Lag Plot Example with Outliers
Lag Plot Edited

…………As can be gathered from the length of the code for the two procedures in Figure 5, computing an autocorrelation plot is somewhat trickier.[3]  The basic idea is that instead of partitioning a range of values for a single column on the horizontal axis by a single lag value, we instead make a comparison against a range of lag values. The second procedure merely creates a list of lag values within the limits specified by the user through the @LagBegin, @LagEnd and @LagStep parameters, then calls the first procedure iteratively to return an autocorrelation value for each row. The calculations for this stat are a little more involved than for a simple lag, but not nearly as difficult and dry as some of those introduced in previous posts – let alone the monster procedures required for the next two tutorials on Cook’s Distance and Mahalanobis Distance.[4] On occasion I have received values greater than one for these autocorrelations, which is abnormal but apparently allowable when the values have been normalized (i.e. recalibrated to a different scale, rather than being “normalized” like a database schema in DBA lingo).[5] Alas, the point is moot anyways because autocorrelation plots are not as useful as lag plots for identifying outliers on the kind of scale DBAs operate at, just as lag plots are in turn less well-suited in many cases than ordinary scatter plots. The basic concept is that the further the values are from the baseline of zero – like the first data point in Figure 6 – the less random the process is. The more values that are closer to zero, the more random the dataset is.[6] This really amounts to shifting back to a bar chart-style type of visualization, which can’t display as many outliers as an ordinary scatter plot. It also requires more computation and more interpretation, since the meaning is not readily apparent to the untrained eye. To make matters worse, autocorrelation is designed to depict the degree of randomness exhibited by a dataset, which requires us to identify a pattern before looking for an exception to it; with ordinary scatter plots, any pattern and its exceptions are evident in a single step. Another difficulty with this approach is that you have to take the automatic rescaling into account; I also tried this on the Hemopexin, CreatineKinase and PyruvateKinase columns and there were a couple of points that were further away from the others, but the autocorrelation values ranged between roughly 0 and 0.25, which isn’t that big of a gap; yet with the LactateDehydrogenase column in Figure 6, they’re between -0.335196 and 0.405092, which is about a third of the scale from -1 to 1.

Figure 5: Code for the Two Autocorrelation Plot Procedures
CREATE PROCEDURE [Calculations].[AutocorrelationSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @LagColumnName AS nvarchar(128), @LagInterval AS bigint, @DecimalPrecision AS nvarchar(50)

DECLARE @SchemaAndTableName nvarchar(400),  @SQLString nvarchar(max)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + ‘.’ + @SchemaName + ‘.’ + @TableName

@SQLString = ‘DECLARE @Mean as decimal( + @DecimalPrecision + ‘), @NCount as bigint
SELECT @Mean = Avg( + @ColumnName + ‘), @NCount = Count(‘ + @ColumnName + ‘)
FROM ‘ + @SchemaAndTableName +

SELECT TOP 1 SUM(TopOperand)  OVER (ORDER BY RN  ) / BottomOperand AS AutoCorrelationCoefficient
FROM (SELECT RN,  TopOperand, SUM(BottomOperand) OVER (ORDER BY RN) AS BottomOperand — this is the n – k on the top operand summation; the Bottom Operand is to RN, not RN – @LagInterval
FROM (SELECT RN, ‘ + @ColumnName + ‘, (‘ + @ColumnName + ‘ – @Mean) * (Lag( + @ColumnName + ‘, ‘ + CAST(@LagInterval AS nvarchar(50)) + ‘) OVER (ORDER BY RN)) AS  TopOperand,
+ @ColumnName + ‘ – @Mean), 2) as BottomOperand –this middle query is necessary because we need to add RN to the LagInterval for one calculation
FROM       (SELECT ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY ‘ + @LagColumnName + ‘ DESC) AS RN, ‘ + @ColumnName +
       FROM ‘ + @SchemaAndTableName + ‘) AS T1
       GROUP BY RN, ‘ + @ColumnName + ‘) AS T2) AS T3
WHERE RN <= @NCount – ‘ + CAST(@LagInterval AS nvarchar(50)) +

–SELECT @SQLStringuncomment this to debug the dynamic SQL
EXEC (@SQLString)

CREATE PROCEDURE [Calculations].[AutocorrelationPlotSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @LagColumnName AS nvarchar(128), @LagBegin decimal (38,21), @LagEnd decimal(38,21), @LagStep decimal (38,21), @DecimalPrecision AS nvarchar(50)
      DECLARE @InputTable table
      (ID bigint IDENTITY (1,1),
            LagAmount decimal(38,21)

DECLARE @ResultTable table
     (ID bigint IDENTITY (1,1),
          AutoCorrelationValue decimal (38,21)

— use the standard CTE method of generating sequences to populate the lag amounts for the table
WITH RangeCTE(RangeNumber) AS
( SELECT @LagBegin as RangeNumber

       UNION ALL

    SELECT RangeNumber + @LagStep
       FROM RangeCTE
       WHERE RangeNumber  < @LagEnd)

SELECT RangeNumber
ORDER BY RangeNumber ASC

DECLARE @SQLString nvarchar(max),
@CurrentTableVarID bigint = 0,
@MaxTableVarID bigint = 0,
@CounterCheck bigint = 0,
@LagInterval decimal(38,21)


WHILE @CurrentTableVarID <= @MaxTableVarID

              SELECT @LagInterval = LagAmount
             FROM @InputTable
            WHERE ID = @CurrentTableVarID

              SET @SQLString = ‘EXEC Calculations.AutocorrelationSP
            @DatabaseName = ‘+ CAST(@DatabaseName as nvarchar(50)) + ‘,
             @SchemaName = ‘+ CAST(@SchemaName as nvarchar(50)) + ‘,
            @TableName  =+ CAST(@TableName  as nvarchar(50)) + ‘,
             @ColumnName = ‘+ CAST(@ColumnName as nvarchar(50)) + ‘,
             @LagColumnName = ‘+ CAST(@LagColumnName as nvarchar(50)) + ‘,
              @LagInterval = ‘+ CAST(@LagInterval as nvarchar(50))+ ‘,
             @DecimalPrecision = ”’+ CAST(@DecimalPrecision as nvarchar(50)) + ””

              –SELECT @SQLString  — uncomment this to debug the dynamic SQL

             INSERT INTO @ResultTable

       SET  @CounterCheck = @CounterCheck  + 1
      SET @CurrentTableVarID = @CurrentTableVarID + 1 — increment the loop

SELECT LagAmount, AutoCorrelationValue
FROM @InputTable AS T1
       INNER JOIN @ResultTable AS T2
       ON T1.ID = T2.ID

 Figure 6: Autocorrelation Plot Example with Outliers
Autocorrelation Plot Edited

…………When all is said and done, one of the more effective visual means of identifying outliers available to us out-of-the-box is also one of its least publicized: the Microsoft Cluster Viewer, which depicts the results calculated by the two SQL Server Data Mining (SSDM, to coin an abbreviation) Clustering algorithms. The calculations are fast and don’t require any custom code of the kind I’ve posted here. Best of all, it can do what Reporting Services cannot: efficiently depict both summary stats and individual data points together, by clicking on a cluster and viewing the Drill Through window (by selecting Model Columns Only or Model and Structure Columns in the submenu) as seen in Figure 7. As discussed in the last post, SSRS is crippled by its simultaneous inabilities to consume more than one result set from a stored procedure in a single dataset, to consume more than one dataset in a single chart, or calculate statistics as efficiently as T-SQL or .Net languages like Visual Basic. Perhaps it is possible to add this functionality through custom .Net code, but if so, the workaround hasn’t been well-publicized, judging from the frequency with which the complaint is raised in coding forums. As we have seen in this segment of the series, it can still be used as an effective outlier detection tool, but much of its potential has not be realized. The potential of SSDM, in contrast, has been realized but not recognized by the SQL Server community at large. As discussed in my last tutorial series, SSDM as a whole is perhaps the most unsung and long-neglected component of SQL Server. One of its many overlooked benefits is the Cluster Viewer, which acts much like a scatter plot, except that the points represents a set of summary stats and the associated data points, rather one or the other as would be the case in any SSRS chart. By using a combination of the Density shading color, the Shading Variable dropdown and various values of the State dropdown (like Very Low, Low, Average, High and Very High), outliers easily become apparent. The All Links slider can also be used to show or hide links between clusters, which isn’t possible between data point in an ordinary scatter plot. As I pointed out in the last two posts, I’m trying to avoid tutorial clutter by omitting descriptions of basic tasks in Reporting Services, like adding a diagram to a report; in the same vein, I’m not going to get into an in-depth description of the Clustering algorithm, which I’ve already introduced in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering. As discussed there, the Cluster Viewer can be used to display two types of results, one of which is derived from the Expectation Maximization (EM) version of the Clustering algorithm. It also displays the results of the K-Means Clustering algorithm, which is closely related to the popular K-Nearest Neighbors formulas that apparently used quite frequently in outlier detection.[7] All of the methods in this family are based on the idea of calculating a relative distance between data points, which is a somewhat more sophisticated measure than the crude distances implied in scatter plots or the difference between each data point and the mean or median, as seen in Z-Scores and many other outlier detection methods we’ve already discussed. This particular flavor of SSDM Clustering is really the first of three outlier detection methods that are based on such distance calculations. In the next two posts I’ll close out the series by discussing the two most sophisticated distance-based outlier identification methods I could find in the data mining literature, Cook’s Distance and Mahalanobis Distance. There are some costs in terms of interpretability, performance and the legibility of the lengthy code, but they address some unique use needs in the field of outlier identification.

Figure 7: SSDM Clustering Example with a Cluster of Outliers

[1] National Institute for Standards and Technology, 2014,  “ Lag Plot,” published in the online edition of the Engineering Statistics Handbook. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htm

[2] IBID.

[3] I tried to use the windowing functions ROW ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING in this code but couldn’t get them to perform the way I wanted. I may take another shot at it though if I ever need to rewrite this procedure for some practical application.

[4] I derived this code from the formulas given at two sources: National Institute for Standards and Technology, 2014,  “ Autocorrelation,” published in the online edition of the Engineering Statistics Handbook. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35c.htm and Also see the course notes published by McInally, Cameron, 2008, “WCSLU2850.Lo1 Web Project 11,” published April 23, 2008 at the Coparoom.com web address http://www.coparoom.com/archive/Fordham/courses/Spring2008/CS2850/web_project/project11.html

[5] See the reply by the user named thrillhouse86  to the thread “Can Autocorrelation Be Geater Than One?” published April 27, 2010 at the Physics Forums web address http://www.physicsforums.com/showthread.php?t=392277

[6] National Institute for Standards and Technology, 2014,  “ Plot,”  published in the online edition of the Engineering Statistics Handbook. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htm

[7] See the Wikipedia page “Outlier” at http://en.wikipedia.org/wiki/Outlier

Outlier Detection with SQL Server, part 6.2: Finding Outliers Visually with Reporting Services Box Plots

 By Steve Bolton

…………Throughout this series of amateur mistutorials in using SQL Server to identify outliers, we have repeatedly seen that the existing tried-and-true methods of detection long used for such purposes as hypothesis testing are actually poorly suited for finding aberrant values in large databases. The same problem of scale also affects the simple visual inspection methods we’re surveying in this segment of the series; for example, we reached a point in last week’s tutorial where histograms and run charts had to be binned to accommodate large record counts. The problem with such strategies for our purposes is that they blur outliers by concealing them in a mass of other values, rather than highlighting how much they contrast with normal values. That difficulty is even more pronounced with the box plots, which are a well-established form of outlier detection that unfortunately doesn’t seem to scale well. Because all of the outliers are represented on a single axis, they easily blur into a single undifferentiated line after the accumulation of a handful of records. This is a problem with all visual inspection methods, but it is more pronounced with box plots and their relatives because they’re limited to representing them in a single dimension. The histograms we introduced last week and the variants of scatter plots we’ll discuss in the next installment can have the advantageous capability of displaying values in two dimensions, which conveys information in the available space in a much more efficient way. Because of this insurmountable, built-in limitation, I’ll only spend a short time discussing how to implement this family of plots in Reporting Services. I’m omitting discussion of probability plots from this series altogether for basically the same reason: they certainly have their uses, as we shall discover in a future series on goodness-of-fit testing, but outlier detection only amounts to an afterthought in comparison to them.
…………Don’t get me wrong: box plots are entirely appropriate for certain use cases, including detecting a handful of outliers. Their most common use, however, is in comparing the variability of multiple columns against each other, or a single column against itself after partitioning it by some flag; this is especially useful when comparing trials of scientific experiments against each other. The technique was popularized by 20th Century statistician John Tukey, who is recognized as one of the Founding Fathers of the field of data mining[1] and was also instrumental in the development of a number of important statistical tools, particularly the Fast Fourier Transform (FFT).[2] He was also apparently quite sane, in contrast to the many famous mathematicians and physicists who have completely lost their minds and morals, as I pointed out a few times in my last tutorial series, A Rickety Stairway to SQL Server Data Mining; he had the common sense to correct the uncommon nonsense found in the infamously flawed Kinsey Report on human sex habits.[3] Edward Tufte, whose name is synonymous with data visualization, has nothing but praise for Tukey.[4] They may be ugly, but the box plots he invented are very effective in conveying some of the stats associated with Interquartile Range, if you know how to read them. Unfortunately, most of the emphasis is on comparing the variability of data over a set of columns or trials, not finding outliers, so the outlier information we’re looking for is obscured.
…………When interpreting plots of the kind depicted in Figure 1, just remember this simple rule mentioned by Kaiser Fung in an old post on data visualization: “the box contains the middle 50% of the data…the line inside the box is the median score; the dots above (or below, though nonexistent here) the vertical lines are outliers.”[5] The edges of the box represent the first and third quartiles, which we discussed earlier in this series in the post on Interquartile Range. The whiskers are the tricky part, since they can represent all kinds of different measures, like standard deviations and various percentiles of values.[6] Sometimes the min and max values of the dataset are used for the whiskers, but this is unsuited to our purposes because it would further obscure any outliers. My implementation is equivalent to a Tukey box plot, in which the whiskers represent the inner fence values of the Interquartile Range and another line is added to represent the median. The yellow points represent the maximum and minimum values where the Interquartile Range procedure returned an OutlierDegree equal to 1 and the red ones correspond to the same for OutlierDegree = 2. This isn’t standard practice, so I had to add code to the SSRS report to implement it.
…………Keep in mind when interpreting Figure 1 that only there may be many more outliers than the four dots depicted here, which take up a minuscule amount of the available space and thus convey the information we need very inefficiently. As I will discuss in more depth momentarily, SSRS simply doesn’t provide an out-of-the-box way to drill down to the records that ought to accompany the kind of summary statistics that the mean, median, Interquartile Range and the like represent. We’re therefore limited to displaying just a few data points based on the minimum and maximum values associated with the class of OutlierDegree they belong to. And even if we could access all of the individual data points alongside the summary statistics, as is normally the case with other implementations of box plots, we still wouldn’t be able to display them all because of the scaling issue. Adding more than a handful of values above or below the boxes quickly blurs them into an undifferentiated line, thereby concealing the information we’re after. If our purpose is looking for aberrant values, then it is much more efficient to simply display all the data points at once in a table ordered by the OutlierDegree flag column, with the summary statistics neatly available in a single view above it. For an example, see the tables returned in Outlier Detection with SQL Server, part 5: Interquartile Range. Figure 1 doesn’t present any new facts or unearth any buried information that isn’t already freely available to us in the table format, nor does it execute any faster. The data below comes from the same 209-row dataset on the Duchennes form of muscular dystrophy I have been using throughout this series for examples.[7] Since I’m not familiar with Duchennes[8] or biochemistry, I might be unwittingly making an apples-and-oranges comparison between the values for the protein Hemopexin and the enzyme Lactate Dehydrogenase. The purpose here was merely to demonstrate how difficult it is spot outliers with this technique, not to expose any relationship between the columns, so it’s beside the point anyways. It is worth noting though that adding the columns for the Creatine Kinase and Pyruvate Kinase enzymes to the box plot quickly rendered it unreadable, due to a few extreme outliers stretching out the vertical scale too far – which gives you an idea of how fragile box plots can be when applied to our purposes.

Figure 1: A Sample Box Plot Report with Outliers (click to enlarge)
2014-10-15 23_47_28-Box Plot with IQR 2 - Microsoft SQL Server Report Builder


…………To feed data to these reports, I had to rejig the stored procedure used in the aforementioned article on Interquartile Range and write the T-SQL code in Figure 2 to call it multiple times for each comma-separated column included in the @ColumnString parameter. The length of the code may seem intimidating, but it’s actually quite simple; the SplitColumnNameStringCTE merely separates the @ColumnString into a table variable, which is joined to a table variable that holds the results for each column. The rest of the code merely loops over each column in the list and feeds it to the Interquartile Range procedure; the number of parameters we need to feed to that procedure and the large number of return fields accounts for much of the length of this code. The first three parameters also enable users to select columns in any single table in any database for which they have the requisite access. It is certainly possible to extend this procedure to compare columns from multiple tables, but this is just for illustration purposes, so I kept it as simple as I could. I’ve included the @DecimalPrecision parameter in most of the procedures I’ve posted in this series so that users can adjust the precision and scale of the calculations to avoid overflows, but it may be necessary to ratchet it down further to keep Reporting Services from returning blank reports, as SSRS was doing with the run sequence plots in last week’s tutorial for some unfathomable reason.

Figure 2: Code for the Multiple IQR Stored Procedure
CREATE PROCEDURE [Calculations].[MultipleIQRSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnString AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @DecimalPrecision AS nvarchar(50)

DECLARE @SchemaAndTableName nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName
DECLARE @ColumnTable table
 (ID bigint IDENTITY (1,1),
  ColumnName nvarchar(128)

DECLARE @ResultTable table
(ID bigint IDENTITY (1,1),
Mean decimal(38,9),
Median decimal(38,9),
LowerQuartile decimal(38,9),
UpperQuartile decimal(38,9),
InterquartileRange decimal(38,9),
LowerInnerFence decimal(38,9),
UpperInnerFence decimal(38,9),
LowerOuterFence decimal(38,9),
UpperOuterFence decimal(38,9),
OutlierDegreeMax1 decimal(38,9),
OutlierDegreeMin1 decimal(38,9),
OutlierDegreeCount1 bigint,
OutlierDegreeMax2 decimal(38,9),
OutlierDegreeMin2 decimal(38,9),
OutlierDegreeCount2 bigint

; WITH SplitColumnNameStringCTE
(TempPatIndex,LeftString, RemainingString, StringOrder)
AS (SELECT TempPatIndex, LEFT(@ColumnString, TempPatIndex) AS LeftString, RIGHT(@ColumnString, LEN(@ColumnString) TempPatIndex) AS RemainingString, 1 AS StringOrder
           FROM (SELECT PATINDEX(‘%,%’, @ColumnString) AS TempPatIndex) AS T1
              UNION ALL                    /* after splitting the string, send the remainder back to the PATINDEX and LEFT/RIGHT functions in the part below */
              SELECT NewPatIndex, LeftString =  CASE  WHEN LEFT(LastString , NewPatIndex) = THEN RIGHT(LastString, LEN(LastString) NewPatIndex)
                    WHEN LEFT(LastString , NewPatIndex) IS NULL THEN RIGHT(LastString, LEN(LastString) NewPatIndex)
                    ELSE LEFT(LastString , NewPatIndex)  END,
              RIGHT(LastString, LEN(LastString) NewPatIndex) AS RemainingString, StringOrder + 1
              FROM  ( SELECT PATINDEX(‘%,%’, RemainingString) AS NewPatIndex, RemainingString AS LastString, StringOrder
                    FROM SplitColumnNameStringCTE
                     WHERE LeftString IS NOT NULL AND LeftString != AND LeftString LIKE ‘%,%’ ) AS T1

INSERT INTO @ColumnTable
SELECT T1.SplitString AS ColumnA
 (SELECT TOP 99999999999 REPLACE(REPLACE(LeftString, ‘,’, ), ‘ ‘, ) AS SplitString, StringOrder
              FROM SplitColumnNameStringCTE
              ORDER BY StringOrder) AS T1

DECLARE @CurrentID bigint = 0, @MaxID bigint = 0, @CurrentColumnName nvarchar(128)


WHILE @CurrentID < @MaxID
 SET @CurrentID = @CurrentID + 1 — increment the loop

     SELECT @CurrentColumnName = ColumnName
      FROM @ColumnTable
     WHERE ID = @CurrentID

     INSERT @ResultTable
     (Mean, Median, LowerQuartile, UpperQuartile, InterquartileRange, LowerInnerFence, UpperInnerFence, LowerOuterFence, UpperOuterFence, OutlierDegreeMax1,
, OutlierDegreeCount1,  OutlierDegreeMax2, OutlierDegreeMin2,  OutlierDegreeCount2)

     EXEC Calculations.InterquartileRangeSP3  @DatabaseName, @SchemaName, @TableName, @CurrentColumnName, @PrimaryKeyName, @DecimalPrecision


SELECT ColumnName, Mean, Median, LowerQuartile, UpperQuartile, InterquartileRange, LowerInnerFence, UpperInnerFence, LowerOuterFence, UpperOuterFence, OutlierDegreeMax1,
, OutlierDegreeCount1,  OutlierDegreeMax2, OutlierDegreeMin2,  OutlierDegreeCount2
FROM @ColumnTable AS T1
       INNER JOIN @ResultTable AS T2
       ON T1.ID = T2.ID

…………As discussed in last week’s tutorial, describing how to do basic SSRS tasks like adding data sources is not part of the scope of this series; there are plenty of other tutorials available on the Web which explain them better than I can. I will mention a few critical details needed for my implementations though, like the technique discussed in the last tutorial for retrieving data from stored procedures and using it in an SSRS report. One potential “gotcha” I ought to highlight is the fact the Mean and Median are sometimes left blank in the Series Properties window in Figure 3, even when they are correctly assigned in the Chart Data setup in Figure 4, so you may have to add the values again manually.

Figures 3 and 4: The Series Properties Window in Report Builder and the Chart Data Setup
MeanAndMedianGotchaBox Plot Chart Data Setup

Figure 5: Types of Range Charts Available in Report Builder

…………It may be worth noting here that box plots are grouped together in Report Builder under the Range heading with Smooth Range and Range Column, which we discussed last week, as well as the Stock, Candlestick and Error Bar charts, as seen in Figure 5. The latter three are just stripped-down variants of a box plot, so there’s no sense in discussing them further unless someone can point out a read need for more detail. I’m not aware of any means of implementing violin plots, another popular variant on the box plot, through Reporting Services out-of-the-box, although it might be possible to write custom code that achieves this end. That leaves the Range Bar, which as shown in Figure 6, can be used to conveniently compare the various fence values and quartiles returned with the Interquartile Range.

Figure 6: Range Bar Example (click to enlarge)
Range Bar Example

…………Please note that Report Builder quits altogether whenever I try to combine a range bar with any variant of a scatter plot, so it is apparently difficult to enhance them further for the purposes of outlier detection. That means I can’t even apply the technique for combining box plots with scatter plots, as presented by Mike Davis in his excellent tutorial, “How to Make a Box Plot Chart in SQL Reporting Services 2008 SSRS.”[9] That is how I managed to get the four measly dots into Figure 1, which might be sufficient if we were only doing exploratory data mining or hypothesis testing, but is woefully inadequate if our primary goal is finding outliers. Furthermore, as Davis points out, we’re better off calculating the quartiles and other stats associated with Interquartile Range ourselves anyways, since “Reporting services does not do a good job of calculating these numbers. The best thing to do is have analysis services calculate these for you or use a stored procedure to produce them.”[10] So the optimal way to go about it is to design a stored procedure of the kind I used here, but that presents another problem which really amounts to an unnecessary complication: Reporting Services will only recognize the first result set a stored procedure feeds to it, so you can’t return summary statistics and the records they’re calculated from in the same dataset.[11] And since you can’t include them in the same dataset, they can’t be combined in the same chart. One unworkable workaround is to simply return the summary stats together with the data in a single denormalized table, but this is grossly inefficient at best and quickly becomes impractical as the number of rows increases. Another poor solution is to just send the data and then add code to recalculate the detail rows in the SSRS reports, but this forfeits all of the power and performance advantages of computing through set-based methods. I haven’t had a chance yet to investigate other potential workarounds like writing Custom Data Extensions, using LINQ through a web service or operating on a local report in a WPF report control, but they all seem to be so awkward as to nullify the real the selling point of using Reporting Services charts to identify outliers, which is their simplicity. At present I’m leaning towards trying to implement multiple resultsets through VB.Net code embedded in SSRS reports, but I have no idea if it’s feasible. One of the most useful things Microsoft could do to improve the performance of Reporting Services in future editions of SQL Server might be to allow SSRS reports to consume multiple result sets of this kind, since it might drastically cut down the number of round-trips to the database and recalculations performed within it. Microsoft has been in the habit of artificially limiting the usefulness of its software through such oversights and arbitrary limitations for so long that it almost seems to be part of the corporate culture; as one programmer whose name escapes me once puts it, the company commonly adds great features to its software and then renders them useless, which he likened to saying, “Here’s a glass of milk – with a hole in the bottom.” This particular empty glass already forced me to truncate the histogram creation procedure in the last tutorial by stripping out useful code that tested the normality of the columns using the 3-Sigma Rule [12], by checking whether or not the first standard deviation comprised 68 percent of the values and the second and third comprised 95 and 99.7 percent respectively. It was trivial to calculate these tests on a pass-fail basis, but impossible to return the results efficiently because of this senseless limitation against multiple result sets in Reporting Services. It may be worthwhile to start a Connect request for this upgrade to SSRS, if one doesn’t already exist. Yet even if the next version of SQL Server allowed us to consume multiple result sets in SSRS reports, that still wouldn’t make box plots much more useful when the primary goal is exposing numerous outliers. For that objective, we would probably be much better off using the full two-dimensional space available to us in scatter plots, as we’ll delve into in the next installment.


[1] Indiana University of Pennsylvania IT Prof. James A. Rodger says that the roots of modern data mining can be found in Tukey’s exploratory analysis in the70s. p. 178, Rodger, James A., 2003, “Utilization of Data Mining Techniques To Detect and Predict Accounting Fraud,” pp. 174-187 in Pendharkar, Parag C. ed. Managing Data Mining Technologies in Organizations: Techniques and Applications. Idea Group Publications: Hershey, Pennsylvania.

[2] See the Wikipedia page “John Tukey” at  http://en.wikipedia.org/wiki/John_Tukey

[3] IBID.


…………“John Tukey on data analysis and behavioral science, with a fierce attack on statistical practices for     sanctification, formalism, and hiding the messiness inherent in real data.”
…………“I first saw this as an unpublished manuscript as a graduate student in statistics at Stanford. It set the way for me in data analysis. When I interviewed at Princeton University for my first teaching job, John and I discussed badmandments. After circulating underground for years, John’s essay was finally published in volume III of his collected papers…”

See Tufte, Edward, undated post titled “John Tukey ‘Badmandments’ in Statistical Work, Mainly in the Behavioral Sciences “ at the EdwardTufte.com website. Available online at http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003xA

[5] Fung , Kaiser, 2010, “Eye Heart This,” published Aug 12, 2010 at the Junk Charts website. Available online at http://junkcharts.typepad.com/junk_charts/boxplot/

[6] See the Wikipedia page “Box Plot” at  http://en.wikipedia.org/wiki/Box_plot

[7] I originally downloaded it from Vanderbilt University’s Department of Biostatistics.

[8] I am acquainted with two people who have it though, which is why I selected this particular biostats dataset.

[9] Davis, Mike, 2009, “How to Make a Box Plot Chart in SQL Reporting Services 2008 SSRS,” published Nov. 12, 2009 at the BIDN.com web address http://www.bidn.com/blogs/MikeDavis/ssis/163/how-to-make-a-box-plot-chart-in-sql-reporting-services-2008-ssrs Another post of interest is user3557796’s thread “SSRS Box Plot with Scatter Outliers,” published April 21, 2014 at the StackOverflow web address http://stackoverflow.com/questions/23204901/ssrs-box-plot-with-scatter-outliers

[10] IBID.

[11] The Internet is awash with threads about this topic. Some examples include Jes Borland’s post titled “T-SQL Tuesday #024: Reporting Services, Stored Procedures, and Multiple Result Sets” on Nov. 8, 2011 at LessThanDot, which can be accessed at http://blogs.lessthandot.com/index.php/DataMgmt/ssrs/t-sql-tuesday-024-reporting/ ; the post by the user named mou_inn in Microsoft’s ASP.Net Forum titled “Multiple Datasets from Stored Procedure in SSRS,” on Feb. 16, 2012, which is available at http://forums.asp.net/t/1770328.aspx?Multiple+Datasets+from+Stored+Procedure+in+SSRS ; and Matt Gibson’s reply on March 11, 2011 to the StackOverflow thread titled “Query That Returns Multiple Result Sets,” which is available at the web address http://stackoverflow.com/questions/5270843/query-that-returns-multiple-result-sets

[12] See the Wikipedia pages “68–95–99.7 Rule” and  “Standard Deviation” at http://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule “68–95–99.7 Rule” and http://en.wikipedia.org/wiki/Standard_deviation “Standard Deviation” respectively.


Outlier Detection with SQL Server, part 6.1: Visual Outlier Detection with Reporting Services

 By Steve Bolton

…………Most of the previous articles in this self-tutorials on using SQL Server to find outliers required us to implement statistical formulas, in order to derive measures that required some explanation before they could be interpreted correctly. In this segment of the series, we’ll be discussing a group of outlier detection methods that require very little interpretation, because they’re often self-evident. Many of the tests outlined in previous installments of this series can of course be trussed up with some eye candy in SQL Server Reporting Services (SSRS), but would still require some knowledge of the concepts involved in order to interpret them correctly. In this segment of the series we’ll instead highlight plots and charts whose significance can be grasped intuitively, even by amateurs like myself with few technical skills; we’ll be building reports that constitute separate outlier detection methods in and of themselves, rather than merely displaying the results of some other outlier test like Z-Scores or Chauvenet’s Criterion. The strength of data visualization methods like histograms and scatter plots is their simplicity, which allows even an untrained eye to spot outliers. On the other hand, their simplicity is also their major drawback, because they lack sophistication and rigor. All of the methods highlights in this segment amount to a sort of brute force approach to visualizing outliers, using an unconscious assessment that “if this data point looks out of place, we’ll look at it more closely.” It’s a quick but superficial strategy. For that reason it may pay to put these visualization techniques at the beginning of any outlier identification workflow, so that more rigorous and costly methods can be applied later on to the data points singled out by them for further investigation.
…………In addition, this set of visual detection methods is not immune from the downsides that have pervaded this series. Like any other outlier identification technique, charts of this kind don’t tell you why a record is an outlier; they merely flag them as worthy of further analysis. Moreover, as we have seen throughout this series, the sheer number of records we’re dealing with in typical modern database tables presents many challenges; we might not be constrained to the same degree as we were with statistical measures that are designed for hypothesis testing on a mere few hundred rows at best, like with Grubbs’ Test, the Modified Thompson Tau Test, Dixon’s Q-Test, Chauvenet’s Criterion, the Tietjen-Moore Test and the Generalized Extreme Studentized Deviate Test (GESD), but we still run into issues with fitting all of the data points of interest into a single chart. Various workarounds are available for this issue, such as displaying counts of distinct data points rather than each point individually, or more advanced methods like binning and banding (perhaps by fuzzy set membership), but some of them defeat the purpose by hiding the outliers from view. Furthermore, these visual methods are not insulated completely from the chicken-and-egg arguments over whether or not the aberrant values are departures from the expected distribution, or if they represent evidence that a different distribution would actually model the data better. Like an open-ended algebraic expression with too many variables, that question can only be solved by adding more evidence, in the form of goodness-of-fit tests – which are, as I’ve discussed previously in this series, are omitted with shocking frequency in many fields, as many professional statisticians lament. Outlier detection as a whole is essentially a crude form of pattern recognition, in which we single out records that don’t fit that pattern – which begs a question that is still debated in that field of machine learning, namely what the definition of a pattern is. That is why sooner or later, it will be necessary to apply some substantial domain knowledge to defining what type of pattern we’re looking for and whether or not particular records represent aberrations from it. Visual methods like scatter plots may give quick and cheap suggestions that a particular data point that is distant from all the rest may constitute an outlier, but they tell us nothing about why it is aberrant, or even if it is indeed a departure from our expectations. For that reason, these methods ought to be considered a starting point an outlier detection process, not its culmination.
…………For example, histograms are easy to read because they merely represent the counts of records for each distinct value in a set as a data bar, but they tell you nothing about what the counts are supposed to be. Like any other good idea, there are many variations on the theme, such as cumulative histograms in which the totals accrue until the entire count of records is reached, at a the upper right corner; I won’t present an example of this subtype, however, because it is difficult to spot outliers with them. It is more common to implement some kind of binning or banding scheme to the values, which normally appear on the horizontal axis, rather than changes to the frequencies (i.e. counts) that normally appear on the vertical axis. Figure 1 presents a stored procedure that allows users to choose between three different types of HistogramType values, in which the first is a simple frequency count (which isn’t of much practical use, so I’ve omitted any examples), the second organizes the counts of distinct values by their distances from the mean and the third bins them by the number of standard deviations from the mean, as is customary. Each HistogramType value is implemented in a different common table expression (CTE) in Figure 1, which ends with logic that selects between them. The rest of the T-SQL follows essentially the same format as the rest of the code samples provided in this series, such as the use of @DecimalPrecision to allow users to avoid arithmetic overflows by setting their own precisions and scales. The other four parameters enable users to create a histogram from any table in any database for which they have requisite access, as usual, while the rest is similar to the dynamic SQL examples provided in previous posts. The simple logic is implemented in a few math functions and windowing functions which are actually quite easy to follow, once you’ve subtracted the dynamic SQL they’re enclosed in.

Figure 1: Code for the Basic Histogram Stored Procedure
CREATE PROCEDURE [Calculations].[HistogramBasicSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @HistogramType tinyint = 1, @DecimalPrecision AS nvarchar(50)

DECLARE @SQLString AS nvarchar(max), @HistogramTypeString nvarchar(max), @SchemaAndTableName nvarchar(400), @IntervalNumber bigint = 1

SET @SchemaAndTableName = ISNull(@DatabaseName, ) + ‘.’ + @SchemaName + ‘.’ + @TableName
SET @SQLString = DECLARE @Min int, @Max int, @PopulationMean decimal (‘ + @DecimalPrecision + ‘), @StDev decimal( + @DecimalPrecision + ‘), @StDevBucketNumber int,
@Count bigint, @PercentageMultiplier decimal (‘ + @DecimalPrecision + ‘)

SELECT @Count = Count( + @ColumnName + ‘), @Min = Min(‘ + @ColumnName + ‘), @Max = Max(‘ + @ColumnName + ‘), @PopulationMean = Avg(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))), @StDev = StDev(‘ + @ColumnName + ‘)
FROM ‘ + @SchemaAndTableName +

SELECT @StDevBucketNumber = CEILING((@Max – @Min) / @StDev) — basically asking the question, “How many StDevs can we cram in between the min and max values?

SELECT @PercentageMultiplier = 100 / CAST(@Count AS decimal(‘ + @DecimalPrecision + ‘)) — for calculating percentages against the entire set;

WITH FrequencyCTE
(IntervalNumber, FrequencyCount)
AS (SELECT DISTINCT TOP  99999999 IntervalNumber * ‘ + CAST(@IntervalNumber AS nvarchar(50)) + ‘ AS IntervalNumber, Count(T3.’ + @ColumnName + ‘) OVER (PARTITION BY IntervalNumber)
FROM (SELECT DISTINCT ‘ + @ColumnName + ‘, CEILING(CAST(‘ + @ColumnName + ‘ AS decimal(6,2)) /’ + CAST(@IntervalNumber AS nvarchar(50)) + ‘) AS IntervalNumber
      FROM   (SELECT DISTINCT ‘ + @ColumnName +
              FROM ‘ + @SchemaAndTableName +
              WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1) AS T2
       INNER JOIN ‘ + @SchemaAndTableName + ‘ AS T3
      ON T2.’ + @ColumnName + ‘ = T3.’ + @ColumnName +
ORDER BY IntervalNumber),
(IntervalNumber, FrequencyCount, DistanceFromTheMean, StDevInterval)
AS (SELECT *, ”StDevInterval” =
       CASE WHEN DistanceFromTheMean / @StDev <= 0 THEN FLOOR(DistanceFromTheMean / @StDev)
      ELSE CEILING(DistanceFromTheMean / @StDev)
FROM  (SELECT *, CAST((@PopulationMeanIntervalNumber)
AS  decimal(6,2)) * -1 AS DistanceFromTheMean
       FROM FrequencyCTE) AS T1),
(StDevInterval, FrequencyCount)
AS (SELECT DISTINCT StDevInterval, SUM(FrequencyCount) OVER (PARTITION BY StDevInterval)
FROM DividedIntoStDevIntervalsCTE)’

SELECT @HistogramTypeString = CASE WHEN @HistogramType = 1 THEN
SELECT IntervalNumber, FrequencyCount
FROM FrequencyCTE
      WHEN @HistogramType = 2 THEN
SELECT IntervalNumber, FrequencyCount, DistanceFromTheMean, StDevInterval
FROM DividedIntoStDevIntervalsCTE
      WHEN @HistogramType = 3 THEN
SELECT StDevInterval, FrequencyCount
FROM DistributionWithIntervalsCTE

— *** I could also add to this by converting the Counts to Percentages *****”
SET @SQLString = @SQLString + @HistogramTypeString
EXEC (@SQLString)


…………The procedure above was used to supply values for the four Reporting Services bar charts depicted in Figures 2 through 5, which display data from various columns from a 209-row dataset on the Duchennes form of muscular dystrophy and an 11-million row dataset on the Higgs Boson, which I downloaded from the by Vanderbilt University’s Department of Biostatistics and University of California at Irvine’s Machine Learning Repository respectively and converted into a database of about 7 gigabytes. The level of detail is dramatically reduced in the first two histograms, which represent different views of the same column in the DuchennesTable. The peaks and troughs that occur with interstitial outliers in Figure 2 are completely obscured in Figure 3, which limits us to looking at how frequently values fall outside the expected range – which is quite a bit in the cases of the values I’ve circled in red, since some of them are a whopping four or five standard deviations beyond the mean in a relatively small dataset. HistogramType 2 might provide more detail, but can’t be used with the far larger Higgs Boson Dataset, since we’d essentially run out of screen space to provide the counts of all the distinct values of any column compared to its mean. Note how lopsided the values are for the middle four standard deviations of the first float column in the HiggsBosonTable, which are shifted far to the left in the diagram because there are so many values occurring at exceptionally high standard deviation. As indicated by the red lines I circled them with, the values for the first standard deviation above the mean are minuscule in comparison to those for one standard deviation below it, just as the values two standard deviations above the mean vastly outnumber those that are two below it. This is a perfect illustration of how outlier detection methods often represent the flip side of goodness-of-fit tests, because unless the table consists of little else except outliers, then this column clearly does not fit the bell curve shape of the Gaussian i.e. “normal” distribution. Column 2 in the same dataset may fit the normal distribution though, as is self-evident from the roughly bell-curved shape of Figure 5.

Figure 2 to 5: Four Sample Reports with Outliers for the Histogram Stored Procedure (click to enlarge)
Histogram Type 2
Histogram Type 3 Edited
Histogram Type 3 Higgs Boson - Edited
Histogram - Higgs Boson Column 2

…………This is not a lesson in how to do basic tasks in Reporting Services, like adding a diagram to a report, which I will omit in order to avoid the kind of extraneous detail that so often clutters tutorials on other computing subjects. There are plenty of other tutorials available that can explain such tasks more competently and cleanly than I can. Yet there are a few nuances here that require explanation, including the technique used to populate these reports from the stored procedure.[1] As depicted in the three Report Builder windows in Figure 6, you have to right-click the Datasets folder in the Report Data Explorer, then select the “Use a dataset embedded in my report” button on the Query tab, then select the Stored Procedure button under Query Type and pick the right procedure out of the drop-down list that appears under “Select or enter stored procedure name.” Typically, you’ll have copy and paste the names of the columns returned by the stored procedure into both columns displayed on the Fields tab. You may also need to copy and paste the parameter names (including the ampersands) on the left column in the Parameters tab, then again enclosed in brackets on the right column, or use the function buttons to the right to add the parameters manually.  It is also sometimes necessary to add the parameters again by right-clicking on the Parameters folder in the Report Data Explorer, although SQL Server sometimes autocompletes these. I’ve highlighted the difference between the two Parameters locations in the top graphic.  It is important (but not all that important) to follow these directions, since we’ll be essentially following the same script to implement other procedures in the next two tutorials.

Figure 6: Setting Up a Reporting Services Dataset to Use a Stored Procedure
RSDatasetProperties (1) 

RSDatasetProperties (2)

RSDatasetProperties (3)

…………One the key principles of visual outlier detection is to highlight them in some way, as I’ve done in Figure 2 by changing the label color and box border width whenever the frequency counts are more than two standard deviations beyond the mean in either direction, or when they fall or rise by more than a count of three from one value to the next. The code for this is depicted in Figure 8. There may be a workaround for this – I’m familiar with SSRS but am certainly not knowledgeable – but you’re apparently only limited to a Previous function that is essentially equivalent to the T-SQL Lag function with a value of one, since there is no “Next” function equivalent to Lead that would allow you to go in the other direction. If you need reports of this kind for outlier detection, it may pay to do the reading I have not done yet on how the human eye reacts to contrast, shading and various other stimuli in order to highlight such values. One common sense example of this principle is the use of red for highlighting. One avenue I want to investigate further is to set the normal values semi-transparent, which might provide the needed contrast in an eye-catching way. Unfortunately, I don’t think there is any clean, straightforward way of implementing it through a Transparency or Opacity property at present, although I may be wrong. As SSRS developer Graeme Malcolm lamented in one useful tutorial, “It turns out you can only set a transparency level for gauges and charts in Reporting Services – not for shapes or data regions.”[2] One workaround is to either set the Fill color to No Color in the Properties[3] or set the color to Nothing programmatically[4] when a particular value meets your conditions. You might also get achieve semi-transparency by setting the bars of the normal values to colors only slightly different from the background, such as a different shade of white. A little imagination could go a long way when designing reports for outlier detection; you could, for example, apply gradients to shade data points based on their Z-Scores or something similarly fancy. Keep in mind, however, that there are limitations in terms of things like screen space that eventually force tradeoffs and constrain the possibilities. Once data records get numerous enough, you can adjust the widths and heights of SSRS charts and their constituent elements to accommodate them, but eventually you may run out of room entirely and be forced into strategies like binning, as we were in the case of the 11-million-row HiggsBosonTable.

Figures 7 and 8: Setup for the Hybrid Line Chart and Associated Outlier Highlighting
Histogram Chart Data SettingsHistogram Outlier Label Settings

…………All four sample reports include a line chart that traces the peaks of the bar graph, thereby adding more information in an easily comprehensible and efficient manner. I followed the directions for creating a hybrid visualization of this kind in the “Column and Line Hybrid Charts” section of Robert M. Bruckner’s classic TechNet white paper, “Get More Out of SQL Server Reporting Services Charts.”[5] The Cliff’s Notes version of the implementation is to create a second data series, right-clicking in the Chart Data and select Change Chart Type. That is why we have one series for IntervalNumber and another for FrequencyCount in the sample Chart Data window depicted in Figure 7, which was used in the four sample reports.
…………The directions for building and interpreting a line chart of this kind are trivial, so I’ll dispense with them. Once we know how to create one, however, it is child’s play to adapt it for use as an alternative form of outlier detection known as a run sequence plot or run chart. For the most part, what we’re looking for is the same kind of peaks and troughs evident in the hybrid charts we’ve already seen, except without the accompanying histograms. It is actually a more limited identification method, however, because we’re comparing values vs. time slices rather than values against their counts, which doesn’t necessarily tell us anything about how frequently a value should occur. A line chart for department store sales, for example, might see a big jump in sales before Christmas and a commensurate slump afterwards, but domain knowledge would tell us that the associated peaks and troughs represent the expected distribution, not actual outliers. More subtle relationships might be discovered through Time Series algorithms like the ones I covered in A Rickety Stairway to SQL Server Data Mining, Algorithm 9: Time Series, which are often visualized through run sequence plots of the same kind.

Figure 9: Code for the Simple Run Sequence Plot Stored Procedure
CREATE PROCEDURE [Calculations].[RunSequencePlotSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @TimeColumnName AS nvarchar(128),  @DatePeriodType as nvarchar(400), @DecimalPrecision AS nvarchar(50)
DECLARE @SQLString AS nvarchar(max), @SchemaAndTableName nvarchar(400), @DatePeriodTypeString nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + ‘.’ + @SchemaName + ‘.’ + @TableName

SELECT @DatePeriodTypeString = CASE WHEN @DatePeriodType = THEN ‘@TimeColumnName
WHEN @DatePeriodType = ‘Year’ THEN ‘Year( + @TimeColumnName + ‘)’
WHEN @DatePeriodType = ‘Month’ THEN ‘Month( + @TimeColumnName + ‘)’
WHEN @DatePeriodType = ‘Day’ THEN ‘Day( + @TimeColumnName + ‘)’
WHEN @DatePeriodType = ‘Week’ THEN Datepart(week, ‘ + @TimeColumnName + ‘)’

SET @SQLString = SELECT DISTINCT ‘ + @DatePeriodTypeString + ‘ AS TimePeriod, Avg(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS Mean,
Min(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS Minimum, Max(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS Maximum, StDev(CAST(‘ + @ColumnName + ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS StandardDeviation Var(CAST(‘ + @ColumnName
+ ‘ AS decimal(‘ + @DecimalPrecision + ‘))) OVER (PARTITION BY ‘ + @DatePeriodTypeString + ‘ ORDER BY ‘ + @DatePeriodTypeString + ‘) AS Variance
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL
ORDER BY ‘ + @DatePeriodTypeString +
EXEC (@SQLString)

…………Run sequence plots are not only simple to interpret but a breeze to program. The procedure in Figure 9 may be the shortest I’ve posted to date in this series and can be followed effortlessly, if you’re familiar with T-SQL and have seen the sample code I’ve posted previously. All it does is extract a count of all distinct values for a particular column, partitioned by another column that typically represents a time slice. The can of course be easily extended to handle other DatePart values like quarter, month, dayofyear, weekday, hour, minute, second, millisecond, microsecond, nanosecond, etc. – whatever your use cases call for. I’ve thrown in the kitchen sink as far as built-in aggregates go, but the ones you don’t need can of course be removed with a few backspaces. The @DatePeriodType parameter accepts the values Year, Month, Day and Week to partition the time column by particular intervals, or can be left blank to return all time periods. Note that it may be necessary to use the @DecimalPrecision parameter to avoid missing values in the chart, not merely to prevent arithmetic overflows; for whatever unfathomable reason, SSRS sometimes returned blank charts with no warnings or error messages unless I ratcheted the precision and scale down a little. For example, I was using 38,29 most of the time and my means wouldn’t show up, nor my mins and maxes, but changing it to 38,27 brought them back.

Figures 10 and 11: Simple Run Sequence Plot Example with Outliers (click to enlarge)
Run Sequence Plot Examples Edited (2)

Run Sequence Plot Examples Edited (1)

…………Run Sequence plots are frequently used to compare the Time Series values of multiple columns for exploratory data mining purposes, but I’ve only included a single column in the example in Figure 10 because our purpose is outlier detection. I’ve adapted it to that purpose by adding a Smooth Range Column visualization that displays the minimum and maximum values for the same column at the same time slices as the line chart, which displays the mean. That gives us a more accurate picture of the variability of the values over time without adding much difficulty to the interpretation or wasting white space that could be providing more information. For example, the sample reports above show not only a few obvious outliers on the line chart for the CreatineKinase and Hemopexin columns of the DuchennesTable, but a few extreme peaks and troughs in the min and max values.[6] This may not be a typical use for run sequence plots, but it does make them more useful for outlier detection purposes. In the next article I’ll delve into other chart types like box plots that are grouped together with the Range Column and Smooth Range Column in Report Builder, but which are unfortunately hampered by constraints that make them much less useful in detecting outliers in large datasets.


[1] Please note that I consulted several webpages awhile back in order to learn how to do this, but can’t find my original citations in order to give proper credit

[2] Malcolm, Graeme, 2010, “Using a Transparent Background in Reporting Services,” published Oct. 13, 2010 at the Content Master website. Available online at http://cm-bloggers.blogspot.com/2010/10/using-transparent-background-in.html.

[3] I got this idea from Yilmaz, Eralper, 2014, “Building Transparent Stacked Column Charts Using MS SQL Server Reporting Services 2008,” published at the Kodyaz Development Resources website and available at the web address http://www.kodyaz.com/articles/transparent-stacked-column-in-reporting-services.aspx

[4] I took this idea from the reply by the user name i_h in the thread  “’Transparent’ is not a Valid BackgroundColor?” published Jan. 13, 2012 in the MSDN forums. Available at the web address https://social.msdn.microsoft.com/Forums/en-US/81ec0049-f007-4619-b8df-ee74ec2b9ec1/transparent-is-not-a-valid-backgroundcolor?forum=sqlreportingservices

[5] Bruckner, Robert M., 2006, “Get More Out of SQL Server Reporting Services Charts,” a white paper published  by Techet in September 2006. Available online at http://technet.microsoft.com/en-us/library/aa964128(v=sql.90).aspx

[6] These results should be taken with a  grain of salt, however, because I have had trouble accurately converting the SampleDate column in the original Duchennes dataset to a SQL Server date data type.

Outlier Detection with SQL Server, part 5: Interquartile Range

By Steve Bolton

…………The last seven articles in this series of mistutorials on identifying outlying values in SQL Server database were clunkers, in the sense that the methods had many properties in common that made them inapplicable to the scenarios DBAs typically need them for. Chauvenet’s Criterion, Peirce’s Criterion, the Tietjen-Moore Test, the Generalized Extreme Studentized Deviate Test (GESD), Grubbs’ Test, the Modified Thompson-Tau Test and Dixon’s Q-Test are well-suited to the uses they were designed for, like hypothesis testing, but are difficult to apply to common SQL Server use cases like finding data quality problems in tables of several millions rows or doing exploratory data mining. Most of them require prior goodness-of-fit testing to verify that the underlying data follows a Gaussian “normal” distribution, i.e. a bell curve, without which they are invalid; many of the lookup tables they depend on are widely available but stop at just a few hundred rows at best, while calculating the missing lookup values for millions of cases can be exceptionally costly. Toss in other drawbacks of hypothesis testing that are often unstated these days (like the use of arbitrary confidence levels and misconceptions about probabilistic reasoning, which statisticians themselves raise frequently in their literature) and it appears that for most scenarios, DBAs would be better off sticking with the methods we kicked off the series with, Z-Scores and Benford’s Law. I’m only writing about these topics as an amateur, but the inapplicability of so many standard outlier identification methods to larger datasets makes me wonder if it the age of “Big Data”[1] doesn’t call for the devising of new means of detection. Thankfully, however, we haven’t by any means exhausted the means already available to us in the common statistical literature, without having to delve into research papers and academic journals and that sort of thing. I haven’t yet had a chance to discuss Interquartile Range because I’m trying to group the detection methods by the properties they have in common, but this particular one has little overlap with any of the others we’ve surveyed to date. It nevertheless performs relatively well and is applicable to a much wider set of use cases than any other means we’ve discussed since finishing up Z-Score a couple of months ago.
…………Interquartile Range has apparently been in use for so long and is so pervasive in academic research that the story of its origin is difficult to find in a cursory search, unlike the colorful histories of some of the lesser-known methods discussed in recent posts. In-depth research of this kind wasn’t really necessary for this week’s article because the calculations and concepts are easier than anything we’ve discussed to date.[2] The idea is fairly simple: instead of calculating a single center point for the whole dataset, we establish two boundaries known as the lower and upper quartiles encompassing the middle half of the values, so named because they are a quarter of the way (25 percent and 75 percent) from the edges of the dataset. The Interquartile Range is just another single measure of how dispersed data is around the center of the dataset, like the more familiar standard deviation and variance, except that it is less sensitive to outlying values (i.e., it is more “robust”). Computing it is trivial once we got the lower and upper quartiles, since all we have to do is subtract the former from the latter. Interquartile Range is apparently useful for other applications such as goodness-of-fit testing, but when used to find those aberrant data points we call outliers, it is usually accompanied by calculations of upper and inner fences. These are established by simply subtracting or adding 1.5 times the Interquartile Range from the lower quartile or doing the same with the upper quartile, except with 3 times the Interquartile Range. Using this test, any values falling outside these four “fences” are defined as outliers. The math in Figure 1 looks a lot more complicated than it really is, when all we’re really doing is a few modulos and simple divisions to get the lower and upper quartiles, then some simple subtraction and multiplication to establish the fence values. The most difficult part of the T-SQL code is probably the common table expression (CTE), which is trivial compared to some of the more difficult nested subqueries, UNPIVOT operations and windowing functions used in other recent tutorials.

Figure 1: Code for the Interquartile Range Procedure
CREATE PROCEDURE [Calculations].[InterquartileRangeSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)
SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName
DECLARE @SQLString nvarchar(max)

SET @SQLString = ‘DECLARE @OrderByCode tinyint,
@Count bigint,
@LowerPoint bigint,
@UpperPoint bigint,
@LowerRemainder decimal(38,37), — use the maximum precision and scale for these two variables to make the
procedure flexible enough to handle large datasets; I suppose I could use a float
@UpperRemainder decimal(38,37),
@LowerQuartile decimal( + @DecimalPrecision + ‘),
@UpperQuartile decimal( + @DecimalPrecision + ‘),
@InterquartileRange decimal( + @DecimalPrecision + ‘),
@LowerInnerFence decimal( + @DecimalPrecision + ‘),
@UpperInnerFence decimal( + @DecimalPrecision + ‘),
@LowerOuterFence decimal( + @DecimalPrecision + ‘),
@UpperOuterFence decimal( + @DecimalPrecision + ‘) 

SET @OrderByCode = ‘ + CAST(@OrderByCode AS nvarchar(50)) +  SELECT @Count=Count( + @ColumnName + ‘)
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @LowerPoint = (@Count + 1) / 4, @LowerRemainder =  ((CAST(@Count AS decimal(‘ + @DecimalPrecision + ‘)) + 1) % 4) /4,
@UpperPoint = ((@Count + 1) *3) / 4, @UpperRemainder =  (((CAST(@Count AS decimal(‘ + @DecimalPrecision + ‘)) + 1) *3) % 4) / 4; –multiply by 3 for the left s’ + @PrimaryKeyName + ‘e on the upper point to get 75 percent

(‘ + @PrimaryKeyName + ‘, RN, ‘ + @ColumnName + ‘)
AS (SELECT ‘ + @PrimaryKeyName + ‘, ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ ASC) AS RN, ‘ + @ColumnName +
FROM ‘ + @SchemaAndTableName +
HERE ‘ + @ColumnName + ‘ IS NOT NULL),
TempCTE2 (QuartileValue)
AS (SELECT TOP 1 ‘ + @ColumnName + ‘ + ((Lead( + @ColumnName + ‘, 1) OVER (ORDER BY ‘ + @ColumnName + ‘) – ‘ + @ColumnName + ‘) * @LowerRemainder) AS QuartileValue
WHERE RN BETWEEN @LowerPoint AND @LowerPoint + 1


SELECT TOP 1 ‘ + @ColumnName + ‘ + ((Lead( + @ColumnName + ‘, 1) OVER (ORDER BY ‘ + @ColumnName + ‘) – ‘ + @ColumnName + ‘) * @UpperRemainder) AS QuartileValue
WHERE RN BETWEEN @UpperPoint AND @UpperPoint + 1)

SELECT @LowerQuartile = (SELECT TOP 1 QuartileValue
FROM TempCTE2 ORDER BY QuartileValue ASC), @UpperQuartile = (SELECT TOP 1 QuartileValue
FROM TempCTE2 ORDER BY QuartileValue DESC)

SELECT @InterquartileRange = @UpperQuartile – @LowerQuartile
SELECT @LowerInnerFence = @LowerQuartile – (1.5 * @InterquartileRange), @UpperInnerFence = @UpperQuartile + (1.5 * @InterquartileRange), @LowerOuterFence = @LowerQuartile – (3 * @InterquartileRange), @UpperOuterFence = @UpperQuartile + (3 * @InterquartileRange)

–SELECT @LowerPoint AS LowerPoint, @LowerRemainder AS LowerRemainder, @UpperPoint AS UpperPoint, @UpperRemainder AS UpperRemainder
— uncomment this line to debug the inner calculations

SELECT @LowerQuartile AS LowerQuartile, @UpperQuartile AS UpperQuartile, @InterquartileRange AS InterQuartileRange,@LowerInnerFence AS LowerInnerFence, @UpperInnerFence AS UpperInnerFence,@LowerOuterFence AS LowerOuterFence, @UpperOuterFence AS UpperOuterFence

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, OutlierDegree
FROM  (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,
       OutlierDegree” =  CASE WHEN (‘ + @ColumnName + ‘ < @LowerInnerFence AND ‘ + @ColumnName + ‘ >= @LowerOuterFence) OR (‘ +
@ColumnName + ‘ > @UpperInnerFence
+ @ColumnName + ‘ <= @UpperOuterFence) THEN 1
       WHEN ‘ + @ColumnName + ‘ < @LowerOuterFence OR ‘ + @ColumnName + ‘ > @UpperOuterFence THEN 2
       ELSE 0 END
       FROM ‘ + @SchemaAndTableName +
       WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1
      ORDER BY CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,
CASE WHEN @OrderByCode = 5 THEN OutlierDegree END ASC,
CASE WHEN @OrderByCode = 6 THEN OutlierDegree END DESC

–SELECT @SQLStringuncomment this to debug string errors
EXEC (@SQLString)

…………The code in Figure 1 basically follows the same format as that of other procedures I’ve posted in this series, for simplicity’s sake. The first five parameters allow users to test any column in any database they have access to, while the @DecimalPrecision enables them to avoid arithmetic overflows by manually setting a precision and scale appropriate to the column they’ve selected. As usual, the procedure is created in a Calculations schema; there are no brackets to handle spaces in object names, nor is there any validation or SQL injection code. As with past procedures, uncommenting the next-to-last line allows users to debug the dynamic SQL; I also provided a second debugging point of the same kind midway through the procedure, for testing the inner calculations. The @OrderByCode I’ve used in previous tutorials also returns, with the same values as usual: value #1 and #2 allow users to order by the primary key ascending and descending, while #3 and #4 do the same for the ColumnName and #5 and #6 order the results by the OutlierDegree column. As depicted below, the OutlierDegree column allows for a range of values depending on how much a particular data point deviates from the norm, not merely a Boolean yes-no flag like we’ve seen in many hypothesis-testing based methods. Note that the results also include the Interquartile Range, fence values and quartiles used to test each data point.

Figure 2: Results for the Interquartile Range Procedure on the PyruvateKinase Column
EXEC [Calculations].[InterquartileRangeSP]
              @DatabaseName = N’DataMiningProjects,
             @SchemaName = N’Health’
             @TableName = N’DuchennesTable,
             @ColumnName = N’PyruvateKinase,
              @PrimaryKeyName = N’ID’,
              @OrderByCode = 6,
              @DecimalPrecision = N’38,21′


…………The test in Figure 2 was performed on the Pyruvate Kinase column of a 209-row dataset on the Duchennes form of muscular dystrophy, which I downloaded from the by Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table. For the sake of consistency, I’ve stress-tested outlier detection methods that might have performance issues on the first float column of the Higgs Boson Dataset, which is made publicly available by the University of California at Irvine’s Machine Learning Repository and now occupies almost 6 gigabytes of my practice database. On average, the procedure took about 18-and-a-half to 19 minutes to run against the 11 million rows of that dataset on my poor beat-up semblance of a development machine, as compared to about 3 minutes for the Z-Score procedure I posted earlier in the series. The addition of a non-clustered index on the column improved things a little – as long as I included the clustered primary key, but the execution plans were still too large to post; suffice it to say that they consisted mainly of parallel non-clustered index Seeks with some expensive Sorts, plus a lot of Spools that had inconsequential costs. Perhaps a columnstore index would help things, but I’ve been unable to upgrade yet to SQL Server 2014, where many of the restrictions that once hobbled the feature have been relaxed.
…………Since Z-Scores perform better and can be used in conjunction with any outlier threshold that end users choose, they remain enthroned as the most widely applicable detection method we’ve yet discussed. Nonetheless, Interquartile Range is more likely to be of use to DBAs than the hypothesis testing-based means that took up the middle segment of this series. The calculations are simple to code and perform, the concepts aren’t that hard to explain to database users and perhaps best of all, we’re not limited to using just a Gaussian bell curve. That also means we don’t have to do preliminary goodness-of-fit testing, which is so often omitted by careless researchers. One of the Wikipedia articles I found the formula at mentions it being used in conjunction with Cauchy and Laplace distributions, although not necessarily in its capacity as an outlier detector.[3] It can even be adapted for double duty as a goodness-of-fit test. In and of itself, it constitutes an alternate measure of dispersion that can be used in place of standard deviation and variance. The scenarios in which such a substitution would prove useful include ones where a measure less likely to be altered by outlying values is called for. The same property might make it more appropriate than Z-Scores when there is a real need for a more conservative test for outliers. Another plus in its favor is the fact that it also measures the degree of membership in the set of outliers on a scale, rather than merely flagging it as many hypothesis-testing methods do; furthermore, those methods have numerous other restrictions on the number and types of inputs, outputs and calculation methods that make them unsuitable for most SQL Server tasks, like recursive deletion with Chauvenet’s Criterion and the inability of Dixon’s Q-Test to identify more than one outlier per dataset. Moreover, the fence and quartile values are trivial to return once they’ve been calculated and constitute global measures in their own right.
…………I have yet to try Cook’s Distance and Mahalanobis Distance, but I have high hopes that they too will prove to be useful additions to the toolbelts of SQL Server data miners and DBAs. I hope to use both as a springboard into a much longer and more difficult, albeit worthwhile, series a few months down the line, Information Measurement with SQL Server. Before delving into the difficult math that underpins distance-based metrics of that kind, however, I will give a brief overview of how to use Reporting Services to find outliers the easy way: by the naked eye. Finding outliers is not always that straightforward, but in many cases all we need to do is spot them in a histogram or scatter plot, where they sometimes stand out like sore thumbs. They are sometimes also glaringly obvious in the diagrams produced by the Clustering algorithm in SSDM, which I may give a quick refresher on, based on my last tutorial series, A Rickety Stairway to SQL Server Data Mining. As we will see, scaling up visual detection methods of this kind to meet the size of SQL Server databases is the primary challenge, just as their size stretches beyond the ordinary bounds of hypothesis testing. The pervasiveness of the size issue makes me wonder, once again, if it might not be worthwhile to devise new scalable methods of outlier detection to complement the ones already in common use today.[4]


[1] This buzzword is a lot like the overused term “globalization.” Unlike with statistics and data mining, I have real expertise in foreign policy history, and can say definitively that globalization has been going on for millennia; the only difference is that it has accelerated in recent decades. Likewise, the amount of Data the human race has to process is always getting Bigger; it’s just getting Bigger at a faster pace these days.

[2] I retrieved the formulas from the most convenient sources, the Wikipedia pages “Outlier” and “Interquartile Range” at http://en.wikipedia.org/wiki/Outlier and http://en.wikipedia.org/wiki/Interquartile_range respectively. I also tested the procedure against some of the examples provided there. Also see National Institute of Standards and Technology, 2014, “7.1.6. What are Outliers in the Data?” published in the online edition of the Engineering Statistics Handbook. Available online at the web address http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

[3] IBID.

[4] While writing this series I encountered an interesting suggestion by Will G. Hopkins, the writer of one of the best plain English explanations of statistics available online today: “Here’s something challenging for the real lovers of numbers. The mean ± SD encloses 68% of the data on average for a normally distributed variable. So if you want to use a percentile range that corresponds to the mean ± SD, what should it be? Answer: 16th-84th. If I had my way, this measure would replace the interquartile range. We could call it the standard percentile range…” I could write code for this easily, but didn’t bother because the series already features too many outlier identification methods that are dependent on a normal distribution. Nor would it necessarily do anything to help with the scaling problem I mentioned above. It does illustrate, however, how we’re not limited to just using tried and true measures and can devise new ones as needed, if they are appropriate to the contexts at hand. See Hopkins, Will G.  2013, “Percentile Ranges,” published at the website A New View of Statistics and available at the web address http://www.sportsci.org/resource/stats/percentile.html



Get every new post delivered to your Inbox.