Outlier Detection with SQL Server, part 3.2: GESD

By Steve Bolton

…………In the last edition of this amateur series of self-tutorials on finding outlying values in SQL Server columns, I mentioned that Grubbs’ Test has a number of limitations that sharply constrain its usefulness to DBAs. The Generalized Extreme Studentized Deviate Test (GESD) suffers from some of the same restrictions – most notably the fact that it is only applicable to datasets that have a Gaussian (a.k.a. “normal”) distribution, better known as the bell curve. Nonetheless, the GESD applies to a wider set of use cases because it can be applied to find more than one outlier, unlike the Grubbs’ Test it is derived from. It is “essentially Grubbs test applied sequentially,” with some tweaks applied to its test statistics and critical values to avoid such problems as false negatives with weak stopping criteria.[1] Grubbs’ Test is sometimes performed recursively by deleting aberrant data points one at a time, which raises the whole morass of ethical issues I’ve harped on previous articles, like the far too common tendencies among researchers to classify outliers arbitrarily, subjectively change those classifications (i.e. “moving the goalposts”) and fail to investigate the causes of aberrant values and matching them to appropriate responses. As mentioned previously, outliers can be differentiated by their use cases, numbers and types of inputs, the numbers and types of outputs and their mathematical properties and the means of calculation applied in between the input and output stage, if that would affect performance. A failure in any one of these areas can be worse than an incorrect calculation. An outlier workflow also ought to include rigorous goodness-of-fit testing to make sure that only tests appropriate to the underlying distribution are applied, plus proper classification procedures and criteria, guarding against redefinition (i.e. “changing horses in midstream”) and matching the underlying causes of outliers with appropriate responses. Not all outliers are the result of faulty data collection and storage, so simply deleting aberrant data points as recursive application of the Grubbs’ Test implies is often inappropriate, or even unethical. There is little reason to use it when a test like GESD is available to ameliorate some of these associated drawbacks and temptations The GESD test is still quite limited in comparison to other outlier detection methods we’ll survey here, since it is normally only applied to look for small numbers of aberrant data points, but at least we’re not limited to finding a single one as we are with the Grubbs’ Test. The performance costs for both are quite trivial. Another benefit of the GESD is that we can reuse much of the code from last week’s mistutorial, thereby simplifying the development and troubleshooting processes.
…………We can thank Harvard Biostatistics Prof. Bernard Rosner for this improvement to the Grubbs’ Test, which was published in an issue of the statistical journal Technometrics back in 1983.[2] I was unable to get ahold of the this paper, unlike the original publications for Grubbs’ Test, but I was able to find the formulas at the National Institute for Standards and Technology’s Engineering Statistics Handbook, which is one of the most readable online sources of information on statistics for amateurs like myself. I’d wager that most readers will find a pint of Guinness a lot less boring than these underlying equations, so it is worth noting that we can thank a brewer of Irish beer for some of these formulas. The “Studentized” part of the name is derived from the “Student’s T-distribution” it is dependent on, which I always vaguely pictured as being derived from some sort of high school or college math test; the real story is more colorful, in that it is was the product of a 1908 Biometrick article by William Sealy Gosset, who chose the pen name “Student” after his employer, the Guinness brewery in Dublin, required him to publish under a pseudonym.[3] Just something more colorful that degrees of freedom and critical values to meditate on the next time you’ve had one too many draughts of Guinness. I’m not sure why the GESD calculations, like those of Grubbs’ Test, are dependent on both the T-distribution and the bell curve, although one of my missions when starting this series was to grasp the underlying mechanics and logic of the formulas, which is something I was deficient in when writing my last series of tutorials, on SQL Server Data Mining (SSDM). I am trying to acquire the skill to translate equations into T-SQL, Multidimensional Expressions (MDX) and Visual Basic (VB) code as quickly as possible, but sometimes still struggle to get the formulas right. Some caution is in order here, because the code in the figure below only matched the first four results for Rosner’s 54 sample data points, in the example calculation at the NIST webpage. I had some issues following the order of operations in the NIST’s equations at first, but it is odd that they would only be off by a hundredth of a decimal point for the first four results, then diverge after that. This ought to refocus attention on one of the main caveats associated with any of my tutorial series: always check my code before putting it into production, because I’m writing this in hopes of learning, not because I already know what I’m doing; I’m only publishing it in order to sharpen my thinking further and in the hopes that others might gain something from my misadventures.

Figure 1: Code for the GESD Test Procedure
CREATE PROCEDURE [Calculations].[GESDTestSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @Alpha decimal(38,35) = 0.05, @NumberOfTests bigint = 10
AS

SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName

SET
@SQLString =
‘DECLARE @Count bigint

SELECT @Count=Count( + @ColumnName + ‘)
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT RN, ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + GESDStatistic, P, V, CriticalRegion, IsOutlier” = CASE WHEN GESDStatistic > CriticalRegion THEN 1 ELSE 0 END
FROM (SELECT RN, ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, GESDStatistic, P, V, (Calculations.FindCriticalRegionForTDistributionFunction (V, 1, 1-P) * (@Count – RN)) / Power(((@Count – (RN – (1 + (Power(Calculations.FindCriticalRegionForTDistributionFunction (V, 1, 1-P), 2))))) * (@Count – (RN + 1))), 0.5) AS CriticalRegion
       FROM (SELECT RN, ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, GESDStatistic, 1 – (‘ + CAST(@Alpha AS nvarchar(50)) + ‘ / (2 * (@Count – (RN + 1)))) AS P, ((@Count – RN) – 1) AS V
      FROM (SELECT TOP ‘ + CAST(@NumberOfTests AS nvarchar(50)) + ‘ ROW_NUMBER() OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ DESC) AS RN, ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,
              (‘ + @ColumnName + ‘ – Avg( + @ColumnName + ‘) OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) / StDev( + @ColumnName + ‘) OVER (PARTITION BY 1 ORDER BY ‘ + @ColumnName + ‘ DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS GESDStatistic
              FROM ‘ + @SchemaAndTableName +
              WHERE ‘
+ @ColumnName + ‘ IS NOT NULL
              ORDER BY ‘ + @ColumnName + ‘ DESC) AS T1) AS T2) AS T3′

–SELECT @SQLStringuncomment this to debug string errors
EXEC (@SQLString)

…………Anyone who has read previous articles in this series should be able to cut through all the gobbledygook in Figure 1 by noticing the commonalities between this T-SQL code and that of previous procedures. Once again, the first five parameters allow users to perform the test on any column in any database for which they have permissions and the first three lines of the procedure make some adjustments to the string names of those parameters for legibility purposes. There is no @OrderByCode as in previous procedures and @DecimalPrecision is likewise not needed. Instead, we must supply values for two new parameters, the @NumberOfTests to perform (which I’ve set to an arbitrary default of zero) and the @Alpha value, which is a core concept in hypothesis testing along with confidence levels, critical values, critical regions, statistical significance and the like. I strongly recommend Will G. Hopkins’ website A New View of Statistics for anyone interested in a readable introduction to these topics, which really aren’t as hard to grasp as they seem – as long as someone explains them in plain English. As usual, the procedure is created in a schema called Calculations, which you can change to your liking; uncommenting the next-to-last line allows you to debug the dynamic SQL; and you’ll have to add your own code to accommodate spaces in object names, which I don’t allow, or to handle SQL injection, which I haven’t included to keep my code short and to the point. This procedure is much shorter than last week’s because I’m reusing the code I already published for the Calculations.FindCriticalRegionForTDistributionFunction, which performs the comparisons against the T-distribution to see if a particular value is an outlier. The code is also much shorter because there’s only one version of the GESD, whereas Grubbs’ Test has two-tailed, lower one-tailed and upper one-tailed versions. As is customary, I retrieve the global statistics once only at the beginning of the dynamic SQL, but in this case, but all we need is a simple record count. As usual, it takes several subqueries to perform all of the calculations required, which forms the bulk of the rest of the dynamic SQL. I’m really quite proud of myself for coding it with the DESC ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING clauses rather than a common table expression (CTE), since one of my goals is to make my T-SQL more efficient by implementing the kind of windowing functions Itzik Ben-Gan discusses in his classic reference Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions.[4] It seemed to be the most efficient solution to the problem, although performance really isn’t much of an issue with either GESD or Grubbs’. Figure 3 shows only a minimal change in the execution time when a non-clustered index was added, while the execution plans weren’t really noteworthy enough to publish.

Figure 2: Results for the GESD Test Procedure
EXEC   Calculations.GESDTestSP
             @DatabaseName = N’DataMiningProjects,
             @SchemaName = N’Health‘,
             @TableName = N’DuchennesTable,
             @ColumnName = N’PyruvateKinase,
             @PrimaryKeyName= ‘ID’,
             @Alpha = 0.05,
             @NumberOfTests = 15

GESD Results

Figure 3: Client Statistics for the GESD Test Procedure
GESDClientResults

…………If performance had been an issue I would have run the procedure against the first float column in the Higgs Boson Dataset made publicly available by the University of California at Irvine’s Machine Learning Repository, which occupies nearly 6 gigabytes of space in the DataMiningProjects database I’ve created to encompass all of the three practice datasets we’ll be using in this tutorial series. Instead, I ran it against the PyruvateKinase column of a tiny 9-kilobyte dataset on the Duchennes form of muscular dystrophy, published online by Vanderbilt University’s Department of Biostatistics. Since calculations on the same enzyme were also performed in last week’s article, we can easily contrast them. The stark, obvious difference is that we only received one row of output for Grubbs’, whereas Figure 2 flags the first 13 rows as outliers based on the fact that their GESD statistics exceed the values in the T-distribution lookup table for the associated critical region, which is in turn determined by the probability values (P) and the degrees of freedom (V, which is in this case the record count).
…………The EXEC code above it specified a common @Alpha value of 0.05 and a test of 15 values, which is why we received 15 results. This is certainly more useful than returning a single output of the kind we get from Grubbs’ Test, but still of limited utility to DBAs who normally work with tables of thousands or even billions of rows. While writing this series, one of the weaknesses I’ve discovered with applying standard outlier detection methods to SQL Server databases is that many of them simply aren’t designed with the Big Data buzzword in mind. Many of them do a fantastic job when used in hypothesis testing, the main use case scenario they were designed for, which is a much narrower and more specific task than the two most common uses cases DBAs face, exploratory data mining and checking for data quality problems. Studies of the first kind often involve just a few hundred or even just a few dozen cases, whereas in the latter we may be dealing with millions of records. It is almost impossible, however, to find lookup tables for many of the distributions and calculations associated with these hypothesis tests that go beyond more than a few hundred values. For example, I had to hunt all over the Internet to find a table of T-distribution values that went up as far as 200 values, which is still below the 209 rows of the Duchennes table and minuscule in comparison to the 11 million rows of the Physics.HiggsBosonTable. I’ve also learned that attempting to fill the gap by calculating the missing values for these tables yourself can be quite computationally expensive and surprisingly difficult to code, as the cumulative distribution function (CDF) for the Gaussian bell curve can be. I now suspect that Grubbs and GESD belong to a subset of outlier detection methods that DBAs will rarely find use cases for, along with such related means as the Tietjen-Moore Test, Dixon’s Q-Test, Chauvenet’s Criterion and the Modified Thompson Tau Test. I’ll dispense with these in the middle of the series since I’ve already written most of the code  for them and might as well not let it go to waste, just in case a reader out there discovers a need to include them in their toolbelt. After this quick stretch we’ll get into methods like Interquartile Range and Peirce’s Criterion, which I expect to be much more useful for our scenarios, although not perhaps as much as topics we’ve already covered like Benford’s Law and Z-Scores. I also have high hopes for Cook’s Distance and Mahalanobis Distance, which I’ll tackle after a recap of SSDM Clustering and an interlude into Visual Outlier Detection with Reporting Services, in which we can spot outliers with the naked eye. For now, I’ll finish on quickly getting out of the way other means of outlier detection from the same class as GESD and Grubbs. Many of these share some of the same severe limitations, such as dependence on a normal distribution. GESD may be the most flexible among them, since it allows you to specify the number of outliers you want to look for, whereas Dixon’s Q-Test and Grubbs limit you to just one. As we shall see next week, the Tietjen-Moore test appears at first glance to be more useful since it also include a parameter like @NumberofTests. Its utility is crimped, however, by the subtle difference that it only tells you whether or not the dataset contains that number of outliers. GESD will likely be more useful, in that it can actually flag the specified number of records as aberrant data points.

[1] See National Institute for Standards and Technology, 2014,  “1.3.5.17.1. Grubbs’ Test for Outliers,” published in the online edition of the Engineering Statistics Handbook. Available at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

[2] Rosner, Bernard, 1983,  “Percentage Points for a Generalized ESD Many-Outlier Procedure,” pp. 165-172 in Technometrics, May, 1983. Vol. 25, No. 2.  Original citation found at the NIST webpage “1.4.3. References For Chapter 1: Exploratory Data Analysis,” published in the online edition of the Engineering Statistics Handbook. Available  on the Internet at  http://www.itl.nist.gov/div898/handbook/eda/section4/eda43.htm#Rosner

[3] See the Wikipedia page on the “Student’s T-distribution,” available at the web address http://en.wikipedia.org/wiki/Student’s_t-distribution .

[4] Ben-Gan, Itzik, 2012, Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions . O’Reilly Media, Inc.: Sebastopol, California.

 

Outlier Detection with SQL Server, part 3.1: Grubbs’ Test


By Steve Bolton

…………In the last two installments of this series of amateur self-tutorials, I mentioned that the various means of detecting outliers with SQL Server might best be explained as a function of the uses cases, the context determined by the questions one chooses to ask of the data, the number and data types of the inputs and the desired mathematical properties of the outputs. The means of calculation in between the input and output stage may also be pertinent for performance reasons. Moreover, we can differentiate these aberrant data points we call “outliers” by their underlying causes, which must be matched with the correct response; it does us no good to find extreme values in our datasets if we can’t determine whether they were the product of faulty data collection, corruption during storage, natural random variations or some other root cause, then use that determination to handle them correctly. If we could build a matrix of outlier detection methods and their uses cases, then Grubbs’ Test would occupy a very narrow range. The inputs and questions the test can answer are quite constrained, since the test can only determine whether the highest and lowest values in a sample are outliers. It outputs a single test statistic, which can be used to output a single Boolean answer, rejecting or accepting the null hypothesis that there are no outliers in the dataset. The National Institute for Standards and Technology’s Engineering Statistics Handbook, one of the best online resources for explaining statistics in plain English, warns that, “If you suspect more than one outlier may be present, it is recommended that you use either the Tietjen-Moore test or the generalized extreme Studentized deviate test instead of the Grubbs’ test.” In the kinds of billion-row databases that SQL Server DBAs work with on a day-to-day basis, we can expect far more than a single aberrant data point just by random chance alone. Grubbs’ Test is more applicable to hypothesis testing on small samples in a research environment, but I’ll provide some code anyways in the chance that it might prove useful to someone in the SQL Server community on small datasets.
…………The “maximum normed residual test” originated with the a paper penned for the journal Technometrics by Frank E. Grubbs, a statistician for the U.S. Army’s Ballistics Research Laboratory (BRL), six years before his retirement in 1975. Apparently the Allies owe him some gratitude, given that “he was dispatched to England in 1944 as part of a team on a priority mission to sample and sort the artillery ammunition stockpiled for the invasion of France. After the team conducted thousands of test firings of the hundreds of different lots of artillery ammunition in the stockpiles, he analyzed the statistical variations in the data and was able to divide the ammunition into four large categories based on their ballistic characteristics. As a result, the firing batteries did not need to register each lot of ammunition before it was unpacked; they only needed to apply four sets of ballistic corrections to the firing tables to achieve their objectives.” After the war, he assisted the BRL in evaluating the reliability and ballistic characteristics of projectiles, rockets, and guided missiles; maybe he wasn’t a “rocket scientist,” as the saying goes, but close enough. The groundwork for the test that bears his name was laid in 1950, when he published a paper titled “Procedures for Detecting Outlying Observations in Samples” for Annals of Mathematical Statistics, which I also had to consult for this article. The 1950 paper is made publicly available by the Project Euclid website, while the one establishing the test itself is made available at California Institute of Technology’s Infrared Processing and Analysis Center, for anyone wise enough to double-check my calculations and code or to get more background.
…………Calculating the test statistic from the formula at the NIST webpage is really trivial; the difficulty is in finding proper tables of the T-distribution to interpret the statistic with. The equation for the two-sided test is quite similar to the familiar Z-Score, except that we take the maximum value of the absolute deviation (i.e., the data point minus the mean) before dividing by the standard deviation. The one-sided tests for determining if a minimum or maximum value in a dataset is an outlier are only slightly different; in the former we subtract the minimum value from the mean, while in the latter we subtract the mean from the maximum. Since the code is so easy that even a caveman can do it, I decided not to complicate it by adding logic to let the user select which of the three tests to use; I simply return all three in one row, along with the critical regions for each. The formulas for calculating the critical regions at the NIST’s webpage on Grubbs’ Test are more involved, which requires the use of the function in Figure 3. This in turn calls a rather sloppy but effective function to find the correct critical values for the T-distribution, from the lookup tables defined in Figure 1. I haven’t supplied any code to populate them, but this can be easily rectified by using one of the thousands of lookup tables available on the Internet for that distribution. The tricky part was finding a table that was reasonably complete, since many sources begin skipping over deciles around 40 or 50 degrees of freedom; I populated my own from the best source I could find, the “Tables of Critical values of t for Probabilities” at the StatsToDo website. In fact, you may need to tweak the DDL and retrieval code if you use a different source, since my ranges and stopping point of 200 degrees of freedom are derived from that particular source. According to another lookup table (from a Blogspot post by a writer who I’ve been unable to identify to give proper credit) that I didn’t use because it skips some deciles, the values for 200 and 250 are nearly identical except down to the hundredth of percentage points; the next value listed there is for infinity, which varies only a few hundredths of a percentage point from 250. Unlike researchers working with small samples drawn from an unknown population, SQL Server users can often instantly call up millions of records, so using the smaller values of these lookup tables may be of limited utility for our use cases. I only recently learned how to do hypothesis testing, so my standard advice to check my code before putting it into a production environment definitely holds here. The Grubbs Statistic values match the NIST sample results though and could prove useful, by providing a measurement of “the largest absolute deviation from the sample mean in units of the sample standard deviation.”

Figure 1: DDL for the T-Distribution Lookup Tables
CREATE TABLE [Calculations].[CriticalValueRangeTable](
       [ID] [smallint] IDENTITY(1,1) NOT NULL,
       [OneTail] [decimal](5, 4) NULL,
       [TwoTail] [decimal](5, 4) NULL,
 CONSTRAINT [PK_CriticalValueRangeTable] PRIMARY KEY CLUSTERED (
       [ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE =
OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

CREATE TABLE [Calculations].[TDistributionTable](
[ID] [smallint] IDENTITY(1,1) NOT NULL,
[ProbabilityRange1] [decimal](7, 4) NULL,
[ProbabilityRange2] [decimal](7, 4) NULL,
[ProbabilityRange3] [decimal](7, 4) NULL,
[ProbabilityRange4] [decimal](7, 4) NULL,
[ProbabilityRange5] [decimal](7, 4) NULL,
[ProbabilityRange6] [decimal](7, 4) NULL,
[ProbabilityRange7] [decimal](7, 4) NULL,
CONSTRAINT [PK_TDistributionTable] PRIMARY KEY CLUSTERED ([ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]) ON [PRIMARY]

Figure 2: Function to Look Up Values in the T-Distribution Tables
CREATE FUNCTION [Calculations].[FindCriticalRegionForTDistributionFunction]
(@DegreesOfFreedom bigint, @SidedTestType bit, @ProbabilityValue decimal(5,4))
RETURNS decimal(7, 4)
AS
BEGIN
DECLARE @CriticalRegion decimal(7, 4)
– this is a little more awkward than I’d like, but hey
      SELECT @CriticalRegion = CASE WHEN ProbabilityRangeColumnID = 1 THEN (SELECT ProbabilityRange1
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
      WHEN ProbabilityRangeColumnID = 2 THEN (SELECT ProbabilityRange2
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom
ELSE NULL END))
       WHEN ProbabilityRangeColumnID = 3 THEN (SELECT ProbabilityRange3
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
       WHEN ProbabilityRangeColumnID = 4 THEN (SELECT ProbabilityRange4
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
       WHEN ProbabilityRangeColumnID = 5 THEN (SELECT ProbabilityRange5
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom
>= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
      WHEN ProbabilityRangeColumnID = 6 THEN (SELECT ProbabilityRange6
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom >= 200 THEN 200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
       WHEN ProbabilityRangeColumnID = 7 THEN (SELECT ProbabilityRange7
FROM Calculations.TDistributionTable WHERE ID = (CASE WHEN @DegreesOfFreedom >= 200 THEN
200 WHEN ID = @DegreesOfFreedom THEN @DegreesOfFreedom ELSE NULL END))
       ELSE NULL END
      FROM   (SELECT TOP 1 ID AS ProbabilityRangeColumnID
             FROM  (SELECT ID, ValueRange, ABS(ValueRange - @ProbabilityValue) AS RangeDifference, Lead(ValueRange, 1, 0) OVER (ORDER BY ValueRange) AS Lead
                     FROM (SELECT ID, ‘ValueRange’ = CASE WHEN @SidedTestType = 0 THEN OneTail
                           WHEN @SidedTestType = 1 THEN TwoTail
                           ELSE NULL END
                     FROM [Calculations].[CriticalValueRangeTable] ) AS T1) AS T2
              ORDER BY RangeDifference ASC) AS T3
      RETURN @CriticalRegion
E
ND

Figure 3: Grubbs Hypothesis Testing Function
CREATE FUNCTION [Calculations].[GrubbsHypothesisTestSP](
@DegreesofFreedom bigint, @TestType bit = 0, @SignificanceLevel decimal(38,35))
RETURNS decimal(38,32)
AS
BEGIN
       DECLARE @CriticalValue decimal(38,32), – *** look this up in a table by the SignificanceLevel I’ve already  recalculated according to the formulas, and also the Degrees of Freedom – 2
      @ReturnValue decimal(38,32)
       SELECT @CriticalValue = [Calculations].[FindCriticalRegionForTDistributionFunction] (@DegreesofFreedom, @TestType, @SignificanceLevel)
      SELECT @ReturnValue = Power(Power(@CriticalValue, 2) / (@DegreesOfFreedom - 2
+ Power(@CriticalValue, 2)), 0.5) * ((@DegreesOfFreedom -1) / Power(@DegreesOfFreedom,
0.5))
       RETURN @ReturnValue
END

Figure 4: Code for the the Grubbs Test Procedure
CREATE PROCEDURE [Calculations].[GrubbsTestSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @DecimalPrecision AS nvarchar(50), @Alpha decimal(38,35) = 0.05
AS

DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
S
ET @DatabaseName = @DatabaseName + ‘.’
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName –I’ll change this value one time, mainly for legibility purposes

SET @SQLString = ‘DECLARE @Mean decimal(‘ + @DecimalPrecision + ‘),
@StDev decimal(‘ + @DecimalPrecision + ‘),
@Min decimal(‘ + @DecimalPrecision + ‘),
@Max decimal(‘ + @DecimalPrecision + ‘),
@GrubbsVersion1 decimal(‘ + @DecimalPrecision + ‘),
@GrubbsVersion2 decimal(‘ + @DecimalPrecision + ‘),
@GrubbsVersion3 decimal(‘ + @DecimalPrecision + ‘),
@DegreesofFreedom bigint,
@SignificanceLevel decimal(‘ + @DecimalPrecision + ‘),
@SignificanceLevelOneSided decimal(‘ + @DecimalPrecision + ‘)

SELECT @DegreesofFreedom=Count(‘ + @ColumnName + ‘), @Mean = Avg(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @StDev = StDev(CAST(‘ + @ColumnName + ‘ AS Decimal(‘ + @DecimalPrecision + ‘))), @Min = Min(‘ + @ColumnName + ‘), @Max = Max(‘ + @ColumnName + ‘) FROM ‘ + @SchemaAndTableName + WHERE ‘ + @ColumnName + ‘ IS NOT NULL

– the sample exercise at the NIST webpage uses a straight value of @SignificanceLevel = @Alpha, rather than the two calculations for two- and one-sided tests that are recommended elsewhere on the directions; hence Ive commented them out for now
-
-SET @SignificanceLevel  = ‘ + CAST(@Alpha AS nvarchar(50)) + ‘ / (2 * @DegreesofFreedom)
–SET @SignificanceLevelOneSided = ‘ + CAST(@Alpha AS nvarchar(50)) + ‘ / (@DegreesofFreedom)
SET @SignificanceLevel = ‘ +
CAST(@Alpha AS nvarchar(50)) +
SET @SignificanceLevelOneSided = ‘ + CAST(@Alpha AS nvarchar(50)) +

SELECT @GrubbsVersion1 = Max(‘ + @ColumnName + ‘ – @Mean) / @StDev, @GrubbsVersion2 = ((@Mean – @Min) / @StDev),
@GrubbsVersion3 = ((@Max – @Mean) / @StDev)
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL

SELECT @GrubbsVersion1 AS  GrubbsTwoSided, CAST([Calculations].[GrubbsHypothesisTestSP] (@DegreesOfFreedom, 1, @SignificanceLevel) AS decimal(‘ + @DecimalPrecision + ‘)) AS
CriticalRegionForGrubbsTwoSided,
@GrubbsVersion2 AS  GrubbsLowerOneSided, @GrubbsVersion3 AS  GrubbsUpperOnesided, CAST([Calculations].[GrubbsHypothesisTestSP] (@DegreesOfFreedom, 0, @SignificanceLevelOneSided)  AS decimal(‘ + @DecimalPrecision + ‘)) AS CriticalRegionForGrubbsUpperOneSided
–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)

…………Thankfully, we will be able to reuse the mass of T-SQL in the first three figures in next week’s tutorial, which also requires looking up T-distribution values. The code in Figure 4 will look familiar if you’ve been following this mistutorial series. As always, you’ll have to add the brackets and program in the logic yourself if you allow spaces in your object names; you may also need to add SQL injection and other security code. Also, keep in mind that I’m still using a Calculations schema for these sample routines, so you may need to create one in your database or change the name as needed. The first three parameters allow you to run the procedure against any column in any database for which you have the requisite permissions and access. The @DecimalPrecision also allows you to manually set the precision and scale for the internal calculations of the procedure, in order to avoid arithmetic overflows. There are some slight differences between the parameters of this procedure and those discussed in the last few articles though, such as the fact that the usual @OrderByCode and @PrimaryKeyName are not needed here. The @Alpha parameter allows you to set the significance level of the test to any value you please (including incorrect ones, since I haven’t added any validation code) like the default of 0.05, which corresponds to a 95 percent confidence level. Oddly, Grubbs wrote in the 1969 paper that the confidence levels ought to be at least 99 percent for use with this test, but still used the standard 95 percent in his own example. Similarly, the NIST webpage says to use @Alpha divided by the degrees of freedom for a one-sided test and divided by twice the degrees of freedom for a two-sided test, yet uses a plain significance value of @Alpha = 0.05 in its sample data. Hence the commenting out of the code that would change the @SignificanceLevel to the alternate values. When run against the sample exercise on the NIST page, my code is accurate within about a hundredth of a percentage point, as long as this adjustment is made.

Figure 5: Results for the Grubbs Test Procedure
EXEC   Calculations.GrubbsTestSP
            @DatabaseName =N’DataMiningProjects’,
             @SchemaName= N’Health’,
             @TableName = N’DuchennesTable’,
             @ColumnName= N’PyruvateKinase’,
             @DecimalPrecision= N’12,7′,
             @Alpha = 0.05

 GrubbsTestResults

 

…………Executing a query like the one above against the Duchennes dataset we’ll be working with in this tutorial series produced the five columns above. The first, third and fourth columns represent the Grubbs Statistics for the two-sided, minimum and maximum tests respectively, while the Critical Regions are derived from the aforementioned code in Figures 1 through 3. The Grubbs Statistics are far beyond the critical regions, so yes, the maximum values in the dataset are beyond the thresholds and thus can be defined as “outliers.” Keep in mind that after testing the procedure against several different columns in different datasets, I’ve only seen slight differences between the two-sided result and the upper one-sided; due to lack of appropriate samples to work through, I cannot say whether or not that can be expected or is an error on my part. The PyruvateKinase column refers to an enzyme that is apparently involved in the devastating Duchennes form of muscular dystrophy, which is the subject of a tiny nine-kilobyte sample dataset made publicly available by Vanderbilt University’s Department of Biostatistics. In the last couple of blog posts I used the first float column of the Higgs Boson Dataset that the University of California at Irvine’s Machine Learning Repository has made available online, which occupies nearly 6 gigabytes of space in the DataMiningProjects database I’ve created to encompass all of the practice datasets I’ll be using in tutorial series to come. Traversing that much space in index scans and seeks turned out to be somewhat costly for the two versions of Z-Scores that I coded in the last two posts, but I haven’t bothered to post Client Statistics from SQL Server Management Studio (SSMS) because the Grubbs’ Test procedure takes only a matter of seconds, even for the massive float columns of the Physics.HiggsBosonTable.
…………The procedure may perform well compared to other outlier detection methods, but its usefulness is limited. I have yet to test it against a table that didn’t turn out to have outliers – which is likely to be the case for most of the large tables that DBAs might run the procedure on. Grubbs’ Test is more useful for the kind of small tables used in hypothesis testing, rather than exploratory data mining and data quality analyses, which are far more common uses cases in the SQL Server user community. Statistical testing of this kind is also prone to several real dangers I’ve touched on in the last few articles and others that I have yet to address. The confidence levels commonly associated with them are pretty much drawn out of thin air; you’ll see 95 percent used most of the time, but only because it was an arbitrary de facto standard long before the UNIVAC. There really isn’t a good justification for it, except for the fact that it has been used for so long. Secondly, probabilities are not guarantees; there is a finite chance that random fluctuations alone could produce a dataset that consisted of nothing but outliers, using any definition and detection method. Worst of all, Grubbs’ Test requires a Gaussian (i.e.”normal”) distribution, i.e. the bell curve. Without goodness-of-fit tests that clearly demonstrate that the data ought to fit the normal distribution, such tests are useless – or worse, deceptive. As Grubbs himself puts it mildly, “Until such time as criteria not sensitive to the normality assumption are developed, the experimenter is cautioned against interpreting the probabilities too literally when normality of the data is not assured.” I don’t yet know how to apply some of the best goodness-of-fit tests (mainly because I’m still having trouble wrapping my head around some of the intricacies of cumulative distribution functions) but that won’t stop me from harping on this point repeatedly: the extent to which statistics are bandied about in many academic disciplines without proper testing today is simply alarming. The one place we can least afford to see them is in medical research, where they become a matter of life and death, but at least half of all studies published today contain at least one statistical error. The most common error appears to be the lack of goodness-of-fit testing; researchers in many fields seem to be in the habit of routinely applying tests that depend on a Gaussian distribution with reckless disregard for their validity. It’s not surprising that this occurs, given that there are so few statistical tests that can be used with the scores of other distributions that data might follow. If researchers everywhere were held to a proper standard of evidence, they might not be able to back up claims for a favorite project or experimental medicine with any statistics at all.
…………This leads straight back to the logical and moral issues that I’ve tried to reiterate throughout this series: there are an awful lot of shenanigans that go on with “lies, damned lies and statistics,” as the old saying goes. Grubbs’ Test is no more vulnerable to misuse than any other measure based on the normal distribution, all of which can be invalid, misleading or worse when the data is not naturally Gaussian. It is sometimes, however, applied recursively by simply deleting outliers until some stopping criteria is reached, which raises the grim possibility of improper handling of unfavorable data points. In some situations, an iterative Grubbs Test is vulnerable to false negatives, in which actual outliers are not detected, or false positives, in which haphazard definitions of stopping criteria lead to the deletion of good data. That brings us back full circle to the confluence of subjective definitions, “moving the goalposts” and inadequate analysis of the causes of aberrance which I discussed at length in the last article. Thankfully, the Generalized Extreme Studentized Deviate Test (GESD) ameliorates some of the standard problems associated with the sequential application of Grubbs’ Test by making a few adjustments; the NIST recommends that either the GESD or the Tietjen-Moore test be applied when looking for more than one outlier, which is almost always going to be the case in a SQL Server database. The math and code for both are also relatively simple. Unfortunately, they are both dependent on a Gaussian distribution, which means they also require goodness-of-fit tests that are often dispensed with in an irresponsible manner. The same limitation applies to Dixon’s Q-Test, which is simple to code, as well as to Chauvenet’s Criterion, which is not. It may also be true of Peirce’s Criterion, which will also be featured later in this series. Interquartile Range is a much more easily coded method of outlier detection which also may be less dependent on the normal distribution. Later in the series, I’ll give a quick recap of the Clustering algorithm in SQL Server Data Mining (SSDM) and supply some eye candy that is much easier on the brain than these fancy equations in Visual Outlier Detection with SQL Server Reporting Services. Towards the end I’ll delve into more difficult methods like Cook’s Distance and the Modified Thompson Tau Test, then Mahalanobis Distance. Many of these more sophisticated methods are of course more difficult to code than GESD, Tietjen-Moore and Dixon’s Q-Test, but they may also be more applicable to distributions besides the bell curve.

Outlier Detection with SQL Server, part 2.2: Modified Z-Scores

By Steve Bolton

…………There are apparently many subtle variations on Z-Scores, a ubiquitous measure that is practically a cornerstone in the foundation of statistics. The popularity and ease of implementation of Z-Scores are what made me decide to tackle them early on in this series of amateur self-tutorials, on using various components of SQL Server to detect those aberrant data points we call “outliers.” As discussed in the last two posts, there are many different means of identifying outliers, which may be understood best by categorizing them by their use cases; the right choice of detection tools is essentially a function of the questions one wants to ask of the data, the number and types of inputs, the desired mathematical properties of the outputs and in between, the performance and other requirements used in transforming the inputs into outputs. From my scant understanding of what little literature I’ve read on the topic, statisticians and other researchers commonly encounter use cases where the sensitivity of ordinary measurements to outliers has to be toned down, often in response to fat-tailed (i.e. highly skewed) distributions. The Modified Z-Scores developed by Temple University Prof. Boris Iglewicz and University of Massachusetts Prof. David C. Hoaglin are one means of adjusting Z-Scores for such cases, but hardly the only one. I’m highlighting it merely because I was introduced to it early on, while trying to learn the subject of stats from the National Institute for Standards and Technology’s Engineering Statistics Handbook, one of the best online resources for anyone trying to wade through this notoriously dry subject.[I]
                Iglewicz and Hoaglin suggest specific cut-off criteria for their measure, that can of course be adjusted as needed by users – which raises the whole question of researchers “moving the goalposts” or setting them haphazardly when using any means of outlier detection. Correct classification is a thorny issue with every method we’ll discuss in this series; I’m merely using the hard boundary associated with Modified Z-Scores as an introduction to the all-important topic of subjectivity in category definitions. As an amateur, I can’t give any guidance on whether or not to use a specific cut-off point for classifying outliers, although it would seem to be common sense that more detailed rough and fuzzy sets ought to be more commonly used in place of hard limits than they are. It is worth reiterating that outliers also vary significantly in their causes and the responses made to them, not just the means of differentiation. The frequent mismatches between the causes and responses and the lack of attention paid to discerning them both leave the door wide open to innumerable fallacies and sometimes outright fraud, which as I discussed earlier in this series, is frighteningly common in certain fields. The definition of an outlier is subjective, depending on the kind of investigation a researcher chooses to perform, but whether or not a particular data point meets the chosen criteria is wholly objective. Fallacies and fraud arise when the distinction in the right uses and proper places of subjectivity and objectivity are blurred; the whole history of human philosophy demonstrates that when the former is loosed from such bonds, the result is always maniacal madness. For example, a person can choose to affix the name “pepperoni pizza” to anything they want; but once they’ve set tomato sauce, bread, cheese and the like as part of the criteria, then they can’t pretend that a pencil sharpener or a Planck length fits the definition, because whether or not they consist of the same ingredients set forth in the criteria is an objective matter. That’s plain common sense, which suddenly becomes uncommon when the labeler has an incentive to fudge their definitions, or worse yet, a pedantic justification for it, like solipsism (i.e., one of the major symptoms of schizophrenia). Outlier detection presents a serious temptation to simply ignore the distinctions between causes and put no effort to differentiating the correct response to others, so that data miners and others who use these tools frequently just delete records that don’t fit their models and theories, or adjust their definitions of the term to achieve the same purpose. I’ll delve into outlier deletion in more depth a few posts from now, but the issue of subjective limits can serve as a transition into my usual dire disclaimer that math formulas, including those underpinning data mining algorithms and outlier detection, resides in a Pandora’s Box. The Modified Z-Scores under discussion today do not open the box any wider than any other formula; this is merely the context in which all statistical measures naturally reside, in which the slightest logical deviation in their use may lead to erroneous, misleading or even fraudulent conclusions. Data mining tools can be used quite easily by amateurs like myself for exploratory data analysis, but need to be handled like scalpels when attempting to prove a specific point. Nevertheless, they’re often employed carelessly like sledge hammers by professionals in many different fields, particularly health care. The specter of fallacious reasoning hems us in all sides, and wielding these tools properly for this purpose requires more skill than the old board game of Operation. The difference with math and logic is that there is no buzzer to warn us when we’ve used them wrong; there may be terrible consequences down the line in the form of falling bridges and adverse medical reactions, but the intelligentsia also has the intellectual power to explain those away using the same poor reasoning. What is called for here is not intelligence, but wisdom; without it, outlier detection methods merely prove the old adage, “There are three kinds of lies: lies, damned lies and statistics.”[ii] No data mining tool or math formula is going to going to provide a quick fix for this overarching problem, which hangs like a Sword of Damocles over everything that researchers, scientists, mathematicians and data miners do; the only fix is to apply the use of reason rigorously, which requires a deep understanding of logical fallacies and in turn, painful self-examination. Neither I nor most of the DBAs who read this probably have that kind of training, so our use cases ought to be limited to exploratory analysis – which can be a highly productive exercise, even for the unqualified – rather than hypothesis testing and the like.
…………The point of using Modified Z-Scores is to address situations where it is desirable to reduce the sensitivity to outliers, so that there are for all intents and purposes fewer false positives when classifying them. Whether or not such reduced sensitivity is a wise choice to fit the problem at hand is one question; whether or not Modified Z-Scores succeed in doing so seems to be an open and shut case. In this series I’m trying to grasp the mechanisms that make these algorithms and formulas work as they do, which is something I didn’t delve into adequately in my series on SQL Server Data Mining (SSDM). The reason why Iglewicz and Hoaglin’s Z-Scores are less sensitive to outliers without being completely blind to them is that they use medians rather than means, which are an alternate measure of central tendency that is known for being less affected by unusual values. Both medians and means are basically primordial forms of clustering that identify a specific location near the center of a dataset, but the former is less affected by the most distant points. The formula given at the NIST website is not all that difficult to decipher or code in T-SQL; I was unable to get ahold of a copy of their original paper to see what the reasoning was behind the constant that appears in it, but it is child’s play to simply include it in the code and be done with it.[iii]  This was my first introduction to median absolute deviation (MAD), which is a variation of the average absolute deviation that is even less affected by extremes in the tail because the data in the tails have less influence on the calculation of the median than they do on the mean.”[iv] I initially confused it with a more common calculation, mean absolute deviation because of the similar names. The idea is basically the same though: instead of taking a mean of a mean, we compare each data point to the median of the whole dataset, then calculate a new median for the absolute value of those distances. Then we take subtract the median from each data point again and multiply that result by a constant, 0.6745, the divide the result by the MAD. The equations are actually quite easy to read; most of the T-SQL involved in implementing them is dedicated to calculating the two medians, using some subqueries and windowing functions. I’ve precalculated both in Common Table Expressions (CTEs) at the beginning of this week’s T-SQL stored procedure, because this reduces them to one-time operations (I think the technical term might an “Θ(n) operation”) and makes the complicated dynamic SQL a little more legible. The T-SQL in Figure 1 could be streamlined further to suit your needs by removing the DENSE_RANK calculation, the OutlierCandidate column and the @OrderByCode logic, which are dispensable elements I’ve added as conveniences.

Figure 1: Code for the Modified Z-Score Stored Procedure[v]
CREATE PROCEDURE [Calculations].[ModifiedZScoreSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)–1 is by PK ASC, 2 is by
PK Desc, 3  is by ColumnName ASC, 4 is by ColumnName DESC, 5  is by ZScore ASC, 6 is by ZScore DESC
AS

SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400), @SQLString nvarchar(max)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName –I’ll change
this value one time, mainly for legibility purposes

SET @SQLString =
‘DECLARE @OrderByCode as tinyint ,– pass the outer value like a parameter of sorts
@Median AS decimal(‘ + @DecimalPrecision + ‘),
@MedianAbsoluteDeviation AS decimal(‘ + @DecimalPrecision + ‘)

– PRECALCULATED STATS
————————–
– precalculating these 3 stats not only makes the code more legible, but is more efficient because it is a one-time operation
- first get the median

WITH MedianCTE
(‘ + @ColumnName + ‘, RN, DenseRank)
AS
(
SELECT ‘ + @ColumnName + ‘, RN, DENSE_RANK() OVER (PARTITION BY 1 ORDER BY RN DESC) AS DenseRank
FROM (SELECT ‘ + @ColumnName + ‘, ROW_NUMBER() OVER (ORDER BY ‘ + @ColumnName + ‘) AS RN
     FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1
)

SELECT @Median = Avg(‘ + @ColumnName + ‘) FROM MedianCTE WHERE RN BETWEEN DenseRank – 1 AND DenseRank +1;

– get the MedianAbsoluteDeviation
WITH MedianAbsoluteDeviationCTE
(‘ + @ColumnName + ‘, RN, DenseRank)
AS
(
SELECT NewMedian, RN, DENSE_RANK() OVER
(PARTITION BY 1 ORDER BY RN DESC) AS DenseRank
F
ROM        (SELECT NewMedian, ROW_NUMBER() OVER (ORDER BY NewMedian) AS RN
       FROM             (SELECT ABS(‘ + @ColumnName + ‘ – @Median) AS NewMedian
              FROM ‘ + @SchemaAndTableName +
              WHERE ‘ + @ColumnName + ‘ IS NOT NULL) AS T1) AS T2)

SELECT @MedianAbsoluteDeviation = Avg(‘ + @ColumnName + ‘)
FROM MedianAbsoluteDeviationCTE
WHERE RN BETWEEN DenseRank – 1 AND DenseRank +1;

–SELECT @Median
–SELECT @MedianAbsoluteDeviation
S
ET @OrderByCode = ‘ + CAST(@OrderByCode  AS nvarchar(50) ) +

– OUTLIER COMPARISON OPERATIONS
– now check each data point
SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ModifiedZScore, DENSE_RANK () OVER (ORDER BY ModifiedZScore) AS GroupRank,
”OutlierCandidate” = CASE WHEN Abs(ModifiedZScore) > 3.5 THEN 1
       ELSE 0
       END
FROM (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ((0.6745 * (‘ + @ColumnName + ‘ – @Median)) / @MedianAbsoluteDeviation) AS ModifiedZScore
FROM ‘ + @SchemaAndTableName + ‘) AS T1
       ORDER BY
CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,
CASE WHEN @OrderByCode = 5 THEN ModifiedZScore
END ASC,
CASE WHEN @OrderByCode = 6 THEN ModifiedZScore END DESC’
–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)

…………Note that I’m using almost all of the same parameters and dynamic SQL format as in the article on regular Z-Scores. The combination of the first three parameters allows you to execute the procedure against any table in any database, assuming you have the requisite permissions. Just like in the last tutorial, the @OrderByCode parameter allows you to sort the results by 1) the primary key values, ascending; 2) the primary key value descending; 3) the column values ascending; 4) the column values descending; 5) the Modified Z-Score ascending and 6) the Modified Z-Score descending. You’ll have to supply your own @DecimalPrecision values and tweak them to avoid arithmetic overflows, which are tricky to handle when multiple calculations can change the number of decimal places repeatedly. I usually try setting these values to the original precision and scale of decimal and numeric columns if they’re high enough, but when working with integers you’ll have to decide how many decimal places are appropriate for your output.  You can debug the dynamic SQL by uncommenting the next-to-last line and two others beginning with comment marks and SELECTS. I’ve also used DENSE_RANK windowing function again to assign identical results to specific groups by their Modified Z-Score values, which comes in handy with columns that have few distinct values that are repeated many times. The OutlierCandidate is merely a bit column that reveals whether or not the ModifiedZScore falls outside the ±3.5 threshold set by Iglewicz and Hoaglin. Your requirements may be different, so feel free to change the threshold or eliminate it altogether; it wouldn’t be all that difficult either to replace hard thresholds like this with more flexible fuzzy set criteria with graded memberships. If you use @OrderByCode 5 or 6, values where OutlierCandidate = 1 will be sorted to the top and bottom of the results. As usual, you’ll have to add your own brackets and logic to handle spaces if you allow them in your object names (I have a ruthless ban on them in my own code, for legibility purposes) and program in your own security to handle risks like SQL injection.

Figure 2: Results for Column1 of the HiggsBosonTable
EXEC [Calculations].[ModifiedZScoreSP]
              @DatabaseName = N’DataMiningProjects’,
             @SchemaName = N’Physics’,
              @TableName = N’HiggsBosonTable’,
              @ColumnName = N’Column1′,
              @PrimaryKeyName = N’ID’,
             @OrderByCode = 6,
              @DecimalPrecision = N’33,29′
ModifiedZScoresValues

 

Figure 3: Client Statistics for the Modified Z-Scores Procedure
ModifiedZScoresClientStats (1)
ModifiedZScoresClientStats (2)

…………In last week’s tutorial, I tested my Z-Score stored procedure on the first float column of a nearly 6-gigabyte table from the Higgs Boson Dataset, which is made publicly available by the University of California at Irvine’s Machine Learning Repository. In future mistutorials I will use a dataset on the Duchennes form of muscular dystrophy provided by Vanderbilt University’s Department of Biostatistics, as well as transcriptions of the Voynich Manuscript, a creepy tome whose medieval author encrypted it so well that no one has been able to crack it since, including the National Security Agency (NSA). For the sake of consistency, I tested my Modified Z-Scores procedure against the same Higgs Boson column. Using the query at the top of Figure 2 returned the corresponding results, plus about 11 million more rows that I somehow couldn’t cram into the article. There were some records at the very bottom with Modified Z-Scores near -1, but none that qualified for Iglewicz and Hoaglin’s cut-off point for outliers.
…………I didn’t bother to post screenshots of the execution plans because they weren’t pretty, nor would they fit on an NFL scoreboard. The novel approach I took of comparing the middle point of two windowing functions moving in opposite directions added a lot of expensive sorts, which even the addition of a non-clustered index couldn’t fix. As depicted in Figure 3, the index improved the Client Processing Time and Total Execution Time significantly, but the procedure still consumed far too much memory on my poor beat-up development machine and took too long for my tastes. It will do just fine on columns in small tables, but expect it to take a while if you’re executing it on 11 million rows of a 6-gig database using an outdated workstation in place of a real server.
…………That drawback ought to refocus attention on one of the caveats I want to stress in this series: I’m posting these articles because I don’t know what I’m doing and want to learn, not because I have any real expertise. As with my series A Rickety Stairway to SQL Server Data Mining, I’m passing on my misadventures so that others don’t repeat them. Another error I rectified along the way was accidentally substituting a mode for the median while wool-gathering; that procedure might actually be useful in catching certain outliers and I will post it if anyone thinks they can benefit, but the bottom line is that I almost posted an article based on the wrong formula. Just keep in mind that my code samples in this series will always need further testing before going into a production environment. Consider these posts an introduction to the topic, not the last word. If all goes according to plan, I’ll be introducing both myself and my readers to Chauvenet’s Criterion, which is a means of outlier detection that is intrinsically dependent on a Gaussian distribution. I may follow these up by going on a tangent with some fairly easy means of outlier detection, like Grubbs’ Test and the Tietjen-Moore Test, the Generalized Extreme Studentized Deviate (ESD) Test, Interquartile Range and Dixon’s Q-Test. At some point I’ll also get into a discussion of Visual Outlier Detection with Reporting Services (featuring a lot of eye candy) and do a quick recap of Clustering with SSDM. Towards the end of the series I’ll tackle Cook’s Distance the Modified Thompson Tau Test, then end with the daunting task of coding Mahalanobis Distance. I hope to use that last post as a springboard towards a much longer and more difficult series months down the line, Information Measurement with SQL Server.

[i] That’s more “dry humor,” but not as bad as the hydrology joke in the last column.

[ii] Although frequently attributed to Mark Twain and Benjamin Disraeli, the quip apparently originated with British politician Leonard H. Courtney in 1895. See the TwainQuotes webpage “Statistics” at http://www.twainquotes.com/Statistics.html.

[iii] See “Detection of Outliers,” an undated article published at the National Institute for Standards and Technology’s Engineering Statistics Handbook website. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm. The page in turn cites Iglewicz, Boris and Hoaglin, David, 1993, “Volume 16: How to Detect and Handle Outliers,” The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

[iv] See “Measures of Scale,” an undated article published at the National Institute for Standards and Technology’s Engineering Statistics Handbook website. Available online at http://www.itl.nist.gov/div898/handbook/eda/section3/eda356.htm#MAD

[v]  While coding this, I forgot how to use modulo properly and made use of See Byers, Mark, 2009, response to the thread “How Can I Determine is Digit Even Number?” published Nov. 26, 2009 at StackOverflow.com. Available online at http://stackoverflow.com/questions/1805420/how-can-i-determine-is-digit-even-number.  I also double-checked my median calculations against the MathIsFun webpage “How to Find the Median Value” at http://www.mathsisfun.com/median.html.

Outlier Detection with SQL Server, part 2.1: Z-Scores

By Steve Bolton

…………Using SQL Server to ferret out those aberrant data points we call outliers may call for some complex T-SQL, Multidimensional Expressions (MDX) or Common Language Runtime (CLR) code. Yet thankfully, the logic and math that underpin the standard means of outlier detection I’ll delve into in this series are sometimes mercifully simple. That was not the case in the introductory article in this amateur series of self-tutorials, in which I tackled the fascinating mystery of Benford’s Law, a means of finding outliers that is often used to expose fraud. I also used the occasion to bring up the topic of potential misuses of outlier detection itself, including some duplicitous practices that are nonetheless frighteningly frequent in academic studies – particularly in the health care field, where a lot of money is at stake and where poorly handled stats can do the most damage to ordinary people. I cannot say enough about how critical it is to moor statistics and data mining solidly in reason, because all it takes is a single fallacy to render the conclusions drawn from them useless, misleading or downright false; this point is worth reiterating throughout this blog series, given that it dwarfs all of the technical issues that mathematicians, statisticians, data miners and the field of computing spend much more time on. Our algorithms may go on correctly generating numbers, but if they’re not firmly embedded in reason, they may be deceptively erroneous. Some of the saving graces of this week’s method of outlier detection are that it is well-understood and simple implement, both of which mean there is there is less room for fallacies to worm their way into Z-Score calculations than there are with other methods.
…………Z-Scores may be more run-of-the-mill than cutting edge methods like Benford’s Law, but the technique is ubiquitous throughout the field of statistics precisely because it is a reliable workhorse. In fact, it is used as a building block in many other higher-order statistics, many of which are in turn used in the edifices of today’s sophisticated data mining algorithms. When I wrote my series of self-tutorials on SQL Server Data Mining (SSDM) I merely set out to prove that this neglected component could be used by inexperienced amateurs like myself for practical purposes, but I didn’t yet grasp the mechanics of why the constituent algorithms worked. In this series and the next several I hope to write, I plan to rectify that defect by looking more under the hood. I’m still hardly qualified to write about statistics, but I have found it is much easier to grasp the material by categorizing various stats by their use cases. It is easiest to sort through the confusing welter of numbers and algorithms by viewing the use cases as a function of the questions one wants to ask of the data, the number and type of inputs an algorithm requires, the number and type of outputs it returns, as well as the mathematical properties associated with the inputs and outputs. For example, if you choose to ask a particular question of a dataset but don’t have the correct data types or number of input parameters, or sample sizes, your choices will be quickly narrowed down to a few stats and algorithms. If you require the outputs to have specific mathematical properties, such as positivity or homoscedasticity, your choices will be further constrained, because the formulas are also highly specific as to what kind of data they spit back out. Will G. Hopkins, the author of an excellent plain English guide to stats[i], likewise writes that he had a difficult time sorting out the various types of statistical models until he categorized them by the types of comparisons being made, such as numerical vs. nominal data, or numeric vs. numeric, etc. Those categories are basically equivalent to the column Content types I discussed in the series on SSDM, where they represent an important consideration in the design of mining models. It might be helpful in the future to develop a matrix of use cases for all permutations of Content types, data types, numbers of inputs, properties of outputs and perhaps even the performance requirements for the inner workings of the algorithms in between the input and output stages. For now, however, we’ll just use the concept to illustrate what we’re doing with Z-Scores.
…………This particular measure is merely a comparison of a data point to the mean and standard deviation of the dataset it belongs to. The formula for Z-Scores is fairly simple: subtract the data point from the average, then divide by the deviation. What’s the purpose in doing this though? I was able to grope towards a better understanding by resorting to categorization again. Averages are merely the most popular instance of a broader classification of measures of central tendency or location, which identify the center of a dataset; the mean really represents the most primordial form of clustering we know of. If we want to tell how close a particular data point is to that center point, we also need a yardstick to measure the distance by. This is where standard deviation, the most basic metric of dispersion, comes in handy. Rather than pinpointing the single center of the dataset, the deviation is a single measure of how diffuse or spread out all the data points are. Like the mean, standard deviation is a fundamental building block of higher statistics, one which also gives us a convenient means of gauging how far a particular data point is from the center point identified by the mean. In plain English, a Z-Score essentially tells us how many standard deviations (i.e. units of dispersion) there are between a given data point and the center. Many other calculations we’ll encounter in the next few tutorial series are either derived directly from Z-Scores, or resemble them in their use of the mean and standard deviation. To someone lost in the swirl of numbers and equations surrounding statistics, it may seem that there is no rhyme or reason to any of them, but there is a method behind the madness. In each Z-Score calculation, we’re not plugging in just anything, like a whole probability distribution or a sequence or a set, but a single data point – which matches our question, “Is this an outlier?” More sophisticated calculations may require us to further limit our choices by such considerations as data types, Content types, the internal performance requirements of the algorithm, the number of inputs, the sample size and the desired mathematical properties of the output. In the case of Z-Scores, all we really have to make sure of is that we’re inputting one of SQL Server’s numeric data types. We obviously can’t plug text or dates into a Z-Score equation, although we could perform calculations on such fields and then plug them in as needed. We also need to know the mean and standard deviation for the entire dataset, rather than a mere sample; in some situations it might be impractical to calculate them due to resource constraints, but DBAs usually have one over on those engaged in scientific research, in that they usually have populations of millions of records to drawn from if they choose to. Sometimes researchers only have access to small samples taken from unknown populations, in which case it may not be possible to apply Z-Scores at all.
…………Fortunately, coding a Z-Score is also a lot less taxing on the brain than the subject of the last post, Benford’s Law. That is why the stored procedure in Figure 1 is a lot easier to follow. The first three parameters allow you to specify a table in any database for which you have privileges, while the fourth identifies the column to be sampled and the fifth is the name of the table’s primary key. Keep in mind that I don’t allow spaces in my object names, so if you’re going to be operating on objects that have them, you’re going to have to add the requisite brackets to this code yourself. Enter the @DecimalPrecision parameter with care, since an incorrect setting will result in arithmetic overflows; leaving that setting up to the end user was a lot easier to code than a workaround that would fit all use cases. The most difficult part of the code to grasp may be the @OrderByCode, which allows you to sort the results by 1) the primary key values, ascending; 2) the primary key value descending; 3) the column values ascending; 4) the column values descending; 5) the Z-Score ascending and 6) the Z-Score descending. I’m a stickler about giving credit where it is due, so I’ll point that I’ve already done ORDER BY CASES before, but double-checked the syntax at a thread by one of the greatest assets of the SQL Server community, Pinal Dave.[ii] Uncommenting the next-to-last line will allow you to debug the procedure as needed by checking the dynamic SQL. Also be aware that I haven’t taken any steps to proof this against SQL injection attacks, so be ready to program your own security requirements into it. In addition, the procedure is created in a schema called Calculations that I will be using frequently throughout this series, so be prepared to add it to your own database or change the code.

Figure 1: Code for the Z-Score Stored Procedure
CREATE PROCEDURE [Calculations].[ZScoreSP]
@DatabaseName as nvarchar(128) = NULL, @SchemaName as nvarchar(128), @TableName as nvarchar(128),@ColumnName AS nvarchar(128), @PrimaryKeyName as nvarchar(400), @OrderByCode as tinyint = 1, @DecimalPrecision AS nvarchar(50)
AS

SET @DatabaseName = @DatabaseName + ‘.’
DECLARE @SchemaAndTableName nvarchar(400)
SET @SchemaAndTableName = ISNull(@DatabaseName, ) + @SchemaName + ‘.’ + @TableName –I’ll change this value one time, mainly for legibility purposes

DECLARE @SQLString nvarchar(max)
SET @SQLString =
‘DECLARE @OrderByCode as tinyint ,– pass the outer value like a parameter of sorts
@StDev AS decimal(‘ + @DecimalPrecision + ‘),
@Mean AS decimal(‘ + @DecimalPrecision + ‘)

– precalculating these not only makes the code more legible, but is more efficient because it is a one-time operation
SELECT @StDev = StDEv(‘ + @ColumnName + ‘)   FROM ‘ + @SchemaAndTableName +
SELECT @Mean = Avg(‘ + @ColumnName + ‘)   FROM ‘ + @SchemaAndTableName +

–SELECT @StDev  — uncomment these to debug
value errors
–SELECT @Mean

SET @OrderByCode = ‘ + CAST(@OrderByCode  AS nvarchar(50) ) +

SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘, ZScore, DENSE_RANK () OVER (ORDER BY ZScore) AS GroupRank
FROM
       (SELECT ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘,  ”ZScore” = CASE WHEN @StDev = 0 THEN 0
     ELSE (‘ + @ColumnName + ‘ – @Mean) / @StDev
     END
        FROM ‘ + @SchemaAndTableName +               GROUP BY ‘ + @PrimaryKeyName + ‘, ‘ + @ColumnName + ‘) AS T1 — the purpose of the inner query is to allow us to order by the ZScore

       ORDER BY
CASE WHEN @OrderByCode = 1 THEN ‘ + @PrimaryKeyName + ‘ END ASC,
CASE WHEN @OrderByCode = 2 THEN ‘ + @PrimaryKeyName + ‘ END DESC,
CASE WHEN @OrderByCode = 3 THEN ‘ + @ColumnName + ‘ END ASC,
CASE WHEN @OrderByCode = 4 THEN ‘ + @ColumnName + ‘ END DESC,
C
ASE WHEN @OrderByCode = 5 THEN ZScore END ASC,
CASE WHEN @OrderByCode = 6 THEN ZScore END DESC

–SELECT @SQLString — uncomment this to debug string errors
EXEC (@SQLString)

Figure 2: Sample Results from Column1 of the HiggsBosonTable

EXEC [Calculations].[ZScoreSP]
             @DatabaseName = ‘DataMiningProjects’,
             @SchemaName = ‘Physics’,
             @TableName = N’HiggsBosonTable’,
             @ColumnName = N’Column1′,
             @PrimaryKeyName = N’ID’,
             @OrderByCode = 6,
             @DecimalPrecision = ‘38,28’

ZScoreResults

…………As discussed in my last few posts, I’ll be using three publicly available practice datasets for my next three or four series of tutorials, beginning in last week’s post with a data on the Duchennes form of muscular dystrophy provided by Vanderbilt University’s Department of Biostatistics. In time, I will probably also perform outlier detection and other data mining calculations on the Voynich Manuscript, an inscrutable medieval tome with an encryption scheme so obscure that no one has been able to crack it for more than five centuries, including the National Security Agency (NSA). The best of the three datasets to stress test this procedure is the data on the Higgs Boson made available by the University of California at Irvine’s Machine Learning Repository, given that its single table dwarfs the tiny 9-kilobyte Duchennes table by almost 6 gigabytes. It also consists entirely of numeric data, unlike the transcriptions of the Voynich Manuscript I’ve imported, which are mainly appropriate for text mining. To date, I have yet to find an explanation of what the 28 float columns actually measure, although there’s an outside chance I might be able to interpret an explanation if I find one, given that I understand particle physics too well for my own good back in fourth grade.[iii] Figure 2 depicts a sample query against the first float column in the HiggsBosonTable of my DataMiningProjects database, which includes all three datasets. Note that it’s in descending order by Z-Scores. The GroupRank separates identical Z-Score values into distinct groups, through the DENSE_RANK windowing function; feel free to eliminate it from the code if it drags down performance on your databases. I find it handy when running the procedure against tables with small ranges of distinct and frequently duplicated values. This logic may also be enhanced by intrepid programmers to handle bucketing and banding of contiguous but not quite identical values, perhaps using fuzzy sets with graded memberships. The interpretation is not difficult at all: the further away Z-Scores are in either direction from zero, the more likely they are to represent any outliers. It only becomes difficult once we compare the results to particular probability distributions, which often expect certain percentage of their values to occur in specific ranges and therefore makes the definition of an outlier less arbitrary in that context; for example, the Gaussian or “normal” distribution, i.e. the bell curve, expects about 68 percent of the values to be within the first standard deviation, 95 within the second and 99.7 within the third.
…………The procedure took much less time to execute on the 11 million rows (the seven at the tail end are accidental duplicates I’ve been procrastinating on removing, but they’re inconsequential for today’s article) of the HiggsBosonTable than I expected, given that my poor beat up development machine is hardly a match for the servers most DBAs use every day. Unfortunately, I was unable to test it with a columnstore index because I haven’t been able to afford to upgrade from SQL Server 2012 Developer Edition to 2014, in which Microsoft lifted many of the restrictions that made it impossible to apply them to many user scenarios. The Column1 I was testing in the HiggsBosonTable has a precision of 33 and a scale of 29, so I naturally received this error when trying to create one: “CREATE INDEX statement failed. A columnstore index cannot include a decimal or numeric data type with a precision greater than 18. Reduce the precision of column ‘Column 1′ to 18 or omit column ‘Column1′. (Microsoft SQL Server, Error: 35341).” I was, however, able to reduce the client processing time and total execution time by adding a regular nonclustered index to Column1. The total execution time was higher, but only because of an increased Wait Time on Server Replies, which was probably due to pressure on the server from other unrelated tasks. If you click on the pictures of the execution plans to enlarge them, you’ll see that the index was used when calculating the standard deviation, the average and the Z-Scores alike.

Figure 3: Client Statistics and Execution Plans for the Z-Score Procedure on the HiggsBosonTable
ZScoreClientStats ZScoreExecutionPlans (1) ZScoreExecutionPlans (2)

…………In my last post, I cited many articles by professionals who gave a long laundry list of use cases, limitations and nuances of interpretation for Benford’s Law. I searched for similar materials for Z-Scores but essentially came up empty, perhaps because they’re so well-established, well-understood and trivial to calculate. The one clear exception I saw mentioned in the literature is that fat-tailed distributions, in which the data is significantly skewed in one or more directions, may require the use of less outlier-sensitive techniques. This particularly outlier identification method is used most often in conjunction with the normal distribution but doesn’t have to be; out of all the techniques I’ll survey in this series, it is perhaps the most flexible and suited to the widest range of use cases. It is fairly easy to understand and interpret, performs well and isn’t dependent on any particular cut-off criteria for defining outliers, unlike many others that force unnecessary all-or-nothing choices. This means we can use it to ask how much of an outlier a record is, which is a much more informative question than simply asking if a record belongs in a single outlier bucket or outside it.
…………In the next installment of this series I’ll briefly touch on Iglewicz and Hoaglin’s Modified Z-Scores, which are one of many twists that can be applied to Z-Scores depending on the use cases at hand. That material should be fairly light and easy, as should some of the posts that follow on stats like Grubbs’ Test and the Tietjen-Moore Test that have deceptively scary names. Interquartile Range and Dixon’s Q-Test should also be trivial to handle. I’m not sure yet how difficult it will be to explain and code lesser-known measures like the Generalized Extreme Studentized Deviate (ESD) Test , Cook’s Distance, Peirce’s Criterion, Chauvenet’s Criterion and the Modified Thompson Tau Test. Eventually I’ll also be doing a recap of how to use SSDM Clustering for finding aberrant data and writing a post tentatively titled Visual Outlier Detection with Reporting Services, which will be full of eye candy rather than T-SQL and equations like the others. I’m not yet sure what order I’ll be tackling them all in, except for Mahalanobis Distance, which is apparently a quite sophisticated and useful method that unfortunately has math of commensurate difficulty. If all goes according to plan, climbing that final hurdle will propel me into a much more difficult but useful series, Information Measurement with SQL Server. We will see references to Z-Scores and formulas like it through that series, as well as the remainder of this one.

[i] See See Hopkins, Will G., 2001, A New View of Statistics website. The home page for the site is http://www.sportsci.org/resource/stats/index.html.

[ii] Pinal Dave, 2007, “SQL SERVER – CASE Statement in ORDER BY Clause – ORDER BY Using Variable,” published July 17, 2007 at the Journey to SQL Authority with Pinal Dave website. Available online at

http://blog.sqlauthority.com/2007/07/17/sql-server-case-statement-in-order-by-clause-order-by-using-variable/

[iii] When my family moved out of the home we had in my elementary school days, we never did find the uranium and radium samples that came with my do-it-yourself cloud chamber kit. Oh well, I suppose the new owners either sold it to some North Koreans, or their kids all have seven toes. I suppose I shouldn’t lose hope of finding them again, given that their half-lives were several million years apiece.

Outlier Detection with SQL Server, part 1: Finding Fraud and Folly with Benford’s Law

By Steve Bolton

…………My last blog series, A Rickety Stairway to SQL Server Data Mining, often epitomized a quip by University of Connecticut statistician Daniel T. Larose, to the effect that “data mining is easy to do badly.”[1] It is clear that today’s sophisticated mining algorithms can still return useful results even when misused by a bumbling amateur like myself, but it is better to understand why they work, which is an area I’m seriously deficient in. In that series of self-tutorials, I was more interested in demonstrating that SSDM could be used by someone who understood it badly, while gleaning what I could about the mechanics and concepts that underpin data mining. I intended to rectify that with a new series tentatively titled Information Measurement with SQL Server, in which I would gradually teach myself how to code dozens of metrics used in information theory and statistics in Transact-SQL (T-SQL), Multidimensional Expressions (MDX) and Visual Basic (VB) Common Language Runtime (CLR) routines. In time, I still intend to publish my misadventures with such sundry metrics as the Küllback-Leibler Divergence, Minimum Description Length, Bayes Factors, the Akaike Information Criterion and Shannon’s Entropy. I will probably also stick the plan of writing a parallel series on Using Other Data Mining Tools with SQL Server, in which I may pass on my inexperience with accessing such alternative mining platforms as WEKA and RapidMiner from Microsoft’s database server software. Prior to writing up both series, however, I decided to sneak in a short but potentially useful series on myriad means of detecting outliers using SQL Server, in part because the material is lighter and the equations involved less complicated by several orders of magnitude. That will help me bridge the gap in skills required for the Information Measurement series. All three series could be of real practical use to database administrators (DBAs) and data miners, but outlier detection is simpler both to implement and discuss. In fact, I’ll dispense in the very first article with one of the more complex and intriguing methods, Benford’s Law, which I will implement with a simple lookup table and some dynamic SQL with windowing functions. In the next eight or nine articles I’ll tackle several others means that can be implemented in SQL Server using Reporting Services, SSDM, T-SQL and possibly some CLR stored procedures written in VB.

The Uses and Abuses of Outliers

…………Before discussing any of these detection methods, it is critical to delineate what an outlier is and more importantly, how to respond to them depending on their causes. One of the greatest pitfalls of statistics (and math in general) is that it is quite easy for the numbers to become detached from their underlying logic and meaning; in such cases, the statistical packages and data mining tools will continue to spit out figures, but they will be meaningless, misleading or otherwise downright false. When math goes wrong, it is often at this juncture with logic and reason, which are terribly neglected in every academic field these days; in fact, one of the most productive things anyone in the Western intelligentsia can do today is to learn about logical fallacies and acquire the skill of recognizing them.[2] As I pointed out in the SSDM series and will reiterate in the future, at a deeper level, it is even more important to acquire the skills of recognizing one’s own fallacies and motivations for holding them – which. In a nutshell, sometimes people don’t like the truth, so they apply whatever form of intelligence they’re proficient in to find a way around it; this is one of the great banes and most common defects of all human history, one that requires painful self-examination and swallowing of pride. When math, statistics, data mining and kindred topics go wrong, it is often because they have come uncoupled from reason, not because the tools are spitting out bad numbers; the few sentences I’ve spent here are all out of proportion to the importance of keeping our calculations moored in reason and reality, given that the slightest mistake in this regard can render them useless or counterproductive. For these reasons, it is always wise to take a page from the philosophers of ages past and reduce the wiggle room available to fallacies by carefully defining our terms, without going to the opposite extreme of hair-splitting reductionism and getting lost in jargon. With all that background out of the way, let’s define an outlier rigidly but plainly so that anyone can understand it: it is a data point that contrasts with the other data points it is grouped with. How we group them is dependent on the questions we want to ask, which amount to a subjective choice; whether or not a particular data point meets the criteria we’ve set down for membership in a set or its outliers is an objective thing beyond our control. This distinction between subjectivity and objectivity is frequently confused and sometimes abused by intellectuals these days, but particularly when it comes to the treatment of outliers. Once we’ve detected them, the potential responses range from doing nothing, to applying alternate outlier-resistant measures like medians, to throwing out the data altogether.[3]
                There are two broad classes of hazards involved with outliers, one of which is changing the definition of what an outlier is, or altering the criteria for membership in the larger set they are being contrasted with. Either way, it may amount to moving the goalposts, when not done for a plainly justified reason. The second snare is making assumptions about what caused the aberration in these data points, which is sometimes combined with the first. Outliers can of course be useful in detecting data quality problems, systematic errors in data collection and in the case of Benford’s Law, fraud. It would be a mistake, however, to simply change the definition of outliers or assign those data points to a different population without ascertaining whether erroneous data of some kind is indeed the culprit. This presents researchers with a subtle temptation to simply throw out data they don’t like because it doesn’t fit their preconceived notions, by changing categories. Furthermore, because outlier detection is so useful in ferreting out data quality issues, aberrant data points of any kind are often assumed to be “bad.” Like gold nuggets in the pan of a prospector, they may actually be positive, depending on the context. To a businessman, they may represent an untapped market; to a health care researcher, they may represent a population that turns out to be immune to a particular disease; to a scientist, they may represent a subgroup of materials more resistant to particular physical strain of some kind. If you found yourself in a disaster or horror film with a high body count, you don’t want to be among the “normal” population, because the norm is to end up dead; only the outliers survive in The Poseidon Adventure and Aliens.
Sometimes outliers have no significance at all, due to luck of the draw. Probability theory only tells us that over time we can normally expect a certain set of events to occur with a given frequency, but it does not guarantee it; I wouldn’t be on your chances of rolling snake eyes a hundred times in a row or getting three straight loner hands in Euchre, but it is a fallacy to believe that it can’t happen. Theoretically, all of your data points could be outliers in a given trial of any size, simply by random chance, no matter how infinitesimal it might be. Moreover, a certain number of outliers can always be expected, but that number may vary significantly depending on what the underlying distribution of the data is. In these cases, the presence of many outliers indicates that the dataset is highly skewed and comes from a fat-tailed distribution, or from a mix of one or more distributions.[4] This presents researchers with an even more subtle temptation to continue using popular methods of statistical analysis that are meaningful only with the distribution they have chosen, which more often than not is the Gaussian or “normal” distribution. Researchers in many fields prefer this distribution because there are far fewer statistical tests available for the others, so it is not in their interest to recognize outliers as evidence of some other underlying distribution. In fact, it is frightening to see how little goodness-of-fit tests are applied – particularly in medical research, which I had some incomplete training in back around my grad school days. This is part of a larger pattern of unconscionable misuse of statistics that is standard fare in medical research – which is exactly the one field in which they can do the most damage to ordinary people. At least half the research published today contains at least one statistical error, and just one is enough to render any study invalid; this figure does not take into account the logical errors that may accompany incorrect use of statistical methods and also render them useless or misleading.[5] When I brought up this figure to a health care professional I know, she told me that it is standard fare in her field to simply not publish negative findings, which is likewise dishonorable, because it amounts to suppressing all but the desired evidence. Outliers in particular are easy game for academics who are at best negligent in learning how to use their statistical tools, and at worst are hell-bent on reaching particular conclusions. This is particularly true when the presence of outliers indicates that the underlying theory is flawed, or that prohibitively expensive data collection methods are necessary. No scholar ever wants to admit that a theory they’ve invested years of thought and personal prestige into is wrong, especially when money is at stake and applying more rigorous statistical techniques would require a lot more brain work. Ethical issues of this kind are not uncommon with outlier detection, although SQL Server DBAs are less likely to encounter them if their main use case scenarios are merely looking for data quality problems or doing exploratory data mining, to ferret out possible relationships between variables for further testing.

The Enigma of Benford’s Law

                Both types of DBAs, however, can make use of Benford’s Law, a strikingly different and somewhat mysterious form of outlier detection that can be used by the SQL Server community right off the bat. The gist of it is that the leading digits in many number series often follow a particular distribution, unless there is interference by some non-random process. When this occurs it is often regarded as evidence of intelligent interference, especially when there is something to be gained by it; for this reason, it is most often used to detect fraud, which really amounts to an intentional degradation of data quality. As mathematicians Arno Gerger and Theodore P. Hill note, “For instance, the U.S. Internal Revenue Service uses BL to ferret out suspicious tax returns, political scientists use it to identify voter fraud, and engineers to detect altered digital images.”[6] One of my favorite uses to date has been to uncover the kind of academic fraud I discussed in the last section, particularly in health care research and the discipline of economics.[7] West Virginia University Mark Nigrini, a noted expert on Benford’s Law, was one of the pioneers in using it to identify financial fraud, which seems to be its most popular use case. Williams College math Prof. Steven J. Miller has published several anecdotes of embezzlement uncovered through Benford’s Law and mentions a few less colorful uses that might be of use to DBAs, like “analyzing round-off errors” and “determining the optimal way to store numbers.”[8] Another application that might be of interest to DBAs is “computer disk space allocation.”[9] It has also been applied to such dry subjects as hydrology.[10] Its more glamorous use in exposing swindlers is what earned Benford’s Law a starring role on primetime television in 2006, when it was used by the good guys to catch a thief in an episode of the crime series NUMB3RS.[11]
                Demonstrating that Benford’s Law works is something altogether different than stating definitively why it does. Astronomer Simon Newcomb was the first to notice it when he was flipping through a reference on logarithm tables in 1881 and saw that pages for logs starting with 1 and 2 were far more common than those starting with 8 and 9.[12] Physicist Frank Benford got his name attached instead of Newcomb because he demonstrated in 1938 that it held for “over 20,000 numbers from sources as diverse as Readers’ Digest articles, street addresses of American Men of Science, atomic weights, population sizes, drainage rates of rivers, and physical constants.”[13] It is actually more of an observation, like Moore’s Law, but even more enigmatic in that explanations of the phenomena to date have fallen short. Some observers link it to scaling laws like Zipf’s Law, which is apparently an instance of the general scaling laws generalized by Benoit Mandelbrot (1924–2010)[14] and popularized in chaos theory.[15] Hill is credited with providing the most tenable to theory to date, which statistician R.M. Fewster says is a “measure-theoretical proof that data drawn from a random mix of different distributions—rather like Frank Benford’s original 20,000—will ultimately converge to Benford’s law.” [16] Fewster attempts to “provide nothing rigorous, but provide a simple, intuitive explanation of why and when the law applies” that is helpful in finding use cases. Yet even Berger and Hill concurred with Fewster in a recent paper in which they take issue with recent explanations, then conclude with the statement that it “remains mysterious.”[17] Anyone interested in getting to the bottom of that mystery can begin at the Benford Online Bibliography at http://www.benfordonline.net/, which is a plentiful source for all things Benford-related.
Like other data mining techniques, we can still make use of it without understanding why it works – just as bumblebees don’t stop flying merely because scientists can’t explain how they apparently defy the laws of aerodynamics.[18] To put it to use, we’ll first build a lookup table, using the Data Definition Language (DDL) code below. Note that I’m using a bigint identity column for personal reasons and hosting both the table and its associated stored procedure in a Calculations schema, which may not fit your particular use cases. The identity column also starts with a zero base, to represent the digit zero. After the DDL statement, I populate it using constants provided in a couple of articles published by Nigrini.[19]

Figure 1: DDL Code for the BenfordTable
CREATE SCHEMA Calculations AUTHORIZATION dbo;
GO

CREATE TABLE [Calculations].[BenfordTable](
       [ID] [bigint] IDENTITY(0,1) NOT NULL,
       [ExpectedPercentageDigit1] [decimal](5, 5) NULL,
       [ExpectedPercentageDigit2] [decimal](5, 5) NULL,
       [ExpectedPercentageDigit3] [decimal](5, 5) NULL,
       [ExpectedPercentageDigit4] [decimal](5, 5) NULL,

 CONSTRAINT [PK_BenfordTable] PRIMARY KEY CLUSTERED
( [ID] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE
= OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

INSERT INTO Calculations.BenfordTable
(ExpectedPercentageDigit1, ExpectedPercentageDigit2,
ExpectedPercentageDigit3, ExpectedPercentageDigit4)

VALUES (NULL, 0.11968, 0.10178, 0.10018),
(0.30103, 0.11389, 0.10138, 0.10014),
(0.17609, 0.19882, 0.10097, 0.10010),
(0.12494, 0.10433, 0.10057, 0.10006),
(0.09691, 0.10031, 0.10018, 0.10002),
(0.07918, 0.09668, 0.09979, 0.09998),
(0.06695, 0.09337, 0.09940, 0.09994),
(0.05799, 0.09035, 0.09902, 0.09990),
(0.05115, 0.08757, 0.09864, 0.09986),
(0.04576, 0.08500, 0.09827, 0.09982)

Figure 2: Comparison Procedure with Dynamic SQL and Windowing Functions

CREATE PROCEDURE Calculations.BenfordComparisonSP
@SchemaAndTableName as nvarchar(128) = NULL,@ColumnName AS nvarchar(128) = NULL
AS

DECLARE @SQLString nvarchar(max)

SET @SQLString =
WITH TempCTE
(RecordCount, Digit1, Digit2, Digit3, Digit4)
AS
(SELECT Count(*) OVER (PARTITION BY 1) AS RecordCount,
(‘ + @ColumnName +   / CAST(”1” + REPLICATE(”0”, LEN(CAST(‘ + @ColumnName +    AS VARCHAR(255))) – 1) AS bigint)) %10 as Digit1, (‘ + @ColumnName +   / CAST(”1” + REPLICATE(”0”, LEN(CAST(‘ + @ColumnName +    AS VARCHAR(255))) – 2) AS bigint)) %10 as Digit2,
(‘ + @ColumnName   / CAST(”1” + REPLICATE(”0”, LEN(CAST(‘ + @ColumnName +    AS VARCHAR(255))) – 3) AS bigint)) %10 as Digit3,
(‘ + @ColumnName +   / CAST(”1” + REPLICATE(”0”, LEN(CAST(‘ + @ColumnName +    AS VARCHAR(255))) – 4) AS bigint)) %10 as Digit4
FROM ‘ + @SchemaAndTableName +
WHERE ‘ + @ColumnName +   IS NOT NULL
),
SecondTempCTE
(RecordCount, Digit, CountForDigit, DigitNumber)
AS
(
SELECT DISTINCT RecordCount, Digit1, Count(Digit1) OVER (PARTITION BY Digit1) AS CountForDigit1, ”CountForDigit1” AS DigitNumber
FROM TempCTE

UNION 

SELECT DISTINCT RecordCount, Digit2, Count(Digit2) OVER (PARTITION BY Digit2) AS CountForDigit2, ”CountForDigit2” AS DigitNumber
FROM TempCTE

UNION

SELECT DISTINCT RecordCount, Digit3, Count(Digit3) OVER (PARTITION BY Digit3) AS CountForDigit3, ”CountForDigit3” AS DigitNumber
FROM TempCTE

UNION

SELECT DISTINCT RecordCount, Digit3, Count(Digit4) OVER (PARTITION BY Digit4) AS CountForDigit4, ”CountForDigit4” AS DigitNumber
FROM TempCTE
)

the reason zeros occur are because there are a handful of zero values
SELECT Digit, CountForDigit1, CountForDigit2, CountForDigit3, CountForDigit4,PercentageDigit1,PercentageDigit2,PercentageDigit3,PercentageDigit4,
PercentageDigit1 – (ExpectedPercentageDigit1 * 100) AS DeviationFromExpectedPercentageDigit1,
PercentageDigit2 – (ExpectedPercentageDigit2 * 100) AS DeviationFromExpectedPercentageDigit2,
PercentageDigit3 – (ExpectedPercentageDigit3 * 100) AS DeviationFromExpectedPercentageDigit3,
PercentageDigit4 – (ExpectedPercentageDigit4 * 100) AS DeviationFromExpectedPercentageDigit4,
(PercentageDigit1 – ExpectedPercentageDigit1) – (1 / (2 * RecordCount))
/ Power(((ExpectedPercentageDigit1 * (1 – ExpectedPercentageDigit1)) * RecordCount), 0.5) AS NigriniZStatisticDigit1,
(PercentageDigit2 – ExpectedPercentageDigit2) – (1 / (2 * RecordCount)) / Power(((ExpectedPercentageDigit2 * (1 – ExpectedPercentageDigit2)) * RecordCount), 0.5) AS NigriniZStatisticDigit2,
(PercentageDigit3 – ExpectedPercentageDigit3) – (1 / (2 * RecordCount)) / Power(((ExpectedPercentageDigit3 * (1 – ExpectedPercentageDigit3)) * RecordCount), 0.5) AS NigriniZStatisticDigit3,
(PercentageDigit4 – ExpectedPercentageDigit4) – (1 / (2 * RecordCount)) / Power(((ExpectedPercentageDigit4 * (1 – ExpectedPercentageDigit4)) * RecordCount), 0.5) AS NigriniZStatisticDigit4
FROM
(SELECT RecordCount, Digit, IsNull(CountForDigit1, 0) AS CountForDigit1, IsNull(CountForDigit2, 0) AS CountForDigit2, IsNull(CountForDigit3, 0) AS CountForDigit3,  IsNull(CountForDigit4, 0) AS CountForDigit4,
IsNull(CountForDigit1, 0) * CAST(100 AS decimal(38,19)) / (CAST(RecordCount AS decimal(38,19))) AS PercentageDigit1, IsNull(CountForDigit2, 0) * CAST(100 AS decimal(38,19)) / (CAST(RecordCount AS decimal(38,19))) AS PercentageDigit2,
IsNull(CountForDigit3, 0) * CAST(100 AS decimal(38,19)) / (CAST(RecordCount AS decimal(38,19))) AS PercentageDigit3, IsNull(CountForDigit4, 0) * CAST(100 AS decimal(38,19)) / (CAST(RecordCount AS decimal(38,19))) AS PercentageDigit4
FROM SecondTempCTE AS T1
       PIVOT (MAX(CountForDigit) FOR DigitNumber IN  (CountForDigit1, CountForDigit2, CountForDigit3, CountForDigit4)) AS pvt) AS T2
       LEFT JOIN Calculations.BenfordTable AS T3
ON Digit = T3.ID
      WHERE Digit IS NOT NULL’

EXEC (@SQLString)

…………There’s a lot going in Figure 2, including dynamic SQL, a pivot, some windowing functions and a digit parsing trick I heavily adapted from Gordon Linoff’s example in the StackOverflow thread “Parsing an Integer Field in SQL.”[20] At present, it’s only working on tinyint, smallint, int and bigint fields, because handling data types like floats and numerics that allow decimal places will require some extra logic I haven’t implemented yet. I will if I get a request to do so or discover a need to; for now, my purpose is just to demonstrate how Benford’s Law might be applied in a SQL Server database. Some cautions are in order here. First, I don’t allow spaces in my T-SQL object names, so you’ll have to add brackets to the stored procedure manually if you want to run it against tables and columns that have them. Secondly, I also haven’t done any optimizations on the procedure at all, or even checked the execution plan yet, so test this out before running it against large tables and views. Third, there are probably a lot of undiscovered logical errors embedded here, since I haven’t tested it thoroughly. It’s also important to keep in mind that if you’re using it against a column that only has one digit, you will naturally get nulls returned for the calculations performed against the second through fourth columns, which I momentarily forgot (causing a moment of panic and the invention of a few brand new curse words). Also be aware that I may not be calculating Nigrini’s Z-Statistic correctly, since I wasn’t able to find any complete examples online to compare my results against. I derived the code from the equations on the Wikipedia page for Benford’s Law[21] and in Durtschi, et al.’s “The Effective Use of Benford’s Law to Assist in Detecting Fraud in Accounting Data.”[22] The latter has directions on how to interpret it though.
…………If all goes according to plan, however, executing a query like the one in Figure 3 ought to produce a single table with the values in the three tables below, which provide the counts, percentages, deviations and Nigrini Z-stats for the first four digits of that column. Figure 4 does the same for the LactateDehydrogenase column in the same table of the Duchennes database, which is one of three datasets I will be using for practice purposes over the course of my next three or four tutorial series. This tiny nine-kilobyte table of data on a particularly debilitating form of muscular dystrophy is one of many health care research datasets made publicly available through Vanderbilt University’s Department of Biostatistics. In future articles, I will use the much larger Higgs Boson dataset made available through the University of California at Irvine’s Machine Learning Repository and transcriptions of the Voynich Manuscript, an inscrutable medieval tome with an encryption scheme so obscure that even the National Security Agency (NSA) can’t crack it. The problem with using the Voynich Manuscript for this week’s tutorial is that it is the result of an intelligent author rather than a random natural process, one who intentionally kept his words shrouded in mystery and has succeeded for nearly six centuries. This means that it is highly unlikely to follow Benford’s Law. Likewise, the meaning of the seven gigs of float values found in the Higgs Boson dataset is still unclear to me, since I haven’t yet found documentation on its blandly named columns. The significance of the Age column in the Duchennes dataset is self-explanatory though. We can expect significant deviation from Benford’s Law here, given the natural limitations on human longevity and the preponderance of female carriers of childbearing age in the survey population. I deleted the results for the third and fourth digits of this column since there were no carriers in the population over the ages of 100 and 1,000, for obvious reasons. The LactateDehydrogenase column likewise had no values with four digits, so I deleted the corresponding results for that digit for the sake of brevity.

Figure 3: Results of the BenfordComparisonSP on the Age Column of the Duchennes Database
EXEC   [Calculations].[BenfordComparisonSP] @SchemaAndTableName = N’Health.DuchennesTable, @ColumnName = N’Age

BenfordsLawFigure3

Figure 4: Results of the BenfordComparisonSP on the LactateDehydrogenase Column
BenfordsLawFigure4

…………Some of the stats for the Digit 0 are of course null, given that it can’t be the first digit of a number.Note that there are an unexpectedly high number of LactateDehydrogenase values that start with 1 and 2, but that the deviations for the rest of the values and digits drop off sharply after that. I knew very little about this enzyme before converting the Duchennes database into a SQL Server database so I figured that it might be a candidate for Benford’s Law, but the data shows that it deviates from it almost as much as the Age column. After the fact, I looked up the normal value range at MedlinePlus and sure enough, the normal values range from 105-333 IU/L (international units per liter).[23] This is why we see such high deviations for 1 and 2 on the first digit, but far smaller deviations on the second and third digits, which are not affected much by the natural range limits. In this case, the unusually high number of values starting with 1 and 2 qualify as outliers if we’re analyzing them through the prism of a Benford distribution, but it is easy to pinpoint a highly skewed natural distribution as the cause.

Uses and Misuses of Benford’s Law

…………Like any other statistical tool, Benford’s Law has to be used as carefully as a scalpel, because the slightest slip beyond the proper context and mooring in logic can render it useless or misleading. The literature on Benford’s Law is chock full of conditions that can be used by DBAs and .Net programmers to develop use cases though. The article by Durtschi, et al. states for example that “if the mean of a particular set of numbers is larger than the median and the skewness value is positive, the data set likely follows a Benford distribution. It follows that the larger the ratio of the mean divided by the median, the more closely the set will follow Benford’s law.”[24] They also point out a few common sense examples where Benford’s Law doesn’t apply, such as numbers assigned by humans (like operational codes) and numbers with specific minimum and maximum bounds.[25] The Age column discussed above probably falls under the rubric of the latter. Furthermore, they also warn that in regards to detecting financial fraud, “if there are only a few fraudulent transactions, a significant difference will not be triggered even if the total dollar amount is large. Second, statistically, if the account being tested has a large number of transactions, it will take a smaller proportion of inconsistent numbers to trigger a significant difference from expected than it would take if the account had fewer observations.”[26] It is also apparently possible to jump to conclusions in the opposite direction, given that in a case from 2006, the falsified data used in a fraudulent cancer study was found to have nonetheless followed Benford’s Law.[27]  The Wikipedia page also lists some sequences that are likely to fit a Benford distribution, such as Fibonacci numbers, factorials, powers of 2 and “the powers of almost any other number”, and some that are unlikely, like “square roots and reciprocals.”[28]
…………One of the most well-written resources on the topic is accountant Tommie W. Singleton’s “Understanding and Applying Benford’s Law,” which has succinct, convenient lists of use cases, including “credit card transactions, purchase orders, loan data, customer balances, journal entries, stock prices, accounts payable transactions, inventory prices and customer refunds.”[29] He also lists “examples of data sets that are not likely to be suitable for Benford’s Law include: airline passenger counts per plane; telephone numbers; data sets with 500 or fewer transactions; data generated by formulas (e.g., YYMM#### as an insurance policy number); and Data restricted by a maximum or minimum number (e.g., hourly wage rate).”[30] Large, randomly generated datasets of unassigned numbers without minimum and maximum limits are ideal, particularly those from exponential series of several orders of magnitude.[31] He also writes that “The theory does not hold true for data sets in which digits are predisposed to begin with a limited set of digits. For instance, Benford’s Law will not hold true for data sets of human heights, human weights and intellectual quotient (IQ) scores. Another example would be small insurance claims (e.g., between US $50 and US $100). The theory also does not hold true when a data set covers only one or two orders of magnitude.”
…………In addition, we need to keep in mind that when applying this to subjects like accounting that are human inventions rather than observations of purely physical phenomena, we can’t necessarily expect them to stay static for deep reasons involving mind-blowing topics like free will and determinism. I suspect that is one of the reasons behind the puzzling increase in zeroes over time in the accounting field, which Jialan Wang discusses in depth in an interesting article on Benford’s Law.[32] Keep in mind that fraud detection is based on the principle that deviation from expected random values is often a sign of purposeful interference by a non-random process, i.e. a human intelligence. In the case of the mysterious increase in zeroes in the accounting field, we may be seeing the accidental results of half-conscious interference by human intelligences, who are unwittingly affecting the outcome of Benford’s Law by such gradual methods as routine changes in accounting practices.
To make a long story short, use your head when applying Benford’s Law. That includes using the code provided here, given that I have no professional experience with this and am learning as I go. If you can grasp Benford’s Law, the rest of the series will be a breeze – at least until the final article on a metric called Mahalanobis Distance. That will probably be the most difficult installment and will help me transition into the Information Measurement with SQL Server series, which will include implementations of various other distances and divergences. Between now and then I’ll probably tackle such simpler means of outlier detection as Z-Scores, Grubbs’ Test, the Tietjen-Moore Test, the Generalized Extreme Studentized Deviate (GESD) Test and Cook’s Distance, perhaps along with Interquartile Range, Dixon’s Q-Test, Peirce’s Criterion, Chauvenet’s Criterion and the Modified Thompson Tau test. I’m not certain of the order I’ll tackle them in yet or whether or not some of the topics might be combined, given that I haven’t coded them yet. The calculations and logic for some of these are a lot easier on the brain than their dry-sounding names, while others like Z-Scores are commonly known statistics. Along the way I will probably also write an article on Visual Outlier Detection with Reporting Services using the various graphs and charts RS provides, as well as touch on the SSDM Clustering algorithm again. As with the Rickety Stairway series on SSDM, I’m bound to make mistakes, but will probably be able to provide some benefit to other DBAs and data miners as I learn through misadventure. Due to my inexperience, I expect many of my articles to qualify as outliers in and of themselves, but by the end of the series, I aim to acquire the skills to recognize them as such, quickly and accurately.

 

 

[1] p. xii, LaRose, Daniel T., 2005, Discovering Knowledge in Data: An Introduction to Data Mining. Wiley-Interscience: Hoboken, N.J.

[2] Which I have yet to do.

[3] ASTM International (formerly known as the as the American Society for Testing and Materials), which “publishes voluntary consensus technical standards for a wide range of materials, products, systems, and services” sells a guide called Standard Practice for Dealing With Outlying Observations, if you’ve got $48 to spare for a .pdf copy (the hard cover’s the same price). That isn’t happening on my budget, but there’s the link if you want a more professional guide to the topic. I originally found out about the ASTM publication from the Wikipedia page “Outlier” at http://en.wikipedia.org/wiki/Outlier.

[4] IBID. To make sure I didn’t leave anything out, I consulted the Wikipedia page “Outlier” at http://en.wikipedia.org/wiki/Outlier, which I where I learned that mixed distributions can also be problematic.

[5] For a quick gloss on the topic of the misuse of stats in health care, see the Influential Points.com webpage “Statistical Errors in Research: Use and Misuse of Statistics in Biology,” at http://influentialpoints.com/Training/Statistical_Errors_in_Research_Use_and_misuse_of_statistics_in_biology.htm . Earlier this year another meta-analysis was released showing that half of the journal articles surveyed contained statistical errors, but I wasn’t immediately able to find a reference to that particular one. The webpage above is sufficient to help someone interested in further reading get started/

[6] Berger, Arno and Hill, Theodore P., 2011, “Benford’s Law Strikes Back: No Simple Explanation in Sight for Mathematical Gem,” published by Springer Science Business Media, LLC, Vol. 33, No. 1. Available online at http://people.math.gatech.edu/~hill/publications/PAPER%20PDFS/BenfordsLawStrikesBack2011.pdf.

[7] Tödter, Karl-Heinz, 2009, “Benford’s Law as an Indicator of Fraud in Economics,” pp. 39-351 in German Economic Review, August 2009. Vol. 10, No. 3. I think I merely read the abstract of this one, or references to it, but the important thing for my point is to know that the article exists.

[8] Miller, Steven J., 2008, “Benford’s Law and Fraud Detection, or Why the IRS Should Care About Number Theory!” Published as an outline at the Bronfman Science Lunch, Oct. 21, 2008, at Williams College. Available online at http://web.williams.edu/Mathematics/sjmiller/public_html/math/talks/Benford_Williams20min.pdf.

[9] See Fewster, R.M., 2009, “A Simple Explanation of Benford’s Law,” pp. 26-32 in The American Statistician, February 2009, Vol. 63, No. 1. 2009. Available online at https://www.stat.auckland.ac.nz/~fewster/RFewster_Benford.pdf.

[10] What good is a blog post on stats without a little “dry humor?”

[11] See the webpage “Benford’s Law” at the Wolfram Mathworld website, published by Eric W. Weisstein.

[12] See Bogomolny, Alexander, 2014, “Benford’s Law and Zipf’s Law,” published at the Interactive Mathematics Miscellany and Puzzles website. Available online at http://www.cut-the-knot.org/do_you_know/zipfLaw.shtml. The tale is related in more depth in Fewster, p. 26.

[13] p. 26, Fewster.

[14] See Bogomolny,

[15] For a readable introduction to that topic for non-mathematicians, I highly recommend, Gleick, James, 1988, Chaos: Making a New Science. Penguin: New York.

[16] p. 27, Fewster.

[17] p. 90, Berger and Hill.

[18] At least according to a famous quip by Igor Sikorsky (1889 1972), one of the inventors of the helicopter.

[19] I originally used Weisstein’s figures, until I found a more detailed table in Nigrini, Mark J. and Mittermaier, Linda J., 1997, “The Use of Benford’s Law as an Aid in Analytical Procedures,” pp. 52-67 in Auditing: A Journal of Practice and Theory, Fall 1997. Vol. 16, No. 2. Reprinted at the University of North Carolina at Charlotte’s Belk College of Business web address http://belkcollegeofbusiness.uncc.edu/cwiggins/acct6210/Lecture%20Notes/Nigrini%20article%20pdf1302507.pdf This work in turn cites Nigrini, M. J. 1996. “Taxpayer Compliance Application of Benford’s Law,” pp. 72-92, Journal of the American Taxation Association. Vol. 18, No. 1.

[20] See Linoff, Gordon, 2013, response to the thread “Parsing an Integer Field in SQL,” published Jan. 14, 2013 at the StackOverflow.com website. Available online at http://stackoverflow.com/questions/14319107/parsing-an-integer-field-in-sql.

[21] See the Wikipedia page “Benford’s Law,” available at http://en.wikipedia.org/wiki/Benford’s_law.

[22] pp. 25-26, Durtschi, Cindy; Hillison, William and Carl Pacini, 2004, “The Effective Use of Benford’s Law to Assist in Detecting Fraud in Accounting Data”, pp. 17-33 in Journal of Forensic Accounting, Vol. 5. Available online at the University of South Florida St. Petersburg webpage http://faculty.usfsp.edu/gkearns/Articles_Fraud/Benford%20Analysis%20Article.pdf

[23] See the MedlinePlus webpage “Lactate Dehydrogenase Test” webpage available at http://www.nlm.nih.gov/medlineplus/ency/article/003471.htm.

[24] p. 24, Durtschi, et al. They cite “Wallace (2002).”

[25] IBID., pp. 23-24.

[26] IBID., p. 26.

[27] Diekmann, Andreas and Jann, Ben, 2010, ‘Benford’s Law and Fraud Detection: Facts and Legends,” pp. 397–401 in German Economic Review. Vol. 11, No. 3. Available online at http://www.socio.ethz.ch/socio/news/docs/Diekmann_Jann_GER_2010_1.pdf.

[28] See the Wikipedia page “Benford’s Law,” available at http://en.wikipedia.org/wiki/Benford’s_law.

[29] Singleton, Tommie W., “Understanding and Applying Benford’s Law” undated article published at the ISACA website and retrieved Aug, 11, 2014. Available online at http://www.isaca.org/Journal/Past-Issues/2011/Volume-3/Pages/Understanding-and-Applying-Benfords-Law.aspx.

[30] IBID. I added the semi-colons myself in order to convert the list into a sentence, which I hope is permissible.

[31] IBID.

[32] Wang, Jialan, 2011, “Benford’s Law: A Revised Analysis,”  published Oct. 24, 2011 at the Studies in Everyday Life: Investigations and Experiments on Food, Ethics, Economics, and the Environment website. Available online at http://econerdfood.blogspot.com/2011/10/benfords-law-revised-analysis.html

 

Stay Tuned…for a SQL Server Tutorial Series Juggling Act

by Steve Bolton

…………If all goes according to plan, my blog will return in a few weeks with two brand new series, Using Other Data Mining Tools with SQL Server and Information Measurement with SQL Server. Yes, I will be attempting what amounts to a circus act among SQL Server bloggers, maintaining two separate tutorial series at the same time. Like my series A Rickety Stairway to SQL Server Data Mining, it might be more apt to call them mistutorials; once again I’ll be tackling subject matter I’m not yet qualified to write about, in the hopes of learning as I go. The first series will compare and contrast the capabilities of SSDM against other tools in the market, with the proviso that it is possible to run them on data stored in SQL Server. That eliminates a lot of applications that run only on Linux or otherwise cannot access SQL Server databases right off the bat. The software packages I survey may include such recognizable names as RapidMiner and WEKA, plus a few I promised others to review long ago, like Autobox and Predixion Software. The latter is the brainchild of Jamie MacLennan and Bogdan Crivat, two former members of Microsoft’s Data Mining Team who were instrumental in the development of SSDM.
…………The second series of amateur tutorials may be updated more frequently because those posts won’t require lengthy periods of time to evaluate unfamiliar software. This will expand my minuscule knowledge of data mining in a different direction, by figuring out how to code some of the building blocks used routinely in the field, like Shannon’s Entropy, Bayes factors and the Akaike Information Criterion. Not only can such metrics be used in the development of new mining algorithms, but they can be applied out-of-the-box to answer myriad basic questions about the type of information stored in our SQL Server databases – such as how random, complex, ordered and compressed it might be. No mining company would ever excavate a new quarry without first performing a geological survey of what precious metals might be beneath the surface; likewise, it may be helpful for data miners to have a surface impression of how much useful information might be stored in our tables and cubes, before digging into them with sophisticated algorithms that can have formidable performance costs. Scores of such measures are scattered throughout the mathematical literature that underpins data mining applications, so it may take quite awhile to slog through them all; while haphazardly researching the topic, I ran across a couple of quite interesting measures of information that seem to have been forgotten. I will try to make these exercises useful to SQL Server users by providing T-SQL, MDX, Common Language Runtime (CLR) functions in Visual Basic and perhaps even short SSDM plug-in algorithms, as well as use cases for when they might be appropriate. To illustrate these uses, I may test them on freely available databases of interest to me, like the Higgs Boson dataset provided by the Univeristy of California at Irvine’s Machine Learning Repository. I may also make use of a tiny 9 kilobyte dataset on Duchenne’s form of muscular dystrophy, which Vanderbilt University’s Department of Biostatistics has made publicly available, and transcriptions of the Voynich Manuscript, an enigmatic medieval manuscript with an encryption scheme so obscure that even the National Security Agency (NSA) can’t crack it. Both tutorial series will use the same datasets in order to cut down on overhead and make it easier for readers to follow both. When and if I manage to complete both series, the next distant item on this blog’s roadmap will be a tutorial series on how to use various types of neural nets with SQL Server, which is a topic I had some intriguing experiences with many years ago.

A Rickety Stairway to SQL Server Data Mining, Part 15, The Grand Finale: Custom Data Mining Viewers


By Steve Bolton

…………As mentioned previously in this amateur self-tutorial series on the most neglected component of Microsoft’s leading database server software, SQL Server Data Mining (SSDM) can be extended through many means, such as Analysis Services stored procedures, CLR functionality, custom mining functions and plug-in algorithms. I had originally planned to close out this series with at least two separate articles on the subject of writing custom means of data visualization, but these plug-in viewers turned out to be simpler to implement than I expected. The deployment aspects are naturally the most difficult aspects, but the code behind custom data mining viewers is actually much less challenging than that for plug-in algorithms, which we already discussed in depth in previous articles. They’re by no means trivial, but programmers familiar with the .Net framework shouldn’t find them too difficult, especially if they’re already experienced with Window Forms user interface elements that were popular in the second half of the last decade. Throughout this series I have tried to provide Visual Basic code samples because most of the SSDM tutorials available on the Web are written in C#, the other major .Net language.  Likewise, I originally intended to write a viewer using a more modern Windows Presentation Foundation (WPF) architecture, since all of the tutorials I’m aware of are for Windows Forms. Even though it would be child’s play to use a WPF datagrid in a WPF container control in Windows Forms, I decided it against it in order to simplify the instructions and avoid another layer of controls to debug. I also decided to adhere to the Keep It Simple Stupid (KISS) design principle by writing a viewer that performed a very basic task with easily decipherable output; in this case, we’ll simply be using a datagrid to display the output of a simple Data Mining Expressions (DMX) query, which selects some familiar metadata from the PluginModel.CONTENT we’ve used throughout this series, like ATTRIBUTE_NAME, MSOLAP_NODE_SCORE and 17 other columns.
…………Despite the fact that it was written in August of 2004, nearly a decade ago, the original tutorial provided by Microsoft for plug-in viewers, by Jesper Lind and Scott Oveson is still quite valuable. There is no custom viewer tutorial in the SDK and the only other resource I know of on the topic is a four-page discussion in Data Mining with Microsoft SQL Server 2008, the indispensable book by former members of Microsoft’s Data Mining Team.[i] In order to make use of the Lind-Oveson tutorial, however, you’ll need to download the original SQL Server 2005 samples for plug-in algorithms and custom viewers from the respective links at Codeplex. As I did with Bogdan Crivat’s sample code in the from software development kit (SDK) for custom algorithms, I’ll start off with their codebase, but it will look nothing like the original after I get done converting the C# to Visual Basic, renaming the variables to my liking and repurposing the methods to create a completely different kind of viewer (and perhaps mangling it beyond recognition in the course of adding my own bugs). Most of the code is fairly straightforward, so there are only so many ways you can write it anyways, although there are some methods like LoadViewerData that provide more wiggle room for creativity.
…………Writing a viewer is not really much harder than using ADOMD in an ordinary Windows Forms or WPF application. The only substantial difference is that we’ll compile it in a class library that can run inside Visual Studio or SQL Server Management Studio (SSMS), like other data viewers do. The process is a little counterintuitive at first, in that you being by creating a .Net class project, but instead of using a class object (the one automatically generated by Visual Studio can be deleted), you use Add…New Item from the context menu to create a Windows Forms User Control class. This is not a tutorial on basic Visual Studio functionality, so to avoid the common affliction of “tutorial clutter” I’ll leave out the details on how to work with user controls, just as I have left the common Windows Forms code and user interface declarations out of the code in Figure 1. Next you’ll need to add a reference to Microsoft.DataWarehouse.Interfaces, which should be listed on the .Net tab of Visual Studio’s Add Reference dialog once you right-click the class and select Add Reference… on the context menu. The ViewerUserControl class also Inherits System.Windows.Forms.UserControl and Implements IMiningModelViewerControl2, an upgrade available in SQL Server 2008 from the original plug-in viewer interface, which introduces further slight differences between my class and Lind and Oveson’s example. One of the crucial mistakes I made at this juncture was failing to coordinate my assembly, class name and namespace, which led to a lot of weeping, wailing and gnashing of teeth when I tried to register them during deployment. The winning combination for me was “PluginViewer.PluginViewer.ViewerUserControl.”

Figure 1: Partial Code for the ViewerUserControl Class
Imports System.Collections
Imports System.ComponentModel
Imports System.Drawing
Imports System.Data
Imports System.Data.OleDb
Imports System.Windows.Forms
Imports Microsoft.DataWarehouse.Interfaces
Imports Microsoft.AnalysisServices.AdomdClient 

Namespace PluginViewer
    Public Class ViewerUserControl
        Inherits System.Windows.Forms.UserControl
        Implements IMiningModelViewerControl 

        Private ClassConnectionString As String = String.Empty
        Private ClassMiningModelName As String = String.Empty
        Private ClassServiceProvider As IServiceProvider
        Private ClassConnection As System.Data.IDbConnection
        Private ClassCommand As New Microsoft.AnalysisServices.AdomdClient.AdomdCommand
        Private ClassAdapter As New Microsoft.AnalysisServices.AdomdClient.AdomdDataAdapter

        Private Sub ViewerUserControl_Load(sender As System.Object, e As System.EventArgs)
            this procedure was automatically added by Visual Studio, not in the original sample code
        End Sub

        Public Shared ReadOnly Property DisplayName() As String
Get
                Return “My PluginViewer”

            End Get
        End Property

        Public Property ServiceProvider() As IServiceProvider Implements IMiningModelViewerControl.ServiceProvider
            Get
                Return Me.ClassServiceProvider
            End Get
            Set(Value As IServiceProvider)
                If Me.ClassServiceProvider IsNot Value Then
                    Me.ClassServiceProvider = Value
                End If
            End Set
        End Property

        Public Property ConnectionString() As String Implements IMiningModelViewerControl.ConnectionString
            Get
                Return Me.ClassConnectionString
            End Get
            Set(value As String)
                If Me.ClassConnectionString <> value Then
                    Me.ClassConnectionString = value
                End If
            End Set
        End Property

        Public Property MiningModelName() As String Implements IMiningModelViewerControl.MiningModelName
            Get
                Return Me.ClassMiningModelName
            End Get
            Set(value As String)
                If Me.ClassMiningModelName <> value Then
                    Me.ClassMiningModelName = value
                End If
End Set
        End Property

        Public Sub ViewerActivated(IsActivated As Boolean) Implements IMiningModelViewerControl.ViewerActivated

        End Sub

        Public Function LoadViewerData(LocalContext As Object) As Boolean Implements IMiningModelViewerControl.LoadViewerData
            ClassConnection = New AdomdConnection(Me.ConnectionString)
            ClassConnection.Open()

            ‘I’m going to use a Data Adapter instead of an ADOMD command
            ClassCommand = ClassConnection.CreateCommand()
            ClassCommand.CommandText = String.Format(“SELECT
MODEL_CATALOG, MODEL_NAME, ATTRIBUTE_NAME, NODE_NAME, NODE_UNIQUE_NAME,
NODE_TYPE, NODE_DESCRIPTION, NODE_CAPTION, NODE_RULE, MARGINAL_RULE,
[CHILDREN_CARDINALITY], [PARENT_UNIQUE_NAME], NODE_SUPPORT,
NODE_PROBABILITY,MARGINAL_PROBABILITY, NODE_DISTRIBUTION,
MSOLAP_MODEL_COLUMN,MSOLAP_NODE_SCORE, MSOLAP_NODE_SHORT_CAPTION FROM [PluginModel].CONTENT”
, Me.MiningModelName)

            ClassAdapter.SelectCommand = ClassCommand
            ClassAdapter.Fill(Me.PluginViewerDataset1, PluginViewerDataTable)
            ClassConnection.Close()

            Return True
        End Function
        End Class

…………Most of the code above consists of simple Get and Set statements for viewer properties, like DisplayName, ServiceProvider, ConnectionString and MiningModelName. The DisplayName is a description of your choice which will be displayed in the Viewer dropdown box in SSDM projects (which does not necessarily have to be the class name) while the other properties are mostly taken care of by SQL Server. In my class, these property statements are used to populate the class-scoped ClassConnectionString, ClassMiningModelName and ClassServiceProvider variables. The trickiest part of this for me was using the right variable types in my Implements statements. The IMiningModelViewerControl2 class also has a property that wasn’t in the previous interface version, DBConnection, of type System.Data.IDbConnection; I’m unfamiliar with it and the code works fine as is, so I’ll omit discussion of it.
…………When loading a viewer, the ViewerActivated function will also be called; during testing I wrote messages to the Event log in the procedure just for tracking purposes, but I’ve left it blank here in order to simplify the tutorial. Suffice it to say that you can include your own initialization code here for events that occur after the viewer is activated. Lind and Oveson state that “If the viewer has another window open such as a legend, it may want to bring that window to the user’s attention when the view is activated (that is, clicked on, or selected using the tabs in the viewer or by selecting it using the dropdown box).” The meat and potatoes are found in the LoadViewerData routine, where the connection is opened and a command is issued to load the data. Much of the action takes place off-screen in my example because I used a data adapter directly bound to a Windows Forms datagrid, which hides a lot of the code-behind from view. I took a cue from A Rickety Stairway to SQL Server Data Mining, Part 12: Accessing SSDM with AMO, ADOMD, SSIS, XMLA and RS  and used an ADO.Net dataset rather than the old OLEDB interface in the Lind-Oveson tutorial, or the ActiveX Data Objects Multi-Dimensional for .Net (ADOMD.Net) objects cited in MacLennan et al. If you’re using ADOMD, you may have to download and install the right package for the version of SQL Server you’re working with, which for SQL Server 2012 are the SQL_AS_AMO.msi and SQL_AS_ADOMD.msi installers on the SQL Server 2012 Feature Pack webpage. MacLennan et al. also use a DataReader object and Lind and Oveson use an OLEDB command to access the data in their respective tutorials, so I opted for using an ADO command object just to add a little contrast; the DM Team’s book also discusses some nuances about the proper use of connection objects in LoadViewerData, which may be helpful in circumstances that I didn’t encounter in my brief foray into writing viewers.[ii] The most helpful suggestion I can provide for this procedure is to test out any DMX query you call here in a SQL Server Management Studio (SSMS) window first, because it’s much more cumbersome to troubleshoot incorrect syntax from within tests of a custom viewer. Also keep in mind that if you execute a simple metadata query like mine, the PARENT_UNIQUE_NAME and CHILDREN_CARDINALITY column names must be in brackets. I don’t know the reason for this strange limitation, but it’s a consistent syntax requirement of any DMX query.
…………The most challenging aspects of writing a plug-in viewer are the setup and  testing, but the process is not much different than the one described for custom algorithms in A Rickety Stairway to SQL Server Data Mining, Part 14.3: Debugging and Deployment. In some ways it is actually much easier, because we don’t need to reference a wrapper like DMPluginWrapper.dll and coordinate such things as .Net and bit versions between them our classes. We can also repeat some of the procedures discussed there, such as creating a Strong Name Key File (.snk) for our project; in this case, however, we will need to find the public key token and include it in the registry. It is possible to run sn.exe from a command prompt to extract the public key token, but I added a command to the Visual Studio tools menu in order to reuse it in future projects, by following the directions at the MSDN posts Getting Public Key Token of Assembly Within Visual Studio, Visual Studio Tip: Get Public Key Token for a Strong Named Assembly and  Visual Studio 2010 Get Public Key Token. The gist of it is that you assign the Title of the command as it will appear in your Tools menu, then the full path to the command (which in my case was C:\Program Files\Microsoft SDKs\Windows\v7.0A\bin\NETFX 4.0 Tools\sn.exe) and -Tp $(TargetPath) in the Arguments section. The Use Output Window checkbox should also be selected. Some of the sources I read said to use quote marks around $(TargetPath) but that produced an error message saying that the file could not be read because it was missing. Once the command completely successfully, copy the 16-digit code after the phrase “Public key token,” which in my case was “b51c4a48a2da4bbd.” To register the viewer, we will have to create a registry key with the name of our viewer and four entries within that key. The Algorithms entry must be to a value equal to the name of the plug-in algorithm we will use it with, which in the case of the practice project we’ve been using in the Rickety series would be Internal_Name_For_My_Algorithm. We also need to create an entry named Assembly with our namespace, followed by a comma and the public key value we just extracted. In this case, the registry entry value would be “PluginViewer, PublicKeyToken= b51c4a48a2da4bbd” sans the quote marks. It is also necessary to create a Class key set to our project’s namespace plus the name of the class, which corresponds to ‘PluginViewer.ViewerUserControl” and a DisplayName key with the string you want to appear in the Viewer dropdown on the Mining Model Viewer tab.
…………The tricky part is to get add the subkey to the right registry keys. The Lind-Oveson tutorial says that it must be added to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\90\Tools\BIShell\Packages\{4a0c6509-bf90-43da-abee-0aba3a8527f1}\Settings\Designers\DataMiningViewers to make a custom viewer work with Visual Studio and HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\90\Tools\Shell\Packages\{4a0c6509-bf90-43da-abee-0aba3a8527f1}\Settings\Designers\DataMiningViewers for SQL Server Management Studio (SSMS). That tutorial was designed with SQL Server 2005 and Visual Studio 2005 in mind though. On my development machine, I had to install them to the HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\VisualStudio\10.0\Packages\{4a0c6509-bf90-43da-abee-0aba3a8527f1}\Settings\Designers\DataMiningViewers key for Visual Studio 2010 and HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\110\Tools\Shell\Packages\{4a0c6509-bf90-43da-abee-0aba3a8527f1}\Settings\Designers\DataMiningViewers for the SQL Server 2012 version of SSMS. There are also DataMiningViewers keys in HKEY_CURRENT_USER, but these are unimportant for our purpose. Note that the GUIDs are the same as in the Lind-Oveson tutorial, but that there are some subtle differences, such as the SQL Server version number (90 vs. 110), the name of the tool (“BIShell” vs. Visual Studio 10) and the fact that in my case, they occurred under the Wow6432Node (which is mystifying, because this key is for 32-bit compatibility, but I’m running Analysis Services 64-bit on a 64-bit version of Windows). There is probably some shortcut to determining the key locations that would make writing setup programs for custom mining viewers much easier, but so far I’ve merely done a search for the DataMiningViewers key in Regedit. Once that determination is made, you can add the registry entries in whatever way you prefer, such as reg.exe at a command prompt, merging .reg files or editing the registry manually with Regedit. In order to familiarize myself with the process of writing setup programs for custom viewers, I opted to add a deployment project to the MonitorMiningSolution that all of the projects in this series belong to. In Figure 3 you can see an example of my registry settings; note that I used the Software key instead of explicitly specifying the Wow6432Node, which will assign it the latter through reflection when the 32-bit registry is viewed in the 64-bit version of Regedit.
…………This is not a lesson in how to work with Visual Studio 2010 deployment projects, for which there are certainly many freely available tutorials; the aim here is to point out nuances to it that are relevant to the writers of custom mining viewers. One thing such developers need to be aware of is that the Lind-Oveson tutorial advises to copy our PluginViewer.dll file to the same location as the out-of-the-box viewers, which in the case of SQL Server 2005 is apparently C:\Program Files\Microsoft SQL Server\90\Tools\Binn\VSShell\Common7\IDE\. I can’t find any path named “VSShell” on my machine and no folder that would be roughly equivalent, even after substituting the 110 version for 90 and using a Program Files (x86) location. I did, however, find Microsoft.AnalysisServices.Viewers.dll in two locations in a hard drive search, C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\ManagementStudio and C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\PrivateAssemblies. After moving the DLL to the latter folder and restarting Visual Studio the viewer loaded correctly. The Lind-Oveson tutorial also advises to use the gacutil utility to register the assembly, like so: gacutil /i PluginViewer.dll, which should add it to the Global Assembly Cache (GAC). Figures 2 and 3 depict the settings I used for the deployment project in the Visual Studio 2010 File System Editor and Registry Editor.

Figure 2: Visual Studio 2010 File System Editor Example for the PluginViewer Project
FileSystemEditorOutput

Figure 3: Visual Studio 2010 Registry Editor Example for the PluginViewer Project
RegistryEditorOutput

…………The directions Lind and Oveson give for associating a particular algorithm with a new viewer seem to be applicable only to COM projects written in C++. These days, plug-in developers are probably going to use the .Net platform exclusively, so my segment of the Rickety series on custom algorithms was strictly limited to that architecture. My sample code was written in Visual Basic, whereas other SSDM tutorial writers have opted for C#, but both compile down to the same base .Net language. The solution they provide involves adding a variable to the GetViewerType method of a C++ file that doesn’t even exist in the .Net wrapper DMPluginWrapper.dll we had to compile back in A Rickety Stairway to SQL Server Data Mining, Part 14.1: An Introduction to Plug-In Algorithms, so that we could reference the plug-in architecture in the .dll of our custom algorithm. As discussed in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm, the GetViewerType method of AlgorithmMetadataBase is ultimately responsible for associating a custom algorithm with specific viewers. Normally you’d return at least one of the seven values of the MiningViewerType class, but since the return value type is just a string, we can substitute any value we want – including the name of our PluginViewer, sans the UserControl portion of the class name. Simply edit it, then recompile that class and perform all of the other steps discussed in Post 14.3 for redeploying a custom algorithm.[iii] If redeployment of the algorithm is successful, the new data mining viewer should be listed in the VIEWER_TYPE column of the DMSCHEMA_MINING_SERVICES rowset in the same row as the custom algorithm. As Lind and Oveson mention, you can associate multiple viewers with an algorithm by separating their names with semi-colons in the algorithm code, which in our case would be in the string returned by AlgorithmMetadataBase.GetViewerType. For example, we could specify our custom mining viewer in GetViewerType together with the one of the enumeration values for the standard viewers like so: Return “PluginViewer;” + MiningViewerType.MicrosoftCluster. Since PluginViewer is listed first it would be listed first in the VIEWER_TYPE column and at the top of the dropdown menus in SSMS and Visual Studio. I’m not aware of any means of enabling one of the nine out-of-the-box algorithms we discussed much  earlier in this series to use a custom viewer (even by editing the msmdsrv.ini file by hand), but that does not indicate that they do not already exist or cannot be invented.
…………Be aware that when a custom viewer fails to load, no messages are recorded in msmdsrv.log or the event logs. If an error occurs too early in the loading process, the viewer simply won’t be listed in the VIEWER_TYPE column of the DMSCHEMA_MINING_SERVICES rowset, but if it occurs late enough you may see the non-specific message like “An error occurred attempting to create the viewer MyCustomMiningViewerName.” So far, I have been unable to debug custom mining viewers in Visual Studio because I invariably receive messages on every breakpoint that, “The breakpoint will not currently be hit. No symbols have been loaded for this document.” This is true whether I attach to msmdsrv.exe or to the SSMS process that the viewer is supposed to run in. Perhaps there is a more straightforward way of doing this by debugging through the same means that are used with ordinary Windows Forms user controls, which I am more than a little rusty on. The path of least resistance for me was to simply write messages to the Event log that tracked exceptions and the routines they occurred in, which helped expose the call stack. From what I can gather, the viewer loading sequence begins with DisplayName and is followed by calls to Initialize Component, Set ServiceProvider, Set ConnectionString, Set MiningModelName, LoadViewerData, Get ConnectionString and Get MiningModelName, in that order. Once I debugged my viewer sufficiently well to get through this sequence, I was able to visualize the PluginModel’s metadata in a Windows Forms datagrid, as depicted in the figure below.

Figure 4: The PluginViewer in Use in a Visual Studio Data Mining Project
PluginViewerOutput

…………Now that we’ve gotten through the most difficult aspects of custom algorithms and viewers, we can touch on the subject of GetNodeIDsForCase, a function in AlgorithmBase that I neglected till now because it is only required for plug-ins that make use of data mining dimensions and drillthrough. The former is a feature that we skipped over because it is of its limited utility, whereas drillthrough can’t be implemented in the Generic Content Viewer and thus is more applicable when custom algorithms are used with custom viewers. Returning a DrillThroughNodeFlags.Content value in GetNavigator rather than DrillThroughNodeFlags.DimensionContent will direct SSDM to apply it to drillthrough rather than mining dimensions. It is called internally by SSDM (usually at least once after InsertCases and possibly before SaveContent) so the developer has little direct control over it. Basically, it returns the NodeUniqueNames of content nodes that are associated with a particular case as an array of strings; according to Books Online, it “Maps cases to node IDs, which are simply labels for nodes in a tree layout of your model content.” Crivat also says in the tutorial in the SDK “that if you are implementing a decision tree algorithm, it is likely that more nodes (the full path from root to leaf) will contain any given case. You can push all the node unique identifiers for those nodes in the string array returned by this function.” When the CaseID is -1 and there is no InputCase, the root node will be returned, but otherwise the cases associated with particular nodes are returned. I added the code in Figure 5 to the custom algorithms I posted a few articles ago and verified that it worked correctly, although I found no immediate use for it. When debugging code like this, note how CHILDREN_CARDINALITY changes for the root node.

Figure 5: An Example of the GetNodeIDsForCase Method
Protected Overrides Function GetNodeIDsForCase(nodeFlags As DrillThroughNodeFlags, CaseID As Long, inputCase As MiningCase, maxNodes As ULong) As String()

            Dim ReturnString() As String
            Select Case CaseID ‘I implemented this as a Select in case we want to add custom logic later to make the type of node added conditional on the case
                Case -1 ‘the flag for the root node at the top of the hierarchy, containing all cases
                    ReDim ReturnString(0)
                    ReturnString(0) = “Root Node”
                Case Else
                    ReDim ReturnString(AttributeSet.GetAttributeCount)

                    For I As UInteger = 0 To AttributeSet.GetAttributeCount

                        in this model, we have 20 rows/input cases, with two columns; I want each row to be associated with both columns
                        normally you’d associate the NODE_UNIQUE_NAME with it
                        ReturnString(I) = “This case is associated with the “ + AttributeSet.GetAttributeDisplayName(I, True) + “column.”
                    Next I
            End Select
        End Function

…………Another SSDM feature I neglected to delve into was Classified columns, which seem to be beneficial only in custom algorithms and then in certain narrow use cases. I also neglected some broad topics that might have certain nuances in SSDM, but are generally taken care of by other components of SQL Server, such as security. This is handled by the ordinary Analysis Services mechanisms, which have already been fully fleshed out elsewhere. I don’t mean to relegate such an important topic to an afterthought, but I’m not conscious of any earthshaking distinctions between the ways SQL Server handles dimension and model permissions. Setup programs are a topic I’ve dealt with on an ongoing basis as part of other tutorials; there really isn’t any separate mechanism specific just to data mining projects, aside from the tricky business of deploying custom viewers and algorithms. The latter was tackled in Post 14.3, which really represented the pinnacle of difficulty in this series. As expected, deployment and debugging were also the biggest hurdles with custom viewers, but I seriously overestimated how difficult it would be to write them. I knew from the beginning of this series that I would make a lot of mistakes because I was in over my head, but decided that the best way to learn was to dive in – especially when I could provide some badly needed free press to the most underrated feature of SQL Server. I could probably make a further contribution by writing on the topic of data mining performance, but that will have to wait until I get a much firmer grasp of SSDM internals. A year and a half ago I won a copy of Kalen Delaney’s incredibly useful book on SQL Server 2008 R2 internals, but there is no such guide for SSDM. There isn’t even an equivalent to the Analysis Services Performance Guide, which nevertheless has some hints about how SSDM data might be stored to disk in a manner similar to SSAS dimensions. I’m still new to this, so my experience is only as valuable as the pitfalls a rookie may encounter. I really ought to thank SQLServerCentral for their support in publishing more than a dozen of the articles in this series despite my inexperience, which was of immense help.
…………At present, the best way I can probably make a worthwhile contribution while still acquiring some of the data mining skills I need is to evaluate SSDM’s competitors, and explain how to use them in conjunction with SQL Server. Some of the tools I may survey include Weka, RapidMiner, Autobox and Predixion Software, the last of which was established by former members of Microsoft’s Data Mining Team. This will be my second attempt at this upcoming series, which never even got off the ground when I tried to launch it in 2012 with evaluations of IBM and Oracle’s data mining products. As explained in Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial and Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2, I gained new respect for SQL Server after encountering numerous usability problems that made even the setup processes for their competitors a real challenge. SQL Server simply has its competitors beat hands down when it comes to usability and user interface design, which aren’t trivial matters; just think of the many ways a defective user interface might burden a DBA whose trying to troubleshoot a production server that’s crashed, for example. And as I discovered a few days ago when reading up on Oracle the other day, I discovered that their data mining offerings are light years behind SSDM. There is so little content there that the 2007 edition of Oracle Data Mining: Mining Hold from Your Warehouse consists of little except giant screenshots and white space. At the time, Oracle only supported two kinds of clustering, adaptive Naïve Bayes networks and support vector machines (SVMs, which overlap Linear and Logistic Regression). Two years earlier, at the time the Lind-Oveson tutorial was written, SSDM was already almost in its present state, with additional algorithms like Association Rules, Sequence Clustering, Time Series and Neural Networks that Oracle simply couldn’t compete with. To illustrate just how great of a lead the SSDM user interface enjoyed, consider that Oracle didn’t have any graphical means of viewing Decision Trees yet.[iv] Thanks to years of neglect of data mining by the top brass at Microsoft, Oracle and other competitors may well have slowly inched their way towards competitiveness with SSDM. Yet it is impossible to determine just how good Oracle’s data mining capabilities are because the user interface and setup procedures have regressed since then, to the point that it’s practically unusable. Installing DB2 is likewise a chore, but their Windows GUI is currently the equivalent of the junkyard truck in Sanford and Son. That is why I believed it was entirely reasonable to refer to decade-old instructional materials on SSDM throughout this series, like the Lind-Oveson tutorial. In my next series, I’ll investigate whether or not SQL Server’s other competitors have succeeded in surpassing SSDM’s capabilities over the last decade of neglect, much like the tortoise and the hare. There’ll be a long learning curve, but hopefully along the way I’ll be able to provide some practical advice on how to mining SQL Server data with these tools, as well as some guidelines for when they would be appropriate to use rather than SSDM.


[i] see pp. 558-562, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[ii] IBID., pp. 558-559.

 [iii] When writing this, I originally thought that MiningViewerType was an enumeration and that I was therefore limited to using only the seven values predefined by Microsoft. Then I discovered that you can override predefined enums using the Shadows keyword, as described in the user “Boop Boop”’s response in the MSDN thread [RESOLVED] Extending an Enum. This would have been an ironic solution to keep us from wandering in the obscurity of the underlying C++ code, down in the murky depths a couple of assemblies below our custom algorithm code.

 [iv] p. 96, Hamm, Carolyn K. 2007, Oracle Data Mining: Mining Hold from Your Warehouse. Rampant TechPress: Kittrell, North Carolina.

A Rickety Stairway to SQL Server Data Mining, Part 14.8: PMML Hell


By Steve Bolton

…………In A Rickety Stairway to SQL Server Data Mining, Part 14.3: Debugging and Deployment, we passed the apex of this series of amateur self-tutorials on SQL Server Data Mining (SSDM) and have seen the difficulty level and real-world usefulness of the material decline on a steady slope since then. This week’s topic is a bit more challenging than the material covered in the last article, A Rickety Stairway to SQL Server Data Mining, Part 14.7: Additional Plugin Functionality, but only because it pertains to an SSDM feature that is poorly documented, to the point that it sometimes difficult to determine whether the lack of results are due to bugs or incorrect usage. In fact, the only way I have been able to get the product’s Predictive Model Markup Language (PMML) support features to work at all is by implementing the relevant methods in SSDM’s extensibility architecture for adding custom algorithms.
…………PMML was developed in the late 1990s by the Data Mining Group (DMG), a consortium of data mining product vendors, in order to facilitate interoperability between their disparate mining tools using XML. The consortium maintains a handy list of vendors that support particular versions of PMML at the DMG.org website, which is also the central repository of other critical information on PMML, particularly schema specifications.  These include Weka, Tibco, Angoss, Augustus, BigML, Experian, IBM InfoSphere, Knime, KXen, MicroStrategy, Pervasive DataRush, Predixion Software, RapidMiner, R/Rattle, Salford Systems, Sand CDBMS, SAS Enterprise Miner, Teradata, Clementine and Zementis. The last of these even has a PMML Converter product that can translate older PMML documents to the latest standard, which I have yet to try out. Unfortunately, Microsoft has fallen further behind the current standard than any other vendor on the list, thanks to the pattern of neglect by the company’s top brass that I have spoken about before. It only supports version 2.1, which was released way back in March 2003, according to the helpful Wikipedia page on PMML. The latest release was version 4.1, back in December of 2011, a scant two years ago. Most portability features in the computing field don’t live up to their billing in their earliest incarnations, due various kinks in the translation processes, lack of strict adherence to interoperability standards and lack of widespread adoption among vendors. PMML is no exception to that rule of thumb and Microsoft’s support for the standard is frozen at a point in time when it was much more applicable. Because I am a rookie at this, I cannot yet gauge how common it is for mining models to be exchanged today between different tools using PMML; perhaps Microsoft determined it was one of those pie-in-the-sky standards that didn’t provide much benefit for end users and thus didn’t bother to update it. Either way, the bottom line is the same: regardless of what the rest of the industry is doing with PMML, use cases in which mining models can be successfully exchanged in this way with SSDM are going to be quite rare. I’m not going to spend much time on this topic because SSDM is using an obsolete version of PMML that is more than a decade old. Furthermore, it is limited to just two or three (Microsoft’s documentation is unclear) of the nine algorithms included with SSDM out-of-the-box, which we discussed much earlier in this series. Moreover, this stunted feature is also unwieldy and poorly documented, thereby making it even less useful.
…………In theory, it is possible to create a SQL Server mining model using a statement like CREATE MINING MODEL MyPMMLMiningModelName FROM PMML ‘My XML String’, by pasting the full string of a PMML 2.1 document created by a different mining tool. It should also be possible to create PMML documents from existing SQL Server mining models using a statement like SELECT FROM MyPMMLMiningModelName.PMML. The DMSCHEMA_MINING_MODEL_CONTENT_PMML and DMSCHEMA_MINING_MODEL_XML schema rowsets return basically the same information in a common set of columns, which will be blank for mining models that are not derived from PMML. The MODEL_PMML column is supposed to display the PMML text associated with a model, while the SIZE column contains the length of the PMML definition in bytes and LOCATION should contain the document’s storage location on disk. Unless a model is created from PMML, however, these schema rowsets are useless. Yet in many cases there is no practical benefit to this, because recreating the ported mining model in the Visual Studio GUI may be less of a hassle than dealing with endless XML parsing errors, even for the PMML version and algorithm types that SQL Server can theoretically handle. It good that Microsoft’s Data Mining Team included nascent features for working with PMML way back in 2005, but the neglect the top brass at Microsoft has displayed towards data mining tools since then has had a greater impact on this stillborn feature than any other component of SSDM. I’m not going to say that it’s useless in its current stunted form, because there are undoubtedly PMML documents out there that SSDM can parse correctly, thereby saving data miners the time of implementing them a second time. Models that will port correctly for the narrow set of algorithms and ancient PMML version SQL Server theoretically supports are not easy to come by though, which diminishes its usefulness further.
…………For example, five of the sample models available at the DMG’s PMML Sample Models Archive are Clustering or Decision Trees algorithms encoded in PMML 2.0 or 2.1, which means SQL Server should theoretically have been able to parse them.  Yet all five models produced a variety of error messages when the text was pasted into SQL Server’s CREATE MINING MODEL statement, including “Incorrect document syntax,” “Cannot parse mining model due to malformed PMML” and various “PMML parsing failed” errors. The most ominous of these was a warning that “The model inside the PMML 2.1 document uses an algorithm that is not supported,” after the document was apparently parsed correctly. Whether they were malformed because of some fault in DB2 Intelligent Miner, the IBM product that produced the documents, or could not be parsed due to some error on SQL Server’s part is difficult to determine. The bottom line is that getting this portability feature to actually work between different data mining tools may be a taller order than it seems, one that may involve a lot of unwieldy, time-consuming editing of documents by hand. For relatively small documents under the size of a megabyte it might be feasible to fix invalid PMML manually, but it would defeat the purpose of this portability feature if we had to do this kind of menial work to translate all 22 gigagytes of data stored in the mining structures we’ve used as samples in this series. And this is a pittance compared to the terabytes or even petabytes of information required when Big Analysis is performed on Big Data. The latter term is of course a buzzword that obscures the fact that Data has always been getting Bigger, throughout much of the history of the human race; it is a lot like the term “globalization,” which has been occurring steadily for millennia, as I learned while I was studying foreign policy history in grad school. One of the best strategies for coping with the Even Bigger Data of the 21st Century is to develop tools that automate the processing of such data, which will become a Big Hassle in and of itself, in tandem with the size of the datasets. PMML is designed to do this by performing the task of porting mining models, but at the point when the 2.1 specification was released, it apparently could not be done sufficiently “under the hood” to make user intervention unnecessary. Until Data Mining Expressions (DMX) statements like these can be made to work without a lot of costly attention on the part of end users, their practical benefits will diminish in tandem with the rise of Big Data, as the costs in terms of time and energy to use them increase with each megabyte of XML that has to be checked manually. Certainly, with sufficient expenditure of resources, PMML documents can be reworked to fit into the existing SSDM architecture, but those costs are already prohibitive and only increasing as time passes.
…………Similarly, if you run a DMX query like SELECT * FROM DMSCHEMA_MINING_MODEL_CONTENT_PMML, it is a good idea to add a WHERE clause, because the statement will return an error message about the first model it fails to process instead of the results. Unfortunately, most of the models I created earlier in this series could not be included in this statement due to various error messages. Books Online says that Naive Bayes, Decision Trees and Clustering are the only ones out of the nine out-of-the-box algorithms we surveyed earlier in this series which support “the use of Predictive Model Markup Language (PMML) to create mining models,” but it is unclear from this wording whether BOL means creating models from PMML or vice-versa. Apparently it is the former, because the ALLOW_PMML_INITIALIZATION column of DMSCHEMA_MINING_SERVICES schema rowset is only true for Decision Trees and Clustering, whereas queries of DMSCHEMA_MINING_MODEL_CONTENT_PMML against Naive Bayes models always fail. All three of the models we retained in A Rickety Stairway to SQL Server Data Mining, Algorithm 1: Not-So-Naïve Bayes returned error messages when included in queries to this schema rowset, but for different reasons. Limiting the WHERE clause to NBRecordTableViewModel returned a long message including the phrase “persisted file cannot be found” and the path of the SQL Server Analysis Services (SSAS) file containing the model, which usually signifies that the model has not been processed yet. Processing Naive Bayes models won’t lead to successful queries of DMSCHEMA_MINING_MODEL_CONTENT_PMML though because you’re likely to encounter a message that “the algorithm does not support the functionality requested by the ” model,” which is a tip-off that it does not support deriving PMML documents from model content queries. This was the case with NBDenormalized3Model, but NBDenormalizedAllModel returned one of those inscrutable error messages that crop up every so often in Analysis Services: “An unexpected error occurred (file ‘mddiscretizer.cpp’, line 1689, function ‘MDDiscretizer::GetBucketRange’.” Not surprisingly, searches for the term “GetBucketRange” with both “Sql server” and “Discretizer” returned no hits on Google. I suspect that GetBucketRange might someday belong in the invaluable list Jeannine Takaki posted last year in the TechNet thread A Guide to the (Undocumented) System Stored Procedures for Data Mining. Given that it is unlikely that Microsoft’s obsolete implementation of PMML can be used with Naive Bayes in this way, I didn’t make solving this bug a priority.
…………Since BOL and ALLOW_PMML_INITIALIZATION both state that Decision Trees is supported by SSDM, I did not expect to have different errors returned for all ten of the mining models we used in A Rickety Stairway to SQL Server Data Mining, Algorithm 3: Decision Trees. DTDenormalizedView3Model, DTDenormalizedView3ModelDiscrete and DTDenormalizedView3ModelDiscrete2 all returned an error to the effect that the “mining model has a column that uses the DATE data type. This data type is not supported by the PMML 2.1 specification. The DATE data type can typically be converted to the DOUBLE data type as a workaround, which is supported by the PMML 2.1 specification.” Because the other seven models all returned the more ominous error that the “mining model contains features that are not supported in the PMML 2.1 specification,” I skipped debugging the date error and altered one model so that it had just one attribute (marked as both an input and output), to see if I could return any Decision Trees PMML at all. Alas, I received the same error and reached the conclusion that Microsoft’s implementation of Decision Trees is not compliant with its implementation of PMML 2.1, but that it didn’t document precisely which of the algorithm features are the root cause. I also reached the conclusion that it was not worth solving the problem by trial and error, since we’re dealing with an implementation of PMML that is not just obsolete, but buggy and undocumented. I also received the date error when running the query against ClusteringDenormalizedView3Model. The other 16 mining models we used in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering returned without errors, but the MODEL_PMML column was blank in each case. As Figure 1 shows, SSDM returned such basic information as the MODEL_CATALOG, MODEL_NAME and MODEL_GUID, but left the others blank. Note how the second query returns identical information when using a SELECT…PMML statement. The result is the same: the MODEL_PMML column that is supposed to return “An XML representation of the model’s content in PMML format,” as BOL states, is still blank.

Figure 1: Sample Query of DMSCHEMA_MINING_MODEL_CONTENT_PMML
PMMLSchemaRowsetResult1

…………To date, the only way I have been able to get the MODEL_PMML columns of these statements to return a PMML document is by implementing the logic myself during the development of custom algorithms. The only documents I have yet to encounter which the CREATE MINING MODEL statement would accept as valid were likewise produced this way, using several methods built into the plug-in architecture for that purpose. Implementing PMML support in a plug-in algorithm is mainly a matter of adding three optional Overridable methods to your code. The easiest of these is the SupportsPMML member of AlgorithmMetadataBase, which is another one of those methods that behaves like a property and merely returns a Boolean value. The heavy lifting is done in RenderPMMLContent and LoadPMMLContent methods of AlgorithmBase. First we’ll use RenderPMMLContent to produce a PMML document from the PluginModel we’ve been using in this section of the Rickety series, on building custom algorithms for SQL Server. When you perform a SELECT…PMML query against a plug-in algorithm, SQL Server will you’re your plug-in’s RenderPMMLContent method first, followed by HasFeatureSelection and GetAllFeatureSelectionInfo methods we discussed in our last article. I have yet to see msmdsrv.exe call GetAttributeFeatureSelectionInfo yet during debugging, but these three methods are apparently designed merely to populate PMML models with feature selection information. The PMMLWriter is automatically passed by the msmdsrv process and only needs to be populated with valid PMML attributes. The WriteStartElement and WriteAttributeString statements in Figure 2 assign the values of certain mandatory fields of the algorithm content portion of an XML document. WriteAttributeString is one of a family of PMMLWriter object members that perform similar functions, except on different data types, such as WriteAttributes, WriteAttributeString, WriteBinHex, WriteCData, WriteChars, WriteComment, WriteElementString, WriteEndAttribute, WriteFullEndElement, WriteName, WriteNmToken, WriteNode, WriteQualifiedName, WriteRaw, WriteStartAttribute, WriteString, WriteValue and WriteWhitespace. The plug-in documentation mentions that some of the methods will throw an exception if invoked in .Net plug-ins, but be wary, because the list is incomplete.

Figure 2: The RenderPMMLContent Method

Protected Overrides Sub RenderPMMLContent(PMMLWriter As PMMLWriter)

            PMMLWriter.WriteStartElement(“NaiveBayesModel”)
            PMMLWriter.WriteAttributeString(modelName, Me.Model.Name)
            PMMLWriter.WriteAttributeString(functionName, “classification”)
            PMMLWriter.WriteAttributeString(algorithmName, Me.Model.Metadata.GetServiceName())
            PMMLWriter.WriteAttributeString(“threshold”, 1)

            Me.Model.RenderModelCreationFlagsPMML(PMMLWriter)
            Me.Model.RenderMiningSchemaPMML(PMMLWriter)
            Me.Model.RenderModelStatisticsPMML(PMMLWriter)

            PMMLWriter.WriteEndElement()

        End Sub

…………The .chm help file of the plug-in software development kit (SDK) provides somewhat impenetrable explanations of the RenderModelCreationFlagsPMML, RenderMiningSchemaPMML and RenderModelStatisticsPMML methods – which is better than no explanation at all, i.e. the result you’ll get when using a search engine to look them up. The good news is that plug-in developers apparently don’t have to write their own implementations of these methods, because the code below works. The first method renders the MiningSchema section of a PMML document, RenderMiningSchemaPMML takes care of the global stats section (an equivalent of the MarginalStats object in SSDM) and RenderModelStatisticsPMML renders PMML extensions, which are the equivalent of algorithm parameters and flags. AlgorithmMetadataBase.GetMarginalRequirements must be set to AllStats (as discussed earlier in this segment on plug-in algorithms) in order for RenderMiningSchemaPMML to work correctly. Just like the OpenScope and CloseScope statements we used to encapsulate the logic of the LoadContent and SaveContent methods of the AlgorithmBase class (as explained) in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm), our PMMLWriter object requires WriteStartElement and WriteEndElement statements. The most confusing part of the code below might be the mention of Naïve Bayes in the WriteStartElement. The PluginModel we’ve been working with in this segment on custom algorithms doesn’t match any of the broad classes of mining methods supported by the PMML 2.1 standard, so I simply substituted a value that is compliant to that standard. Other potential values include GeneralRegressionModel, ClusteringModel, SequenceModel, NeuralNetwork, AssociationModel and RegressionModel, TreeModel and possibly also a Taxonomy for hierarchized categories. Each of these require highly specific mandatory XML elements that make them more challenging to render; for the full reference, see the DMG webpages on how to structure a 2.1 document and the XSD for that version. For example, the sample code provided by Bogdan Crivat, one of the original SSDM programmers, in the plug-in SDK requires a modelClass tag that is specific to Clustering algorithms. For General Regression, you would have to fill in a ModelType with values like regression, generalLinear, logLinear, multinomialLogistic, etc. and so on. I won’t get into an in-depth discussion of these XML elements, because this is a tutorial in how to use Microsoft’s implementation of PMML, not in PMML syntax itself. Now if we run a DMX query like SELECT * FROM PluginModel.PMML the MODEL_PMML Column ought to be populated with a valid XML document like the one depicted in Figure 3, except crammed into one line. We would also see the SIZE column populated with the size of the document in bytes.

Figure 3: Sample PMML Document Return in the MODEL_PMML Column
MyValidPMMLDocument

…………If we feed this PMML document to SSDM in a single string, at the end of a statement like CREATE MINING MODEL MyPMMLMiningModelName2 FROM PMML, a mining model of the same name ought to be created on the server, within a mining structure that also uses that name, except terminated with a the suffix _Structure. Provided, that is, that we implement the LoadPMMLContent method of AlgorithmBase correctly. The XML string passed to one of these statements will be converted to an XMLElement object that SQL Server automatically passes to the LoadPMMLContent method. The code within it must parse the document and build a new mining model out of its PMML elements. The first order of business is to make sure the model name matches one of the values that the PMMLWriter.WriteStartElement will accept; I used a dummy value of Naïve Bayes even though my plug-in is in no way related to that algorithm, as discussed a few paragraphs ago. Although I do not make practical use of this in Figure 4, Crivat also notes in his SDK sample code that is wise to “Create a namespace manager with the namespace of the content, and prefix it with ‘pmml.’ This way, all the XPath queries will be qualified with pmml.” The If…End If block immediately after the NamespaceManager code is designed to map the PMML document’s Extensions element to our model’s parameters, which were defined in the last tutorial.  Crivat advises to “Look for the Extension element with the MiningAlgorithmFlags name and ‘Microsoft Analysis Services’ as extender,” as I do when declaring the ParameterXPath variable. The rest of the code in this block merely loops through the model parameters and assigns the values to our model. After this If…End If block, we would normally put any code (or call methods in our custom classes) for loading elements specific to the type of model we’re working. For example, in Bogdan’s tutorial, this is the point where you’d retrieve the number of clusters, their distributions and other information specific to the variant Clustering algorithm he demonstrates. For Naive Bayes we’d include information on the inputs, outputs and value pairs; a SequenceModel has many required tags, such as those identifying the number of sequences, items and rules; an AssociationModel would of course require information on the itemsets, rules and so forth; neural nets have a lot of specific functionality such as the activation function, thresholds and the number of layers. Likewise, the definitions of models labeled Trees might include split characteristics, whereas General Regression requires specifications for its predictors and the Regression type takes familiar values as intercepts and coefficients. These would all have to be retrieved from the PMML string here and plugged into the variables of our plug-in that define the structure of the model. I’ll spare readers the quite lengthy code that would be required to illustrate every type of conversion, especially since that would require the development of multiple sample plug-ins, one for each type of PMML model. The important thing is to grasp the process behind it, which entails parsing the XMLElement supplied to the method, using standard .Net XML functionality, and translating it to your model’s variables.

Figure 4: Sample Implementation of the LoadPMMLContent Method

Protected Overrides Sub LoadPMMLContent(TempAlgorithmContent As System.Xml.XmlElement)

            If String.Compare(TempAlgorithmContent.Name, “NaiveBayesModel”, True) <> 0 Then

                Throw New System.Exception(“Invalid PMML node: “ & TempAlgorithmContent.Name)

            End If

Dim NamespaceManager As New System.Xml.XmlNamespaceManager(TempAlgorithmContent.OwnerDocument.NameTable)

            NamespaceManager.AddNamespace(pmml, TempAlgorithmContent.NamespaceURI)

Dim ParameterXPath As String = pmml:Extension[‘MiningAlgorithmFlags‘=@name and ‘Microsoft Analysis
Services’=@extender]”

Dim ParameterNode As System.Xml.XmlNode = TempAlgorithmContent.SelectSingleNode(ParameterXPath, NamespaceManager)

            If ParameterNode IsNot Nothing Then

                Dim ParameterXMLNodeList As System.Xml.XmlNodeList = ParameterNode.SelectNodes(pmml:key-val,
=NamespaceManager)

 loop through the PMML model parameters and assign them to our model’s parameters

                For I As UInteger = 0 To ParameterXMLNodeList.Count - 1

                    If ParameterXMLNodeList.Item(I).Attributes(“name”) IsNot Nothing Then

                        Select Case ParameterXMLNodeList.Item(I).Attributes(“name”).Value

                            Case MyFeatureSelectionParameter

                                MyMiningParameterCollection(ParameterXMLNodeList.Item(I).Attributes(“name”).Value).Value = System.Convert.ToInt32(ParameterXMLNodeList.Item(I).Attributes(“value”).Value)

                            Case MyTrainingParameter

                                MyMiningParameterCollection(ParameterXMLNodeList.Item(I).Attributes(“name”).Value).Value = ParameterXMLNodeList.Item(I).Attributes(“value”).Value

                        End Select

                    End If

                Next I

            End If
       End Sub

…………If the plug-in code works correctly, msmdsrv will respond to a CREATE MINING MODEL FROM PMML statement by calling this method, then SaveContent to persist the structure of the newly created model. After this, a new mining model and parent structure will be added to your SSAS database. It might conceivably be possible to add functionality to LoadPMMLContent and RenderPMMLContent to handle documents that are compliant with higher versions of PMML than 2.1, but I have yet to verify this; it is also entirely possible that the SSAS query parser will reject syntax that is not 2.1-compliant, regardless of whether a plug-in algorithm’s internal code could successfully process the string or not. This capability would greatly increase the utility of SSDM’s PMML functionality, which would otherwise be the tool’s least useful feature, in its current stunted form. Regardless of whether or not the plug-in architecture can be used to fix the shortcomings of Microsoft’s PMML implementation, the ability to write custom algorithms is one of the most useful features of SSDM, which is in turn the most underrated but productive component of SQL Server. The same can be said of the ability to extend SSDM with custom data visualization capabilities, as we shall discuss in the final segment of the Rickety series.

A Rickety Stairway to SQL Server Data Mining, Part 14.7: Additional Plugin Functionality


By Steve Bolton

…………In order to divide this segment of my amateur tutorial series on SQL Server Data Mining (SSDM) into digestible chunks, I deferred discussion of some functionality that can be implemented in custom algorithms. In the last installment I explained how to write custom data mining functions, which is a quite advanced but powerful feature of the plug-in architecture – even by the standards of SSDM, which is in turn perhaps the most underrated and least publicized component of SQL Server. This week we’ll be covering a few odds and ends that aren’t absolutely necessary to implement in a bare bones plug-in, yet are more commonly used than custom mining functions and not nearly as challenging. It is really quite easy to code custom mining model flags, for instance, whereas parameters for custom algorithms are only slightly more difficult. In order to kill two birds with the same stone, so to speak, we’ll also dispose of the topic of feature selection by implementing custom mining flags and parameters that control it.
…………Mining flags merely require a class-scoped declaration of the MiningModelingFlag object type, as well as the implementation of two optional methods in AlgorithmMetadataBase. The GetSupModelingFlags method merely returns an array of MiningModelingFlag objects, including such enumeration values as MiningModelingFlag.Binary, Invalid, NotNull and Regressor. Returning a value of MiningModelingFlag.Null apparently causes a value of  “error” to be listed in the SUPPORTED_MINING_FLAGS column of the DMSCHEMA_MINING_SERVICES schema rowset, which is roughly the equivalent of a relational dynamic management view (DMV). This can be seen in the screenshot I posted of DMSCHEMA_MINING_SERVICES in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm, which should be corrected by the code depicted in Figure 1. For whatever reason, returning values of MiningModelingFlag.CustomBase rather than references to the flags themselves invariably raised an exception, so be vigilant for this fine distinction when implementing your own flags.

Figure 1: Sample AlgorithmMetadataBase Class Code for Custom Mining Flags
        Friend Shared MyCustomMiningFlag1 As MiningModelingFlag = MiningModelingFlag.CustomBase + 1
        Friend Shared MyCustomMiningFlag2 As MiningModelingFlag = MiningModelingFlag.CustomBase + 2 

        Public Overrides Function GetSupModelingFlags() As MiningModelingFlag()

            Dim TempMiningFlags As MiningModelingFlag() = New MiningModelingFlag() {MiningModelingFlag.Regressor, MyCustomMiningFlag1,MyCustomMiningFlag2} 

            Return TempMiningFlags

        End Function

        Public Overrides Function GetModelingFlagName(TempFlag As MiningModelingFlag) As String

            Select TempFlag 

                Case MyCustomMiningFlag1

                    Return “MyCustomMiningFlag1″ 

                Case MyCustomMiningFlag2

                    Return “MyCustomMiningFlag2″

                Case Else

                    Throw New System.Exception(“Unknown modeling flag : “ & TempFlag.ToString())

            End Select

        End Function

…………I also corrected the declarations of the flags, which should have been Friend Shared in the original version. GetModelingFlagName merely returns string names associated with the flags, which is also trivial to implement. As noted in previous tutorials in this segment of the Rickety series, most of the code I’m providing initially came from the C# samples provided by Bogdan Crivat, one of the original SSDM programmers, in the plug-in software development kit (SDK). I converted it to Visual Basic .Net in part because it’s my favorite language and secondarily because the Internet is practically devoid of SSDM plug-in code written in VB; after that, I made so many changes that the original material is practically unrecognizable. One of the common changes I made was to use Select Cases rather than If statements in order to accommodate wider ranges of values in future projects, as I did in Get ModelingFlagName. After replacing these declarations and methods in sample code I posted for the AlgorithmMetadataBase class a while back, the SUPPORTED_MINING_FLAGS column of DMSCHEMA_MINING_SERVICES should now depict a value of “REGRESSOR, MyCustomMiningFlag1, MyCustomMiningFlag2” rather than “Error.” As you can see in Figure 2, SSDM capitalized the MiningModelingFlag.Regressor value I returned in GetSupModelingFlags – as it does with the other out-of-the-box flag types – but I didn’t stick to this convention when naming my custom flags.

Figure 2: Custom Mining Flags in the SUPPORTED_MINING_FLAGS Column
Plugins7DMVs1

…………Using parameters in a plug-in algorithm is a little trickier and requires the implementation of three methods, plus invocation code somewhere around the AlgorithmMetadataBase constructor. I’m not sure yet if it is necessary to add a constructor to that class, but it makes an ideal place to instantiate your parameters and add them to a class-scoped MiningParameterCollection object, as depicted in Figure 3. Crivat encapsulates declaration code of this kind in the constructor by calling a DeclareParameters method he added to the base class, which is probably a good idea in many circumstances; I opted not to add methods of my own devising to this class for now, just to avoid causing any confusion with other methods that are marked Must Override or optional in the base class. As Figure 3 also shows, MiningParameter objects have several properties that can be used to perform basic tasks like assigning defaults and descriptions, as well as identifying the parameters as hidden or required. The expected return type and name are specified in the same line of code as the parameter declaration.

Figure 3: Parameter Declaration Code for the AlgorithmMetadataBase Class

Protected MyMiningParameterCollection As MiningParameterCollection 

‘#2 declare the members of the collection – possibly in AlgorithmMetadataBase.New or a method called by it

        Public Sub New()

            Dim MiningParameter1 As New MiningParameter(MyFeatureSelectionParameter, GetType(System.Int32))
            Dim MiningParameter2 As New MiningParameter(MyTrainingParameter, GetType(System.String))

            MiningParameter1.DefaultValue = “0”
            MiningParameter1.Description = “A Description for My Only Feature Selection Parameter”
            MiningParameter1.IsExposed = True
            MiningParameter1.IsRequired = False
            MyMiningParameterCollection.Add(MiningParameter1)
            MiningParameter1.DefaultValue = “My Default Value”
            MiningParameter1.Description = “A Description for My Only Training Parameter”
            MiningParameter1.IsExposed = True
            MiningParameter1.IsRequired = False
            MyMiningParameterCollection.Add(MiningParameter2)

        End Sub

…………We must also implement three other methods which are essentially inconsequential. The GetParametersCollection returns either the MiningParameterCollection we declared at the beginning of the class, or Nothing if there are no parameters and therefore no such collection. There’s nothing particularly difficult about the ParseParameterValue method either, although it does require a few more lines of code. My implementation is pretty close to Crivat’s original sample code, in part because there are only so many ways the simple task of parsing the parameter values supplied by the user can be done. The parameter types are different and the exception handling syntax varies a little, with the major change being the Select Case. The logic is easy to grasp: when the user supplies the parameter at the ordinal index of the zero-based collection, we merely return the data type we originally declared for that parameter. In his original code, Crivat notes that the parameters must be cast to the exact type, which I think means that they won’t be implicitly cast. All of the code we’ve added so far in this tutorial belongs in AlgorithmMetadataBase, but the GetTrainingParametersActualValue method is part of the AlgorithmBase class. It is no more difficult to implement, however, because all we have to do is return any default value if the user declined to specify one. As Crivat notes, special logic can be added here to substitute “The best value automatically (heuristically) detected by the algorithm for the current training set.” In a related vein, the .chm help file included with the SDK says that “The actual value may be different from what was requested by the user in situations like the following: the value specified by the user indicates some ‘auto-detect’ mechanism, that computes an optimized value based on the input data.” It also says that “implementing this function usually requires persisting the values of the parameters in the SaveContent implementation of your algorithm,” but we already demonstrated how to persist variable values of any kind in Post 14.2.

Figure 4: Three Methods Relevant to Parameters in AlgorithmMetadataBase and AlgorithmBase

Public Overrides Function GetParametersCollection() As Microsoft.SqlServer.DataMining.PluginAlgorithms.MiningParameterCollection
 

            Return MyMiningParameterCollection 

        End Function

         Public Overrides Function ParseParameterValue(ParameterIndex As Integer, ParameterValue As String) As Object

            Dim ReturnValue As Object

            Select Case ParameterIndex  

                Case 0

                    ReturnValue = System.Convert.ToInt32(ParameterValue)

                Case 1

                    ReturnValue = ParameterValue

                Case Else

                    Throw New System.Exception(“An error occurred when casting the parameter values.”)

            End Select

            Return ReturnValue

         End Function

 

        Protected Overrides Function GetTrainingParameterActualValue(ParameterOrdinal As Integer) As System.Object

            If MyMiningParameterCollection.Item(ParameterOrdinal).Value Is Nothing Then

                Return MyMiningParameterCollection.Item(ParameterOrdinal).DefaultValue

            Else

                Return MyMiningParameterCollection.Item(ParameterOrdinal).Value

            End If

        End Function

…………The ValidateAttributeSet method of AlgorithmMetadataBase is also an ideal place to validate mining flags for incorrect or missing mandatory values. SSDM automatically passes it a reference to the AttributeSet, which can be parsed to make sure that the user passed valid values; for example, you’d obviously want to iterate over the AttributeSet and verify that at least one attribute is marked as an input column. The code for this method is even more trivial, in the sense that it doesn’t even have a return type; you merely iterate over your AttributeSet and custom flags and throw an exception if a bad value is encountered, which halts training. On the other hand, the .chm states “that the attribute set here is not fully constructed, so only a limited set of methods can be invoked: GetAttributeCount, GetAttributeFlags, IsFunctionallyDependant and IsCaseAttribute.” I have yet to verify this, but this may preclude you from using it to parse parameter values as well. Normally, however, they would be checked in ParseParameterValue or GetTrainingParameterActualValue. Regardless of how these methods are implemented, the code in the constructor should be sufficient to expose the parameters to the end user, which can be verified by running a query like the one in Figure 5 on the DMSCHEMA_MINING_SERVICE_PARAMETERS schema rowset. Note that the values depicted for the PARAMETER_NAME, PARAMETER_TYPE, IS_REQUIRED and DESCRIPTION columns correspond to the property values we assigned in the constructor.

Figure 5: DMSCHEMA_MINING_SERVICE_PARAMETERS
Plugins7DMVs2

…………The only difficulty I encountered when implementing parameters was that SSDM never called my GetTrainingParameterActualValue code as expected, which forced me to create a declare a public MiningParameterCollection in AlgorithmBase as well. I then added code to the CreateAlgorithm method of AlgorithmMetadataBase to make the two collections equivalent; there may be some kind of performance penalty for this workaround since AlgorithmMetadataBase is only called when the algorithm is instantiated, but I have yet to see it. I then iterate over AlgorithmBase’s version of the collection in the InsertCases method and call GetTrainingParameterActualValue once for each constituent parameter. As Crivat suggests in his code examples, the beginning of InsertCases is an ideal place to handle training parameters; in my updated sample code for AlgorithmBase, I left a placeholder If…End If block to encapsulate conditional processing based on these parameter values. Since we’re using a class-scoped parameters collection, we can reference them in many other routines though. We could, for example, provide end users more control over model processing by implementing new training phases, or performance parameters that change the way SSDM performs calculations in the ProcessCase method, depending on the available system resources. I had originally planned to use parameters to illustrate how to use ProgressNotification objects, which update users on the status of model processing. As we have seen throughout this series, processing can be quite resource-intensive in terms of CPU, RAM and I/O, as well as time-consuming, thereby necessitating regular communication between algorithms and their users so that the latter can cancel processing or move on to some other task while they wait. The ProgressNotification object is designed to perform this task, by providing messages to the users that are displayed in the Visual Studio GUI, SQL Server Management Studio (SSDM) or in Profiler. As relational DBAs know, polling events in Profiler too frequently can also impact performance, so it is necessary to strike a fine balance between collecting too much and too little performance information. It is technically possible to expose parameters that allow users to control how often the algorithm emits its own developer-defined messages, such as at intervals based on the case count in ProcessCase. Yet to date, I have not been able to use either the Start, Progress or EmitSingleTraceNotification methods of SSDM’s ProgressNotification object without triggering a System.AccessViolationException. I can instantiate ProgressNotification variables and manipulate their other properties and members without incident, but the three methods that actually fire notifications invariably trigger exceptions that can’t even be caught in standard Try…Catch blocks. The concept is really simple; you merely set a class-scoped ProgressNotification object equal to the output of a Model.CreateTaskNotification method call, then use properties like Format, Total and Current to specify a message and track the intervals at which it fires. Strangely, the exact same code in one plug-in project works but triggers the fatal exception in another, even when the code from the first is directly cut and pasted into the second. I’m considering filing a bug report at Technet, but can’t dismiss the possibility that there’s an overlooked error in my project settings or something. Either way, I’ll hold off on posting code that might be buggier than usual until I get to the bottom of it.
…………The code I wrote for the three Overridable methods relevant to feature selection is also buggy, but the good news is that it is irrelevant. HasFeatureSelection is the easiest one, because it is another one of those plug-in methods that behaves more like a property, which in this case is merely a Boolean choice. If you are going to perform feature selection, you merely return a value of True; if not, you can return False or not implement it at all. If you wish, you can also add trivial conditional logic to turn it on and off, depending on the values of any feature selection parameters you’ve implemented, or any other variable for that matter. Instead of using my example parameter in this way, I check the value in GetAttributeFeatureSelectionInfo and then return an AttributeFlags object that identifies it as an Input parameter that participates in feature selection, if the value the user supplied matches the ordinal index passed by SSDM as the TempAttributes unsigned integer. The increment in the conditional logic merely addresses the fact that MiningParameterCollection objects are one-based, while the model’s AttributeSet object is zero-based. As the .chm points out, there are really only four values that this method can return: zero for attributes that feature selection isn’t applied to; AttributeFlags.Input; AttributeFlags.Output; and AttributeFlags.Input|AttributeFlags.Output for those columns which are feature selected both ways. Another optional method, GetAllFeatureSelectionInfo, returns both the set of attributes which were feature selected and the attribute flags associated with them. In this case, the latter are limited to values of AttributeFlags.Input AttributeFlags.Output and a combination of both. This method is on the tricky side, since the return values are passed through output parameters. They are a native part of the syntax of the C family of languages; in Visual Basic, we have to work around this by adding the <Out()> tags and passing them ByRef. The sample implementation in Figure 6 is a little more convoluted than GetAttributeFeatureSelectionInfo because we need to create an array of attributes with smaller dimensions than the AttributeSet of all attributes, so their indexes won’t match; hence the need for the ArrayCounter variable. This method performs pretty much the same function as GetAttributeFeatureSelectionInfo, except that it populates an array of every attribute that is feature selected along with the correct flags, instead of just returning the flags one attribute at a time.

Figure 6: Feature Selection Code

        Protected Overrides Function HasFeatureSelection() As Boolean 

            Return True

        End Function 

        Protected Overrides Function GetAttributeFeatureSelectionInfo(TempAttribute As UInteger) As AttributeFlags

            If MyMiningParameterCollection.Item(0).Value = TempAttribute Then

                Return AttributeFlags.Input

            End If

        End Function

        Protected Overrides Sub GetAllFeatureSelectionInfo(<Out()> ByRef TempAttributes() As UInteger, <Out()> ByRef TempAttributeFlags()
 As AttributeFlags)

            Dim ArrayCounter As UInteger

            For I = 0 To AttributeSet.GetAttributeCount – 1

                If MyMiningParameterCollection.Item(0).Value = I Then                                 
               TempAttributes(ArrayCounter) = I
               TempAttributeFlags(ArrayCounter) = AttributeFlags.Input
                    ArrayCounter = ArrayCounter + 1

                End If

            Next I

        End Sub

…………The problem with the code above is that these methods appear to be designed to be triggered internally by msmdsrv.exe, but I have yet to see Analysis Services hit the breakpoints I set in these routines, with one exception: when rendering Predictive Model Markup Language (PMML) representations of mining models, which we will discuss in the next installment of this series. I could work around this by calling them directly in another routine, like I did with the methods relevant to parameters in InsertCases, but there’s not much of a point in doing that. They don’t seem to do much of anything relevant to processing that can’t be implemented by adding our own methods to AlgorithmBase and calling them in other routines, so we’re not losing much of value by skipping over these methods. The plug-in architecture performs any special processing based on the values returned by them; it’s not like we can feed them values and have SSDM calculate values based on feature selection methods included in most of the nine out-of-the-box algorithms discussed much earlier in this series, like Interestingness Score, Shannon’s Entropy, Bayesian with K2 Prior, Bayesian Dirichlet Equivalent with Uniform Prior. Nor is there apparently any relevant information out there on these relatively irrelevant functions; there are no implementations of them in the sample SDK code (at least based on a Windows search of the files) and nothing in the tutorial I found online titled Microsoft Business Intelligence with Numerical Libraries: A White Paper by Visual Numerics, Inc. In fact, there are no hits at all on Google for any of them. At best, I did manage to find references to them at the old MSDN tutorial A Tutorial for Constructing a Plug-in Algorithm written by Max Chickering and Raman Iyer back in August of 2004, but that was designed for users working with COM in C++. My sample code may be buggy, but it is the only code I’ve seen published anywhere on the planet to date for any of these methods, in Visual Basic, C# or any other language. As I’ve said a few times throughout this series and my earlier series An Informal Compendium of SSAS Errors (which remains one of the few centralized sources of information for certain arcane Analysis Services exception messages), troubleshooting SSAS is a bit like blasting off into space, since you’re basically “boldly going where no man has gone before” thanks to the lack of documentation. This is doubly true of SSDM, which has even less documentation thanks to its status as SQL Server’s most underrated and under-utilized component, despite the fact that it can be put to such productive uses. Plug-in algorithms are in turn the cutting edge of SSDM, so it should be no surprise that we’re entering a vacuum – and just as in deep space, we can do without a lot of the material we’re encountering now. We’ve already traversed the most challenging topics in this segment of the Rickety series, after passing the apex of difficulty in A Rickety Stairway to SQL Server Data Mining, Part 14.3: Debugging and Deployment. Now the series is winding down to the point where we’re addressing features that aren’t of any great necessity in plug-in development. In the next installment, we’ll tackle SQL Server’s implementation of PMML, a powerful XML means of making mining models interchangeable with other products. Unfortunately, SQL Server is still stuck on the PMML standard promulgated back in March of 2003. Despite the lack of attention from the top brass at Microsoft in that time period, SSDM remains one of the best data mining tools available, but PMML support is the one area in which the usefulness of SSDM has really been noticeably reduced by this neglect. In contrast, the product’s support for custom data mining viewers is one of its greatest unsung features, as we shall see in a few weeks as we close out the Rickety series.

A Rickety Stairway to SQL Server Data Mining, Part 14.6: Custom Mining Functions


by Steve Bolton

…………In the last installment of this amateur series of mistutorials on SQL Server Data Mining (SSDM), I explained how to implement the Predict method, which controls how the various out-of-the-box prediction functions included with SSDM are populated in custom algorithms. This limits users to returning the same standard columns returned by calls to prediction functions against the nine algorithms that ship with SSDM. The PredictHistogram function, for example, will return a single prediction column, plus columns labeled $Support, $Probability, $AdjustedProbability, $Variance and $StDev, just as it does with the out-of-the-box algorithms. Although we can control the output of the standard prediction functions to return any values we want, it may be more suitable in some cases to specify our own output format with more flexibility. This is where the ability to define your own custom mining functions comes in handy. Another important use case is when you need to perform custom calculations that wouldn’t normally be implemented in standard prediction functions like PredictStDev or PredictSupport. Plug-in programmers could conceivably perform any complicated calculation they like in the Predict method and return it in the results for PredictVariance, but this would be awkward and illogical if the result is not a measure of variance. Thankfully, the SSDM plug-in architecture we’ve been discussing in the past articles includes a mechanism for defining your own custom mining functions, which addresses all of these needs.
…………Aside from the calculation of cumulative statistics, developing a custom mining function is actually quite simple. The development process begins by declaring a public function in your AlgorithmBase class ordained with a MiningAlgorithmClassAttribute, which looks like the attribute declaration in my sample code: <MiningFunction(“My Custom Mining Function Description”)>. The text within the quotation marks will appear in the DESCRIPTION column of the DMSCHEMA_MINING_FUNCTIONS schema rowset, for the row corresponding to the new function. The parameter list can contain variables of your choosing, like the RecordCount integer I use in my sample code for limiting the rows returned by MyCustomMiningFunction, but special rules apply in a few cases. Parameters decorated with the MiningTableColumnRefAttribute or MiningScalarColumnRefAttribute tags must be AttributeGroup objects, such as the <MiningScalarColumnRefAttribute()> TempAttributeGroup AsAttributeGroup declaration in my sample code. I believe MiningTableColumnRefAttribute tags are used for feeding nested tables to custom mining functions, but the PluginModel mining model we’ve been working with in this segment on custom mining algorithms doesn’t have one, nor do I see any benefit in adding one because of the risk of cluttering the tutorial needlessly. MiningScalarColumnRefAttribute, however, is a more common tag that allows you to pass a reference to a specific model column; in the prediction I’ll execute to test the function, I’ll pass a value of  “InputStat” to identify the first column in the PluginModel. The Flag attribute can also be used to identify optional flags passed as parameters to the function, as I do in the parameter declaration <Flag()> INCLUDE_COUNT AsBoolean. Apparently, these flags must be Booleans, because I received the following warning in the FUNCTION_SIGNATURE column of DMSCHEMA_MINING_FUNCTIONS whenever I use other data types with a flag I called ATTRIBUTE_NUMBER that I later removed: “Invalid Flag type (must be Boolean) in method MyCustomMiningFunction:ATTRIBUTE_NUMBER.” The algorithm will still deploy correctly if these parameters are set wrong but will return error messages when prediction queries are executed against the custom functions. The parameter list may also include a MiningCase object, which is not exposed to end users and causes SSDM to automatically iterate over the case set supplied in a prediction join, OpenQuery, or other such statement. As we shall see, one of the primary challenges with custom mining functions is to calculate cumulative statistics by iterating over this MiningCase object.
…………You can basically return values you want from your function, but if they’re scalar, they must be declared as System.Object. To return more than one value your only choice is to declare the return type as System.Data.DataTable. If your revised AlgorithmBase code compiles and deploys correctly, then the next time the service starts you should see your choice reflected in the RETURNS_TABLE column of DMSCHEMA_MINING_FUNCTIONS. Note how the query in Figure 1 reflects the name of the method we added to AlgorithmBase in the FUNCTION_Name column, the string from the MiningAlgorithmClassAttribute in the DESCRIPTION and the lists of parameters in FUNCTION_SIGNATURE. Some other standard functions enabled for our sample plug-in are also listed in the results (minus the prediction functions I enabled for the last tutorial, which were removed to shorten the list).

Figure 1: The Custom Mining Function Depicted in DMSCHEMA_MINING_FUNCTIONS
DMSCHEMA_MINING_FUNCTIONSResult

…………Note that the MyCustomMiningFunction is listed twice; this is because it is declared in AlgorithmBase and in SupportedFunction.CustomFunction Base, as part of the array returned by AlgorithmMetadataBase. Removing the function from AlgorithmBase without removing it from SupportedFunction.CustomFunction Base from the array returned by AlgorithmMetadataBase.GetSupportedStandardFunctions led to this error on msmdsrv, when the algorithm failed to initialize: “The data mining algorithm provider (ProgID: 06da68d6-4ef0-4cea-b4dd-1a7c62801ed2) for the Internal_Name_For_My_Algorithm algorithm cannot be loaded. COM error: COM error: DMPluginWrapper; Parameter name: in_FunctionOrdinal Actual value was 10000.” Another “gotcha” you may encounter is the warning that “PREDICTION JOIN is not supported” for a particular mining model when you try to use the function in a query. After many comparisons to the original sample code provided in the plug-in software development kit (SDK) by Bogdan Crivat, one of the original developers of SSDM, I was able to narrow this down to the cases in which the GetCaseIdModeled method of AlgorithmMetadataBase returns True. Simply changing the return value for that method ought to fix the problem. Another common mistake you may encounter while testing custom mining functions is leaving off the ON clause in the PREDICTION JOIN, which may result in this warnng: “Parser: The end of the input was reached.” As renowned SQL Server Analysis Services (SSAS) guru Chris Webb says in his helpful post Error messages in MDX SELECT statements and what they mean, this can also be due to missing parentheses, brackets and semi-colon marks. All of these myriad issues were fairly inconsequential in comparison to those that arose when I debugged the AlgorithmNavigationBase class in A Rickety Stairway to SQL Server Data Mining, Part 14.4: Node Navigation, which is far easier than troubleshooting AlgorithmBase and less difficult by several orders of magnitude than dealing with exceptions in AlgorithmMetadataBase. The biggest challenge I encountered while writing the sample code depicted in Figure 2 was calculating cumulative values across iterations over the MiningCase object, which were problematic to persist.

Figure 2: Sample Code for MyCustomMiningFunction
<MiningFunction(“My Custom Mining Function Description”)>
Public Function MyCustomMiningFunction(InputCase As MiningCase, RecordCount As Integer, <MiningScalarColumnRefAttribute()> TempAttributeGroup As AttributeGroup, <Flag()> INCLUDE_COUNT As Boolean) As System.Data.DataTable

            Static CaseCount As UInteger = 0 ‘the 0th case will be the Prepare statement
            for whatever reason, the first InputCase is always Nothing
            we can use that opportunity to set up the data table one time only
            If IsNothing(InputCase) = True Then
                MyCustomMiningFunctionDataTable = New System.Data.DataTable ‘on the first iteration, reset the datatable
                MyCustomMiningFunctionDataTable.Columns.Add(“Skewness”, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(“Kurtosis”, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(ExcessKurtosis, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(JBStatistic, GetType(Double))

                If INCLUDE_COUNT = True Then
                    MyCustomMiningFunctionDataTable.Columns.Add(TotalRows, GetType(UInteger))
                End If

                once the data type is set, we can return it one time, in the event of a Prepare statement
                ‘I will have to double-check later and make sure this statement doesn’t need to execute for each row during Prepared statements
                ‘Bogdan: “•If the execution is intended for a Prepare statement, it returns an empty string. A Prepare execution must return data that has the same type and schema (in case of tables) with the real result.”
                so if it is only for a prepare statement, we’d send it back the empty data table?
                chm: on Context.ExecuteForPrepare: “Determines whether the current query is preparing the execution or it is fully executing. Should only be used in the implementation of custom functions”
                If Context.ExecuteForPrepare = True Then
                    Return MyCustomMiningFunctionDataTable
                End If
            Else
                dealing with actual cases; add them to the cumulatives so we
can perform calculations at the end
                for now we will deal with a single attribute for the sake of simplicity; it would be fairly easy though to extend this to deal with more than one attribute at a time

                Try
                    If CaseCount < RecordCount - 1 Then RecordCount is 1-based, CaseCount is not

                        both variables are 0-based
                        If CaseCount = 0 Then
                            ReDim TempSkewnessKurtosisClassArray(0) ‘initialize the arrays one time; this won’t work during the Preparation phase when InputCase = 0, because other variables will be set to 1
                            ReDim TempDataPointArray(0, RecordCount – 1)
                            TempSkewnessKurtosisClassArray(0) = New SkewnessKurtosisClass
                        End If

                        TempDataPointArray(0, CaseCount) = New DataPointClass
                        TempDataPointArray(0, CaseCount).DataPoint = InputCase.DoubleValue
                        TempSkewnessKurtosisClassArray(0).TotalValue = TempSkewnessKurtosisClassArray(0).TotalValue + InputCase.DoubleValue
                        CaseCount = CaseCount + 1

                    Else

                        add the last set
                        If IsNothing(TempSkewnessKurtosisClassArray(0)) Then TempSkewnessKurtosisClassArray(0) = New SkewnessKurtosisClass
                        TempDataPointArray(0, CaseCount) = New DataPointClass
                        TempDataPointArray(0, CaseCount).DataPoint = InputCase.DoubleValue
                        TempSkewnessKurtosisClassArray(0).TotalValue = TempSkewnessKurtosisClassArray(0).TotalValue + InputCase.DoubleValue
                        CaseCount = CaseCount + 1

                        if we’re on the final case, calculate all of the stats
                        Dim TempRow As System.Data.DataRow
                        TempRow = MyCustomMiningFunctionDataTable.NewRow()
                        TempSkewnessKurtosisClassArray(0).Mean
= SkewnessKurtosisClassArray(0).CalculateAverage(RecordCount, TempSkewnessKurtosisClassArray(0).TotalValue)
                        TempSkewnessKurtosisClassArray(0).StandardDeviation = SkewnessKurtosisClassArray(0).CalculateAverage(RecordCount, TempSkewnessKurtosisClassArray(0).Mean)
                        TempRow.Item(“Skewness”) = TempSkewnessKurtosisClassArray(0).CalculateSkewness(TempDataPointArray, 0, TempSkewnessKurtosisClassArray(0).Mean, TempSkewnessKurtosisClassArray(0).StandardDeviation)
                        TempRow.Item(“Kurtosis”) = TempSkewnessKurtosisClassArray(0).CalculateKurtosis(TempDataPointArray, 0, TempSkewnessKurtosisClassArray(0).Mean, TempSkewnessKurtosisClassArray(0).StandardDeviation)
                        TempRow.Item(ExcessKurtosis) = TempSkewnessKurtosisClassArray(0).CalculateExcessKurtosis(TempSkewnessKurtosisClassArray(0).Kurtosis)
                       TempRow.Item(JBStatistic) = TempSkewnessKurtosisClassArray(0).CalculateJarqueBeraTest(RecordCount, TempSkewnessKurtosisClassArray(0).Skewness, TempSkewnessKurtosisClassArray(0).ExcessKurtosis) 

                        If INCLUDE_COUNT = True Then
                            TempRow.Item(TotalRows) = RecordCount
                        End If

                        MyCustomMiningFunctionDataTable.Rows.Add(TempRow) ‘add the row
                        CaseCount = 0 ‘reset the counter
                        Array.Clear(TempSkewnessKurtosisClassArray, 0, TempSkewnessKurtosisClassArray.Length) ‘clear the arrays so that there is no overhead for using a class-scoped variable

                        Array.Clear(TempDataPointArray, 0, TempDataPointArray.Length)
                        Return MyCustomMiningFunctionDataTable

                        then reset the counters so that the function starts off fresh the next time it is called
                        ‘any code I set after this to reset the DataTable won’t be hit – so hopefully SSDM does garbage collection on its own; the best I can do is reinitialize the values in the event of an error
                        Exit Function ‘so we don’t Return it again
                    End If

                Catch ex As Exception

                    if there’s an exception the static and class-scoped variables for this function must be reset, otherwise they will interfere with the next execution
                    CaseCount = 0 ‘reset the counter
                    Array.Clear(TempSkewnessKurtosisClassArray,0, TempSkewnessKurtosisClassArray.Length) ‘clear the arrays so that there is no overhead for using a class-scoped variable
                    Array.Clear(TempDataPointArray, 0, TempDataPointArray.Length)
                    MyCustomMiningFunctionDataTable.Rows.Clear() ‘clear all the rows
                    MyCustomMiningFunctionDataTable = Nothing

                End Try
            End If

            Return MyCustomMiningFunctionDataTable

        End Function

…………Since there are already a handful of better tutorials out there that demonstrate how to write a custom mining function (most notably Crivat’s sample code in the SDK) I wanted to make a contribution of sorts by pushing the boundaries of the architecture. Since the very basic mining method I developed in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm merely returned some basic global stats for each column that I wanted to learn more about, like skewness, kurtosis and excess kurtosis, I figured it would be interesting to calculate the same figures for the values supplied in a prediction join. For good measure, I added a method to the SkewnessKurtosisClass I used in Post 14.2 to perform Jarque-Bera tests, which use measures of lopsidedness like skewness and kurtosis to determine how well datasets fit a normal distribution. I used the formula available at the Wikipedia page Jarque-Bera Test, which is fairly easy to follow in comparison to more complicated goodness-of-fit tests. I soon discovered, however, that keeping track of the cumulative stats required to calculate these measures is problematic when using custom mining functions. SQL Server will make numerous calls to your function, depending on how many input cases there are, but there isn’t a straightforward solution to keeping track of values across these calls. As a result, I had to create two class-scoped variable arrays that mirrored the AlgorithmBase’s SkewnessKurtosisClassArray and DataPointArray variables, which I explained how to use in Post 14.2. The static CaseCount variable I declared at the beginning of the function is used in conditional logic throughout the outer Else…End If statement to control the population of the TempDataPointArray. The DataPointClass objects that accumulate in this array with each iteration of the function are fed en masse to the routines that calculate skewness, kurtosis, excess kurtosis and the Jarque-Bera test when the CaseCount reaches the final input case. On this final pass, those four stats are also added to the single row that comprises the data table the function returns. This MyCustomMiningFunctionDataTable variable is defined as an instance of System.Data.DataTable in the outer If…Else block, which will only be triggered when InputCase has a value of Nothing. This condition is only met on the first pass through the function, which is apparently designed solely to provide a means of declaring the return variable. The Context.ExecuteForPrepare condition is only true on this first pass. Special exception-handling logic is required to reset the values of the class-scoped variables and the static CaseCount in the event of an error, which is comprised by the Catch…End Try block at the end of the method. The last five lines in the End If immediately above the Catch…End Try block perform pretty much the same function, but only after the final pass executes successfully. If these two blocks of code were not included, the CaseCount would continue to accumulate and the calculations of the TempDataPointArray and TempSkewnessKurtosisClassArray would at best throw exceptions, or worse yet, incorrect results.
…………Ideally, it would be easier to retrieve the count of input rows from the function itself or some variable exposed by SSDM, or at least find a way to determine when the function has exited, but thus far I have not yet found such a means. One alternative I tried was to use a CaseSet as an input parameter, but the Description column of the DMSCHEMA_MINING_FUNCTIONS rowset returned an “Invalid parameter type” error. Defining InputCase as an array of MiningCase objects will produce a different error when you run a Data Mining Extensions (DMX) query, including the text, “Too many arguments or flags were supplied for the function.” Nor can you return multiple rows, one at a time; this is why there are three Return statements in the function, one for the initial ExecuteForPrepare phase, one for returning all of the data en masse one time only in a single table, and another for cases in which an empty table is returned because there are no results to report. Returning a value of Nothing instead of the return type declared in the ExecuteForPrepare phase will raise an error, whereas returning an empty table with no rows is also undesirable, since it adds an extra line to the output. If you return a null value instead of the declared return type, you will also receive an error in SQL Server Management Studio (SSMS) worded like this: “COM error: COM error: DMPluginWrapper; Custom Function MyCustomMiningFunction should return a table, but it returned NULL.” There may still be errors lurking in the conditional logic I implemented in this function, but it at least amounts to the crude beginnings of a workaround of the problem of calculating stats dependent on cumulative values in custom mining functions. Would-be plug-in developers are basically stuck with iterating over the input cases one row at a time, rather than operating on an entire dataset, so some type of workaround involving static or class-scoped variables may be necessary when calculating cumulatives. Displaying the results with this type of hack is also problematic, as Figure 3 demonstrates. It is necessary to add another column to the results, as the first reference to “InputStat” does, otherwise FLATTENED results will only display a single blank row, but this means each row in the input set (taken from an OpenQuery on the PluginModel’s data source) will have a corresponding row in the results. In this case, each of the 20 input cases in our dataset is represented by the skewness value calculated for the training set of the InputStat column. It is possible to omit this superfluous column by removing the FLATTENED statement, but then it returns one empty nested table for each row in the input set. Either way, the only row we’re interested in is the last one, which depicts the skewness, kurtosis, excess kurtosis and Jarque-Bera statistic for the entire dataset. The TotalRows column is also populated with the value the user input for RecordCount when the INCLUDE_COUNT flag is specified; this takes just six lines of code in two separate section of Figure 3, once to add the column to the return table and another to assign the value. To date, this is still impractical because I haven’t found a way to use a TOP, ORDER BY, sub-select other clause to return only the row with the cumulative statistics we calculated, but my ugly hack at least shows that cumulatives can be calculated with custom mining functions. That’s not something that’s ordinarily done, but this comparatively short topic seemed an ideal place to contribute a little bleeding edge experimentation. The “Rickety” disclaimer I affixed to this whole series is sometime apt, because I’m learning as I go; the only reason I’m fit to write amateur tutorials of this kind is that there so little information out there on SSDM, which is an underrated feature of SQL Server that is badly in need of free press of any kind. In the next installment of this series, we’ll delve into other advanced plug-in functionality, like custom mining flags, algorithm parameters and feature selection.

Figure 3: Sample Query Result for the Custom Mining Function (With the Cumulatives Hack)
CustomFunctionQueryResults

Follow

Get every new post delivered to your Inbox.