A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out
In the first of a series of amateur tutorials on SQL Server Data Mining (SSDM), I promised to pull off an impossible stunt: explaining the broad field of statistics in a few paragraphs without the use of equations. What other SQL Server blog ends with a cliffhanger like that? Anyone who aims at incorporating data mining into their IT infrastructure or skill set in any substantial way is going to have to learn to interpret equations, but it is possible to condense a few key statistical concepts in a way that will help those who aren’t statisticians – like me – to make productive use of SSDM without them. These crude Cliff’s Notes can at least familiarize DBAs, programmers and other readers of these tutorials with the minimal bare bones concepts they will need to know in order to interpret the data output by SSDM’s nine algorithms, as well as to illuminate the inner workings of the algorithms themselves. Without that minimal foundation, it will be more difficult to extract useful meaning from your data mining efforts.
The first principle to keep in mind is so absurdly obvious that it is often half-consciously forgotten – perhaps because it is right before our noses – but it is indispensable to understanding both the field of statistics and the stats output by SSDM. To wit, the numbers signify something. Some intelligence assigned meaning to them. One of the biggest hurdles when interpreting statistical data, reading equations or learning a foreign language is the subtle, almost subconscious error of forgetting that these symbols reflect ideas in the head of another conscious human being, which probably correspond to ideas that you also have in your head, but simply lack the symbols to express. An Englishman learning to read or write Spanish, Portuguese, Russian or Polish may often forget that the native speakers of these languages are trying to express the exact same concepts that an English speaker would; they have the exact same ideas in their heads as we do, but communicate them quite differently. Quite often, the seemingly incoherent quirks and rules of a particular foreign language may actually be part of a complex structure designed to convey identical, ordinary ideas in a dissimilar, extraordinary way. It is the same way with mathematical equations: the scientists and mathematicians who use them are trying to convey ideas in the most succinct way they know. It is often easier for laymen to understand the ideas and supporting evidence that those equations are supposed to express, when they’re not particularly well-versed in the detailed language that equations represent. I’m a layman, like some of my readers probably are. My only claim to expertise in this area is that when I was in fourth grade, I learned enough about equations to solve the ones my father, a college physics teacher, taught every week – but then I forgot it all, so I found myself back at Square One when I took up data mining a few years back.
On a side note, it would be wise for anyone who works with equations regularly to consciously remind themselves that they are merely symbols representing ideas, rather than the other way around; a common pitfall among physicists and other scientists who work with equations regularly seems to be the Pythagorean heresy, i.e. the quasi-religious belief that reality actually consists of mathematical equations. It doesn’t. If we add two apples to two apples, we end up with four apples; the equation 2 + 2 = 4 expresses the nature and reality of several apples, rather than the apples merely being a stand-in for the equation. Reality is not a phantom that obscures some deep, dark equation underlying all we know; math is simply a shortcut to expressing certain truths about the external world. This danger is magnified when we pile abstraction on top of abstraction, which may lead to the construction of ivory towers that eventually fall, often spectacularly. This is a common hazard in the field of finance, where our economists often forget that money is just an abstraction based on agreements among large numbers of people to assign certain meanings to it that correspond to tangible, physical goods; all of the periodic financial crashes that have plagued Western civilization since Tulipmania have been accompanied by a distinct forgetfulness of this fact, which automatically produces the scourge of speculation. I’ve often wondered if this subtle mistake has also contributed to the rash of severe mental illness among mathematicians and physicists, with John Nash (of the film A Beautiful Mind), Nicolai Tesla and Georg Cantor being among the most recognized names in a long list of victims. It may also be linked to the uncanny ineptitude of our most brilliant physicists and mathematicians when it comes to philosophy, such as Rene Descartes, Albert Einstein, Stephen Hawking and Alan Turing. In his most famous work, Orthodoxy, 20th Century British journalist G.K. Chesterton noticed the same pattern, which he summed up thus: “Poets do not go mad; but chess-players do. Mathematicians go mad, and cashiers; but creative artists very seldom. I am not, as will be seen, in any sense attacking logic: I only say that this danger does lie in logic, not in imagination.” At a deeper level, some of the risk to mental health from excessive math may pertain to seeking patterns that aren’t really there, which may be closely linked to the madness underlying ancient “arts” of divination like haruspicy and alectromancy.
This digression into philosophy and mysticism is actually quite relevant to data mining, in order to avoid some serious potential pitfalls. One is the possibility of forgetting that the writer of an equation has assigned a meaning to it; typically, it’s not all just gobbledygook, although it may sometimes seem so to laymen like us. The most significant risk is the recognition of patterns that aren’t really there, or of assigning unwarranted meanings to patterns we’ve found or imagined; this is an error even the most highly trained mathematicians and logicians can fall into, including the leading data miners. To make matters even worse, intellectual dishonesty can lead people to deliberately assign unwarranted meanings to statistics, apply the wrong to stats to the wrong question, or resort to many other means of subterfuge to obscure the truth. Our politicians do it every day, as we all know. It is because of this that British Prime Minister Benjamin Disraeli allegedly coined the famous phrase, “There are three kinds of lies: lies, damned lies, and statistics.” There are ways of spotting dishonesty in logic and statistics, but my main purpose here is to help novice data miners avoid honest mistakes, a fate which can be avoided through a better understanding of SSDM. The data output by it has a common format using identical terms, which aids in the critical task of assigning meaning, but there are some subtle distinctions in how those terms should be interpreted from one algorithm to the next. Furthermore, setting inputs, predictable attributes and parameters incorrectly can lead to faulty assessments of the meaning of the output data, as can poor choices of algorithms or simply feeding them bad data. Garbage in, garbage out, as the saying goes. Data mining is more complex, however, in that we can taint our data by using the wrong garbage cans to collect it, so to speak. These algorithms are brilliantly designed, to the point where they can sometimes spit out useful results even when used haphazardly, but they do not work magic. They can only operate on the data they are fed; they cannot violate the Ex Nihilo principle that information cannot simply be created out of thin air. The Time Series algorithm, for example, cannot predict the weather using only temperature data as accurately as it could if we also included records about other variables like wind speed. DBAs and database programmers are normally skilled in modern methods of validation, de-duplication and other means of assuring data integrity, so putting garbage data in is theoretically less of a problem with SSDM. Over-interpretation of data, inputting the wrong kinds of data and the like are much more relevant issues with data mining.
Statistical methods are just tools designed to address some of these potential problems. Data mining cannot introduce new information that is not already present, but statistical methods can be used to conserve existing information and put it to use. It is akin to combining bubble plastic wrap and cardboard boxes to make sure that a package is not damaged in the mail, or better yet, picture it is as using many different methods like caulking, storm windows and the like to keep the heat from leaking out of your home. Statistical methods are no different from ordinary implements like hammers and screwdrivers, except that they are abstract tools used to conserve and present information in a useful way. Of course, we need to choose the right combination of statistical tools for a particular job, just as we need to be careful in the tools we use to build a house or repair a bicycle; just as we wouldn’t use a wrench to pry out a Phillips head screw, we wouldn’t measure how much two datasets resemble each other merely by comparing their modes. It might be better to picture some of the more sophisticated combination of statistical methods as a sort of contraption made up of different cameras and lights on armatures, which can illuminate and photograph a particular object from many directions all at once; humans cannot easily visualize such a device operating on objects of more than three dimensions, as we routinely encounter in SQL Server Analysis Services (SSAS) cubes, but the basic concept is still the same. Imagine identifying an object as an elephant by taking photographs from different directions simultaneously to reveal a tail and a trunk, then shining a light from another angle to reveal two tusks. This is basically what occurs in the field of statistics, which has a vast array of tools available to fit a lot of different applications. I’m only going to give a very brief summary of the tools applicable to SSDM, without going into the many variations on them that I’m sure competent statisticians could point out. For example, probability is a simple concept that I assume all of this blog’s readers can already understand. All of the nine algorithms make use of it, but there is no sense in delving into variations of it that SSDM doesn’t depend upon, like unordered pairs, compound events and the like.
SQL Server-Specific Statistics
It is easier to understand the statistics applicable to data mining when you realize that there is a kind of hierarchy among them, with the more complex algorithms and stats depending on varying combinations of certain basic building blocks. At the base of the pyramid are three measures of central tendency, i.e. the predilection of data to cluster around certain values: mean, median and mode. The last of these is the most difficult of the three to calculate (which requires the use of the Count aggregate and TOP function in T-SQL) but is the least applicable to SSDM; medians, on the other hand, can be useful if data is skewed in a particular direction. It is a basic building block of quantiles, which divide datasets into fragments of n size, as the NTILE windowing function does in T-SQL. Averages are the most important of these three measures of central tendency in SSDM, for they are a basic building block in several hierarchies of increasingly sophisticated statistical methods. For example, both SSAS and SSDM make frequent use of variance, which is a measure of how much variability there is among the values in a dataset; this is calculated by subtracting the mean from each number in the dataset, then squaring them to change any resulting negatives to positive numbers that can be used to compare distance from the mean. The results are then divided by the number of items in the dataset minus one. Standard deviation is simply the square root of the variance and is useful in showing how much variety there is in a dataset; for example, if you suspect that a few middle aged students are throwing off the average age of a college classroom, you could use standard deviation to investigate how much variety there is in the ages in your sample. Z-scores in turn measure how closely an individual value is related to the rest of the dataset by subtracting the mean and dividing by the standard deviation. Covariance likewise builds upon standard deviation, by averaging the products of the deviations of two datasets. It is useful in measuring how different two datasets are from each other, with high positive numbers signifying a close relationship, negative numbers signifying a pronounced inverse relationship and numbers around zero indicating a lack of a connection. Pearson Product Moment Correlation, often referred to simply as Correlation, in turn builds on that method by using the product of the standard deviations of two datasets as divisors with their covariance, to give a more accurate measure of the relationship of two datasets to each other. Because some of these methods are subject to a slight bias towards underestimation, particularly with small samples, a small adjustment known as Bessel’s Correction is sometimes made by dividing by the number of points in the dataset minus the number of parameters (usually 1). The result is known as an unbiased population formula, while statistics like this which don’t apply this correction are known as biased.
These statistical methods have their counterparts in MDX, T-SQL and DMX. Multidimensional Expressions (MDX), the language used in SSAS cubes, provides corresponding functions like Median, Stdev, StdevP (for biased standard deviations), Variance (or Var), VarianceP (or VarP, using the biased formula), Correlation, Covariance and CovarianceN (for unbiased covariance). T-SQL likewise provides the AVG, STDEV, STDEVP, VAR and VARP aggregates, which are now more useful than ever before thanks to several enhancements of the OVER clause. One of the most exciting developments in SQL Server 2012 is the complementing of ROW_NUMBER, RANK, DENSE_RANK, and NTILE with new windowing functions like LEAD, LAG, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK and CUME_DIST, all of which may prove useful in massaging data on the relational side before mining. In a nutshell, the latter four calculate relative ranks of a row as percents or within percentiles. For an in-depth discussion of these T-SQL enhancements, see Itzik Ben-Gan’s excellent book, Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions. DMX (Data Mining Extensions), the language for working in SSDM, provides several prediction functions which I will discuss towards the end of this series of tutorials, as well as statistical functions like PredictHistogram, PredictVariance, PredictStdev and PredictAdjustedProbability. The last of these uses an undocumented internal formula to correct probability estimates, to reduce the chances of a data value being considered probable even when it is rarely or never found in a dataset.
As I touched upon briefly in my last column, the nine algorithms Microsoft provides are much more sophisticated, sometimes by several orders of magnitude, but even the most refined of them depends upon basic building blocks like these. Linear Regression can also be calculated in MDX through the LinRegIntercept, LinRegPoint, LinRegR2, LinRegSlope and LinRegVariance functions, but that is about as far as we can go using solely MDX or T-SQL, unless we want to duplicate the functionality in SSDM’s algorithms, which may waste a lot of developer time and entail a lot of risk in reinventing the wheel. Logistic Regression can be seen as an adjustment to Linear Regression to constrain for minimum and maximum output values; it constitutes the equivalent of the Neural Network algorithm, except without a hidden layer of weighted neurons. Decision Trees in turn makes use of multiple regressions, with adjustments applied. Time Series also applies a long chain of corrections to data obtained through Decision Trees and moving averages, as I will discuss in greater detail when the time for that algorithm’s blog post comes up for discussion. Likewise, Sequence Clustering is a more sophisticated version of the Clustering algorithm optimized for chains of events. The simplest algorithm, Naïve Bayes, plays a hidden role in several of these more sophisticated cousins when SQL Server performs Feature Selection under the hood. Before a dataset is trained, SQL Server may optimize performance and the usefulness of the output under certain conditions, by capping the number of columns used in making predictions, as well as discarding those that it deems unlikely to provide any useful information. This may be determined by four different methods, two of which are sophisticated variations on Naïve Bayes based on data distributions, known as Bayesian with K2 Prior and Bayesian Dirichlet Equivalent with Uniform Prior. The Interestingness Score is also always used with continuous columns, while a probabilistic method known as Shannon’s Entropy is used for discrete and discretized values. Association Rules, Sequence Clustering and Time Series don’t make use of Feature Selection, while Clustering and Linear Regression must use the Interestingness Score, which Naïve Bayes cannot use. Most of the algorithms that make use of Feature Selection have parameters named MAXIMUM_INPUT_ATTRIBUTES and MAXIMUM_OUTPUT_ATTRIBUTES, which represent the highest number of inputs or predictable columns (the default for both is 255) that the algorithm will accept before applying Feature Selection. Setting the number to zero disables Feature Selection altogether, which may allow you to use inputs or outputs that it disallows, at the cost of possible degradations in performance and usefulness, of course. In algorithms that support it, MAXIMUM_STATES can also be used to cap the number of values considered for a column to the top n most popular values, with the rest of the values being treated as Missing. SQL Server Books Online (BOL) provides a more detailed look at Feature Selection, complete with links to technical articles on the Bayesian methods. If you choose your input columns well, however, you won’t have to worry about your choices being overridden by Feature Selection, which you can disable if necessary. The important thing to keep in mind is that this selective sampling of your data may be going on under the hood if you’ve reached the threshold for invoking Feature Selection, in which case some of your input or output data may unexpectedly be absent.
SSDM Output: A Bird’s Nest of Nested Tables
After you process a mining structure, SQL Server populates objects in an SSAS database with the results, which can be retrieved several different ways. The most sophisticated method is to view the charts and other fancy graphical illustrations on the Mining Model Viewer tab in BIDS or SSDT. Microsoft provides a Tree Viewer, Association Rules Viewer, Naïve Bayes Content Viewer, Neural Network Viewer, Cluster Viewer, Sequence Clustering Viewer and Time Series Viewer (each prepended by the company’s name). Logistic Regression uses the Neural Network Viewer because of the similarity of those two algorithms, just as Linear Regression shares the Tree Viewer with Decision Trees. Because these illustrations are so varied, I won’t go into detail on any of them until their particular algorithms come up for discussion in later posts. It is easier to discuss the common format they all share in the Generic Content Viewer, which you can select on the Mining Model Viewer tab, as depicted below:
This particular output is associated with a Sequence Clustering mining model, but the format is similar to that used by all of the other algorithms; essentially, the nine algorithms spit out disparate information, which is then stored in a common format, which the graphs in the Mining Model Viewer tab depict in dissimilar ways. Because the output of several disparate algorithms is mashed together like this, a lot of interpretation is required, especially since the data also represents several different levels of denormalized tables in flattened tree structures. This degree of denormalization and the common format are important considerations when working with DMX, as opposed to MDX and T-SQL. The most basic method to return this raw data is to run a DMX SELECT Content query against the name of the mining model, which I will deal with much later in the series. At that point I will discuss different ways of eliminating some of the redundancies in the data and how to separate some of the denormalized data it returns into a series of tables linked by foreign keys, but that is an advanced topic we don’t have to worry about right now. The Generic Content Viewer is useful in that it does some of the grunt work by representing some of the denormalized data in a tree structure for you, as you can see from the list of Clusters on the left side of the picture above. When you retrieve any of the same data with DMX, it will be returned as a flattened tree, with several different branches included in the same table. The way to differentiate each branch from the other is by the most important column returned in DMX Content queries, the NODE_TYPE. The tables these queries return contain multiple levels of the same tree structure, with the roots, leaves and intermediate nodes all sharing the same columns of data, but representing completely different meanings depending on the NODE_TYPE. The potential values vary considerably from one algorithm to the next, as Figure 1 demonstrates:
Once we know the NODE_TYPE, we can determine what kind of node we are dealing with and interpret the values in the other columns returned by the DMX Content Query, which are represented on the right side of the previous image of the Generic Content Viewer. The information returned in the Generic Content Tree Viewer and DMX content queries contains some obvious or superfluous information, such as the MODEL_CATALOG and MODEL_NAME, which are simply the name of the SSAS database the data mining objects and metadata are contained in and the name of the mining model. NODE_NAME and NODE_UNIQUE_NAME are useful in that they can be used to identify a particular node by its assigned number and help correlate it with other nodes, but they are redundant. PARENT_UNIQUE_NAME is also used to help correlate child nodes with their parents, while CHILDREN_CARDINALITY can help identify how many children a node has. NODE_DESCRIPTION, NODE_CAPTION and ATTRIBUTE_NAME can all be used to identify the purpose of a node better, by giving a text explanation, such as a list of columns in a cluster with either of the Clustering algorithms. There is quite a bit of overlap here though, especially when the XML representations in MSOLAP_MODELCOLUMN and MSOLAP_NODE_SHORT_CAPTION are thrown into the mix. To add to the confusion, some of the nodes are not included at all in this common format when you perform a DMX SELECT content query with certain algorithms, such as NODE_RULE and MARGINAL_RULE in the case of Clustering. At other times, some of these columns are simply left blank, set to NULL or identified as Missing, depending on the algorithm and NODE_TYPE in question. The same is also true with respect to the four columns that provide actual statistics: NODE_SUPPORT (the count of cases), NODE_PROBABLITY, MARGINAL_PROBABILITY and MSOLAP_NODE_SCORE. Refer to the table below for more details:
The presence of multiple levels of nodes in a single table denotes a high degree of denormalization, but NODE_DISTRIBUTION takes that one step further by nesting another table within each row returned. Each NODE_DISTRIBUTION contains a table of values with a common format, including columns for ATTRIBUTE_NAME, ATTRIBUTE_VALUE, SUPPORT, PROBABILITY and VARIANCE whose meaning may vary in dramatic or subtle ways from one NODE_TYPE or algorithm to the next. It also returns a flag named VALUETYPE, which may cause further confusion by assigning a different meaning to the data in a particular row of the NODE_DISTRIBUTION table. For example, if we encounter a NODE_TYPE of 28, that means we are dealing with a periodic structure of the ARIMA algorithm within Time Series. The NODE_DISTRIBUTION table for that row will contain its own rows, which may assigned different meanings to ATTRIBUTE_NAME, ATTRIBUTE_VALUE, SUPPORT, PROBABILITY and VARIANCE, depending on whether the VALUETYPE for that subordinate row’s VALUETYPE is 12 for Periodicity, 13 for Auto Regressive Order, 14 for Moving Average Order or 15 to signify a Difference Order. As with NODE_TYPE, the values represented in VALUE_TYPE are sometimes limited to certain algorithms; the values 12 through 15 are unique to Time Series, for example. The figure below lists the potential values for the VALUETYPE flag:
Figure 3: Potential Values for VALUETYPE in a NODE_DISTRIBUTION Table
This hodgepodge of nested tables and separate meanings for columns within each table, depending on the NODE_TYPE or VALUETYPE, can create a lot of confusion. Sometimes the differences in meaning between statistics like NODE_PROBABILITY and VARIANCE can also vary in subtle ways. In the future, I’ll discuss some methods I’ve devised (possibly with the help of other sources on the Internet) of getting the data into T-SQL queries and sorting out this confusing mass of information. As long as we’ve fed the right algorithms enough good data, there should be some useful information buried in there, which we can dig out, by mining the mining results so to speak. Comparing the results of the nine algorithms can be a little like comparing apples and oranges at times, so it is remarkable that SQL Server’s Data Mining Team was able to devise a basket capable of holding the results of all of them, in the form of this common format. To really understand the results, however, we will have to discuss the inner workings of each algorithm separately, since each is essentially a different flavor, or rather a different kind of tool that answers dissimilar questions. In next week’s post, A Rickety Stairway to SQL Server Data Mining, Algorithm 1: Not-So-Naïve Bayes, I’ll demonstrate how to set up a mining experiment on data that DBAs are more likely to be familiar with, like the output of sys.dm_os_wait_stats, sys.dm_io_virtual_file_stats and the System Log.That ought to make it easier to interpret the results of the first graphs we encounter, in the Naïve Bayes Content Viewer.
 See Chesterton, G.K., 2001, Orthodoxy. Image Books: London.
 Mark Twain attributed it to Disraeli, but it is unclear if he was actually the source. See the Wikipedia page “Lies, Damned Lies, and Statistics”, available at http://en.wikipedia.org/wiki/Lies,_damned_lies,_and_statistics .
 For a readable primer on the topic, see Wagner, Susan F., 1991, Introduction to Statistics. HarperPerennial: New York.
 For a more in-depth discussion, see the Wikipedia page “Unbiased Estimation of Standard Deviation,” available at http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation .
 pp. 68-74, Ben-Gan, Itzik, 2012, Microsoft SQL Sever 2012 High-Performance T-SQL Using Window Functions. O’Reilly Media, Inc.: Sebastopol, California.
 For a discussion of adjusted probability in SSDM, see the post at the SQL Server Data Mining Forum on June 2, 2009 titled “Adjusted Probability, Again.” Available at http://social.msdn.microsoft.com/Forums/en/sqldatamining/thread/e804ceda-f8f7-4a4b-846e-717b0ee20518 . Also see “Probability vs. Adjusted Probability,” posted at sqldev.org on Oct. 5, 2009. Available at http://www.sqldev.org/data-mining/probability-vs-adjusted-probability-2169.shtml .
 I adapted this from Books Online in a roundabout way, first by pasting the table into a DMX query so that I could create a CASE statement, then pasted it back here and added some simple commentary. So I suppose BOL should ultimately get the credit for this piece of information (as it should for a lot of other things I write about in this series), but I’m having trouble finding the original page I cited it from.