A Rickety Stairway to SQL Server Data Mining, Part 12: Accessing SSDM with AMO, ADOMD, SSIS, XMLA and RS

by Steve Bolton

                It may set a SQLServerCentral record for the thickest alphabet soup of acronyms ever crammed into a single title, but this post neatly wraps up numerous alternative means of accessing SQL Server Data Mining (SSDM) programmatically. In four previous posts in the series of amateur self-tutorials on SSDM (as well as a brief wrap-up of four validation procedures in last week’s article), I introduced Data Mining Expressions (DMX), the SQL-like language normally used to access SQL Server mining data. It takes far less time and effort to cover the other programmatic means, since many of them have quite limited use cases and act as wrappers for DMX code. When training in new software packages, users typically learn to perform automation tasks and write custom code at the tail end of the process, since these are advanced lessons that few of them will ever need to put to practical use. Furthermore, users need to understand the features they’re automating or customizing first, which is why we tackled such heavy but indispensable topics as SSDM’s nine algorithms and how to validate mining models earlier in the series. My usual disclaimer that I’m writing this series in order to familiarize with SSDM better while giving the most under-utilized component of SQL Server some badly needed free press is especially true in this week’s post, because I’ve only recently practiced some of the functionality I’ll be introducing. I have been vaguely familiar for years with the whole alphabet soup of AMO (Analysis Management Objects), ActiveX Data Objects Multi-Dimensional for .Net (ADOMD.Net), the SQL Server Integration Services (SSIS) and Reporting Services (RS) functionality pertaining to data mining and XMLA (XML for Analysis), but never used them in-depth until preparing to write this post. As usual, I learned a few things along the way, which means one of the primary purposes of this series was served. Hopefully it will also contribute to another purpose of this series, by demonstrating to other SQL Server DBAs that they too can make productive use of SSDM without a formal background in statistics, just as they don’t need the ability to build a custom motorcycle in order to ride one.
                One of the benefits of a series like this is that it can illuminate some of the pitfalls fellow amateurs should watch out for when learning the product. A couple of these “gotchas” occur in Reporting Services, which has fairly simple functionality for one-way retrieval of DMX queries, solely for presentation purposes, in the singular medium of RS webpages. I’ll assume for the sake of convenience that readers already understand the basics of Reporting Services and how to create reports and related objects through two rival means, Report Builder and Repotting Services projects built with Visual Studio. Many tutorials posted on computing topics are bloated with steps that are self-evident or are basic to the point of being unnecessary, so throughout this series I’ve avoided in-depth discussions of how to do such simplistic tasks as creating projects in Visual Studio. Likewise, I’ll assume readers are already familiar with Report Builder, since executing DMX queries within it would be considered an advanced lesson in a tutorial on RS itself. Suffice it to say that Report Builder is an application developers download through their browsers to write reports with, while Visual Studio Reporting Services projects represent an alternate means that provides additional functionality. Some of the GUI elements of these tools are similar in appearance and are used for nearly identical purposes, but there are some subtle differences between them. The most important dissimilarity for our purposes is that DMX queries apparently can’t be written with Report Builder.

Figures 1 and 2: The Visual Studio Report Designer Interface vs. Report Builder
ReportDesignerWindow
ReportBuilderWindow

                In Figure 1, the mouse is positioned near a hammer icon that allows users to toggle the Query Designer to DMX mode; hovering over it shows the label “Command Type DMX.” The GUI in Report Builder is practically identical, except there is no hammer icon at the same mouse point in the main menu. The column names in the left cell are for a fake cube I had to add to the MonitorMiningProject we’ve been using throughout this series, in order to bypass a warning that “No cubes can be found”” when creating a data source with Report Builder. If you try to generate a report model from an RS data source through Report Builder, it only includes the cubes in your database but not the mining models. The interoperability problems between Report Builder and Visual Studio’s Report Designer don’t end there though. After creating an RS project in Visual Studio, followed by an RS data source and RS dataset, you can access the DMX Query Designer depicted in Figure 3. It works almost exactly like the Mining Model Prediction tab we already covered in A Rickety Stairway to SQL Server Data Mining, Part 10.3: DMX Prediction Queries, so I won’t belabor the point. The icon next to the mouse point allows you to edit the DMX query text directly, which is often easier for the same reasons that writing T-SQL by hand is often preferable to writing queries through SQL Server Management Studio (SSMS)’s GUI tools. The icon in between the hammer and the Import… menu item allows you to toggle back to writing Multidimensional Expressions (MDX) code for SQL Server Analysis Services (SSAS) cubes. The process of writing and deploying the datasets that use DMX is straightforward with Report Designer, until you try to open reports written with them in Report Builder. You may receive an error to the effect that “one or more data sources is missing credentials” because the data sources you created in Report Designer are left blank, even though Report Builder is capable of  opening the datasets you built from them. Adding the correct data source from your Report Server folders through Report Builder leads to the same error. If you click on the Data Source Properties and click Test Connection, it may succeed. Yet if you switch to the Credentials tab, you may see it set to “Do no use credentials,” regardless of what the setting actually is in the original data source; furthermore, all of the radio buttons to change this to the correct setting may be greyed out. I leave this for Reporting Services aficionados to puzzle over, since it’s a little beyond the scope of this series. Suffice it to say that a data miner is likely to have less difficulty using Report Designer than Report t Builder, whenever it is possible to choose between them.

Figure 3: The Query Designer in the Reporting Services Report Designer
RSQueryDesigner

                There isn’t much more to say about the data mining features SSIS provides, which is only a little more sophisticated than what we’ve already covered with Reporting Services. The two components of SQL Server are designed for different purposes though, one for extraction, transformation and loading  (ETL) and the other for publishing views of data on the Web, so any functionality they duplicate isn’t really redundant. SSIS provides two tasks specifically for SSAS, but the Analysis Services Execute DDL Task does not work with mining objects; I verified this by using the DMX code DROP MINING MODEL [SQQueryTextIDView 1], which threw a “DDL is not valid” error even before I could run the package.

Figure 4: The Analysis Services Processing Task Using Mining Models
SSISProcessingTask

                As you can see from Analysis Services Processing Task Editor in Figure 4, it is possible to use the other SSAS task to process mining objects; many of the mining models we used for practice purposes throughout this series are listed in the Add Analysis Services Object dialog in the bottom left corner. SSIS also provides a dedicated Data Mining Query Task with three tabs, the first of which allows you to select a mining model from the SSAS database you’re currently connected to. The DMX code you want to execute is supplied on the Query tab, which also has child Parameter Mapping and Result Set tabs for coordinating the data returned by the query with SSIS objects like package variables. The Build New Query button brings up the same aforementioned graphical DMX query builder used in SSMS and Report Designer. As depicted in Figure 7, the Output tab allows users to insert the results into a relational table using the specified relational connection. It appears to automatically prepend the relational database’s default schema – i.e., dbo in most cases – so don’t try tacking a schema onto the table name in the Output tab. One day I may attempt a performance comparison between the SSIS method of inserting DMX results (which seems to perform a bulk insert, from what I can glean from a Profiler trace) and the one I discussed in A Rickety Stairway to SQL Server Data Mining, Part 10.4: Data Mining Dimensions vs. Drastic Dynamic DMX, using OpenQuery. I mentioned in that post that there are three further operations we can performed on data once it has been mined, including presentation to the end user for human analysis by a real intelligence, plugging it back into the mining process for further refinement, or automated actions that are taken without human intervention. SSIS is a good candidate tool for the second and is really the only option for the third. Recursive mining pipelines which pump data in a circle from relational tables into cubes and mining algorithms, then back to the relational side again for another round trip reside on the cutting edge of data mining, given that SQL Server (and probably none of its competitors) doesn’t provide a way to do it out of the box. Automated action is beyond the bleeding edge, into uncharted regions only futurists dare to explore.

Figures 5 to 7: The SSIS Data Mining Query Task and Its Tabs
DataMiningQueryTaskMiningModelTab
DataMiningQueryTaskQueryTab
DataMiningQueryTaskOutputTab

                Whenever that future comes to pass, automated action based on mining results will be performed by tools like SSIS. It is unlikely to be performed by XMLA, which is merely the language SSAS uses to communicate with other components, using HTTP and the Simple Object Access Protocol (SOAP). Thankfully, server communication is ordinarily performed through a compressed binary version of XMLA[i], which is otherwise verbose to the point that it would waste server resources. Both the code and data returned by XMLA are illegible and obese, but SSMS provides a button for XMLA queries for those souls intrepid or foolhardy enough to attempt the feat of writing an XMLA query by hand. Except for checking performance counters or performing backup operations, I have yet to find a use case for writing XMLA by hand, and I question why examples like these are even necessary. The relational side of SQL Server doesn’t demand that you access management information or perform important tasks like backups in such a wasteful manner; imagine the outcry – or laughter by our competitors at Oracle – if our relational DBAs had to hand-code XML each time they wanted to check a performance counter or perform a backup. Figure 8 shows an example of how to query the DMSCHEMA_MINING_COLUMNS rowset, which provides metadata in a manner similar to a dynamic management view (DMV). Using SELECT * FROM [$system].[DMSCHEMA_MINING_COLUMNS] we can perform the same query with one line of code that takes milliseconds to type. Equivalent relational code likewise takes one line: SELECT * FROM sys.columns. The query in Figure 8 takes eight lines, while the output requires a long scrollbar. Simply reading through the code to figure out what table is being referenced is difficult enough, but tuning out all of the noise in the XML results is practically impossible. I hereby throw down the gauntlet and challenge anyone on the planet to a duel, to see how much time it takes to type XMLA code or read its results, vs. alternative means like DMX. It is possible to run DMX Select queries in XMLA by encapsulating them with tags like <Execute> and <Statement>, but they suffer from the same illegibility.

Figure 8: A Typical Bloated XMLA Query
XMLAQuery

               You’re much better off simply issuing a DMX query, or even encapsulating it in a T-SQL OpenQuery statement, as I discussed in the last tutorial; no matter how complex the syntax gets, it can’t be harder to wade through such a mess. It is a mystery to me why languages like MDX and DMX haven’t been upgraded with T-SQL style keywords like BACKUP; perhaps it’s all part of the XML programming craze that began a decade ago, and went overboard on occasion by reviving verbose declarative programming and reversing the long-term industry trend towards encapsulation of code. I would love Windows Presentation Foundation (WPF), except for the fact that I now find myself hand-typing Extensible Markup Language (XAML) code, provoking flashbacks of slaving over typesetters at various newspapers in the 1980s. I get the same sinking feeling of retrogression when I’m forced to use PowerShell, which can be utilized to issue ADOMD and XMLA commands, thereby further eroding encapsulation and legibility, in addition to vastly increasing the amount of time it takes to type commands. As a courtesy, I will post a few links here to those who want to use PowerShell with SSDM, such as ScriptingGuy’s Use PowerShell to Perform SQL Data Mining page and the Marktab.net pages Programming SQL Server Data Mining with PowerShell 2.0 and Ed Wilson Interview — Microsoft PowerShell. I also ran across an example of someone who attempted to perform data mining entirely within PowerShell, without accessing any tool like SSDM. That’s impressive in the same sense that walking on stilts deserves applause, but I wouldn’t use it in a production environment, for the same reason that I wouldn’t try to run from a hungry lion on stilts. It’s just not practical. Data mining is difficult enough without the added clutter imposed by verbose programming languages, so whenever possible, substitute DMX code, SSIS tasks or ADOMD.Net. Perhaps there are use cases for XMLA and PowerShell with SSDM that I haven’t thought of yet, so I will bite my lip about my true inner feelings about both, which are whispered about by other SQL Server users in hushed tones. At the risk of starting a nuclear flame war that might burn down half of the SQL Server blogosphere, I will question its usefulness as  a SQL Server tool in a Windows environment, until someone gives me clear examples of use cases where writing PowerShell code is faster than equivalent SQL Server Management Object (SMO) code. If it can access functionality that doesn’t yet exist in SMO, it needs to be added ASAP, because SMO beats PowerShell hands down in code reuse. This is one of the many Achilles Heels which make command line tools such a drag on the whole computing industry; they usually perform the exact same tasks as their GUI counterparts, but with a lot more unnecessary and arcane code that is bound to slow down even those who are proficient at using them. Usually I’m open-minded about programming tools, where “the more the merrier” is sometimes the case – as long as there is a valid use case, no matter how narrow, that the tool can perform better than others. There’s no point in keeping a hand crank telephone in your kitchen though, except perhaps for aesthetic reasons. I suppose that a new market could be created for hand crank telephones if AT&T and the other leading telecommunications companies added a few bells and whistles and mounted a big propaganda campaign, painting them as a “hot technology.” But I wouldn’t want to have to dial 911 with one in an emergency.
               Data mining is a high-level activity that is not going to leave you much time or mental energy for monkeying around with low-level tasks like hand-editing XML or PowerShell scripts on a regular basis. In practically every use case, you’re better off writing .Net code using AMO and ADOMD classes, which perform the same automation tasks as SMO, except for Analysis Services. They are used for DDL and DML respectively and can be programmed with any of the .Net languages. I went to the trouble of getting a Microsoft Certified Solution Developer (MCSD) certification in Visual Basic 6.0 back in the day and have kept up with the language ever since, so I’ll stick with VB in the examples I provide. It’s a matter of taste, but I consider the syntax to be more legible than C#. Unfortunately, I’ve watched the quality of Microsoft’s Visual Studio documentation decline precipitously with each passing release since VB 5.0, with the latest manifestation being the elimination of many useful code examples in favor of C#, which is clearly Microsoft’s preferred programming primus inter pares now. That is also one of the reasons why I have seen calls for AMO and ADOMD code samples written in VB.Net proliferate in recent years. I don’t recall running across any yet, so I’ll provide a few in this post.
                The DLLs Visual Studio needs to reference for AMO and ADOMD programming may not already be installed with SQL Server 2012 Developer Edition, in which case you will need to download SQL_AS_AMO.msi and SQL_AS_ADOMD.msi from the SQL Server 2012 Feature Pack webpage. You will also see a SQL_AS_OLEDB.msi installer for OLEDB programming, which I’m not familiar with. You can also download SQL_AS_DMViewer.msi if you want to use the Data Mining Viewer control in a Visual Studio project, but the disappointing thing is that it only works in Windows Forms 2.0, which is rapidly becoming obsolete. It was already superseded by WPF in Visual Studio 9.0 and 10.0, but even WPF may be on its way out, should Microsoft succeeded in its wildest dreams with its unworkable Metro interface. After finishing installation of AMO and AMOMD .msi packages (should it prove necessary), set references in your Visual Studio project to Analysis Management Objects and Microsoft.AnalysisServices.AdomdClient.dll. If the latter is not visible in the list of available references, try navigating to your application’s Properties command under the Project menu item, then select the Compile tab and click on the Advanced Compile Options… button hidden at the bottom of the page. Changing the Target framework from .net Framework 4 Client Profile to .Net Framework 4 in the Advanced Compiler Settings dialog window fixed this problem for me.[ii] For some tasks we will cover in next week’s tutorial on ADOMDServer stored procedures it may also be necessary to set a reference in your Visual Studio project to msmgdsrv.dll, which may be located in your Program Files\Microsoft Analysis Services\AS OLEDB\110 folder.
                As usual, I’m not going to clutter thise tutorial with unnecessary or irrelevant steps, so I’ll assume readers have a working knowledge of the Visual Basic language and how to create and manage solutions in Visual Studio.Net 2010. After creating a project, you must add an Imports Microsoft.AnalysisServices statement at the top of any code window in which you want to reference the AMO or ADOMD object models. The next task is usually to instantiate some items at the apex of the object model, such as the server and database objects in the AMO example in Figure 9. Typically when working with SQL Server, one of the first things we need to do is open a connection – which can be trickier than it seems, given that connection strings for any components of all database server software in general seem to be poorly documented. Visual Studio simplifies the rest of the process, because you can use a combination of the Object Browser, Intellisense and watches to figure out what an object model’s items can do. Quite often I can figure out how to write code for an object model on my own faster this way than by consulting the documentation or professional tutorials. I suppose I should use some of that saved time to learn how to post VB code properly; this is my first attempt, and I obviously have much to learn about incorporating line numbers and wrapping text. The word “Temp” is just the flag I’ve used for years to indicate local variables. Yet it is fairly easy to decipher the code in Figure 9, which loops through the mining structure collection of a database and the mining models collection of each structure to display some important properties. In most cases of AMO and ADOMD objects, you can also use standard VB collection syntax to perform operations like Add, Remove, Count and Item. If we chose to, we could also dig down deeper into lower-level collections, like the Columns collections for both the structures and the models. This example goes four levels below the parent database, into the model algorithms and their parameters.

Figure 9: AMO Code to View Mining Object Properties
Dim TempServer As New Microsoft.AnalysisServices.Server
Dim TempDatabase As New Microsoft.AnalysisServices.Database

TempServer.Connect(“Data Source=127.0.0.1:2384;Initial Catalog=MonitorMiningProject”)
TempDatabase = TempServer.Databases(“MonitorMiningProject”)

For I = 0 To TempDatabase.MiningStructures.Count - 1

Debug.Print(“Structure Name: “ + TempDatabase.MiningStructures(I).Name)
       Debug.Print(HoldoutActualSize: ” + TempDatabase.MiningStructures(1).HoldoutActualSize.ToString
      
Debug.Print(“HoldoutMaxPercent: “ + TempDatabase.MiningStructures(1).HoldoutMaxPercent.ToString) ‘you can set these structure properties as well etc.
      
Debug.Print(“HoldoutMaxCases: “ + TempDatabase.MiningStructures(1).HoldoutMaxCases.ToString)
      
Debug.Print(“HoldoutSeed: “ + TempDatabase.MiningStructures(1).HoldoutSeed.ToString)

For J = 0 To TempDatabase.MiningStructures(I).MiningModels.Count – 1
          Debug.Print(    + TempDatabase.MiningStructures(I).MiningModels(J).Name()
          Debug.Print(         + TempDatabase.MiningStructures(I).MiningModels(J).Algorithm())

 For K = 0 To TempDatabase.MiningStructures(I).MiningModels(J).AlgorithmParameters.Count – 1
            
Debug.Print(         + TempDatabase.MiningStructures(I).MiningModels(J).AlgorithmParameters(K).Name
+ ” “ + TempDatabase.MiningStructures(I).MiningModels(J).AlgorithmParameters(K).Value.ToString)
  
        Next K
     
Next J
Next I       We can also instantiate or delete objects and change most of their properties. In fact, we can do things with AMO that aren’t possible in DMX, such as creating data sources and data source views (DSVs). It is necessary, however, to call the Update command on parent objects before creating objects that reference, which is why the data source in Figure 10 is created and updated first, then the DSV, followed by the mining structure and its model. It is also necessary to set a unique ID and often the Name property for many new objects as well. The data source is little more than a glorified connection, but the DSV requires code showing SSDM how to load data from the relational table or cube that supplies the data to be mined. In this case, we’ll be selecting some records from Monitoring.dm_os_wait_stats, a table of wait stat data that we’ve been using throughout this series for practice data. The dataset/adapter/SQLCommand code depicted below is fairly standard; I double-checked the syntax against Data Mining with Microsoft SQL Server 2008, the classic reference written by former members of Microsoft’s Data Mining Team, but there really aren’t many different ways you can code this. I did depend on their reference to learn how to do data bindings though.[iii]

Figure 10: AMO Code to Create New Mining Objects
create the data source
Dim NewRelationalDataSource As New RelationalDataSource
NewRelationalDataSource.ConnectionString = “Provider=SQLNCLI11.1;Data Source=MYSERVER;Integrated Security=SSPI;Initial Catalog=Monitoring”
NewRelationalDataSource.Name = NewRelationalDataSource
NewRelationalDataSource.ID = NewRelationalDataSource

TempDatabase.DataSources.Add(NewRelationalDataSource)

in this case, the DataSource must exist server side before we can create the DSV below that references it
TempDatabase.Update(UpdateOptions.ExpandFull)

 

create a DSV that references the dataset
Dim TempDataset As New System.Data.DataSet

Dim TempDataAdapter As New System.Data.SqlClient.SqlDataAdapter
Dim TempSQLCommand As New System.Data.SqlClient.SqlCommand
Dim TempSQLConnection As New System.Data.SqlClient.SqlConnection

TempSQLConnection.ConnectionString = “Data Source=MYSERVER;Integrated Security=SSPI;Initial Catalog=Monitoring”
TempSQLCommand.Connection = TempSQLConnection
TempSQLConnection.Open()

TempSQLCommand.CommandText = “SELECT TOP 100 ID, WaitTypeID FROM Monitoring.dm_os_wait_stats ‘this is a table of practice data we used throughout the series of tutorials
TempDataAdapter.SelectCommand = TempSQLCommand
TempDataAdapter.Fill(TempDataset, WaitTypeTable)
TempSQLConnection.Close()

Dim NewDataSourceView As New Microsoft.AnalysisServices.DataSourceView
NewDataSourceView.DataSourceID = NewRelationalDataSource
NewDataSourceView.Name = NewDataSourceView
NewDataSourceView.ID = NewDataSourceView
TempDatabase.DataSourceViews.Add(NewDataSourceView)

use the Update command after performing DDL operations on the database
in this case, the DSV must exist server side before we can create the mining structure below that references it
TempDatabase.Update(UpdateOptions.ExpandFull)

 create a new mining structure based on the new data source

Dim NewMiningStructure As New Microsoft.AnalysisServices.MiningStructure
Dim IDStructureColumn As New Microsoft.AnalysisServices.ScalarMiningStructureColumn
Dim WaitTypeIDStructureColumn As New Microsoft.AnalysisServices.ScalarMiningStructureColumn

set the structure binding to the new data source view
NewMiningStructure.Source = New DataSourceViewBinding(NewDataSourceView)

now set various column properties and bindings
IDStructureColumn.Content = “Key”  ‘don’t forget to set this for at least 1 column
IDStructureColumn.IsKey = True ‘another “Key” property to set ;)
IDStructureColumn.Type = “Long”
IDStructureColumn.Distribution = “Uniform”
IDStructureColumn.Name = “ID”
IDStructureColumn.ID = “ID”
IDStructureColumn.KeyColumns.Add(WaitTypeTable, “ID”, System.Data.OleDb.OleDbType.BigInt)
‘ set the data bindings
WaitTypeIDStructureColumn.Content = “Discretized” ‘the Content property can also refer to ClassifiedColumns types, which are relevant to the advanced topic of plug-in algorithms
WaitTypeIDStructureColumn.DiscretizationBucketCount = 15
WaitTypeIDStructureColumn.DiscretizationMethod = “EqualAreas”
WaitTypeIDStructureColumn.Type = “Long”
WaitTypeIDStructureColumn.Distribution = “Uniform”
WaitTypeIDStructureColumn.Name = “WaitTypeID”
WaitTypeIDStructureColumn.ID = “WaitTypeID”
WaitTypeIDStructureColumn.ModelingFlags.Add(“NOT
NULL”
)
WaitTypeIDStructureColumn.KeyColumns.Add(WaitTypeTable, “WaitTypeID”, System.Data.OleDb.OleDbType.BigInt)

set the data bindings
NewMiningStructure.Name = VBPracticeMiningStructure
NewMiningStructure.ID = VBPracticeMiningStructure
NewMiningStructure.Columns.Add(IDStructureColumn)
NewMiningStructure.Columns.Add(WaitTypeIDStructureColumn)
 
use the Update command after performing DDL operations on the database
TempDatabase.MiningStructures.Add(NewMiningStructure)
TempDatabase.Update()

Dim NewMiningModel As New Microsoft.AnalysisServices.MiningModel
NewMiningModel.Name = VBPracticeMiningModel
NewMiningModel.Algorithm = Microsoft_Clustering
NewMiningModel.AlgorithmParameters.Add(“CLUSTER_COUNT”, 15)

Dim NewMiningModelColumn As New Microsoft.AnalysisServices.MiningModelColumn
NewMiningModelColumn.Name = “ID” ‘the mining model must include the case key column
NewMiningModelColumn.Usage = “Key” ‘yet another “Key” property to set ;)
NewMiningModelColumn.SourceColumnID = “ID” ‘this is the ID of Structure column
NewMiningModel.Columns.Add(NewMiningModelColumn)

Dim NewMiningModelColumn2 As New Microsoft.AnalysisServices.MiningModelColumn
NewMiningModelColumn2.Name = “WaitTypeID”
NewMiningModelColumn2.SourceColumnID = “WaitTypeID” ‘this is the ID of Structure column
NewMiningModel.Columns.Add(NewMiningModelColumn2)

use the Update command after performing DDL operations on the database
TempDatabase.MiningStructures(“VBPracticeMiningStructure”).MiningModels.Add(NewMiningModel)
NewMiningStructure.Update()
NewMiningModel.Update()

Debug.Print(TempDatabase.MiningStructures(1).LastProcessed.ToString)
            process a structure or model
           TempDatabase.MiningStructures(1).Process()
            TempDatabase.MiningStructures(1).MiningModels(1).Process()

             The example above creates two columns from the new DSV, ID and WaitTypeID, and adds them to the columns collection of a new mining structure. Pretty much all of the same structure, model and column properties you can set in the GUI of SSDM are available in the object model, plus a few that aren’t. At least one of the columns must have  a Content type of Key and have its IsKey property set to true, otherwise you’ll receive an error like: “Error (Data mining): The ‘VBPracticeMiningStructure’ mining structure does not have a case key column.” After this we can create mining model columns and correlate them with the appropriate structure columns using their SourceColumnID properties. We can add the model columns to their parent models after this, then add the new mining model to its parent structure, followed by a couple of Update statements. For the sake of simplicity I’ll leave out nested tables, but it would be fairly simple to include a second table in the dataset code that defines the DSV, or by creating an object of type AnalysisServices.TableMiningStructureColumn rather than AnalysisServices.ScalarMiningStructureColumn like we did above; just remember to add some child columns, otherwise you’ll receive an error when you try to deploy the new objects. That big mass of code up there may look intimidating, but it’s all really quite elementary. If you prefer C# examples I suggest you refer to the DM Team’s book, which goes into much more depth and tackles other important use cases, such as how to process models. In a nutshell, it’s fairly simple; both structure and model objects have a Process method, which you can call like so: TempDatabase.MiningStructures(1).MiningModels(1).Process(). Voila. If you can think of an operation that can be performed on an SSDM object, it’s certain to be in the object model – plus a few that can’t be done in the GUI or in DMX, such as creating DSVs. SSMS and Business Intelligence Development Studio, i.e. the 2008 version of SQL Server Data Tools (SSDT), were both written with AMO, so any functionality you find there has to be in the object model somewhere.[iv] Nonetheless, it’s a fairly simple object model that isn’t difficult to navigate either.
               ADOMD.Net is probably even easier to use, although it doesn’t look like it from the three figures that follow. I’m more accustomed to using .Net datagrids and datasets, so I decided to build some of those for my example rather than use an ADOMD DataReader, as the DM Team explains how to do in their book. In Figure 11 you’ll see an ordinary .Net 2010 dataset, which will hold the results of our DMX query. In this case, we will be retrieving a subset of the columns of DMSCHEMA_MINING_SERVICES, the equivalent of a relational system table for SSDM algorithms, as we touched on briefly in A Rickety Stairway to SQL Server Data Mining, Part 10.2: DMX DML. Keep in mind that your column names must match those returned by the query, unless you use column aliases. I had a devil of a time with data type conversion errors until I realized that I couldn’t remove the underscores from the column names in the dataset, as I usually do whenever possible for legibility purposes, because .Net and ADOMD does not automatically convert them by their ordinal position in a query. Some of my columns were populated properly and others were left null in my query and dataset, at a time when I was retrieving the full set of DMSCHEMA_MINING_SERVICES columns. Also make sure to assign the .Net data types that correspond to the right OLE DB data types that ADOMD will return; you may need to consult Books Online (BOL) to see the OLE DB types that system objects like this will return, then look up the conversion table at a webpage like Mapping .NET Framework Data Provider Data Types to .NET Framework Data Types. I’ll skip over an explanation of Figure 12, which is merely the XAML to create the WPF datagrid that we will display our results in. There’s nothing special of interest there unless you want to reverse engineer this project.

Figure 11: Dataset Structure for the ADOMD Query Example
Dataset

 Figure 12: XAML for the ADOMD Query Example
<Window x:Class=”MainWindow”
    xmlns=”http://schemas.microsoft.com/winfx/2006/xaml/presentation&#8221; xmlns:x=”http://schemas.microsoft.com/winfx/2006/xaml“ Title=”MainWindow xmlns:my=”clr-namespace:WpfApplication1″ Height=”617″ Width=”1481″>
<Window.Resources>
<my:ADOMDResultDataaset x:Key=”ADOMDResultDataset”
/>
</Window.Resources>
<Grid Width=”1466″>
<Grid.ColumnDefinitions>     
     
<ColumnDefinition Width=”1437*” />
            <ColumnDefinition Width=”12*” />
</Grid.ColumnDefinitions>
<DataGrid AutoGenerateColumns=”False” Height=”524″ HorizontalAlignment=”Left” Margin=”12,42,0,0 Name=”DataGrid1″ VerticalAlignment=”Top” Width=”1422″ ItemsSource=”{Binding Source={StaticResource MiningServicesViewSource}}” FontSize=”9″>
             <DataGrid.Columns>
                <DataGridTextColumn x:Name=”Column1″ Binding=”{Binding Path=ID}” Header=”ID” Width=”30″ IsReadOnly=”True” FontSize=”8″ />
                <DataGridTextColumn x:Name=”Column2″ Binding=”{Binding Path=Service_Name}” Header=”ServiceName Width=”100″  />
                <DataGridTextColumn x:Name=”Column5″ Binding=”{Binding Path=Service_GUID}” Header=”ServiceGUID Width=”80″  />
                <DataGridTextColumn x:Name=”Column8″ Binding=”{Binding Path=Supported_Distribution_Flags}” Header=”DistributionFlags Width=”100″  />
                <DataGridTextColumn x:Name=”Column9″ Binding=”{Binding Path=Supported_Input_Content_Types}” Header=”InputContentTypes Width=”100″  />
                <DataGridTextColumn x:Name=”Column10″ Binding=”{Binding Path=Supported_Prediction_Content_Types}” Header=”PredictionContentTypes Width=”100″  />
                <DataGridTextColumn x:Name=”Column11″ Binding=”{Binding Path=Supported_Modeling_Flags}” Header=”ModelingFlags” Width=”100″  />
                <DataGridTextColumn x:Name=”Column13″ Binding=”{Binding Path=Training_Complexity}” Header=”TrainingComplexity Width=”100″  />               
                <DataGridTextColumn x:Name=”Column14″ Binding=”{Binding Path=Prediction_Complexity}” Header=”PredictionComplexity Width=”100″  />
               
<DataGridTextColumn x:Name=”Column15″ Binding=”{Binding Path=Expected_Quality}” Header=”ExpectedQuality Width=”100″  />
                <DataGridCheckBoxColumn x:Name=”Column18″ Binding=”{Binding Path=Allow_PMML_Initialization}” Header=”AllowPMMLInitialization Width=”100″  />
                <DataGridCheckBoxColumn x:Name=”Column24″ Binding=”{Binding Path=MSOLAP_Supports_Analysis_Services_DDL}” Header=”SupportsSSASDDL Width=”100″  />
               <DataGridCheckBoxColumn x:Name=”Column25″ Binding=”{Binding Path=MSOLAP_Supports_OLAP_Mining_Models}” Header=”SupportsOLAPModels Width=”100″  />
              <DataGridCheckBoxColumn x:Name=”Column26″ Binding=”{Binding Path=MSOLAP_Supports_Data_Mining_Dimensions}” Header=”SupportsMiningDimensions Width=”100″  />
              <DataGridCheckBoxColumn x:Name=”Column27″ Binding=”{Binding Path=MSOLAP_Supports_Drillthrough}” Header=”SupportsDrillthrough Width=”100″  />
 </DataGrid.Columns>
</DataGrid>

<Button Content=”Fill Grid” Height=”32″ HorizontalAlignment=”Left” Margin=”12,0,0,0 Name=”Button1″ VerticalAlignment=”Top” Width=”75″ />
<Button Content=”Create New Structure” Height=”37″ HorizontalAlignment=”Left” Margin=”93,-2,0,0 Name=”Button2″ VerticalAlignment=”Top” Width=”176″ Visibility=”Hidden” />
<Button Content=”Create SSAS SP” Height=”28″ HorizontalAlignment=”Left” Margin=”287,4,0,0 Name=”Button3″ VerticalAlignment=”Top” Width=”103″ Visibility=”Hidden” />
</Grid>
</Window>

               The actual ADOMD code is much shorter than either the dataset or XAML code above, or the AMO code from the previous example. The first line is just a reference to the dataset we just created, which has project-wide scope and requires a reference in the Windows.Resource section of the XAML. The code to create data command and data adapter objects is of little consequence, since it’s nearly identical to ADO and SMO code. We just create a connection, open it, set the command text, then fill the only table in our dataset and close the connection. After we start the project, all we need to do to fill the datagrid in Figure 13 with the results of our query is click the Fill Grid button, whose event handler includes this ADOMD code.

Figure 13: ADOMD Code to Display a Simple DMX Query in WPF
Dim ADOMDResultDataaset As WpfApplication1.ADOMDResultDataaset = CType(Me.FindResource(ADOMDResultDataset), WpfApplication1.ADOMDResultDataaset)
Dim TempConnection As New Microsoft.AnalysisServices.AdomdClient.AdomdConnection
Dim TempCommand As New Microsoft.AnalysisServices.AdomdClient.AdomdCommand
Dim TempAdapter As New Microsoft.AnalysisServices.AdomdClient.AdomdDataAdapter

TempConnection.ConnectionString = “Data Source=127.0.0.1:2384;Initial Catalog=MonitorMiningProject”
TempConnection.Open()

TempCommand = TempConnection.CreateCommand() ‘make sure the column names match
TempCommand.CommandText = “SELECT Service_Name, Service_GUID, Supported_Distribution_Flags,
Supported_Input_Content_Types, Supported_Prediction_Content_Types, Supported_Modeling_Flags, Training_Complexity,
Prediction_Complexity, Expected_Quality, Allow_PMML_Initialization,  MSOLAP_Supports_Analysis_Services_DDL, MSOLAP_Supports_OLAP_Mining_Models,
MSOLAP_Supports_Data_Mining_Dimensions, MSOLAP_Supports_Drillthrough FROM
[$system].[DMSCHEMA_MINING_SERVICES]“
‘select some data from the server

TempAdapter.SelectCommand = TempCommand
TempAdapter.Fill(ADOMDResultDataaset, MiningServicesTable)
TempConnection.Close()

Figure 14: Results of an ADOMD Query Displayed in a VB WPF Datagrid (click to enlarge)
GridResults

                The ADOMD object model also exposes pretty much every property or method you need to perform anything you’d ordinarily do in DMX code or the SSDM GUI. It’s also fairly simple as object models go, especially since most of the class members are relevant to OLAP functionality, such as collections and properties for cubes, tuples, named sets, KPIs, members, levels, hierarchies, dimensions and cells. If you go exploring through the ADOMD and AMO object models, you may even find curious references to functionality that doesn’t seem to be anywhere else. For example, the model column DiscretizationMethod property allows you to specify two values I’ve never seen before, Thresholds and EqualRanges, and which I’ve only seen documented in the AMO object model. I also ran across a curious method called GetFullStatistics while experimenting with Microsoft.AnalysisServices.AdomdServer, a version of ADOMD.Net which can boost conserve processing resources by performing its operations server-side. This was a feature of SSDM I was completely unaware of until I read the DM Team’s book, or so I thought; I had planned ages ago to learn how to write SSAS stored procedures and was glad to learn in middle of my experiments that they’re the same thing. This is one of the most neglected features of Analysis Services at present, but I suspect that in the long run it may prove to be a hidden gem. Writing the code for my first SSAS procedures was a beneficial learning experience for me, one that really calls for a separate article. I’ve also done little experimentation with similar CLR functionality on the relational side, like managed code aggregates and user defined functions (UDFs), so it might be appropriate to compare the workflows, use cases and pitfalls of them all in one gulp, in next week’s installment. After that, there will be no place left to climb up our stairway except the ricketiest, most challenging and perhaps most rewarding one of them all, writing plug-in algorithms.


[i] p. 480, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[ii] I found this solution in a post by Robert Williams at the webpage “How to Change VS2010 Add Reference Box Filter?” published April 16, 2010 at StackOverflow.com.

[iii] pp. 510-511, MacLennan, et al.

[iv] IBID., p. 502

A Rickety Stairway to SQL Server Data Mining, Part 11: Model Comparison and Validation

by Steve Bolton               

              Throughout this series of amateur self-tutorials on SQL Server Data Mining (SSDM) I’ve typically included some kind of disclaimer to the effect that I’m writing this in order to learn the material faster (while simultaneously providing the most underrated component of SQL Server some badly needed free press), not because I necessarily know what I’m talking about. This week’s post is proof that I have much more to learn, because I deliberately delayed the important topic of mining model validation until the tail end of the series precisely because I thought it belonged at the end of a typical data mining workflow. A case can still be made for first discussing the nine algorithms Microsoft includes in the product, since we can’t validate mining models we haven’t built yet, but I realized while writing this post that validation can be seen as an initial stage of the process, not merely a coda to assess how well a particular mining project went. I was only familiar with the topic in passing before this post, after having put off formally learning it for several years because of some performance issues which persisted in this week’s experiments, as I’ll get to shortly. For that reason, I never got in the habit of using validation to select the mining models most likely to produce useful information with a minimal investment of server resources. It may be a good idea to perform validation at the end of a project as well, but it ought to be incorporated at earlier stages. In fact, it might be better to view it as one more recurring stage in an iterative mining process. Since data mining is still in its infancy as a field, there may be more of a tendency to view mining projects using the kind of simplistic engineering processes that used to characterize other types of software development, like the old waterfall model. I must confess I unconsciously fell into the trap of assuming simple sequential processes like this without even realizing it, when I should have been applying the Microsoft Solutions Framework formula for software development, which I had to practically memorize back in the day to get pass my last Microsoft Certified Solution Developer (MCSD) certification in Visual Basic 6.0. As I discussed last week, the field has not advanced to the point where it’s standard practice to continually refine the information content of data by mining it recursively, but it will probably come to that someday; scripts like the ones I posted could be used to export mining results back into relational tables, where they can be incorporated into cubes and then mined again. Once refined mining like that become more common, the necessity of performing validation reiteratively at set stages in the circular process will probably be more obvious.
                Until the latest release of SQL Server, validation was always referred to in Microsoft’s documentation as a stage an engineering process called the Cross Industry Standard Process for Data Mining (CRISP-DM). This six-stage process begins with setting the objectives of a project from the perspective of end users plus an initial plan to achieve them, followed by the Data Understanding phase,  in which initial explorations of data are performed with small-scale collections and analysis. Some of the tasks in the Preparation and Modeling phases occur in a circular manner, with continual refinements in data models and collection processes, for example. Validation can be seen as part of the fifth phase, Evaluation, which follows the creation of mining models in the previous phases, in the same order as I’ve deal with these topics in this series. This should lead to the development of models with the highest information content in return for the smallest performance impact, prior to the final phase, Deployment. Elimination of models that don’t perform as expected and substitution with better ones may still be desirable after this point, so validation may be necessary after the deployment phase as well. I have not seen the original documentation for CRISP-DM because their website, crisp-dm.org, has been down for some time, but what I’ve read about it at Wikipedia seems to suggest something of a waterfall model, alongside some recognition that certain tasks would require repetition.[1] Perhaps a more explicit acknowledgment of the iterative nature of the data mining engineering process was slated for inclusion in CRISP-DM 2.0, but this planned revision to the standard is apparently defunct at the moment. The last post at the Cross Industry Standard Process for Data Mining Blog at http://crispdm.wordpress.com/  is titled “CRISP-DM to be Updated” and has a single comment stating that it had been a long time since the status of the project was revised, but rest assured, it was on its way. That was back in 2007. Perhaps CRISP-DM was a victim of its own success, however, because there really hasn’t been any other competing standard since the European Union (EU) got the ball rolling on development of the standard in 1997, with the collaboration of Teradata, NCR, Daimler and two lesser known companies. SEMMA (Sample, Explore, Modify, Model and Assess) is sometimes viewed as a rival to CRISP-DM, but its developer, SAS Institute Inc., says it is  “’rather a logical organization of the functional tool set of’ one of their products, SAS Enterprise Miner.”[2] There are really only so many sane ways to organize a mining project, however, just as there are with other types of software development. It’s not surprising that CRISP-DM has no competition and has had little revision, since it’s fairly self-evident – at least once it’s been thought through for the first time, so credit still must go to its original inventors. Perhaps there is little more that can be done with it, except perhaps to put a little more emphasis on the iterative aspects. This is also the key characteristic of the Microsoft Framework for software development, most of which is entirely applicable to SSDM.
                Until the current version of SQL Server, the MSDN webpage “Validating Data Mining Models (Analysis Services – Data Mining)” said that “CRISP-DM is a well-known methodology that describes the steps in defining, developing, and implementing a data mining project. However, CRISP-DM is a conceptual framework that does not provide specific guidance in how to scope and schedule a project. To better meet the particular needs of business users who are interested in data mining but do not know where to begin planning, and the needs of developers who might be skilled in .NET application development but are new to data mining, Microsoft has developed a method for implementing a data mining project that includes a comprehensive system of evaluation.” This could have been misconstrued to mean that the method of validation employed by Microsoft was proprietary, when in fact it consists of an array of well-known and fairly simple statistical tests, many of which we have already discussed. The same page of documentation, however, does a nice job of summarizing the three goals of validation: 1) ensuring that the model accurately reflects the data it has been fed; 2) testing to see that the findings can be extrapolated to other data, i.e. measures of reliability; and 3) checking to make sure the data is useful, i.e. that it is not riddled with tautologies and other logical problems which might render it true yet meaningless. These goals can sometimes be addressed by comparing a model’s results against other data, yet such data is not always available. Quite often the only recourse is to test a model’s results against the data it has already been supplied with, which is fraught with a real risk of unwittingly investing a lot of resources to devise a really complex tautology. One of the chief means of avoiding this is to divide data into testing and training sets, as we have done throughout this series by leaving the HoldoutSeed, HoldoutMaxPercent and HoldoutMaxCases properties at their default values, which reserves 30 percent of a given model’s data for training. This is an even more important consideration with Logistic Regression and the Neural Network algorithms, which is why they have additional parameters like HOLDOUT_PERCENTAGE, HOLDOUT_SEED and SAMPLE_SIZE that allow users to set similar properties at the level of individual models, not just the parent structures. As detailed in A Rickety Stairway to SQL Server Data Mining, Algorithm 9: Time Series, that mining method handles these matters differently, through the HISTORIC_MODEL_COUNT and HISTORICAL_MODEL_GAP parameters. The former bifurcates a dataset into n number of models, separated by the number of time slices specified in the latter parameter. The built-in validation methods cannot be used with Time Series – try it and you’ll receive the error message “A Mining Accuracy Chart cannot be displayed for a mining model using this algorithm type” – but one of the methods involves testing subsets of a given dataset against each other, as Time Series does in conjunction with the HISTORIC_MODEL_COUNT parameter. By either splitting datasets into testing and training sets or testing subsets against each other, we can use the information content available in a model to test the accuracy of our mining efforts. [3]We can also test mining models based on the same dataset but with different parameters and filters against each other; it is common, for example, to use the Clustering algorithm in conjunction with different CLUSTER_SEED settings, and to pit neural nets with different HOLDOUT_PERCENTAGE, HOLDOUT_SEED and SAMPLE_SIZE values against each other.
               Validation is performed in the last element of the GUI we have yet to discuss, the Mining Accuracy Chart tab, which has four child tabs: Input Selection, Lift Chart, Classification Matrix and Cross Validation. The first of these is used to determine which input is used for the second and is so simple that I won’t bother to provide a screenshot. You merely select as many mining models as you wish from the currently selected structure, pick a single column common to all of them and optionally, select a single value for that column to check for. Then you either select one of the three radio buttons labeled Use mining model test cases, Use mining structure test cases and Specify a different data set. The last of these options brings up the window depicted in Figure 1, which you can use to choose a different external dataset from a separate mining structure or data source view (DSV) to test the current dataset against. You must supply column mappings there, in the same manner as on the Mining Model Prediction tab we discussed two posts ago in A Rickety Stairway to SQL Server Data Mining, Part 10.3: DMX Prediction Queries. The third radio button also enables a box to input a filter expression. You’ll also notice a checkbox labeled Synchronize prediction columns and values at the top of the tab, which Books Online (BOL) says should normally remain checked; this is not an option I’ve used before, but I assume from the wording of the documentation that this is a rare use case when mining models within the same structure have a column with the same name, but with different values or data types. The only difficulty I’ve had thus far with this tab is the Predict Value column, which I’ve been unable to type any values into and has not been automatically populated from the data of any model I’ve validated yet.

Figure 1: Specifying a Different Data Set in the Input Selection Tab

SpecifyColumnMapping

                The Input Selection tab doesn’t present any data; it merely specifies what data to work with in the Lift Chart tab. The tab will automatically display a scatter plot if you’re working with a regression (which typically signifies a Content type of Continuous) or a lift chart if you’re not. We’ve already briefly touched upon scatter plots and the concept of statistical lift earlier in this series, but they’re fairly simple and serve similar purposes. In the former, a 45 degree line in the center represents the ideal prediction, while the rest of the graph is filled with data points in which predicted values run along the vertical axis and the actual values on the horizontal axis. You can also hover over a particular data point to view the predicted and actual values that specify its location. The closer the data points are in a scatter plot, the better the prediction. It’s really a fairly simple idea, one that is commonly introduced to high school and early college math students. Generating the scatter plot is easier said than done, however, because I have seen the Visual Studio devenv.exe run on one core for quite a while on these, followed by the SQL Server Analysis Services (SSAS) process msmdsrv.exe as also running on a single core and consuming more than a gigabyte of RAM, before crashing Visual Studio altogether on datasets of moderate size. This is the first performance pitfall we need to be wary of with validation, although the results are fairly easy to interpret if the process completes successfully. As depicted in Figure 2,  the only complicated part about this type of scatter plot with SSDM validation is that you may be seeing data points from different models displayed in the same chart, which necessitates the use of color coding. In this case we’re looking at data for the CpuTicks column in the mining models in the ClusteringDenormalizedView1Structure, which we used for practice data in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering. The scores are equivalent to the Score Gain values in the NODE_DISTRIBUTION tables of the various regression algorithms we’ve covered to date; these represent the Interestingness Score of the attribute, in which higher numbers represent greater information gain, which is calculated by measuring the entropy when compared against a random distribution.

Figure 2: Multiple Mining Models in a Scatter Plot

ScatterPlotExample

 
              The object with scatter plots is to get the data points as close to the 45-degree line as possible, because this indicates an ideal regression line. As depicted in Figure 3, Lift charts also feature a 45-degree line, but the object is to get the accompanying squiggly lines representing the actual values to the top of the chart as quickly as possible, between the random diagonal and the solid lines representing hypothetical perfect predictions. Users can easily visualize how much information they’ve gained through the data mining process with lift charts, since it’s equivalent to the space between the shaky lines for actual values and the solid diagonal. The really interesting feature users might overlook at first glance, however, is that lift charts also tell you how many cases must be mined in order to return a given increase in information gain. This is represented by the grey line in Figure 3, which can be set via the mouse to different positions on the horizontal axis, in order to specify different percentages of the total cases. This lift chart tells us how much information we gained for all values of CounterID (representing various SQL Server performance counters, from the IO data used for practice purposes earlier in the series) for the DTDenormalizedView1Structure mining structure we used in A Rickety Stairway to SQL Server Data Mining, Algorithm 3: Decision Trees. If I had been able to type in the Predict Value column in the Input Selection box, I could have narrowed it down to just a single specific value for the column I chose, but as mentioned before, I have had trouble using the GUI for this function. Even without this functionality, we can still learn a lot from the Mining Legend, which tells us that with 18 percent of the cases – which is the point where the adjustable grey line crosses the diagonal representing a random distribution – we can achieve a lift score of 0.97 on 77.11 percent of the target cases for one mining model, with a probability of 35.52 percent of meeting the predicted value for those cases. The second model only has a lift score of 0.74 for 32.34 percent of the cases with a probability of 25.84 at the same point, so it doesn’t predict as well by any measure. As BOL points out, you may have to balance the pros and cons of using a model with higher support coverage and a lower probability against models that have the reverse. I didn’t encounter that situation in any of the mining structures I successfully validated, in all likelihood because the practice data I used throughout this series had few Discrete or Discretized predictable attributes, which are the only Content types lift charts can be used with.

Figure 3: Example of a Lift Chart (click to enlarge)

LiftChartExample

                Ideally, we would want to predict all of the cases in a model successfully with 100 percent accuracy, by mining as few cases as possible. The horizontal axis and the grey line that identifies an intersection along it thus allow us to economize our computational resources. By its very nature, it also lends itself to economic interpretations in a stricter sense, since it can tell us how much investment is likely to be required in order to successfully reach a certain percentage of market share. Many of the tutorials I’ve run across with SSDM, particularly BOL and Data Mining with Microsoft SQL Server 2008,the classic reference by former members of Microsoft’s Data Mining Team[4], use direct mailing campaigns as the ideal illustration for the concept. It can also be translated into financial terms so easily that Microsoft allows you to transform the data into a Profit Chart, which can be selected from a dropdown control. It is essentially a convenience that calculates profit figures for you, based on the fixed cost, individual cost and revenue per individual you supply for a specified target case size, then substitutes this more complex statistic for the target case figure on the x-axis. We’re speaking of dollars and people instead of cases, but the Profit Chart represents the same sort of relationships: by choosing a setting for the percentage of total cases on the horizontal axis, we can see where the grey line intersects the lines for the various mining models to see how much profit is projected. A few more calculations are necessary, but it is nothing the average college freshman can’t follow. In return, however, Profit Charts provide decision makers with a lot of power.
              Anyone who can grasp profit charts will find the classification matrices on the third Mining Accuracy Chart tab a breeze to interpret. First of all, they don’t apply to Continuous attributes, which eliminated most of my practice data from consideration; this limitation may simplify your validation efforts on real world data for the same reason, but it of course comes at the cost of returning less information your organization’s decision makers could act on. I had to hunt through the IO practice data we used earlier in this series to find some Discrete predictable attributes with sufficient data to illustrate all of the capabilities of a Classification Matrix, which might normally have many more values listed than the three counts for the two mining models depicted below. The simplicity of the example makes it easier to explain the top chart in Figure 1, which can be read out loud like so: “When the mining model predicted a FileID of 3, the actual value was correct in 1,147 instances and was equal to FileID #1 or #2 in 0 instances. When it predicted a FileID of 1, it was correct 20,688 times and made 116 incorrect predictions in which the actual value was 2. When the model predicted a FileID of 2, it was correct 20,607 times and made 66 incorrect predictions in which the actual value turned out to be 1.” The second chart below is for the same column on a different model in the same structure, in which a filter of Database Id = 1 was applied

Figure 4: A Simple Classification Matrix

ClassificationMatrix

                 The Cross Validation tab is a far more useful tool – so much so that it is only available in Enterprise Edition, so keep that in mind when deciding which version of SQL Server your organization needs to invest in before starting your mining projects. The inputs are fairly easy to explain: the Fold Count represents the number of subsets you want to divide your mining structure’s data into, while the Target Attribute and Target State let you narrow the focus down to a particular attribute or attribute-value pair. Supplying a Target State with a Continuous column will elicit the error message “An attribute state cannot be specified in the procedure call because the target attribute is continuous,” but if a value is supplied with other Content types then you can also specify a minimum probability in the Target Threshold box. I strongly recommend setting the Max Cases parameter when validating structures for the first time, since cross-validation can be resource-intensive, to the point of crashing SSAS and Visual Studio if you’re not careful. Keeping a Profiler trace going at all times during development may not make the official Best Practices list, but it can definitely be classified as a good idea when working with SSAS. It is even truer with SSDM and truer still with SSDM validation. This is how I spotted an infinitely recursive error labeled “Training the subnet. Holdout error = -1.#IND00e+000” that began just four cases into my attempts to validate one of my neural net mining structures. The validation commands kept executing indefinitely even after the Cancel command in Visual Studio had apparently completed. I haven’t yet tracked down the specific cause of this error, which fits the pattern of other bugs I’ve seen that crash Analysis Services through infinite loops involving indeterminate (#IND) or NaN values. Checking the msmdsrv .log revealed a separate series of messages like, “An error occurred while writing a trace event to the rowset…Type: 3, Category: 289, Event ID: 0xC121000C).” On other occasions, validation consumed significant server resources – which isn’t surprising, given that it’s basically doing a half-hearted processing job on every model in a mining structure. For example, while validating one of my simpler Clustering structures with just a thousand cases, my beat-up development machine ran on just one of six cores for seven and a half minutes while consuming a little over a gigabyte of RAM. I also had some trouble getting my Decision Trees structures to validate properly without consuming inordinate resources.
               The information returned may be well worth your investment of computational power though. The information returned in the Cross-Validation Report takes a bit more interpreting because there’s no eye candy, but we’re basically getting many of the figures that go into the fancy diagrams we see on other tabs. All of the statistical measures it includes are only one level of abstraction above what college freshmen and sophomores are exposed to in their math prereqs and are fairly common, to the point that we’ve already touched on them all throughout this series. For Discrete attributes, lift scores like the ones we discussed above are calculated by dividing the actual probabilities by the marginal probabilities of test cases, with Missing values left out, according to BOL. Continuous attributes also have a Mean Absolute Error which indicates greater accuracy inversely, by how small the error is. Both types are accompanied by a Root Mean Square Error, which is a common statistical tool calculated in this case as the “square root of the mean error for all partition cases, divided by the number of cases in the partition, excluding rows that have missing values for the target attribute.” Both also include a Log Score, which is only a logarithm of the probability of a case. As I have tried to stress throughout this series, it is far more important to think about what statistics like these are used for, not the formulas they are calculated by, for the same reason that you really ought not think about the fundamentals of internal combustion engines while taking your driver’s license test. It is better to let professional statisticians who know that topic better than we do worry about what going on under the hood; just worry about what the simpler indicators like the speedometer are telling you. The indicators we’re working with here are even simpler, in that we don’t need to worry about them backfiring on us if they get too high or too low. In this instance, minimizing the Mean Absolute Error and Root Mean Square Error and maximizing the Lift Score are good; the opposite is bad. Likewise, we want to maximize the Log Score because that means better predictions are more probable. The logarithm part of it only has to do with converting the values returned to a scale that is easier to interpret; the calculations are fairly easy to learn, but I’ll leave them out for simplicity’s sake. If you’re a DBA, keep in mind that mathematicians refer to this as normalizing values, but it really has nothing in common with the kind of normalization we perform in relational database theory.                                                                                                                                                                          

Figure 5: Cross-Validation with Estimation Measures (click to enlarge)

CrossValidationSPResults

                Figure 5 is an example of a Cross-Validation Report for the single Linear Regression model in a mining structure we used earlier in this tutorial series. In the bottom right corner you’ll see figures for the average value for that column in a particular model, as well as the standard deviation, which is a common measure of variability. The higher the value is, the more dissimilar the subsets of the model are from each other. If they vary substantially, that may indicate that your dataset requires more training data.[5] Figure 6 displays part of a lengthy report for one of the DecisionTrees structures we used earlier in this series, which oddly produced sets of values for just two of its five processed models, DTDenormalizedView1ModelDiscrete and DTDenormalizedView1ModelDiscrete2. The results are based on a Target Attribute of “Name” and Target State of “dm_exec_query_stats” (which together refer to the name of a table recorded in the sp_spaceused table we used for practice data) and a Target Threshold of 0.2, Fold Count of 10 and Max Cases of 1,000. In addition to Lift Score, Root Mean Square Error and Log Score we discussed above, you’ll see other measures like True Positive, which refers to the count of successful predictions for the specified attribute-value pair and probability. True Negative refers to cases in which successful predictions were made outside the specified rangers, while False Positive and False Negative refer to incorrect predictions for the same.  Explanations for these measures are curiously missing from the current desktop documentation for SQL Server 2012, so your best bet is to consult the MSDN webpage Measures in the Cross-Validation Report for guidance. It also speaks of Pass and Fail measures which are not depicted here, but which refer to assignments of cases to correct and incorrect classifications respectively.

Figure 6: Cross-Validation with Classification and Likelihood Measures (click to enlarge)

ClassificationCrossValidationReport

               Validation for Clustering models is handled a bit differently. In the dropdown list for the Target Attribute you’ll see an additional choice, #CLUSTER, which refers to the cluster number. Selecting this option returns a report like the one depicted below, which contains only measures of likelihood for each cluster. Clustering also uses a different set of DMX stored procedures than the other algorithms to produce the data depicted in cross-validation reports. The syntax for SystemGetClusterCrossValidationResults is the same as the SystemGetCrossValidationResults used by the other algorithms, except that the former does not make use of the Target State and Target Threshold parameters. The same rules that apply in the GUI must also be adhered to in the DMX Code; in Figure 8, for example, we can’t set the Target State or Target Threshold because the predictable attribute we selected has a Content type of Continuous. In both cases, we can add more than one mining model by including it in a comma-separated list. As you can see, the output is equivalent to the cross-validation reports we just discussed.

Figure 7: Cross-Validation with #CLUSTER (click to enlarge)

ClusterCrossValidation

Figure 8: Cross-Validating a Cluster Using the Underlying DMX Stored Procedure
CALL SystemGetClusterCrossValidationResults(ClusteringDenormalizedView1Structure,ClusteringDenormalizedView1Model, ClusteringDenormalizedView1Model3,
10, – Fold Count
1000 – Max Cases
)

 

ClusterValidationResults

Figure 9: Cross-Validating Using the DMX Stored Procedure SystemGetCrossValidationResults
CALL SystemGetCrossValidationResults(
LRDenormalizedView1Structure, LRDenormalizedView1Model, – more than one model can be specified in a comma-separated list
10, – Fold Count
1000, – Max Cases
‘Io Stall’, – the predictable attribute in this case is Continuous, so we can’t set the Target State and Threshold
NULL, – Target State
NULL) – Target Threshold, i.e. minimum probability, on a scale from 0 to 1

 

CrossValidationSPResults

               Similarly, we can use other DMX procedures to directly retrieve the measures used to build the scatter plots and lift charts we discussed earlier. As you can see, Figure 10 shows the Root Mean Square Error, Mean Absolute Error and Log Score measures we’d expect when dealing with a Linear Regression algorithm, which is where the data in LRDenormalizedView1Model comes from. The data applies to the whole model though, which explains the absence of a Fold Count parameter.  Instead, we must specify a flag that tells the model to use only the training cases, the test cases, both, or either one of these choices with a filter applied as well. In this case we’ve supplied a value of 3 to indicate use of both the training and test cases; see the TechNet webpage “SystemGetAccuracyResults (Analysis Services – Data Mining)” for the other code values. Note that these are the same options we can choose from on the Input Selection tab. I won’t bother to provide a screenshot for SystemGetClusterAccuracyResults, the corresponding stored procedure for Clustering, since the main difference is that it’s merely missing the three attribute parameters at the end of the query in Figure 10.

Figure 10: Using the SystemGetAccuracyResults Procedure to Get Scatter Plot and Lift Chart Data 
CALL SystemGetAccuracyResults(
LRDenormalizedView1Structure, LRDenormalizedView1Model, – more than one model can be specified in a comma-separated list
3, – code returning both cases and content from the dataset
‘Io Stall’, – the predictable attribute in this case is Continuous, so we can’t set the Target State and Threshold
NULL, – Target State
NULL) – Target Threshold, i.e. minimum probability, on a scale from 0 to 1

 GetAccuracyResults

               These four stored procedures represent the last DMX code we hadn’t covered yet in this series. It would be child’s play to write procedures like the T-SQL scripts I provided last week to encapsulate the DMX code, so that we can import the results into relational tables. This not only allows us to slice and dice the results with greater ease using views, windowing functions, CASE statements and Multidimensional Expressions (MDX) queries in cubes, but nullifies the need to write any further DMX code, with the exception of prediction queries. There are other programmatic means of accessing SSDM data we have yet to cover, however, which satisfy some narrow use cases and fill some specific niches. In next week’s post, we’ll cover a veritable alphabet soup of additional programming tools like XMLA, SSIS, Reporting Services (RS), AMO and ADOMD.Net which can be used to supply DMX queries to an Analysis Server, or otherwise trigger some kind of functionality we’ve already covered in the GUI.


[1] See the Wikipedia page “Cross Industry Standard Process for Data Mining” at http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

[2] See the Wikipedia page “SEMMA” at http://en.wikipedia.org/wiki/SEMMA

[3] Yet it is important to keep in mind – as many academic disciplines, which I will not name, have done in recent memory – that this says nothing about how well our dataset might compare with others. No matter how good our sampling methods are, we can never be sure what anomalies might exist in the unknown data beyond our sample. Efforts to go beyond the sampled data are not just erroneous and deceptive, but do disservice to the scientific method by trying to circumvent the whole process of empirical verification. It also defies the ex nihilo principle of logic, i.e. the common sense constraint that you can’t get something for nothing, including information content.

[4] MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[5] IBID., p. 175.

A Rickety Stairway to SQL Server Data Mining, Part 10.4: Data Mining Dimensions vs. Drastic Dynamic DMX


by Steve Bolton

                Anyone who’s gotten this far in this series of amateur self-tutorials on SQL Server Data Mining (SSDM) is probably capable of using it to unearth some worthwhile nuggets of wisdom in their data. One of the key remaining questions at this step of our rickety stairway is what we can do with the information we’ve mined. The most obvious answer to this would be to act upon the information; we must not lose sight of the fact that our ultimate mission is not to simply dig up truths based on past data for a person or organization, but to enable them to act with greater precision in the present to ensure a better future. Any subject under the sun may an intrinsic worth of its own, but appreciation for anything should not be divorced from its wider uses and implications for everything else beyond it; this is the point at which intellectuals to get lost in Ivory Towers and become irrelevant to the wider world outside their narrow vision. This also happens in the discipline of history, which I have some claim to real expertise in – unlike with SSDM, which I am writing about in order to familiarize myself with the topic better while simultaneously giving SQL Server’s most neglected component some badly needed free press. Sometimes historians get lost in mere antiquarian appreciation for their narrow specialties, while ignoring the more important tasks of helping mankind avoid fulfilling Santayana’s classic warning, “Those who cannot remember the past are condemned to repeat it.” Likewise, as the field of data mining blossoms in the coming decades, it will be critical to keep the giant mass of truths it is capable of unearthing relevant to the goals of the organizations the miners work for. Knowledge is worth something on its own, but it should also be remembered that “Knowledge is power.” Francis Bacon’s famous quip is incomplete, however, because knowledge is only one form of power; as Stephen King once pointed out in typically morbid fashion, it doesn’t do a mechanic whose car jack has slipped much good to truly know that he’s powerless to move the automobile crushing his chest. I suspect this is a fine distinction that will come to the fore as data mining matures. The most critical bottlenecks for organizations in the future, once data mining has become ubiquitous, may be the challenge of acting on it. The poor soul in Stephen King’s example can’t act on his information at all. Many organizations, however, may prove incapable of acting on good data due to other internal faults. Eight centuries ago, St. Thomas Aquinas pointed out that sometimes people prove incapable of doing things they know they ought to do, despite telling themselves to do them; such “defects of command” may well become glaringly evident in the future, as the next bottlenecks in unsuccessful organizations.
                At that point, data mining may already be automated to the point where algorithm processing results mechanically trigger actions by organizations. Some relational databases and data warehouses already have the capability of doing this to a degree, such as automatically sending out new orders to replenish inventories before they are depleted, or triggering a stock trade. Yet in most cases these automated decisions are based on fairly simple functions, not the ultra-, uber-, mega-sophisticated functions that modern data mining algorithms amount to. One day Wall Street firms may rise and fall in a day depending on whether the gigantic data warehouse at the center of an organization can out-mine its competitors and therefore out-trade them, without any human intervention at all since the formulas they’re acting on are too complex to put into human language; this would be a strange twist indeed on the science fiction chestnut of artificial intelligence gone awry. Using some of the scripts I’m going to provide in this post in conjunction with ADO.Net, SQL Server Integration Services (SSIS) and XML for Analysis (XMLA), we could indeed start laying the foundations for such automated processes right now, if need be.  The problem is that data mining is still in its infancy, to the point where it would be quite unwise for any organization to automatically rely on its mining results without prior human review by an actual intelligence. Our next programming task could thus be to build systems that automatically act on data mining results, but for the meantime it is much more prudent for humans to review the results first, then decide on courses of action themselves. That is probably why SSDM, or any other mining software I’m aware of, lacks built-in ways of handling such functionality. Without it, programmers are left with two other tasks they can perform on data they’ve already mined: present it more efficiently so that people can analyze it from a human perspective, or to feed the data back into mining process for further computational analysis. Either way, we’re talking about mining the mining results themselves; the process can be looked at as continuously refining hydrocarbon fuels or cutting diamonds we’ve taken from the ground, until they reach the point where the cost of further refining outweighs further improvements in quality. To perform the latter task, we need some programmatic means of feeding mining results from SSDM back into the mining process. Keep in mind that this is a fairly advanced task; it is obviously much more important to mine data the first time, as the previous posts in this series explained how to do, than to feed them back into the system reiteratively. It is not as advanced or risky as automated action though. That is probably why some limited means to perform the former task are provided in SSDM, but not any out-of-the-box solutions to the latter one.
               Such nascent functionality is included in SSDM in the form of data mining dimensions, which incorporate your mining results into a SQL Server Analysis Services (SSAS) cube. The official documentation carries a caveat that makes all the difference in the world: “If your mining structure is based on an OLAP cube.” The practice data I used throughout this series came strictly from relational data, because I didn’t want to burden DBAs who come from a relational background with the extra step of first importing our data into a cube, then into SSDM. That is why I was unable to add a data mining dimension through the usual means of right-clicking on a mining model and selecting the appropriate menu item, as depicted in Figure 1. I discovered this important limitation after getting back just 34 hits for the Google search terms “’Create Data Mining Dimension’ greyed” (or grayed) – all of which led to the same thread at MSDN with no reply.

Figure 1: Mining Dimension Creation Menu Command
CreateDataMiningDimension

                The rest of the screenshots in this post come from another project I started practicing on a long time ago, which was derived from cube data and therefore left the Create a Data Mining Dimension command enabled. Selecting it brings up the dialog in Figure 2, which performs the same function as the Create mining model dimension checkbox and a Create cube using mining model dimension checkbox on the Completing the Wizard page of the Data Mining Wizard. It’s all fairly self-explanatory.

Figure 2: Mining Dimension Creation Dialog Box
CreateDataMiningDimensionDialog

                A new data source view (DSV) is added to your project after you create the mining dimension. Depending on the algorithm you’ve chosen, certain columns from SSDM’s common metadata format will be automatically added to a MiningDimensionContentNodes hierarchy, including ATTRIBUTE_NAME, NODE_RULE, NODE_SUPPORT and NODE_UNIQUE_NAME. Some of these columns may be entirely blank and even those that are not provide only a fraction of the numeric data the algorithms churn out, such as NODE_SUPPORT, which is a mere case count. Such miserly returns would hardly be worth the effort unless we add more numeric columns for SSAS to operate on, such as NODE_PROBABILITY and MSOLAP_NODE_SCORE from the DSV list on the right-hand side of Figure:

Figure 3: Available Mining Dimension Columns
MiningDimensionContent

                Even after including these columns, you may not see any worthwhile data depicted in the cube’s Browser tab till you right-click the white space in the table and check the Include Empty Cells command, as depicted in Figure 4.

Figure 4: Include Empty Cells
IncludeEmptyCells

                Data mining dimensions are of limited use in their current form, however, because of a welter of serious inadequacies. As mentioned before, your data will have to be imported from a cube instead of a relational database, but it can also only be exported to a cube, not into a relational table; even within SSAS, you’re limited to incorporating it is a dimension, not a fact table. Furthermore, they only work with four of SQL Server’s nine data mining algorithms, Decision Trees, Clustering, Sequence Clustering and Association Rules, so we’re crippled right off the bat. One of these is narrowly specialized, as discussed in A Rickety Stairway to SQL Server Data Mining, Algorithm 8: Sequence Clustering, and another is a brute force algorithm that is high popular but appropriate only in a narrow set of use cases, as explained in A Rickety Stairway to SQL Server Data Mining, Algorithm 6: Association Rules. Furthermore, the data provided in mining dimensions comes solely from the common metadata format I introduced in A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out. As I reiterated in each individual article on the nine algorithms, the format is capable of holding apples and oranges simultaneously (like a produce stand), but it comes at the cost of a bird’s nest of nested tables and columns which change their meaning from one algorithm to the next, or even one row to the next depending on the value of such flags as NODE_TYPE and VALUETYPE. The first instinct of anyone who comes from a relational background is to normalize it all, not necessarily for performance reasons, but for logical clarity. If you’re among this crowd, then you’re in luck, because the scripts I’m providing here will allow you to follow those instincts and import all of your mining results into relational tables.
                The concept is really quite simple, at least initially: you create a linked server to your SSAS database using sp_addlinkedserver or the GUI in SQL Server Management Studio (SSMS), then issue DMX queries through OpenQuery. For explicit directions, see the SQLServerDataMining.com webpage Getting Data Mining Results into SQL Tables posted by the user DMTeam on Oct. 24, 2008, but there’s not much left to say about that end of it.[i] The difficult part is formatting your DMX correctly, which can be challenging enough in an SSMS DMX Query window. There are many idiosyncrasies to the language which take getting used to, such as the odd necessity of putting brackets around the PARENT_UNIQUE_NAME column in all situations, the difficulty of scoping FLATTENED columns from NODE_DISTRIBUTION tables and a dozen others. One of the most nagging limitations is the inability of SQL Server to import its own SSDM results using the standard Microsoft OLE DB for Analysis Services data provider, which crashes whenever #IND (indeterminate) or NaN values are encountered in the mined data. The workaround I’ve developed currently requires you to have a copy of Microsoft Excel on your SSAS server because it uses Excel functions to format the #IND and NaN values correctly, but I will probably post a better solution that doesn’t require Excel fairly soon, now that I know how to write SSAS stored procedures. This can be a nightmare to debug, but the scripts I’ll provide take care of the problem for you. Debugging DMX in its native environment is difficult enough, but trying to track down a single error in a query that’s twenty lines long can be a mind-numbing experience when it’s nested inside a T-SQL OpenQuery statement. For all intents and purposes, you’re then also dealing with tracking endless quote marks just as you would with dynamic SQL. If you want to persist the results you’re going to have to use either sp_executesql or INSERT EXEC to get them into a temporary table, thereby adding another layer of quote marks to keep track of.
                The scripts I’m providing here relieve you of most of this burden of coding your own DMX, with the exception of prediction queries. These are handled by the PredictionQuerySP stored procedure I’ve also incorporated, which allows you to execute your own DMX queries in T-SQL, including those with functions that require quote marks, by doubling them like so: EXEC Mining.PredictionQuerySP‘MonitoringLS’, ‘SELECT ClusterProbability(”Cluster 1”) FROM [ClusteredDenormalizedView1Model2]‘. This code can be downloaded here, along with the other stored procedures for importing DMX data into T-SQL. The code was designed with SSMS T-SQL queries in mind, but it can be easily updated to accommodate queries submitted from other data access tools that don’t require double quoting, by adding a simple bit flag and IF statement. Aside from prediction queries, you need not write your own DMX at all, since the eight ImportContent stored procedures (two of the nine algorithms, Logistic Regression and neural nets, share the same procedure and table structure) I’m providing here encapsulate the whole process of transferring your mining results directly into the relational tables created by the companion Data Definition Language (DDL) scripts. There are also procedures and corresponding tables to import any information DMX queries can provide about mining structures, models, their columns and parameters. The procedures also take care of the messy work of normalizing all of SSDM’s nested tables into hierarchies of relational tables. I came up with this solution just a few weeks ago, after I tried to resuscitate a bunch of ad hoc scripts I’d written over the years for this article, only to realize that a more comprehensive and integrated architecture would make more sense. That left me no time to incorporate the necessary error checking or transactions in the procedures, add appropriate nonclustered indexes on the tables, try performance tuning, or any other such considerations. Don’t depend on these scripts until you’ve verified the accuracy of their results first. There are guaranteed to be mistakes in the code, which is a hodgepodge of dynamic SQL, CTEs, MERGE statements and multiple instances of a workaround I developed for the silly limitations Microsoft put on nested INSERT EXEC statements. Feel free to update it to your liking; just please give me credit when possible, because I’d like to get paid at some point for investing all of this energy learning yet another field. I intend to upgrade the scripts on my own for the purely selfish reason that they will help save me the time of writing any more DMX, except in the case of prediction queries. If there’s sufficient interest, I’ll post my revisions here. There’s no reason why we have to reinvent the wheel with each DMX query, when we can write all we need in one big guzzle like this, then store it in a single standardized, interchangeable format. The DDL statements to create the schema in Figure 5 are available here.

Figure 5: Schema v. 0.1 for Importing DMX into Relational Tables (click to enlarge)
MiningDiagramv1
                Your first task in importing SSDM results into this relational schema is to start at the top of the object model and create a new row for your SSAS database in the Mining.SSASDatabase; I haven’t created a stored procedure for this yet, because the metadata for this table is not yet set, so its really only useful at this point for generating identity foreign keys for the MiningStructureTable column SSASDatabaseID. Then use Mining.ImportStructureTableSP to create or update your list of mining structures, by supplying the procedure with the linked server and SSAS database names. At present I haven’t added code to simplify the process any further, so you’ll have to use the identity value generated for a particular structure when running Mining.ImportColumnTableSP and Mining.ImportModelTableSP, which gets the column information and collection of mining models in a single structure. The latter procedure also records the algorithm settings for a particular model, but the model column information will have to be retrieved by supplying the model identity value auto-generated by Mining.ImportModelTableSP to Mining.Import.ModelColumnTableSP. Once these steps are complete, you can import all of your model’s processing results into the appropriate table type, by choosing from one of the following stored procedures based on the model’s selected algorithm: ImportAssociationRulesModelContentSP, ImportClusterModelContentSP, ImportDecisionTreesModelContentSP, ImportLinearRegressionModelContentSP, ImportNaiveBayesModelContentSP, ImportNNModelContentSP and ImportTimeSeriesModelContentSP. They all take an identical set of parameters, including the linked server name, database name and model identity value, like so: EXEC [Mining].[ImportLinearRegressionModelContentSP]MonitoringLS, ‘MonitorMiningProject’, 18.
                What you can you do with the data once it’s been imported into relational tables? The sky’s the limit. You could create customized views of the results and apply fine-grained security access rights to them; perform sophisticated ANOVA analyses of your mining results; do custom clustering operations; slice and dice the data with the exciting new T-SQL windowing functions; or even build views to compare the intercepts of your regressions across models, or models applying different algorithms, should that prove meaningful to you. Best of all, you can then feed the data into the fact tables and dimensions of an SSAS cube – including the NODE_DISTRIBUTION table data, which you can’t extract through ordinary data mining dimensions – then mine the results again. You could conceivably continue this process recursively, if you had the need and the computational resources. As I explained in A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out, data mining algorithms consist of combinations of statistical building blocks, which could conceivably be arranged in an infinite array of permutations. This whole series is predicated on the idea that a person without a doctorate in statistics can still use these algorithms and their building blocks without understanding their inner workings, as long as they grasp their uses – just as a driver only needs to know what a brake and steering wheel do, not how to reverse engineer them. Therefore I can’t predict what combinations of statistical building blocks you may find useful for a particular project, only that incorporating the results into relational tables will give you much greater freedom to apply them as you see fit.
              SSDM is impressive, but you reach the limits of what you can do with the data once you include it in a data mining dimension or depict it in the GUI; by importing the data into other tools, you can not only present the results differently, but perform further analysis. In my next post, we’ll make mention of four DMX stored procedures we have yet to cover, since they’re advanced means of performing validation tasks on mining models. After that we’ll get into ADO.Net, SSIS tasks, XMLA, Reporting Services and other such alternative means of accessing and extending SSDM programmatically. Functionally, writing custom code and model validation ought to come after learning how to use the nine algorithms correctly, since we can’t validate or retrieve what hasn’t yet been processed. Writing DMX and accessing SSDM with tools like SSIS is not an absolute necessity though, because users can always depend on the GUI tools to view their mining results, as explained in previous posts. It is less practical to do without model validation though, since this is a critical step in the mining process. Why should we rely on the results of faulty models and waste further server resources on them, when statistical and programmatic means are already available to us that can tell us if they’re defective? Stay tuned for that essential step up this long stairway. After that point we will be up the ladder to the roof and practically doing cartwheels on the chimney, which will be the ideal point for a final tutorial or two to close out the series, on the arcane topics of Predictive Model Markup Language (PMML) and plug-in algorithms.


[i] In previous iterations I depended on nearly identical information provided by other posters and sites, which I neglected to keep records of, so I can’t properly cite them. There isn’t much room for creativity with the basic formula they all provide for importing DMX queries so a citation probably isn’t necessary – yet I like to cover all my bases, so I found that citation at SQLServerDataMining before using it to overhaul these scripts a few weeks back.

A Rickety Stairway to SQL Server Data Mining, Part 10.3: DMX Prediction Queries

 
by Steve Bolton

               In the last two installments of this series of amateur self-tutorials on SQL Server Data Mining (SSDM), I discussed how to performed Data Definition Language (DDL) and Data Manipulation Language (DML) tasks with Data Mining Expressions (DMX), the SQL-like language used to administer SSDM. These topics were easier to explain and less critical to know than the material covered in previous posts, like the installments on SSDM’s nine algorithms; furthermore, the syntax for both operations is far simpler and serves far fewer purposes than DDL and DML with T-SQL and Multidimensional Expressions (MDX), the languages used for relational tables and OLAP cubes in SQL Server. Much of the functionality DMX provides can be also be performed faster in the GUI of SQL Server Data Tools (SSDT) and is thus only useful in certain limited scenarios. This is less true, however, of prediction queries, which allow users to make calculations based on the patterns contained in a previously trained mining model, sometimes by feeding it sets of input values that don’t exist in the model. It is often more difficult or even impossible to perform the same tasks without writing DMX code, which makes prediction queries the most important component of the language. We can perform some of the same tasks using the Mining Model Prediction tab in SSDT, which isn’t difficult to understand if you’ve already used the tools in SSDT and Visual Studio provides to graphically create tables and views instead of coding them by hand with T-SQL. Figure 2 is a screenshot of the tab, in which a few columns of the LRDenormalizedView1Model mining model we used previously in this series are selected in the table to the left and then appear in the grid at the bottom; to see the predicted values when compared against the set of cases selected to the right, simply right-click the tab surface and select the Result menu item. To view the DMX query the tool uses to create the prediction, select Query. The rest is fairly self-explanatory, except for the Singleton Query menu item, which requires an introduction to the various types of DMX prediction queries. Like equivalent graphical tools that generate T-SQL to create tables and views, the Mining Model Prediction Tab is awkward and deceptively simplistic; it seems like it ought to save you the trouble of writing DMX code, but often does not. As discussed in previous posts, the use cases for DML and DDL in DMX are quite limited because many of the same tasks can be performed more efficiently through the GUI, but prediction queries (like stored procedures and views in T-SQL) are really the one part of DMX that can typically be more efficiently coded by hand.

Figure 1: The Mining Model Prediction Tab
MiningModelPredictionTabExample

                The equivalent functionality in DMX is provided by a couple of interdependent clauses and functions which are often used in concert. The first of these are prediction joins, which employs syntax elements similar to those of T-SQL joins to compare the patterns stored in a mining model against a set, which can come from explicitly defined valued, the results of a sub-select, an MDX query or the data returned by an OpenQuery statement. To test a particular case (i.e., a single row of values) against your model, you supply a FROM statement coupled with the mining model name, followed by a PREDICTION JOIN clause with a list of explicit hand-coded values corresponding to the columns in the model. An example of such a “singleton query” can be found in Figure 4, in which we see what output is returned from a mining model we used in A Rickety Stairway to SQL Server Data Mining, Algorithm 5: The Neural Network Algorithm, when it is supplied with specific values for the DatabaseID and FileHandleID columns. In that query I used the NATURAL PREDICTION JOIN clause, which automatically correlates the columns from the mining model and the joined set together based on their names. If the keyword NATURAL is left out, then you must manually correlate each set of columns together using ON clauses in much the same manner as in a T-SQL query, which in Figure 3 would mean tacking on syntax like this: ON [NNDenormalizedView1Model10].[Database ID] = T2.[Database ID] AND [NNDenormalizedView1Model10].[File Handle ID] = T2.[File Handle ID]. An “unnatural” prediction join of this kind can be awkward to write when you’re dealing with numerous columns, but it is mandatory when the column names don’t match, which often occurs when the joined set data comes from an OpenQuery of an external data source. Selecting from a mining model without a prediction join clause (as depicted in Figure 3) amounts to what is called an empty prediction join, which returns the most likely values based on patterns already stored in a model. Like the OpenQuery syntax borrowed from T-SQL, the MDX statements that can be used to specify the joined set are beyond the scope of this tutorial series. We’ve already covered sub-selects and most of the syntax elements pertinent to SELECT statements in the last two tutorials, such as TOP, WHERE and ORDER BY, in previous tutorials, so I won’t belabor the point. Keep in mind, however, that when performing prediction queries you will often need to use the FLATTENED keyword, because many of the prediction functions that are often used alongside prediction joins return nested tables.

Figure 2: The PredictNodeID Function

SELECT PredictNodeID([Io Stall Read Ms])
FROM [LRDenormalizedView1Model]

SELECT ATTRIBUTE_NAME, NODE_CAPTION, NODE_SUPPORT
FROM [LRDenormalizedView1Model].CONTENT
WHERE NODE_UNIQUE_NAME= ’00000000e’

ATTRIBUTE_NAME NODE_CAPTION NODE_SUPPORT
Io Stall Read Ms All 99456

                Prediction functions can be applied to columns in a mining model without the use of prediction joins, as depicted in Figures 1 and 2, but they are often used in tandem, as seen in Figure 4. The algorithm used in a mining model determines which subset of prediction functions can be used against its data; for example, PredictNodeID is not available for the Logistic Regression and the Neural Network algorithms. It is not strictly a prediction function, since it merely returns a node name rather than predicting a value based on patterns in a model’s data, but I’ve included it here for the sake of clarity, given that it has the word Predict in its name. The first query in Figure 2 returns the NODE_UNIQUE_NAME value 00000000e, which you can use to look up information for a specific node. In the example above, the node identifier is used to restrict the results to the NODE_SUPPORT, ATTRIBUTE_NAME and NODE_CAPTION for a single node. You could also use the node identifier to retrieve important stats about a node from its nested NODE_DISTRIBUTION table, but this would imply the use of the FLATTENED keyword to denormalize the multiple rows it returns.

Figure 3: 5 DMX Functions in an Empty Prediction Join

SELECT PredictSupport([Io Pending Ms Ticks], INCLUDE_NULL) AS SupportResult,
PredictProbability([Io Pending Ms Ticks], INCLUDE_NULL) AS ProbabilityResult,
PredictAdjustedProbability([Io Pending Ms Ticks], INCLUDE_NULL) AS AdjustedProbabilityResult,
PredictStdev([Io Pending Ms Ticks]) AS StDevResult,
PredictVariance([Io Pending Ms Ticks]) AS  VarianceResult
FROM [NNDenormalizedView1Model10]

SupportResult ProbabilityResult AdjustedProbabilityResult StDevResult VarianceResult
99456 1 0 9.31134197258688 86.7010893304582

Figure 4: 5 DMX Functions in a Singleton Natural Prediction Join

SELECT PredictSupport([Io Pending Ms Ticks], INCLUDE_NULL) AS SupportResult,
PredictProbability([Io Pending Ms Ticks], INCLUDE_NULL) AS ProbabilityResult,
PredictAdjustedProbability([Io Pending Ms Ticks], INCLUDE_NULL) AS AdjustedProbabilityResult,
PredictStdev([Io Pending Ms Ticks]) AS StDevResult,
PredictVariance([Io Pending Ms Ticks]) AS  VarianceResult
FROM [NNDenormalizedView1Model10]
NATURAL PREDICTION JOIN
       (SELECT 7 AS [Database ID], 4 AS [File Handle ID]) AS T2

SupportResult ProbabilityResult AdjustedProbability
Result
StDevResult VarianceResult
9999.00019996001 0.999900019996001 0 1.36773308691065 1.87069379703014

                SSDM also provides the five functions depicted in the two figures above to return statistical predictions for some fairly simple, common measures we’ve used throughout this series, such as support (the number of cases in our mining model, i.e. rows of data), standard deviation, variance and probability. The only unique one is AdjustedProbability, which “is useful in associative predictions because it penalizes items that are too popular in favor of less-popular items.”[i] I have yet to see exact formula by which this measure is calculated though and vaguely recall reading somewhere that it is proprietary Microsoft code, so it may not be publicly available. As I’ve tried to stress throughout this series, however, detailed knowledge of the underlying equations is not necessary to derive substantial benefits from SSDM, any more that it is necessary to design your own combustion engine before driving a car. The most important thing to keep in mind is the concept of what AdjustedProbability is used for, so that if you encounter a scenario where you suspect that values with excessively high counts are obscuring underlying patterns in your data, it might be a good idea to apply this function and see what pops out.

Figure 5: PredictHistogram Usage on a Linear Regression Mining Model
SELECT FLATTENED PredictHistogram([Io Stall Write Ms]) AS T1
FROM [LRDenormalizedView1Model]

T1.IoStall WriteMs T1.$SUPPORT T1.$PROBABILITY T1.$ADJUSTED
PROBABILITY
T1.$VARIANCE T1.$STDEV
32921 99456 0.999989945504635 0 52333897.9797561 7234.21716426567
  0 1.00544953648776E-05 1.00544953648776E-05 0 0

               These statistical prediction functions are available for every algorithm except Time Series, which only supports PredictStDev and PredictVariance. Time Series is also the only algorithm which does not support PredictHistogram, contrary to Books Online (BOL), which states that it “can be used with all algorithm types except the Microsoft Association algorithm.” I verified this manually by executing the function on Time Series and Association Rules models, the first of which resulted in an “Error (Data mining): The PREDICTHISTOGRAM function cannot be used in the context at line 1, column 18.” The second returned results similar to those in Figure 5, which depicts the histogram returned for one of our old Linear Regression mining models used in A Rickety Stairway to SQL Server Data Mining, Algorithm 2: Linear Regression. The documentation for PredictHistogram is a bit puzzling, however, because it mentions phantom $ProbabilityVariance and $ProbabilityStdev columns which I have yet to encounter, but which are apparently meaningless anyways, given that they’re both accompanied by disclaimers like these: “Microsoft data mining algorithms do not support $ProbabilityVariance. This column always contains 0 for Microsoft algorithms.” The function does return an StDev column though, as depicted in Figure 5, along with the predicted value and columns for familiar measures like $Support, $Probability and $AdjustedProbability. To muddy the waters even further, the same page of Microsoft documentation states that the function should return $Cluster, $Distance and $Probability columns with mining models that use the Clustering algorithm, but it simply returns the same columns in the example below. All of the 17 clustering models we used in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering return the same six columns. If we want to return this cluster information, we have to use one of the four functions used exclusively with Clustering, which are depicted in Figure 6. The Cluster () function tells us which cluster the values in our prediction join are most likely to fall into, while ClusterProbability and ClusterDistance tells us the probability of the input set belonging to a particular cluster or its distance from it. If no values are supplied to those functions, then the comparisons are made against the most likely clusters. They are not strictly prediction functions, because Clustering is rarely used for predictions, but I’ve included them here because they perform roughly analogous purposes. PredictCaseLikelihood is prefaced with the word Predict because it is a prediction function in a stricter sense. It provides the statistical likelihood that the input set will belong to a particular cluster, which is subtly different from ClusterProbability. Measures of likelihood make an inference based on observations made on actual data, while probabilities are calculated without reference to observations of a sample.[ii] The NORMALIZED keyword can be used with PredictCaseLikelihood to specify that the results are adjusted to an even scale, which is the default behavior, or NONNORMALIZED to return raw unadjusted calculations. The term “normalization” in the data mining literature most often refers to a process that adjusts disparate scales to make them comparable, not to any concepts relevant to relational database theory.

Figure 6: Clustering-Specific Functions
SELECT Cluster () AS T1Cluster,
ClusterProbability (‘Cluster 2′) AS T1ClusterProbability,
ClusterDistance () AS T1ClusterDistance,
PredictCaseLikelihood(NORMALIZED) AS T1DefaultNormalizedLikelihood,
PredictCaseLikelihood(NONNORMALIZED) AS T2RawLikelihood
FROM [ClusteringDenormalizedView1Model13]
     
NATURAL PREDICTION JOIN
    
(SELECT 7 AS [Database ID], 4 AS [File Handle ID]) AS T2

T1Cluster T1ClusterProbability T1ClusterDistance T1DefaultNormalized
Likelihood
T2RawLikelihood
Cluster 5 0.14552979054638 0.776691246872147 0.000134777874081471 3.6216743774023E-144

                All nine algorithms can make use of the polymorphic Predict function, but its behavior varies greatly depending on the data mining method selected and the type of value input to it. When used with the Time Series or Sequence Clustering algorithm to becomes an alias for the PredictTimeSeries or PredictSequence functions respectively. In other cases it must be supplied with the name of a predictable column. If a simple Input column is supplied, you will receive an error message like this: “Error (Data mining): Only a predictable column (or a column that is related to a predictable column) can be referenced from the mining model in the context at line 3, column 15.” As discussed in depth in A Rickety Stairway to SQL Server Data Mining, Part 0.0: An Introduction to an Introduction, the Content type assigned to a column represents one of the most important distinctions in data mining. If it is set to a predictable column with a Content type of Table, then the Predict function is equivalent to PredictAssociation, which makes predictions on nested tables. This is primarily used for recommendation engines of the kind Association Rules is often used to build[iii], but it is not precisely the same as making a prediction on an Association Rules model, since it can be applied to nested tables in other algorithms that support them. Beware of the confusing documentation in Books Online (BOL), which says that it applies to “Classification algorithms and clustering algorithms that contain predictable nested tables. Classification algorithms include the Microsoft Decision Trees, Microsoft Naive Bayes, and Microsoft Neural Network algorithms.” This seems to imply that it can be used with neural nets, but it can’t be since they don’t make use of predictable nested tables. While we’re on the topic of Association Rules, it might be an opportune time to mention OPTIMIZED_PREDICTION_COUNT, which we deferred discussing in A Rickety Stairway to SQL Server Data Mining, Algorithm 6: Association Rules. This is not a parameter to a function, but an algorithm parameter which limits prediction functions to return only the specified number of results with Association Rules models, regardless of how many are requested by a prediction query. When the Predict function is used without a nested table on an Association Rules model, it operates much like the Predict function does with other algorithms, rather than PredictAssociation. A column supplied to the Predict column can be specified with the keywords EXCLUDE_NULL or INCLUDE_NULL, while a table takes the same keywords used with PredictAssociation, INCLUSIVE, EXCLUSIVE, INPUT_ONLY, and INCLUDE_STATISTICS. The last of these returns two additional columns of stats, $Probability and $AdjustedProbability, which can also be supplied as parameters to order the results or limit them to the top n values for that column. Figure 7 provides a simple example of the Predict function applied to a column from one of the neural net models used previously in this series, while Figure 8 gives an example of PredictAssociation in action. Note that the first parameter supplied to it is the name of the predictable table, not a column within it; I initially made the mistake of specifying the table key, which leads to the confusing error message, “Only a nested table column with a KEY column inside is allowed in an association prediction function.”

Figure 7: Use of the Plain Predict Statement on a Neural Net Model (click to enlarge results)
SELECT *, Predict([Io Stall], INCLUDE_NULL) AS PredictResult
FROM [NNDenormalizedView1Model10]

Figure7Results

Figure 8: The PredictAssociation Function (click to enlarge results)
SELECT FLATTENED [Minute Gap],
    
(SELECT *
     FROM
 PredictAssociation([DmOsWaitStats], $Probability, INCLUDE_STATISTICS) ) AS T1
FROM [ARDoubleNestedTableStructure2Model1]

Figure8Results

               The query above returned 673 values, one for each WaitTypeID, but I only included the most noteworthy values in the result table because almost all were identical to the first row. As you can see in the query text, there are no clauses to limit the results. Frankly, this is one of the points at which we slam into the limitations of DMX like a brick wall, for there are no programmatic means I know of to reference the columns of the FLATTENED table to restrict them with a WHERE clause. I therefore had to delete all of the redundant rows by hand when pasting them into Microsoft Word. As I have pointed out many times in this series, I am an amateur at this, so listen to any data mining professional if they contradict anything I’ve written here, but I have yet to find a means of limiting such queries without importing the results en masse into relational or OLAP tables first. Neither of the queries in Figure 9 will allow you to do this, since the first returns the error message, “Error (Data mining): The specified DMX column was not found in the context at line 6, column 7” and the second returns, “The dot expression is not allowed in the context at line 6, column 7. Use sub-SELECT instead.” The latter also occurs if you qualify the WaitTypeID column with T1 instead. Note that in the second query, the sub-select is given the alias T2, to prevent an error to the effect that “Subselects and subcubes cannot be queried in flattened format.”

Figure 9: Column Problems with Two PredictAssociation Queries
SELECT FLATTENED [Minute Gap],
    
(SELECT *
    
FROM PredictAssociation([DmOsWaitStats], $Probability, INCLUDE_STATISTICS)) AS T1
FROM [ARDoubleNestedTableStructure2Model1]
WHERE WaitTypeID  = 1

 SELECT *
     FROM
(SELECT FLATTENED PredictAssociation([DmOsWaitStats], $Probability, INCLUDE_STATISTICS) AS T1
FROM [ARDoubleNestedTableStructure2Model1]) AS T2
WHERE DmOsWaitStats.WaitTypeID  = 1

                The same problem unfortunately afflicts PredictSequence and PredictTimeSeries, which are otherwise among the most useful features in DMX. This is not surprising, given that the word “prediction” ordinarily connotes some reference to the future, while Sequence Clustering and Time Series both imply temporal ordering of some kind.  If you recall from A Rickety Stairway to SQL Server Data Mining, Algorithm 8: Sequence Clustering, we found several clusters of queries that seemed to progress in a particular order during the collection phase for the data used throughout this series of tutorials, which was based on roughly three days of polling six IO-related Dynamic Management Views (DMVs) every minute using a SQL Server Integration Services (SSIS) job. Not surprisingly, query #339 was the most common one found using the Sequence Clustering algorithm, since it was part of the job that ran every minute during this period. That is probably why it is over-represented in Figure 10. The rest of the sequence revealed by the PredictSequence was highly useful, however, because it seemed to identify a chain of queries that I mistakenly did not group together when examining my data manually. The prediction join singles out query #6, which I originally thought on a casual glance to be a non-descript background process, when the query text actually referred to a change tracking background procedure. Queries #310 and #272 in the results below were related to full-text indexing on the same database that change tracking was enabled on, which had little use during this period, so SSDM effectively clustered these queries together into a sequence I initially missed. In retrospect, I think I altered a few values in a text column at some point, which triggered change tracking and changes to the full-text indexing. QueryTextID #119 is a common Agent background job that was merely executed so frequently that it happened to be included in this cluster, like #339. In the query below, I supplied the name of the nested table just like we would with PredictAssociation, along with two optional parameters to specify the beginning and ending steps in the sequence. Specifying just one number limits it to n number of steps. The NestedTableID column is the key column in the nested table, but I have no idea why it was included in the results but left blank. Other than this, the results might have been surprisingly useful in a real-world scenario. They would have also been much more difficult to extract from the data in the GUI, so we’ve definitely reached the point at which DMX becomes worthwhile to learn for particular scenarios.

Figure 10: PredictSequence
SELECT FLATTENED PredictSequence(QueryTextIDView,5, 15)
FROM
[SQQueryTextIDPredictableModel15]
    
 NATURAL PREDICTION JOIN
    
(SELECT 6 AS QueryTextID) AS T2

Expression.$SEQUENCE Expression.NestedTableID Expression.QueryTextID
5   339
6   339
7   357
8   310
9   272
10   119
11   339
12   339
13   339
14   339
15   339

               The results may have been useful, but they came at an increased cost in terms of model processing time. It took just 13 seconds to process the model but 3:49 to run the PredictSequence query, during which time msmdsrv.exe ran on just one core. Performance was even more of an issue with the PredictTimeSeries query in Figure 11, which I had to cancel the query after 52 minutes. Selecting just one column did not reduce the processing time down to reasonable levels either. I deleted some of the extraneous columns manually to highlight only the most interesting results, which represent just the first handful of rows out of thousands returned by the query. The first parameter is the nested table name, followed by n number of steps to predict into the future, which defaults to 1. Specifying two numeric values as we did in PredictSequence has the same effect of limiting the prediction to a particular range of steps into the future; for example, you could make predictions for 15 days to 25 days ahead using 15, 25, if your Key Time measures is in terms of days. The REPLACE_MODEL_CASES clause can also be supplied to indicate that predictions should be made using the patterns stored in the model, but by applying them to different data points in the join set. EXTEND_MODEL_CASES works slightly differently, by tacking new data points onto the end of the data already included in the model; for example, you could use it to append another seven days of data to compare them against the data stored in the model without having to actually alter it. I left both of these clauses out and limited it to a single prediction step for both legibility and performance reasons, since I would have to manually supply values for all 67 columns in the join set. Keep in mind that you can also apply PredictTimeSeries to columns outside your nested table, such as MinuteGap in the example below. If it is not applied to one of these columns or a nested table, you will receive this error: “Error (Data mining): Only a top-level column or a nested table column with a KEY TIME column inside can be used in a time series prediction function.”

Figure 11: PredictTimeSeries (click to enlarge results)
SELECT FLATTENED [Minute Gap],
   
(SELECT
   
FROM PredictTimeSeries([TS View], 1) AS T1) AS T2
FROM [TSColumnarModel2]

 Figure11Results

                Once again, however, I had to delete columns and rows by hand after the query was finished because I couldn’t restrict them in the query with a WHERE clause. This is one of the major drawbacks of DMX, which is still in its infancy as a language. Given that it deals with such complex forms of analysis, you’d expect it to have an even richer set of capabilities to slice and dice data than T-SQL or MDX have, but they simply haven’t been added to the language yet. Until the day comes when DMX is augmented with some badly needed new functionality, we will usually be better off importing our mining data into relational tables and OLAP cubes whenever possible. This is particularly true of the nested tables returned by these prediction functions and SSDM’s complicated metadata format, which I explained in A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out and in each individual post on the nine algorithms. As explained throughout the series, the common format is useful in that it allows us to collect apples and oranges that we ordinarily couldn’t compare in the same basket. On the other hand, the price to be paid is a rat’s nest of nested tables and columns whose meaning changes between algorithms, or even from one row to the next. The first instinct of someone trained in relational database theory is to normalize all of these nested structures and rid ourselves of all this redundancy and overlapping meanings of attributes. SSDM provides a native means of importing its results into OLAP cubes in the form of data mining dimensions, which can’t be used with certain algorithms and still retain some of the limitations of this format. In next week’s tutorial, I’ll complement this method with some scripts to import your SSDM results directly into relational tables using a combination of DMX and T-SQL. From there, they can built into new cubes of your own design, rather than using the format imposed by data mining dimensions. This is the pinnacle of DMX, at which we find its primary use – which is to export the data it returns into tables and cubes for more efficient storage, retrieval and analysis.


[i] p. 121, 124, 571, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[ii] For an excellent discussion of this nuance, see the discussion titled “What is the difference between “likelihood” and “probability”?” at the CrossValidated website, available at http://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability. Pay particular attention to the answer on Sept. 14, 2010 by the user named Thylacoleo. 

[iii] p. 123, MacLennan, et.al.

A Rickety Stairway to SQL Server Data Mining, Part 10.2: DMX DML

by Steve Bolton

                The three introductory posts in this series of self-tutorials on SQL Server Data Mining (SSDM) and the separate articles on the nine algorithms Microsoft ships with it covered the most complex and essential elements of of the most under-utilized and underestimated component of the product. Most of the remaining articles in this series touch on areas that are relatively more advanced, yet which are often less complex than most of the material we’ve covered so far, and certainly less necessary to the average casual user. I’ve aimed this series mainly at the large untapped market of database professionals who aren’t statisticians by trade, but who could be using SSDM to solve real-world problems with a minimal investment of training, time, energy and server resources, including some purely relational and OLTP problems. It is more important for this class of users to expend these resources determining how to apply the right algorithms correctly in the right circumstances, not writing Data Mining Extensions (DMX) code. You can certainly make productive use of SSDM without learning any DMX, just as you can get by without becoming adept at interpreting equations. There is less incentive to dwell on the topic, but these articles on DMX will be relatively short for an equally important reason: there’s not much to say. As explained in last week’s article on performing Data Definition Language (DDL) tasks, there is far less functionality to review with DMX in comparison to T-SQL or Multidimensional Expressions (MDX), SQL Server’s relational and OLAP languages. The topic of Data Manipulation Language (DML) in DMX is a little more complex, but is still far easier to learn than T-SQL or MDX. This allows users to focus their more energy on more pressing topics, such as applying the algorithms correctly to their data.
                In most SQL-based languages, the four major DML operations to take note of are UPDATE, DELETE, INSERT and SELECT statements. The same is true in DMX, but the functionality of these statements is far more limited. The meaning of the statements also differs in subtle ways. UPDATE does indeed update something, but it is merely the NODE_CAPTION of a cluster produced by the Clustering algorithm. For example, the dataset we used for example purposes throughout this series produced one named “Cluster 1” in the mining model we named ClusteringDenormalizedView1Model13, as discussed in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering. To give it a new name that is more descriptive of the mining results it represents, we could use a statement like this: UPDATE ClusteringDenormalizedView1Model13.CONTENT SET NODE_CAPTION = ‘High MinuteGap Cluster’ WHERE NODE_UNIQUE_NAME = ‘001’. That’s all there is to it. We can also quickly dispose with the DELETE statement, which is designed to clear cases from either a mining model or its parent mining structure. Figure 1 succinctly explains the three different actions that can be performed using this statement. Note that the operation can be performed on either a mining structure or a model, and that the name of the specified object may be followed by a .CASES or .CONTENT keyword, which is pretty much the same pattern we’ll see with the SELECT statement. The former keyword refers to the cases used for training and the latter to those that are not.

Figure 1: Three Uses of the DELETE Statement in DMX

DMXDMLFigure1

                One use for the DELETE statement is to reclaim disk space to get rid of results from previous processing operations that are no longer needed. The most common usage scenario, however, is to empty structures or models in order to add fresh data to them using the INSERT INTO statement, which has the secondary effect of processing the objects it operates on. Contrary to the behavior of the same statement with T-SQL, you cannot use it to add new cases to a model or structure that already has any without receiving an error message; any existing data must be deleted, which severely crimps the usefulness of both DELETE and INSERT INTO. This wasteful limitation is precisely equivalent to being forced to truncate a relational table, then importing all of the same data to just to append a single new row. Likewise, inserting into a structure using syntax like INSERT INTO MyStructure is fairly straightforward because it processes the structure and all of its models, but applying it to a model in a statement like INSERT INTO MyModel is pointless if your structure has multiple models that you don’t want to process. When there is a single model in a structure, SSDM will process the model, but not its structure, unless the latter hasn’t been processed yet. Yet an error will be generated if there are any other models in the structure that need processing, so in effect, you will be forced to process any unprocessed ones anyways, which diminishes the usefulness of using INSERT INTO on a model. INSERT INTO on a structure has some performance advantages though, as former members of Microsoft’s Data Mining Team point out in their indispensable book Data Mining with Microsoft SQL Server 2008. Not only can SSDM can process all of the models efficiently by reading from the cache created in structure processing only once, but Enterprise Edition users can benefit from parallel processing of the models.[i]

Figure 2: A Simple Example of a DMX INSERT INTO Statement
INSERT INTO MyStructure
(Column1, Column2, Column3 etc.)
OPENQUERY(DataSourceName, ‘MyQueryText’)

                The second line of the INSERT INTO syntax is also upfront, in that you merely provide a comma-separated list of columns, all enclosed within parentheses, just as in T-SQL. After that, the syntax gets dicey. In the third line you must provide a comma-separated list of columns that proceeds in the same order as the corresponding columns in the first line, but this can be done through one of several different methods. One of the most common is the OPENQUERY statement, in which you provide the name of a preexisting data source and then a string used to retrieve data from it, using syntax specific to the provider the data source comes from. One of the drawbacks of this is that you can’t programmatically create the data sources, which are equivalent to the data sources we created for this tutorial series way back in A Rickety Stairway to SQL Server Data Mining, Part 0.0: An Introduction to an Introduction. The DM Team provides a complex workaround for this for adventurous users who are not faint of heart.[ii] Specifying a data source like this may not be necessary if you substitute the OPENROWSET statement, which allows you to specify the name of the provider, a connection string and the provider-specific text of your query, such as a T-SQL query. In other words, it works much like the T-SQL statement of the same name. Books Online (BOL) says you can also use other means like an MDX statement or XML for Analysis (XMLA) query, but I have yet to see practical examples of this being published anywhere, nor do I have an inkling of what syntax to use, given that I’d also have to specify the data source within them. One of the means is to use the old SHAPE statement that some of Microsoft’s data interfaces supported back at the beginning of this millennium, which is mandatory if you want to INSERT INTO nested tables. This is a complicated task which involves using five different keywords (SHAPE, APPEND, RELATE, TO and AS) in tandem with multiple parentheses and curly braces, plus your choice of access methods like OPENQUERY or OPENROWSET, which can lead to code that is both so torturous and tortuous that it makes me want to file a complaint with Human Rights Watch. All BOL says about it in each edition of the SQL Server documentation is that, “The complete syntax of the SHAPE command is documented in the Microsoft Data Access Components (MDAC) Software Development Kit (SDK).”[iii] I was tempted to say, if they’re not going to bother to document it, neither will I. The most formal yet readable example of it I have seen to date is in the DM Team book, as usual.[iv] Nevertheless I will give it the old college try and explain it as informally as possible: basically, you SHAPE { add the main query here in curly braces, such as an OPENQUERY } then APPEND ( now within parentheses, add another query here within { curly braces such as another OPENQUERY } then use a RELATE FirstQueryIDColumn TO SecondQueryIDColumn ) AS MyAlias. I’ve used it successfully before, but it’s hard to work with. If you want to use INSERT INTO with nested tables, however, learning the SHAPE command may be your only option.
                Moreover, INSERT INTO has a couple of other clauses to take into account. Among these is the SKIP keyword, which can be placed next to one of the columns in the first list to indicate that the corresponding columns in your OPENQUERY or OPENROWSET query are superfluous and should be ignored, perhaps because they don’t exist in your model. This scenario pertains mainly to nested tables. The documentation also mentions a .COLUMN_VALUES clause for INSERT INTO, but it is difficult to ascertain from BOL precisely what it is used for. As I’ve mentioned before in past posts, working with SQL Server Analysis Services (SSAS) is sometimes like blasting off into space, because the documentation is sometimes so thin that you’re basically going where no man has gone before, much like Capt. James T. Kirk and Co. For example, my series of posts on An Informal Compendium of SSAS Errors is still one of the few, if not the only, centralized sources of information for certain SSAS errors, including several pertaining exclusively to data mining. The .COLUMN_VALUES is in the same category, in that I could find only four relevant hits for it in a Google search, two of which merely repeated the thin information provided in BOL. A search for it in the text of both the 2005 and 2008 editions of the DM Team’s book also turned up nada[v]. Luckily, the last hit in Google turned up a reference to a readable explanation in the original OLE DB for Data Mining Specification, written way back in July 2000, back in the days when SSDM was a brand new feature in SQL Server, with only two algorithms.[vi] Figure 3 contains a hypothetical example of the clause, which I have yet to use in practice. It supposedly allows you to supply data to one column at a time and train it aside from others, with the caveat that any related columns must be trained together. This is the closest we can get in SSDM to incremental updating, which I surmise is mostly useful for performance purposes. The OLE DB document says that “the client application can browse those values but cannot yet perform queries or browse model content,” so I don’t see what advantage having a model with incomplete results would have, except to enable this kind of crude incremental processing. BOL says that it “provides column data to the model in a concise, ordered manner that is useful when you work with datasets that contain hierarchies or ordered columns,” but I am unsure what Microsoft means by this.

Figure 3: A Hypothetical Example of the COLUMN_VALUES Clause
INSERT INTO ClusteringDenormalizedView1Model.COLUMN_VALUES
(IoStall)
OPENROWSET(‘MyProvider’, MyConnectionString,
‘SELECT IOStall FROM MyTable)

                Because of such complexity, you’re better off not using INSERT INTO to populate and process your models, just as you are better off creating and altering them without DMX DDL, as explained in last week’s tutorial. Typing out statements like CREATE TABLE can have some advantages in T-SQL, because it forces the query writer to think more intently about the many variables that factor into object creation and use, but this is not usually the case with DMX. The main use case for INSERT INTO, DELETE and DDL operations in DMX operations is in those situations where you’re creating and changing structures and their models on the fly, especially when the results are depicted in programs other than SQL Server Data Tools (SSDT) or earlier Visual Studio-based incarnations of it with SQL Server editions prior to 2012. A casual data miner is much more likely to depend on the GUI Microsoft provides with SSDM than they are to rely solely on SQL Server Management Studio (SSMS) to view T-SQL query results, for example. This is due in large part to the complexity of the output and the visualization tools needed to depict it, which are a lot more sophisticated than a mere spreadsheet or grid control. The next most likely usage scenario is to retrieve the raw numbers these visualizations are built with directly through a DMX query, more often in SSMS than not. Even then, however, most users will be simply selecting data from objects that have already been created and fed data in SSDT. Other use cases do occur, but they’re proportionally much less common than on the relational or even the OLAP side of things. Perhaps this is why no means was even built in to create data sources or data source views (DSVs).
                Selecting data is a much more common operation in DMX, one with a bit more variety than these other DML operations but far less than corresponding SELECTs in T-SQL or MDX. The most diversity occurs in the FROM clause, which we will discuss after we get through the keywords that can be tacked on the beginning or the end of a query. This won’t take long because there are so few of them. The only keyword among them you won’t find in T-SQL is FLATTENED, which is designed mainly to denormalize DMX queries so that the NODE_DISTRIBUTION column can be depicted in GUIs that don’t supported nested tables. SSMS can depict this column correctly in DMX queries, but you can’t perform common operations like importing the data into relational tables (a crucial topic we will touch on two weeks from now) without this keyword. After this you can specify common T-SQL keywords like DISTINCT, TOP n, AS for column aliases, WHERE and ORDER BY. Their functionality is often stripped down in comparison to T-SQL, however, while in the case of DISTINCT, it is positively bizarre. When a column is specified with a Content type of Discrete (for a refresher on the crucial distinction, see the first couple of posts in this series), then it returns unique values as expected. With Continuous columns, however, it returns the median, while in the case of Discretized columns it returns the median of each bucket. The reasoning for this is quite beyond my ken. The usefulness of ORDER BY is also crimped by the fact that it can take only one condition. The operators DMX uses are mainly useful with ORDER BY and WHERE, but it features only a subset of those available with T-SQL. The language has the same arithmetic operators as T-SQL except modulo; a severely restricted subset of logical operators with AND, OR, NOT; the same comparison operators except !< and !> (not less than and not greater than); and the same unary operators, except for ~ for a bitwise NOT. Oddly, it doesn’t have this functionality, but DMX does support using // as a third means of commenting, in addition to the two used by T-SQL.
               The WHERE clause can also make use of EXISTS, which is technically a function in DMX rather than a keyword. Another function which can be useful in both the SELECT list and the WHERE clause is StructureColumn(“MyColumn”), which allows you to include a column from the parent mining structure in a model that doesn’t have it, either in the SELECT results or as a filter. Lag() can also be used in a WHERE clause to retrieve “the time slice between the date of the current case and the last date in the data,” but all it does is basically take a hatchet to your dataset. Using it on any other algorithm except Time Series raises this error: “The LAG function cannot be used in the context at line n, column n.” In my limited experience, it is also allowed only in CASES queries and seems to always take the final case in the mining model as its starting point. My experience is limited precisely because I can’t find much of a practical use for it, unlike the corresponding Lag command in MDX. One of the most exciting innovations in SQL Server 2012 is the addition of analytic functions like Lag to T-SQL, which has the left the DMX version even further behind in the dust. Unlike its MDX or T-SQL counterpart, the DMX version of Lag takes no arguments, so it’s not surprising that it returns information of such little use.
             Four other functions are commonly used with WHERE, including IsTestCase () and IsTrainingCase (), which take no arguments, and IsDescendant (NodeID) and IsInNode (NodeID), which you supply the parent node, or the node that you suspect contains the current case. Their functionality is fairly self-explanatory. The same is not the case with several other functions that are often used in the SELECT list, like TopCount, TopPercent and TopSum, each of which returns a table where the values for the specified column are => the count, percentile or sum. BottomCount, BottomPercent and BottomSum do the same except return the lowest values, in ascending order. They operate similar to the functions of the same name in MDX, except that the rank expression and the numerical value to check for are the second and third arguments to the function respectively, rather than the other way around. RangeMin, RangeMid and RangeMax can also be used to extract the minimum, average and maximum values for a bucket in a Discretized column. Figure 4 depicts the values returned for a Discretized column in the NBDenormalized3Model mining model used in A Rickety Stairway to SQL Server Data Mining, Algorithm 1: Not-So-Naïve Bayes. The blank first line returned with the results is a mystery to me, along with the annoying habit of SSDM to automatically insert spaces into your column names when creating mining structures, which is why we have to use brackets to enclose [Io Pending Ms Ticks] rather using the more legible name IOPendingMsTicks from my original schema. These three functions can also be used with prediction joins, in which case the results refer to the specific buckets predicted by the query.

Figure 4: Examples of RangeMin, RangeMid and RangeMax Applied to a Column in NBDenormalized3Model
SELECT DISTINCT RangeMin([Io
Pending Ms Ticks]) AS [MinIOPendingMSTicksBucket],
RangeMid([Io Pending Ms Ticks]) AS
[MidIOPendingMSTicksBucket],
RangeMax([Io
Pending Ms Ticks]) AS
[MaxIOPendingMSTicksBucket]
    
FROM
     [NBDenormalized3Model]

Results

DMXDMLFigure4Results

                  DMX has an entire class of prediction functions that are far more worthy of our time and attention, which deserve separate treatment next week because they are the meat and potatoes of the language. They are complemented by a specific syntax for SELECT to create what is called prediction join, which performs similar functions that it would be best to discuss separately. As depicted in Figure 5, simply selecting from a model without a CONTENT, CASES or SAMPLE_CASES clause will perform an empty prediction join, which is a topic that will also be deferred until next week. Likewise, using the DIMENSION_CONTENT clause in a FROM statement returns data from a mining dimension in a cube, which is a fairly simple topic. I will also defer that subject until after our discussion of DMX is over though, because I can augment the functionality with some custom DMX, T-SQL and MDX code that may aid DBAs who want to use SSDM successfully, without investing too much time and energy learning the whole complex metadata format I described in the posts on the individual algorithms. When you perform a CONTENT query on a model, you get back the 18 columns of that common metadata format, which vary substantially in meaning not only between algorithms but between rows, depending on the value of the NODE_TYPE column. The FLATTENED keyword is designed to denormalize the NODE_DISTRIBUTION column it returns, in which case six additional columns will be returned: ATTRIBUTE_NAME, ATTRIBUTE_VALUE, SUPPORT, PROBABILITY, VARIANCE and VALUETYPE. The meaning of these columns may also change from row to row, depending on the flag in the VALUETYPE column. For details on how to interpret these results, refer to the tutorials on each particular algorithm, which include handy charts for deciphering this rat’s nest of nested tables. Relational DBAs may be put off by this highly denormalized, flattened representation of the Generic Content Tree Viewer in SSDT, which is why I will supply code in a couple of weeks that will allow moderate users of SSDM to bypass CONTENT queries of this kind altogether. In some situations it may also be useful to inspect the cases used to train a particular structure or model, which is where the CASES clause comes into play.
               One of the strangest features in DMX is the SAMPLE_CASES clause, which applies only to the Sequence Clustering algorithm. There is one sentence referring to the clause in the 2005 edition of the DM Team’s book and no mention in the 2008 edition. Beyond that, there is next to no documentation on it anywhere on the planet, except for a single cryptic note in BOL. What few sources I have found seem to agree that it returns training cases which may be generated by SSDM itself, rather than representing actual data, but I have yet to see an explanation of why Sequence Clustering would require fake cases for internal use like this. I suspect it must have something to do with constructing the Markov models it depends on, but since I lack a solid background in statistics, I can’t really say for sure. Either way, it is a bit too arcane to discuss for the purposes of this series, given that it is a poorly documented and narrowly specialized feature, which probably pertains only to rare use cases for a single algorithm that is its useful only in exceedingly specific scenarios. If you already know how to use features like SAMPLE_CASES or COLUMN_VALUES, or even need to know it, chances are you’re not in the target audience for this series of tutorials; if you’re among the select half-dozen illuminati who use it regularly, chances are you’re the guy who wrote the code. One of the puzzling things about Analysis Services is that there are so many poorly or completely undocumented features, but someone had to have written the code behind it. Sometimes, trying to backtrack and rediscover the functionality is a bit like trying to figure out where the colonists of Roanoke went from the single cryptic word they left behind, CROATOAN.

Figure 5: Types of SELECT Clauses in DMX (adapted from BOL as usual – click to enlarge)

DMXDMLFigure5

                If you have several gigabytes of processed mining data and really want to tax your server, try running SELECT * FROM $system.DMSCHEMA_MINING_MODEL_CONTENT without a WHERE clause. This will return the all of the same information as a model CONTENT query, except for every model in your database. It is the first of ten schema rowsets for data mining, which are the only equivalents to the scores of dynamic management views (DMVs), information schema views, management stored procedures, metadata functions, DBCCs and other tools that relational DBAs are familiar with. As depicted in Figure 6, DMSCHEMA_MINING_STRUCTURES and DMSCHEMA_MINING_MODELS provides information roughly analogous to sys.tables, except for mining structures and models respectively. DMSCHEMA_MINING_STRUCTURE_COLUMNS and DMSCHEMA_MINING_COLUMNS are likewise analogous to the functionality of sys.columns, except for structure and model columns respectively. DMSCHEMA_MINING_FUNCTIONS provides information on the functions available with each algorithm, which DMSCHEMA_MINING_SERVICES and DMSCHEMA_MINING_SERVICE_PARAMETERS provide the properties and parameters available for them. DMSCHEMA_MINING_MODEL_XML has a couple extra columns like Size and Location which can supplement the information returned by DMSCHEMA_MINING_MODELS. It also has a column for an XML representation of a model’s PMML text, if it has any, but this is an advanced topic we won’t be tackling till the end of the series. The same goes for DMSCHEMA_MINING_MODEL_CONTENT_PMML.

Figure 6: The Ten Data Mining Schema Rowsets (click to enlarge)
DMXDMLFigure6

                That’s almost all you need to know about DMX – with the glaring exception of prediction queries, which provide functionality that you can’t always get using the GUI in SSDT. In the remaining articles in this series we’ll also touch on some really advanced leftovers like how to call DMX stored procedures, but that can wait until we discuss their primary use case, model validation, which is one of the most important steps in a mining workflow. In the upcoming tutorial on mining dimensions, I’ll introduce an alternative that allows DBAs who are more familiar with T-SQL or MDX to slice and dice the results SSDM returns in their favorite language. There are some usage scenarios where you can’t do without DMX, but it is not an ideal tool for interpreting the results returned by the algorithms. Its functionality is simply far too limited; as I mentioned a few paragraphs up, for example, the Lag function is barbaric compared to its newborn counterpart in T-SQL. Since the multi-dimensional data SSDM deals with is so complex, you would assume that DMX wouldn’t be so simplistic and one-dimensional. Perhaps it is still in its infancy as a language, just as the whole data mining field is, or perhaps it has died stillborn for lack of attention from the top brass at Microsoft, given that it hasn’t been updated much since SQL Server 2005. I look at it as the proper tool to do one of two jobs, the first of which is to get my mining data into relational tables or cubes, where I can perform operations on it that are more useful and sophisticated by several orders of magnitude, using T-SQL and MDX. The second of these is to perform prediction queries, which is the real raison d’être of the language and gives SSDM a whole new level of functionality that isn’t necessarily possible in the GUI.


[i] p. 109, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[ii]  IBID., pp. 109-110.

[iii] See the MSDN webpage “SHAPE (DMX)” -  http://msdn.microsoft.com/en-us/library/ms131972.aspx

[iv] pp. 110-112, MacLennan, et al.

[v]  IBID. Also see MacLennan, Jamie andTang, ZhaoHuin, 2005, Data Mining with Microsoft SQL Server 2005. Wiley Publishing: Indianapolis.

[vi] pp. 37-38, Microsoft Corporation, 2000, OLE DB for Data Mining Specification Version 1.0. Available online at  Dwight Calwhite’s website at www.calwhite.com/files/OLEDBDM1.doc

A Rickety Stairway to SQL Server Data Mining, Part 10.1: DDL with DMX

by Steve Bolton

                As users become more familiar with software, they typically follow a progression that begins with learning the GUI, followed by how to perform the same functions programmatically, then perhaps becoming adept at writing custom code. Other instructional material on SQL Server Data Mining (SSDM), such as that found in Books Online (BOL) and Data Mining with Microsoft SQL Server 2008[i], the classic book on the subject written by former members of Microsoft’s Data Mining Team, take a more topical approach. For example, they discuss each algorithm together with the Data Mining Extensions (DMX) functions pertinent to them. My series of amateur self-tutorials is aimed more at the untapped market of SQL Server professionals with moderate to low skills in statistics, who could nonetheless be making productive use of the most under-utilized tools in SQL Server, even to support purely relational or Online Transaction Processing (OLTP) tasks. For that reason, we’ve covered the indispensable parts of SSDM first and are now moving on to the extras, which can be dispensed with but are helpful in certain scenarios. It would be difficult to get useful results out of SSDM without the minimal grounding covered in post 0.0, post 0.1 and post 0.2, as well as the individual discussions of the nine algorithms. You don’t need a doctorate in statistics to use SSDM successfully, any more than you need to write a dissertation on combustion engines to drive a car, but users should have at least a basic familiarity with how to set up projects, structures and mining models, differentiate between Content types, apply the right algorithm for the right job and have an inkling of how to decipher the numbers they return. It helps to understand the inner workings of the algorithms, but SSDM can be put to good use without interpreting a single equation, aside perhaps from the simple calculations used in Linear Regression. All of these basic topics are prerequisites for the more advanced subjects we will be dealing with in the remainder of this series, such as data mining dimensions, plugging in your own custom algorithms with Predictive Model Markup Language (PMML) and mining model validation. It also makes more sense for the class of users this series is aimed at to learn such prerequisites as viewing mining results the easy way, through SSDM’s many visualization tools, before moving on to manipulating the results programmatically. For these purposes, DMX should be considered an advanced topic. Since we’re climbing the rickety stairway of these tutorials (which I am entirely unqualified to write) in order of the ease, complexity and necessity of the topics, I will the first of at least three tutorials on DMX with a discussion of Data Definition Language (DDL) queries. This involves creating mining structures and models programmatically rather than through the SQL Server Data Tools (SSDT) GUI, which we’ve already covered in depth. Once we’ve taken care of this basic task, we can tackle more advanced discussions in coming weeks of how to use DMX to do Data Manipulation Language (DML) on the structures and models we’ve built.
               This is really a baby step on the one stair in our series that is least likely to creak. DMX is general is much simpler than T-SQL or Multidimensional Expressions (MDX), the languages used by relational databases and cubes respectively in SQL Server. That is especially true when it comes to DDL, because there are only two types of objects we can create, mining structures and mining models, and only three DDL statements we can apply to them, DROP, CREATE and ALTER. To make matters even simpler, we’ll get rid of one of those right off the bat: to delete a model, simply use the syntax DROP MINING MODEL MyMiningModel  and to delete a structure, type DROP MINING STRUCTURE MyMiningStructure. That’s it. To simplify matters further, we won’t fuss over a variation of the CREATE statement geared towards PMML until that really advanced topic comes up for discussion at the tail of the end series, when our rickety stairway will practically be out of the attic and onto the roof. It looks something this, but neither the string nor the purpose of the PMML will make any sense until we’ve gotten to that subject: CREATE MINING MODEL MyMiningModel FROM PMML MyPMMLString. Before creating models of any kind, we would have to start by creating the parent mining structure. Figure 1 provides a visual example of what a typical structure creation looks like in SQL Server Management Studio (SSMS):

Figure 1: CREATE MINING STRUCTURE Example
CreateStructureQuery

                To create a DMX query like the one above, simply click on the reddish DMX icon on the top bar, just above the Execute button (assuming of course, that you’ve set up your SSMS interface like mine). Simply connect via the login window that appears and you’ll find a list of available mining models in your default SQL Server Analysis Services (SSAS) database; on the left you’ll see that the last one I selected was named ARDenormalizedView1Model14, which we used in the tutorial on the Association Rules algorithm. That doesn’t have much to do with what’s going on in the query window, where I’m creating a structure identical to the one we used in the tutorial on Sequence Clustering. That algorithm demands a nested table, which makes it ideal to demonstrate how to include the TABLE type; the nested table in MyStructure above is named QueryTextIDView and it has two columns in parentheses, which are specified by the same syntax as that used for regular mining structure columns. All of the data types in this structure are either Key or Long, but SSDM also provides a few other data types like Text, Boolean, Double, Date and Binary. The only reason I didn’t use them throughout this series is because the data we operated on in the examples came from three days of polling six different Dynamic Management Views (DMVs) related to IO pressure every minute, which produced six tables, some of which had more than a million rows, joined to a parent RecordTable with various measures of time. Some of these dependent tables originally had columns with data types like nvarchar or varbinary that could have easily been assigned Text or Binary data types in SSDM, but as explained in post 0.2, performance was an important concern, so I normalized those tables and substituted numeric foreign keys for them. This may have been wise, considering that with certain algorithms I was bedeviled with a frustrating error in which SSDM ran out of room to store strings on disk, which might have been exacerbated if I left these long text and binary values as they were. For many projects, however, it may make perfect sense to use these alternative data types. For a newcomer to SSDM, the only puzzling thing about the syntax might be the second clause following the data type in each column specifier. This corresponds to the Content type that flags the meaning of each column, which is a crucial distinction in data mining but not in relational databases. As discussed in depth throughout this series, there are nine of these Content types (Key, Key Sequence, Key Time, Cyclical, Ordered, Continuous, Discrete, Discretized and Table), but each algorithm has important restrictions on which ones it will accept. It is also possible to disallow nulls in a structure column by placing the NOT_NULL keyword between the data and Content types, but I neglected to depict that in the example above. After the data type and prior to NOT_NULL or the Content type, you can add the keyword UNIFORM, NORMAL or LOG NORMAL to specify the structure column’s Distribution property, which serves as a hint that some algorithms can use for optimization purposes. The first value refers to a flat line in which all values in a column occur with equal frequency, the second to a Gaussian bell curve and the third to a lopsided distribution in which certain values are much more likely. I have yet to see a good guide, however, as to which algorithms make use of this property and how they go about doing so; furthermore, it might require some preliminary queries in T-SQL or MDX to determine the distribution ahead of time, so I may discuss this in a future article on prep work before mining projects. Figure 3 has an example of one of these keywords in action in the CPU_Ticks column. You can also add a RELATED TO clause to specify that the column is related to a nested table key, a Discrete column or another column that has a RELATED TO clause applied to it, but I have yet to experiment with this feature much, nor have I seen a decent example in the literature of how to use it in practice. It creates a hierarchy in the same fashion as SSAS cubes and can apparently also be set using the ClassifiedColumns, but I have yet to see an example of how this would affect the outcomes of mining projects. The structure definition ends with some clauses regarding training cases which are much easier to explain. The optional WITH HOLDOUT clause merely specifies values for the mining structure properties HoldoutMaxCases and HoldoutMaxPercent, while the REPEATABLE keyword merely specifies the HoldoutSeed, all of which we covered way back at the beginning of this series.
                The real action takes place in the ALTER MINING STRUCTURE statement, which is only used to add mining models, in the same way that the ALTER TABLE statement in T-SQL can be used to add columns. The top of the query is fairly self-explanatory and quite similar to the CREATE statement we just covered. The difference is that instead of the specifying the data and Content types, we indicate whether the column should be used as an input (which is the default, and therefore blank for some columns), as a predictable column but not an input (using the PREDICT_ONLY keyword) or as both (using the PREDICT keyword). Note that the Hour column also has a mining flag applied to it called MODEL_EXISTENCE_ONLY, which we haven’t used throughout this series since we didn’t want to reduce our data down to a dichotomous choice between Missing or Existing states. The only other flag we can set is REGRESSOR, depending on whether or not we’re using an algorithm like Linear Regression that will accept it. After all that we see the USING statement, which specifies the algorithm. The parentheses after that include the specific parameters for that algorithm, if we don’t want to leave them at their default values. We’ve already covered these in depth in the articles on each individual algorithm, so refer to the pertinent prior posts for details.

Figure 2: ALTER MINING STRUCTURE Example
ALTER MINING STRUCTURE MyStructure
ADD MINING MODEL MyMiningModel
(
       MyIDColumn,
       Cpu_Ticks,
       DayOfWeek,
       Hour MODEL_EXISTENCE_ONLY,
       Minute,
      MinuteGap PREDICT_ONLY,
              QueryTextIDView PREDICT
              (
             MyNestedIDColumn,
              QueryTextID PREDICT
              )
)
USING [Microsoft_Sequence_Clustering]

parameters specific to each algorithm
(CLUSTER_COUNT = ’10′)
WITH DRILLTHROUGH,
FILTER(DayOfWeek = 7)

                 The optional keyword WITH DRILLTHROUGH refers to a feature I’ve neglected to discuss yet in this series. It sets the mining model property AllowDrillThrough to true, which allows users to view the cases used to train a model, contrary to the default behavior. Several prerequisites must be met first, such as granting the user AllowDrillThrough permissions on the structure or model, depending on which one is being accessed. Books Online (BOL) advises that “The model permission lets you drill through from the model, even if you do not have permissions on the structure. Drillthrough permissions on the structure provide the additional ability to include structure columns in drillthrough queries from the model, by using the StructureColumn (DMX) function,” but these are scenarios I’ve yet to encounter, since I’m the only user on the machine I’ve taught myself SSDM on.  The parent structure’s CacheMode property must be set to KeepTrainingCases rather than ClearAfterProcessing, which deletes them when processing is finished. The Naive Bayes, Logistic Regression and Neural Network algorithms do not allow drillthrough at all, since they “do not assign cases to specific nodes in the content,” whereas there are several other limitations with Linear Regression and Times Series, as detailed at the MSDN webpage Drillthrough Queries (Data Mining). If all those conditions are met, users can view the cases used to train a model with a DMX SELECT .Cases query, which we will cover next week along with other SELECT clause topics. The FILTER clause corresponds to the mining model Filter property, which I have also neglected to mention. The idea is not much different than a WHERE clause in T-SQL, except that the conditions are enclosed in parentheses and follow DMX-specific syntax, which would be more appropriate to discuss next week. Basically, you can add several conditions connected with familiar keywords such as AND and OR, as well as subqueries with SELECT, EXISTS and the like. It is also possible add a FILTER clause to a nested table by tacking it after the table’s definition. You can create a structure and a model together using the CREATE MINING MODEL statement as depicted in Figure 3, but you can’t add a filter or any of the holdout clauses.

Figure 3: CREATE MINING MODEL Example
CREATE MINING Model MyModel3
(
       MyIDColumn LONG KEY,
       Cpu_Ticks LONG  LOG NORMAL CONTINUOUS,
       DayOfWeek LONG CYCLICAL,
       Hour LONG CYCLICAL,
       Minute LONG CYCLICAL,
       MinuteGap LONG CONTINUOUS PREDICT_ONLY,
              QueryTextIDView TABLE PREDICT
              (
              MyNestedIDColumn LONG KEY SEQUENCE,
              QueryTextID LONG DISCRETE PREDICT
              )
)
USING [Microsoft_Sequence_Clustering]
parameters specific to each algorithm
(CLUSTER_COUNT = ’10′)
WITH DRILLTHROUGH

  

                It is also possible to duplicate an existing mining model using a SELECT INTO statement,  which is mainly useful when  you want to operate on a model that is nearly identical except for different algorithm parameters or WHERE filters. There are no column specifiers, but it is fairly simple to stipulate the algorithm, the parameters and a filter, like so: SELECT * INTO CopyOfMyMiningModel USING Microsoft_Neural_Network (HIDDEN_NODE_RATIO = 10) WITH FILTER (FileHandleID = 11) FROM MyMiningModel.
      
There are far fewer additional clauses to worry about with these statements than their corresponding CREATE and ALTER TABLE counterparts in T-SQL. There isn’t even an ALTER MINING MODEL statement to worry about. WeM might as well include the two RENAME statements here, which merely allow you to change the name of a model or structure like so: RENAME MINING MODEL MyModel3 TO MyModel4. To perform the same operation on a structure, simply change the keyword MODEL to STRUCTURE. The closest thing you have to a BACKUP statement in DMX is EXPORT, which backs up a model or structure to an SSAS backup file like so: EXPORT MINING STRUCTURE MyStructure, TSDenormalizedView1Structure MINING MODEL MyModel4 TO ‘C:\MyStructure3.abf’ WITH DEPENDENCIES, PASSWORD=‘duh’. The DEPENDENCIES keyword specifies that additional objects needed for processing, such as the Data Source View (DSV), are also included in the file, while the PASSWORD keyword is self-explanatory. You can export more structures all at once by separating them with commas (as shown above, where TSDenormalizedView1Structure is an old structure we used with the Time Series tutorial) and indicate that only a specific model should be included, using the MINING MODEL keyword. The EXPORT MINING MODEL statement has basically the same syntax, except that it also exports the structure the model depends on, but only the specified model within the structure. IMPORT is the equivalent of a RESTORE statement in T-SQL and has only one optional clause, for the password previously assigned. For example, IMPORT FROM ‘C:\MyStructure3.abf’ WITH PASSWORD=‘duh’ would restore the structure we just exported.
                There’s simply not much else to say about DDL in DMX, which is a welcome break after several weeks of covering complex topics in this blog. It doesn’t provide much additional functionality beyond what we already have in the GUI, nor is there as much benefit to be gained by typing these commands out by hand as there is with T-SQL DDL statements, which help you to think more clearly about the functionality of your tables and views. The main scenario for using the two CREATE statements is with the optional SESSION keyword, which allows you to create structures and models that are discarded when their associated connection is closed, in a manner analogous to temporary tables in T-SQL. I imagine that this is useful mainly if you’re remotely creating temporary objects, as you might through Excel. I can’t remember off-hand, but it may also be possible to do this in Reporting Services, which I only do on occasion; normally I write any DMX queries in SSMS, which may be different from your usage scenario. In other words, much of the DDL we use in DMX is not only extremely simple, but often unnecessary; using the GUI might not be the hip, cool thing to do, but it is often the fastest. In the next couple of tutorials we’ll venture into areas where DMX becomes a progressively more useful tool, or even outright indispensable in some cases. Even in many of these scenarios, however, the learning curve is still quite short compared to broader SQL Server languages like T-SQL and MDX. This simplicity can free up resources for more pressing tasks, like setting algorithm parameters correctly. It is also an antidote to the excessive complexity associated with SSDM’s common metadata format, which as discussed in previous issues has some drawbacks alongside its advantages. As we shall see, one of the primary uses of DMX is to address this limitation, by importing the results into relational tables or OLAP cubes where the results can be sliced and diced with greater ease.


[i] MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

A Rickety Stairway to SQL Server Data Mining, Algorithm 9: Time Series

by Steve Bolton

                The data mining method known as Time Series is aptly named, because it seems to take a long series of time to get it right.
                Out of the nine algorithms included in SQL Server Data Mining (SSDM), Time Series is on a par with Neural Networks in terms of sophistication, raw power and potential usefulness. When properly designed, neural nets can also be used to predict future trends based on past data, but Time Series is usually the preferred tool when projecting slices of time with definite intervals; its inner workings are also complex but still far easier to interpret than the innards of neural nets, as explained in A Rickety Stairway to SQL Server Data Mining, Algorithm 5: The Neural Network Algorithm. Sequence Clustering is the only other SSDM algorithm that is implicitly temporal, but it can also be applied to data that is ordered but not in terms of time – one of its most common applications is in gene sequencing, for example. That algorithm is usually applied to slices of time that are indefinite, which might make it an ideal choice for a DBA to harvest Windows and SQL Server log data for sporadic server events. Like Sequence Clustering and Association Rules, Time Series has a fairly long learning curve, which is another reason why I saved it for the end of this segment of my self-tutorials on SSDM. On the other hand, it is a much more versatile tool, in that it has far fewer restrictions on input data than many of the other algorithms and can be applied to a much wider range of temporal problems. Certain other algorithms like Clustering, Naïve Bayes and Logistic Regression can only be applied to temporal projections in an awkward way, while with Association Rules you might as well banish the thought of using it to accurately predict the future, because it’s not going to happen. Linear Regression and Decision Trees can be used for predictions, but SQL Server data miners might as well use Time Series if they have the computational resources, given that it uses both of these algorithms as building blocks to make predictions with much greater accuracy.
                As always, I preface my articles with a disclaimer that I’m posting this series in order to learn the ropes, while providing some badly needed publicity to the most under-utilized tool in SQL Server; if even an amateur like myself can use it to uncover profitable hidden relationships in data, then real DBAs can probably gain even more from it with minimal expenditures of time, training and server resources. On the other hand, I do have real expertise in history, including a Master’s degree and part of a doctorate. I remember getting into debates with other grad students about the purposes of history, where I took the sometimes unpopular stance that the chief use of the past is to affect the future. The past may have some intrinsic value of its own, but history is mainly antiquarian entertainment unless it is used to fix today’s problems tomorrow. Perhaps that is because I specialized in foreign policy, which is an area where forecasting is crucial, for the simple reason that people die in large numbers when our statesmen fail to learn from history. I tend to look at data from the same philosophical viewpoint. All of those numbers and letters DBAs toil to preserve are already part of the past, and their chief use is to apply them to tomorrow’s challenges. This is true of OLTP as well as OLAP data; one saves yesterday’s records to make them available for tomorrow’s transactions, while the other analyzes the patterns from one year’s records to forecast the challenges an organization will face the next year. Yesterday’s data makes tomorrow’s action possible. Time Series is one of the most valuable data mining algorithms precisely because it is more explicitly in line with this goal than some of its competitors. It is not necessarily a “better” mining method though. With data mining, matching the proper tool with the right task is a critical question, and Time Series is not always the best choice. You wouldn’t use it to divide data into groups based on characteristics that have nothing to do with time, in which case you’re better off using Clustering or Association Rules, or another preliminary algorithm like Naïve Bayes. A common design pattern in data mining, however, is to use non-temporal algorithms like these to identify groups, then use Time Series to see how they change over the course of a series of precise intervals. This is the basically the workflow I’ve used up until this post, in which we will apply some of the knowledge gained from the other algorithms to temporal predictions. When it is used in this way, the types of data that Time Series can operate on are bounded only by human imagination, as long the intervals we’re predicting are specific. For a long list of the many varied applications Time Series is put to today, see the Wikipedia page on the topic. Earthquake prediction and weather forecasting are among the more glamorous and highly useful applications of Time Series that benefit the human race every day.
               The output of Time Series can be interpreted with a fair amount of ease, even by laymen. Just picture a line chart, which for the uninitiated, simply has the time intervals depicted horizontally at the bottom and the values you’re measuring listed vertically on the left side, with one or more lines and curves in the middle to show the trends of your data. For a couple of simple examples, see the webpage 6.4.5.1. Example of Multivariate Time Series Analysis in the National Institute of Standards and Technology’s Engineering Statistics Handbook, which I highly recommend for amateurs like myself who want to learn the basics of stats quickly. Visualizing the results is the easy part. The hard part is explaining the inner workings of the algorithm, which is surpassed in complexity only by neural nets. It may be arcane, but keep in mind, that doesn’t mean it works magic. There is an unspoken tendency among those, like myself, who don’t fully understand the hard-core math behind it all to treat data mining with a touch of superstition, especially when it unearths valuable patterns. This tendency is magnified with Time Series since we’re predicting the future. It is not a black art, although the math behind it may be a black box to those of us without doctorates in stats, which aren’t needed to put SSDM to productive use. Time Series does not receive messages from beyond the mortal plane like St. Odilia or Mother Shipton. Nor does it perform the ex nihilo miracle of adding information out of thin air, which is logically impossible. As explained in A Rickety Stairway to SQL Server Data Mining, Part 0.1: Data In, Data Out, data mining can at best preserve and identify information that is already present in the finite data we feed into it. If we feed Time Series garbage in the form of wrong data, bad data or insufficient good data, we’re basically just projecting garbage into the future. It is not easy to understand how Time Series churns through the data it is fed, but this is mainly because so many separate steps are involved, not because they are beyond the ken of laymen. As mentioned before, the algorithm makes heavy use of Linear Regression and Decision Trees, two statistical building blocks we’ve covered before. If you can follow the explanations in those tutorials, you’re half-way home. If you haven’t gone through them yet, have no fear, it’s not as hard as it sound. Regression is a topic covered in many general math courses at high school senior or freshman college level, while Decision Trees is just an adaptation designed with non-linear data in mind, so that instead of a single line through a scatter plot, you get jagged ones that change course in tandem with the data. Microsoft’s version of Time Series depends on a variant of Linear Regression called auto-regression, which is even simpler in the sense that a single variable is used to calculate its own regression line, rather than its influence on a second variable. It also makes use of moving averages, which seems to be a somewhat advanced but relatively common calculation in relational databases, judging from the fact that it is often included at the tail end of the T-SQL cookbooks I’ve read.[i] I’ve avoided posting equations throughout much of the series, partly because I’ve somehow forgotten half of the math I used to know back when I was in fourth grade, and used to solve equations for my father, a college physics teacher. Secondly, we don’t really need it to use Time Series or any other SSDM algorithm, just as we can drive cars without giving dissertations on automotive engineering first. When driving a car, it is nonetheless helpful to understand basic concepts about what’s going on under the hood, such as what functions pistons serve in the engine. Likewise, the important things to understand in this discussion of the inner workings of Time Series are the functions the statistical building blocks are meant to serve. The equations themselves are secondary because the whole point of SSDM is to implement all of that advanced stuff for you. Chances are if you’re a DBA, you’re already familiar with moving averages, and if you’ve been to college, you’ve been exposed to regression analysis in the past. It only takes one baby step up our rickety stairway to understand that Decision Trees produces a jagged regression line rather than the single straight one Linear Regression puts out, for the purpose of modeling data that doesn’t follow a straight path. All that remains is to explain how these three building blocks are combined together in Microsoft’s implementation of Time Series.
               The Wikipedia page on the subject  lists scores of different methods of calculating Time Series, which is really a group of related algorithms. As with Clustering, keeping track of all of the research going on in the field would be quite difficult even for an expert, because its many constituent methods are put to diverse and highly specific purposes in widely separated fields that are exceedingly compartmentalized. One of the avenues of research, however, was pioneered by Microsoft itself, through an adaptation of the mining method we discussed in A Rickety Stairway to SQL Server Data Mining, Algorithm 3: Decision Trees. Much of the explanation that follows on the Auto Regression Trees with Cross Predict[ii] (ARTXP) algorithm within Time Series comes from a paper presented in 2002 by three Microsoft statisticians.[iii] It is adept at short-term predictions but not at long-term forecasting, so in SQL Server 2008 Microsoft added an implementation of the industry standard Box-Jenkins method, also known as an Arithmetic Moving Average (ARIMA). Users can set the FORECAST_METHOD parameter to use one or the other exclusively, but in this post I left it at its default value of MIXED. For lack of time I also did not experiment with PREDICTION_SMOOTHING, which can be used to specify how the two algorithms are blended together to produce a single set of outputs in SSDM. The closer the parameter is to zero, the more the results are exponentially weighted in favor of ARTXP, but the closer it is to 1, the more they are weighted towards ARIMA. I am not yet sure if the normalization curve Microsoft uses is applied before or after ARTXP and ARIMA are finished processing Time Series data in their own unique ways. This may be a subject for future experimentation, since a substantial impact on performance may occur if it is applied afterward and weighting turns out to be more costly for one algorithm than the other. Keep in mind that the two algorithms are also always weighted by temporal distance, so that ARTXP figures more prominently in short-term predictions and ARIMA towards long-term ones, assuming that they’re both enabled. Cross-prediction, i.e. forecasting the values of multiple columns so that they take each other’s predictions into account, is only available with ARTXP.[iv] It is also one of many Time Series features that are only available with Enterprise Edition, so it is important to plan for the correct version of SQL Server if your projects require this algorithm.
               The aforementioned research paper on ARTXP also mentions Markov models, which are the backbone of Sequence Clustering, but it is unclear to a novice like me from reading the paper if they are actually employed in SSDM. It also mentions that Monte Carlo methods, another common building block in statistics, can be used to interpolate values between leaf nodes in Decision Trees when future values are not known, but again, I’m not sure if it is used in the SSDM implementation or not.[v]  As usual, much of the original research paper is over my head, but it isn’t difficult to understand the point of it all. Auto-regression lines are obviously weak tools by themselves, given that the past behavior of a single variable is not necessarily a reliable predictor of how it will behave in the future. Worse still, simple auto-regressions can’t even model non-linear relationships in that single variable.. Without the Decision Trees component of SSDM to conserve and identify it, a lot of the information present in the data would be lost if depicted in a single straight line. The places where the modified auto-regression lines SSDM produces branch off at crooked angles depict places where a pronounced change has taken place in the data. If the change occurs in the temporal values used as the key it “means that there is a change in the trend at a certain point in time.”[vi] If there are no splits in the results SSDM returns, then what we’re left with is basically a simple auto-regression.[vii]
               It’s not quite this simple as this, but much of the added complexity in ARTXP and ARIMA comes from merely tacking additional terms on the end of regression formulae in order to account for additional variables. One of the most important of these are seasonal fluctuations in data, which are referred to as periodicity terms. The important thing to remember, once again, is the function those variables perform, not how they are included in the equations going on under the hood. The only equation I’ve included in this series to date is the common Linear Regression formula, y = ax + b, which is simple enough to understand that most college freshmen are exposed to it at some point; the key concept to remember is that when ARTXP or ARIMA have to take another variable account, they simply add it to the right side. Each new term alters the course of the regression lines produced by the equation. Data Mining with Microsoft SQL Server 2008, the classic reference written by former members of Microsoft’s Data Mining Team, says that the ARTXP equation may be modified by up to eight historical terms for each periodicity,” for example.[viii] More terms are tacked on to account for additional input and predictable variables until the equation becomes illegible by human standards, simply because of its length, not because the concepts are all that difficult. As we shall see, it is possible to retrieve the regression equations for both ARTXP and ARIMA, but there’s little point in doing so unless you’re plugging them into some other math program or doing documentation or some similar task, because they’re just too long to be user-friendly. With both algorithms, you’re going to have to rely on the visualizations SSDM provides, but with ARTXP this can be problematic. As Microsoft’s documentation points out,  “Interpreting the information conveyed by the model content is an art that requires a deep understanding of the data and its meaning in the business context.” In other words, all it does is flag potential patterns that might be of interest, out of a mass of possible relationships so vast that humans can’t conceive of them. All data mining methods require some level of human intervention for interpretation (all that the algorithms do is hide the chaff in the hopes that we can find some wheat) but with ARTXP and Decision Trees, I generally find myself having to spend more time on that phase than with other algorithms. Your mileage may vary (YMMV), but I normally find myself wading through a lot more redundancies and tautologies than with other data mining methods – with the exception of Association Rules, a brute force algorithm that is not much different than the way ‘49ers used to sort gold flakes out in pans during the California Gold Rush. For that reason, I personally prefer ARIMA because I seem to get worthwhile results back from it with less investment of time and other resources. Keep in mind, however, that ARTXP and ARIMA are designed to solve different yet complementary problems, so it may not be productive to favor one over the other in all situations.
                ARIMA is built on the work of British statistician Gwilym Jenkins and George Box, a chemist who taught himself stats while researching the effects of poison gas on animals for the British Army during World War II.[ix] Their 1971 book Time Series Analysis: Forecasting and Control popularized what became known as the Box-Jenkins method of Time Series, but much of the groundwork had already been set down in a thesis by New Zealand statistician Peter Whittle in 1951.[x] I can’t explain the Box-Jenkins method or moving averages any better than the NIST’s aforementioned Engineering Statistics Handbook does, but I’ll provide the Cliff’s Notes.[xi] The first step is lashing a moving average together with an auto-regression. Like a single straight regression line, a simple average of all the values in a column will not give you an accurate picture of fluctuations in your data over time. With regression, this limitation can be addressed with such methods as weighted caps, as in Logistic Regression, or by branching lines, as in Decision Trees. With averages, the answer is to smooth the values by taking averages at different points in time, such as the mean of a set of five consecutive values, beginning with a different row each time as you move forward through a dataset. To increase the accuracy of the change over time that these numbers represent, data miners can always add on more layers of computation to smooth the smoothed values, or to use more advanced methods like exponential smoothing and weighted data points. In its simplest form, smoothing merely means taking a moving average of a moving average. In ARIMA, the moving averages is added as another term to the auto-regression equation, in a manner similar to the addition of new terms to ARTXP equations. The ways in which these statistical tools can be combined is limited only by one’s imagination, since a data miner could conceivably add new layers of smoothing to their moving averages in an infinite regress, although the amount of information conserved would quickly diminish.
               The Box-Jenkins Model provides a well-known methodology for combining them in a standard way that conserves enough information to make it useful for a wide range of projects. First the algorithm detects how stationary a data series is through a run sequence or auto-correlation plot, then calculates the differences between each interval and repeatedly adds terms to the equation until “statistical properties such as mean, variance, autocorrelation, etc. are all constant over time.”[xii] The point of this stage is to make the equation easier to work with. The greater the variability in the trends of a dataset, the more orders of difference will be necessary to smooth out the values, according to Books Online (BOL).[xiii] Normally in SSDM this requires a single difference order – which is why we typically see just one of them represented for each ARIMA node in the Generic Content Tree Viewer – but on occasion the algorithm will add a second difference order. This behavior can be changed by using two hidden parameters, ARIMA_AR_ORDER and ARIMA_DIFFERENCE_ORDER, which are cutting edge topics we won’t delve into at this point; we’re still walking up a creaky stairway in this series, not taking a space elevator. Seasonality, i.e. periodicity, is also detected through such means as auto-correlation or spectral plots. In Microsoft’s implementation, a common statistical tool called a Fast Fourier Transformation (which has been called “the most important numerical algorithm of our lifetime”) is applied to identify seasonality.[xiv] New terms are then added to the equation for each cyclical pattern detected in the data. Auto-correlation calculations are then applied to order the auto-regression and moving averages terms properly, by examining how well the data produced correlates with itself. This part of the calculation process is a bit opaque to an amateur like me, but at some point it may make use of least squares calculations, which we touched on briefly in A Rickety Stairway to SQL Server Data Mining, Algorithm 4: Logistic Regression. None of the concepts I’ve just glossed over are difficult to grasp, with the FFT being perhaps the most difficult. It took me three paragraphs to explain crudely how they are all combined together to output Time Series values though, which ought to give users an idea of just how many calculations are involved in the process. The number of calculation steps makes the algorithm relatively resource-intensive, so proper modeling and setting algorithm parameters correctly is of critical importance. There seems to be less risk of seeing the bogeyman of data mining, overfitting, in which poor performance is paired with cluttered, meaningless or misleading results, at least in comparison to other algorithms like Association Rules or Decision Trees. In my limited experience, I have seen far more problems with pure performance bottlenecks, which can crash SQL Server Analysis Services (SSAS) during the processing phase or SQL Server Data Tools (SSDT) when retrieving the mining results. Even when processing and retrieval are successful, it may take an inordinate amount of time to finish them. I have also received a couple of puzzling errors repeatedly in the SSDT and previous versions of Visual Studio on different Time Series projects, which thankfully don’t seem to affect the results returned in the GUI. One of the two error messages depicted below recommends that I use a data type of higher precision, but I still receive it when using the largest data types available in SSDM:

Figure 1: Frequent Time Series Error Messages in the GUI
Error1
Error2

               Throughout this series we’ve skipped using the NOT NULL and MODEL_EXISTENCE_ONLY mining flags, since our data has no nulls and there would be no advantage in reducing our data to a dichotomous choice of Missing or Existing states. Time Series has a unique MISSING_VALUE_SUBSTITUTION parameter that performs a related function, however, by filling in missing values in a prediction series with the Previous value, the Mean taken from moving averages during training or a specific number of your choosing. Leaving it at its default of None is a common source of model processing errors, especially with newly created mining models. Thankfully, Microsoft provides this crucial parameter in every edition of SQL Server.              It may be important to set the other parameters of Time Series correctly to avoid performance problems and crashes, but beware, because many of them are only available in Enterprise Edition, like COMPLEXITY_PENALTY, HISTORIC_MODEL_COUNT, HISTORICAL_MODEL_GAP, INSTABILITY_SENSITIVITY, MAXIMUM_SERIES_VALUE, MINIMUM_SERIES_VALUE and PREDICTION_SMOOTHING. The DM Team’s aforementioned book says that MINIMUM_SUPPORT and COMPLEXITY_PENALTY are “rarely used” in versions after 2005 because of the addition of ARIMA, but I found it necessary to set them to non-default values in this week’s trials to avoid overfitting.[xv] As discussed  in the tutorial on Decision Trees, these two parameters can be used to prune useless branches, but with ARTXP, the trade-off is not just between quality of results, model accuracy and performance, but in the stability of predictions, which tend to improve when these parameters are set more restrictively. In past projects I had trouble getting SSDM to return any results at all until I stumbled on a helpful post at Microsoft’s Data Mining Forum that explained how to set them correctly in a Time Series model. One of the problems discussed in that thread was alleviated by the addition of the INSTABILITY_SENSITIVITY parameter in SQL Server 2008, which allows users to turn off an internal inhibition on tree growth when the standard deviation of the predictions crosses a certain threshold. Since excessive tree growth is more of problem in this week’s trials and the parameter is only operative when the FORECAST_METHOD is set exclusively to ARTXP, I won’t get into a discussion of it here. I found it necessary to set the MINIMUM_SERIES_VALUE and MAXIMUM_SERIES_VALUE parameters, which limit the range of output values much like the common math functions Floor and Ceiling do in many programming languages. Unfortunately, they are set for entire models, not individual columns, which robs them of much of their usefulness. For example, if you’re predicting one series with a logical range of values between 100 and 1,000, and a second one that should range between 0 and 10, then the best you can do is set these parameters to 0 and 1,000 respectively. The first column can thus still produce values below its own natural floor, while the second can still output values above its natural ceiling.
               The two seasonality parameters are indispensable though. When AUTO_DETECT_PERIODICITY is set closer to 1, it detects seasonal patterns more aggressively, with a corresponding risk of overfitting and possible reward of greater accuracy. The closer it is to zero, the less likely these three outcomes are. On several occasions I have found it useful to leave it set at its default of 0.6 and instead set PERIODICITY_HINT, which takes a comma-separated list of numerical values that you suspect occur at cyclical intervals in your data. A common use case might be to set it to {60, 24, 7} if you expect patterns to appear in your data by each minute in an hour, each hour in a day and each day in a week. You may track data by the minute, hour, day and week, yet that doesn’t necessarily mean that you should set periodicity hints for each of them. Unless your data tends to vary in a cyclical manner by those same intervals then there is no seasonality to detect. In that case SSDM even may ignore the hint, although this is unlikely, given that BOL says SSDM is “very sensitive” to it. For example, it might make sense to set it to 7 if you’re measuring sales at a restaurant which tends to do the same amount of business from one Monday to the next, but dissimilar amounts of business when comparing Monday’s sales to Saturday’s. In other business settings where sales are more uniform across a week, setting a periodicity hint of 7 would make less sense.
                There are no parameters to control feature selection (which may not be used at all with Time Series, although it is difficult to tell from the documentation), nor is it advisable to set the three training holdout properties on each mining structure, since all you may be doing is deleting case data.[xvi] The closest equivalent Time Series has to training data are historical models, which are compared against each other to improve the accuracy of predictions. Since we have no future data to test the models on, we’re limited to using what’s already in our dataset. Testing a dataset against itself is of limited utility, since it’s practically tautological to gauge how well a model is trained solely by the training set it has already been fed. SSDM addresses this by dividing the data into separate models based on the HISTORIC_MODEL_COUNT parameter, which has defaults to a single model, then comparing them against each other. HISTORICAL_MODEL_GAP controls how they are divided, by truncating the data at the specified interval of time slices. The default is 10, but if you’re using it in tandem with HISTORIC_MODEL_COUNT, the goal is set it equivalent to the number of time slices you want to predict into the future.[xvii] As depicted in Figure 2, processing time roughly doubled between the third and sixth Time Series models, which had equivalent parameters except for HISTORIC_MODEL_COUNT, which was first set to its default then to 3. There was not much of a difference in performance for different HISTORIC_MODEL_GAP values. Oddly, performance decreased sharply when AUTO_DETECT_PERIODICITY was set to any value other than its default, regardless of whether it was lower or higher. Normally I test one variable at a time, but because of the extremely long processing times at the defaults, I set values for PERIODICITY_HINT, MINIMUM_SERIES_VALUE and MAXIMUM_SERIES_VALUE simultaneously and instantly received much better performance.

Figure 2: Performance Comparison for Various Trials of Time Series vs. Other Algorithms (click to enlarge)
TSResultsGrid

               Before the trials depicted above, I was using an incorrect time key, which caused msmdrv.exe to load about a gigabyte and a half of RAM immediately in the initial trials, which later topped out at about 5 gigs. During that time the CPU seemed to run on one core, without much IO traffic at all. After I fixed that problem, SSAS ran on all six cores[xviii] most of the time and RAM usage rarely got above 600 megabytes, with no IO resource consumption. Processing was still time-consuming though, with the ARTXP phases seeming to take the longest. It’s always a good idea to monitor new SSDM projects in Profiler, but it is especially true with Time Series, which is aptly named because it can take quite a while to finish processing. During the ARIMA phases, you’ll see messages like “Learning ARIMA Time Series model for ‘MyColumn7′ in historical model number 1 (8 of 16)” repeated for each predictable column, first on the Main Model, then again for each historical model. Once you see messages like “Learning Time Series tree for ‘MyColumn5′ in main model (9 of 30)” then you’re in the ARTXP phase. The instances where msmdrv.exe got hung up for hours and I had to terminate the process manually seemed to occur most frequently during the ARTXP phase when building the trees for the Key Time columns, for whatever reason. I assume that the blending of the two algorithms takes place after this, but if so, it never contributed much to the final processing times. Some of the processing jobs took so long that it consumed much of the time I had set aside to write this post, so I limited my experiment to the mining structure built on sp_spaceused. As I discuss in more depth in past posts, the data used in these trials come from roughly three days of polling six dynamic management views (DMVs) every minute, in the hopes of creating IO pressure in order to study the topic further, as well as to use a type of data DBAs might be more familiar with. For a refresher on the schema we’ve basically stuck to throughout the series, see A Rickety Stairway to SQL Server Data Mining, Part 0.2: How to Dig Out of a Data Mining Cave-In. I did not dare run Time Series trials on the mining structures based on dm_os_wait_stats and dm_exec_query_stats, which had 7.6 million and 3.9 million rows respectively, plus many more columns with large data types.
                I ran the first set of Time Series trials on those same denormalized schemas mainly for purposes of comparing its performance against a few select trials of other algorithms. Using a single denormalized view as the sole input table is not always the ideal way to design Time Series projects though. As I have tried to hammer home through this series, choosing the right Content type for your data is critical, because certain algorithms are limited in the ones that they will accept. Like Linear Regression, Time Series can work with Continuous attributes but not Discrete or Discretized ones; like all of the others except Sequence Clustering, it can accept Cyclical and Ordered data but simply treats them as Discrete values, rendering those Content types useless. Like many other algorithms, Time Series can use nested tables as both inputs and outputs, which may be useful in some projects depending on the natural schema of the data. The “key difference”[xix] is that Time Series requires a Key Time attribute to represent each unique time slice, which must come from a single numeric or date column. In other algorithms the case key is identified by the Key type, but it is optional with Time Series. In relational database lingo, it must have a 1:1 relationship with the Key Time column though. In previous projects I have used received quite good results using timestamps, such as single records for each day, as Key Time columns. In the first set of trials mentioned above, however, I was forced to use ID columns which were only roughly equivalent to proper time slices, thanks to such problems as the mysterious delays in the SQL Server Integration Services (SSIS) job that collected the data, which were recorded in the special MinuteGap column. To get results that would produce accurate projections of my data, rather than simple performance comparisons against other algorithms, a redesign of the schema seemed to be in order.
               It was then that I finally understood the usefulness of the second Key, which I didn’t initially grasp from the written discussions on Time Series schemas in the DM Team’s book and Microsoft’s documentation.[xx] The algorithm can accept data in two ways, one of which is a columnar format, i.e. a normalized representation of data in which each Key Time represents a unique value. Unlike other algorithms, however, it can also be fed the same data in an interleaved format, in which each Key Time is not unique. Records in this somewhat denormalized format are identified through unique combinations of both the Key Time and Key columns, in a sort of composite key. The flexibility of the second approach is that users can add new values for whatever state the Key represents at will simply by appending new rows, whereas in the usual columnar approach a schema change would be required to add new columns to hold the new state value. On the other hand, it is not possible to use the new states as predictable attributes or set other mining structure properties if they’re not identified as separate columns. Perhaps the choice of the correct schema is highly dependent on whether or not you’re getting your data from a highly normalized relational database, like I am in these tutorials, or from the denormalized star or snowflake designs commonly used in cubes. The right choice probably also depends highly on whether or not all of the possible values for a state column are known, or whether they are subject to rapid change, as might be the case in a real-time OLAP scenario or one in which previously unseen values are generated frequently.. Because I learned the relational side of SQL Server first, (where “today’s new row is tomorrow’s new column”) my first instinct is usually to go with the most normalized approach – to the point that I found myself counting how many more cells it took to represent the same data in an interleaved format in the BOL and DM Team’s examples. Furthermore, I only have a few dozen columns to deal with rather than the several thousand used in real-world professional projects; if the schema was that large I’d probably be using a snowflake or star schema cube anyways. I thus made a quick redesign of the schema exclusively for this post, based on a columnar format in which I took a select few of the most important columns from the six DMVs the SSIS job collected data into. Using six ugly PIVOT statements, I put six of the most important performance counters in their own columns depending on whether they applied to All databases, the Monitoring database or TempDB; the nine tables I monitored using sp_spaceused received their columns, populated with their respective values for the Data column; one column was set aside for the io_stall values for the 25 unique FileHandleIDs in the dm_io_virtual_file_stats table; the nine most important wait types got their own columns; and each of the seven unique SchedulerAddress values from dm_io_pending_io_requests were given columns with their corresponding io_pending_ms_ticks values. Since we’re close to wrapping up the segment of this series that deals with the nine algorithms, I also used the unique groups of queries identified several weeks ago by the Clustering algorithm to create a new QueryTextGroupID column in my schema. For example, some of the queries fell into a particular cluster because they applied only to SQL Server Agent background tasks, while others applied to queries I made on the Monitoring database during data collection. I identified 25 of these distinct groups, each of which was filled with their respective values from the dm_exec_query_stats columns last_physical_reads and last_logical_writes added together. I also tossed in the mysterious MinuteGap column and a CombinedSmallDateTime column that collected all of the time measures in the RecordTable in a single Key Time attribute.
                The results I received back from SSDM in this second set of trials were really quite interesting, although we don’t have time to go into them all. I gained a lot of insight throughout this series into how a real performance monitoring system might be properly built using SSDM, such as how to identify relationships between specific query types and various measures of IO, but never more than at the tail end of this final algorithm discussion. The main point of this series is though is to use this kind of data to illustrate how SSDM works – or how it doesn’t always work right, as was the case in my first few trials with this new schema. The first trial got hung up and the second only completed processing because I set many of the new columns to Ignore. Nevertheless, an old problem from the beginning of this series reared its ugly head again: the processing job finished, but the GUI was unable to retrieve the results without being hung up indefinitely, forcing me to terminate msmdrv after a few hours of waiting. This in turn crashed SSDT without warning. The same thing occurred after I removed many more columns in the third trial, even though there was substantial improvement in processing time. It wasn’t until I removed the nested table I had been using in favor of a single denormalized view that SQL Server was capable of representing the results in the GUI. Processing time also improved dramatically, to just over five minutes.

Figure 3: The Charts Tab with the Columnar Schema
TSOtherSchema

                The run chart in Figure 3 would not be difficult to interpret, except for the fact that I found too many useful results and ended up cluttering it with too many columns. The date values at the bottom represent minute and hour figures as well as the day and date, but the GUI did not depict them in this case. Time moves forward as you go from left to right on the horizontal axis, while the values for the various columns increase or decrease by the percentages listed vertically on the left. The Abs button can be used to toggle between percentages and absolute values, which can harder to read if you have many columns with widely varied ranges of values. It is often wise to use the dropdown control on the right to select only a small subset of columns with similar ranges of values, otherwise those with smaller scales will be dwarfed by others in the run chart, regardless of their significance. The Prediction Steps dropdown can be used to control how far ahead the run chart peers into the future, up to a maximum of 100 intervals. On a few occasions I have had to close projects and reopen them before SSDT or Visual Studio would actually show the new predictions in the GUI, but I didn’t notice that bug in this week’s trials. Beware the Show Historic Predictions checkbox; it has always instantly crashed any version of SQL Server or Visual Studio I’ve ever used, without warning, regardless of the project. I vaguely recall seeing it work once or twice long ago, but I may be wrong; the most accurate prediction I can make in this post is that it will probably crash SSDT the next time I accidentally click on it. I’ll put that on the list of bugs I ought to submit to TechNet someday. Figure 4 shows how the Mining Legend depicts the time slice and values for two columns for whatever point the user clicks within the run chart, using data taken from the earlier set of trials performed on the sp_spaceused mining structure. Figure 5 shows the same data, except with the Show Deviations checkbox selected so that the variation of the data from the trend line is included in the regression lines.

Figures 4 and 5: The Chart Tab Mining Legend and Show Deviations Checkbox
TrendLineProblems
ShowDeviations

                The Model Tab shows the ARTXP trees in a format very similar to the Decision Trees Viewer, which we have already discussed in detail and won’t explain again for sake of brevity. One of the few differences is that hovering over a node shows the ARIMA equation for the whole tree, which is the same for every node. Hovering over a leaf node also provides the tree equation, plus the coefficients and histograms in the Mining Legend. There is no Histograms dropdown, presumably because Time Series does not accept Discrete columns. As was the case with Decision Trees, the results are useful, but it some of the nonsensical relationships must be sorted out manually. For example, check out the comparisons for NumOfBytesWritten in Figure 6, which don’t tell us much. If NumOfBytesWritten is greater than or equal to a certain number, it’s not very useful to inform us that it is likely to be greater than or equal to another number in the following step. The output in this series did contain some useful splits, but I had to hunt for them.

Figure 6: The Time Series Model Tab (click to enlarge)
TSModelTab

               As always, if the information in the GUI is insufficient for your purposes, you can always dig deeper into the data with the Generic Content Tree Viewer. Because the output is so complex, I’ve included an example in Figure 7, in which there is one node in the tree to the left for each predictable attribute for ARTXP, and another row for each predictable attribute for ARIMA, with the ARTXP nodes at the top and the ARIMA ones at the bottom. Relating the ARTXP and ARIMA nodes would be an apples and oranges comparison, so don’t bother trying. There is one NODE_TYPE #16 for each predictable attribute, by followed by leaf nodes and/or interior branches, with leaf nodes at the end of any interior branch. There is also one NODE_TYPE #27 (i.e. an ARIMA Root) for each predictable attribute in ARIMA nodes, with multiple type #28 nodes for each periodic structure, each of which contains a single type #29 node for an auto-regression equation and a single type #30 for a moving average.

Figure 7: An Example of the Generic Content Viewer with the Times Series Algorithm (click to enlarge)
TSGenericContentViewer

                As usual, Time Series uses the same columns for its output that other algorithms do, but assigns different meanings to them. The meaning even varies from one row to the next, depending on its NODE_TYPE. As I have mentioned in the past, this denormalized structure is useful in the sense that it allows SSDM to represent the disparate output of its algorithms together, much like a produce stand can be used to hold both apples and oranges. The price to be paid for that is added complexity, thanks to the high level of denormalization, which also includes a nested NODE_DISTRIBUTION table in each row. Time Series uses the standard ATTRIBUTE_NAME, ATTRIBUTE_VALUE, SUPPORT, PROBABILITY and VARIANCE columns within it, but these can vary in meaning from one row to the next depending on their VALUETYPE flags. The NODE_DISTRIBUTION table is blank with ARTXP except in the leaf nodes, where the first row is always a VALUETYPE 11, for the Y-intercept. The rows that follow have a VALUETYPE of 7 for the coefficient or 9 for the Statistics. A lot of these concepts should already be familiar, given that we’ve already covered how they are depicted in the Generic Content Tree Viewer in the previous tutorials on Linear and Logistic Regression. Each NODE_TYPE #27 ARIMA root seems to have one VALUETYPE 11 row for the intercept, then one more for each NODE_TYPE #28 periodic structure within it. Each of the type #28s seems to have exactly one row apiece in its NODE_DISTRIBUTION table for periodicities, auto-regressive orders, difference orders and moving average orders, in that order. The SUPPORT, PROBABILITY and VARIANCE for all of these always seems to be 0, so the attribute-value pair is the distinguishing characteristic;  this is true of both type #27 ARIMA roots and type #28 periodic structures.  BOL says the complement of the coefficient is included in the data returned by the auto-regression, but I have yet to see it, unless the complement is the only value returned. It also says that it is possible to retrieve “the equation used to blend the algorithms,” but I have yet to find it.

Figure 8: Metadata for TS (adapted from Books Online as always – click to enlarge)
TSMetadata

                Obviously, I have much more to learn about the nine SSDM algorithms, particularly Sequence Clustering and Time Series. This is especially true when nested tables are involved, since they still seem to be my Achilles Heel. I’ve covered the basics of the algorithms sufficiently, however, that we can move on to the more advanced step of retrieving results directly through Data Mining Expression (DMX) queries. Instead of relying on the Generic Content Tree Viewer to tell us what we need to know, we can perform tasks like directly retrieving ARIMA equations or even reassembling ARTXP calculations, which BOL says can be difficult, “because information for each split is in a different place within the tree. Therefore, with an ARTXP model, you must get all the pieces and then do some processing to reconstitute the complete formula. Retrieving an equation from an ARIMA model is easier because the formula has been made available throughout the tree.” DMX can even be used to bypass the Generic Content Tree Viewer and other SSDM visualizations altogether, by exporting all of the data into other tools. In the fourth installment of our upcoming segment on DMX, I’ll explain how to import all of this denormalized data into a normalized relational database, where we can use T-SQL to slice and dice it with much greater ease. From there, it can even be imported into SSAS cubes, where it can be sifted through with Multidimensional Expressions (MDX). Both of these languages have far richer Data Manipulation Language (DML) capabilities than DMX, which is also quite limited in terms of its Data Definition Language (DDL). After several weeks of being force-fed a lot of thick information on some of SSDM’s most complex algorithms, the upcoming tutorials on DML and DDL should be a nice change of pace. In the third installment I will delve into the primary use for DMX, which is to perform prediction queries, particularly on Time Series data. There may be a delay in this series for a week or two while I recuperate from surgery for the antibiotic resistant sinus infection I mentioned back in December, but I don’t need Time Series to forecast that the lag will be short.


 

[i] The new windowing functions in SQL Server 2012 can simplify the calculation of moving averages, as described in Ben-Gan, Itzik, 2012, Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions. O’Reilly Media, Inc.: Sebastopol, California.

[ii] I found the abbreviation at the blog of Jamie MacLennan, one of the leaders of Microsoft’s Data Mining Team in the last decade. See MacLennan, Jamie, 2007, “New Time Series Features for SQL Server 2008,” published Oct. 22, 2007 at Jamie’s Junk. Available at http://blogs.msdn.com/b/jamiemac/archive/2007/10/22/new-time-series-features-for-sql-server-2008.aspx

[iii] Meek, Chris; Chickering, David Maxwell and Heckerman, David, 2002 , “Autoregressive Tree Models for Time-Series Analysis,” published at the Microsoft Research website and available for download at http://research.microsoft.com/en-us/um/people/dmax/publications/dmart-final.pdf. Presented in the Proceedings of the Second International SIAM Conference on Data Mining (SIAM) at Arlington, Virginia, April 2002.

[iv] This done by setting multiple columns to Predict, which I have done throughout this series anyways regardless of the algorithm. For a short but enlightening discussion of cross-prediction, see Shuvro Mitra’s posts in the SQL Server Data Mining Forum thread “Time Series to Support Cross Prediction,” posted Oct. 28, 2008 at http://social.msdn.microsoft.com/forums/en-US/sqldatamining/thread/5a0d7196-9c1a-4ed6-b293-2d006042d5b2/

[v] p. 9, Meek, et al.

[vi] See the MSDN webpage “Mining Model Content for Time Series Models (Analysis Services – Data Mining)” at http://msdn.microsoft.com/en-us/library/bb677217.aspx

[vii] I’m paraphrasing a comment on p.12, Meek, et al.

[viii] p. 282, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[ix] Jenkins “was a jazz and blues enthusiast and an accomplished pianist,” according to the Wikipedia webpage “Gwilym Jenkins” at  http://en.wikipedia.org/wiki/Gwilym_Jenkins. Any statistician who can jam is a friend of mine. Box’s work during the war may seem repellent in the present age, now that chemical weapons are taboo and groups like PETA protect animal rights so rabidly, but at the time there was considerable fear that Hitler would use chemical weapons against Britain. So I would not include him in the long list of mathematicians and professionals in the hard sciences who have gone mad, wasted their talents on junk science or became terrible philosophers, as I have mentioned throughout this series. That might apply, however, to critics who base their censure on the inhuman idea that animal life is worth as much as human life. See the Wikipedia webpage “George E. P. Box” at http://en.wikipedia.org/wiki/George_Box

[x] See the Wikipedia webpage “Autoregressive–moving-average model” at http://en.wikipedia.org/wiki/Autoregressive_moving_average_model. Also see Box, George E.P. and Jenkins, Gwilym M., 1971, Time Series Analysis: Forecasting and Control. Holden-Day: San Francisco. I have not had the chance to read the latter yet.

[xi] For further background, also see the Wikipedia webpages  “Box–Jenkins” at http://en.wikipedia.org/wiki/Box%E2%80%93Jenkins,  “Corellogram” at http://en.wikipedia.org/wiki/Autocorrelation_plot and “Autocorrelation” http://en.wikipedia.org/wiki/Autocorrelation

[xii] For a quick explanation of differencing, see Prof. Bob Nau’s notes for the Forecasting – Decision 411 course at the Duke University webpage titled “Stationarity and Differencing,” available at http://people.duke.edu/~rnau/411diff.htm

[xiii] See the MSDN webpage “Microsoft Time Series Algorithm Technical Reference” at http://msdn.microsoft.com/en-us/library/bb677216.aspx

[xiv] p. 282, MacLennan, et al. For the comment about the FFT, see See Strang, Gilbert, 1994, “Appendix I: Wavelets,” pp. 250-255 in American Scientist No. 82, April, 1994. I’m not sure what version of the FFT is used by SSDM, but the most common one is the recursive Cooley-Tukey algorithm. See the Wikipedia webpage “Fast Fourier Transform” at http://en.wikipedia.org/wiki/Fast_Fourier_transform.

[xv]  IBID., pp. 288-289.

[xvi]  IBID., p. 271.

[xvii] IBID., pp. 287-288.

[xviii] My poor beat-up development machine uses an AMD CPU. Microsoft’s documentation warns that SSDM can return quite different results for Itanium processors when they are fed the same datasets, thanks to their notorious floating point processing problem. I can’t comment on that issue because I have never run on SSDM on an Itanium machine.

[xix] Pun intended.

[xx] See the MSDN webpage “Microsoft Time Series Algorithm” at http://msdn.microsoft.com/en-us/library/ms174923.aspx. Also see pp. 265-266, MacLennan, et al.

A Rickety Stairway to SQL Server Data Mining, Algorithm 8: Sequence Clustering

by Steve Bolton

               In last week’s edition of this amateur series of self-tutorials[i] on SQL Server Data Mining (SSDM), we covered Clustering, an algorithm with an exceptionally wide variety of uses in comparison to the other eight Microsoft includes with the product. Sequence Clustering may be derived from that data mining method, but it contrasts sharply with ordinary Clustering in that its range of uses and the kind of data it can operate on are greatly constrained. It makes use of the Expectation Maximization (EM) type of Clustering discussed in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering but includes another processing step to link the groups it creates together by paths. One of the most common uses for it cited in the literature are protein sequencing of DNA[ii], which has a physical but not a temporal order, yet most of the other popular applications of Sequence Clustering imply some kind of ordering by time. It differs from Time Series, the last algorithm we will cover in this series, in that it is well-suited for analyzing data that is separated by undefined, unranked measures of time, whereas Time Series deals with fixed intervals. One of the reasons I have written this series as I learn the ropes is to demonstrate to other SQL Server professionals just how useful this terribly neglected tool can be for those with little statistical training, even when applied to purely relational or OLTP databases. One of the most practical uses that non-OLAP DBAs can put this specific SSDM algorithm to is harvesting the database server and Windows Application, System and Security logs, in order to identify events that might predict a particular adverse event, such as an eventual server failure. It is ideal for tracking events of this sort which often do not occur at predictable, fixed intervals. Another popular use for Sequence Clustering is clickstream analysis, to discover the paths users are most likely to take; faulty webpages can even be identified by looking for paths users are least apt to traverse on a website. Just as Clustering can be used for anomaly detection (which DBAs may find quite useful to identify bad data), Sequence Clustering can point out anomalous, uncommon paths. It is not used for prediction as frequently as Time Series, but it can be directed to predict the next path an object is likely to take.
               It can also be used in the market basket scenarios that Association Rules is designed to solve, with the added benefit of preserving ordering of the transaction items. As discussed in A Rickety Stairway to SQL Server Data Mining, Algorithm 6: Association Rules, that particular mining tool is limited to certain specific mining scenarios, requires a quite steep learning curve and is quite limited in the range of data it can accept. Sequence Clustering is not only useful for some of the same scenarios, but shares similar limitations, albeit for different reasons. Technically, Sequence Clustering accepts a larger range of the Content types offered by SSDM than any other algorithm, which represent a crucial distinction in the meaning of each mining model column. For a refresher on that topic, or how to perform basic tasks like setting up a mining project and its structures and models, see post 0.0, post 0.1 and post 0.2. It is the only one of the nine algorithms that accepts the Cyclical and Ordered types without a warning in Books Online (BOL) that the values will simply be treated as Discrete, without any special processing, which is pointless. In this week’s trials on the IO data we’ve been using for examples throughout this series, I was able to assign the Cyclical type to the Minute, Hour and DayOfWeek columns, all of which fit the bill since they repeat themselves at precise intervals. The number of clusters for that mining model increased to six, all with case support over 100, and processing time went up about 20 percent, but the results weren’t as good. There was a marked effect on the clusters the algorithm produced, but it seemed to treat the values for these columns as Discrete; I ended up with results which were cluttered with 60 separate attribute-value pairs for Minute and another 60 for Hour, which told me nothing useful. The documentation on the Cyclical and Ordered types is so thin in BOL and the other mining literature that it is difficult to ascertain exactly how they are processed with Sequence Clustering; even Data Mining with Microsoft SQL Server 2008,  the best reference ever published on SSDM, mentions it only once at a glance.[iii]  Unlike Association Rules and several other SSDM algorithms, Sequence Clustering can at least make use of the two leading Content types, Continuous and Discrete.
               Its major drawback (which almost amounts to a fatal flaw) is that it requires a single input table with a column set to the Key Sequence type, which uniquely identifies each row. As discussed in A Rickety Stairway to SQL Server Data Mining, Part 0.2: How to Dig Out of a Data Mining Cave-In, I’ve had my share of troubles with SSDM’s nested tables, which can be difficult to work with. I solved a lot of those problems in the tutorial on Association Rules, which depends heavily on nested tables, but the topic becomes murkier to the uninitiated with Sequence Clustering, thanks to the special way it handles them. Association Rules has a case key, i.e. a relationship defined in the Data Source View (DSV) a mining structure comes from, which is practically identical to a foreign key in the relational database world. Yet it also has a nested key to identify one measure in a nested table that you want to analyze. This is not the same as the case key, which you don’t even need to include in your mining models. If you learn Association Rules first, then the way Sequence Clustering handles nested tables may be a bit confusing. There is no nested key, only a mandatory Key Sequence column that uniquely defines each row in the nested table. It is the only algorithm which makes use of this Content type, which is like a case key in Association Rules, except it is mandatory. Association Rules allows multiple nested tables with multiple columns in addition to the nested keys, but with Sequence Clustering you’re limited to a single nested table with a mandatory Key Sequence column and a single column for the measure you want to analyze. I’ll refer to the latter as the state column, for lack of a better term. To make matters worse, that measure must use the Discrete Content type. The parent of the nested table can have other measures, but its nested table can only have one, so unless you artificially aggregate all of your other dependent table measures somehow and include them in the parent – which will inevitably decrease the granularity of its information content – then you’re limited to analyzing a gutted shell of your original dataset.
               Throughout this series, we’ve been experimenting on data taken from about three days of polling dm_exec_query_stats, dm_io_pending_io_requests, dm_io_virtual_file_stats, dm_os_performance_counters, dm_os_wait_stats and sp_spaceused each minute, in the hopes I could simultaneously learn more about IO bottlenecks while using measures working DBAs might be more at ease with. I later cut back to three denormalized views that left joined dm_io_pending_io_requests, dm_io_virtual_file_stats, dm_os_performance_counters and a parent RecordTable of time measures to sp_spaceused, dm_os_wait_stats and dm_exec_query_stats respectively. Because of the requirement of a single nested table with a single non-key attribute, comparing the performance of Sequence Clustering against the results of other algorithms as we’ve done in the past would be pointless. Since these strictures make Sequence Clustering a world unto itself, the best we will be able to do is compare its parameter settings against each other. This would be an opportune time to take a closer look at certain Discrete measures in our dataset, such as the WaitTypeID column of dm_os_wait_stats and the five columns in dm_exec_query_stats that can be used to uniquely identify particular queries, to see if they are associated with particular bottleneck stats. Our parent RecordTable, however, had mostly time measures, with all of the other measures of IO pressure being included in other nested tables we can’t include; the only measure in the parent which would be of particular interest without them would be MinuteGap, which tracked a mysterious increase in the intervals between runs of the SQL Server Agent job that took care of the data collection process. Furthermore, some of the Discrete measures in the dependent tables are of little use to us, like the wait figures, which occurred at predictable intervals but were accompanied with numerical values – which are not only Continuous, but can’t be included because we’re limited to just one column aside from the case key. The best we would be able to do is to treat each record where a single WaitTypeID passed a particular threshold as a distinct event to track, but we wouldn’t be able to compare it against other waits, or even against other values for the same wait type. This leaves us with the five columns in dm_exec_query_stats that uniquely identify queries: QueryTextID, SQLHandleID, PlanHandleID, QueryHandleID and PlanHashID. I picked the first of these for this week’s trials because I saved the text of the queries during the data collection process, but forgot to collect the particular query plans the other measures point to; theoretically though, if not for that oversight, we could have easily correlated performance bottlenecks with particular query plans. Instead, we will have to look for correlations with particular query texts. Even after all this, we still face another limitation, in that each of these Discrete values must have a 1:1 relationship with the Key Sequence column; sometimes the same query was run multiple times in a single polling interval, so I had to reduce the granularity of the data further by adding a T-SQL query in my DMV that asked a Boolean Yes-No question of whether or not a particular query was run in a polling interval.[iv] The count of how many times it occurred in each interval was probably significant, but we had to discard all of that information. We could just skip the whole mess and run Sequence Clustering without a sequence column, but in that case all we end up doing is creating EM clusters, without all of the extra parameters discussed in the Clustering article.
               Furthermore, certain errors are common with Sequence Clustering, which can be frustrating to deal with because of their unintuitive behavior and lack of documentation. More than a year after I posted the first installment, my series of posts labeled An Informal Compendium of SSAS Errors remains one of the only centralized sources of information for certain SQL Server Analysis Services (SSAS) errors, if not the only one on the Web for some of them. There are no hits at all on Google, for example, for the third error listed in Figure 1, which mentions the most common errors I have encountered over the course of the last couple years of struggling here and there with Sequence Clustering. I’ve counted at least two situations in which this can occur: first, you may really be using the wrong Content type, which SSDT will catch only part of the time, or you have more than one state column in your structure. The latter mistake can also lead to the second error in the list, even though it is worded to suggest that the problem stems from having more than one Key Sequence column. Some of these errors are badly worded, such as the fourth one, which can occur if the values for your state column aren’t unique for each case key; that can throw off the sort orders of the internal queries SSDM uses, but the ordinary users wouldn’t know that unless they dug deep into the Profiler data or logs. The fifth error occurred to me repeatedly because I was in the habit of copying and pasting old mining structures when creating new ones, just to save the time of entering all of the same information again, but SQL Server Data Tools (SSDT) sets your nested table and its columns to Ignore rather than Predict, PredictOnly or Input when you do this. Another potential glitch is that SSDT sometimes allows projects to be deployed and processed in which the state column is set to some disallowed value, like Discretized or Continuous. This can lead to outrageous performance bottlenecks; for example, I caught this error after some trials I ran using SQLHandleID as my state column crashed msmdrv.exe after consuming 5.4 gigs of RAM and processing endlessly on one core for hours. Somehow, SQL Server had failed to catch the fact that the SQLHandleID column had been set to Continuous rather than the mandatory Discrete value, then deployed and processed the project many times. After I set the attribute to Discrete, it processed within minutes with very little load on the server; once I tried to change it back to Continuous, or to other values like Discretized, SQL Server raised the errors it should have in the first place. The same error also occurred when the Content type of one of my structures was set to Key rather than Key Sequence, which SQL Server should have theoretically disallowed. If you get massive memory consumption while running on one core with this algorithm, check the Content types of your nested table columns to make sure they’re set to Key Sequence and Discrete, because SQL Server may not have caught these mistakes. Some of these are actually glitches in the program itself, which I probably really ought to report to TechNet, maybe after this series is done.

Figure 1: Common Error Messages with Sequence Clustering and How to Fix Them
SQErrors

               These errors and other difficulties mean that Sequence Clustering has one of the highest learning curves among the nine algorithms, behind Association Rules. These are the only two mining methods that I have yet to return useful results from immediately at the default values for their parameters. On the other hand, Association Rules is a bit simpler in certain respects, because its inner workings are much less complicated to understand – so if an explicit understanding of how SSDM arrived at a particular conclusion is critical to you, that would be the ideal choice for market basket analysis. The innards of Sequence Clustering are much more difficult to interpret, although they are nowhere near as inscrutable as those of neural nets, as I describe a little more colorfully in A Rickety Stairway to SQL Server Data Mining, Algorithm 5: The Neural Network Algorithm. I find it is easier to explain what little I know about stats in terms of the building blocks divided by their functions, which are brought together in particular combinations suited to the mining problem at hand. In this case, the first major building block is the EM version of Clustering, which we covered in depth last week. Sequence Clustering builds upon it by adding a second phase, in which a transition matrix is created full of probabilities that one cluster will be followed by the next.[v] One of the purposes of this whole series is to show that you don’t need a doctorate in statistics in order to gain valuable information from SSDM with a minimal investment of time, energy, training and server resources, just as you don’t need to write a dissertation on combustion engineering to drive a car. Microsoft is essentially failing to capitalize on a vast market for its data mining tools among moderate and low-skill information workers because of the incorrect assumption that you need to know how to write equations to use it; the indispensable elements are the ability to interpret the results, manipulate the parameters and set up mining models that return useful results with minimal performance impact. For this reason, I won’t get into a discussion of how transition matrices are formed, although I would really like to learn matrix math, since it is highly useful with neural nets, the area of data mining I’m most interested in. I also won’t get into the math of Markov chains, which form the backbone of the sequencing side of the algorithm. I don’t want to put readers to sleep any more than I already have to by including the equations behind transition matrices and Markov models, not when clusters are so easy to visualize using the tools Microsoft has provided, as we shall see.
               The inventor of this ubiquitous statistical tool was Andrey Markov (1865-1922), the most notable member of a family of Russian mathematicians. As I’ve noted before, some of the early history of math was speckled with bad judgment, junk science and even bad ethics on occasion, but in Markov’s character was exemplary – just like several other Eastern European mathematicians we’ve mentioned in the last few weeks, who went unsung because they toiled behind the Iron Curtain.. He may have succumbed to the dry allure of that nasty brand of atheism which was an intellectual fad of the European intelligentsia in the late Victorian Age, but he demonstrated bravery by refusing to cooperate with the tsarist government’s investigations of student dissidents following the failed 1908 revolution. He was removed from his office as a result of his principled stand for a good cause.[vi] Among his contributions before this persecution were Markov chains, which are widely used for many statistical purposes today beyond the narrow world of Sequence Clustering, especially in bioinformatics. In the simplest Markov models, the probability of one state leading to another is calculated independently of the states that came before it, but SSDM’s application of it in Sequence Clustering uses nth-order Markov chains, in which Bayesian probabilities (which I introduced in a typically slapdash fashion in A Rickety Stairway to SQL Server Data Mining, Algorithm 1: Not-So-Naïve Bayes) are applied that take into account prior states.[vii] In other words, SQL Server will take into account the possibility that Cluster X will lead to Clusters Y or Z depending on whether it arose from Clusters A or B. As the number of potential prior and future states the algorithm must take into account multiplies, the number of calculations SQL Server must perform grows exponentially rather than linearly; furthermore, the transitions matrices become proportionally sparser and therefore lose significance, thanks to the dreaded “curse of dimensionality” we discussed in the article on neural nets.[viii] This quickly can lead to the bogeyman feared by all data miners: overfitting, i.e. decreased performance in return for cluttered, nonsensical or misleading results.
               The Data Mining Team’s aforementioned book says that an excessive number of potential states can lead to matrices that are cluttered with too many transitions of low probability, which can be controlled by “storing only those probabilities that are above a certain threshold” but I am uncertain if this is possible using the four parameters provided with Sequence Clustering.[ix] It is certainly possible to control the growth of the transition matrices with them, but they seem to cap its growth by the number of clusters, cluster cases and states that the probabilities are generated from, not by eliminating low probability states after the fact, which is the way I read that comment. CLUSTER_COUNT works just like with the ordinary Clustering algorithm we discussed last week, by capping the number SQL Server generates during the clustering phase of processing. Likewise, the MINIMUM_SUPPORT parameter we’ve seen numerous times in this series sets the minimum number of cases (i.e. rows) needed to create a cluster. Both of these parameters default to 10 as usual, whereas MAXIMUM_STATES defaults to 100, as is typically the case with other algorithms. We’ve already discussed that parameter many times, but there is one slight difference in behavior here: to get rid of cardinality warnings, you must set MAXIMUM_SEQUENCE_STATES to 0 rather than MAXIMUM_STATES to 0, which would normally be the case. This parameter defaults to 64 and is unique to Sequence Clustering because it “Specifies the maximum number of states that a sequence can have,” as BOL puts it so succinctly. It also warns that setting it to more than 100 can lead to cluttered results, i.e. one of the twin horns of overfitting. For whatever reason, Microsoft doesn’t provide any of the other parameters available with ordinary Clustering, which would be helpful in controlling the first phase of processing. The usual MAXIMUM_INPUT_ATTRIBUTES and MAXIMUM_OUTPUT_ATTRIBUTES found with most other SSDM algorithms aren’t available either, perhaps because the sequencing phase doesn’t use feature selection and the clustering exclusively uses the Interestingness Score, which is not affected by these parameters. BOL says with the former that “Feature selection is not invoked; however, you can control the behavior of the algorithm by setting the value of the parameters MINIMUM_SUPPORT and MINIMUM_PROBABILIITY,” but there is no MINIMUM_PROBABILIITY parameter, nor can you add one manually in SSDT without receiving an error during deployment. As usual, we will avoid setting the MODEL_EXISTENCE_ONLY and NOT NULL mining model flags in this week’s trials, since our data doesn’t have any nulls and we don’t want to reduce our data to a dichotomous choice between Missing or Existing states.
                As you can see from Figure 2, all of the processing times were quite fast at a wide variety of parameter settings. The most noteworthy change was the elimination of cardinality warnings by setting MAXIMUM_SEQUENCE_STATES to 0. Keep in mind that these results are in no way comparable to the trials we’ve done in prior weeks. We’re only evaluating the 2,586 cases in the master RecordTable and just one state column and the key from the nested table dm_exec_query_stats, not all of its 40 columns. Moreover, the DISTINCT clause I used to make the values for the state column QueryTextID unique undoubtedly eliminated all of the rows where the same query occurred more than once during a polling interval, so we’re not even dealing with the full 1,105,650 rows in that dependent table either. The numerous strictures on Sequence Clustering structures mean we really can’t compare its performance with other algorithms at all, unless we were to edit most of our previous mining models to match the structure used here. It is interesting to note, however, that the twelfth model in the list took longer to retrieve in the GUI than it did to process it, which is unusual. It produced six clusters at first, two of which had just one case apiece, which is where MINIMUM_SUPPORT comes in handy. Theoretically this shouldn’t have happened since the default is supposed to be 10, but setting it to 100 had the expected effect of eliminating the two extra clusters, bring the total to four in the next model.

Figure 2: Performance Comparison by Sequence Clustering Parameters (click to enlarge)
SQParameterResults

                Comparing the usefulness of the results is also difficult, since almost all of our key measures of IO were in other dependent tables we can’t include in our structure. In retrospect, I could have gleaned much more useful information by doing some proper aggregates on our IO measures and included them in the parent RecordTable, but I’m usually reluctant to do that, since reducing the granularity of data directly empties it of information content; furthermore, since we didn’t aggregate our data in previous weeks, we still wouldn’t be able to compare our results to those of other algorithms. Almost all of the measures in RecordTable were related to time, except MinuteGap, which reflected the growth of a strange gap between runs of the Agent job that performed the data collection. Initially it ran every minute as directed, but over the course of three days it slowly crept up to two minutes, then three, then four, without any corresponding warnings or job failures in the server logs. The most useful correlations we could look for with our stripped-down mining structure would be between MinuteGap and specific queries, especially ones that followed each other in a particular order.
                This is exactly what I found, particularly in the thirteenth model. It was cleanly divided into six clusters, four of which had high values for MinuteGap. The relationships between them are depicted in Figure 3, which is a simple Cluster Diagram of the exact same kind as the one introduced in last week’s article. The Cluster Discrimination, Cluster Profiles and Cluster Characteristics tabs all work the same as described there, except that they may have additional values for Start and End to indicate that a particular value leads to the beginning or end of the transition matrix. This week we also have an additional tab, State Transitions. It’s actually quite easy to understand, despite the clutter in Figure 4, which stems merely from the fact that there were so many different queries in our dataset. The number in each rectangle represents the numerical QueryTextID I assigned to each distinct query in order to normalize dm_exec_query_stats at the beginning of these trials. Those with the highest case support are shaded in blue, those in the middle in grey and the least common ones in white. The lines signify a transition from one state to the next, i.e. from one query to the next in the case of our data, while the numbers next to them indicate the probability of the link. The probabilities can be hidden using the ShowEdgeLabels checkbox. The Strongest Links slider can be used to add or remove links from the Sequence Clustering Viewer based on their likelihood, while the Cluster dropdown can limit the clusters depicted.

Figures 3 and 4: Cluster Diagram and State Transitions for Model 13
Model13
StateTransitionsTab

              This model had the clearest and most useful relationships, but the gist with the same with its brethren, all of which seemed to divide clusters along two different lines: low vs. high values for MinuteGap, or transitions that began or ended with different query types. After looking up the texts of the most frequently cited queries, I found that almost all were on msdb or otherwise involved SQL Agent job steps, which seemed to confirm my initial suspicion: the MinuteGap mystery was probably a result of the performance bottleneck caused by the Agent jobs themselves. Furthermore, the Agent-related jobs often divided into two separate categories, one for background Agent processes that I’m usually blissfully unaware of, plus the data collection queries I wrote myself. The first group included queries on dbo.sp_verify_subsystems and sp_help_targetserver, plus regularly polled sys.configurations to see if Agent XPs were enabled in sp_configure, for example. To my surprise, when I checked some of the clusters with the lowest case support that seemed unconnected to these major groupings, I found the text of queries that I had run during the data collection process for activities that had nothing to do with this project at all. SSDM had segregated these queries off in a corner, just as it should have, without any conscious intervention on my part. On top of that, certain stored procedures like sp_help_jobhistory_full and sp_sqlagent_log_jobhistory were grouped together in a way that suggested the many checks I made of the progress of the Agent job in the server logs were segmented as well. I didn’t find the transitions that the algorithm returned as useful, but I think that is mainly because certain repetitive jobs, like the polling job I ran every minute, tended to overshadow other meaningful connections. In all of these models, the most common end point was to the main query in that job, but many of the links other queries had to it almost certainly had artificially high probabilities simply because that query ran every single minute, while they did not. Nevertheless, I did spot some vaguely defined transitions between queries that you might find on a real server, such as one particular Agent background process being followed by another with a fair amount of consistency. I don’t know enough about Agent’s inner workings to state with any certainty that the background queries I saw ought to follow each other, but the data gave me an excellent starting point with which to check into the matter further. This is where much of the untapped potential in SSDM resides, especially for DBAs. It’s not going to perform magic or take DBAs out of the loop, but it can greatly speed up the process of identifying things like possible bottlenecks or relationships between queries that a DBA might not have time to spot, regardless of how seasoned they are. There are simply too many variables out there for any DBA to name, let alone track in their heads, but SSDM can do this kind of grunt work. It can’t make decisions for you, but it can alert people with domain knowledge to items of interest they might not otherwise have spotted. When properly designed, sooner or later SSDM is bound to call attention to an important relationship even an expert in any field would have missed.

Figure 5: Metadata for Sequence Clustering (adapted from Books Online as usual)
SQMetadata

                As always, if the SSDT visualization tools do not provide enough detail for you, it is always possible to dig deeper into the raw data the visualizations are built from, using the Generic Content Tree Viewer. As discussed in each previous article, SSDM uses a common metadata format that is flexible enough to represent the disparate output of each of the nine algorithms, which is like comparing apples and oranges; I look at the format like a produce stand, which is capable of holding both apples and oranges side by side, so to speak. The price to be paid for this, however, is a degree of denormalization that relational DBAs probably aren’t accustomed to. The same column names are always used, but their meaning changes in subtle ways between algorithms. They even vary in meaning from one row to the next within each algorithm, depending on the NODE_TYPE value. In the case of Sequence Clustering, we have four of these to take into account, the first of which is the single root of the model at the top of the Generic Content Tree Viewer, as usual. Below that we will find a single NODE_TYPE 5 node for each cluster in our model, plus an additional NODE_TYPE 13 node labeled “Sequence Level for Cluster n,” where n is the maximum number of clusters we have plus one. Each cluster also has one of these sequence nodes beneath it, which represent transition matrices. Each of these transition matrices in turn has a long list of the transitions for the state column, which are represented by type 14 nodes. As usual, the most important information for all of these node types is found in the NODE_DISTRIBUTION table, which is not only nested but varies in meaning from one row to the next depending on its VALUETYPE code. Fortunately, in the case of Sequence Clustering, we’re only dealing with two different values for this flag, 1 for Missing and 4 to indicate a Discrete column, which should be easy to understand for readers who have followed the rest of this series. We must also take into account the ordering of the rows when our NODE_DISTRIBUTION table is within a NODE_TYPE 14 transition node, because the first row will represent the starting state and the following ones successive states, according to BOL. I assume that this means that all of the following rows are ordered from first to last, but I have yet to verify this empirically. In practice, what you’re likely to see when you check inside the Generic Content Tree Viewer are attribute-value pairs for your state column in both the transition matrices and their individual transitions, alongside figures for the probability, variance and count of cases for that particular attribute-value pair. The stats for the columns in your parent table will only be found in the cluster nodes and the model root. To date I have not yet used Data Mining Extensions (DMX) queries to retrieve this information directly from the mining models, but after perusing the Generic Content Tree Viewer, I can see how it might easily turn up useful information on particular transitions. It is easier to use this Viewer than with other algorithms, but the drawback is that there are hundreds of transitions in each cluster’s transition matrix, which would make it difficult to find the specific information you’re looking for. Using a DMX query, however, we would be able to return all of the most likely transitions for a particular QueryTextID.
                DMX queries are an advanced topic that we will not delve into unto after we’ve finished surveying all nine algorithms, which we will wrap up next week with Time Series. One of the reasons I have yet to use DMX to find specific transitions in Sequence Clustering models is that there are far too many constraints on the data you can input, which severely crimps the algorithm’s utility. My wishlist for future editions of SSDM now includes upgrades to Sequence Clustering so that users can input more than one nested table, plus analyze more than one state column within each. It would also be much more fruitful if we could at least use the Discretized Content type for other types of columns, or maybe even full-fledged Continuous columns. All of these limitations mean that more often than not, Sequence Clustering ought to be left for the end of a data mining workflow, once you already have quite specific ideas of what kind of events you want to search for in your data. Time Series thankfully has fewer such limitations, although it is limited only to Continuous attributes. Sequence Clustering is typically used for temporal data, but this is apparently not always the case, given that one of the problems it is most commonly applied to is protein sequencing. Time Series is designed specifically with temporal data in mind, although it deals with specific intervals of time rather than events with undefined gaps between them, as Sequence Clustering does. Time Series may have a wider range of uses and fewer limitations than Sequence Clustering, but it would be a mistake to see it is a “better” algorithm, because it is meant to address a complementary set of data mining tasks. The object in data mining is to select the right tool for the job. When investigating data at specific temporal intervals, that tool is usually Time Series, especially if you want to perform the highly valuable but equally risky task of predicting the future, based on the past.


[i] For a decent professional tutorial on this subject, see the AdventureWorks exercise on Sequence Clustering at Microsoft’s Technet website, titled “Lesson 4: Building a Sequence Clustering Scenario” at http://technet.microsoft.com/en-us/library/ms167594.aspx. It is difficult to find a comprehensible lesson on the subject, but this is one of the best available on the Internet.

[ii] A specific example is the National Institutes of Health (NIH) project at Indiana University’s Cloud Computing for Health Research website titled “Sequence Clustering: Pipeling Applications to Classify Biological Sequences,” which can be found at http://salsahpc.indiana.edu/nih/index.php/Sequence_Clustering I ran across a lot of sources that cited protein sequencing as an example of how it can be applied bioinformatics, which is just a fancy buzz word for developing computerized methods for biological analysis. Quite often, it just means applying computers to problems in the health field.

[iii] MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[iv] In case the exact syntax might illustrate the problem, here’s the full query. The DISTINCT clause is the most important part to keep in mind:

CREATE VIEW [Monitoring].[QueryTextIDView]
AS
SELECT DISTINCT  TOP 9999999 ROW_NUMBER () OVER (PARTITION BY 1 ORDER BY RecordID) AS ID, RecordID, QueryTextID
FROM [Monitoring].[dm_exec_query_stats]
ORDER BY ID, RecordID, QueryTextID

[v]  IBID., pp. 334-335.

[vi] See the Wikipedia page“Andrey Markov”at http://en.wikipedia.org/wiki/Andrey_Markov

[vii] See the documentation from Books Online, which is available at the MSDN webpage “Microsoft Sequence Clustering Algorithm Technical Reference” at http://msdn.microsoft.com/en-us/library/cc645866.aspx

Also see the Wikipedia page “Markov Chain” at  http://en.wikipedia.org/wiki/Markov_chain I am not familiar with Hidden Markov Models or their differences with visible Markov models, but they are apparently pervasive enough that MacLennan, et al. mentioned on pp. 336-337 that SQL Server doesn’t use them.

[viii] Books Online makes the comment about matrix sparsity, but I added the rest. See the Wikipedia webpage “Curse of Dimensionality” at http://en.wikipedia.org/wiki/Curse_of_dimensionality

[ix] p. 336, MacLennan, et al.

A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering

by Steve Bolton

                In last week’s installment of this amateur series of self-tutorials on SQL Server Data Mining (SSDM), we covered my least favorite of the nine algorithms Microsoft includes with the product. There is a right time and place for every tool, but the times and places where Association Rules is the appropriate one are quite narrow. Its limitations make it an oddball among the major data mining methods, in that it is not used to construct other algorithms, nor does it use the others as building blocks. Functionally, it has the most in common with Clustering, an algorithm which also groups similar items within a dataset together. Association Rules is not technically referred to as a segmentation algorithm, but the two data mining tools group items in a dissimilar yet somewhat complementary way. Association Rules identifies the most common relationships between specific states of a couple of input or predictable columns in a mining model, in a sort of brute force, bottom-up way that leads to many small groups. In contrast, Clustering use more sophisticated, top-down methods to partition datasets into a handful of large groups, in which different states of the same column may be present in the same cluster. The first usually results in small itemsets, such as specific states of three or four mining model columns, while the latter may group together tens of thousands of rows in a particular cluster. In relational terminology, there are 1:M columns in each Association Rules itemset but a 1:1 relationship between columns and states, while the relationships for both are 1:M with Clustering. This week’s algorithm has the added advantage that a distance is often implied between clusters based on how divergent their constituent rows (i.e. cases in SSDM terminology) are from each other, whereas in Association Rules itemsets can only be crudely compared and contrasted by the probability with which they appear together.
                Not only does Clustering entail far less danger of overfitting (the bogeyman of data mining, in which decreased performance is paired with information glut, in the form of cluttered or misleading results) but it can accept the two major Content types, Discrete and Continuous.[i] Association Rules requires the latter Content type to be Discretized into buckets, which leads to loss of information content. In fact, Clustering is unique among the SSDM algorithms in that it does not even require you to identify a predictable column. The behavior with PredictOnly columns is a little different than usual, in that they are assigned values in a second pass over the training data based on the values in the clusters, rather than being used to form clusters themselves.[ii] This data mining method can be put to use in making predictions[iii], but this is not often done. That is one of the few broad applications where its utility is limited though. Association Rules is mainly useful for market basket analysis and recommendation engines, but Clustering is used for so many purposes in so many different industries that it is not possible to count all of the varieties of the algorithm in use today. The majority of the applications have this in common: they are instances of what is termed “unsupervised learning” in data mining lexicon, in which data is classified, without knowing the classifications or their properties in advance. It is well-suited to discovering previously unknown determinants of your dataset, which makes it an ideal starting point for research in many widely varied academic fields. Today, research on the algorithm is balkanized among many different fields for that reason, rather than being the exclusive domain of information theorists, mathematicians and statisticians. Its origins were also divided among several different academic disciplines between the 1950s and 1960s. The two names most commonly associated with its genesis are mathematician Hugo Steinhaus and Stuart Lloyd, a physicist in the Manhattan Project, who began publishing research on early variants of the algorithm in 1957.[iv] As mentioned earlier in this series, the origins of some of the other prominent mining algorithms were clouded by the uneven judgment of their inventors, who sometimes succumbed to junk science or sordid philosophes in fits of unreason. This was certainly not the case with Steinhaus, a Pole of Jewish ancestry who spent World War II in hiding, on the run in his homeland from the Nazis.[v] After the Soviet occupation of Poland he was forced to toil in obscurity behind the Iron Curtain, like Petr Hájek, Ivan Havel and Metodej Chytil, three Czechs who helped lay the groundwork for Association Rules in the mid-‘60s, as discussed in last week’s column.
               In addition to academic research, the clustering algorithm is quietly used in a wide range of applications that millions of American unwittingly encounter on a daily basis, like image data compression, optical character recognition (OCR), image grouping, computer vision and speech recognition.[vi] The algorithm can also be used in reverse to find oddball data that doesn’t fit normal patterns within a dataset, which can illuminate eccentric relationships or help identify outliers, i.e. values that can throw off measures of central tendency like averages. In some fields, such as “medical diagnostics, fraud detection, network security, anomaly detection, and computer immunology,” the whole point of Clustering might be to identify outliers.[vii] Data Mining with Microsoft SQL Server 2008, written by several former members of the Microsoft’s Data Mining Team, is not only a readable and indispensable tool for users of SSDM, but contains a brilliant explanation of how to use Clustering for anomaly detection.[viii] Even those DBAs who are strictly focused on the relational OLTP world could use it to identify bad data that needs to be cleansed. In other scenarios, it may be preferable to filter outliers out of a dataset before processing, or to limit them to a single cluster; how they are handled depends highly on the goals of a particular mining project.[ix]
               The types of problems the algorithm can be applied to are so vast that selecting the right means of Clustering becomes of central importance, especially since it is an “exploratory” algorithm that is designed to make a little more sense out of large quantities of relatively unknown data. As the DM Team puts it in their aforementioned book, “It is possible for people with particular domain expertise and a deep understanding of the data to create clusters in up to five or six dimensions, but modern data sets typically contain dozens (if not hundreds) of dimensions, leaving you with the impossible task of creating groupings when you can’t even conceive of the possible relationships between the attributes.”[x] On the other hand, the wide range of applications and large datasets it is designed to tackle also leave it open to one of the most dreaded terms in mathematics: “the curse of dimensionality.”[xi] Basically, several complications arise at an exponential pace as new dimensions are added to data, including the multiplication of irrelevant attributes, which seems to be a problem with other data mining algorithms as well. Certain Clustering methods, such as the popular K-Means method included in SSDM, have the additional difficulty in that their calculations imply distances between clusters, which quickly recede into a vague mass with higher dimensional problems. Perhaps the most dreaded acronym in mathematics is “NP,” which designates certain classes of problems which are particularly difficult to solve. K-Means is not among the most difficult class problems, NP-Complete (some of which may be logically impossible to solve with finite resources), but it is considered to be NP-Hard to prove that it has found the best possible solution (i.e. the “global optimum”) to particular Clustering problems, without taking the prohibitively expensive step of examining each and every solution. Both of these terms are bandied about frequently in the literature around Clustering algorithms, which indicates the high difficulty level of the problems that they are designed to cope with.
               Furthermore, “clustering is in the eye of the beholder.” These words from computer science Prof. Anil K. Jain, one of the gurus of Clustering, say it all.[xii] This data mining method is so useful that it has become ubiquitous, but that comes at a cost of diffuseness in defining what a cluster is. The answer to that question is basically determined by the type of Clustering method one chooses to use, since they all define the groups they create differently.  This is merely one a long list of “user dilemma” questions posed by Clustering, which are often answered by the choice of Clustering method, like how to validate clusters, which features to select when defining them, deciding whether or not to normalize the data, focusing on or discarding outliers and determining whether or not the data even has a tendency to cluster.[xiii] There is also a trade-off between Clustering methods, in that some of them operate better on certain types of data distributions, while others are conducive to building clusters of particular shapes. As Jain puts it so succinctly, these problems are compounded by the fact that “different clustering algorithms often result in entirely different partitions even on the same data.” Some Clustering methods, like K-Means, may create clusters even where there are none, because the data is uniformly distributed. Others may partition data in a particular way that is entirely valid, yet miss other equally valid partitions. For example, in one of his recently published papers Jain cites a particular study in which a particular Clustering experiment divided animals into mammals vs. birds based on one iteration in which their appearances were heavily weighted, but in another iteration in which their activities were heavily weighted, they were segmented on the lines of predators vs. non-predators.[xiv] Basically, the term Clustering covers several thousand disparate means of dividing datasets into distinct chunks based on the similarities of their constituent cases (i.e. rows, in the instance of our data). The selection of methods must be determined by the dataset and the goals of the data miners, since it will unavoidably affect the outcomes. It might even be logically impossible to develop a single Clustering method that fits every type of experiment imaginable. As Jain says,

           “The above discussion underscores one of the important facts about clustering; there is no best clustering algorithm. Each clustering algorithm imposes a structure on the data either explicitly or implicitly. When there is a good match between the model and the data, good partitions are obtained. Since the structure of the data is not known a priori, one needs to try competing and diverse approaches to determine an appropriate algorithm for the clustering task at hand. This idea of no best clustering algorithm is partially captured by the impossibility theorem [Kleinberg, 2002], which states that no single clustering algorithm simultaneously satisfies a set of basic axioms of data clustering.”[xv]

               All of these considerations are in addition to the usual trade-offs in terms of calculation performance and ease in visualizing the results. Luckily, the two Clustering approaches Microsoft has included in SSDM are very easy to visualize, as we shall see in a few moments, quite literally. They’re also among the most established and well-known varieties of Clustering methods, which makes performance tuning, cluster modification and interpretation of the results a little easier. K-Means and the Expectation Maximization (EM) methods iteratively apply their internal methods of dividing data over and over until the point is reached where no more information is being added to the mining model, so processing stops. I like to picture the process as wrapping a ball of yarn around an object until none is left, except that with clustering we are wrapping up several different objects, separated by a certain amount of space determined by the levels of similarity between them. The process might also be likened to the way that rain drops are formed by condensation of water vapor around minute particulates of dust. The way in which the two algorithms go about it differs somewhat though, which has an effect on the results returned. The object of the internal math of K-Mean is to assign data points to particular clusters by iteratively reducing the squared error of calculations of Euclidean geometric distance between them; EM assigns data points by iteratively calculating probabilities that they belong in a particular cluster, which may include discarding clusters and repeating processing if some are found to lack sufficient case support. K-Means explicitly implies distances between clusters, while the probabilities used in EM only imply distance in a less defined way. The “soft clustering” approach of EM has some advantages over the “hard clustering” techniques of K-Means, however, in that its data points can belong to more than one cluster and that its clusters can have a wider range of shapes, whereas data points are limited to a single cluster in K-Means, which tends to produce spherical groups of roughly the same size. I’ve seen several sources say that K-Means are ideal for Gaussian distributions (i.e. bell curves) but EM also assigns probabilities based on the same data distributions, so I’m not sure which of the two choices in SSDM would be ideal for that scenario.[xvi] K-Means might be preferred, however, if you have an explicit idea of the number of clusters your data ought to produce. Its drawbacks include a lack of clear starting points (i.e., the variable k) to being clustering, the inability of verifying whether or not a global optimum has been reached, sensitivity to skewing by outliers, lack of scalability and difficulty in analyzing certain irregular data shapes.[xvii] Users also have to specify the number of clusters in advance, which in some respects defeats the purpose of exploring unknown data.[xviii] The twin difficulties of identifying the best starting points and the proper number of clusters means it is best suited to being processed in multiple runs with different parameter settings.[xix] Although Microsoft deserves kudos for adding support for Discrete data to its version of K-Means (in contrast to other simpler, less versatile implementations of it), data of this Content type is of less practical use than with EM, because the probabilistic methods used to calculate it don’t have much in common with the distance-based calculations for Continuous data in K-Means.[xx] One serious limitation of Microsoft’s implementation is that the distances calculated by K-Means are not returned in the metadata returned by the Generic Content Tree Viewer. These can be retrieved by using prediction functions, according to the documentation in Books Online (BOL), but the Data Mining Extensions (DMX) language SSDM uses is an advanced topic that we won’t be tackling in this series until all nine algorithms have been introduced.
                Another advantage of EM over K-Means is that it requires less memory during processing and a maximum of one database scans, which may be why it “outperforms sampling approaches,” according to BOL. The documentation also says it “has the ability to use a forward-only cursor,” but I’m not sure if this referring to internal processing under the hood or in DMX queries; either way, I will never find out, because I haven’t used cursors in years due to a fatal allergy. The performance of both algorithms can also be enhanced by making wise choices between the scalable and non-scalable versions of both. In the former, SSDM loads 50,000 cases at a time and only reads additional batches of the same size if the mining models haven’t converged yet. In the latter, the datasets are loaded in one huge gobble. BOL advises that “Because it operates on a local buffer, iterating through the data is much faster, and the algorithm makes much better use of the CPU memory cache; EM is three times faster than non-scalable EM, even if all the data can fit in main memory. In the majority of cases, the performance improvement does not lead to lower quality of the complete model.”[xxi] Advantages like these may be why Microsoft chose 1 – Scalable EM as the default for the CLUSTERING_METHOD parameter, followed by 2 for Non-Scalable EM, 3 for Scalable K-Means and 4 for Non-Scalable K-Means. As we shall see, I didn’t see much of a performance drop when switching to the non-scalable means, probably because the datasets we’ve been using throughout these trials are relatively small, at least for Clustering applications, which seem to run a lot faster than certain other algorithms at any settings on the same data. The DM Team’s book says that “The scalable clustering framework was created to solve the problem of too little memory to maintain the entire case set in memory. However, if you have enough memory, you can set the SAMPLE SIZE parameter…to 0 to tell the algorithm to use as much memory as necessary.”[xxii] Most of the trials I ran only consumed about 1 of the 16 gigabytes of RAM on my poor beat-up development machine, which may explain why I saw little difference in processing time with any of the non-scalable means. Reducing the SAMPLE_SIZE to 10,000 improved processing time to 0:45 from roughly a minute or two on various other EM trials, while setting it to 0 also yielded a respectable time of 1:23 that was better than certain other trials. Setting it to 1,000, however, yielded the worst performance time of any of the 15 trials in Figure 1, at 6:26. Processing ran on just one of the six cores throughout most of the additional time. I’m going to go out on a limb and speculate that too low of a SAMPLE_SIZE can decrease performance in some scenarios, by forcing SSDM to process increased numbers of unnaturally small batches. Also note that if you accidentally set it between 0 and 100, as I did once on accident, SSDM will return an error message.

Figure 1: Trial Results for the Clustering Algorithm at Various Parameter Settings (click to enlarge)
ClusteringResults

                 Many of the other variables mentioned above can also be explicitly controlled in SSDM. For example, the point where processing stops can be specified with the STOPPING_TOLERANCE parameter, which defaults to 10. According to BOL, when the change in cluster probabilities is below this number divided by the total number of cases, the model is believed to have converged and processing stops. A trade-off between performance and results is involved, with higher numbers leading to faster processing and looser clusters and lower numbers to slower processing and tighter clusters; the DM Team’s book advises that it can be set to 1 if you suspect in advance that you’re dealing with a “small data set or very distinct clusters.”[xxiii] All of the flavors of Clustering can be directed to begin their searches for place to condense at different data points, depending on the random number generated by the CLUSTER_SEED parameter, which defaults to zero. Setting it to different numbers may produce diverse clusters on the same data, particularly with K-Means. This makes it an ideal candidate to experiment with when we discuss model validation, a very important step in data mining, which we really can’t address adequately until we’ve covered all nine algorithms; for this reason I will save it for later in this series of amateur tutorials, along with other validation experiments I’m procrastinating on, like feeding different training data to the Neural Networks algorithm, as mentioned in A Rickety Stairway to SQL Server Data Mining, Algorithm 5: The Neural Network Algorithm. MODELLING_CARDINALITY is another parameter unique to Clustering which really shouldn’t be put off, since reducing below its default of 10 may cause SSDM to evaluate too few sample models during processing, thereby potentially improving performance at the cost of the quality of the results returned. Setting the number higher may of course have the reverse effect; keep in mind the omnipresent risk of overfitting when tweaking any of these parameters though, because the wrong settings may actually decrease both. The DM Team advises in its aforementioned book that “Typically you can reduce this by half without significantly impacting accuracy. If you are running the Enterprise or Developer edition of SQL Server 2008, each candidate model will be processed on separate threads, allowing you to take advantage of better hardware investments.”[xxiv] By default, SQL Server attempts to build 10 clusters, but the number can be explicitly set using the CLUSTER_COUNT parameter. Keep in mind that if SQL Server can’t build the number you specify, it “builds as many clusters as possible,” according to BOL. Setting it to 0 means that SQL Server tries to determine by itself the proper number of clusters to build, using a heuristic method that follows a curve that favors about 10 clusters. The DM Team book points out that “this heuristic is particularly important with discrete clustering, which tends to favor either very few or very many clusters, neither of which is particularly useful,” but it is difficult to tell from the documentation precisely how the presence of many Discrete columns will affect the number of clusters chosen.[xxv] This could be of importance with the third mining structure we’ve been working with, based on data returned by dm_exec_query_stats, which has many Discrete columns to indicate particular query texts, plan handles and the like.
               That mining structure has 3.8 million rows that took more than an hour to process, even with this fast algorithm, so I limited by experimentation with these parameters to the first mining structure, based on the 142,080 rows of data returned by sp_spaceused. See the first three posts in this series for a refresher on the schema we’ve been using throughout this series, which is based on more than three days of polling the dynamic management views (DMVs) dm_exec_query_stats, dm_io_pending_io_requests, dm_io_virtual_file_stats, dm_os_performance_counters, dm_os_wait_stats and sp_spaceused every minute. This deliberately created a disk bottleneck, so that I could study the IO subsystem better, while simultaneously providing a type of data for this series that is highly relevant to DBAs. We’ve already previously discussed parameters like MAXIMUM_STATES and MAXIMUM_INPUT_ATTRIBUTES which are also available in Clustering, so I won’t belabor those topics further. It is interesting to note, however, that there is there is no MAXIMUM_OUTPUT_ATTRIBUTES parameter in this algorithm, which I surmise has to do with the unique way in which predictable columns are handled. MAXIMUM_INPUT_ATTRIBUTES is relevant to the four methods of feature selection in SSDM, including the Interestingness Score, which is the only one Clustering makes use of, according to the documentation.[xxvi] I am a little unsure of this, however, because the other three methods that the documentation says it doesn’t use are all Bayesian, as described in A Rickety Stairway to SQL Server Data Mining, Algorithm 1: Not-So-Naïve Bayes. As we will see shortly, however, the root of each mining model returns a Bayesian Information Criterion, i.e. a Schwarz Criterion. Jain mentions this measure of information content is sometimes used to determine the number of clusters created by K-Means, but it is unclear from the documentation if this is the case in SSDM.[xxvii] We also don’t have room to do justice to the MINIMUM_SUPPORT parameter, which behaves a little differently with this algorithm, which may be why it was known by the different moniker of MINIMUM_CLUSTER_CASES in SQL Server 2005. Basically, it specifies that a cluster will be discarded unless it has the indicated number of cases. Setting this to numbers other than its default of 1 might make sense, for example, if you know that your clusters should logically have a certain minimum case support. This might be the case with our second mining structure based on dm_os_wait_stats, which returned near the maximum of 673 different wait types with each poll, so that it might make sense to partition our data into clusters near that minimum size to see if they divide into clusters based on the polling time. As always, a trade-off is involved with this parameter, because some of the clusters you filter out may include relevant data, if you’re not careful to set it properly. Keep in mind when setting it that the same case may occur in multiple clusters with EM but not in K-Means. Books Online also mentions the addition of an internal parameter called NORMALIZATION which is meant to control the collation of Z-scores, but I have not yet found a way to set it. Besides this, the only other variables we might concern ourselves with are the mining structure flags MODEL_EXISTENCE_ONLY and NOT NULL. Throughout this series we’ve ignored them though, since our data is not supposed to have nulls and turning our columns into a dichotomous choice between Missing and Existing would rob our data of meaning.
                The Cluster Diagram shows at a glance just how easy it is to visualize Clustering, at least in comparison to other algorithms. It operates much like the Dependency Network tabs introduced in previous posts, with the usual All Links slider to limit connections between the clusters by their statistical strength. The bluer a cluster is the, more members it has, while those with fewer members are shaded from grey to white in descending order of case support. Like with the Dependency Network tab, you can move the clusters around with your mouse. You can also right-click a cluster and select Rename Cluster… to specify a user-friendly name, but be aware that it won’t be persisted if you reprocess the model. The Cluster Profiles tab works much like the Attribute Profiles tab described in the Naïve Bayes Tutorial, except that it also includes Continuous columns with the minimum, maximum, mean, mean minus standard deviation and mean plus standard deviation displayed in the Mining Legend instead of the data distributions. As always, I’m not crazy about this tab, since it is difficult to digest all the diamond charts mixed together with histograms at a glance. The Cluster Characteristics tab also behaves like the Attribute Characteristics tab in Naïve Bayes, except that you can’t retrieve the probabilities by hovering your mouse over the blue bars. It is indispensable, however, in determining what cases belong to a particular cluster, which is isn’t readily evident with the other tabs. Cluster Discrimination operates like a stripped-down version of the Neural Network Viewer, in which one cluster is compared against another to see where a proportion of the cases for a particular attribute-value range fits best. The “Complement of Cluster” option can be selected from the dropdowns to see how a particular cluster compares against all of the values not included in it.[xxviii]

Figures 2-5: The Four Clustering Visualization Tabs
ClusterDiagram

ClusterProfilesTab
ClusterCharacteristics
ClusterDiscrimination

               If this isn’t enough information for us, we can dig down deeper into the data with the Generic Content Tree Viewer, which as always involves a tradeoff in the form of decreased legibility. In past articles I’ve discussed the challenges of SSDM’s common metadata format, which is a convenient way of representing the output of all nine algorithms that would ordinarily be akin to comparing apples and oranges. I look at it as a produce stand that can hold both, so to speak. The cost is that we have to sift through a denormalized table, whose rows can vary in meaning depending on the NODE_TYPE value, as well as the nested NODE_DISTRIBUTION table, whose rows can also vary in meaning depending on their VALUETYPE values. Furthermore, there are subtle differences in the meanings of the columns in this common format depending on the selected algorithm. Fortunately, interpreting this format is perhaps easier with Clustering than any other algorithm. As you can see in Figure 6, many of the columns are simply left blank or set to 0 or 1, or provide redundant like the name of the attribute. There are only two NODE_TYPE values, with one of those being the single model root node as usual. ATTRIBUTE_NAME and ATTRIBUTE_VALUE tell you the name of the attribute and the particular state that is the topic of that row,  while the standard codes we’ve worked with for most of the past few weeks are used for VALUETYPE, including 1 for Missing, 3 for a Continuous column, 4 for Discrete (and presumably, 5 for Discretized). The number of supporting cases, the probability of the attribute-value pair occurring and the variance are also listed in columns by the same names, as usual. The only difficult part of it might be interpreting the MSOLAP_NODE_SCORE, which BOL says returns a Bayesian Information Criterion score, as discussed earlier. I have not yet spotted a means to retrieve the Interestingness Scores, which are available in the model content of other data mining methods that use them, such as Linear Regression.

Figure 6: Metadata for Clustering (adapted from Books Online as usual)
ClusteringMetadata

          The cluster with the most case support in the third mining structure, based on dm_exec_query_stats, was disconnected from the other nine clusters, which means it represented a group of highly distinct data. The Cluster Discrimination tab revealed dozens of states for specific Discrete measures like IOOffset, IOCompletionRequestAddress, IOuserDataAddress and SchedulerAddress, SQLHandleID tightly coupled with that cluster or its complement. The first group was strongly linked to values of 2 and 3 minutes for MinuteGap, an important measure I added to account for the unexpected increases in time between SQL Server Integration Services (SSIS) jobs during the data collection process. The complement strongly favored 1. A SQL Server internals or SSIS guru could probably easily use these figures to trace down the specific cause of this mysterious gap. The second structure, based on dm_os_wait_stats, had very weak links between its seven clusters, except for between the second and sixth clusters, which had just a few thousand rows between them out of the 5.4 million cases that were not reserved for training. This hints that the constituent attribute-value pairs represented outliers, previously unknown relationships or some other form of anomaly. This time the Cluster Discrimination tab revealed very high values for SignalWaitTimeMS, WaitingTasksCount, SizeOnDiskBytes, NumOfBytesWritten, IOStall, NumOfReads and IOStallReadMS in Cluster 2 and much lower values for many of the same values in its complement. This probably indicates that this group covered moments of particularly high IO pressure during the data collection process, possibly in terms of reads rather than writes, given that it was also associated with a NumOfWrites value of just 3.
                The output for the smallest mining structure, based on sp_spaceused, took more labor to interpret because it varied depending on the parameters in Figure 1. As shown in Figures 7 through 15, the number and arrangements of clusters varied between eight different shapes for the 15 mining models created in this structure. The first illustration shows the identical results returned for the first two mining models, where EM and EM-Scalable were used with their default values. Likewise, in the second picture, identical results were returned for K-Means and K-Means scalable at their default values. The fifth, seventh, tenth, eleventh, twelfth and fifteenth mining models all had the same structure depicted in the third image. This was similar to the structure in the sixth picture, for the ninth mining model. The remaining illustrations depict the results for the sixth, eighth, thirteenth and fourteenth models, all of which were unique. The major determinants of these differences seemed to be setting the CLUSTER_COUNT to 20 with the sixth model, which therefore produced twice as many cluster, reducing the MODELLING_CARDINALITY to 2 and raising it to 30 in the eighth and ninth, and setting the SAMPLE_SIZE to 100,000 and 10,000 respectively with the thirteenth and fourteenth. The STOPPING_TOLERANCE didn’t seem to have much effect on the cluster shapes in this trial. The data returned by these various trials was useful in complementary ways. For example, the cluster with the most cases in the two K-Means trials had a complement with very high measures of IO Pressure like NumOfBytesRead, Unused, Data, Reserved and IndexSize associated with MinuteGap, in contrast to very low values for the first cluster. The first two EM models had a very high correlation between a MinuteGap of 3 in one small cluster, together with very high values for other measures of IO pressure like NumOfBytesWritten, NumOfReads, IOStall, IOStallReadMs and SizeOnDiskBytes, with very low values for the same measures associated with its complement. This cluster was also associated with a particular table in the Monitoring database that was continuously written to during the data collection process. It is possible that some read activity was going on as well, because like these numbers weren’t strongly associated with any measures of IO write activity either way, which is reminiscent of what we discussed above with one of the clusters in the second mining structure. It is entirely possible, looking back, that this small cluster represented moments in which I read from that particular table during data collection just to take some visual samples of how the process was going. That would be taboo during a real, live data collection endeavor, but in my experiment, I didn’t care about contaminating the data in this way because I deliberately wanted to create a disk bottleneck. Also note that the first two EM models and the first two K-Means models both produced strong, useful insights at a casual glance, but did so using entirely different sets of measures that were available to both. The most useful measures in the K-Means trials came from the columns returned by sp_spaceused, while in the EM ones they came from dm_io_pending_io_requests, which were joined together in the same view that the structure was built on. So SSDM automatically partitioned the data from different angles depending on the parameters we set, which led to different results which were nonetheless both useful and valid.

Figures 7 to 15: Cluster Diagrams for the 15 Models Based on sp_spaceused
ClusterType1
ClusterType2
ClusterType3
ClusterType4
ClusterType5
ClusterType6
ClusterType7
ClusterType8


                If our results are this helpful at such a low performance cost using these two simple, well-known flavors of Clustering, it is intriguing to speculate on how much more might be revealed at minimal cost through other variants. As Jain has pointed out, there are already so many variations on the algorithm being put to productive use out in the wild that it is difficult even for an expert like him to track them all. Hierarchical clustering, for example, represents a whole set of Clustering methods that aren’t available in SSDM but which are useful for mining data that naturally follows a tier pattern; it is not derived from the Decision Trees algorithm we covered earlier in this series but is similar in many respects. Furthermore, new variants of Clustering are being developed all the time, including enhancements of older versions like K-Means. Some of these new forms of K-Means include two popular algorithms called ISODATA and FORGY, which sometimes also selects k through alternate methods like the Akaike Information Criterion.[xxix] New methods of Clustering based on data distribution densities are still being developed, such as DBSCAN and OPTICS, which don’t work well as well on Gaussian distributions as EM, which is in the same category of methods.[xxx] Other new variants are known by such colorful acronyms as CLARANS, BIRCH, CLIQUE, SUBCLU, DiSH and Eric.[xxxi] Much recent research has been focused on creating semi-supervised Clustering methods in order to improve the stability of the results, which can fluctuate greatly on similar datasets in some purely unsupervised scenarios.[xxxii] Canopy clustering is a method of particular interest, since it has a low performance impact and can be used to identify clusters before running more resource-intensive methods like K-Means and EM. It might thus be wise in some mining scenarios to perform canopy clustering on a cube or relational database, using your own code, in order to run SSDM trials more efficiently. We can add this to the growing list of preparation work that can be done in a cube or relational database prior to mining, along with determining the natural distributions of our data, computing variances and standard deviations and so on. More intensive preparation will be required to take the next step up our stairway, to a more limited variation on this family of data mining methods called Sequence Clustering. This algorithm first begins by Clustering data in much the same manner as we have in this article, but adds supplementary steps to place the clusters in a specific order, which requires the use of nested tables. Internally, it makes use of a common statistical building block called Markov chains, which are ironically often augmented by Clustering in Monte Carlo methods, a statistical building block which is beyond the scope of this series. Keep in mind that I’m posting this series in order to familiarize myself more with SSDM while simultaneously giving this incredibly useful product some badly needed free press, so if a data mining maven tells you that Sequence Clustering isn’t difficult to work with, listen to him instead. I have found it a challenge to work with in every SSDM application I’ve done to date though, to the extent that it has a steep learning curve similar to that of Association Rules. This is one of the reasons I’ve saved it for the tail end of this series, along with Time Series, another temporal mining algorithm.


[i] For a refresher on how important this distinction is in analysis, as well as instructions in basic tasks like setting up a mining project in SQL Server Data Tools (SSDT), see A Rickety Stairway to SQL Server Data Mining, Part 0.0: An Introduction to an Introduction.

[ii] See the MSDN webpage “Microsoft Clustering Algorithm Technical Reference” at http://msdn.microsoft.com/en-us/library/cc280445.aspx. Also see p. 295, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[iii] p. 314, MacLennan, et al.

[iv] No author, 2008, “Deaths,” University of Chicago Magazine, Vol. 100, No. 3, January-February 2008. http://magazine.uchicago.edu/0812/peer_review/deaths.shtml

[v] See the Wikipedia webpage “Hugo Steinhaus” at  http://en.wikipedia.org/wiki/Hugo_Steinhaus

[vi] p. 3, Berkhin, Pavel, 2002, “Survey of Clustering Data Mining Techniques,” a technical report for Accrue Software of San Jose, CA. Available online at the Appanews Topic Model Software Machine Learning Library at  http://www.bradblock.com.s3-website-us-west-1.amazonaws.com/Survey_of_Clustering_Data_Mining_Techniques.pdf

[vii] IBID., p. 43.

[viii] pp. 297-299, MacLennan, et al.

[ix] p. 42, Berkhin.

[x] p. 291, MacLennan, et al.

[xi] See the Wikipedia article “Curse of Dimensionality” at http://en.wikipedia.org/wiki/Curse_of_dimensionality

[xii] Jain, Anil K., 2010, “Data Clustering: 50 Years Beyond K-Means,” pp. 651-666 in Pattern Recognition Letters, Vol. 31, No. 8, June, 2010. Available online at the University of Central Florida Computer Science Division website at http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/JainDataClusteringPRL09.pdf. The copy I read had no page numbers, so I won’t be able to cite them specifically here. The .pdf file I read has 35 pages, none of which are marked, and the published one has 15, so I’m at a loss as to how to cite it.

[xiii] IBID.

[xiv] IBID. The study he cites is Pampalk, Elias; Dixon, Simon and Widmer, Gerhard, 2003, “On the Evaluation of Perceptual Similarity Measures for Music,” pp. 7-12 in Proceedings of the Sixth International Conference on Digital Audio Effects (DAFx-03). Queen Mary University of London: London.

[xv] IBID.

[xvi] See the Wikipedia webpage “Cluster Analysis” at http://en.wikipedia.org/wiki/Cluster_analysis

[xvii] p. 17, Berkhin.

[xviii] See the Wikipedia page “K-Means Clustering” at  http://en.wikipedia.org/wiki/K-means_algorithm

[xix] IBID.

[xx] p. 312, MacLennan, et al. Also see MSDN webpage “Microsoft Clustering Algorithm Technical Reference” at http://msdn.microsoft.com/en-us/library/cc280445.aspx.

[xxi] See MSDN webpage “Microsoft Clustering Algorithm Technical Reference” at http://msdn.microsoft.com/en-us/library/cc280445.aspx.

[xxii] pp. 313-314, MacLennan, et al.

[xxiii]  IBID., p. 316.

[xxiv] IBID.

[xxv]  IBID., p. 315.

[xxvi] See the MSDN webpage “Feature Selection in Data Mining” at http://msdn.microsoft.com/en-us/library/ms175382(v=sql.105).aspx

[xxvii] See Jain, Anil K., 2010, “Data Clustering: 50 Years Beyond K-Means.”

[xxviii] While we’re on the topic of visualization, I thought it interesting to note that on p. 3, Berkhin states that “Describing the numbers of data points per every unit represents an extreme case of clustering, a histogram, where no actual clustering takes place. This is a very expensive representation, and not a very revealing one.” I never looked at a histogram this way before. Given that indexing in SQL Server is based on histograms, could it be possible to improve our indexes with less expensive, more revealing clusters in certain scenarios? If so, this would be yet another practical use for SSDM even in purely relational and OLTP applications.

[xxix] See the Wikipedia webpage “Cluster analysis”at http://en.wikipedia.org/wiki/Cluster_analysis

[xxx]  IBID.

[xxxi]  IBID.

[xxxii]  IBID.

A Rickety Stairway to SQL Server Data Mining, Algorithm 6: Association Rules

by Steve Bolton

                In last week’s installment of this series of self-tutorials on SQL Server Data Mining (SSDM), we covered my favorite algorithm, Neural Networks, which is also among the most intricate but productive of the nine data mining methods Microsoft provides. The major drawback is that its inner workings are almost impossible to interpret; it is about as difficult to figure out why a mining model reached a particular conclusion from examining its weighted artificial neurons as it is read a man’s thoughts from a coroner’s report on his cerebellum. This week we’re going to whipsaw into a completely unrelated algorithm, Association Rules, which has characteristics that are almost a mirror image of the properties of Neural Networks. Its inner workings are perhaps easier to understand than any of the other nine algorithms, because it is so simple. Despite its inner simplicity, it is paradoxically more difficult to extract useful information from it because of the steep learning curve required to manipulate its parameters properly. As I have made clear in prior posts, this series is meant to show that an amateur like me can still garner a wealth of useful information from the most under-utilized tool in SQL Server with a minimum of investment in time, energy, server resources and training. In every other algorithm we’ve tested to date (and in my prior experience with the three we have yet to discuss), interesting results have immediately popped out even when the parameters controlling each one were left at their default values, but Association Rules takes a lot of coaxing to extract useful results from. At least in my limited experience, it seems to be more prone to the scourge of data mining, overfitting, i.e. decreased performance in return for cluttered, meaningless or misleading results. For these reasons and others, it is my least favorite among Microsoft’s nine algorithms.
                That is not to say that it does not have its uses. It was first developed by chemical engineer Rakesh Agrawal and computer scientists Tomasz Imielinski and Arun Swami in a landmark research paper in 1993, which built on the study of GTE labs researcher Gregory Piatetsky-Shapiro.[i] It may have been foreshadowed by Petr Hájek and Ivan Havel and Metodej Chytil, three Czech researchers working behind the Iron Curtain, who published a similar paper in 1966. In past articles I have written about some sordid episodes in the history of statistics, math and the hard sciences, but in this case, we have people of seemingly sterling character, like Hajek, who was known for resisting the Communist occupiers of Czechoslovakia, and Havel, who was the brother of Vaclav Havel, the famed leader of the nation’s “Velvet Revolution” in 1989.[ii]  Either way, it was not popularized until the 1993 article, which makes it “one of the most cited papers in the data mining field.”[iii] There’s apparently quite a bit of ongoing research on this particular mining method, including the development of more advanced refinements like Eclat, FP-growth, GUHA and OPUS that allow for hierarchical categorization, sequence ordering and other uses that the original A Priori method doesn’t handle well. Alternative ways of gauging the interestingness of data have also been invented, including measures of collective strength, conviction and leverage. I can’t comment directly on any of these new variations, because my experience with Association Rules is limited to the version that Microsoft ships with SSDM, which Books Online (BOL) says “is a straightforward implementation of the well-known Apriori algorithm.” My gut feeling, however, is that the algorithm is a hot topic only because it is well-suited for a couple of specific, narrow implementations, like market basket analysis and recommendation engines, which are in high demand right now in the limited field of E-commerce. In fact, this specific scenario was spelled out in the very first sentence in Agrawal, et al.’s article. Like any other tool, it has its uses, but several drawbacks that severely limit its utility in comparison to other algorithms. It is not well-suited to scenarios where the order of items in a particular case of transactions (think of a row in a dependent table joined to a parent table) is important. Association Rules can be used for predictions, but only in a crude way, because it has no ordering and therefore can only refer to specific points in time. Even in data mining there are no crystal balls, so I can’t peek into the future and see if researchers will continue to devote as much attention to this algorithm. My gut feeling, however, is that we’re probably already nearing the top of a bell curve, after which the advantages of refining this particular algorithm will begin to level out. It may continue to be popular for many reasons, such as the fact that its inner workings are very simple to explain, but the very simplicity of this data mining method is what limits its utility and possible refinements so drastically. The shovel was one of the great inventions in human history, but since that time, it has undergone only slight modifications of limited utility. Research into building better shovels has not stopped by any means, but it has topped off – for example, anyone with more than a hundred bucks can purchase a high-tech “crovel” like the one featured on a recent episode of the National Geographic Channel show Doomsday Preppers. That souped-up shovel is a great tool, but its refinements are even more highly specialized, to the point that we now have one geared specifically for end-of-the-world scenarios.[iv] In the same way, research into Association Rules will probably continue but follow the law of diminishing returns, whereas we’ve barely scratched the surface of the potential of other data mining methods like neural nets and Clustering.
                Just like a shovel, the basic A Priori version of Association Rules is simple to explain. SQL Server counts all of the cases in your data mining model where certain columns values are found together, which can be a time-consuming process, then creates an itemset for each combination with a count above a certain threshold. Basically, it just counts the different states of your columns to see how frequently they occur in combination with the states of other columns, then tosses out the ones that are sufficiently numerous. You can certain properties to specify how many columns can appear in your itemsets; for example, you can tell SQL Server to only create itemsets with three or four items, i.e. column-value pairs. For example, using the IO data we’ve been working with throughout this series (see post 0.0, post 0.1 and post 0.2 for refreshers on the details of our schema, as well other basics like how to create a project), one of the itemsets returned was labeled “IO Stall = 4439-10428, IOPendingMsTicks = 1-28″ which was supported by 9,053 cases and had an itemset length of two columns, IOStall and IOPendingMsTicks. Throughout this series we have referred to the number of rows processed by SSDM as supporting cases, but in the lingo of Association Rules this simple measure is also sometimes referred to as frequency. Once the itemsets are generated, the algorithm creates a rule by simply calculating the probability (also referred to as confidence in the context of Association Rules) that one will be found with another. One of the more important rules I found, for example, was labeled  “IO Stall Read Ms = 15273-78862008 -> Num Of Bytes Read > = 3006464.” Several itemsets can appear on the left-hand side of a rule, but only one in the predictable column listed in the right of the -> arrow figure. The multiplicity of greater than, less than and equals signs in the presence of that arrow can be initially confusing, especially once we find itemsets with more than two members and rules with more than two itemsets on the left side, but users can get acclimated fairly quickly with little training.  The only operation that might be a little beyond the ordinary math skills of the Average Joe might be the lift assigned for each itemset and rule. For itemsets, the count of cases is divided by the total number of cases, then the probability is normalized ( i.e. a fancy way of saying that they’re put on a single comparable scale, as you might need to do if you one were comparing ranges of prices for euros, dollars and yen, or disparate rating systems for movies, etc.). With rules, the calculation is a bit more complex, in that a ratio is built by comparing the left-hand itemsets with the predictable result on the right on one hand, against the results on the right-hand when they’re not found in the presence of those itemsets. The comparisons of rules are then normalized, this time using a logarithm. The essential thing for those of us without doctorates in statistics to remember is that the point is to build a way for us to measure the interestingness of the results. This is actually simpler to calculate and easier for laymen to grasp with Association Rules, since it is the only algorithm which doesn’t use the four methods of feature selection discussed in prior posts to compute interestingness. This measure is sometimes referred to as importance in this context. As BOL points out, the simpler way in which Association Rules calculates it means has a drawback in that the associations it makes are less likely to provide relevant new information than the splits produced by Decision Trees, which we discussed a few weeks ago, or for any of the other algorithms for that matter. The structures produced by these calculations are also basic as a scoop or a spade. As BOL puts it, “An association model has a simple structure. Each model has a single parent node that represents the model and its metadata, and each parent node has a flat list of itemsets and rules. The itemsets and rules are not organized in trees, they are ordered with itemsets first and rules next.”
               The drawbacks of Association Rules, however, also seem to stem from the advantages of its simplicity. A shovel is a simple, basic tool, but it’s only economical to use it when the holes you want to dig are shallow, or when more advanced tools like back hoes and drills can’t be applied. You could theoretically use a shovel to excavate a gold mine two miles deep into the ground, like South Africa’s TauTona mine, but that would be an exceptionally painful and inefficient way of going about it. Likewise, I seriously doubt Association Rules can be used more efficiently than the other eight SSDM algorithms to unearth interesting relationships buried deep in any big dataset. It might be an appropriate choice when you need quick insight into a small dataset without many unique variables, in which case you don’t have to quarry very deep below the surface of your data, which means that the outlines of the relationships you seek may already be visible. It may also be appropriate when you need to explicitly understand how SSDM arrived at a particular answer, which is the only area in which beats the Neural Networks algorithm hands down. In this sense, Association Rules is so simple even a caveman can do it. Any caveman can also excavate a mineshaft by beating the ground with a stick, but they’re going to exhaust all of their brute strength long before they turn up their first precious gem or ounce of gold. To put it bluntly, that’s essentially what we’re doing with Association Rules: brute force data mining. The process is easy to grasp because it’s so basic, but it also has the drawbacks of any simple, clumsy tool: it takes an awful lot of effort to get any work done. A first-time user might expect that in return for the shallow view of data that Association Rules provides, we would have to expend correspondingly fewer resources, but this is not the case. In fact, you are much more likely to encounter immediate performance problems with this algorithm than any other, in return for a much greater risk of information glut, which together constitutes the twin horns of overfitting, the bogeyman of all data miners. For these reasons, it might be wise to limit Association Rules to preliminary examinations of data, but even for this narrow class of uses, Naïve Bayes does a better job by consuming few server resources in order to return its relatively shallow results. It also has the same innate Content type limitation which hobbles Naïve Bayes, in that it accepts only Discrete and Discretized data. Association Rules is in some ways even simpler to understand than Naïve Bayes, because we don’t have to worry about feature selection, nor is there a statistical mystery about why it returns the results it does. Yet even the simplicity of its inner workings is misleading, because a really steep learning curve is required to master the parameters that control Association Rules. I’ve seen all of the other algorithms immediately return back interesting results with their parameters set to their defaults, but have yet to see this occur with Association Rules. Its simplicity is deceptive in this sense. Excavating a mine with a simple tool like a spade, for example, can quickly turn into a maddeningly complex job. As I’ve said before, I’m an amateur at this, but in the last few years I’ve experimented with it, Association Rules has consistently returned data of far less quality in return for much higher consumption of server resources, on top of a much greater investment in time and energy merely to complete processing at all. I’m sure an expert can point out some specific scenarios where Association Rules might be the preferred tool, but I can’t think of any off the top of my head which are much different than those it was developed for: market basket analysis and recommendation engines, which we’ll explain shortly, after demonstrating how to misapply it to other scenarios. Perhaps it works better on certain data distributions, as Naïve Bayes does. These limitations, however, make it a much more specific but simple digging tool, analogous to a pick instead of an all-purpose shovel for light digging.
                One of the points I hope this series of self-tutorials gets across is that any SQL Server professional can make practical use of SSDM without much investment in time, energy, server resources and training. Association Rules is an exception to that rule, because you are likely to experience a baptism by fire that forces you to delve right into the parameters that control the algorithm. Regardless of your skill level you will have to get through it, otherwise you may not get any results back at all. Thankfully, we can skip over the mining flags MODEL_EXISTENCE_ONLY and NOT NULL as usual, since our data has no nulls and we don’t want to reduce our data to just two values, Missing or Existing. There are also two parameters we won’t worry about for now since they’re not relevant to this week’s trials. One of these is AUTODETECT_MINIMUM_SUPPORT, which former members of the Microsoft’s Data Mining Team say “represents the sensitivity of the algorithm used to autodetect minimum support. Setting this value to 1.0 will cause the algorithm to automatically detect the smallest appropriate value of minimum support. Setting this value to 0 turns off autodetection, and the algorithm operates on the actual value of minimum support. This parameter is used only when MINIMUM_SUPPORT is set to 0.0.”[v] I have set MINIMUM_SUPPORT to this value but have yet to see any place to set this particular parameter in SQL Server Data Tools (SSDT). We also won’t use OPTIMIZED_PREDICTION_COUNT, which is a brute force method of limiting the results returned by DMX prediction queries, a topic that will be discussed in this series once we’ve gotten through all nine algorithms. That leaves us seven parameters to control the behavior of Association Rules, the first of which is also a brute force method of limiting the results returned by the algorithm. MAXIMUM_ITEMSET_COUNT essentially takes a hatchet to your data mining results and hacks off any that might appear after your model has finished processing this number of itemsets, regardless of how important they might be. Unfortunately, at its default setting of 200,000, I continually ran into the String Store error described in the Decision Trees post: “File system error: A string store or binary store with a compatibility level of ’1050′ is at the maximum file size of 4 gigabytes. To store additional strings, you can change the StringStoresCompatibilityLevel property of the associated dimension or distinct count measure to ’1100′ and reprocess. This option is only available on databases with a compatibility level of ’1100′ or higher. Physical file: . Logical file: .” I have already recounted the sad story of how Microsoft added a new property in 2012 that allows SQL Server Analysis Services (SSAS) users to circumvent this serious limitation with cubes, but apparently forgot to add the functionality to SSDM. It is also unclear if SSDM is actually breaching this limit, or if some undocumented problem is causing this error message for unrelated reasons, as I explained in that post; in this instance, however, it seems more probably that the limit is actually being hit, given that one of my smaller Association Rules mining structures suddenly grew by 504 megabytes once processing finally succeeded. That occurred after I set the limit as low as 500, which finally rid me of that particular error.
               In a previous project, I saw error 7with the phrases “the job completed with failure” and “readdata” repeat themselves ad infinitum, ad nauseum in a Profiler trace until my server crashed during an Association Rules test, but thankfully that didn’t occur during this week’s trials. On one occasion in this project I also ran into an error that informed me “The Association Rules algorithm found no frequent itemsets matching the algorithm parameters for model ARDenormalizedView1Model.” A more persistent and trickier problem to solve was a third error, which sometimes occurred simultaneously with the String Store fault: “The Association Rules algorithm found no rules matching the algorithm parameters for model ARDenormalizedView1Model.” It took a lot of playing with the remaining six parameters to get my models to return any rules at all, which has been par for the course with other projects I’ve tested on SSDM. The one I played with the least was MAXIMUM_SUPPORT, which sets an absolute limit on the case support for any particular itemset when set to more than 1, or as a percentage of all cases when set to less than its default of 1. As BOL says, “This parameter can be used to eliminate items that appear frequently and therefore potentially have little meaning.” This of course depends highly on what you’re digging for in your data mine. If you want to spot frequent relationships, it would be better to leave this parameter alone, but if you want to spot previously unknown ones then tweaking MAXIMUM_SUPPORT may help, assuming you haven’t spotted these relationships yet because they’re relatively infrequent. Eliminating this rule-generating error required a difficult balancing act between two parameters we’ve encountered before in this series, MINIMUM_PROBABILITY and MINIMUM_SUPPORT, and three we have not, MINIMUM_ITEMSET_SIZE, MAXIMUM_ITEMSET_SIZE and MINIMUM_IMPORTANCE. As usual, relationships where the confidence level is below the MINIMUM_PROBABILITY setting or the case count is below MINIMUM_SUPPORT are discarded. The documentation says that the default for MINIMUM_SUPPORT with Association Rules is 0.03, i.e. that an itemset must be included in 3 percent of all cases – yet when you go to set the parameter in SSDT, you will find a text label there saying that the default is actually 0.0. I have yet to verify which one of these conflicting statements is accurate.
               The itemset size parameters limit the size of the data returned in a horizontal way, by shrinking the number of columns that can appear in any itemset. For example, if we set MINIMUM_ITEMSET_SIZE.to 2 and MAXIMUM_ITEMSET_SIZE to 3, in the dataset used throughout this series we could theoretically encounter an itemset that contains the three columns MinuteGap, IOStallReadMs and IndexSize, because that is three columns wide. We couldn’t have one that combined Rows, DayOfWeek, NumOfReads and NumOfBytesWritten, since that would be four columns wide, which would be beyond the setting in MAXIMUM_ITEMSET_SIZE. Nor could we have one that combined IOOffset and CounterID alone, since that would be two columns, one less than the setting of MINIMUM_ITEMSET_SIZE. Take care not to set MINIMUM_ITEMSET_SIZE higher than MAXIMUM_ITEMSET_SIZE, as I did once, because no results will be returned. BOL also warns that, “You cannot reduce model processing time by increasing the minimum value, because Analysis Services must calculate probabilities for single items anyway as part of processing. However, by setting this value higher you can filter out smaller itemsets.” From my experience, this is accurate.
                Frequently, working with SSAS is a lot like exploring outer space, because you’re bolding venturing out where few men have gone before. For a few examples, see my series on some of the errors you’ll encounter in SSAS for which there is apparently no documentation anywhere on the planet. Documentation for MINIMUM_IMPORTANCE exists, but it is so thin that you can read find it all within a few minutes. There are two sentences referring to it in the last edition of the Data Mining with Microsoft SQL Server 2008, which was written by former members of Microsoft’s Data Mining Team: “Minimum_Importance is a threshold parameter for association rules. Rules with importance less than Minimum_Importance are filtered out.”[vi] A search on Google turns up just 84 hits, of which only maybe four or five are relevant posts which aren’t repeats of each other. One of these is a post in which Jamie MacLennan of the DM Team says that a bug was found in the parameter back in 2006 and that a fix was being prepared for the next service pack in SQL Server 2005, but I haven’t yet found any reference to whether it was fixed or not since then.[vii] The only explanation I found was in another post by MacLennan, who said that “Minimum_Importance is a calculation that further filters rules based on the amount of lift they provide – the purpose is to filter out tautologies, e.g. “Everybody buys milk, so Cookies->Milk is true with 100%”.   This rule is not important, since <anything>->Milk would also be 100%.”[viii] I also found some other good explanations of the meaning of the statistical term importance in Association Rules at the Data Mining Forum, but without any direct mention of the parameter.[ix] They say that a little knowledge is a dangerous thing, yet I had no choice but to charge into the fray with this minimal understanding of the parameter, because it quickly became of maximum importance to the success of this week’s trials to get my processing times down using MINIMUM_IMPORTANCE.
               What followed was a process of trial and error, with a lot of the latter. Figure 1 is a summary of this week’s trials on the three views I have built my mining structures from throughout the series, which are taken from about three days of polling dm_exec_query_stats, dm_io_pending_io_requests, dm_io_virtual_file_stats, dm_os_performance_counters, dm_os_wait_stats and sp_spaceused every minute, in order to intentionally cause IO pressure. I had hoped that this would kill two birds with one stone, by familiarizing myself more with a SQL Server topic I don’ t yet know enough about, while providing a familiar reference point for those DBAs who already understand it. As discussed in previous posts, I cut back to three denormalized views that left joined dm_io_pending_io_requests, dm_io_virtual_file_stats, dm_os_performance_counters and a parent RecordTable (which tracked time) to sp_spaceused, dm_os_wait_stats and dm_exec_query_stats respectively. Ideally, Association Rules calls for market basket analysis and nested tables, but I’ll defer discussion of that till the end of this post. For the meantime we’ll stick with mining structures like those we’ve worked with in previous weeks, so that we don’t make apples vs. oranges comparisons. I’ve included some figures from our Logistic Regression and Neural Networks trials to show how processing times vary, although they use different parameters. As you can see, the results of this pioneering public test of these parameters were not positive. Not only did many of the jobs fail to complete, but the processing times were often simply awful whether they completed or not. It took a full day of playing with these parameters over the course of 14 trials to successfully return any results at all. That time would have stretched out across an entire week, if I hadn’t learned a really valuable tip the hard way: when processing an unfamiliar model, always keep a Profiler trace going and calculate how much time it will take in advance, by taking samples of the interval between messages that look like this, “Reading cases: n cases read” and multiplying by the number of cases in your model. You’ll need to click on Figure 1 to enlarge it, because the table contains a gold mine of information on the mining process and the parameters it took to finally get the job done. It doesn’t tell the whole story though. IO pressure was next to nil throughout all 16 trials on the Association Rules models, but when I finally got usable results back on the 14th model, 504 megs of data were suddenly added to the mining structure all at once, which is probably why I was hitting the 4 gigabyte string store limit when MAXIMUM_ITEMSET_COUNT was set much higher than a few hundred or a few thousand. I was recently fortunate enough to get a new 8 gig stick of RAM, otherwise I would have been in serious trouble in this week’s trials, in which the msmdrv process routinely gobbled up between 3 and 5 gigs at a time. My processor continually ran on just one of its six cores during this set of trials, which may indicate that SSDM could benefit from better multi-threading in certain phases of model processing. Perhaps one of the “forbidden parameters” in SSAS (i.e. those that can only be changed by editing the msmdsrv.ini file) could help, but I’m not familiar with them enough yet to try. All in all, this brute force method of data mining was brutal on my poor machine. Strangely, once I finally got results back from the 14th model, the final two processed in just a few tenths of a seconds, with just a couple of megs being added to the database size. I used different parameters in both cases, including multiplying the MAXIMUM_ITEMSET_COUNT by a factor of ten to 5,000, but I suspect that SQL Server simply reused some of the data stored during the processing of the 14th model.

Figure 1: Performance of Association Rules at Various Parameter Settings, Compared to Other Algorithms (click to enlarge)
TableOfResults

                What did we get in return for this large of investment of training, time, server resources and energy? Not much. In Figures 2 and 3 we see the output of the Rules and Itemsets tabs in the Microsoft Association Viewer. It also has a Dependency Network tab, which I found to be a useful starting point in the workflow for sifting through mining results, but we’ve already discussed that in depth in past posts, such as those on Naïve Bayes and Linear Regression. The Minimum Probability and Minimum Importance dropdowns and Maximum Rows control can be used to limit the results in the Rules tab, which can be sorted by Probability or Importance by clicking on the appropriate column header. The Minimum Support, Minimum Itemset Size and Maximum Rows controls serve basically the same purposes in the Itemsets tab, which can be sorted by the Support by number of cases or the Size, by the number of columns represented in each itemset. The Filter Rule, Filter Itemset, Show and Show Long Name are used to limit the rules or itemsets by the names contained in them, or to alter the way they are displayed. Luckily these controls aren’t hard to use or understand, because I had to really sift through the results to retrieve any relationships that were relevant to me. I noticed two relationships between IOStallReadMs and two other measures in the Rules tab and the one highlighted relationship between IOStall and IOPendingMsTicks in the Itemsets tab, but that was it. I suppose I could have found more relationships by sifting through the thousands of results on both tabs, but that would have taken far more time than with any of the other algorithms we’ve surveyed to data, where useful relationships were immediately apparent in the results. Every time I have used Association Rules in the past, I have had the same unpleasant sensation of having to mine the mining results themselves by hand, so to speak. I suppose this shouldn’t be that surprising, given that we can’t dig that deep with such a blunt instrument, which forces us to continue sifting through the results with the equivalent of our bare hands.

Figures 2 and 3: The Rules and Itemsets Tabs (click to enlarge)
Rules Tab
Itemsets Tab

               I will spare you the chore of having to sift through the long, unordered list of itemsets and rules that we can see in the Generic Content Viewer. It’s not torture to understand, like interpreting the innards of a neural net, but it doesn’t really tell us much more than we’ve already seen in the Rules and Itemsets tabs. There are only three types of nodes, two of which specify items or rules. The key node to look for is the model root at the top, since it has stats for the whole model, including ITEMSET_COUNT, RULE_COUNT, MIN_SUPPORT, MAX_SUPPORT, MIN_ITEMSET_SIZE, MAX_ITEMSET_SIZE, MIN_PROBABILITY, MAX_PROBABILITY, MIN_LIFT and MAX_LIFT. All of the minimums and maximums are those found for any itemset or rule in the whole model. For the sake of consistency I’ll provide a quick reference to the metadata returned by Association Rules as I have in past posts, although this would be most useful if we were parsing the results ourselves with some custom DMX queries. The nested NODE_DISTRIBUTION table isn’t difficult to interpret with this algorithm, because each row refers to a different item in an itemset or rule. VARIANCE is always 0, the ATTRIBUTE_NAME simply tells us the column name, ATTRIBUTE_VALUE gives us a label representing the state of the column and PROBABILITY is always equal to the NODE_PROBABILITY and MARGINAL_PROBABILITY. VALUETYPE tells us the Content type of the column, such as Discrete or Discretized.

Figure 4: Metadata for Association Rules (adapted from Books Online)
ARMetadata

                Overfitting is a big problem with Association Rules, as I found out the hard way long before I ever started this series of tutorials. These results are about par for the course. Of course, that begs the question of whether or not I’m using the algorithm the way it was designed to be used. The answer is that I haven’t, simply because I wanted to illustrate how much more clumsy the algorithm is when processing datasets of sizes similar to those we’ve already used. The documentation warns that large datasets with many distinct items or low thresholds for minimum itemset in the model parameters can be time-consuming and they’re not kidding. It also advises that “To minimize processing time and reduce the complexity of the itemsets, you might try grouping related items by categories before you analyze the data,” which might work. Yet in a sense, if we’re comparing it against other algorithms, that’s cheating, because we’re drastically reducing the granularity of our data in that way. Furthermore, if we already know enough our data to group them this way, then why run an algorithm that is designed to group items together? Yes, we can get more detail about specific items in this way, in order to link items together more accurately, but it’s a bit like digging a hole with your bare hands, then finishing the job with a spade. I’m not disputing that it can be useful to do this, only pointing out that the range of scenarios where this will be useful is quite limited.
                That brings us to market basket analysis and recommendation engines, which are all the rage now in the E-commerce world (which, it should be remembered to keep our priorities in proportoin, represents just a tiny corner of the wide range of human problems that can be addressed by data mining). The two are basically flip sides of the same coin; one notices that customers tend to buy certain things together, while the other recommends products to customers because they are frequently found in the purchases of other customers. Long before I ever used SSDM, an apocryphal story had already seeped across the Internet to me of how data miners had discovered a curious market basket relationship: men who buy beer also tend to buy diapers with them. I don’t know if it’s true, but I’ve seen it repeatedly referenced in the origins of Association Rules now. I crack a smile every time I see recommendation engines on websites offer me products, because I have an inkling of what’s going on under the hood at those web servers. I used to snicker because the recommendations were sometimes patently ridiculous, but the technology must be improving, because certain websites now actually point out things I might buy or want that I previously didn’t know about. It’s entirely possible that they’re using some other algorithm, like neural nets or Sequence Clustering, to do jobs like this – I can make a good case that they might do a more efficient job of recommending better products, if designed correctly. Or they may be using one of the newer enhancements to Association Rules. Chance are, however, that they are using the original A Priori version of it, which was designed with one thing in mind: transaction analysis.
               For that, you basically need two things: a parent table to uniquely identify transactions, plus a dependent table with a list of items for each. I now know that the major mistake I made in A Rickety Stairway to SQL Server Data Mining, Part 0.2: How to Dig Out of a Data Mining Cave-In was a common one: I assumed that nested tables in SSDM were exactly equivalent to relational tables related by a foreign key, but this isn’t precisely true. I missed a subtle difference which counts for a lot in the end. Two keys are actually required: a case key, which SSDM retrieves automatically from the relationships in your data source views (DSVs), and a nested key, which identifies a column in the dependent table that you want to analyze. In fact, it quite often makes sense to eliminate the case key from your model, if it has no information content other than to link the two tables together.[x] Since the data mining cave-in I mentioned in that post, I have had to resort to three denormalized views in order to reduce the outrageous processing times I encountered with my original DSV, which featured the tables corresponding to my six DMVs linked to a single RecordTable. I made the mistake of using RecordID, the case key in all six tables, as my nested key, which is probably what led to the mother of all performance bottlenecks. There were simply multiple RecordIDs for each unique row in the dependent tables, which the DM Team’s aforementioned book says can lead to “undefined” behavior. When you use nested tables, each column within them is stored on disk as a separate dimension, which may have been what caused the massive IO pressure I encountered, followed by massive page file usage and memory consumption. After this week’s trials on the three denormalized structures we’ve been using as substitutes, I was finally able to return to my original DSV schema and nest all six dependent tables the way I had planned at the beginning. At times, I departed slightly from the classic market basket setting because the data I collected wasn’t exactly suited to it. In many scenarios, we would have to do a lot of artificial aggregation of our data, if there were more than one distinct value for our nested key for each unique case key. I’m assuming, however, that it is safe to skip aggregation if our nested key has a 1:1 relationship with the case key. This is precisely the case with certain crucial measures in our data that I want to investigate further, such as WaitTypeID and CounterID. The same few hundred wait types and dozen or so performance counters I collected occurred in exactly the same proportion with each data collection job, so we don’t need to perform any smoothing of our data to use them. In a normal market basket analysis, a person may buy more than one of the same item in a single transaction, or not buy any at all, so I also experimented with aggregating on measures that didn’t have a 1:1 relationship with the case key.
               Now that I was using the algorithm and the nested tables feature as directed, SSDM behaved better. Processing time took only a dozen minutes or so on a couple of occasions and the single resulting mining structure took up just 27 megs of disk space. Furthermore, contrary to the previous trials, SSDM began making use of multiple processor cores, possibly because there were multiple dimensions to process; the algorithm still could have made better use of multi-threading in certain phases (such as the “Started training algorithm phase,” where it went back to one core to stay). Yet not all of the problems with Association Rules were due to my own inexperience, because as I go to press with this post several weeks after writing it, I’m back to playing with the parameters in the hopes of finally generating some rules out of these nested tables. Forget the “training algorithm phase”; I’m still stuck on the “training amateur data miner phase.” The learning curve for this data mining method seems to be prohibitively high in comparison to the other eight SSDM algorithms, on top of the unavoidable performance and lack of utility in the results. This algorithm is indeed a simple, blunt instrument, but has more in common with a pick than a shovel, because its utility is further limited to a few narrow chores. I occasionally find uses for picks when digging, but not as often as shovels.[xi] For this reason, I would almost always choose Naïve Bayes or Linear Regression as a preliminary algorithm in any workflow over Association Rules, unless there was some specific matching need like market basket analysis. Don’t get me wrong, I’m not knocking Association Rules. I’m not knocking picks either. I just wouldn’t want to do any serious digging with either one. The analogy breaks down there, however, because a pick doesn’t require much of an investment in training the way Association Rules does. Due to its inner simplicity, I would have put it at the beginning of this tutorial series, but I wanted to stick with the concept of statistical building blocks, to make the series a little more comprehensible. If I wanted to go in order of the complexity of the parameters and learning curve, this algorithm would have come last. This tool’s a little bit of an oddball; it’s not even used as a building block for any of the other eight algorithms, nor is it founded on any of them. I’ve only included it at this point in the series because it’s vaguely similar to Clustering, which will be the topic of the next two tutorials. Clustering isn’t often compared to Association Rules, but there is some overlap in their functionality, in that one groups items by the commonality with which they are found together, while the other groups them by common properties. Clustering also implies a measure of distance between groups of items, which can be useful. Its more advanced cousin, Sequence Clustering, can also perform many of the same functions as Association Rules, while taking temporal ordering into account. This is merely one more function that this week’s algorithm can’t perform well. Association Rules is a one-dimensional tool in the world of multidimensional databases, but there are times when one dimension is all we need.


[i] Piatetsky-Shapiro, Gregory, 1991, “Discovery, Analysis, and Presentation of Strong Rules,” pp. 229-248 in Piatetsky-Shapiro, Gregory and Frawley, William J. eds., Knowledge Discovery in Databases, MIT Press: Cambridge, Mass. Also see Agrawal, Rakesh; Imielinbski, Tomasz and Swami, Arun, 1993, “Mining Association Rules Between Sets of Items in Large Databases,” pp. 207-216 in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. No publisher or city listed. Available online through the Penn State CiteSeerX website at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.217.4132

[ii] Hájek, Petr; Havel, Ivan and Chytil, Metodej, 1966,”The GUHA Method of Automatic Hypotheses Determination,” pp. 293-308 in Computing 1. No volume or number given. Also see the Wikipedia article “Petr Hájek” at http://en.wikipedia.org/wiki/Petr_H%C3%A1jek

[iii] See the Wikipedia article “Association Rule Learning” at http://en.wikipedia.org/wiki/Association_rule_learning

[iv] I was originally going to post the link to the site where you can buy a crovel, just because I like to cite everything as much as possible out of habit, but thought that might be construed as a commercial endorsement. I wonder how many of them were returned after the Mayan Apocalypse fizzled in December…and how many of the buyers thought that purchasing a really expensive shovel might actually save them from Armageddon.

[v] p. 369, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis

[vi] IBID., p. 368.

[vii] MacLennan, Jamie, 2006, “Usage of Association Rules,” posted Aug. 18, 2006 at the MSDN Data Mining Forum at http://social.msdn.microsoft.com/Forums/en/sqldatamining/thread/c90a1c08-bf97-41a0-ac7d-4e4450423e0a

[viii] MacLennan, Jamie, 2007, “The Mean of Using Association with Importance and Probability,” posted April 17, 2007 at the MSDN Data Mining Forum at http://social.msdn.microsoft.com/forums/en-US/sqldatamining/thread/c7616723-9f98-4ea5-9d7e-a20f7c528a4a/

[ix] See the MSDN Data Mining Forum discussions “Association algorithm – Importance of a rule,” started March 6, 2006 and “Association Rules – Importance,” started Feb. 14, 2008, at

http://social.msdn.microsoft.com/forums/en-US/sqldatamining/thread/a1b1f0e4-0e8c-42f0-bf0b-cff176019e03/

and http://social.msdn.microsoft.com/forums/en-US/sqldatamining/thread/a02367ed-b6e6-421a-9b82-2f8ca0df71f9/

[x] pp. 89-90 in MacLennan, et al. contains a much better explanation of how nested tables work than that found in Books Online.

[xi] Since I was in high school, I’ve done my share of digging shale out of a particular mountain in Western New York to keep it from falling on a relative’s rooftop, and haven’t gotten very far yet. Every so often, however, I drop my shovel and reach for a pick, which helps me shrink the pile of fallen shale a little bit faster. When I use Association Rules, I get the same sensation of straining at a mountainside, wishing I had heavier equipment.

Follow

Get every new post delivered to your Inbox.