Category Archives: A Rickety Stairway to SQL Server Data Mining

A Rickety Stairway to SQL Server Data Mining, Part 15, The Grand Finale: Custom Data Mining Viewers

By Steve Bolton

…………As mentioned previously in this amateur self-tutorial series on the most neglected component of Microsoft’s leading database server software, SQL Server Data Mining (SSDM) can be extended through many means, such as Analysis Services stored procedures, CLR functionality, custom mining functions and plug-in algorithms. I had originally planned to close out this series with at least two separate articles on the subject of writing custom means of data visualization, but these plug-in viewers turned out to be simpler to implement than I expected. The deployment aspects are naturally the most difficult aspects, but the code behind custom data mining viewers is actually much less challenging than that for plug-in algorithms, which we already discussed in depth in previous articles. They’re by no means trivial, but programmers familiar with the .Net framework shouldn’t find them too difficult, especially if they’re already experienced with Window Forms user interface elements that were popular in the second half of the last decade. Throughout this series I have tried to provide Visual Basic code samples because most of the SSDM tutorials available on the Web are written in C#, the other major .Net language.  Likewise, I originally intended to write a viewer using a more modern Windows Presentation Foundation (WPF) architecture, since all of the tutorials I’m aware of are for Windows Forms. Even though it would be child’s play to use a WPF datagrid in a WPF container control in Windows Forms, I decided it against it in order to simplify the instructions and avoid another layer of controls to debug. I also decided to adhere to the Keep It Simple Stupid (KISS) design principle by writing a viewer that performed a very basic task with easily decipherable output; in this case, we’ll simply be using a datagrid to display the output of a simple Data Mining Expressions (DMX) query, which selects some familiar metadata from the PluginModel.CONTENT we’ve used throughout this series, like ATTRIBUTE_NAME, MSOLAP_NODE_SCORE and 17 other columns.
…………Despite the fact that it was written in August of 2004, nearly a decade ago, the original tutorial provided by Microsoft for plug-in viewers, by Jesper Lind and Scott Oveson is still quite valuable. There is no custom viewer tutorial in the SDK and the only other resource I know of on the topic is a four-page discussion in Data Mining with Microsoft SQL Server 2008, the indispensable book by former members of Microsoft’s Data Mining Team.[i] In order to make use of the Lind-Oveson tutorial, however, you’ll need to download the original SQL Server 2005 samples for plug-in algorithms and custom viewers from the respective links at Codeplex. As I did with Bogdan Crivat’s sample code in the from software development kit (SDK) for custom algorithms, I’ll start off with their codebase, but it will look nothing like the original after I get done converting the C# to Visual Basic, renaming the variables to my liking and repurposing the methods to create a completely different kind of viewer (and perhaps mangling it beyond recognition in the course of adding my own bugs). Most of the code is fairly straightforward, so there are only so many ways you can write it anyways, although there are some methods like LoadViewerData that provide more wiggle room for creativity.
…………Writing a viewer is not really much harder than using ADOMD in an ordinary Windows Forms or WPF application. The only substantial difference is that we’ll compile it in a class library that can run inside Visual Studio or SQL Server Management Studio (SSMS), like other data viewers do. The process is a little counterintuitive at first, in that you being by creating a .Net class project, but instead of using a class object (the one automatically generated by Visual Studio can be deleted), you use Add…New Item from the context menu to create a Windows Forms User Control class. This is not a tutorial on basic Visual Studio functionality, so to avoid the common affliction of “tutorial clutter” I’ll leave out the details on how to work with user controls, just as I have left the common Windows Forms code and user interface declarations out of the code in Figure 1. Next you’ll need to add a reference to Microsoft.DataWarehouse.Interfaces, which should be listed on the .Net tab of Visual Studio’s Add Reference dialog once you right-click the class and select Add Reference… on the context menu. The ViewerUserControl class also Inherits System.Windows.Forms.UserControl and Implements IMiningModelViewerControl2, an upgrade available in SQL Server 2008 from the original plug-in viewer interface, which introduces further slight differences between my class and Lind and Oveson’s example. One of the crucial mistakes I made at this juncture was failing to coordinate my assembly, class name and namespace, which led to a lot of weeping, wailing and gnashing of teeth when I tried to register them during deployment. The winning combination for me was “PluginViewer.PluginViewer.ViewerUserControl.”

Figure 1: Partial Code for the ViewerUserControl Class
Imports System.Collections
Imports System.ComponentModel
Imports System.Drawing
Imports System.Data
Imports System.Data.OleDb
Imports System.Windows.Forms
Imports Microsoft.DataWarehouse.Interfaces
Imports Microsoft.AnalysisServices.AdomdClient 

Namespace PluginViewer
    Public Class ViewerUserControl
        Inherits System.Windows.Forms.UserControl
        Implements IMiningModelViewerControl 

        Private ClassConnectionString As String = String.Empty
        Private ClassMiningModelName As String = String.Empty
        Private ClassServiceProvider As IServiceProvider
        Private ClassConnection As System.Data.IDbConnection
        Private ClassCommand As New Microsoft.AnalysisServices.AdomdClient.AdomdCommand
        Private ClassAdapter As New Microsoft.AnalysisServices.AdomdClient.AdomdDataAdapter

        Private Sub ViewerUserControl_Load(sender As System.Object, e As System.EventArgs)
            this procedure was automatically added by Visual Studio, not in the original sample code
        End Sub

        Public Shared ReadOnly Property DisplayName() As String
                Return “My PluginViewer”

            End Get
        End Property

        Public Property ServiceProvider() As IServiceProvider Implements IMiningModelViewerControl.ServiceProvider
                Return Me.ClassServiceProvider
            End Get
            Set(Value As IServiceProvider)
                If Me.ClassServiceProvider IsNot Value Then
                    Me.ClassServiceProvider = Value
                End If
            End Set
        End Property

        Public Property ConnectionString() As String Implements IMiningModelViewerControl.ConnectionString
                Return Me.ClassConnectionString
            End Get
            Set(value As String)
                If Me.ClassConnectionString <> value Then
                    Me.ClassConnectionString = value
                End If
            End Set
        End Property

        Public Property MiningModelName() As String Implements IMiningModelViewerControl.MiningModelName
                Return Me.ClassMiningModelName
            End Get
            Set(value As String)
                If Me.ClassMiningModelName <> value Then
                    Me.ClassMiningModelName = value
                End If
End Set
        End Property

        Public Sub ViewerActivated(IsActivated As Boolean) Implements IMiningModelViewerControl.ViewerActivated

        End Sub

        Public Function LoadViewerData(LocalContext As Object) As Boolean Implements IMiningModelViewerControl.LoadViewerData
            ClassConnection = New AdomdConnection(Me.ConnectionString)

            ‘I’m going to use a Data Adapter instead of an ADOMD command
            ClassCommand = ClassConnection.CreateCommand()
            ClassCommand.CommandText = String.Format(“SELECT
, Me.MiningModelName)

            ClassAdapter.SelectCommand = ClassCommand
            ClassAdapter.Fill(Me.PluginViewerDataset1, PluginViewerDataTable)

            Return True
        End Function
        End Class

…………Most of the code above consists of simple Get and Set statements for viewer properties, like DisplayName, ServiceProvider, ConnectionString and MiningModelName. The DisplayName is a description of your choice which will be displayed in the Viewer dropdown box in SSDM projects (which does not necessarily have to be the class name) while the other properties are mostly taken care of by SQL Server. In my class, these property statements are used to populate the class-scoped ClassConnectionString, ClassMiningModelName and ClassServiceProvider variables. The trickiest part of this for me was using the right variable types in my Implements statements. The IMiningModelViewerControl2 class also has a property that wasn’t in the previous interface version, DBConnection, of type System.Data.IDbConnection; I’m unfamiliar with it and the code works fine as is, so I’ll omit discussion of it.
…………When loading a viewer, the ViewerActivated function will also be called; during testing I wrote messages to the Event log in the procedure just for tracking purposes, but I’ve left it blank here in order to simplify the tutorial. Suffice it to say that you can include your own initialization code here for events that occur after the viewer is activated. Lind and Oveson state that “If the viewer has another window open such as a legend, it may want to bring that window to the user’s attention when the view is activated (that is, clicked on, or selected using the tabs in the viewer or by selecting it using the dropdown box).” The meat and potatoes are found in the LoadViewerData routine, where the connection is opened and a command is issued to load the data. Much of the action takes place off-screen in my example because I used a data adapter directly bound to a Windows Forms datagrid, which hides a lot of the code-behind from view. I took a cue from A Rickety Stairway to SQL Server Data Mining, Part 12: Accessing SSDM with AMO, ADOMD, SSIS, XMLA and RS  and used an ADO.Net dataset rather than the old OLEDB interface in the Lind-Oveson tutorial, or the ActiveX Data Objects Multi-Dimensional for .Net (ADOMD.Net) objects cited in MacLennan et al. If you’re using ADOMD, you may have to download and install the right package for the version of SQL Server you’re working with, which for SQL Server 2012 are the SQL_AS_AMO.msi and SQL_AS_ADOMD.msi installers on the SQL Server 2012 Feature Pack webpage. MacLennan et al. also use a DataReader object and Lind and Oveson use an OLEDB command to access the data in their respective tutorials, so I opted for using an ADO command object just to add a little contrast; the DM Team’s book also discusses some nuances about the proper use of connection objects in LoadViewerData, which may be helpful in circumstances that I didn’t encounter in my brief foray into writing viewers.[ii] The most helpful suggestion I can provide for this procedure is to test out any DMX query you call here in a SQL Server Management Studio (SSMS) window first, because it’s much more cumbersome to troubleshoot incorrect syntax from within tests of a custom viewer. Also keep in mind that if you execute a simple metadata query like mine, the PARENT_UNIQUE_NAME and CHILDREN_CARDINALITY column names must be in brackets. I don’t know the reason for this strange limitation, but it’s a consistent syntax requirement of any DMX query.
…………The most challenging aspects of writing a plug-in viewer are the setup and  testing, but the process is not much different than the one described for custom algorithms in A Rickety Stairway to SQL Server Data Mining, Part 14.3: Debugging and Deployment. In some ways it is actually much easier, because we don’t need to reference a wrapper like DMPluginWrapper.dll and coordinate such things as .Net and bit versions between them our classes. We can also repeat some of the procedures discussed there, such as creating a Strong Name Key File (.snk) for our project; in this case, however, we will need to find the public key token and include it in the registry. It is possible to run sn.exe from a command prompt to extract the public key token, but I added a command to the Visual Studio tools menu in order to reuse it in future projects, by following the directions at the MSDN posts Getting Public Key Token of Assembly Within Visual Studio, Visual Studio Tip: Get Public Key Token for a Strong Named Assembly and  Visual Studio 2010 Get Public Key Token. The gist of it is that you assign the Title of the command as it will appear in your Tools menu, then the full path to the command (which in my case was C:\Program Files\Microsoft SDKs\Windows\v7.0A\bin\NETFX 4.0 Tools\sn.exe) and -Tp $(TargetPath) in the Arguments section. The Use Output Window checkbox should also be selected. Some of the sources I read said to use quote marks around $(TargetPath) but that produced an error message saying that the file could not be read because it was missing. Once the command completely successfully, copy the 16-digit code after the phrase “Public key token,” which in my case was “b51c4a48a2da4bbd.” To register the viewer, we will have to create a registry key with the name of our viewer and four entries within that key. The Algorithms entry must be to a value equal to the name of the plug-in algorithm we will use it with, which in the case of the practice project we’ve been using in the Rickety series would be Internal_Name_For_My_Algorithm. We also need to create an entry named Assembly with our namespace, followed by a comma and the public key value we just extracted. In this case, the registry entry value would be “PluginViewer, PublicKeyToken= b51c4a48a2da4bbd” sans the quote marks. It is also necessary to create a Class key set to our project’s namespace plus the name of the class, which corresponds to ‘PluginViewer.ViewerUserControl” and a DisplayName key with the string you want to appear in the Viewer dropdown on the Mining Model Viewer tab.
…………The tricky part is to get add the subkey to the right registry keys. The Lind-Oveson tutorial says that it must be added to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\90\Tools\BIShell\Packages\{4a0c6509-bf90-43da-abee-0aba3a8527f1}\Settings\Designers\DataMiningViewers to make a custom viewer work with Visual Studio and HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\90\Tools\Shell\Packages\{4a0c6509-bf90-43da-abee-0aba3a8527f1}\Settings\Designers\DataMiningViewers for SQL Server Management Studio (SSMS). That tutorial was designed with SQL Server 2005 and Visual Studio 2005 in mind though. On my development machine, I had to install them to the HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\VisualStudio\10.0\Packages\{4a0c6509-bf90-43da-abee-0aba3a8527f1}\Settings\Designers\DataMiningViewers key for Visual Studio 2010 and HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Microsoft SQL Server\110\Tools\Shell\Packages\{4a0c6509-bf90-43da-abee-0aba3a8527f1}\Settings\Designers\DataMiningViewers for the SQL Server 2012 version of SSMS. There are also DataMiningViewers keys in HKEY_CURRENT_USER, but these are unimportant for our purpose. Note that the GUIDs are the same as in the Lind-Oveson tutorial, but that there are some subtle differences, such as the SQL Server version number (90 vs. 110), the name of the tool (“BIShell” vs. Visual Studio 10) and the fact that in my case, they occurred under the Wow6432Node (which is mystifying, because this key is for 32-bit compatibility, but I’m running Analysis Services 64-bit on a 64-bit version of Windows). There is probably some shortcut to determining the key locations that would make writing setup programs for custom mining viewers much easier, but so far I’ve merely done a search for the DataMiningViewers key in Regedit. Once that determination is made, you can add the registry entries in whatever way you prefer, such as reg.exe at a command prompt, merging .reg files or editing the registry manually with Regedit. In order to familiarize myself with the process of writing setup programs for custom viewers, I opted to add a deployment project to the MonitorMiningSolution that all of the projects in this series belong to. In Figure 3 you can see an example of my registry settings; note that I used the Software key instead of explicitly specifying the Wow6432Node, which will assign it the latter through reflection when the 32-bit registry is viewed in the 64-bit version of Regedit.
…………This is not a lesson in how to work with Visual Studio 2010 deployment projects, for which there are certainly many freely available tutorials; the aim here is to point out nuances to it that are relevant to the writers of custom mining viewers. One thing such developers need to be aware of is that the Lind-Oveson tutorial advises to copy our PluginViewer.dll file to the same location as the out-of-the-box viewers, which in the case of SQL Server 2005 is apparently C:\Program Files\Microsoft SQL Server\90\Tools\Binn\VSShell\Common7\IDE\. I can’t find any path named “VSShell” on my machine and no folder that would be roughly equivalent, even after substituting the 110 version for 90 and using a Program Files (x86) location. I did, however, find Microsoft.AnalysisServices.Viewers.dll in two locations in a hard drive search, C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Binn\ManagementStudio and C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\PrivateAssemblies. After moving the DLL to the latter folder and restarting Visual Studio the viewer loaded correctly. The Lind-Oveson tutorial also advises to use the gacutil utility to register the assembly, like so: gacutil /i PluginViewer.dll, which should add it to the Global Assembly Cache (GAC). Figures 2 and 3 depict the settings I used for the deployment project in the Visual Studio 2010 File System Editor and Registry Editor.

Figure 2: Visual Studio 2010 File System Editor Example for the PluginViewer Project

Figure 3: Visual Studio 2010 Registry Editor Example for the PluginViewer Project

…………The directions Lind and Oveson give for associating a particular algorithm with a new viewer seem to be applicable only to COM projects written in C++. These days, plug-in developers are probably going to use the .Net platform exclusively, so my segment of the Rickety series on custom algorithms was strictly limited to that architecture. My sample code was written in Visual Basic, whereas other SSDM tutorial writers have opted for C#, but both compile down to the same base .Net language. The solution they provide involves adding a variable to the GetViewerType method of a C++ file that doesn’t even exist in the .Net wrapper DMPluginWrapper.dll we had to compile back in A Rickety Stairway to SQL Server Data Mining, Part 14.1: An Introduction to Plug-In Algorithms, so that we could reference the plug-in architecture in the .dll of our custom algorithm. As discussed in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm, the GetViewerType method of AlgorithmMetadataBase is ultimately responsible for associating a custom algorithm with specific viewers. Normally you’d return at least one of the seven values of the MiningViewerType class, but since the return value type is just a string, we can substitute any value we want – including the name of our PluginViewer, sans the UserControl portion of the class name. Simply edit it, then recompile that class and perform all of the other steps discussed in Post 14.3 for redeploying a custom algorithm.[iii] If redeployment of the algorithm is successful, the new data mining viewer should be listed in the VIEWER_TYPE column of the DMSCHEMA_MINING_SERVICES rowset in the same row as the custom algorithm. As Lind and Oveson mention, you can associate multiple viewers with an algorithm by separating their names with semi-colons in the algorithm code, which in our case would be in the string returned by AlgorithmMetadataBase.GetViewerType. For example, we could specify our custom mining viewer in GetViewerType together with the one of the enumeration values for the standard viewers like so: Return “PluginViewer;” + MiningViewerType.MicrosoftCluster. Since PluginViewer is listed first it would be listed first in the VIEWER_TYPE column and at the top of the dropdown menus in SSMS and Visual Studio. I’m not aware of any means of enabling one of the nine out-of-the-box algorithms we discussed much  earlier in this series to use a custom viewer (even by editing the msmdsrv.ini file by hand), but that does not indicate that they do not already exist or cannot be invented.
…………Be aware that when a custom viewer fails to load, no messages are recorded in msmdsrv.log or the event logs. If an error occurs too early in the loading process, the viewer simply won’t be listed in the VIEWER_TYPE column of the DMSCHEMA_MINING_SERVICES rowset, but if it occurs late enough you may see the non-specific message like “An error occurred attempting to create the viewer MyCustomMiningViewerName.” So far, I have been unable to debug custom mining viewers in Visual Studio because I invariably receive messages on every breakpoint that, “The breakpoint will not currently be hit. No symbols have been loaded for this document.” This is true whether I attach to msmdsrv.exe or to the SSMS process that the viewer is supposed to run in. Perhaps there is a more straightforward way of doing this by debugging through the same means that are used with ordinary Windows Forms user controls, which I am more than a little rusty on. The path of least resistance for me was to simply write messages to the Event log that tracked exceptions and the routines they occurred in, which helped expose the call stack. From what I can gather, the viewer loading sequence begins with DisplayName and is followed by calls to Initialize Component, Set ServiceProvider, Set ConnectionString, Set MiningModelName, LoadViewerData, Get ConnectionString and Get MiningModelName, in that order. Once I debugged my viewer sufficiently well to get through this sequence, I was able to visualize the PluginModel’s metadata in a Windows Forms datagrid, as depicted in the figure below.

Figure 4: The PluginViewer in Use in a Visual Studio Data Mining Project

…………Now that we’ve gotten through the most difficult aspects of custom algorithms and viewers, we can touch on the subject of GetNodeIDsForCase, a function in AlgorithmBase that I neglected till now because it is only required for plug-ins that make use of data mining dimensions and drillthrough. The former is a feature that we skipped over because it is of its limited utility, whereas drillthrough can’t be implemented in the Generic Content Viewer and thus is more applicable when custom algorithms are used with custom viewers. Returning a DrillThroughNodeFlags.Content value in GetNavigator rather than DrillThroughNodeFlags.DimensionContent will direct SSDM to apply it to drillthrough rather than mining dimensions. It is called internally by SSDM (usually at least once after InsertCases and possibly before SaveContent) so the developer has little direct control over it. Basically, it returns the NodeUniqueNames of content nodes that are associated with a particular case as an array of strings; according to Books Online, it “Maps cases to node IDs, which are simply labels for nodes in a tree layout of your model content.” Crivat also says in the tutorial in the SDK “that if you are implementing a decision tree algorithm, it is likely that more nodes (the full path from root to leaf) will contain any given case. You can push all the node unique identifiers for those nodes in the string array returned by this function.” When the CaseID is -1 and there is no InputCase, the root node will be returned, but otherwise the cases associated with particular nodes are returned. I added the code in Figure 5 to the custom algorithms I posted a few articles ago and verified that it worked correctly, although I found no immediate use for it. When debugging code like this, note how CHILDREN_CARDINALITY changes for the root node.

Figure 5: An Example of the GetNodeIDsForCase Method
Protected Overrides Function GetNodeIDsForCase(nodeFlags As DrillThroughNodeFlags, CaseID As Long, inputCase As MiningCase, maxNodes As ULong) As String()

            Dim ReturnString() As String
            Select Case CaseID ‘I implemented this as a Select in case we want to add custom logic later to make the type of node added conditional on the case
                Case -1 ‘the flag for the root node at the top of the hierarchy, containing all cases
                    ReDim ReturnString(0)
                    ReturnString(0) = “Root Node”
                Case Else
                    ReDim ReturnString(AttributeSet.GetAttributeCount)

                    For I As UInteger = 0 To AttributeSet.GetAttributeCount

                        in this model, we have 20 rows/input cases, with two columns; I want each row to be associated with both columns
                        normally you’d associate the NODE_UNIQUE_NAME with it
                        ReturnString(I) = “This case is associated with the “ + AttributeSet.GetAttributeDisplayName(I, True) + “column.”
                    Next I
            End Select
        End Function

…………Another SSDM feature I neglected to delve into was Classified columns, which seem to be beneficial only in custom algorithms and then in certain narrow use cases. I also neglected some broad topics that might have certain nuances in SSDM, but are generally taken care of by other components of SQL Server, such as security. This is handled by the ordinary Analysis Services mechanisms, which have already been fully fleshed out elsewhere. I don’t mean to relegate such an important topic to an afterthought, but I’m not conscious of any earthshaking distinctions between the ways SQL Server handles dimension and model permissions. Setup programs are a topic I’ve dealt with on an ongoing basis as part of other tutorials; there really isn’t any separate mechanism specific just to data mining projects, aside from the tricky business of deploying custom viewers and algorithms. The latter was tackled in Post 14.3, which really represented the pinnacle of difficulty in this series. As expected, deployment and debugging were also the biggest hurdles with custom viewers, but I seriously overestimated how difficult it would be to write them. I knew from the beginning of this series that I would make a lot of mistakes because I was in over my head, but decided that the best way to learn was to dive in – especially when I could provide some badly needed free press to the most underrated feature of SQL Server. I could probably make a further contribution by writing on the topic of data mining performance, but that will have to wait until I get a much firmer grasp of SSDM internals. A year and a half ago I won a copy of Kalen Delaney’s incredibly useful book on SQL Server 2008 R2 internals, but there is no such guide for SSDM. There isn’t even an equivalent to the Analysis Services Performance Guide, which nevertheless has some hints about how SSDM data might be stored to disk in a manner similar to SSAS dimensions. I’m still new to this, so my experience is only as valuable as the pitfalls a rookie may encounter. I really ought to thank SQLServerCentral for their support in publishing more than a dozen of the articles in this series despite my inexperience, which was of immense help.
…………At present, the best way I can probably make a worthwhile contribution while still acquiring some of the data mining skills I need is to evaluate SSDM’s competitors, and explain how to use them in conjunction with SQL Server. Some of the tools I may survey include Weka, RapidMiner, Autobox and Predixion Software, the last of which was established by former members of Microsoft’s Data Mining Team. This will be my second attempt at this upcoming series, which never even got off the ground when I tried to launch it in 2012 with evaluations of IBM and Oracle’s data mining products. As explained in Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial and Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2, I gained new respect for SQL Server after encountering numerous usability problems that made even the setup processes for their competitors a real challenge. SQL Server simply has its competitors beat hands down when it comes to usability and user interface design, which aren’t trivial matters; just think of the many ways a defective user interface might burden a DBA whose trying to troubleshoot a production server that’s crashed, for example. And as I discovered a few days ago when reading up on Oracle the other day, I discovered that their data mining offerings are light years behind SSDM. There is so little content there that the 2007 edition of Oracle Data Mining: Mining Hold from Your Warehouse consists of little except giant screenshots and white space. At the time, Oracle only supported two kinds of clustering, adaptive Naïve Bayes networks and support vector machines (SVMs, which overlap Linear and Logistic Regression). Two years earlier, at the time the Lind-Oveson tutorial was written, SSDM was already almost in its present state, with additional algorithms like Association Rules, Sequence Clustering, Time Series and Neural Networks that Oracle simply couldn’t compete with. To illustrate just how great of a lead the SSDM user interface enjoyed, consider that Oracle didn’t have any graphical means of viewing Decision Trees yet.[iv] Thanks to years of neglect of data mining by the top brass at Microsoft, Oracle and other competitors may well have slowly inched their way towards competitiveness with SSDM. Yet it is impossible to determine just how good Oracle’s data mining capabilities are because the user interface and setup procedures have regressed since then, to the point that it’s practically unusable. Installing DB2 is likewise a chore, but their Windows GUI is currently the equivalent of the junkyard truck in Sanford and Son. That is why I believed it was entirely reasonable to refer to decade-old instructional materials on SSDM throughout this series, like the Lind-Oveson tutorial. In my next series, I’ll investigate whether or not SQL Server’s other competitors have succeeded in surpassing SSDM’s capabilities over the last decade of neglect, much like the tortoise and the hare. There’ll be a long learning curve, but hopefully along the way I’ll be able to provide some practical advice on how to mining SQL Server data with these tools, as well as some guidelines for when they would be appropriate to use rather than SSDM.

[i] see pp. 558-562, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.

[ii] IBID., pp. 558-559.

 [iii] When writing this, I originally thought that MiningViewerType was an enumeration and that I was therefore limited to using only the seven values predefined by Microsoft. Then I discovered that you can override predefined enums using the Shadows keyword, as described in the user “Boop Boop”’s response in the MSDN thread [RESOLVED] Extending an Enum. This would have been an ironic solution to keep us from wandering in the obscurity of the underlying C++ code, down in the murky depths a couple of assemblies below our custom algorithm code.

 [iv] p. 96, Hamm, Carolyn K. 2007, Oracle Data Mining: Mining Hold from Your Warehouse. Rampant TechPress: Kittrell, North Carolina.


A Rickety Stairway to SQL Server Data Mining, Part 14.8: PMML Hell

By Steve Bolton

…………In A Rickety Stairway to SQL Server Data Mining, Part 14.3: Debugging and Deployment, we passed the apex of this series of amateur self-tutorials on SQL Server Data Mining (SSDM) and have seen the difficulty level and real-world usefulness of the material decline on a steady slope since then. This week’s topic is a bit more challenging than the material covered in the last article, A Rickety Stairway to SQL Server Data Mining, Part 14.7: Additional Plugin Functionality, but only because it pertains to an SSDM feature that is poorly documented, to the point that it sometimes difficult to determine whether the lack of results are due to bugs or incorrect usage. In fact, the only way I have been able to get the product’s Predictive Model Markup Language (PMML) support features to work at all is by implementing the relevant methods in SSDM’s extensibility architecture for adding custom algorithms.
…………PMML was developed in the late 1990s by the Data Mining Group (DMG), a consortium of data mining product vendors, in order to facilitate interoperability between their disparate mining tools using XML. The consortium maintains a handy list of vendors that support particular versions of PMML at the website, which is also the central repository of other critical information on PMML, particularly schema specifications.  These include Weka, Tibco, Angoss, Augustus, BigML, Experian, IBM InfoSphere, Knime, KXen, MicroStrategy, Pervasive DataRush, Predixion Software, RapidMiner, R/Rattle, Salford Systems, Sand CDBMS, SAS Enterprise Miner, Teradata, Clementine and Zementis. The last of these even has a PMML Converter product that can translate older PMML documents to the latest standard, which I have yet to try out. Unfortunately, Microsoft has fallen further behind the current standard than any other vendor on the list, thanks to the pattern of neglect by the company’s top brass that I have spoken about before. It only supports version 2.1, which was released way back in March 2003, according to the helpful Wikipedia page on PMML. The latest release was version 4.1, back in December of 2011, a scant two years ago. Most portability features in the computing field don’t live up to their billing in their earliest incarnations, due various kinks in the translation processes, lack of strict adherence to interoperability standards and lack of widespread adoption among vendors. PMML is no exception to that rule of thumb and Microsoft’s support for the standard is frozen at a point in time when it was much more applicable. Because I am a rookie at this, I cannot yet gauge how common it is for mining models to be exchanged today between different tools using PMML; perhaps Microsoft determined it was one of those pie-in-the-sky standards that didn’t provide much benefit for end users and thus didn’t bother to update it. Either way, the bottom line is the same: regardless of what the rest of the industry is doing with PMML, use cases in which mining models can be successfully exchanged in this way with SSDM are going to be quite rare. I’m not going to spend much time on this topic because SSDM is using an obsolete version of PMML that is more than a decade old. Furthermore, it is limited to just two or three (Microsoft’s documentation is unclear) of the nine algorithms included with SSDM out-of-the-box, which we discussed much earlier in this series. Moreover, this stunted feature is also unwieldy and poorly documented, thereby making it even less useful.
…………In theory, it is possible to create a SQL Server mining model using a statement like CREATE MINING MODEL MyPMMLMiningModelName FROM PMML ‘My XML String’, by pasting the full string of a PMML 2.1 document created by a different mining tool. It should also be possible to create PMML documents from existing SQL Server mining models using a statement like SELECT FROM MyPMMLMiningModelName.PMML. The DMSCHEMA_MINING_MODEL_CONTENT_PMML and DMSCHEMA_MINING_MODEL_XML schema rowsets return basically the same information in a common set of columns, which will be blank for mining models that are not derived from PMML. The MODEL_PMML column is supposed to display the PMML text associated with a model, while the SIZE column contains the length of the PMML definition in bytes and LOCATION should contain the document’s storage location on disk. Unless a model is created from PMML, however, these schema rowsets are useless. Yet in many cases there is no practical benefit to this, because recreating the ported mining model in the Visual Studio GUI may be less of a hassle than dealing with endless XML parsing errors, even for the PMML version and algorithm types that SQL Server can theoretically handle. It good that Microsoft’s Data Mining Team included nascent features for working with PMML way back in 2005, but the neglect the top brass at Microsoft has displayed towards data mining tools since then has had a greater impact on this stillborn feature than any other component of SSDM. I’m not going to say that it’s useless in its current stunted form, because there are undoubtedly PMML documents out there that SSDM can parse correctly, thereby saving data miners the time of implementing them a second time. Models that will port correctly for the narrow set of algorithms and ancient PMML version SQL Server theoretically supports are not easy to come by though, which diminishes its usefulness further.
…………For example, five of the sample models available at the DMG’s PMML Sample Models Archive are Clustering or Decision Trees algorithms encoded in PMML 2.0 or 2.1, which means SQL Server should theoretically have been able to parse them.  Yet all five models produced a variety of error messages when the text was pasted into SQL Server’s CREATE MINING MODEL statement, including “Incorrect document syntax,” “Cannot parse mining model due to malformed PMML” and various “PMML parsing failed” errors. The most ominous of these was a warning that “The model inside the PMML 2.1 document uses an algorithm that is not supported,” after the document was apparently parsed correctly. Whether they were malformed because of some fault in DB2 Intelligent Miner, the IBM product that produced the documents, or could not be parsed due to some error on SQL Server’s part is difficult to determine. The bottom line is that getting this portability feature to actually work between different data mining tools may be a taller order than it seems, one that may involve a lot of unwieldy, time-consuming editing of documents by hand. For relatively small documents under the size of a megabyte it might be feasible to fix invalid PMML manually, but it would defeat the purpose of this portability feature if we had to do this kind of menial work to translate all 22 gigagytes of data stored in the mining structures we’ve used as samples in this series. And this is a pittance compared to the terabytes or even petabytes of information required when Big Analysis is performed on Big Data. The latter term is of course a buzzword that obscures the fact that Data has always been getting Bigger, throughout much of the history of the human race; it is a lot like the term “globalization,” which has been occurring steadily for millennia, as I learned while I was studying foreign policy history in grad school. One of the best strategies for coping with the Even Bigger Data of the 21st Century is to develop tools that automate the processing of such data, which will become a Big Hassle in and of itself, in tandem with the size of the datasets. PMML is designed to do this by performing the task of porting mining models, but at the point when the 2.1 specification was released, it apparently could not be done sufficiently “under the hood” to make user intervention unnecessary. Until Data Mining Expressions (DMX) statements like these can be made to work without a lot of costly attention on the part of end users, their practical benefits will diminish in tandem with the rise of Big Data, as the costs in terms of time and energy to use them increase with each megabyte of XML that has to be checked manually. Certainly, with sufficient expenditure of resources, PMML documents can be reworked to fit into the existing SSDM architecture, but those costs are already prohibitive and only increasing as time passes.
…………Similarly, if you run a DMX query like SELECT * FROM DMSCHEMA_MINING_MODEL_CONTENT_PMML, it is a good idea to add a WHERE clause, because the statement will return an error message about the first model it fails to process instead of the results. Unfortunately, most of the models I created earlier in this series could not be included in this statement due to various error messages. Books Online says that Naive Bayes, Decision Trees and Clustering are the only ones out of the nine out-of-the-box algorithms we surveyed earlier in this series which support “the use of Predictive Model Markup Language (PMML) to create mining models,” but it is unclear from this wording whether BOL means creating models from PMML or vice-versa. Apparently it is the former, because the ALLOW_PMML_INITIALIZATION column of DMSCHEMA_MINING_SERVICES schema rowset is only true for Decision Trees and Clustering, whereas queries of DMSCHEMA_MINING_MODEL_CONTENT_PMML against Naive Bayes models always fail. All three of the models we retained in A Rickety Stairway to SQL Server Data Mining, Algorithm 1: Not-So-Naïve Bayes returned error messages when included in queries to this schema rowset, but for different reasons. Limiting the WHERE clause to NBRecordTableViewModel returned a long message including the phrase “persisted file cannot be found” and the path of the SQL Server Analysis Services (SSAS) file containing the model, which usually signifies that the model has not been processed yet. Processing Naive Bayes models won’t lead to successful queries of DMSCHEMA_MINING_MODEL_CONTENT_PMML though because you’re likely to encounter a message that “the algorithm does not support the functionality requested by the ” model,” which is a tip-off that it does not support deriving PMML documents from model content queries. This was the case with NBDenormalized3Model, but NBDenormalizedAllModel returned one of those inscrutable error messages that crop up every so often in Analysis Services: “An unexpected error occurred (file ‘mddiscretizer.cpp’, line 1689, function ‘MDDiscretizer::GetBucketRange’.” Not surprisingly, searches for the term “GetBucketRange” with both “Sql server” and “Discretizer” returned no hits on Google. I suspect that GetBucketRange might someday belong in the invaluable list Jeannine Takaki posted last year in the TechNet thread A Guide to the (Undocumented) System Stored Procedures for Data Mining. Given that it is unlikely that Microsoft’s obsolete implementation of PMML can be used with Naive Bayes in this way, I didn’t make solving this bug a priority.
…………Since BOL and ALLOW_PMML_INITIALIZATION both state that Decision Trees is supported by SSDM, I did not expect to have different errors returned for all ten of the mining models we used in A Rickety Stairway to SQL Server Data Mining, Algorithm 3: Decision Trees. DTDenormalizedView3Model, DTDenormalizedView3ModelDiscrete and DTDenormalizedView3ModelDiscrete2 all returned an error to the effect that the “mining model has a column that uses the DATE data type. This data type is not supported by the PMML 2.1 specification. The DATE data type can typically be converted to the DOUBLE data type as a workaround, which is supported by the PMML 2.1 specification.” Because the other seven models all returned the more ominous error that the “mining model contains features that are not supported in the PMML 2.1 specification,” I skipped debugging the date error and altered one model so that it had just one attribute (marked as both an input and output), to see if I could return any Decision Trees PMML at all. Alas, I received the same error and reached the conclusion that Microsoft’s implementation of Decision Trees is not compliant with its implementation of PMML 2.1, but that it didn’t document precisely which of the algorithm features are the root cause. I also reached the conclusion that it was not worth solving the problem by trial and error, since we’re dealing with an implementation of PMML that is not just obsolete, but buggy and undocumented. I also received the date error when running the query against ClusteringDenormalizedView3Model. The other 16 mining models we used in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering returned without errors, but the MODEL_PMML column was blank in each case. As Figure 1 shows, SSDM returned such basic information as the MODEL_CATALOG, MODEL_NAME and MODEL_GUID, but left the others blank. Note how the second query returns identical information when using a SELECT…PMML statement. The result is the same: the MODEL_PMML column that is supposed to return “An XML representation of the model’s content in PMML format,” as BOL states, is still blank.


…………To date, the only way I have been able to get the MODEL_PMML columns of these statements to return a PMML document is by implementing the logic myself during the development of custom algorithms. The only documents I have yet to encounter which the CREATE MINING MODEL statement would accept as valid were likewise produced this way, using several methods built into the plug-in architecture for that purpose. Implementing PMML support in a plug-in algorithm is mainly a matter of adding three optional Overridable methods to your code. The easiest of these is the SupportsPMML member of AlgorithmMetadataBase, which is another one of those methods that behaves like a property and merely returns a Boolean value. The heavy lifting is done in RenderPMMLContent and LoadPMMLContent methods of AlgorithmBase. First we’ll use RenderPMMLContent to produce a PMML document from the PluginModel we’ve been using in this section of the Rickety series, on building custom algorithms for SQL Server. When you perform a SELECT…PMML query against a plug-in algorithm, SQL Server will you’re your plug-in’s RenderPMMLContent method first, followed by HasFeatureSelection and GetAllFeatureSelectionInfo methods we discussed in our last article. I have yet to see msmdsrv.exe call GetAttributeFeatureSelectionInfo yet during debugging, but these three methods are apparently designed merely to populate PMML models with feature selection information. The PMMLWriter is automatically passed by the msmdsrv process and only needs to be populated with valid PMML attributes. The WriteStartElement and WriteAttributeString statements in Figure 2 assign the values of certain mandatory fields of the algorithm content portion of an XML document. WriteAttributeString is one of a family of PMMLWriter object members that perform similar functions, except on different data types, such as WriteAttributes, WriteAttributeString, WriteBinHex, WriteCData, WriteChars, WriteComment, WriteElementString, WriteEndAttribute, WriteFullEndElement, WriteName, WriteNmToken, WriteNode, WriteQualifiedName, WriteRaw, WriteStartAttribute, WriteString, WriteValue and WriteWhitespace. The plug-in documentation mentions that some of the methods will throw an exception if invoked in .Net plug-ins, but be wary, because the list is incomplete.

Figure 2: The RenderPMMLContent Method

Protected Overrides Sub RenderPMMLContent(PMMLWriter As PMMLWriter)

            PMMLWriter.WriteAttributeString(modelName, Me.Model.Name)
            PMMLWriter.WriteAttributeString(functionName, “classification”)
            PMMLWriter.WriteAttributeString(algorithmName, Me.Model.Metadata.GetServiceName())
            PMMLWriter.WriteAttributeString(“threshold”, 1)



        End Sub

…………The .chm help file of the plug-in software development kit (SDK) provides somewhat impenetrable explanations of the RenderModelCreationFlagsPMML, RenderMiningSchemaPMML and RenderModelStatisticsPMML methods – which is better than no explanation at all, i.e. the result you’ll get when using a search engine to look them up. The good news is that plug-in developers apparently don’t have to write their own implementations of these methods, because the code below works. The first method renders the MiningSchema section of a PMML document, RenderMiningSchemaPMML takes care of the global stats section (an equivalent of the MarginalStats object in SSDM) and RenderModelStatisticsPMML renders PMML extensions, which are the equivalent of algorithm parameters and flags. AlgorithmMetadataBase.GetMarginalRequirements must be set to AllStats (as discussed earlier in this segment on plug-in algorithms) in order for RenderMiningSchemaPMML to work correctly. Just like the OpenScope and CloseScope statements we used to encapsulate the logic of the LoadContent and SaveContent methods of the AlgorithmBase class (as explained) in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm), our PMMLWriter object requires WriteStartElement and WriteEndElement statements. The most confusing part of the code below might be the mention of Naïve Bayes in the WriteStartElement. The PluginModel we’ve been working with in this segment on custom algorithms doesn’t match any of the broad classes of mining methods supported by the PMML 2.1 standard, so I simply substituted a value that is compliant to that standard. Other potential values include GeneralRegressionModel, ClusteringModel, SequenceModel, NeuralNetwork, AssociationModel and RegressionModel, TreeModel and possibly also a Taxonomy for hierarchized categories. Each of these require highly specific mandatory XML elements that make them more challenging to render; for the full reference, see the DMG webpages on how to structure a 2.1 document and the XSD for that version. For example, the sample code provided by Bogdan Crivat, one of the original SSDM programmers, in the plug-in SDK requires a modelClass tag that is specific to Clustering algorithms. For General Regression, you would have to fill in a ModelType with values like regression, generalLinear, logLinear, multinomialLogistic, etc. and so on. I won’t get into an in-depth discussion of these XML elements, because this is a tutorial in how to use Microsoft’s implementation of PMML, not in PMML syntax itself. Now if we run a DMX query like SELECT * FROM PluginModel.PMML the MODEL_PMML Column ought to be populated with a valid XML document like the one depicted in Figure 3, except crammed into one line. We would also see the SIZE column populated with the size of the document in bytes.

Figure 3: Sample PMML Document Return in the MODEL_PMML Column

…………If we feed this PMML document to SSDM in a single string, at the end of a statement like CREATE MINING MODEL MyPMMLMiningModelName2 FROM PMML, a mining model of the same name ought to be created on the server, within a mining structure that also uses that name, except terminated with a the suffix _Structure. Provided, that is, that we implement the LoadPMMLContent method of AlgorithmBase correctly. The XML string passed to one of these statements will be converted to an XMLElement object that SQL Server automatically passes to the LoadPMMLContent method. The code within it must parse the document and build a new mining model out of its PMML elements. The first order of business is to make sure the model name matches one of the values that the PMMLWriter.WriteStartElement will accept; I used a dummy value of Naïve Bayes even though my plug-in is in no way related to that algorithm, as discussed a few paragraphs ago. Although I do not make practical use of this in Figure 4, Crivat also notes in his SDK sample code that is wise to “Create a namespace manager with the namespace of the content, and prefix it with ‘pmml.’ This way, all the XPath queries will be qualified with pmml.” The If…End If block immediately after the NamespaceManager code is designed to map the PMML document’s Extensions element to our model’s parameters, which were defined in the last tutorial.  Crivat advises to “Look for the Extension element with the MiningAlgorithmFlags name and ‘Microsoft Analysis Services’ as extender,” as I do when declaring the ParameterXPath variable. The rest of the code in this block merely loops through the model parameters and assigns the values to our model. After this If…End If block, we would normally put any code (or call methods in our custom classes) for loading elements specific to the type of model we’re working. For example, in Bogdan’s tutorial, this is the point where you’d retrieve the number of clusters, their distributions and other information specific to the variant Clustering algorithm he demonstrates. For Naive Bayes we’d include information on the inputs, outputs and value pairs; a SequenceModel has many required tags, such as those identifying the number of sequences, items and rules; an AssociationModel would of course require information on the itemsets, rules and so forth; neural nets have a lot of specific functionality such as the activation function, thresholds and the number of layers. Likewise, the definitions of models labeled Trees might include split characteristics, whereas General Regression requires specifications for its predictors and the Regression type takes familiar values as intercepts and coefficients. These would all have to be retrieved from the PMML string here and plugged into the variables of our plug-in that define the structure of the model. I’ll spare readers the quite lengthy code that would be required to illustrate every type of conversion, especially since that would require the development of multiple sample plug-ins, one for each type of PMML model. The important thing is to grasp the process behind it, which entails parsing the XMLElement supplied to the method, using standard .Net XML functionality, and translating it to your model’s variables.

Figure 4: Sample Implementation of the LoadPMMLContent Method

Protected Overrides Sub LoadPMMLContent(TempAlgorithmContent As System.Xml.XmlElement)

            If String.Compare(TempAlgorithmContent.Name, “NaiveBayesModel”, True) <> 0 Then

                Throw New System.Exception(“Invalid PMML node: “ & TempAlgorithmContent.Name)

            End If

Dim NamespaceManager As New System.Xml.XmlNamespaceManager(TempAlgorithmContent.OwnerDocument.NameTable)

            NamespaceManager.AddNamespace(pmml, TempAlgorithmContent.NamespaceURI)

Dim ParameterXPath As String = pmml:Extension[‘MiningAlgorithmFlags‘=@name and ‘Microsoft Analysis

Dim ParameterNode As System.Xml.XmlNode = TempAlgorithmContent.SelectSingleNode(ParameterXPath, NamespaceManager)

            If ParameterNode IsNot Nothing Then

                Dim ParameterXMLNodeList As System.Xml.XmlNodeList = ParameterNode.SelectNodes(pmml:key-val,

 loop through the PMML model parameters and assign them to our model’s parameters

                For I As UInteger = 0 To ParameterXMLNodeList.Count – 1

                    If ParameterXMLNodeList.Item(I).Attributes(“name”) IsNot Nothing Then

                        Select Case ParameterXMLNodeList.Item(I).Attributes(“name”).Value

                            Case MyFeatureSelectionParameter

                                MyMiningParameterCollection(ParameterXMLNodeList.Item(I).Attributes(“name”).Value).Value = System.Convert.ToInt32(ParameterXMLNodeList.Item(I).Attributes(“value”).Value)

                            Case MyTrainingParameter

                                MyMiningParameterCollection(ParameterXMLNodeList.Item(I).Attributes(“name”).Value).Value = ParameterXMLNodeList.Item(I).Attributes(“value”).Value

                        End Select

                    End If

                Next I

            End If
       End Sub

…………If the plug-in code works correctly, msmdsrv will respond to a CREATE MINING MODEL FROM PMML statement by calling this method, then SaveContent to persist the structure of the newly created model. After this, a new mining model and parent structure will be added to your SSAS database. It might conceivably be possible to add functionality to LoadPMMLContent and RenderPMMLContent to handle documents that are compliant with higher versions of PMML than 2.1, but I have yet to verify this; it is also entirely possible that the SSAS query parser will reject syntax that is not 2.1-compliant, regardless of whether a plug-in algorithm’s internal code could successfully process the string or not. This capability would greatly increase the utility of SSDM’s PMML functionality, which would otherwise be the tool’s least useful feature, in its current stunted form. Regardless of whether or not the plug-in architecture can be used to fix the shortcomings of Microsoft’s PMML implementation, the ability to write custom algorithms is one of the most useful features of SSDM, which is in turn the most underrated but productive component of SQL Server. The same can be said of the ability to extend SSDM with custom data visualization capabilities, as we shall discuss in the final segment of the Rickety series.

A Rickety Stairway to SQL Server Data Mining, Part 14.7: Additional Plugin Functionality

By Steve Bolton

…………In order to divide this segment of my amateur tutorial series on SQL Server Data Mining (SSDM) into digestible chunks, I deferred discussion of some functionality that can be implemented in custom algorithms. In the last installment I explained how to write custom data mining functions, which is a quite advanced but powerful feature of the plug-in architecture – even by the standards of SSDM, which is in turn perhaps the most underrated and least publicized component of SQL Server. This week we’ll be covering a few odds and ends that aren’t absolutely necessary to implement in a bare bones plug-in, yet are more commonly used than custom mining functions and not nearly as challenging. It is really quite easy to code custom mining model flags, for instance, whereas parameters for custom algorithms are only slightly more difficult. In order to kill two birds with the same stone, so to speak, we’ll also dispose of the topic of feature selection by implementing custom mining flags and parameters that control it.
…………Mining flags merely require a class-scoped declaration of the MiningModelingFlag object type, as well as the implementation of two optional methods in AlgorithmMetadataBase. The GetSupModelingFlags method merely returns an array of MiningModelingFlag objects, including such enumeration values as MiningModelingFlag.Binary, Invalid, NotNull and Regressor. Returning a value of MiningModelingFlag.Null apparently causes a value of  “error” to be listed in the SUPPORTED_MINING_FLAGS column of the DMSCHEMA_MINING_SERVICES schema rowset, which is roughly the equivalent of a relational dynamic management view (DMV). This can be seen in the screenshot I posted of DMSCHEMA_MINING_SERVICES in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm, which should be corrected by the code depicted in Figure 1. For whatever reason, returning values of MiningModelingFlag.CustomBase rather than references to the flags themselves invariably raised an exception, so be vigilant for this fine distinction when implementing your own flags.

Figure 1: Sample AlgorithmMetadataBase Class Code for Custom Mining Flags
        Friend Shared MyCustomMiningFlag1 As MiningModelingFlag = MiningModelingFlag.CustomBase + 1
        Friend Shared MyCustomMiningFlag2 As MiningModelingFlag = MiningModelingFlag.CustomBase + 2 

        Public Overrides Function GetSupModelingFlags() As MiningModelingFlag()

            Dim TempMiningFlags As MiningModelingFlag() = New MiningModelingFlag() {MiningModelingFlag.Regressor, MyCustomMiningFlag1,MyCustomMiningFlag2} 

            Return TempMiningFlags

        End Function

        Public Overrides Function GetModelingFlagName(TempFlag As MiningModelingFlag) As String

            Select TempFlag 

                Case MyCustomMiningFlag1

                    Return “MyCustomMiningFlag1” 

                Case MyCustomMiningFlag2

                    Return “MyCustomMiningFlag2”

                Case Else

                    Throw New System.Exception(“Unknown modeling flag : “ & TempFlag.ToString())

            End Select

        End Function

…………I also corrected the declarations of the flags, which should have been Friend Shared in the original version. GetModelingFlagName merely returns string names associated with the flags, which is also trivial to implement. As noted in previous tutorials in this segment of the Rickety series, most of the code I’m providing initially came from the C# samples provided by Bogdan Crivat, one of the original SSDM programmers, in the plug-in software development kit (SDK). I converted it to Visual Basic .Net in part because it’s my favorite language and secondarily because the Internet is practically devoid of SSDM plug-in code written in VB; after that, I made so many changes that the original material is practically unrecognizable. One of the common changes I made was to use Select Cases rather than If statements in order to accommodate wider ranges of values in future projects, as I did in Get ModelingFlagName. After replacing these declarations and methods in sample code I posted for the AlgorithmMetadataBase class a while back, the SUPPORTED_MINING_FLAGS column of DMSCHEMA_MINING_SERVICES should now depict a value of “REGRESSOR, MyCustomMiningFlag1, MyCustomMiningFlag2” rather than “Error.” As you can see in Figure 2, SSDM capitalized the MiningModelingFlag.Regressor value I returned in GetSupModelingFlags – as it does with the other out-of-the-box flag types – but I didn’t stick to this convention when naming my custom flags.

Figure 2: Custom Mining Flags in the SUPPORTED_MINING_FLAGS Column

…………Using parameters in a plug-in algorithm is a little trickier and requires the implementation of three methods, plus invocation code somewhere around the AlgorithmMetadataBase constructor. I’m not sure yet if it is necessary to add a constructor to that class, but it makes an ideal place to instantiate your parameters and add them to a class-scoped MiningParameterCollection object, as depicted in Figure 3. Crivat encapsulates declaration code of this kind in the constructor by calling a DeclareParameters method he added to the base class, which is probably a good idea in many circumstances; I opted not to add methods of my own devising to this class for now, just to avoid causing any confusion with other methods that are marked Must Override or optional in the base class. As Figure 3 also shows, MiningParameter objects have several properties that can be used to perform basic tasks like assigning defaults and descriptions, as well as identifying the parameters as hidden or required. The expected return type and name are specified in the same line of code as the parameter declaration.

Figure 3: Parameter Declaration Code for the AlgorithmMetadataBase Class

Protected MyMiningParameterCollection As MiningParameterCollection 

‘#2 declare the members of the collection – possibly in AlgorithmMetadataBase.New or a method called by it

        Public Sub New()

            Dim MiningParameter1 As New MiningParameter(MyFeatureSelectionParameter, GetType(System.Int32))
            Dim MiningParameter2 As New MiningParameter(MyTrainingParameter, GetType(System.String))

            MiningParameter1.DefaultValue = “0”
            MiningParameter1.Description = “A Description for My Only Feature Selection Parameter”
            MiningParameter1.IsExposed = True
            MiningParameter1.IsRequired = False
            MiningParameter1.DefaultValue = “My Default Value”
            MiningParameter1.Description = “A Description for My Only Training Parameter”
            MiningParameter1.IsExposed = True
            MiningParameter1.IsRequired = False

        End Sub

…………We must also implement three other methods which are essentially inconsequential. The GetParametersCollection returns either the MiningParameterCollection we declared at the beginning of the class, or Nothing if there are no parameters and therefore no such collection. There’s nothing particularly difficult about the ParseParameterValue method either, although it does require a few more lines of code. My implementation is pretty close to Crivat’s original sample code, in part because there are only so many ways the simple task of parsing the parameter values supplied by the user can be done. The parameter types are different and the exception handling syntax varies a little, with the major change being the Select Case. The logic is easy to grasp: when the user supplies the parameter at the ordinal index of the zero-based collection, we merely return the data type we originally declared for that parameter. In his original code, Crivat notes that the parameters must be cast to the exact type, which I think means that they won’t be implicitly cast. All of the code we’ve added so far in this tutorial belongs in AlgorithmMetadataBase, but the GetTrainingParametersActualValue method is part of the AlgorithmBase class. It is no more difficult to implement, however, because all we have to do is return any default value if the user declined to specify one. As Crivat notes, special logic can be added here to substitute “The best value automatically (heuristically) detected by the algorithm for the current training set.” In a related vein, the .chm help file included with the SDK says that “The actual value may be different from what was requested by the user in situations like the following: the value specified by the user indicates some ‘auto-detect’ mechanism, that computes an optimized value based on the input data.” It also says that “implementing this function usually requires persisting the values of the parameters in the SaveContent implementation of your algorithm,” but we already demonstrated how to persist variable values of any kind in Post 14.2.

Figure 4: Three Methods Relevant to Parameters in AlgorithmMetadataBase and AlgorithmBase

Public Overrides Function GetParametersCollection() As Microsoft.SqlServer.DataMining.PluginAlgorithms.MiningParameterCollection

            Return MyMiningParameterCollection 

        End Function

         Public Overrides Function ParseParameterValue(ParameterIndex As Integer, ParameterValue As String) As Object

            Dim ReturnValue As Object

            Select Case ParameterIndex  

                Case 0

                    ReturnValue = System.Convert.ToInt32(ParameterValue)

                Case 1

                    ReturnValue = ParameterValue

                Case Else

                    Throw New System.Exception(“An error occurred when casting the parameter values.”)

            End Select

            Return ReturnValue

         End Function


        Protected Overrides Function GetTrainingParameterActualValue(ParameterOrdinal As Integer) As System.Object

            If MyMiningParameterCollection.Item(ParameterOrdinal).Value Is Nothing Then

                Return MyMiningParameterCollection.Item(ParameterOrdinal).DefaultValue


                Return MyMiningParameterCollection.Item(ParameterOrdinal).Value

            End If

        End Function

…………The ValidateAttributeSet method of AlgorithmMetadataBase is also an ideal place to validate mining flags for incorrect or missing mandatory values. SSDM automatically passes it a reference to the AttributeSet, which can be parsed to make sure that the user passed valid values; for example, you’d obviously want to iterate over the AttributeSet and verify that at least one attribute is marked as an input column. The code for this method is even more trivial, in the sense that it doesn’t even have a return type; you merely iterate over your AttributeSet and custom flags and throw an exception if a bad value is encountered, which halts training. On the other hand, the .chm states “that the attribute set here is not fully constructed, so only a limited set of methods can be invoked: GetAttributeCount, GetAttributeFlags, IsFunctionallyDependant and IsCaseAttribute.” I have yet to verify this, but this may preclude you from using it to parse parameter values as well. Normally, however, they would be checked in ParseParameterValue or GetTrainingParameterActualValue. Regardless of how these methods are implemented, the code in the constructor should be sufficient to expose the parameters to the end user, which can be verified by running a query like the one in Figure 5 on the DMSCHEMA_MINING_SERVICE_PARAMETERS schema rowset. Note that the values depicted for the PARAMETER_NAME, PARAMETER_TYPE, IS_REQUIRED and DESCRIPTION columns correspond to the property values we assigned in the constructor.


…………The only difficulty I encountered when implementing parameters was that SSDM never called my GetTrainingParameterActualValue code as expected, which forced me to create a declare a public MiningParameterCollection in AlgorithmBase as well. I then added code to the CreateAlgorithm method of AlgorithmMetadataBase to make the two collections equivalent; there may be some kind of performance penalty for this workaround since AlgorithmMetadataBase is only called when the algorithm is instantiated, but I have yet to see it. I then iterate over AlgorithmBase’s version of the collection in the InsertCases method and call GetTrainingParameterActualValue once for each constituent parameter. As Crivat suggests in his code examples, the beginning of InsertCases is an ideal place to handle training parameters; in my updated sample code for AlgorithmBase, I left a placeholder If…End If block to encapsulate conditional processing based on these parameter values. Since we’re using a class-scoped parameters collection, we can reference them in many other routines though. We could, for example, provide end users more control over model processing by implementing new training phases, or performance parameters that change the way SSDM performs calculations in the ProcessCase method, depending on the available system resources. I had originally planned to use parameters to illustrate how to use ProgressNotification objects, which update users on the status of model processing. As we have seen throughout this series, processing can be quite resource-intensive in terms of CPU, RAM and I/O, as well as time-consuming, thereby necessitating regular communication between algorithms and their users so that the latter can cancel processing or move on to some other task while they wait. The ProgressNotification object is designed to perform this task, by providing messages to the users that are displayed in the Visual Studio GUI, SQL Server Management Studio (SSDM) or in Profiler. As relational DBAs know, polling events in Profiler too frequently can also impact performance, so it is necessary to strike a fine balance between collecting too much and too little performance information. It is technically possible to expose parameters that allow users to control how often the algorithm emits its own developer-defined messages, such as at intervals based on the case count in ProcessCase. Yet to date, I have not been able to use either the Start, Progress or EmitSingleTraceNotification methods of SSDM’s ProgressNotification object without triggering a System.AccessViolationException. I can instantiate ProgressNotification variables and manipulate their other properties and members without incident, but the three methods that actually fire notifications invariably trigger exceptions that can’t even be caught in standard Try…Catch blocks. The concept is really simple; you merely set a class-scoped ProgressNotification object equal to the output of a Model.CreateTaskNotification method call, then use properties like Format, Total and Current to specify a message and track the intervals at which it fires. Strangely, the exact same code in one plug-in project works but triggers the fatal exception in another, even when the code from the first is directly cut and pasted into the second. I’m considering filing a bug report at Technet, but can’t dismiss the possibility that there’s an overlooked error in my project settings or something. Either way, I’ll hold off on posting code that might be buggier than usual until I get to the bottom of it.
…………The code I wrote for the three Overridable methods relevant to feature selection is also buggy, but the good news is that it is irrelevant. HasFeatureSelection is the easiest one, because it is another one of those plug-in methods that behaves more like a property, which in this case is merely a Boolean choice. If you are going to perform feature selection, you merely return a value of True; if not, you can return False or not implement it at all. If you wish, you can also add trivial conditional logic to turn it on and off, depending on the values of any feature selection parameters you’ve implemented, or any other variable for that matter. Instead of using my example parameter in this way, I check the value in GetAttributeFeatureSelectionInfo and then return an AttributeFlags object that identifies it as an Input parameter that participates in feature selection, if the value the user supplied matches the ordinal index passed by SSDM as the TempAttributes unsigned integer. The increment in the conditional logic merely addresses the fact that MiningParameterCollection objects are one-based, while the model’s AttributeSet object is zero-based. As the .chm points out, there are really only four values that this method can return: zero for attributes that feature selection isn’t applied to; AttributeFlags.Input; AttributeFlags.Output; and AttributeFlags.Input|AttributeFlags.Output for those columns which are feature selected both ways. Another optional method, GetAllFeatureSelectionInfo, returns both the set of attributes which were feature selected and the attribute flags associated with them. In this case, the latter are limited to values of AttributeFlags.Input AttributeFlags.Output and a combination of both. This method is on the tricky side, since the return values are passed through output parameters. They are a native part of the syntax of the C family of languages; in Visual Basic, we have to work around this by adding the <Out()> tags and passing them ByRef. The sample implementation in Figure 6 is a little more convoluted than GetAttributeFeatureSelectionInfo because we need to create an array of attributes with smaller dimensions than the AttributeSet of all attributes, so their indexes won’t match; hence the need for the ArrayCounter variable. This method performs pretty much the same function as GetAttributeFeatureSelectionInfo, except that it populates an array of every attribute that is feature selected along with the correct flags, instead of just returning the flags one attribute at a time.

Figure 6: Feature Selection Code

        Protected Overrides Function HasFeatureSelection() As Boolean 

            Return True

        End Function 

        Protected Overrides Function GetAttributeFeatureSelectionInfo(TempAttribute As UInteger) As AttributeFlags

            If MyMiningParameterCollection.Item(0).Value = TempAttribute Then

                Return AttributeFlags.Input

            End If

        End Function

        Protected Overrides Sub GetAllFeatureSelectionInfo(<Out()> ByRef TempAttributes() As UInteger, <Out()> ByRef TempAttributeFlags()
 As AttributeFlags)

            Dim ArrayCounter As UInteger

            For I = 0 To AttributeSet.GetAttributeCount – 1

                If MyMiningParameterCollection.Item(0).Value = I Then                                 
               TempAttributes(ArrayCounter) = I
               TempAttributeFlags(ArrayCounter) = AttributeFlags.Input
                    ArrayCounter = ArrayCounter + 1

                End If

            Next I

        End Sub

…………The problem with the code above is that these methods appear to be designed to be triggered internally by msmdsrv.exe, but I have yet to see Analysis Services hit the breakpoints I set in these routines, with one exception: when rendering Predictive Model Markup Language (PMML) representations of mining models, which we will discuss in the next installment of this series. I could work around this by calling them directly in another routine, like I did with the methods relevant to parameters in InsertCases, but there’s not much of a point in doing that. They don’t seem to do much of anything relevant to processing that can’t be implemented by adding our own methods to AlgorithmBase and calling them in other routines, so we’re not losing much of value by skipping over these methods. The plug-in architecture performs any special processing based on the values returned by them; it’s not like we can feed them values and have SSDM calculate values based on feature selection methods included in most of the nine out-of-the-box algorithms discussed much earlier in this series, like Interestingness Score, Shannon’s Entropy, Bayesian with K2 Prior, Bayesian Dirichlet Equivalent with Uniform Prior. Nor is there apparently any relevant information out there on these relatively irrelevant functions; there are no implementations of them in the sample SDK code (at least based on a Windows search of the files) and nothing in the tutorial I found online titled Microsoft Business Intelligence with Numerical Libraries: A White Paper by Visual Numerics, Inc. In fact, there are no hits at all on Google for any of them. At best, I did manage to find references to them at the old MSDN tutorial A Tutorial for Constructing a Plug-in Algorithm written by Max Chickering and Raman Iyer back in August of 2004, but that was designed for users working with COM in C++. My sample code may be buggy, but it is the only code I’ve seen published anywhere on the planet to date for any of these methods, in Visual Basic, C# or any other language. As I’ve said a few times throughout this series and my earlier series An Informal Compendium of SSAS Errors (which remains one of the few centralized sources of information for certain arcane Analysis Services exception messages), troubleshooting SSAS is a bit like blasting off into space, since you’re basically “boldly going where no man has gone before” thanks to the lack of documentation. This is doubly true of SSDM, which has even less documentation thanks to its status as SQL Server’s most underrated and under-utilized component, despite the fact that it can be put to such productive uses. Plug-in algorithms are in turn the cutting edge of SSDM, so it should be no surprise that we’re entering a vacuum – and just as in deep space, we can do without a lot of the material we’re encountering now. We’ve already traversed the most challenging topics in this segment of the Rickety series, after passing the apex of difficulty in A Rickety Stairway to SQL Server Data Mining, Part 14.3: Debugging and Deployment. Now the series is winding down to the point where we’re addressing features that aren’t of any great necessity in plug-in development. In the next installment, we’ll tackle SQL Server’s implementation of PMML, a powerful XML means of making mining models interchangeable with other products. Unfortunately, SQL Server is still stuck on the PMML standard promulgated back in March of 2003. Despite the lack of attention from the top brass at Microsoft in that time period, SSDM remains one of the best data mining tools available, but PMML support is the one area in which the usefulness of SSDM has really been noticeably reduced by this neglect. In contrast, the product’s support for custom data mining viewers is one of its greatest unsung features, as we shall see in a few weeks as we close out the Rickety series.

A Rickety Stairway to SQL Server Data Mining, Part 14.6: Custom Mining Functions

by Steve Bolton

…………In the last installment of this amateur series of mistutorials on SQL Server Data Mining (SSDM), I explained how to implement the Predict method, which controls how the various out-of-the-box prediction functions included with SSDM are populated in custom algorithms. This limits users to returning the same standard columns returned by calls to prediction functions against the nine algorithms that ship with SSDM. The PredictHistogram function, for example, will return a single prediction column, plus columns labeled $Support, $Probability, $AdjustedProbability, $Variance and $StDev, just as it does with the out-of-the-box algorithms. Although we can control the output of the standard prediction functions to return any values we want, it may be more suitable in some cases to specify our own output format with more flexibility. This is where the ability to define your own custom mining functions comes in handy. Another important use case is when you need to perform custom calculations that wouldn’t normally be implemented in standard prediction functions like PredictStDev or PredictSupport. Plug-in programmers could conceivably perform any complicated calculation they like in the Predict method and return it in the results for PredictVariance, but this would be awkward and illogical if the result is not a measure of variance. Thankfully, the SSDM plug-in architecture we’ve been discussing in the past articles includes a mechanism for defining your own custom mining functions, which addresses all of these needs.
…………Aside from the calculation of cumulative statistics, developing a custom mining function is actually quite simple. The development process begins by declaring a public function in your AlgorithmBase class ordained with a MiningAlgorithmClassAttribute, which looks like the attribute declaration in my sample code: <MiningFunction(“My Custom Mining Function Description”)>. The text within the quotation marks will appear in the DESCRIPTION column of the DMSCHEMA_MINING_FUNCTIONS schema rowset, for the row corresponding to the new function. The parameter list can contain variables of your choosing, like the RecordCount integer I use in my sample code for limiting the rows returned by MyCustomMiningFunction, but special rules apply in a few cases. Parameters decorated with the MiningTableColumnRefAttribute or MiningScalarColumnRefAttribute tags must be AttributeGroup objects, such as the <MiningScalarColumnRefAttribute()> TempAttributeGroup AsAttributeGroup declaration in my sample code. I believe MiningTableColumnRefAttribute tags are used for feeding nested tables to custom mining functions, but the PluginModel mining model we’ve been working with in this segment on custom mining algorithms doesn’t have one, nor do I see any benefit in adding one because of the risk of cluttering the tutorial needlessly. MiningScalarColumnRefAttribute, however, is a more common tag that allows you to pass a reference to a specific model column; in the prediction I’ll execute to test the function, I’ll pass a value of  “InputStat” to identify the first column in the PluginModel. The Flag attribute can also be used to identify optional flags passed as parameters to the function, as I do in the parameter declaration <Flag()> INCLUDE_COUNT AsBoolean. Apparently, these flags must be Booleans, because I received the following warning in the FUNCTION_SIGNATURE column of DMSCHEMA_MINING_FUNCTIONS whenever I use other data types with a flag I called ATTRIBUTE_NUMBER that I later removed: “Invalid Flag type (must be Boolean) in method MyCustomMiningFunction:ATTRIBUTE_NUMBER.” The algorithm will still deploy correctly if these parameters are set wrong but will return error messages when prediction queries are executed against the custom functions. The parameter list may also include a MiningCase object, which is not exposed to end users and causes SSDM to automatically iterate over the case set supplied in a prediction join, OpenQuery, or other such statement. As we shall see, one of the primary challenges with custom mining functions is to calculate cumulative statistics by iterating over this MiningCase object.
…………You can basically return values you want from your function, but if they’re scalar, they must be declared as System.Object. To return more than one value your only choice is to declare the return type as System.Data.DataTable. If your revised AlgorithmBase code compiles and deploys correctly, then the next time the service starts you should see your choice reflected in the RETURNS_TABLE column of DMSCHEMA_MINING_FUNCTIONS. Note how the query in Figure 1 reflects the name of the method we added to AlgorithmBase in the FUNCTION_Name column, the string from the MiningAlgorithmClassAttribute in the DESCRIPTION and the lists of parameters in FUNCTION_SIGNATURE. Some other standard functions enabled for our sample plug-in are also listed in the results (minus the prediction functions I enabled for the last tutorial, which were removed to shorten the list).

Figure 1: The Custom Mining Function Depicted in DMSCHEMA_MINING_FUNCTIONS

…………Note that the MyCustomMiningFunction is listed twice; this is because it is declared in AlgorithmBase and in SupportedFunction.CustomFunction Base, as part of the array returned by AlgorithmMetadataBase. Removing the function from AlgorithmBase without removing it from SupportedFunction.CustomFunction Base from the array returned by AlgorithmMetadataBase.GetSupportedStandardFunctions led to this error on msmdsrv, when the algorithm failed to initialize: “The data mining algorithm provider (ProgID: 06da68d6-4ef0-4cea-b4dd-1a7c62801ed2) for the Internal_Name_For_My_Algorithm algorithm cannot be loaded. COM error: COM error: DMPluginWrapper; Parameter name: in_FunctionOrdinal Actual value was 10000.” Another “gotcha” you may encounter is the warning that “PREDICTION JOIN is not supported” for a particular mining model when you try to use the function in a query. After many comparisons to the original sample code provided in the plug-in software development kit (SDK) by Bogdan Crivat, one of the original developers of SSDM, I was able to narrow this down to the cases in which the GetCaseIdModeled method of AlgorithmMetadataBase returns True. Simply changing the return value for that method ought to fix the problem. Another common mistake you may encounter while testing custom mining functions is leaving off the ON clause in the PREDICTION JOIN, which may result in this warnng: “Parser: The end of the input was reached.” As renowned SQL Server Analysis Services (SSAS) guru Chris Webb says in his helpful post Error messages in MDX SELECT statements and what they mean, this can also be due to missing parentheses, brackets and semi-colon marks. All of these myriad issues were fairly inconsequential in comparison to those that arose when I debugged the AlgorithmNavigationBase class in A Rickety Stairway to SQL Server Data Mining, Part 14.4: Node Navigation, which is far easier than troubleshooting AlgorithmBase and less difficult by several orders of magnitude than dealing with exceptions in AlgorithmMetadataBase. The biggest challenge I encountered while writing the sample code depicted in Figure 2 was calculating cumulative values across iterations over the MiningCase object, which were problematic to persist.

Figure 2: Sample Code for MyCustomMiningFunction
<MiningFunction(“My Custom Mining Function Description”)>
Public Function MyCustomMiningFunction(InputCase As MiningCase, RecordCount As Integer, <MiningScalarColumnRefAttribute()> TempAttributeGroup As AttributeGroup, <Flag()> INCLUDE_COUNT As Boolean) As System.Data.DataTable

            Static CaseCount As UInteger = 0 ‘the 0th case will be the Prepare statement
            for whatever reason, the first InputCase is always Nothing
            we can use that opportunity to set up the data table one time only
            If IsNothing(InputCase) = True Then
                MyCustomMiningFunctionDataTable = New System.Data.DataTable ‘on the first iteration, reset the datatable
                MyCustomMiningFunctionDataTable.Columns.Add(“Skewness”, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(“Kurtosis”, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(ExcessKurtosis, GetType(Double))
                MyCustomMiningFunctionDataTable.Columns.Add(JBStatistic, GetType(Double))

                If INCLUDE_COUNT = True Then
                    MyCustomMiningFunctionDataTable.Columns.Add(TotalRows, GetType(UInteger))
                End If

                once the data type is set, we can return it one time, in the event of a Prepare statement
                ‘I will have to double-check later and make sure this statement doesn’t need to execute for each row during Prepared statements
                ‘Bogdan: “•If the execution is intended for a Prepare statement, it returns an empty string. A Prepare execution must return data that has the same type and schema (in case of tables) with the real result.”
                so if it is only for a prepare statement, we’d send it back the empty data table?
                chm: on Context.ExecuteForPrepare: “Determines whether the current query is preparing the execution or it is fully executing. Should only be used in the implementation of custom functions”
                If Context.ExecuteForPrepare = True Then
                    Return MyCustomMiningFunctionDataTable
                End If
                dealing with actual cases; add them to the cumulatives so we
can perform calculations at the end
                for now we will deal with a single attribute for the sake of simplicity; it would be fairly easy though to extend this to deal with more than one attribute at a time

                    If CaseCount < RecordCount – 1 Then RecordCount is 1-based, CaseCount is not

                        both variables are 0-based
                        If CaseCount = 0 Then
                            ReDim TempSkewnessKurtosisClassArray(0) ‘initialize the arrays one time; this won’t work during the Preparation phase when InputCase = 0, because other variables will be set to 1
                            ReDim TempDataPointArray(0, RecordCount – 1)
                            TempSkewnessKurtosisClassArray(0) = New SkewnessKurtosisClass
                        End If

                        TempDataPointArray(0, CaseCount) = New DataPointClass
                        TempDataPointArray(0, CaseCount).DataPoint = InputCase.DoubleValue
                        TempSkewnessKurtosisClassArray(0).TotalValue = TempSkewnessKurtosisClassArray(0).TotalValue + InputCase.DoubleValue
                        CaseCount = CaseCount + 1


                        add the last set
                        If IsNothing(TempSkewnessKurtosisClassArray(0)) Then TempSkewnessKurtosisClassArray(0) = New SkewnessKurtosisClass
                        TempDataPointArray(0, CaseCount) = New DataPointClass
                        TempDataPointArray(0, CaseCount).DataPoint = InputCase.DoubleValue
                        TempSkewnessKurtosisClassArray(0).TotalValue = TempSkewnessKurtosisClassArray(0).TotalValue + InputCase.DoubleValue
                        CaseCount = CaseCount + 1

                        if we’re on the final case, calculate all of the stats
                        Dim TempRow As System.Data.DataRow
                        TempRow = MyCustomMiningFunctionDataTable.NewRow()
= SkewnessKurtosisClassArray(0).CalculateAverage(RecordCount, TempSkewnessKurtosisClassArray(0).TotalValue)
                        TempSkewnessKurtosisClassArray(0).StandardDeviation = SkewnessKurtosisClassArray(0).CalculateAverage(RecordCount, TempSkewnessKurtosisClassArray(0).Mean)
                        TempRow.Item(“Skewness”) = TempSkewnessKurtosisClassArray(0).CalculateSkewness(TempDataPointArray, 0, TempSkewnessKurtosisClassArray(0).Mean, TempSkewnessKurtosisClassArray(0).StandardDeviation)
                        TempRow.Item(“Kurtosis”) = TempSkewnessKurtosisClassArray(0).CalculateKurtosis(TempDataPointArray, 0, TempSkewnessKurtosisClassArray(0).Mean, TempSkewnessKurtosisClassArray(0).StandardDeviation)
                        TempRow.Item(ExcessKurtosis) = TempSkewnessKurtosisClassArray(0).CalculateExcessKurtosis(TempSkewnessKurtosisClassArray(0).Kurtosis)
                       TempRow.Item(JBStatistic) = TempSkewnessKurtosisClassArray(0).CalculateJarqueBeraTest(RecordCount, TempSkewnessKurtosisClassArray(0).Skewness, TempSkewnessKurtosisClassArray(0).ExcessKurtosis) 

                        If INCLUDE_COUNT = True Then
                            TempRow.Item(TotalRows) = RecordCount
                        End If

                        MyCustomMiningFunctionDataTable.Rows.Add(TempRow) ‘add the row
                        CaseCount = 0 ‘reset the counter
                        Array.Clear(TempSkewnessKurtosisClassArray, 0, TempSkewnessKurtosisClassArray.Length) ‘clear the arrays so that there is no overhead for using a class-scoped variable

                        Array.Clear(TempDataPointArray, 0, TempDataPointArray.Length)
                        Return MyCustomMiningFunctionDataTable

                        then reset the counters so that the function starts off fresh the next time it is called
                        ‘any code I set after this to reset the DataTable won’t be hit – so hopefully SSDM does garbage collection on its own; the best I can do is reinitialize the values in the event of an error
                        Exit Function ‘so we don’t Return it again
                    End If

                Catch ex As Exception

                    if there’s an exception the static and class-scoped variables for this function must be reset, otherwise they will interfere with the next execution
                    CaseCount = 0 ‘reset the counter
                    Array.Clear(TempSkewnessKurtosisClassArray,0, TempSkewnessKurtosisClassArray.Length) ‘clear the arrays so that there is no overhead for using a class-scoped variable
                    Array.Clear(TempDataPointArray, 0, TempDataPointArray.Length)
                    MyCustomMiningFunctionDataTable.Rows.Clear() ‘clear all the rows
                    MyCustomMiningFunctionDataTable = Nothing

                End Try
            End If

            Return MyCustomMiningFunctionDataTable

        End Function

…………Since there are already a handful of better tutorials out there that demonstrate how to write a custom mining function (most notably Crivat’s sample code in the SDK) I wanted to make a contribution of sorts by pushing the boundaries of the architecture. Since the very basic mining method I developed in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm merely returned some basic global stats for each column that I wanted to learn more about, like skewness, kurtosis and excess kurtosis, I figured it would be interesting to calculate the same figures for the values supplied in a prediction join. For good measure, I added a method to the SkewnessKurtosisClass I used in Post 14.2 to perform Jarque-Bera tests, which use measures of lopsidedness like skewness and kurtosis to determine how well datasets fit a normal distribution. I used the formula available at the Wikipedia page Jarque-Bera Test, which is fairly easy to follow in comparison to more complicated goodness-of-fit tests. I soon discovered, however, that keeping track of the cumulative stats required to calculate these measures is problematic when using custom mining functions. SQL Server will make numerous calls to your function, depending on how many input cases there are, but there isn’t a straightforward solution to keeping track of values across these calls. As a result, I had to create two class-scoped variable arrays that mirrored the AlgorithmBase’s SkewnessKurtosisClassArray and DataPointArray variables, which I explained how to use in Post 14.2. The static CaseCount variable I declared at the beginning of the function is used in conditional logic throughout the outer Else…End If statement to control the population of the TempDataPointArray. The DataPointClass objects that accumulate in this array with each iteration of the function are fed en masse to the routines that calculate skewness, kurtosis, excess kurtosis and the Jarque-Bera test when the CaseCount reaches the final input case. On this final pass, those four stats are also added to the single row that comprises the data table the function returns. This MyCustomMiningFunctionDataTable variable is defined as an instance of System.Data.DataTable in the outer If…Else block, which will only be triggered when InputCase has a value of Nothing. This condition is only met on the first pass through the function, which is apparently designed solely to provide a means of declaring the return variable. The Context.ExecuteForPrepare condition is only true on this first pass. Special exception-handling logic is required to reset the values of the class-scoped variables and the static CaseCount in the event of an error, which is comprised by the Catch…End Try block at the end of the method. The last five lines in the End If immediately above the Catch…End Try block perform pretty much the same function, but only after the final pass executes successfully. If these two blocks of code were not included, the CaseCount would continue to accumulate and the calculations of the TempDataPointArray and TempSkewnessKurtosisClassArray would at best throw exceptions, or worse yet, incorrect results.
…………Ideally, it would be easier to retrieve the count of input rows from the function itself or some variable exposed by SSDM, or at least find a way to determine when the function has exited, but thus far I have not yet found such a means. One alternative I tried was to use a CaseSet as an input parameter, but the Description column of the DMSCHEMA_MINING_FUNCTIONS rowset returned an “Invalid parameter type” error. Defining InputCase as an array of MiningCase objects will produce a different error when you run a Data Mining Extensions (DMX) query, including the text, “Too many arguments or flags were supplied for the function.” Nor can you return multiple rows, one at a time; this is why there are three Return statements in the function, one for the initial ExecuteForPrepare phase, one for returning all of the data en masse one time only in a single table, and another for cases in which an empty table is returned because there are no results to report. Returning a value of Nothing instead of the return type declared in the ExecuteForPrepare phase will raise an error, whereas returning an empty table with no rows is also undesirable, since it adds an extra line to the output. If you return a null value instead of the declared return type, you will also receive an error in SQL Server Management Studio (SSMS) worded like this: “COM error: COM error: DMPluginWrapper; Custom Function MyCustomMiningFunction should return a table, but it returned NULL.” There may still be errors lurking in the conditional logic I implemented in this function, but it at least amounts to the crude beginnings of a workaround of the problem of calculating stats dependent on cumulative values in custom mining functions. Would-be plug-in developers are basically stuck with iterating over the input cases one row at a time, rather than operating on an entire dataset, so some type of workaround involving static or class-scoped variables may be necessary when calculating cumulatives. Displaying the results with this type of hack is also problematic, as Figure 3 demonstrates. It is necessary to add another column to the results, as the first reference to “InputStat” does, otherwise FLATTENED results will only display a single blank row, but this means each row in the input set (taken from an OpenQuery on the PluginModel’s data source) will have a corresponding row in the results. In this case, each of the 20 input cases in our dataset is represented by the skewness value calculated for the training set of the InputStat column. It is possible to omit this superfluous column by removing the FLATTENED statement, but then it returns one empty nested table for each row in the input set. Either way, the only row we’re interested in is the last one, which depicts the skewness, kurtosis, excess kurtosis and Jarque-Bera statistic for the entire dataset. The TotalRows column is also populated with the value the user input for RecordCount when the INCLUDE_COUNT flag is specified; this takes just six lines of code in two separate section of Figure 3, once to add the column to the return table and another to assign the value. To date, this is still impractical because I haven’t found a way to use a TOP, ORDER BY, sub-select other clause to return only the row with the cumulative statistics we calculated, but my ugly hack at least shows that cumulatives can be calculated with custom mining functions. That’s not something that’s ordinarily done, but this comparatively short topic seemed an ideal place to contribute a little bleeding edge experimentation. The “Rickety” disclaimer I affixed to this whole series is sometime apt, because I’m learning as I go; the only reason I’m fit to write amateur tutorials of this kind is that there so little information out there on SSDM, which is an underrated feature of SQL Server that is badly in need of free press of any kind. In the next installment of this series, we’ll delve into other advanced plug-in functionality, like custom mining flags, algorithm parameters and feature selection.

Figure 3: Sample Query Result for the Custom Mining Function (With the Cumulatives Hack)

A Rickety Stairway to SQL Server Data Mining, Part 14.5: The Predict Method

By Steve Bolton

…………In order to divide the Herculean task of describing custom algorithms into bite-sized chunks, I omitted discussion of some plug-in functionality from previous installments of this series of amateur tutorials on SQL Server Data Mining (SSDM). This week’s installment ought to be easily digestible, since it deals with a single method in the AlgorithmBase class that is relatively easy to implement – at least in comparison to the material covered in the last four articles, which is fairly thick. For primers on such indispensable topics as the benefits of the plug-in architecture, the object model, the arrangement of nodes in the mining results and debugging and deployment of custom mining methods, see those previous posts. The Predict method is marked Must Override, which signifies that it must be implemented in order for any SSDM plug-in to compile and run successfully, but it is possible to leave the code blank as we did in these previous tutorials in order to get the bare bones plug-in to return some simple results. Until it is fleshed out, the plug-in sample code we’ve been working with in recent weeks won’t return any results for prediction queries.
…………As I mentioned in those previous posts, I’m writing the sample plug-in code in Visual Basic .Net, in part because there’s a distinct lack of  SSDM code written in VB on the Internet, and secondly because it’s my favorite language. I am much more familiar with it than C#, which is what most of the available plug-in tutorials and sample code are written in. Like most of the code I’ve posted so far in this segment of the Rickety series, my implementation of the Predict method originally came from the samples provided by Bogdan Crivat (one of the founding SSDM programmers) in the plug-in software development kit (SDK). After converting to VB using online tools, changing the variable names to my liking, adding implementations of some objects found in the SDK help file and otherwise reworking it, the code is now practically unrecognizable. The design pattern and logical flow are pretty much the same, however, and would not have worked at all without Crivat’s indispensable sample code to guide me. My goal is merely to describe the basic tasks any plug-in programmer typically has to perform in order to return the simplest of results from prediction queries, which is a fairly straightforward task. That does not mean, however, that the Predict method cannot be insanely complex when a particular algorithm requires really advanced functionality; like most of the other plug-in class methods we’ve covered so far, there is plenty of room to add intricate conditional logic when the situation calls for it. An introductory tutorial like this is not one of those situations, because of the risk of clutter, so I’ll just to stick to the basics of what can or must be done in the Predict method. I’ll leave fancier implementations of that functionality up to the imaginations of intrepid plug-in programmers.
…………Most of the plug-in methods we’ve deal with so far are commonly linked to a couple of classes in the object model, but the relationship between the Predict method and the PredictionResult method is even tighter than in many of these cases. SSDM automatically populates the method with MiningCase objects, which we worked with back in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm, plus a single PredictionResult that reflects the structure of the prediction query specified by the end user. For example, the class has a MaxPredictions member that is equivalent to the n parameter of the Predict function, to limit the number of records returned; an IncludeNodeID member to return a string value for the NODE_ID when the user specifies the INCLUDE_NODE_ID flag in the query; and AttributeSortBy and AttributeSortDirection members that describe the arrangement of columns specified in the ORDER BY clause.  If the user also used the INCLUDE_STATISTICS flag, then SSDM will also return a NODE_DISTRIBUTION table with the results that is limited to the number of predictions set in the MaxStates property. Any order assigned to this dependent table will likewise be reflected in the StateSortBy and StateSortDirection properties. StateSortBy and AttributeSortBy both take values of the SortBy enumeration, which includes values like SortBy.AdjustedProbability, SortBy.Probability, SortBy.Support, SortBy.Variance and SortBy.None, while the other two properties take either SortDirection.Ascending or SortDirection.Descending values. The good news is that we don’t have to worry about any of these sorting properties, because as Crivat says in the SDK tutorial, “…you do not need to take them into account during the prediction, except if your patterns layout allows you to dramatically optimize the performance. The infrastructure will push the attribute predictions in a heap structure, sorted by probability, support, or adjusted probability as requested in the prediction operation.” In other words, it’s an advanced lesson far beyond the scope of this crude tutorial series. I imagine that in such a scenario, you would add conditional logic to alter the results returned by the query, depending on which sort method was in use. For example, you might implement a Select Case like the one in Figure 1, which could also be performed on a PredictionResult.StateSortBy value:

Figure 1: Sample Conditional Logic for a SortBy Value


   Case PredictionResult.SortBy.AdjustedProbability

    Case PredictionResult.SortBy.Probability

    Case PredictionResult.SortBy.Support

    Case PredictionResult.SortBy.Variance

    Case PredictionResult.SortBy.None

End Select

…………That begs the question: what belongs in each Case clause? The answer might include any of the remaining operations we’ll perform with PredictionResult in the rest of this tutorial, many of which could be encapsulated in one of these Case statements. A typical implementation of the Predict method, would begin a While loop like the one in Figure 2, which iterates over the OutputAttributes collection of the supplied PredictionResult. The PredictionCounter variable also limits the loop by the value of the MaxPredictions property and is incremented with each iteration just before the End While, which occurs near the end of the function. The AttributeSet.Unspecified variable supplies the index of the currently selected output attribute in the AttributeSet collection; it is zero-based, so we won’t be subtracting one from some of the arrays we’ll deal with in the meat and potatoes of the function, unlike in some other methods we’ve dealt with in the past. I normally prefer to iterate over explicit arrays, but had to use a While loop in this case because AttributeGroup objects like the OutputAttributes member have only two commands to control iteration, Next and Reset. The Select Case that follows the While declaration merely determines whether or not the currently selected attribute is disallowed by the PredictionResult.InputAttributesTreatment property. If not, then we create an AttributeStatistics object to hold the results we’ll be returning.

Figure 2: Complete Sample Code for the Predict Method
Protected Overrides Sub Predict(InputCase As Microsoft.SqlServer.DataMining.PluginAlgorithms.MiningCase, PredictionResult As Microsoft.SqlServer.DataMining.PluginAlgorithms.PredictionResult)

            PredictionResult.OutputAttributes.Reset() ‘an AttributeGroup object

            Dim PredictThisAttributeBoolean = False
            Dim PredictionCounter As UInt32 = 0 ‘for determining whether or not we’ve reached the limit that may be defined in PredictionResult.MaxPredictions

            ‘I’m not sure yet but I think MaxPredictions is unlimited when = 0
            While PredictionResult.OutputAttributes.Next(AttributeSet.Unspecified) = True And PredictionCounter < PredictionResult.MaxPredictions

                ‘OK if the current attribute is input, or output, matches AttributeSet.GetAttributeFlags(1), but

                Select Case PredictionResult.InputAttributeRestrictions

                    Case PredictionResult.InputAttributesTreatment.ExcludeInputAttributes And Me.AttributeSet.GetAttributeFlags(AttributeSet.Unspecified).ToString Like “*Input*”

                        PredictThisAttributeBoolean = False

                    Case PredictionResult.InputAttributesTreatment.PredictOnlyInputAttributes And Me.AttributeSet.GetAttributeFlags(AttributeSet.Unspecified).ToString Like “*Output*”

                        PredictThisAttributeBoolean = False

                    Case PredictionResult.InputAttributesTreatment.None ‘this one really belongs in the Case Else, but I want to demonstrate it

                        PredictThisAttributeBoolean = True

                    Case Else

                        PredictThisAttributeBoolean = True

                End Select

                If PredictThisAttributeBoolean = True Then

                    Dim AttributeStatsToAdd As New AttributeStatistics()
                    AttributeStatsToAdd.Attribute = AttributeSet.Unspecified ‘the target attribute
                     AttributeStatsToAdd.AdjustedProbability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).AdjustedProbability
                    AttributeStatsToAdd.Max = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Max
                    AttributeStatsToAdd.Min = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Min
                   AttributeStatsToAdd.Probability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Probability
                    AttributeStatsToAdd.Support = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Support

                    test to see if the Missing state ought to be returned (SSDM will filter out data that is not requested in the query, but there’s no point wasting resources letting it get to that point

                    If PredictionResult.IncludeStatistics = True Then

                        For I As Integer = 0 To MarginalStats.GetAttributeStats AttributeSet.Unspecified).StateStatistics.Count – 1

                            If (I <> 0 And I > PredictionResult.MaxStates) Then Exit For ‘exit if we’ve hit the MaxStates, if there is a limit

                            If MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(I).Value.IsMissing = True And PredictionResult.IncludeMissingState = False Then
                                don’t add this node, because it’s the missing state

                            Else ‘in all other cases, add it

                                Dim SkewnessState As New StateStatistics()
                                SkewnessState.ValueType = MiningValueType.Other
‘don’t subtract 1 because Unspecified is zero-based
                               SkewnessState.AdjustedProbability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).AdjustedProbability
                               SkewnessState.Probability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Probability
                                SkewnessState.Support = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Support
                                SkewnessState.ProbabilityVariance = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).ProbabilityVariance
                               SkewnessState.Variance = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Variance
                            End If
                        Next I
                    End If

                    If PredictionResult.IncludeNodeId Then
                        AttributeStatsToAdd.NodeId = AttributeStatsToAdd.NodeId = AttributeSet.GetAttributeDisplayName(AttributeSet.Unspecified, False)
                    End If

                    ‘ Add the result to the predictionResult collection

                    PredictionCounter = PredictionCounter + 1 ‘increment it only if a prediction is actually made (which it may not be if a comparison to the Input/Output to InputAttributeRestrictions leads to an exclusion

                End If

           End While
        End Sub

…………This sample (I apologize for the sloppy code formatting, but the WordPress GUI is sometimes more challenging to work with than SSDM Extensibility) merely assigns the values of a column’s MarginalStats object to the new AttributeStatistics object, which ought to be a familiar operation at this point in the tutorial series. It then tests the IncludeStatistics member to see if we ought to include the NODE_DISTRIBUTION table, followed by a test of the IncludeMissingState property, then creates a new StateStatistics object if everything passes. The VALUETYPE of the single NODE_DISTRIBUTION table row this represents is set to MiningValueType.Other and the AdjustedProbability, Probability, Support, ProbabilityVariance (i.e. Standard Deviation, or StDev) and Variance are set to the corresponding values in the MarginalStats object for that column. The most important part is the Value.SetDouble statement, which sets the ATTRIBUTE_VALUE to the skewness statistics for that column, which we computed in previous tutorials. The StateStatistics object is then added to the AttributeStatistics object. Note how that section of code resembles the assignments of state stats in Post 14.4. You could conceivably add multiple rows of StateStatistics for each AttributeStatistics, or multiple AttributeStatistics for each column you return, depending on the needs of your algorithm. The next order of business is to add conditional logic to deal with instances in which the IncludeNodeID is True; in this case, I’m simply returning the full name of the output column. The newly populated PredictionResult is then added to the collection of predictions that will be returned by the method, the PredictionCounter variable I created is incremented, then the While loop either makes another pass or the method returns.
…………The logic you implement in the Predict method might be complex but the basics of its operation are not, in part because most of the PredictionResult properties are set by the query writer and therefore read-only. Debugging of the Predict method is far easier than with the AlgorithmNavigationBase class we discussed in the last post, which is in turn far less of a hassle than troubleshooting the other AlgorithmBase members and several orders of magnitude easier than dealing with errors in the AlgorithmMetadataBase class. One of the few issues I ran into initially were errors with text including the word “GetResultFromAttStats,” which might belong on Jeannine Takaki’s incredibly valuable thread on undocumented SSDM stored procedures. Presently, it is so undocumented that it only returns two hits on Google. One of these is an MSDN thread titled problem with managed code plug-in to analysis services which had no replies. I was fortunate to find this sentence in Gaël Fatou’s invaluable reply at the MSDN thread Managed Plugin-Algorithm: “…if  (trainingStats.StateStatistics[0].Value.IsMissing) will not work, as it will always only consider the first state (which is always missing). The current state must be considered instead: if  (trainingStats.StateStatistics[nIndex].Value.IsMissing).” This is due to an error in the SDK sample code that has apparently bitten at least one other plug-in developer, which necessitates the replacement of the [0] in the StateStatistics[0].Value.IsMissing statement with the proper state stats index. I never encountered GetResultFromAttStats errors after applying this solution, but first-time plug-in developers ought to be aware of this stumbling block in advance. I also received a couple of “Object reference not set to an instance of an object” errors that referenced GetAttributeFlags, a method of AttributeSet that I was calling incorrectly in my IncludeNodeID code. Another problem I encountered that was equally easy to fix was forgetting to enable the prediction functions I called in my test query, which led SQL Server Management Studio (SSMS) to return error messages with the text “…function cannot be used in the context” every time I ran the queries in Figures 3 and 4. It was a simple matter to edit the code of GetSupportedStandardFunctions in AlgorithmMetadataBase to return the correct SupportedFunction enumeration values. Out of the 37 available values, I selected PredictAdjustedProbability, PredictCaseLikelihood, PredictHistogram, PredictProbability, PredictSupport, PredictStdDev, PredictVariance, PredictScalar and PredictNodeId, which enabled me to return the results in the two figures below. I also enabled PredictTable, CustomFunctionBase and IsInNode for future experimentation. The .chm included with the SDK mentions that other Data Mining Extensions (DMX) functions like DAdjustedProbability, DNodeId, DStdDev, DVariance, DSupport, PredictAssociation and PredictSequence are also dependent on the output of the Predict method, but I didn’t implement some of these because they’re applicable to other types of algorithms, which are beyond the scope of this series. Note that the PredictHistogram function in the top graphic includes a row for the Missing state, while the first row contains the same values as the PredictSupport, PredictProbability, PredictAdjustedProbability, PredictVariance and PredictStDev functions in the bottom graphic. The Predict function at the bottom returns the same skewness value for the InputStat column we saw in last week’s post, as well as the name of the InputStat column for the PredictNodeID function. Most of these columns are standard return values in prediction queries, but in next week’s tutorial, I’ll demonstrate how to use custom mining functions to return more flexible result formats and perform tailor-made processing.

Figures 3 and 4: The Results of Various DMX Prediction Functions

A Rickety Stairway to SQL Server Data Mining, Part 14.4: Node Navigation

By Steve Bolton

…………The pinnacle of difficulty in this series of amateur self-tutorials on SQL Server Data Mining (SSDM) was surmounted in the last installment, when we addressed the debugging and deployment of custom plug-in algorithms. From here on in, we will be descending back down the stairway, at least in terms of exertion required to squeeze practical benefit out of the lessons. In our first steps we’ll have to deal with a few aspects of plug-in development that were deferred in order to divide the tutorials into bite-sized chunks, particularly the topic of node navigation. The construction and navigation of a hierarchy of nodes out of our mining results takes place almost exclusively in the AlgorithmNavigationBase class, which we touched on briefly in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm. In order to display rudimentary results in a single node in the Generic Content Viewer by the end of that article, it was necessary to return trivial or dummy values for many of the 18 methods in that class which were marked Must Override. The first bit of good news is that we’ve already introduced GetStringNodeProperty, GetDoubleNodePropert and GetNodeDistribution, the methods responsible for assigning the values of familiar columns from SSDM’s common metadata format, like NODE_DESCRIPTION, NODE_RULE, MARGINAL_RULE, NODE_PROBABILITY, MARGINAL_PROBABILITY, NODE_DISTRIBUTION, NODE_SUPPORT, MSOLAP_MODEL_COLUMN, MSOLAP_NODE_SHORT_CAPTION and NODE_CAPTION. The second good news item is that debugging of AlgorithmNavigationBase is much simpler than that of AlgorithmBase, where most of the number-crunching occurs. It is several orders of magnitude easier than debugging AlgorithmMetadataBase, where errors can be difficult to track down since they cause SQL Server Analysis Services (SSAS) to skip loading faulty algorithms when the service starts up. There are some pitfalls to constructing a node hierarchy out of your mining results, however, which can still be tricky to overcome.
…………Debugging of this class is much easier in part because we simply refresh the results in the Visual Studio GUI or SQL Server Management Studio (SSMS) in order to trigger breakpoints set here, whereas in AlgorithmBase or AlgorithmMetadataBase it is often necessary to reprocess mining models or restart the service. Restarts are still necessary after you deploy changes to the code, but not if you’re simply stepping through the results again. Expanding a closed node in the Generic Content Viewer may also cause various routines in AlgorithmNavigationBase to be called, which will activate any breakpoints you’ve set there in Visual Studio – assuming you’ve set them as prescribed in the last article. One of the more common errors you may see while debugging SSDM navigation may be worded like this: “Execution of the managed stored procedure GetAttributeValues failed with the following error: Exception has been thrown by the target of an invocation. The specified marginal node cannot be found in the mining model ‘PluginModel’.” The sentence indicates that the culprit may be an inappropriate choice of data mining viewer types, which are defined when the service starts by the GetViewerType routine of AlgorithmMetadataBase. I encountered this frequently while debugging this project because I initially returned a value of MiningViewerType.MicrosoftNeuralNetwork in that routine. The plug-in we developed a couple of tutorials ago merely returns a few global statistics like skewness, kurtosis and excess kurtosis, so it is not surprising that the neural net viewer would be a poor choice. Curiously, any attempt on my part to rectify it since that rookie mistake by substituting a different MiningViewerType enumeration value in GetViewerType has produced a more serious GetAttributeValues error. The text is the same except for the final line, which instead mentions the name of our practice plug-in algorithm at the end: “The targeted stored procedure does not support ‘Internal_Name_For_My_Algorithm’.” It apparently takes a little more work to change the viewer type of an algorithm once it’s been set, at least in some cases; it’s not a problem I’ve solved yet, but one plug-in developers should be wary of. It proved to be merely an inconvenience for the purposes of this series, since selecting Generic Content Viewer from the dropdown list in the Visual Studio GUI or SSMS works without errors whenever I return the original value of MicrosoftNeuralNetwork. Whenever I returned results in a new window in one of those tools I would receive the first GetAttributeValues error, then simply selected the Generic Content Viewer, which is the only one we need to worry about in a bare bones plug-in like this.
…………Developers might notice that there is no GetAttributeValues member in the object model or the DMPluginWrapper class it is derived from, via the means outlined in A Rickety Stairway to SQL Server Data Mining, Part 14.1: An Introduction to Plug-In Algorithms. Nor does either one expose the GetAttributeScores routine, which is mentioned in some of the arcane error messages SQL Server occasionally returns when processing the nine algorithms included with the product, which we covered much earlier in this series. These are internal routines within the SSAS process, msmdsrv.exe, which we can’t assign breakpoints to in Visual Studio, let alone debug and recompile. By good fortune, I stumbled across a list of 20 such undocumented internal routines at Jeannine Takaki’s invaluable Technet post A Guide to the (Undocumented) System Stored Procedures for Data Mining, at the last minute while writing up this article. She explains the uses of each in detail and how to use the CALL syntax on them in Data Mining Expressions (DMX) code.  GetAttributeValues and GetModelAttributes are the only two undocumented routines on the list which can be used with any one of the nine out-of-the-box algorithms. GetAttributeScores applies specifically to neural nets and Linear Regression. Some of the routines are specific only to the two Microsoft-supplied clustering algorithms, such as GetClusterCharacteristics, GetClusterDiscrimination, GetClusterProfiles and GetClusters, whereas ARGetNodes; GetItemsets, GetRules and GetStatistics only apply to the Association Rules algorithm. Linear Regression and Decision Trees likewise are the only data mining methods which make use of CalculateTreeDepth, DTAddNodes, DTGetNodeGraph, DTGetNodes and GetTreeScores, while GetAttributeCharacteristics, GetAttributeDiscrimination, GetAttributeHistogram and GetPredictableAttributes can only be used with Naïve Bayes mining models. I have yet to experiment with them myself, but they all seem to be related to the retrieval of mining model results, which means you may encounter them when developing plug-ins that use the corresponding mining viewer types.
…………It is imperative to be conscious of the order in which SSDM triggers the 18 mandatory methods in AlgorithmNavigationBase when you retrieve the mining results in the GUI. The sequence begins in the GetNavigator routine of AlgorithmBase, which instantiates a new instance of AlgorithmNavigationBase, which in turn calls any code you’ve included in the constructor. In my case, all the New method does is set a reference to a class-scoped instance of the algorithm and a Boolean which indicates whether or not the navigation operation is being performed on a data mining dimension. I cannot verify that the rest of the sequence is the same for all mining models or all algorithms, but to date, I have seen SSDM invariably trigger the GetCurrentNodeID routine next. The next seven calls are to routines which set either the type or the name of the node, such as GetNodeType, GetNodeUniqueName, GetUniqueNameFromNodeID, GetNodeType again, a single call to GetStringNodeProperty to assign the NODE_CAPTION value, then GetNodeUniqueName and GetUniqueNameFromNodeID once more. I’m not sure what the purpose of repeating these calls to GetNodeType, GetNodeUniqueName and GetUniqueNameFromNodeID is; I only know that it occurs. The code for these routines is fairly trivial. GetNodeUniqueName has no parameters and merely calls GetUniqueNameFromNodeId, which merely converts the NodeID integer parameter supplied by SSDM into a string value. GetNodeIdFromUniqueName merely performs the same function as GetUniqueNameFromNodeID, except in reverse; in my example, I merely converted the NodeUniqueName supplied by SSDM as a parameter to the function and converted it to a 32-bit integer. I suppose other plug-in developers might have a genuine need to develop a much more complex naming scheme than in my trifling example of this class (which can be downloaded here), but for our purposes, there’s no real reason to complicate things. GetCurrentNodeID is likewise called frequently during the retrieval of mining results, but only returns the value of the CurrentNode, which is a class-scoped integer that identifies the node the navigation class is currently assigning the values for. In the implementations I’ve seen so far, such as the tutorial that founding SSDM developer Bogdan Crivat included in the plug-in software development kit (SDK) and Microsoft Business Intelligence with Numerical Libraries: A White Paper by Visual Numerics, Inc., the value of the CurrentNode is not set in this routine, merely returned. So how do we direct SSDM to operate on a different node – or for that matter, determine the number and order of the nodes? That doesn’t occur in GetNodeType, which merely returns a value of the NodeType enumeration to assign the NODE_TYPE value we’ve seen returned throughout this series in SSDM’s common metadata format. You might need to know the number of the node in order to assign the proper value, which can be any one of the following: AssociationRule; Cluster; CustomBase; Distribution; InputAttribute; InputAttributeState; Interior; ItemSet; Model; NaiveBayesMarginalStatNode; NNetHiddenLayer; NNetHiddenNode;NNetInputLayer; NNetInputNode; NNetMarginalNode; NNetOutputLayer; NNetOutputNode; NNetSubNetwork; None; PredictableAttribute; RegressionTreeRoot; Sequence; TimeSeries; Transition; Tree; TSTree; and Unknown. In other words, we’d assign None, Unknown or one of the 25 NODE_TYPE values we’ve seen often throughout this series. In my implementation, I performed a Select Case on the value of CurrentNode and set a type of Root when it equaled zero and InputAttribute in all other cases.
…………Decisions already have to be made based on the CurrentNode value, yet we have not yet assigned it in code. As mentioned in the last couple of tutorials, one of the trickiest aspects of debugging AlgorithmBase and AlgorithmMetadataBase is that calls to methods appear to come out of thin air, because they originate within msmdsrv.exe. It is a different expression of the same problem embodied in the hidden calls to undocumented methods, like GetAttributeValues and GetAttributeScores. In AlgorithmNavigationBase, this is manifested in apparently uncontrollable calls to the 18 mandatory methods, directed by msmdsrv.exe, which seem to leave developers without any well-defined place to specify the very node structure these routines operate on. That is not to say that these methods calls are random; msmdsrv seems to follow a fairly consistent pattern, in which the calls listed above are followed by an invocation of GetParentCount, then GetNodeAttributes. After this, GetNodeUniqueName is usually followed by another call to GetUniqueNameFromNodeId, then both are repeated again, followed by yet another call to GetNodeType. The NODE_CAPTION is typically set again in GetStringNodeProperty for some reason, followed by further calls to GetNodeUniqueName and GetUniqueNameFromNodeId. SQL Server then invokes GetChildrenCount and GetParentCount, followed by three calls to GetStringNodeProperty in which the NODE_DESCRIPTION, NODE_RULE and MARGINAL_RULE are set; NODE_PROBABILITY and MARGINAL_PROBABILITY are then set in two separate calls to GetDoubleNodeProperty; the structure of the NODE_DISTRIBUTION table is then set in a single call to GetNodeDistribution, followed by alternating calls to GetDoubleNodeProperty and GetStringNodeProperty to set the NODE_SUPPORT, MSOLAP_MODEL_COLUMN, MSOLAP_NODE_SCORE and MSOLAP_NODE_SHORT_CAPTION. It was trivial to set the values of these in Post 14.2, when we were only dealing with a single node. If we’re dealing with multiples nodes, we’d need to insert Select Cases on the CurrentNode value to assign the correct values of these columns to the appropriate nodes, but it wouldn’t normally be suitable to change the value of CurrentNode itself here. After this, SSDM usually calls LocateNode twice, which merely invokes ValidateNodeID, i.e. the routine where you’d embed any custom code to authenticate your node structure; in my algorithm, I simply added code to throw an error if the CurrentNode value exceeded the count of the number of attributes. Somewhere in this vicinity, SSDM will call GetChildrenCount and sometimes MoveToChild, as well as GetParentNodeID and MoveToParent in some cases. Correctly specifying the cases when those particular routines are hit is the key to constructing your node network, not through a call to a method that simply declares the size and shape of the whole network explicitly. After this, SSDM will usually call GetCurrentNode and repeat this whole sequence over again until it reaches the final node – which it may never do, if you don’t specify the node structure correctly.
…………In a typical program, you’d expect to set a fixed number of nodes, perhaps in a class-scoped multi-dimensional array or collection that also determined the structure of the relationships between them. In SSDM plug-ins, however, the structure is specified at almost primordial levels, through calls to methods that determine the counts of each nodes parents and children. The numbers returned by GetParentCount and GetChildrenCount in turn determine how many times SSDM will iterate through MoveToParent and MoveToChild respectively. This group of methods also includes GetChildNodeId and GetParentNodeID, in which you merely return the parameter value supplied by SSDM, as far as I can tell. In my implementation, I merely wanted to return a root node with global statistics, followed by two child nodes representing two columns of input data I supplied to my mining model. Since the root node has no parents, I returned zero for GetParentCount when the CurrentNode was equal to zero, i.e. the unique index for the single root node; since the two input attributes have just one parent, I returned a value of one in all other cases. In GetChildrenCount, however, I returned a value equal to the number of input attributes when the CurrentNode matched the root node. Since the root node is the only parent, I returned zero when the CurrentNode is equal to any other value, since these are input attributes, which have only one parent in this case. The CurrentNode value is only changed in one place in this trifling plug-in, in the MoveToChild routine, where it is increased by one in each iteration. The number of iterations will be equal to the value assigned to the single root node in GetChildrenCount. In my plug-in, MoveToParent is never invoked because the only parent specified in GetParentCount has already been passed once, but it may be triggered in structures that require more complex implementations of GetParentCount and GetChildrenCount. This means of specifying the node structure is so basic that simple errors in your logic can cause this sequence of method calls to repeat itself indefinitely, until you receive a warning from SSMS that the XMLA query timed out. If you plan to develop plug-ins, expect to see a lot of these. Also expect to set aside more time than usual for the code that specifies your node structure, since it is done in an unconventional way that is not easy to control. I would prefer to deal with explicit definitions of those structures in old-fashioned class-scoped arrays or collections, which might be easier to manage when dealing with particularly complex node setups; yet perhaps there are sounds reasons for the way Microsoft implemented this, maybe as a performance enhancement of some kind. Just keep in mind that most of the time you spend in AlgorithmNavigationBase will be devoted to specifying your node setup. Most of its methods are trivial, sometimes to the point of merely returning the values SSDM supplies in function parameters. Likewise, even the most complex node setups may not require many lines of code, but that doesn’t mean they won’t be a challenge to arrange properly. It is a good idea to have a clear idea of the structure you want beforehand, then to keep a close eye on the conditional logic in GetChildrenCount and GetParentCount, as well as paying close attention to the way in which you increase the value of CurrentNode or any other index placeholder variable you might use. Expect to iterate through the sequence of method calls outlined above many times during debugging before you get it right; it is not a bad idea to set breakpoints in many of them keyed to the Hit Count or Condition (which can be set in Visual Studio by right-clicking on the breakpoints).
…………As Crivat points out in the tutorial he wrote for the SDK, there may be cases in which a single node may have multiple parents, such as tree structures that represent graphs, but these instances are too extraordinary to warrant a full discussion. I should at least mention MoveToNextTree, a mandatory method Microsoft included to enable unusual structures with more than one root node. As the help file included with the SDK states, “Positions the navigator to the next content tree, if your algorithm’s content is laid out as multiple trees…Usually, all the content can be represented as a single tree and this function should return false. It is useful to implement this function when your content graph is disjoint, to ensure that all nodes can be traversed in a content query.” As I’ve said many times in this series and in others like An Informal Compendium of SSAS Errors, troubleshooting Analysis Services is sometimes like exploring deep space with the crew of the Enterprise, since you may be boldly going where no man has gone before. The MoveToNextTree method is another one of those features of SSAS for which there is apparently no documentation anywhere on the planet aside from the single line of code in sample files Crivat includes with his tutorials, and a single brief C++ example at the MSDN webpage A Tutorial for Constructing a Plug-in Algorithm. The latter doesn’t do much, but says “This function sets _iIDNode to be the root of the next tree in the sequence. Note that if we are in browser mode, there will never be a next tree.” The only useful documentation I could find is the sentence quoted above from the .chm file. Undaunted, I decided to experiment by simply returning a value of True in all cases, which will create an infinite loop that eventually triggers the XMLA timeout warning. The code in Figure 1 worked because it included a limit on the number of times the routine returns True. Each time MoveToNextTree returns that value, it not only iterates through the sequence of method calls we discussed earlier, but does so for each node in that structure. Intrepid programmers with a  need for disjoint graphs (i.e. sets with no common members) could add conditional logic to the other methods we’ve already discussed to return different values, depending on which of the separate tree structures is being iterated through. In my case, the code below merely added three nodes under the root, on the same level as the other two inputs, although I suppose I could have included conditional logic in other routines to create secondary roots or the like.

Figure 1: Some Sample MoveToNextTree Code

Static TreeCount As Int16

If TreeCount < 3 Then TreeCount = TreeCount + 1
    Return True
    Return False
End If

…………The remaining functionality we have yet to discuss with AlgorithmNavigationBase all pertains to means that we have yet to discuss for supplying values for columns in the SSDM’s common metadata for a particular node. For the sake of brevity, I left a discussion of GetNodeAttributes out of Post 14.2, even though it’s relatively inconsequential. It merely specifies the values assigned to the ATTRIBUTE_NAME column by supplying the names assigned already assigned to your attributes, which in my case correspond to mining model columns labeled “InputStat” and “SecondInputStat.” All you need to do is supply the index of the attribute in your AttributeSet to the return variable, which is an array of unsigned integers. To create a comma-separated list of attributes associated with a particular node, return more than one value in the array. This week’s tutorial also features a more complex implementation of the NODE_DISTRIBUTION table than the one returned for a single node a couple of posts ago. The Select Case returns the marginal stats for the whole model when the CurrentNode is equal to the index of the root, but numbers specific to the InputStat or SecondInputStat column for the two child nodes. First I add the marginal stats for the appropriate column to a collection of AttributeStatistics objects, which in turn specifies the values for the first few rows of its NODE_DISTRIBUTION table. The real difference between this and the code we used a few weeks ago is that we’re also including three custom NODE_DISTRIBUTION rows, by adding three extra AttributeStatistics objects whose StateStatistics objects have custom values. First I instantiate the new AttributeStatistics objects, then the three StateStatistics objects which are added to them at the end of the Case statement. The VALUETYPE for each custom NODE_DISTRIBUTION row is set using the MiningValueType iteration, as depicted in Figure 2. Immediately after that, I assign the ATTRIBUTE_VALUE to either the skewness, kurtosis or excess kurtosis value we associated with that attribute in previous tutorials. I’ve left the NODE_SUPPORT, NODE_PROBABILITY and VARIANCE blank in these cases, but set the ATTRIBUTE_NAME (which should not be confused with the column of the same name which is one level higher in the hierarchy of SSDM’s common metadata format0 using the NodeID property.  Ordinarily it might also be a good idea to logic in this routine to do conditional processing based on the NODE_TYPE, but I didn’t want to complicate this tutorial further by nesting Select Cases and other such statements inside each other.

Figure 2: Adding Custom Rows to the NODE_DISTRIBUTION Table

Case Else

Dim TempStats As AttributeStatistics() = New AttributeStatistics(3) {}
empStats(0) = CurrentAlgorithm.MarginalStats.GetAttributeStats(CurrentNode – 1)
TempStats(1) = New AttributeStatistics
TempStats(2) = New AttributeStatistics
TempStats(3) = New AttributeStatistics

Dim SkewnessState, KurtosisState, ExcessKurtosisState As New StateStatistics()

SkewnessState.ValueType = MiningValueType.Other
KurtosisState.ValueType = MiningValueType.Coefficient
ExcessKurtosisState.ValueType = MiningValueType.Intercept

SkewnessState.Value.SetDouble(CurrentAlgorithm.SkewnessKurtosisClassArray(CurrentNode – 1).Skewness)
KurtosisState.Value.SetDouble(CurrentAlgorithm.SkewnessKurtosisClassArray(CurrentNode – 1).Kurtosis)
ExcessKurtosisState.Value.SetDouble(CurrentAlgorithm.SkewnessKurtosisClassArray(CurrentNode – 1).ExcessKurtosis)


we have to set the ATTRIBUTE_NAME string that appears next to these stats in the parent AttributeStats object, for whatever reason
TempStats(1).Attribute = AttributeSet.Unspecified
TempStats(1).NodeId = “Skewness”
TempStats(2).Attribute = AttributeSet.Unspecified
TempStats(2).NodeId = “Kurtosis”
TempStats(3).Attribute = AttributeSet.Unspecified
TempStats(3).NodeId = “Excess Kurtosis”

Return TempStats

…………The results for the updated code I’ve available here produced the three nodes depicted in Figures 3 through 5. The first NODE_DISTRIBUTION table includes the marginal stats for the whole model, plus some values for NODE_CAPTION, NODE_DESCRIPTION and other such columns that we assigned in GetDoubleNodeProperty and GetStringNodeProperty a couple of tutorials ago. Note how the NODE_DISTRIBUTION tables for the second and third figures include custom rows featuring the values for the skewness, kurtosis and excess kurtosis associated with the column mentioned in the ATTRIBUTE_NAME column (the one at the top of each figure, not in the NODE_DISTRIBUTION tables. I have yet to solve the mystery of why the VALUETYPE for the skewness rows is set to Periodicity, when the value supplied in code is MiningValueType.Other, but that is insignificant at this point. The mystery of how the Generic Content Viewer columns are populated in plug-in algorithms is essentially solved; would-be plug-in writers need only decide what they want to appear in them, rather than worrying about how to get the values in there. Only a few enigmas remain in the attic at the top of this Rickety Stairway, such as how use the Predict method, advanced functionality like custom mining functions and Predictive Model Markup Language (PMML). After those four plug-in tutorials are finished, we’ll unravel one final mystery to end the series, how to write custom data mining viewers. Discussion of GetNodeIDsForCase, a specialized AlgorithmNavigationBase method, will be deferred until then because it implements drillthrough to mining model cases, which can’t be done in the Generic Content Viewer. Since the more useful and attractive mining viewers that ship with the product depend on the same values as the Generic Content Viewer, it would be fairly easy to upgrade this sample AlgorithmNavigationBase code to create structures that are compatible with them, such as hierarchies than can be displayed by the Decision Trees Viewer. It wouldn’t be child’s play and the code would be lengthy enough to unduly clutter this series, but once you grasp the basic principles of how to navigate through the appropriate structure, upgrading this skeleton code into custom algorithms that can be displayed in the out-of-the-box mining viewers becomes feasible. The last tutorial in this series will address scenarios where those viewers don’t meet the needs of end users, which calls for custom means of displaying data in the Visual Studio GUI or SSMS.

Figures 3 to 5: Three Nodes Displaying Varied Information
3Nodes (1)

3Nodes (2)3Nodes (3)

A Rickety Stairway to SQL Server Data Mining, Part 14.3: Debugging and Deployment

By Steve Bolton

…………Throughout this series of amateur self-tutorials in SQL Server Data Mining (SSDM), I’ve often said that working with Analysis Services is a bit like blasting off into space with the Starship Enterprise, because you may be boldly going where no man has gone before. My An Informal Compendium of SSAS Errors series remains one of the few centralized sources of information on certain arcane SQL Server Analysis Services (SSAS) errors, some of which turn up no hits on Google and appear to have no extant documentation anywhere on the planet. SSDM is one of the most powerful yet under-utilized components of Analysis Services, which represents Microsoft’s contribution to the cutting edge field of data mining, so we’re venturing even further into uncharted territory in this series. Within SSDM, the most powerful but obscure feature is the extensibility mechanism for writing custom algorithms, so that takes us even further out on a limb. Debugging is likewise one of the most trying tasks in the computing field, since it often involves delving into the unknown to find the causes of errors that may be completely unique to a particular user, system or piece of software. It is thus not surprising that debugging SSDM plug-in algorithms is perhaps the most difficult step to climb in this Rickety Stairway.
…………To complicate matters even further, deployment bugs are often the most difficult to ferret out in any type of software, due to the endless variety of system configurations it may be installed on, not to mention the fact that it is difficult to debug a program that isn’t even running yet. This issue is more prominent in the deployment of SSDM algorithms, for the simple reason that Visual Studio debugger will only be triggered on breakpoints in your custom code or the DMPluginWrapper.dll it references, not in the SSAS process (msmdsrv.exe) that calls it. In the last two tutorials I gave examples of how to write a custom algorithm in Visual Basic.Net classes and compile the DMPluginWrapper file, in which you can set breakpoints at any line of code you prefer. After that it is a matter of starting Analysis Services, then using the Attach to Process command in Visual Studio to link the debugger to the msmdsrv process and instruct it to break on both native and .Net code (in the case of my project, the .Net version was 4.0). Whenever breakpoints in the methods of the three main plug-in classes that we discussed in last week’s tutorial are hit, or those methods call breakpoints in the original DMPluginWrapper class they’re derived from, Visual Studio should break. One of the leading challenges of debugging these projects is that sometimes they don’t break when you’d expect them to, primarily because msmdsrv calls these methods in an order that is not readily apparent to programmers. For example, in an ordinary desktop .Net program, the call stack would begin at the breakpoint, then move to whatever routine called the routine the breakpoint, and so on until you eventually hit a Sub Main or similar entry point at the beginning of the application. With SSDM plug-ins, the call stack often starts with a single routine in your project, which leads back to a routine in the DMPluginWrapper file, which in turn is called by internal msmdsrv.exe routines that you can’t debug or control the execution of. As a result, tracking down the causes of errors is much trickier than in ordinary applications, because you can’t always tell for certain which line of code caused a particular error, or even determine why a routine in your code was called rather than another. The call stack is essentially decoupled, which makes it appear at first as if some errors are being generated from thin air out of left field. In these cases the errors usually turn out to be in your plug-in code, but are only detected in the space in the call stack taken up by internal msmdsrv methods. As mentioned last week, my aim for now is merely to provide the leanest possible version of a VB plug-in, merely to illustrate how SSDM can be directed to return any results at all. Even a bare bones algorithm missing prediction, navigation, feature selection, parameters and other functionality is not feasible without first surmounting the challenge of deploying anything at all. The daunting dilemmas of debugging are readily apparent from the beginning of the deployment phase, which makes it necessary to discuss the two topics together, before fleshing out more advanced features of plug-ins.
…………The process of deployment consists primarily of eight steps, including 1) generating .snk Strong Name Key files for your DMPluginWrapper and plug-in class in order to debug them, which only needs to be done once before adding them to the Global Assembly Cache (GAC); 2) adding the DMPluginWrapper to the GAC, as well as removing it first if it already exists; 3) compiling the latest version of the project;  4) registering your plug-in class as a COM class and if necessary, unregistering the old version first; 5) adding the plug-in class to the GAC, as well as removing any old versions first; 6) informing SSAS of the existence of the new algorithm; 7) restarting the SSAS service; then 8) using Attach to Process as outlined above to begin debugging. after that, you merely need to trigger the appropriate routines in your code, such as processing a mining model to trigger the InsertCases and ProcessCase routines in AlgorithmBase or refreshing the GUI view of mining model results in SQL Server Management Studio (SSMS) or Visual Studio to trigger various routines in AlgorithmNavigationBase, all of which we discussed in depth in last week’s article. All but two of these steps must be performed repeatedly during plug-in development, so I highly recommend placing most of the command line prompts I’ll mention hereafter in a .bat file that can perform them all with a single keystroke. One of those exceptions is Step #6, which only has to be performed once for any plug-in you write. The easy way is to run the XML for Analysis (XMLA) script below, substituting the GUID of your algorithm in the CLSID value and the name of your algorithm for mine, “Internal_Name_For_My_Algorithm.” If the name does not match the one specified in the AlgorithmMetadataBase.GetServiceName routine we discussed last week, you’ll see an error like this in your msmdsrv.log and Application event log: “The ‘ (the name listed in the GetServiceName routine will appear here) ‘ service which was returned by the ” data mining algorithm, does not match its ‘(your project name will appear here)’ algorithm in the server configuration file.”

Figure 1: Sample XMLA Code to Register a Plug-In


Figure 2: Newly Deployed Plug-In Depicted in the SSAS .ini File

…………If the XMLA script runs successfully, then after restarting the service you’ll see the change reflected in the msmdsrv.ini file. Under the ConfigurationSettings\Data Mining\Algorithms node, you’ll find the nine out-of-the-box algorithms listed, all prepended with the name “Microsoft” and sans any GUID tag. For each custom plug-in, you’ll see a listing like the one in Figure 2, in which the opening and closing tags are equivalent to the service name of the algorithm I developed in last week’s tutorial. Note that some versions of the XMLA script listed in other tutorials will set the Enabled value to True rather than 1, which doesn’t seem to cause any unwanted side effects I’m yet aware of, but I changed from the Boolean to the integer value to be consistent with the format used in the .ini file for the out-of-the-box algorithms. There really aren’t many ways to mess up the XMLA script, which only has to be run correctly one time for each plug-in. While troubleshooting my deployment problems I tried substituting incorrect names and GUID values that did not yet exist in the registry, which led to errors like these in the Application log and msmdsrv file: “The data mining algorithm provider (ProgID: 06da68d6-4ef0-4cea-b4dd-1a7c62801ed2) for the Internal_Name_For_My_Algorithm algorithm cannot be loaded. The following system error occurred:  Class not registered” and “The data mining algorithm provider (ProgID: (a GUID other than the 06da68d6-4ef0-4cea-b4dd-1a7c62801ed2 sample value we’re using here) for the MyAlgorithm algorithm cannot be loaded. The following system error occurred:  No such interface supported.” When editing the .ini manually instead of using the XMLA script, the range of errors is only limited by one’s imagination; it’s a bit like the old proverb, “God put obvious limits on our intelligence, but none whatsoever on our stupidity.” Nonetheless, I like to at least manually inspect the .ini after running an XMLA script like this or receiving other deployment errors, just to make sure; that is how I caught the subtle difference between the data types for the Enabled tag, for example. When manually editing the .ini file, don’t edit the <Services> tag under <DataMining> to add your plug-in class, because it’s strictly for built-in algorithms.[i] I recommend keeping shortcuts to the .ini and SSAS log files in a handy place, because you’re going to need to check them repeatedly when testing plug-in algorithms.
…………The script only needs to be run one time for each plug-in, but you obviously can’t run it successfully before compiling and registering your class at least once prior to this. The first step in the process also only needs to be performed once, although there is more room for error in compiling your projects with Strong Name Key files. The other six steps in the deployment process will be repeated ad infinitum, ad nauseum while you debug your code, so we’ll discuss them separately, even though there is some overlap between these topics. The fun begins with generation of the CLSID you see in the XMLA script above, which uniquely identifies your algorithm in the registry and can be generated through a wide variety of tools familiar to Visual Studio and SQL Server programmers, like guidgen.exe, the Create GUID menu function and the NewID() function. The best option when working with VB projects, however, is to check the Sign the assembly checkbox under the Signing tab of Project Properties, which will generate a GUID for you; copy it from the Assembly Information dialog under Project Properties and put it in the GUID tag that adorns your AlgorithmMetadataBase class declaration before compiling the project. This should be identical to the CLSID in the XMLA script. In the Assembly Information box, also select the Make assembly COM Visible checkbox. Do not, however, select the Register for COM Interop checkbox on the Compile tab, otherwise you may receive the following error during compilation: “(Name and Path of your Project).dll” is not a valid assembly.” Back on the Signing tab, you must also add a Strong Name Key file generated by using the Visual Studio sn.exe tool. The DMPluginWrapper.dll file you referenced must also be signed with an .snk and recompiled first, otherwise Visual Studio won’t break on exceptions or hit your breakpoints during debugging, nor will it be installed correctly in the GAC. At one point I received the error, “Unable to emit assembly: Referenced assembly ‘DMPluginWrapper’ does not have a strong name” because I signed my plug-in class but not the DMPluginWrapper.dll it referenced. So I took the easy way out and simply removed signing from my plug-in, which won’t work because you’ll encounter the following error when trying to add DMPluginWrapper to the GAC: “Failure adding assembly to the cache: Attempt to install an assembly without a strong name” I adapted the examples provided by Inaki Ayucar at the webpage How to Sign C++/CLI Assemblies with a Strong Name and Prasanjit Mandal at the MSDN thread Failure Adding Assembly to the Cache: Attempt to Install an Assembly without a Strong Name and ended up with command line text like this: sn.exe -k DMPluginWrapper.snk. I didn’t need to specify the full path to the file, but as the saying goes, Your Mileage May Vary (YMMV). It is also a good idea to verify the signing by using the –v switch on sn.exe. It is critical not to forget to add the following tag to your DMPluginWrapper’s AssemblyInfo.cpp file, substituting the name of the .snk you just generated: [assembly:AssemblyKeyFile(“DMPluginWrapper.snk”)]. Then rebuild the DMPluginWrapper project and reset the reference to the .dll in your plug-in class project. Add an .snk for that project as well on the Signing tab, then rebuild it as well. Compilation should occur with errors at this point and adding both classes to the GAC should now be feasible – provided you do not stumble over one of the following hurdles when adding the assemblies.
…………Experienced .Net programmers know that mismatches between Framework versions or 32-bit and 64-bit platforms can get ugly, which is a problem that is likely to occur while adding the assemblies to the GAC or the registry. The DMPluginWrapper.dll file does not have to be registered and only needs to be added to the GAC one time, but your plug-in must be unregistered, registered again, then removed and added back to the GAC every time you change its code. The catch is to use the right versions of the Regasm.exe registry tool and gacutil.exe. It is not a simple matter of opening an administrative command prompt, navigating to the Debug or Release folders, then running gacutil.exe and Regasm.exe with the appropriate switches, because there is probably more than one version of gacutil and Regasm on your system. Since I compiled my plug-in as a 64-bit class on a 64-bit machine using version 4.0 of the .Net framework, I found the right version of gacutil.exe under the x64 subfolder of the NETFX 4.0 Tools folder, as listed in the sample paths in Figure 3. I also had to use the version of Regasm included under the Microsoft.NET\Framework64\v4.0.30319 folder for the same reason. There aren’t many sources of information on the Internet for debugging SSDM plug-in deployment, but in at least two of them, Bogdan Crivat (one of the developers of the plug-in architecture) helped users who apparently had problems with matching 32-bit versions of Regasm with 64-bit versions of DMPluginWrapper.dll or vice-versa.[ii] He recommends checking the 64-bit registry to make sure the plug-in class was registered in the right hive, but determining which version of the registry you’re actually viewing in Windows 7 is not always straightforward. The correct versions of Regasm and gacutil may of course be located in different folder on your machines, but the principle of matching them with the correct .Net version and platform remains the same. If you receive an error using the commands in Figure 3 with the text “Failure adding assembly to the cache: The filename, directory name, or volume label syntax is incorrect, try adding the file extension to the filename, which in this case would be “.dll.” Note in Figure 3 that you can leave off the extension when removing an assembly from the cache with the –u switch, but must include it when adding an assembly, which is typically done with the –if switch.

Figure 3: Sample Commands to GAC and Register Your Plug-Innote that the paths to the appropriate gacutil and Regasm versions are likely to be different on your computer

                Of course, none of that will work if your platform or .Net Framework versions aren’t identical in the DMPluginWrapper and your plug-in class. The good news is that I have had no problems running plug-ins with .Net 4.0, even though the architecture was originally developed on .Net 2.0. If the DMPluginWrapper has a different .Net version, however, you may receive the following error when using gacutil: “Failure adding the assembly to the cache: This assembly is built by a runtime newer than the currently loaded runtime and cannot be loaded.” In this case, you’re better off setting the .Net version in DMPluginWrapper to the one in the plug-in class; for whatever reason, when I tried to reset it in the other direction, I received an error that the DMPluginWrapper has an incompatible runtime and the references are no longer valid. To make them match, I had to edit the .vcxproj file for the C++ DMPluginWrapper project by hand and change the  set ToolsVersion=”4.0″ under the Project tag near the top. Strangely, I was able to leave the <TargetFrameworkVersion> tag set to v2.0 without receiving mismatch errors with Framework version of 4.0, but I’ll leave an explanation of that mystery to C++ and deployment gurus; all I know is that it worked. Making the platform versions match was much thornier problem. Simply setting the Build/Configuration Manager setting to x64 in the DMPluginWrapper.dll project was not sufficient to get the job done. Editing the AssemblyReference tag in DMPluginWrapper.vcproj by hand to set processorArchitecture=x64 was necessary, but not sufficient. The MSDN thread How to: Configure Visual C++ Projects to Target 64-Bit Platform was critical to fixing the problem, especially the hint about changing the Preprocessor Definitions entry on the Preprocessor tab of the Project Properties. It should be set to something like “WIN64;_DEBUG;%(PreprocessorDefinitions),” yet even that does not represent a complete solution. The Target Machine on DMPluginWrapper also has to be set to MachineX64 (/MACHINE:X64); while you’re at it, you might as well verify that your plug-in class also has  a Target CPU of x64, just to make sure. Then recompile the DMPluginWrapper and add it to the GAC again, then reset the reference to it in the plug-in project to the new version.
…………It is helpful to keep the Application event log and msmdsrv.log handy, because they will indicate whether or not deployment failed using roughly similar error messages. I keep the event logs open in a Microsoft Management Concole (MMC) window at all times anyways, but this is more useful than ever when debugging plug-ins. After successful deployment, they should report messages similar to this every time the SSAS service starts: “The data mining algorithm provider (ProgID: 06da68d6-4ef0-4cea-b4dd-1a7c62801ed2) for the Internal_Name_For_My_Algorithm algorithm was loaded.” The most difficult dilemmas in plug-in development arise when you see anything other than this text, because you can’t use Visual Studio to debug errors in plug-in classes if they can’t even be loaded. The server may continue to run fine, but your plug-in code won’t be called if you run into this category of errors. Some of them are easy to pin down, like an incorrect service name, GUID or class name, which will produce the aforementioned “no such interface supported,” “does not match its…algorithm in the server configuration file” and “class not registered” messages. A “cannot be loaded. The following system error occurred” message followed by blank text is sometimes evidence of incorrect Namespace declarations in your AlgorithmMetadataBase, AlgorithmBase and AlgorithmNavigationBase files. The most difficult problems I’ve yet encountered have been “class not registered” errors reported in the logs, even though the classes are registered correctly. Your plug-in class may appear in the registry with the correct framework version, GUID and all that, but loading may still fail due to errors in AlgorithmMetadataBase that Visual Studio can’t catch. As Crivat says in the SQL Server Data Mining Managed Plug-In Algorithms Tutorial file included with the plug-in SDK,

                “Certain metadata errors do not result in exceptions. An exception thrown when the metadata is initialized would prevent the managed plug-in from loading and there is no way for the developer to detect the cause of that error. For this reason, certain errors in the metadata can be detected by inspecting the schema rowsets exposed by the Analysis Services server. Such errors include mining parameter errors and custom functions errors. If your algorithm uses mining parameters or defines custom functions, make sure to inspect the schema rowsets for algorithm parameters and mining functions, respectively, and check the parameter/function description. If an error is detected in the definition of these objects, the error will be displayed in the respective schema rowset.”

…………Custom functions and parameters are a topic that we will put off for the time being, but last week’s tutorial features a screenshots in which the SUPPORTED_MODELING_FLAGS column of DMSCHEMA_MINING_SERVICES schema rowset held a value of “Error.” This means that my plug-in loaded successfully, but that the GetSupModelingFlags function in AlgorithmMetadataBase returned an invalid value of some kind. Since AlgorithmMetadataBase returns all of the basic properties and functionality of an algorithm, SSAS will call its routines when the service starts, whereas the routines in AlgorithmBase or AlgorithmNavigationBase are not called until model processing or navigation through the results takes place. Once you’ve progressed from errors in AlgorithmMetadataBase that prevent a plug-in from loading to errors like the one in GetSupModelingFlags that do not, then you’ve crossed the Rubicon of SSDM plug-in debugging. When the service starts, SQL Server seems to call functions in AlgorithmMetadataBase that we discussed last week, in a particular order: GetServiceName; GetServiceType; GetServiceName again; GetServiceName a third time; GetServiceDescription; GetViewerType; GetDisplayName; GetPredictionLimit; GetTrainingComplexity; GetPredictionComplexity; GetExpectedQuality; GetScaling; GetSupportsDMDimensions; and GetDrillThroughMustIncludeChildren, sometimes followed by certain methods in the AlgorithmNavigationBase class. If you can get far enough down this chain, then Visual Studio will hit your breakpoints and the most difficult part of debugging will be over. You can set breakpoints in most of these, but not in the class declaration. The worst error I received turned out to be a missing tag in the AlgorithmMetadataBase class declaration: <MiningAlgorithmClass(GetType(MyAlgorithm))>, in which MyAlgorithm was the name assigned to an instance of my AlgorithmBase class. The Application and SSAS logs reported that the algorithm could not be loaded because the class was not registered, even though it was, just not correctly. To catch these types of uncatchable errors, I recommend keeping a bare bones side project based on Crivat’s original tutorial code, either in the original C# or converted to a .Net language of your choice, like VB. I finally trapped this error by simply removing my own lines of code from AlgorithmMetadataBase with his code, which I knew worked, until I finally found the missing class declaration that reproduced the error. If you debug in this way, make sure both projects reference the same version of DMPluginWrapper.dll, just to make sure that there are no discrepancies that might be overlooked. The AlgorithmMetadataBase class may also be the source of a minor error that does not show up until model deployment, at which point you may receive an error stating that you mining model does not have a key – which it won’t, unless the array returned by the GetSupInputContentTypes method includes a value of MiningColumnContent.Key. Most of the other errors associated with this class, however, are often showstoppers that are difficult to track down.
…………Once you’re past this no man’s land, AlgorithmMetadataBase is easier to debug because you can hit its breakpoints in the debugger. Once an algorithm is loaded, the locus shifts to AlgorithmBase, which is not trivial to debug in comparison to classes in other .Net project types, but is still several orders of magnitude easier than AlgorithmMetadataBase. This is the point at which keeping a Profiler trace makes sense, because it may trap processing errors that may not appear in the Application log, and will never be seen in msmdsrv.log. Some of these are easy to debug, while others are not. As mentioned last week, I prefer iterating over explicit arrays to the While and For/Next loops often seen in other tutorials, so I expected to receive more errors about indexes being out of bounds and the like. It comes with the territory. What I did not expect was that I’d end up in uncharted territory, in which Profiler would report an error like “The job completed with failure” shortly after calculating the marginal stats and reporting that it was “Training Mining Model.” This suggested that there was a bug in one of my processing routines in AlgorithmBase, but I only had a couple of them, like InsertCases, ProcessCase, LoadContent and SaveContent, plus some simple algorithm-specific calculations in the SkewnessKurtosisClass objects I mentioned in the previous tutorial. Sometimes Visual Studio would hit my breakpoints in these routines as expected but not report any errors until they had exited, without moving to any other method. Setting breakpoints in the DMPluginWrapper code they referenced didn’t help either. Evidently, the internal routines within SSAS that call the plug-in code reported the error, but there was no way to determine what those routines consisted of, or why they reported errors. As expected, in many cases I missed some array bounds, but I couldn’t view the offending values in a Watch window, or break on the exceptions, because the exceptions were not thrown until after the routines exited with apparent success. On one occasion I was able to catch one of these errors by examining the extra information included in a Profiler trace, in which I found the message “Index was outside the bounds of the array” embedded deep in the verbose output. Object instantiation errors may likewise be accompanied by the message “COM error: COM error: PluginProject; Object reference not set to an instance of an object,” but determining which object in which routine remains an open question. This is not to be confused with a similar message that often accompanies failure to load algorithm, which is usually caused by a bug in your AlgorithmMetadataBase code. The most difficult instance was when I missed some array bounds in ProcessCase, which didn’t report an error until it exited with apparent success, returned to InsertCases, then moved to SaveContent and again exited that method with apparent success. InsertsCases doesn’t call SaveContent directly; Analysis Services does that for you, in the internal msmdsrv.exe code, which you can’t read or set breakpoints in. Worse still, in this particular case, msmdsrv called a couple of other methods after the erroneous code, which may set programmers looking in the wrong direction for their bugs. Another common error seems to be “Internal error: An unexpected error occurred (file ”, line , function ”),” which occurred most often in the custom algorithm logic embedded in my SkewnessKurtosisClass methods.
…………The difficulty of debugging plug-ins decreases by another order of magnitude once you reach the point of retrieving processed results. The first method called during retrieval is AlgorithmBase.LoadContent, which merely loads the records written to disk the last time SaveContent was called during successful processing. If you process a model, shut down the machine and return a week later to view the results in the GUI, Analysis Services will call LoadContent. The good news is that failure of this routine due to errors like missed array bounds and uninstantiated objects  will often be reported in dialogs in the GUI instead of Profiler or the logs. “The MyMiningModel mining model cannot be loaded” is usually a clue that the problem is somewhere in LoadContent. On the other hand, “Execution of the managed stored procedure GetAttributeValues failed with the following error: Exception has been thrown by the target of an invocation” could be the result of errors in practically any of the methods of AlgorithmNavigationBase. Such errors are a lot easier to debug than those in AlgorithmBase, because they don’t require reprocessing and seem to follow a more easily recognizable call stack. Debugging of these methods is not a breeze, but it is much easier than the critical troubleshooting that must be done in AlgorithmMetadataBase and AlgorithmBase just to deploy a plug-in and process models with it. Since it’s a less weighty topic, we can put it off a few weeks until we can get into an in-depth discussion of how to navigate more complex trees structures with this class. Before that, however, we need to construct those trees, which means adding nodes in the Predict method of AlgorithmBase. The next step up this Rickety Stairway will be a discussion of how to use that routine to produce the predictions returned in Data Mining Extensions (DMX) queries and the GUI tools.

Figure 4: Common SSDM Plug-In Deployment Error Messages (click to enlarge)

[i] For more information, see Bogdan Crivat’s post in the MSDN thread Msmdsrv.ini (Split post): Error registering Plugin.

[ii] See the replies by Bogdan Crivat the MSDN webpages Trouble Installing SVM Plug-In on X64 Machines and Managed Plugin Algorithm with SQL Server 2008 R2. The thread Error Registering Plug-In is not directly relevant but still helpful.

A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm

By Steve Bolton

…………As I’ve said many times throughout this series of amateur mistutorials, SQL Server Data Mining (SSDM) is one of the most powerful yet neglected products Microsoft has to offer. The ability to extend it with your own algorithms is in turn the most powerful yet neglected aspect of SSDM itself. The product ships with the nine algorithms discussed earlier in this series, which are quite useful out-of-the-box but nonetheless represent just a drop in the bucket among the thousands of extant mining methods in use today – not to mention the thousands more that are expected to be developed in coming decades, as “Big Analysis” becomes a leading field in IT alongside Big Data. Despite the fact that Microsoft has largely neglected SSDM and Analysis Services for much of the past decade (particularly now that the top brass is obsessed with Tablet Panic) they still remain among the leading database server products. One of the most attractive aspects of SSDM is its extensibility, which as discussed in my last article, helps data miners avoid reinventing the wheel by taking care of such functionality as backups, security and transaction management, and access through such tools as Excel, SQL Server Integration Services (SSIS), Reporting Services and the Data Mining Expression (DMX) language processor in SQL Server Management Studio (SSMS). Despite these advantages, only a few third-party vendors have developed professional plug-in algorithms for SSDM. This is mainly due to lack of awareness of just how good the tools are, which I hope to alleviate a little by giving SSDM a little badly needed free press in this series of articles.
…………I also put a drop in the bucket on behalf of the programming language I’m most comfortable with, Visual Basic .Net, by providing what may be the only tutorial on the Internet for SSDM plug-ins written in VB. I’ve only found a few other tutorials on the subject, all of which seem to be derived from the definitive SQL Server Data Mining Managed Plug-In Algorithms Tutorial, a Microsoft Word document written by founding SSDM developer Bogdan Crivat and included with the plug-in Software Development Kit (SDK). The last article in this series described how to download and install the SDK[i] and compile the DMWrapperPlugin.dll file, which your Visual Studio project must reference. The latter step is one of the aspects of SSDM that could use a makeover after eight years of declining attention from Microsoft, since a native .Net architecture for plug-ins would probably perform better and be easier to write and maintain that the COM interop made possible by this .dll. Once it’s built, however, it is a simple matter to set a reference to it, which should be one of the first actions you take after creating a new Visual Studio class project to host your plug-in. I’m not going to explain how to create such a project, which is one of those trivial steps that often clutter programming tutorials and therefore one of the last distractions we want to deal with when doing high-level algorithm development. The project we’ll be working with for the next few tutorials is a VB class type that was originally derived from the sample code provided in Crivat’s .doc, but which is now almost completely unrecognizable after being converted from C# and substantially and repeatedly altered after that to suit my needs. Perhaps the second-best tutorial I’ve found on SSDM plug-ins is a five-year-old .pdf titled Microsoft Business Intelligence with Numerical Libraries: A White Paper by Visual Numerics, Inc., which is also written in C# and appears closely related to Crivat’s original sample code. I have yet to find a lesson in plug-in development that is not centered on simple Clustering algorithms, so I’ll take a fresh approach and illustrate how to use other data mining methods.
…………In fact, just to get us started and get SSDM to return some kind of simple result, we’ll begin by calculating a few statistics that would normally be considered far too trivial to calculate with data mining software. Skewness, kurtosis and excess kurtosis are measures of how lopsided a dataset is, calculated using well-known and relatively simple formulas which I found at this online calculator and its sister site, and verified at this webpage in the National Institute of Standards and Technology (NIST) Engineering Statistics Handbook, which is one of the most readable and concise books on stats I’ve yet come across. I selected these particular statistics because they were fairly simple, yet I knew little about them, plus I needed some practice converting simple equations into code. I won’t even bother to clutter this tutorial further by posting the code of the classes here, but I’ll provide links to the two classes I developed to implement the logic of my “algorithm,” if it is worthy of such a title. SkewnessKurtosisClass has a constructor that takes the case count, mean and variance of a particular mining model column, then stores them in class-scoped variables. It is also feeds the variance to a method that calculates the column’s standard deviation and also stores it in the same kind of variable. During algorithm training, the standard deviation, mean and an array of data points representing the column (i.e. “attribute” in data mining parlance) and an iterator to identify which column is being operated on are fed to methods that calculate skewness and kurtosis. Excess kurtosis is merely the kurtosis minus three, which provides a more human-readable value of zero for normal distributions of data (i.e. bell curves). This class also has a class-scoped TotalValue variable, which is a cumulative sum of all the raw data values for a particular column that is calculated outside the class itself. My simple project also has a DataPointClass which tracks all of the individual values, which have to be stored in an array of these objects and fed en masse to the other class in order to calculate global stats like skewness and kurtosis. The DataPointClass also has a class-scoped Boolean named IsEqualToMaxSupport, which is supposed to identify whether a particular data point is equal to the maximum value in a column, but which is not working properly yet as of this writing. None of these internal details of these particular classes are of any great importance, since the object at this point is merely to prod SSDM into returning some kind of output. In fact, the calculations for skewness and kurtosis are a little off, because I’m relying on variance values precalculated by SSDM, which uses a biased formula (one which takes into account the entire dataset) rather than the unbiased formula (one derived from an incomplete sample of a dataset) I found in the above sources. It would only take a few lines of extra code to correct this, but I’m trying to keep the number of lines to the barest possible minimum in order to diminish possible confusion.  Our purpose for now is simply to make SSDM perform the simplest of tasks, so that we can concentrate on explaining the plug-in object model. If you need to review the trivial code within these classes, you can follow these links to the SkewnessKurtosisClass.vb and DataPointClass.vb.
…………A typical plug-in project will consist of user-defined classes like the SkewnessKurtosisClass and DataPointClass that encapsulate most of the logic of your custom algorithms, as well as three abstract classes that control processing, navigation and other basic tasks: AlgorithmMetadataBase, AlgorithmBase and AlgorithmNavigationBase. Almost all of their methods must be overridden, although you can add methods, properties and other members as you see fit. Each of these classes must be prefaced with Imports statements for Microsoft.SqlServer.DataMining.PluginAlgorithms and System.Runtime.InteropServices; some sources also include System.Collections.Generic and System.Text, and although I haven’t seen any harm come from leaving them out, there’s always a first time. Each of these three classes must inherit the respective class type within Microsoft.SqlServer.DataMining.PluginAlgorithms, of course. Normally, you would also include an Implements Microsoft.SqlServer.DataMining.PluginAlgorithms.ICaseProcessor clause in the class declaration for AlgorithmBase, which is not mandatory but makes it possible to use the ProcessCase method. AlgorithmMetadataBase is the most difficult one to declare, since it contains the three lines of COM gobbledygook in between the carets depicted in Figure 1. All that does is mark the .Net class for COM interop, identify the type of AlgorithmBase associated with it and assign the class GUID that will identify it in the registry, which in this case is defined as a constant in the #Region block. Some old tutorials on COM interop say that you also need to include an InterfaceID and EventsID in the class declaration, but this is not the cases with SSDM plug-in projects.

Figure 1: A Sample AlgorithmMetadataBase Declaration

Imports Microsoft.SqlServer.DataMining.PluginAlgorithms
Imports System.Runtime.InteropServices
Imports System.Collections.Generic
Imports System.Text

Namespace PluginProject


    Public Class AlgorithmMetadataBase
       Inherits Microsoft.SqlServer.DataMining.PluginAlgorithms.AlgorithmMetadataBase

#Region “COMRegion”
        Public Const ClassId As String = “06da68d6-4ef0-4cea-b4dd-1a7c62801ed2”
#End Region

Figure 2: A Sample AlgorithmBase Declaration
Imports Microsoft.SqlServer.DataMining.PluginAlgorithms

Namespace PluginProject
    Public Class MyAlgorithm
        Inherits Microsoft.SqlServer.DataMining.PluginAlgorithms.AlgorithmBase
        Implements Microsoft.SqlServer.DataMining.PluginAlgorithms.ICaseProcessor

Figure 3: A Sample AlgorithmNavigationBase Declaration
Imports Microsoft.SqlServer.DataMining.PluginAlgorithms

Namespace PluginProject

    Public Class AlgorithmNavigator
        Inherits Microsoft.SqlServer.DataMining.PluginAlgorithms.AlgorithmNavigationBase


…………In order to work with the SSDM plug-in object model, the SSDMPlugIn.chm help file included in the same folder as Crivat’s .doc will have to become your constant companion. It mentions other interfaces, classes and class members that you can implement in COM but which are taken care of behind the scenes in .Net, so there’s no sense in even mentioning the many objects that are marked with such warnings as “do not call this function directly in your plug-in.” Once you reference the DMPluginWrapper.dll and inherit Microsoft.SqlServer.DataMining.PluginAlgorithms, Intellisense and the Object Browser will also provide invaluable clues about how to work with the object model. Crivat’s .doc file is likewise indispensable in understanding the plug-in classes. Unlike with many other native classes in the .Net architecture, there is no single definitive guide or diagram available on sites like MSDN that summarizes the plug-in object model succinctly, so you’ll have to rely on all of these sources. I’m still an amateur, fumbling my way through the object model myself, but I’ll try to pass on what little I know in the hopes that it may save other users some time and energy.
…………The most helpful advice I can give is to look at it from this perspective: whether we’re taking about mining model properties, the NODE_DISTRIBUTION table, node names, or even the metadata returned by schema rowsets (i.e., roughly equivalent to relational DMVs), pretty much every type of data we’ve seen associated with SSDM in previous tutorials must be specified somewhere in these three mandatory classes. Most of their methods are marked Must Override and are designed to supply a specific type of data associated with the model, which sharply reduces the complexity of the development process. For example, AlgorithmMetadataBase consists mainly of quite simple methods that return values equivalent to properties associated with an algorithm. As depicted in Figure 4, more than a dozen methods map directly to columns in the DMSCHEMA_MINING_SERVICES schema rowset we covered in an earlier tutorial: GetServiceName, GetDisplayName, GetServiceDescription, GetServiceType, GetExpectedQuality, GetPredictionComplexity, GetTrainingComplexity, GetViewerType, GetSupportsDMDimensions, GetSupportsDrillThrough, GetDrillThroughMustIncludeChildren, GetSupInputContentTypes, GetSupPredictContentTypes, GetSupModelingFlags, GetModelingFlagName and GetPredictionLimit. It is a simple matter of mapping the appropriate data type or enumeration value to the result you want, then executing a Return statement. So far, I have not yet found a way to specify the values for the HELP_FILE, HELP_CONTEXT, MSOLAP_SUPPORTS_ANALYSIS_SERVICES_DDL, MSOLAP_SUPPORTS_OLAP_MINING_MODELS and SUPPORTED_SOURCE_QUERY columns in DMSCHEMA_MINING_SERVICES. For each Data Mining Query (DMX) language function you allow in GetSupportedStandardFunctions, pairs of values for you algorithm and the corresponding function will be added to the SERVICE_NAME and FUNCTION_NAME columns in the DMSCHEMA_MINING_FUNCTIONS rowset. In future tutorials we’ll implement algorithm parameters in methods like GetParametersCollection and ParseParameterValue, which are depicted in the DMSCHEMA_MINING_SERVICE_PARAMETERS schema rowset. The value of the ALLOW_PMML_INITIALIZATION column in DMSCHEMA_MINING_SERVICES is specified in the SupportsPMML routine, but Predictive Model Markup Language (PMML) is a topic we’ll be putting off for a future article. Four of these routines are not mapped to any schema rowsets, because they perform special internal functions. CreateAlgorithm merely instantiates an instance of the AlgorithmBase, where all of your training and prediction takes place; GetCaseIdModeled indicates whether or not the ID of each mining case should be treated as a separate variable during that processing; GetMarginalRequirements indicates what sort of marginal statistics SSDM should precalculate for you, and ValidateAttributeSet merely provides a place to implement your own validation code to make sure the attributes of the model being input follow the format you want. Because there are so many methods in the AlgorithmMetadataBase class, the VB code for my implementation of that class is too long to post here, even with error-checking Try…Catch blocks and other extraneous clutter removed. It can be downloaded from this link though. Note how the values for the selection of DMSCHEMA_MINING_SERVICES columns in Figure 5 reflect the values I assigned in the routines of that class, including the algorithm name “Internal_Name_For_My_Algorithm.” The functions I specified in GetSupportedStandardFunctions are likewise reflected in the query of DMSCHEMA_MINING_FUNCTIONS in Figure 6.

Figure 4: Mapping  DMSCHEMA_MINING_SERVICES and Other Schema Rowsets to AlgorithmMetadataBase Routines (click to enlarge)

Figure 5 and 6: Two Schema Rowsets for My Algorithm


…………It is not even necessary to reference the most common classes in the plug-in object model within the methods of AlgorithmMetadataBase. Figure 7 depicts the objects you work with most frequently in AlgorithmNavigationBase and AlgorithmBase, with the exception of some like PMMLWriter, MiningFunctionAttribute, MininParameter, MiningTableColumRefAttribute, FlagAttribute, ClusterMembershipInfo, PredictionResult and AttributeGroup, which we will delay for later tutorials in order to reduce the amount of information glut we have to deal with this week. The data fed into the mining model case either be considered vertically, in terms of columns (i.e. attributes), or horizontally, in terms of rows (i.e. mining cases). At the apex of the object model pertaining to attributes, we have the AttributeSet object, which provides a lot of basic properties of a particular column or some pertaining to all columns, like GetAttributeCount. GetAttributeFlags returns an AttributeFlags object which provides other column properties readers will be familiar with if they have been following this column, such as whether it is an Input or Output column (or both), the Contnet type, the Distribution property, various mining flags and the like. The MarginalStats object for a particular column includes precalculated statistics for that attribute such as the training case count, as well as an AttributeStatistics object that maps to columns commonly seen in mining model results, such as the probability and support figures. AttributeStatistics objects may also contain a collection of StateStatistics which in turn map to the nested NODE_DISTRIBUTION table. The ValueType column in the NODE_DISTRIBUTION table likewise maps to the MiningValueType object, which is at the bottom of the attribute hierarchy. When looking at our data in terms of rows rather than columns, the PushCaseSet represents all of the training cases in the model, whereas the MiningCase object represents a single row of data. The StateValue object can be used to retrieve further information about a particular case. The main use of ContextServices is the CheckCancelled method, which can be employed in the AlgorithmBase class to detect whether or not a user to has tried to cancel training or prediction. ModelServices returns some basic information about a model, including some PMML properties that we will put off for a later date. The TaskProgressNotification object is designed to send messages about processing progress to traces, but this must also be deferred for now, because I have yet to solve the Access Violation exceptions I’ve been getting with them. PersistenceWriter and PersistenceReader are used in the SaveContent and LoadContent methods of AlgorithmBase to save patterns to disk and then call them back up as needed, such as when a user tries to navigate through mining model results in Visual Studio or SSMS. The values are stored in PersistenceTag enumeration values which can be cast to your liking. Statements to get or set values for these objects must be encapsulated in OpenScope and CloseScope statements. In my implementation of AlgorithmBase, it is a simple matter of iterating through the arrays of SkewnessKurtosisClass and DataPointClass arrays and saving or loading the results. When viewing my code, keep in mind that I implemented this much differently than Bogdan Crivat or any other tutorial writer I’ve found to date, who usually use the C# equivalent of For/Next or While loops to accomplish the same task. I personally prefer explicit array iteration for debugging purposes and am not yet aware of any potential performance penalties for implementing my loops in this way.

Figure 7: The Plug-In Object Model (click to enlarge)

…………The fourth column in Figure 7 shows that certain values for these objects are calculated in the main processing methods of AlgorithmBase, like InsertCases, ProcessCase and Predict, then retrieved in certain methods of AlgorithmNavigationBase, particularly GetStringNodeProperty, GetDoubleNodeProperty and GetNodeDistribution. To simplify this further, let’s map the object model in Figure 7 in the opposite direction. Anyone who has followed this series or is otherwise familiar with SSDM will recognize that the leftmost column in Figure 8 contains the names of columns in the Generic Content Tree Viewer, which depicts SSDM results in their most basic format. For simplicity’s sake I left out MODEL_SCHEMA and NODE_GUID, which are always VT_NULL or NULL respectively, because as Books Online puts it, “this column is not supported by Analysis Services” – which of course begs a question I cannot answer, of why they are included in the SSDM object model at all. The most pertinent question is this: what do you want to appear in these columns of this standardized result format? Some of the values, like MSOLAP_NODE_COLUMN, NODE_RULE, MARGINAL_RULE, NODE_DESCRIPTION, NODE_CAPTION and MSOLAP_NODE_SHORT_CAPTION are set in AlgorithmNavigationBase.GetStringNodeProperty. For each node in your result set, SSDM will loop through that method several times until it has gone through several possible values of the NodeProperty enumeration; your job is to do a Select Case based on the NodeProperty value and return a string value that will be mapped to it. Likewise, SSDM will call GetDoubleNodeProperty several times for each node and feed it a different NodeProperty enumeration value each time; simply implement a Select Case to find out which value you’re assigning that time, then supply a double value equivalent to the NODE_SUPPORT, NODE_PROBABILITY, MARGINAL_PROBABILITY or MSOLAP_NODE_SCORE you want to appear in the results for that node. It is important to note that in my example of this class, I supplied the total of all column values, the skewness and the kurtosis for the first model column under NODE_SUPPORT, MSOLAP_NODE_SCORE and NODE_PROBABILITY respectively. This is not something you’d do in the real world; I just wanted to show how such values could be assigned. Other important result values can be set in methods like GetNodeUniqueName, GetNodeType and GetUniqueNameFromNodeID. I left my implementation of methods like GetParentCount, MoveToParent, GetParentNodeId, GetChildrenCount, GetChildNodeId, LocateNode, MoveToNextTree, ValidateNodeId and GetNodeAttributes as bare as possible, because their uses can best be illustrated with more complex node structures. For the sake of simplicity, I limited this week’s tutorial to a single node, but in a future article we’ll get into more depth about how to use these other methods in AlgorithmNavigationBase.

Figure 8: Mapping the Standard SSDM Result Format to AlgorithmNavigationBase and AlgorithmBase Methods

…………How are the values retrieved in AlgorithmNavgiationBase set in the first place? The meat and potatoes of the SSDM plug-in architecture consists of three methods in AlgorithmBase which encapsulate training and prediction. For the sake of simplicity, we’ll defer discussion of the Predict method for a couple of weeks, but suffice it to say that this is where patterns are created and added to the output on the basis of DMX prediction queries. All three of these routines are appropriate places to check ContextServices for cancellation by the end user, or to emit trace messages with ProgressNotification objects. Model training occurs in InsertCases, which is called one time only between calls to the Initialize method and either SaveContent or GetNodeIDsForCase; Crivat recommends putting logic dependent on the training parameters at the beginning of this routine. That doesn’t stop you, however, from iterating over the case set several times within this routine, so many plug-in developers implement several phases of training at this point. I split mine into two phases, beginning with one in which I populate the SkewnessKurtosis classes for each attribute with certain global stats that are needed for computing the skewness and kurtosis, such as the variance. Then I call CaseSet.StartCases(Me), which calls ProcessCase one time for each row of input in the mining model. For each row within ProcessCase, retrieve the value of each column in my model and add it the DataPointClass array, then detect whether or not it is equal to the maximum value for that row and set IsEqualToMaxSupport accordingly. Just to show how cumulative statistics can be computed, I also added a running total for each attribute in this routine. After ProcessCase is finished iterating over the whole case set and SSDM returns to the remainder of the InsertCases method, it then enters a post-process phase, in which I perform the calculations of skewness, kurtosis and the like by feeding the SkewnessKurtosisClass the whole DataPointClass array for that attribute. There may be much more efficient ways of implementing this, but my object for the time being was just to get some sort of results back from SSDM, given that the most difficult aspect of plug-in development is next week’s topic, deployment and debugging.
…………Although training is implemented in just two routines, the ways in which you can flexibly implement them are practically infinite. You can call ProcessCase more than once, or not at all, or nest it within a call to itself. You can create separate phases within InsertCases that do not call ProcessCase, as I did in my example. One of the things that began to dawn on me as I rejigged my InsertCases and ProcessCase methods is that the correct choice of implementations will be dictated in large part by the scope of the data itself. Think of the dilemma as a Cartesian product of comparing different attributes, cases and particular states to each other, or in relational lingo, columns, rows and states. It is critical to identify ahead of time what combination of comparisons you’re making in your algorithm. Do you need to know how the marginal stats for Attribute A compare against Attribute B? If so, then it’s probably more apt to put that code in InsertCases, before calling ProcessCase. Pretty much anytime you need to compare the value of any row against a global stat, you’re going to want to compute that global stat early in InsertCases, before row-by-row processing (row by agonizing row, or RBAR?) in ProcessCase. On the other hand, if you need to compare a particular row value against some cumulative value for that row, it is probably more appropriate to calculate the cumulatives one row at a time in ProcessCase, then perhaps compare the final value against the value of a particular row in another iteration of ProcessCase. Fortunately, SSDM precalculates some cumulatives and global stats for you, such as the mean and biased variances for each attribute, as long as you retuned an All value in the AlgorithmMetadataBase.GetMarginalRequirements. Your algorithm may need to make comparisons between rows but within the same attribute, within the same row across different attributes, or reference histograms of particular states of an attribute, or make even more complex calculations involving cumulatives and the like. You could conceivably combine all of these permutations together and perform calculations based on different rows, across different states of different attributes, which in turn depend upon cumulatives, precalculated marginal stats or more complex global stats. Once you’ve made those determinations, another factor you need to be cognizant of is how to retrieve the upper and lower bounds of the arrays you’ll be dealing with, if you choose to implement your plug-ins with explicit iterators rather than While loops and the like. Some of the most common iterators you’ll see are StateStatistics ().Count, AttributeSet.GetAttributeStateCount, AttributeSet.GetAttributeCount, MarginalStats.GetTotalCasesCount and MarginalStats.GetCasesCount, some of which must be supplied the appropriate index numbers. Much of the code in my implementation of AlgorithmBase revolves around iterating through the attributes, mining cases, attribute statistics and other such objects and setting or getting values appropriately. It also contains a trivial method called GetNavigator to call an instance of AlgorithmNavigationBase, which in turn has a class-scoped reference to the AlgorithmBase, plus an iterator for the CurrentNode.
…………It is still too premature to get into an in-depth discussion of more complex node structures, so I’ll defer discussion of certain other methods in AlgorithmNavigationBase. In future tutorials on more advanced topics like algorithm parameters, custom mining functions, feature selection and PMML, we’ll implement some optional members of these three classes, like GetNodeIDsForCase, GetAllFeatureSelectionInfo, GetAttributeFeatureSelectionInfo, HasFeatureSelection, LoadPMMLContent and RenderPMMLContent. As mentioned earlier, the handful of tutorials I’ve found to date on the Internet have been implementations of Clustering, so I’ll refer anyone interested in other optional clustering-specific methods like ClusterMembership and CaseLikelihood to those tutorials, particularly Crivat’s and the Visual Numerics .pdf. Writing a plug-in is not trivial, but it is not as difficult as it may seem either; nevertheless, the material is thick enough that I needed to break up this tail end of the Rickety series into bite-sized chunks. For that reason, I will also put off discussion of the Predict method within AlgorithmBase, because it is possible for SSDM to at least return training statistics without it. I still have plenty to learn about the object model and every other aspect of SSDM plug-ins, but one of the leading reasons I’ll delay discussion of important topics like prediction and navigation is that debugging and deployment are the more difficult aspects of writing custom algorithms. It took me a long time to climb that mountain to get the simple results depicted in Figure 9 on my screen, after processing a two-column model with 20 rows of dummy data. Remember, I substituted a cumulative total, the skewness and kurtosis from the first column for the NODE_SUPPORT, NODE_PROBABILITY and MSOLAP_NODE_SCORE, just to show how that can be done. The calculations for skewness and kurtosis would be accurate, except for the fat that SSDM uses the biased method of variance calculation; I could have added extra steps to correct this, but didn’t want to complicate this tutorial series any further. I don’t want to turn anyone’s brain to mush unnecessarily, because readers will need fresh grey matter to get through the next mistutorial, on the many pitfalls data miners may stumble over when debugging and deploying custom algorithms. This is really the summit of the whole series as far as difficulty goes, after which we start descending back to easier tasks until we wrap up on the topic of writing data mining viewers.

Figure 9: My First Plug-In Results, for a Single Node

[i] The word “SDK” does not appear in the title of the collection of shell plug-in projects Microsoft provides, but that’s basically what it is.

At the Top of the Rickety Stairway…

by Steve Bolton

…………As expected, I’ve run into quite a few obstacles while trying to deploy and debug my first plug-in algorithm. I’ve finally emerged from “DLL Hell” and have successfully deployed the first algorithm I’ll be using in this tutorial, but still have fatal errors lurking somewhere in the AlgorithmMetadataBase class which are preventing it from loading at server startup. These types of bugs are apparently quite difficult to pin down, since they occur before a debugger can be attached. As soon as I find a way through this conundrum, I’ll finish this series on plug-in algorithms with blog posts introducing the object model, discussing common “gotchas” like this in the deployment process and covering advanced functionality like feature selection and customer parameters. I want to make sure the plug-in I wrote actually works in practice before resuming this mistutorial series, so I may not be able to finish the series until sometime in August. Stay tuned though – barring unforeseen circumstances (I don’t want to jinx myself) it will get finished. I’ve climbed so far up this rickety stairway that I’m not going to let the locked attic I’ve found at the top deter me indefinitely from finishing the series. The last installments will be on writing custom data mining viewers for SQL Server Data Mining (SSDM), which should be a simpler matter.

A Rickety Stairway to SQL Server Data Mining, Part 14.1: An Introduction to Plug-In Algorithms

by Steve Bolton

…………In my last post in this amateur series of self-tutorials on SQL Server Data Mining (SSDM), I got into a lengthy discussion of how neglected but powerful SQL Server Analysis Services (SSAS) stored procedures are. This is part of a larger pattern of under-utilization of some of the database server software’s most potent features, including the development of CLR objects to significantly extend the functionality of SSAS and T-SQL, as well as Multidimensional Expressions (MDX) and SSDM itself. SSDM can also be extended in a way that is not applicable to any other component of SQL Server, through the addition of new user-defined mining algorithms. Despite the fact that this gives SQL Server a distinct advantage over other database server mining software in one of the hottest, wide-open fields in computing, it also falls into the same category of neglected functionality. In recent years, this same set of features has also been neglected by some of the top brass at Microsoft, who don’t seem to appreciate just how powerful and well-designed these products really are, or how important they are for the software giant’s position in a segment of computing that is bound to be profitable and relevant for decades to come. The good news is that writing plug-in algorithms for SSDM is also comparable to SSAS stored procedures, in that they appear to more difficult to write than they really are.
…………The development process for a plug-in is by no means trivial, but should not be intimidating either. Most of the kinks in the process actually arise from the age of the SSDM architecture, which hasn’t been significantly updated since 2005. It is a mystery why the top brass at Microsoft has invested so few resources in extending the lead over its competitors provided by tools like MDX and SSDM, but the fact that they both remain at the pinnacle of analysis software eight years later is a testimony to how powerful and well-designed they are. The process for developing plug-ins really could use a makeover, but it’s still useable in its original form, with some adaptations for the .Net platform. The second programming language I learned (besides BASIC, which I used to write games on an old Compucolor II when I was 12) was C++, but I haven’t used it since the 1990s, when I learned to appreciate the simplicity of the Visual Basic syntax. Back in the days of the old COM architecture, I earned a Microsoft Certified Solutions Developer (MCSD) certification in VB 6.0, but likewise abandoned COM for the simplicity and power of the .Net framework like most other Windows developers. I procrastinated in regard to writing plug-ins for far too long after reading that the architecture is based on C++ and COM, but thankfully, a .Net wrapper has been available for some time that greatly simplifies the process. There are some trade-offs in using the wrappers in place of COM, particularly in terms of “marshalling data between native C++ and managed code” and the inability of Analysis Services to manage the memory of .Net plug-ins in some circumstances. For those who cannot abide these slight performance penalties, or who have a masochistic or historical appreciation for C++ and COM, check out A Tutorial for Constructing a Plug-in Algorithm, written by Microsoft Data Mining Team members Max Chickering and Raman Iyer back in August 2004. I’ll confine my discussion of SSDM plug-ins to the wrapper classes written for .Net, which is superior to the practically obsolete COM architecture in many respects and much more adaptable to my own skill set.
…………Working with the wrapper classes requires a brief initial immersion with the C++ shells, however, which I will describe in this initial post in a four or five part series on SSDM plug-ins. Ideally, we would simply write plug-ins based on native classes within the .Net framework, but Microsoft has not yet upgraded SSDM in this way yet; this is just one of a wide range of enhancements that could be made to this long-neglected but powerful component of SQL Server, in order to establish long-term dominance of the data mining market for Microsoft. In the meantime, however, we must take a few preparatory steps that .Net coders aren’t normally accustomed to. Annoyances like this should not obscure the fact that adding to the nine algorithms included out-of-the-box with SSDM is incredibly useful. As mentioned in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering, there are thousands of different clustering algorithms in use today, all of which could be implemented in SSDM. There are likewise innumerable variations on other SSDM algorithms we covered in previous tutorials, like Time Series, Neural Nets, Linear and Logistic Regression, Association Rules and Naïve Bayes – not to mention others like Kohonen nets that aren’t implemented at all. As I discussed in some of the introductory posts in this series, the statistical building blocks that these algorithms are built from can be combined, tweaked and rearranged in a practically infinite number of permutations that can be suited to the particular needs of any analytical project. The sky is the limit. New mining algorithms will probably still be being written a century from now, assuming mankind is still around. Because data mining can be applied to such an exceptionally wide range of problems, which mankind has no shortage of today, the field probably won’t reach the top of its bell curve of economic utility for many decades to come, unlike some of today’s hottest tech products, like current craze for tablet-sized devices. For this reason, extensibility is crucial to the data mining industry and products like SSDM in a way that it is not for some of Microsoft’s other software packages, like Visual Studio or SQL Server Management Studio (SSMS). It is unlikely that data miners of the late 21st Century will still be using SSDM, but Microsoft could retain its head start in a field with a promising long-term future by properly maintaining and promoting SSDM.
…………Finding a comprehensive, up-to-date list of the companies that have developed algorithms with this plug-in architecture is next to impossible, but a cursory search of the Web through Google turns up such names as Comelio, Angoss and KXEN. This list is probably shorter than it ought to be for the simple reason that there just aren’t enough qualified people available yet to write algorithms for the practically infinite array of problems that mining software like SSDM can solve. The lucrative potential of the field is perhaps best illustrated by the fact that C. L. Curotto and N.F.F. Ebecken’s eight-year-old hardcover book Implementing Data Mining Algorithms in Microsoft SQL Server will set you back a whopping $214 if bought on today. Given that I’m an amateur who has yet to be paid to work in the field, I don’t have that kind of money. Fortunately, cheaper sources of information are available, such as Raman Iyer’s MSDN article Add Custom Data Mining Algorithms to SQL Server 2005 and Data Mining with Microsoft SQL Server 2008, the classic reference on SSDM written by former members of the DM Team like Jamie MacLennan, ZhaoHui Tang and Bogdan Crivat. Sources like these list some of the incalculable advantages of using SSDM the architecture rather than starting from scratch, since it integrates with the larger SQL Server and Windows environment to provide managements of security, transactions, backups and access through such tools as Excel, SQL Server Integration Services (SSIS), Reporting Services and the Data Mining Expression (DMX) language processor in SSMS. There’s usually no sense in reinventing the wheel by implementing all of this out-of-the-box functionality yourself. Writing cutting edge algorithms is a high-level activity, which means the last thing you want to be bothered with is lower level architecture details like this, not to mention the painstaking care needed to work with C++ and COM. I’m a big fan of encapsulation of this kind; after all, the overwhelming advantage of computers as a whole is that they allow you to encode blueprints and then copy them effortlessly, thereby freeing users up to perform other tasks on top of those copies. When working with data mining software, however, it’s mandatory.
…………Because the top brass at Microsoft seems to have lost sight of this fact, they haven’t updated the code for SSDM plug-ins properly by providing a real .Net interface. That means we have to jump through a few hoops to build a plug-in with a managed code language like C# or Visual Basic, but the cost in terms of time and energy are inconsequential in comparison to the benefits of being able to write your own SSDM algorithms. Because there’s a relative shortage of SSDM tutorials written with VB, I’ll be creating a Visual Studio project in that language. Before we can get to that stage, however, it is necessary to compile the C++ shells into a .dll that our VB project can reference, otherwise the object model will be invalid. The very first step in the process of writing your own algorithm is to download the Data Mining Managed Plug-in Algorithm API for SQL Server 2005 from the Microsoft Download Center webpage Feature Pack for Microsoft SQL Server 2005 – April 2006. Keep track of the installation location. The first time I attempted to build a plug-in, I downloaded it to C:\Microsoft\Data Mining Managed Plug-in API\ but for this tutorial series, I set it to C:\Program Files (x86)\Microsoft SQL Server\Data Mining Plugin. The folder should have a structure identical to the one depicted in Figure 1:

Figure 1: The Structure of the Data Mining Managed Plug-In API Folder
Plugin Folder

…………The SQL Server Data Mining Managed Plug-In Algorithms Tutorial.doc file is a much more in-depth tutorial than the one I’m providing here, so it would be wise to consult it throughout your development process. I had to lean heavily on MacLennan, et al. to get me started, but after that I relied on this .doc file by Bogdan Crivat, a leading expert on SSDM who is now with Predixion Software. He goes into much greater detail about specialized scenarios and provides several lengthy demos, which are related to the code found in the topmost folder in Figure 1. Rather than rehashing what Bogdan wrote and ending up with a pale copy of his tutorials, I’ll be contributing separate mistutorials in VB that illustrate some of the pitfalls fellow amateurs may run into on their first attempts at writing a custom algorithm. The first mistake that I made was a costly one in terms of development time: I completely overlooked the SSDMPlugIn.chm file depicted in Figure 1, which describes the object model we’ll be working with in the next several blog posts. I wasted quite a bit of time trying to tease out the structure in Visual Studio through Intellisense and the Object Browser, so please, do yourself a favor and read both this Help file and Bogdan’s tutorial before writing your custom algorithm.
…………The kicker is that you can’t get to the point of writing your first lines of code for the algorithm without first compiling the Visual Studio solution file in the DMPluginWrapper folder into a .dll. There are 46 files related to this project in the root of this folder alone, most of which are C++ headers and .cpp documents needed for this shell project. I don’t know enough about the inner workings of the plug-in architecture to state why this extraneous step is necessary, although it would obviously be simpler if users could simply work with a precompiled DMPluginWrapper.dll file, especially one that it natively .Net-compliant. Compiling the project in Visual Studio is a fairly straightforward process that doesn’t take long though, unless you run into one of the following “gotchas” that I stumbled across. Problem #1 was an error at compilation time about a file named oledbm.h being missing. If you’re working with SQL Server 2005 you can probably find it in the C:\Program Files\Microsoft SQL Server\90\SDK\Include folder, or perhaps Program Files (x86). If not, it can be downloaded from this webpage at the website, which will require registration first. I put my copy in C:\Program Files\Microsoft SQL Server\110\SDK\Include, then added that directory to the Additional Include Directories under the Configuration Properties\C++\General tab, through the Project Properties window in Visual Studio. I’m not sure why dmalgo.h, msxml6.h are included in the same .zip file, but since they are was included in the same a set of bug fixes, I replaced those files with these versions in Data Mining Plugin\DMPluginWrapper as well with no ill effects.
…………After fixing this issue, I ran into Problem #2 on the next attempt at compilation: an error message including the text “C2065: ‘DMMVALUETYPE_INTERCEPT’: undeclared identifier.” To fix this, I added the line text “#define DMMVALUETYPE_INTERCEPT ( 11) ” to oledbm.h after “#define DMMVALUETYPE_NODE_UNIQUE_NAME ( 10 ).” Why I did this is a good question that even I really can’t answer. As I’ve said throughout this series, working with SSDM is sometimes akin to traveling on the starship Enterprise, because you’re “boldly going where no man has gone before.” The same is true of SSAS as a whole, which is why my series An Informal Compendium of SSAS Errors remains one of the few centralized sources of information on the planet for certain arcane Analysis Services bugs. In that series, I wrote about certain error messages that don’t seem to be documented anywhere, by anyone, but in this case I was fortunate to find a single helpful reference: a code sample I found at a somewhat anonymous ftp site that included this extra line. This was one of only two references on Google for the search term DMMVALUETYPE_INTERCEPT. I wasn’t even able to determine whose ftp site it was to give proper credit; I couldn’t even determine what its purpose was, other than what looked to be an inventory of someone’s hard drive. I can’t begin to explain why it worked, only that it did, which will have to do. For now, I will also have to settle for a similar workaround to Problem #3, which proved to be intractable but apparently irrelevant. On the next compilation attempt, I received “error MSB3073: The command “”C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Bin\gacutil.exe” /i “C:\Microsoft\Data Mining Managed Plug-in API\DMPluginWrapper\Debug\DMPluginWrapper.dll” /f: VCEnd” exited with code 1. C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\Microsoft.CppCommon.targets 113.” The first workaround I applied was to restart Visual Studio with Administrator privileges as mentioned in the thread Error Code 1 When I Install an Assembly in GAC started in 2011 by the user starblue at It had no effect, but running the offending command in a prompt with elevated privileges led to the error: “Failure adding assembly to the cache: This assembly is built by a runtime newer than the currently loaded runtime and cannot be loaded.” This commonly occurs when there is a mismatch with the .Net framework version targeted by a C++ project, so I tried the solution mentioned on April 27, 2010 by Brian Muth at the MSDN webpage Trying to Change Targeted Framework in VS2010. Because the Unload Project command was greyed out in Visual Studio for some mysterious reason, I simply edited the project’s .vcproj file by hand, by adding the new node v2.0 under the node. This set the framework to .Net 2.0 but it apparently did not fix the problem, since I still received the following compilation error after reopening the project: “.NET Framework 2.0/3.0/3.5 target the v90 platform toolset. Please make sure that Visual Studio 2008 is installed on the machine.” I applied a solution I found at this webpage at the knowledge base, which involved changing the Platform Toolset from v90 to v100 on the Configuration Properties\General tab in the Project Properties. I also altered the Stdafx.h file by adding the line “#define _WIN32_WINNT 0x050” because it was unclear if the solution at the esri page called for it or not, but it led to multiple compiler warnings with the text “C4005: ‘_WIN32_WINNT’: macro redefinition.” After removing that alteration to Stdafx.h, the solution still didn’t do the trick. My next course of action was to follow a suggestion at starblue’s aforementioned thread and simply delete the offending command in the Configuration Properties\Build Events\Post-Build Events, which was listed as “”$(FrameworkSDKDir)Bin\gacutil.exe” /i “$(TargetPath)” /f.” After that the .dll compiled just fine. Perhaps there may be unforeseen consequences for not properly referencing the .dll in the Global Assembly Cache as the command is designed to do, but I have yet to encounter them while working on more other installments in this series of mistutorials.
…………After this step, it is a simple matter to create a new Visual Basic (or C# if that is your preference) project and set a reference to the DMPluginWrapper.dll you just compiled in the C++ shell project. As long as it compiles fine, the abstract classes we will derive from in next week’s tutorial will be recognized by Visual Studio. The object model really isn’t that difficult to work with, because we’re only deriving from four base classes: AlgorithmMetadataBase, AlgorithmBase, AlgorithmNavigationBase and ICaseProcessor. As we will discuss next week, the functionality embodied in the methods, properties and other members of these classes is relatively easy to follow, especially with Bogdan’s tutorial to act as a guide. The Shell folder depicted in Figure 1 even contains basic code samples for these classes in C#, which I easily converted to Visual Basic. I was a little rusty in deriving from abstract classes which slowed me down a little more, but I was up and running with a VB plugin project within a matter of minutes after debugging the DMPluginWrapper.dll compilation issues. Designing and implementing the algorithm itself was of course much more time-consuming and thought-intensive, as we shall discuss in the next installment in this series. The extra obstacles we had to hurdle in this post were trivial in comparison, but that is precisely why they could be eliminated; if the “meat and potatoes” of the problems we’re trying to solve are so difficult that they necessitate building custom algorithms, we really shouldn’t be wasting time peeling the potatoes too, so to speak. There are many overwhelming advantages to writing custom algorithms and we’ve already disposed of the most serious drawback, which is the lack of upgrades to the SSDM architecture. Like Analysis Services as a whole, SSDM merely needs a simple makeover to keep it viable for another eight years or more, through such modernizations as the development of native .Net plug-ins and the like. If the DM Team is given the resources it needs to add such enhancements, as well as better documentation and new out-of-the-box algorithms, there is no reason why Microsoft could not quickly become the industry leader in data mining. As with MDX, it is a testimony to how well the product is designed that it has held up so well for so long against its competition, despite years of relative neglect by Microsoft bosses who seem unduly obsessed with Tablet Panic. Hopefully they will recognize in time that computer technology is fanning out in two directions, one leading to miniaturized devices on one end of the spectrum and Big Data on the other. Microsoft entered the former market far too late to ever achieve dominance, but the latter will require Big Analysis software that the company doesn’t seem to know it already has.