Monthly Archives: October 2013

By Steve Bolton

…………In order to divide the Herculean task of describing custom algorithms into bite-sized chunks, I omitted discussion of some plug-in functionality from previous installments of this series of amateur tutorials on SQL Server Data Mining (SSDM). This week’s installment ought to be easily digestible, since it deals with a single method in the AlgorithmBase class that is relatively easy to implement – at least in comparison to the material covered in the last four articles, which is fairly thick. For primers on such indispensable topics as the benefits of the plug-in architecture, the object model, the arrangement of nodes in the mining results and debugging and deployment of custom mining methods, see those previous posts. The Predict method is marked Must Override, which signifies that it must be implemented in order for any SSDM plug-in to compile and run successfully, but it is possible to leave the code blank as we did in these previous tutorials in order to get the bare bones plug-in to return some simple results. Until it is fleshed out, the plug-in sample code we’ve been working with in recent weeks won’t return any results for prediction queries.
…………As I mentioned in those previous posts, I’m writing the sample plug-in code in Visual Basic .Net, in part because there’s a distinct lack of SSDM code written in VB on the Internet, and secondly because it’s my favorite language. I am much more familiar with it than C#, which is what most of the available plug-in tutorials and sample code are written in. Like most of the code I’ve posted so far in this segment of the Rickety series, my implementation of the Predict method originally came from the samples provided by Bogdan Crivat (one of the founding SSDM programmers) in the plug-in software development kit (SDK). After converting to VB using online tools, changing the variable names to my liking, adding implementations of some objects found in the SDK help file and otherwise reworking it, the code is now practically unrecognizable. The design pattern and logical flow are pretty much the same, however, and would not have worked at all without Crivat’s indispensable sample code to guide me. My goal is merely to describe the basic tasks any plug-in programmer typically has to perform in order to return the simplest of results from prediction queries, which is a fairly straightforward task. That does not mean, however, that the Predict method cannot be insanely complex when a particular algorithm requires really advanced functionality; like most of the other plug-in class methods we’ve covered so far, there is plenty of room to add intricate conditional logic when the situation calls for it. An introductory tutorial like this is not one of those situations, because of the risk of clutter, so I’ll just to stick to the basics of what can or must be done in the Predict method. I’ll leave fancier implementations of that functionality up to the imaginations of intrepid plug-in programmers.
…………Most of the plug-in methods we’ve deal with so far are commonly linked to a couple of classes in the object model, but the relationship between the Predict method and the PredictionResult method is even tighter than in many of these cases. SSDM automatically populates the method with MiningCase objects, which we worked with back in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm, plus a single PredictionResult that reflects the structure of the prediction query specified by the end user. For example, the class has a MaxPredictions member that is equivalent to the n parameter of the Predict function, to limit the number of records returned; an IncludeNodeID member to return a string value for the NODE_ID when the user specifies the INCLUDE_NODE_ID flag in the query; and AttributeSortBy and AttributeSortDirection members that describe the arrangement of columns specified in the ORDER BY clause. If the user also used the INCLUDE_STATISTICS flag, then SSDM will also return a NODE_DISTRIBUTION table with the results that is limited to the number of predictions set in the MaxStates property. Any order assigned to this dependent table will likewise be reflected in the StateSortBy and StateSortDirection properties. StateSortBy and AttributeSortBy both take values of the SortBy enumeration, which includes values like SortBy.AdjustedProbability, SortBy.Probability, SortBy.Support, SortBy.Variance and SortBy.None, while the other two properties take either SortDirection.Ascending or SortDirection.Descending values. The good news is that we don’t have to worry about any of these sorting properties, because as Crivat says in the SDK tutorial, “…you do not need to take them into account during the prediction, except if your patterns layout allows you to dramatically optimize the performance. The infrastructure will push the attribute predictions in a heap structure, sorted by probability, support, or adjusted probability as requested in the prediction operation.” In other words, it’s an advanced lesson far beyond the scope of this crude tutorial series. I imagine that in such a scenario, you would add conditional logic to alter the results returned by the query, depending on which sort method was in use. For example, you might implement a Select Case like the one in Figure 1, which could also be performed on a PredictionResult.StateSortBy value:

Figure 1: Sample Conditional Logic for a SortBy Value

Select

Case PredictionResult.SortBy.AdjustedProbability

Case PredictionResult.SortBy.Probability

Case PredictionResult.SortBy.Support

Case PredictionResult.SortBy.Variance

Case PredictionResult.SortBy.None

End Select

…………That begs the question: what belongs in each Case clause? The answer might include any of the remaining operations we’ll perform with PredictionResult in the rest of this tutorial, many of which could be encapsulated in one of these Case statements. A typical implementation of the Predict method, would begin a While loop like the one in Figure 2, which iterates over the OutputAttributes collection of the supplied PredictionResult. The PredictionCounter variable also limits the loop by the value of the MaxPredictions property and is incremented with each iteration just before the End While, which occurs near the end of the function. The AttributeSet.Unspecified variable supplies the index of the currently selected output attribute in the AttributeSet collection; it is zero-based, so we won’t be subtracting one from some of the arrays we’ll deal with in the meat and potatoes of the function, unlike in some other methods we’ve dealt with in the past. I normally prefer to iterate over explicit arrays, but had to use a While loop in this case because AttributeGroup objects like the OutputAttributes member have only two commands to control iteration, Next and Reset. The Select Case that follows the While declaration merely determines whether or not the currently selected attribute is disallowed by the PredictionResult.InputAttributesTreatment property. If not, then we create an AttributeStatistics object to hold the results we’ll be returning.

Figure 2: Complete Sample Code for the Predict Method
Protected Overrides Sub Predict(InputCase As Microsoft.SqlServer.DataMining.PluginAlgorithms.MiningCase, PredictionResult As Microsoft.SqlServer.DataMining.PluginAlgorithms.PredictionResult)

PredictionResult.OutputAttributes.Reset() ‘an AttributeGroup object

Dim PredictThisAttributeBoolean = False
Dim PredictionCounter As UInt32 = 0 ‘for determining whether or not we’ve reached the limit that may be defined in PredictionResult.MaxPredictions

‘I’m not sure yet but I think MaxPredictions is unlimited when = 0
While PredictionResult.OutputAttributes.Next(AttributeSet.Unspecified) = True And PredictionCounter < PredictionResult.MaxPredictions

‘OK if the current attribute is input, or output, matches AttributeSet.GetAttributeFlags(1), but

Select Case PredictionResult.InputAttributeRestrictions

Case PredictionResult.InputAttributesTreatment.ExcludeInputAttributes And Me.AttributeSet.GetAttributeFlags(AttributeSet.Unspecified).ToString Like “*Input*”

PredictThisAttributeBoolean = False

Case PredictionResult.InputAttributesTreatment.PredictOnlyInputAttributes And Me.AttributeSet.GetAttributeFlags(AttributeSet.Unspecified).ToString Like “*Output*”

PredictThisAttributeBoolean = False

Case PredictionResult.InputAttributesTreatment.None ‘this one really belongs in the Case Else, but I want to demonstrate it

PredictThisAttributeBoolean = True

Case Else

PredictThisAttributeBoolean = True

End Select

If PredictThisAttributeBoolean = True Then

                    Dim AttributeStatsToAdd As New AttributeStatistics()
                    AttributeStatsToAdd.Attribute = AttributeSet.Unspecified ‘the target attribute
                     AttributeStatsToAdd.AdjustedProbability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).AdjustedProbability
                    AttributeStatsToAdd.Max = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Max
                    AttributeStatsToAdd.Min = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Min
                   AttributeStatsToAdd.Probability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Probability
                    AttributeStatsToAdd.Support = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).Support

‘test to see if the Missing state ought to be returned (SSDM will filter out data that is not requested in the query, but there’s no point wasting resources letting it get to that point

If PredictionResult.IncludeStatistics = True Then

For I As Integer = 0 To MarginalStats.GetAttributeStats AttributeSet.Unspecified).StateStatistics.Count – 1

If (I <> 0 And I > PredictionResult.MaxStates) Then Exit For ‘exit if we’ve hit the MaxStates, if there is a limit

If MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(I).Value.IsMissing = True And PredictionResult.IncludeMissingState = False Then
‘don’t add this node, because it’s the missing state

Else ‘in all other cases, add it
‘AttributeStatsToAdd.StateStatistics.Add(MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)))

                                Dim SkewnessState As New StateStatistics()
                                SkewnessState.ValueType = MiningValueType.Other
                                SkewnessState.Value.SetDouble(SkewnessKurtosisClassArray(AttributeSet.Unspecified).Skewness)
‘don’t subtract 1 because Unspecified is zero-based
                               SkewnessState.AdjustedProbability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).AdjustedProbability
                               SkewnessState.Probability = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Probability
                                SkewnessState.Support = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Support
                                SkewnessState.ProbabilityVariance = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).ProbabilityVariance
                               SkewnessState.Variance = MarginalStats.GetAttributeStats(AttributeSet.Unspecified).StateStatistics(CUInt(I)).Variance
                               AttributeStatsToAdd.StateStatistics.Add(SkewnessState)
                            End If
                        Next I
                    End If

                    If PredictionResult.IncludeNodeId Then
                        AttributeStatsToAdd.NodeId = AttributeStatsToAdd.NodeId = AttributeSet.GetAttributeDisplayName(AttributeSet.Unspecified, False)
                    End If

‘ Add the result to the predictionResult collection
PredictionResult.AddPrediction(AttributeStatsToAdd)

PredictionCounter = PredictionCounter + 1 ‘increment it only if a prediction is actually made (which it may not be if a comparison to the Input/Output to InputAttributeRestrictions leads to an exclusion

End If

End While
End Sub

…………This sample (I apologize for the sloppy code formatting, but the WordPress GUI is sometimes more challenging to work with than SSDM Extensibility) merely assigns the values of a column’s MarginalStats object to the new AttributeStatistics object, which ought to be a familiar operation at this point in the tutorial series. It then tests the IncludeStatistics member to see if we ought to include the NODE_DISTRIBUTION table, followed by a test of the IncludeMissingState property, then creates a new StateStatistics object if everything passes. The VALUETYPE of the single NODE_DISTRIBUTION table row this represents is set to MiningValueType.Other and the AdjustedProbability, Probability, Support, ProbabilityVariance (i.e. Standard Deviation, or StDev) and Variance are set to the corresponding values in the MarginalStats object for that column. The most important part is the Value.SetDouble statement, which sets the ATTRIBUTE_VALUE to the skewness statistics for that column, which we computed in previous tutorials. The StateStatistics object is then added to the AttributeStatistics object. Note how that section of code resembles the assignments of state stats in Post 14.4. You could conceivably add multiple rows of StateStatistics for each AttributeStatistics, or multiple AttributeStatistics for each column you return, depending on the needs of your algorithm. The next order of business is to add conditional logic to deal with instances in which the IncludeNodeID is True; in this case, I’m simply returning the full name of the output column. The newly populated PredictionResult is then added to the collection of predictions that will be returned by the method, the PredictionCounter variable I created is incremented, then the While loop either makes another pass or the method returns.
…………The logic you implement in the Predict method might be complex but the basics of its operation are not, in part because most of the PredictionResult properties are set by the query writer and therefore read-only. Debugging of the Predict method is far easier than with the AlgorithmNavigationBase class we discussed in the last post, which is in turn far less of a hassle than troubleshooting the other AlgorithmBase members and several orders of magnitude easier than dealing with errors in the AlgorithmMetadataBase class. One of the few issues I ran into initially were errors with text including the word “GetResultFromAttStats,” which might belong on Jeannine Takaki’s incredibly valuable thread on undocumented SSDM stored procedures. Presently, it is so undocumented that it only returns two hits on Google. One of these is an MSDN thread titled problem with managed code plug-in to analysis services which had no replies. I was fortunate to find this sentence in Gaël Fatou’s invaluable reply at the MSDN thread Managed Plugin-Algorithm: “…if (trainingStats.StateStatistics[0].Value.IsMissing) will not work, as it will always only consider the first state (which is always missing). The current state must be considered instead: if (trainingStats.StateStatistics[nIndex].Value.IsMissing).” This is due to an error in the SDK sample code that has apparently bitten at least one other plug-in developer, which necessitates the replacement of the [0] in the StateStatistics[0].Value.IsMissing statement with the proper state stats index. I never encountered GetResultFromAttStats errors after applying this solution, but first-time plug-in developers ought to be aware of this stumbling block in advance. I also received a couple of “Object reference not set to an instance of an object” errors that referenced GetAttributeFlags, a method of AttributeSet that I was calling incorrectly in my IncludeNodeID code. Another problem I encountered that was equally easy to fix was forgetting to enable the prediction functions I called in my test query, which led SQL Server Management Studio (SSMS) to return error messages with the text “…function cannot be used in the context” every time I ran the queries in Figures 3 and 4. It was a simple matter to edit the code of GetSupportedStandardFunctions in AlgorithmMetadataBase to return the correct SupportedFunction enumeration values. Out of the 37 available values, I selected PredictAdjustedProbability, PredictCaseLikelihood, PredictHistogram, PredictProbability, PredictSupport, PredictStdDev, PredictVariance, PredictScalar and PredictNodeId, which enabled me to return the results in the two figures below. I also enabled PredictTable, CustomFunctionBase and IsInNode for future experimentation. The .chm included with the SDK mentions that other Data Mining Extensions (DMX) functions like DAdjustedProbability, DNodeId, DStdDev, DVariance, DSupport, PredictAssociation and PredictSequence are also dependent on the output of the Predict method, but I didn’t implement some of these because they’re applicable to other types of algorithms, which are beyond the scope of this series. Note that the PredictHistogram function in the top graphic includes a row for the Missing state, while the first row contains the same values as the PredictSupport, PredictProbability, PredictAdjustedProbability, PredictVariance and PredictStDev functions in the bottom graphic. The Predict function at the bottom returns the same skewness value for the InputStat column we saw in last week’s post, as well as the name of the InputStat column for the PredictNodeID function. Most of these columns are standard return values in prediction queries, but in next week’s tutorial, I’ll demonstrate how to use custom mining functions to return more flexible result formats and perform tailor-made processing.

Figures 3 and 4: The Results of Various DMX Prediction Functions

Posted in A Rickety Stairway to SQL Server Data Mining

2 Comments

Tags: Analysis Services, Analytics, Data Mining, Knowledge Discovery, SQL Server, SQL Server Data Mining, SSDM, Stevan Bolton, Steve Bolton

A Rickety Stairway to SQL Server Data Mining, Part 14.4: Node Navigation

Oct 15

Posted by Stevan Bolton

By Steve Bolton

…………The pinnacle of difficulty in this series of amateur self-tutorials on SQL Server Data Mining (SSDM) was surmounted in the last installment, when we addressed the debugging and deployment of custom plug-in algorithms. From here on in, we will be descending back down the stairway, at least in terms of exertion required to squeeze practical benefit out of the lessons. In our first steps we’ll have to deal with a few aspects of plug-in development that were deferred in order to divide the tutorials into bite-sized chunks, particularly the topic of node navigation. The construction and navigation of a hierarchy of nodes out of our mining results takes place almost exclusively in the AlgorithmNavigationBase class, which we touched on briefly in A Rickety Stairway to SQL Server Data Mining, Part 14.2: Writing a Bare Bones Plugin Algorithm. In order to display rudimentary results in a single node in the Generic Content Viewer by the end of that article, it was necessary to return trivial or dummy values for many of the 18 methods in that class which were marked Must Override. The first bit of good news is that we’ve already introduced GetStringNodeProperty, GetDoubleNodePropert and GetNodeDistribution, the methods responsible for assigning the values of familiar columns from SSDM’s common metadata format, like NODE_DESCRIPTION, NODE_RULE, MARGINAL_RULE, NODE_PROBABILITY, MARGINAL_PROBABILITY, NODE_DISTRIBUTION, NODE_SUPPORT, MSOLAP_MODEL_COLUMN, MSOLAP_NODE_SHORT_CAPTION and NODE_CAPTION. The second good news item is that debugging of AlgorithmNavigationBase is much simpler than that of AlgorithmBase, where most of the number-crunching occurs. It is several orders of magnitude easier than debugging AlgorithmMetadataBase, where errors can be difficult to track down since they cause SQL Server Analysis Services (SSAS) to skip loading faulty algorithms when the service starts up. There are some pitfalls to constructing a node hierarchy out of your mining results, however, which can still be tricky to overcome.
…………Debugging of this class is much easier in part because we simply refresh the results in the Visual Studio GUI or SQL Server Management Studio (SSMS) in order to trigger breakpoints set here, whereas in AlgorithmBase or AlgorithmMetadataBase it is often necessary to reprocess mining models or restart the service. Restarts are still necessary after you deploy changes to the code, but not if you’re simply stepping through the results again. Expanding a closed node in the Generic Content Viewer may also cause various routines in AlgorithmNavigationBase to be called, which will activate any breakpoints you’ve set there in Visual Studio – assuming you’ve set them as prescribed in the last article. One of the more common errors you may see while debugging SSDM navigation may be worded like this: “Execution of the managed stored procedure GetAttributeValues failed with the following error: Exception has been thrown by the target of an invocation. The specified marginal node cannot be found in the mining model ‘PluginModel’.” The sentence indicates that the culprit may be an inappropriate choice of data mining viewer types, which are defined when the service starts by the GetViewerType routine of AlgorithmMetadataBase. I encountered this frequently while debugging this project because I initially returned a value of MiningViewerType.MicrosoftNeuralNetwork in that routine. The plug-in we developed a couple of tutorials ago merely returns a few global statistics like skewness, kurtosis and excess kurtosis, so it is not surprising that the neural net viewer would be a poor choice. Curiously, any attempt on my part to rectify it since that rookie mistake by substituting a different MiningViewerType enumeration value in GetViewerType has produced a more serious GetAttributeValues error. The text is the same except for the final line, which instead mentions the name of our practice plug-in algorithm at the end: “The targeted stored procedure does not support ‘Internal_Name_For_My_Algorithm’.” It apparently takes a little more work to change the viewer type of an algorithm once it’s been set, at least in some cases; it’s not a problem I’ve solved yet, but one plug-in developers should be wary of. It proved to be merely an inconvenience for the purposes of this series, since selecting Generic Content Viewer from the dropdown list in the Visual Studio GUI or SSMS works without errors whenever I return the original value of MicrosoftNeuralNetwork. Whenever I returned results in a new window in one of those tools I would receive the first GetAttributeValues error, then simply selected the Generic Content Viewer, which is the only one we need to worry about in a bare bones plug-in like this.
…………Developers might notice that there is no GetAttributeValues member in the object model or the DMPluginWrapper class it is derived from, via the means outlined in A Rickety Stairway to SQL Server Data Mining, Part 14.1: An Introduction to Plug-In Algorithms. Nor does either one expose the GetAttributeScores routine, which is mentioned in some of the arcane error messages SQL Server occasionally returns when processing the nine algorithms included with the product, which we covered much earlier in this series. These are internal routines within the SSAS process, msmdsrv.exe, which we can’t assign breakpoints to in Visual Studio, let alone debug and recompile. By good fortune, I stumbled across a list of 20 such undocumented internal routines at Jeannine Takaki’s invaluable Technet post A Guide to the (Undocumented) System Stored Procedures for Data Mining, at the last minute while writing up this article. She explains the uses of each in detail and how to use the CALL syntax on them in Data Mining Expressions (DMX) code. GetAttributeValues and GetModelAttributes are the only two undocumented routines on the list which can be used with any one of the nine out-of-the-box algorithms. GetAttributeScores applies specifically to neural nets and Linear Regression. Some of the routines are specific only to the two Microsoft-supplied clustering algorithms, such as GetClusterCharacteristics, GetClusterDiscrimination, GetClusterProfiles and GetClusters, whereas ARGetNodes; GetItemsets, GetRules and GetStatistics only apply to the Association Rules algorithm. Linear Regression and Decision Trees likewise are the only data mining methods which make use of CalculateTreeDepth, DTAddNodes, DTGetNodeGraph, DTGetNodes and GetTreeScores, while GetAttributeCharacteristics, GetAttributeDiscrimination, GetAttributeHistogram and GetPredictableAttributes can only be used with Naïve Bayes mining models. I have yet to experiment with them myself, but they all seem to be related to the retrieval of mining model results, which means you may encounter them when developing plug-ins that use the corresponding mining viewer types.
…………It is imperative to be conscious of the order in which SSDM triggers the 18 mandatory methods in AlgorithmNavigationBase when you retrieve the mining results in the GUI. The sequence begins in the GetNavigator routine of AlgorithmBase, which instantiates a new instance of AlgorithmNavigationBase, which in turn calls any code you’ve included in the constructor. In my case, all the New method does is set a reference to a class-scoped instance of the algorithm and a Boolean which indicates whether or not the navigation operation is being performed on a data mining dimension. I cannot verify that the rest of the sequence is the same for all mining models or all algorithms, but to date, I have seen SSDM invariably trigger the GetCurrentNodeID routine next. The next seven calls are to routines which set either the type or the name of the node, such as GetNodeType, GetNodeUniqueName, GetUniqueNameFromNodeID, GetNodeType again, a single call to GetStringNodeProperty to assign the NODE_CAPTION value, then GetNodeUniqueName and GetUniqueNameFromNodeID once more. I’m not sure what the purpose of repeating these calls to GetNodeType, GetNodeUniqueName and GetUniqueNameFromNodeID is; I only know that it occurs. The code for these routines is fairly trivial. GetNodeUniqueName has no parameters and merely calls GetUniqueNameFromNodeId, which merely converts the NodeID integer parameter supplied by SSDM into a string value. GetNodeIdFromUniqueName merely performs the same function as GetUniqueNameFromNodeID, except in reverse; in my example, I merely converted the NodeUniqueName supplied by SSDM as a parameter to the function and converted it to a 32-bit integer. I suppose other plug-in developers might have a genuine need to develop a much more complex naming scheme than in my trifling example of this class (which can be downloaded here), but for our purposes, there’s no real reason to complicate things. GetCurrentNodeID is likewise called frequently during the retrieval of mining results, but only returns the value of the CurrentNode, which is a class-scoped integer that identifies the node the navigation class is currently assigning the values for. In the implementations I’ve seen so far, such as the tutorial that founding SSDM developer Bogdan Crivat included in the plug-in software development kit (SDK) and Microsoft Business Intelligence with Numerical Libraries: A White Paper by Visual Numerics, Inc., the value of the CurrentNode is not set in this routine, merely returned. So how do we direct SSDM to operate on a different node – or for that matter, determine the number and order of the nodes? That doesn’t occur in GetNodeType, which merely returns a value of the NodeType enumeration to assign the NODE_TYPE value we’ve seen returned throughout this series in SSDM’s common metadata format. You might need to know the number of the node in order to assign the proper value, which can be any one of the following: AssociationRule; Cluster; CustomBase; Distribution; InputAttribute; InputAttributeState; Interior; ItemSet; Model; NaiveBayesMarginalStatNode; NNetHiddenLayer; NNetHiddenNode;NNetInputLayer; NNetInputNode; NNetMarginalNode; NNetOutputLayer; NNetOutputNode; NNetSubNetwork; None; PredictableAttribute; RegressionTreeRoot; Sequence; TimeSeries; Transition; Tree; TSTree; and Unknown. In other words, we’d assign None, Unknown or one of the 25 NODE_TYPE values we’ve seen often throughout this series. In my implementation, I performed a Select Case on the value of CurrentNode and set a type of Root when it equaled zero and InputAttribute in all other cases.
…………Decisions already have to be made based on the CurrentNode value, yet we have not yet assigned it in code. As mentioned in the last couple of tutorials, one of the trickiest aspects of debugging AlgorithmBase and AlgorithmMetadataBase is that calls to methods appear to come out of thin air, because they originate within msmdsrv.exe. It is a different expression of the same problem embodied in the hidden calls to undocumented methods, like GetAttributeValues and GetAttributeScores. In AlgorithmNavigationBase, this is manifested in apparently uncontrollable calls to the 18 mandatory methods, directed by msmdsrv.exe, which seem to leave developers without any well-defined place to specify the very node structure these routines operate on. That is not to say that these methods calls are random; msmdsrv seems to follow a fairly consistent pattern, in which the calls listed above are followed by an invocation of GetParentCount, then GetNodeAttributes. After this, GetNodeUniqueName is usually followed by another call to GetUniqueNameFromNodeId, then both are repeated again, followed by yet another call to GetNodeType. The NODE_CAPTION is typically set again in GetStringNodeProperty for some reason, followed by further calls to GetNodeUniqueName and GetUniqueNameFromNodeId. SQL Server then invokes GetChildrenCount and GetParentCount, followed by three calls to GetStringNodeProperty in which the NODE_DESCRIPTION, NODE_RULE and MARGINAL_RULE are set; NODE_PROBABILITY and MARGINAL_PROBABILITY are then set in two separate calls to GetDoubleNodeProperty; the structure of the NODE_DISTRIBUTION table is then set in a single call to GetNodeDistribution, followed by alternating calls to GetDoubleNodeProperty and GetStringNodeProperty to set the NODE_SUPPORT, MSOLAP_MODEL_COLUMN, MSOLAP_NODE_SCORE and MSOLAP_NODE_SHORT_CAPTION. It was trivial to set the values of these in Post 14.2, when we were only dealing with a single node. If we’re dealing with multiples nodes, we’d need to insert Select Cases on the CurrentNode value to assign the correct values of these columns to the appropriate nodes, but it wouldn’t normally be suitable to change the value of CurrentNode itself here. After this, SSDM usually calls LocateNode twice, which merely invokes ValidateNodeID, i.e. the routine where you’d embed any custom code to authenticate your node structure; in my algorithm, I simply added code to throw an error if the CurrentNode value exceeded the count of the number of attributes. Somewhere in this vicinity, SSDM will call GetChildrenCount and sometimes MoveToChild, as well as GetParentNodeID and MoveToParent in some cases. Correctly specifying the cases when those particular routines are hit is the key to constructing your node network, not through a call to a method that simply declares the size and shape of the whole network explicitly. After this, SSDM will usually call GetCurrentNode and repeat this whole sequence over again until it reaches the final node – which it may never do, if you don’t specify the node structure correctly.
…………In a typical program, you’d expect to set a fixed number of nodes, perhaps in a class-scoped multi-dimensional array or collection that also determined the structure of the relationships between them. In SSDM plug-ins, however, the structure is specified at almost primordial levels, through calls to methods that determine the counts of each nodes parents and children. The numbers returned by GetParentCount and GetChildrenCount in turn determine how many times SSDM will iterate through MoveToParent and MoveToChild respectively. This group of methods also includes GetChildNodeId and GetParentNodeID, in which you merely return the parameter value supplied by SSDM, as far as I can tell. In my implementation, I merely wanted to return a root node with global statistics, followed by two child nodes representing two columns of input data I supplied to my mining model. Since the root node has no parents, I returned zero for GetParentCount when the CurrentNode was equal to zero, i.e. the unique index for the single root node; since the two input attributes have just one parent, I returned a value of one in all other cases. In GetChildrenCount, however, I returned a value equal to the number of input attributes when the CurrentNode matched the root node. Since the root node is the only parent, I returned zero when the CurrentNode is equal to any other value, since these are input attributes, which have only one parent in this case. The CurrentNode value is only changed in one place in this trifling plug-in, in the MoveToChild routine, where it is increased by one in each iteration. The number of iterations will be equal to the value assigned to the single root node in GetChildrenCount. In my plug-in, MoveToParent is never invoked because the only parent specified in GetParentCount has already been passed once, but it may be triggered in structures that require more complex implementations of GetParentCount and GetChildrenCount. This means of specifying the node structure is so basic that simple errors in your logic can cause this sequence of method calls to repeat itself indefinitely, until you receive a warning from SSMS that the XMLA query timed out. If you plan to develop plug-ins, expect to see a lot of these. Also expect to set aside more time than usual for the code that specifies your node structure, since it is done in an unconventional way that is not easy to control. I would prefer to deal with explicit definitions of those structures in old-fashioned class-scoped arrays or collections, which might be easier to manage when dealing with particularly complex node setups; yet perhaps there are sounds reasons for the way Microsoft implemented this, maybe as a performance enhancement of some kind. Just keep in mind that most of the time you spend in AlgorithmNavigationBase will be devoted to specifying your node setup. Most of its methods are trivial, sometimes to the point of merely returning the values SSDM supplies in function parameters. Likewise, even the most complex node setups may not require many lines of code, but that doesn’t mean they won’t be a challenge to arrange properly. It is a good idea to have a clear idea of the structure you want beforehand, then to keep a close eye on the conditional logic in GetChildrenCount and GetParentCount, as well as paying close attention to the way in which you increase the value of CurrentNode or any other index placeholder variable you might use. Expect to iterate through the sequence of method calls outlined above many times during debugging before you get it right; it is not a bad idea to set breakpoints in many of them keyed to the Hit Count or Condition (which can be set in Visual Studio by right-clicking on the breakpoints).
…………As Crivat points out in the tutorial he wrote for the SDK, there may be cases in which a single node may have multiple parents, such as tree structures that represent graphs, but these instances are too extraordinary to warrant a full discussion. I should at least mention MoveToNextTree, a mandatory method Microsoft included to enable unusual structures with more than one root node. As the help file included with the SDK states, “Positions the navigator to the next content tree, if your algorithm’s content is laid out as multiple trees…Usually, all the content can be represented as a single tree and this function should return false. It is useful to implement this function when your content graph is disjoint, to ensure that all nodes can be traversed in a content query.” As I’ve said many times in this series and in others like An Informal Compendium of SSAS Errors, troubleshooting Analysis Services is sometimes like exploring deep space with the crew of the Enterprise, since you may be boldly going where no man has gone before. The MoveToNextTree method is another one of those features of SSAS for which there is apparently no documentation anywhere on the planet aside from the single line of code in sample files Crivat includes with his tutorials, and a single brief C++ example at the MSDN webpage A Tutorial for Constructing a Plug-in Algorithm. The latter doesn’t do much, but says “This function sets _iIDNode to be the root of the next tree in the sequence. Note that if we are in browser mode, there will never be a next tree.” The only useful documentation I could find is the sentence quoted above from the .chm file. Undaunted, I decided to experiment by simply returning a value of True in all cases, which will create an infinite loop that eventually triggers the XMLA timeout warning. The code in Figure 1 worked because it included a limit on the number of times the routine returns True. Each time MoveToNextTree returns that value, it not only iterates through the sequence of method calls we discussed earlier, but does so for each node in that structure. Intrepid programmers with a need for disjoint graphs (i.e. sets with no common members) could add conditional logic to the other methods we’ve already discussed to return different values, depending on which of the separate tree structures is being iterated through. In my case, the code below merely added three nodes under the root, on the same level as the other two inputs, although I suppose I could have included conditional logic in other routines to create secondary roots or the like.

Figure 1: Some Sample MoveToNextTree Code

Static TreeCount As Int16

If TreeCount < 3 Then TreeCount = TreeCount + 1
Return True
Else
Return False
End If

…………The remaining functionality we have yet to discuss with AlgorithmNavigationBase all pertains to means that we have yet to discuss for supplying values for columns in the SSDM’s common metadata for a particular node. For the sake of brevity, I left a discussion of GetNodeAttributes out of Post 14.2, even though it’s relatively inconsequential. It merely specifies the values assigned to the ATTRIBUTE_NAME column by supplying the names assigned already assigned to your attributes, which in my case correspond to mining model columns labeled “InputStat” and “SecondInputStat.” All you need to do is supply the index of the attribute in your AttributeSet to the return variable, which is an array of unsigned integers. To create a comma-separated list of attributes associated with a particular node, return more than one value in the array. This week’s tutorial also features a more complex implementation of the NODE_DISTRIBUTION table than the one returned for a single node a couple of posts ago. The Select Case returns the marginal stats for the whole model when the CurrentNode is equal to the index of the root, but numbers specific to the InputStat or SecondInputStat column for the two child nodes. First I add the marginal stats for the appropriate column to a collection of AttributeStatistics objects, which in turn specifies the values for the first few rows of its NODE_DISTRIBUTION table. The real difference between this and the code we used a few weeks ago is that we’re also including three custom NODE_DISTRIBUTION rows, by adding three extra AttributeStatistics objects whose StateStatistics objects have custom values. First I instantiate the new AttributeStatistics objects, then the three StateStatistics objects which are added to them at the end of the Case statement. The VALUETYPE for each custom NODE_DISTRIBUTION row is set using the MiningValueType iteration, as depicted in Figure 2. Immediately after that, I assign the ATTRIBUTE_VALUE to either the skewness, kurtosis or excess kurtosis value we associated with that attribute in previous tutorials. I’ve left the NODE_SUPPORT, NODE_PROBABILITY and VARIANCE blank in these cases, but set the ATTRIBUTE_NAME (which should not be confused with the column of the same name which is one level higher in the hierarchy of SSDM’s common metadata format0 using the NodeID property. Ordinarily it might also be a good idea to logic in this routine to do conditional processing based on the NODE_TYPE, but I didn’t want to complicate this tutorial further by nesting Select Cases and other such statements inside each other.

Figure 2: Adding Custom Rows to the NODE_DISTRIBUTION Table

Case Else

Dim TempStats As AttributeStatistics() = New AttributeStatistics(3) {}
TempStats(0) = CurrentAlgorithm.MarginalStats.GetAttributeStats(CurrentNode – 1)
TempStats(1) = New AttributeStatistics
TempStats(2) = New AttributeStatistics
TempStats(3) = New AttributeStatistics

Dim SkewnessState, KurtosisState, ExcessKurtosisState As New StateStatistics()

SkewnessState.ValueType = MiningValueType.Other
KurtosisState.ValueType = MiningValueType.Coefficient
ExcessKurtosisState.ValueType = MiningValueType.Intercept

SkewnessState.Value.SetDouble(CurrentAlgorithm.SkewnessKurtosisClassArray(CurrentNode – 1).Skewness)
KurtosisState.Value.SetDouble(CurrentAlgorithm.SkewnessKurtosisClassArray(CurrentNode – 1).Kurtosis)
ExcessKurtosisState.Value.SetDouble(CurrentAlgorithm.SkewnessKurtosisClassArray(CurrentNode – 1).ExcessKurtosis)

TempStats(1).StateStatistics.Add(SkewnessState)
TempStats(2).StateStatistics.Add(KurtosisState)
TempStats(3).StateStatistics.Add(ExcessKurtosisState)

‘we have to set the ATTRIBUTE_NAME string that appears next to these stats in the parent AttributeStats object, for whatever reason
TempStats(1).Attribute = AttributeSet.Unspecified
TempStats(1).NodeId = “Skewness”
TempStats(2).Attribute = AttributeSet.Unspecified
TempStats(2).NodeId = “Kurtosis”
TempStats(3).Attribute = AttributeSet.Unspecified
TempStats(3).NodeId = “Excess Kurtosis”

Return TempStats

…………The results for the updated code I’ve available here produced the three nodes depicted in Figures 3 through 5. The first NODE_DISTRIBUTION table includes the marginal stats for the whole model, plus some values for NODE_CAPTION, NODE_DESCRIPTION and other such columns that we assigned in GetDoubleNodeProperty and GetStringNodeProperty a couple of tutorials ago. Note how the NODE_DISTRIBUTION tables for the second and third figures include custom rows featuring the values for the skewness, kurtosis and excess kurtosis associated with the column mentioned in the ATTRIBUTE_NAME column (the one at the top of each figure, not in the NODE_DISTRIBUTION tables. I have yet to solve the mystery of why the VALUETYPE for the skewness rows is set to Periodicity, when the value supplied in code is MiningValueType.Other, but that is insignificant at this point. The mystery of how the Generic Content Viewer columns are populated in plug-in algorithms is essentially solved; would-be plug-in writers need only decide what they want to appear in them, rather than worrying about how to get the values in there. Only a few enigmas remain in the attic at the top of this Rickety Stairway, such as how use the Predict method, advanced functionality like custom mining functions and Predictive Model Markup Language (PMML). After those four plug-in tutorials are finished, we’ll unravel one final mystery to end the series, how to write custom data mining viewers. Discussion of GetNodeIDsForCase, a specialized AlgorithmNavigationBase method, will be deferred until then because it implements drillthrough to mining model cases, which can’t be done in the Generic Content Viewer. Since the more useful and attractive mining viewers that ship with the product depend on the same values as the Generic Content Viewer, it would be fairly easy to upgrade this sample AlgorithmNavigationBase code to create structures that are compatible with them, such as hierarchies than can be displayed by the Decision Trees Viewer. It wouldn’t be child’s play and the code would be lengthy enough to unduly clutter this series, but once you grasp the basic principles of how to navigate through the appropriate structure, upgrading this skeleton code into custom algorithms that can be displayed in the out-of-the-box mining viewers becomes feasible. The last tutorial in this series will address scenarios where those viewers don’t meet the needs of end users, which calls for custom means of displaying data in the Visual Studio GUI or SSMS.

Figures 3 to 5: Three Nodes Displaying Varied Information

Posted in A Rickety Stairway to SQL Server Data Mining

2 Comments

Tags: Analysis Services, Analytics, Data Mining, Knowledge Discovery, SQL Server, SQL Server Data Mining, SSDM, Stevan Bolton, Steve Bolton

	Stevan Bolton on Information Measurement with S…
	Jennifer Mahoney on Information Measurement with S…
	serguei russky on Implementing Fuzzy Sets in SQL…
	Mark on Implementing Fuzzy Sets in SQL…
	Stevan Bolton on Goodness-of-Fit Testing with S…

multidimensionalmayhem

Data Mining * SQL Server * Analysis Services

Monthly Archives: October 2013

A Rickety Stairway to SQL Server Data Mining, Part 14.5: The Predict Method

A Rickety Stairway to SQL Server Data Mining, Part 14.4: Node Navigation

Recent Posts

Categories

Posts by Month

Recent Comments

Meta