A Rickety Stairway to SQL Server Data Mining, Part 14.1: An Introduction to Plug-In Algorithms
by Steve Bolton
…………In my last post in this amateur series of self-tutorials on SQL Server Data Mining (SSDM), I got into a lengthy discussion of how neglected but powerful SQL Server Analysis Services (SSAS) stored procedures are. This is part of a larger pattern of under-utilization of some of the database server software’s most potent features, including the development of CLR objects to significantly extend the functionality of SSAS and T-SQL, as well as Multidimensional Expressions (MDX) and SSDM itself. SSDM can also be extended in a way that is not applicable to any other component of SQL Server, through the addition of new user-defined mining algorithms. Despite the fact that this gives SQL Server a distinct advantage over other database server mining software in one of the hottest, wide-open fields in computing, it also falls into the same category of neglected functionality. In recent years, this same set of features has also been neglected by some of the top brass at Microsoft, who don’t seem to appreciate just how powerful and well-designed these products really are, or how important they are for the software giant’s position in a segment of computing that is bound to be profitable and relevant for decades to come. The good news is that writing plug-in algorithms for SSDM is also comparable to SSAS stored procedures, in that they appear to more difficult to write than they really are.
…………The development process for a plug-in is by no means trivial, but should not be intimidating either. Most of the kinks in the process actually arise from the age of the SSDM architecture, which hasn’t been significantly updated since 2005. It is a mystery why the top brass at Microsoft has invested so few resources in extending the lead over its competitors provided by tools like MDX and SSDM, but the fact that they both remain at the pinnacle of analysis software eight years later is a testimony to how powerful and well-designed they are. The process for developing plug-ins really could use a makeover, but it’s still useable in its original form, with some adaptations for the .Net platform. The second programming language I learned (besides BASIC, which I used to write games on an old Compucolor II when I was 12) was C++, but I haven’t used it since the 1990s, when I learned to appreciate the simplicity of the Visual Basic syntax. Back in the days of the old COM architecture, I earned a Microsoft Certified Solutions Developer (MCSD) certification in VB 6.0, but likewise abandoned COM for the simplicity and power of the .Net framework like most other Windows developers. I procrastinated in regard to writing plug-ins for far too long after reading that the architecture is based on C++ and COM, but thankfully, a .Net wrapper has been available for some time that greatly simplifies the process. There are some trade-offs in using the wrappers in place of COM, particularly in terms of “marshalling data between native C++ and managed code” and the inability of Analysis Services to manage the memory of .Net plug-ins in some circumstances. For those who cannot abide these slight performance penalties, or who have a masochistic or historical appreciation for C++ and COM, check out A Tutorial for Constructing a Plug-in Algorithm, written by Microsoft Data Mining Team members Max Chickering and Raman Iyer back in August 2004. I’ll confine my discussion of SSDM plug-ins to the wrapper classes written for .Net, which is superior to the practically obsolete COM architecture in many respects and much more adaptable to my own skill set.
…………Working with the wrapper classes requires a brief initial immersion with the C++ shells, however, which I will describe in this initial post in a four or five part series on SSDM plug-ins. Ideally, we would simply write plug-ins based on native classes within the .Net framework, but Microsoft has not yet upgraded SSDM in this way yet; this is just one of a wide range of enhancements that could be made to this long-neglected but powerful component of SQL Server, in order to establish long-term dominance of the data mining market for Microsoft. In the meantime, however, we must take a few preparatory steps that .Net coders aren’t normally accustomed to. Annoyances like this should not obscure the fact that adding to the nine algorithms included out-of-the-box with SSDM is incredibly useful. As mentioned in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering, there are thousands of different clustering algorithms in use today, all of which could be implemented in SSDM. There are likewise innumerable variations on other SSDM algorithms we covered in previous tutorials, like Time Series, Neural Nets, Linear and Logistic Regression, Association Rules and Naïve Bayes – not to mention others like Kohonen nets that aren’t implemented at all. As I discussed in some of the introductory posts in this series, the statistical building blocks that these algorithms are built from can be combined, tweaked and rearranged in a practically infinite number of permutations that can be suited to the particular needs of any analytical project. The sky is the limit. New mining algorithms will probably still be being written a century from now, assuming mankind is still around. Because data mining can be applied to such an exceptionally wide range of problems, which mankind has no shortage of today, the field probably won’t reach the top of its bell curve of economic utility for many decades to come, unlike some of today’s hottest tech products, like current craze for tablet-sized devices. For this reason, extensibility is crucial to the data mining industry and products like SSDM in a way that it is not for some of Microsoft’s other software packages, like Visual Studio or SQL Server Management Studio (SSMS). It is unlikely that data miners of the late 21st Century will still be using SSDM, but Microsoft could retain its head start in a field with a promising long-term future by properly maintaining and promoting SSDM.
…………Finding a comprehensive, up-to-date list of the companies that have developed algorithms with this plug-in architecture is next to impossible, but a cursory search of the Web through Google turns up such names as Comelio, Angoss and KXEN. This list is probably shorter than it ought to be for the simple reason that there just aren’t enough qualified people available yet to write algorithms for the practically infinite array of problems that mining software like SSDM can solve. The lucrative potential of the field is perhaps best illustrated by the fact that C. L. Curotto and N.F.F. Ebecken’s eight-year-old hardcover book Implementing Data Mining Algorithms in Microsoft SQL Server will set you back a whopping $214 if bought on Amazon.com today. Given that I’m an amateur who has yet to be paid to work in the field, I don’t have that kind of money. Fortunately, cheaper sources of information are available, such as Raman Iyer’s MSDN article Add Custom Data Mining Algorithms to SQL Server 2005 and Data Mining with Microsoft SQL Server 2008, the classic reference on SSDM written by former members of the DM Team like Jamie MacLennan, ZhaoHui Tang and Bogdan Crivat. Sources like these list some of the incalculable advantages of using SSDM the architecture rather than starting from scratch, since it integrates with the larger SQL Server and Windows environment to provide managements of security, transactions, backups and access through such tools as Excel, SQL Server Integration Services (SSIS), Reporting Services and the Data Mining Expression (DMX) language processor in SSMS. There’s usually no sense in reinventing the wheel by implementing all of this out-of-the-box functionality yourself. Writing cutting edge algorithms is a high-level activity, which means the last thing you want to be bothered with is lower level architecture details like this, not to mention the painstaking care needed to work with C++ and COM. I’m a big fan of encapsulation of this kind; after all, the overwhelming advantage of computers as a whole is that they allow you to encode blueprints and then copy them effortlessly, thereby freeing users up to perform other tasks on top of those copies. When working with data mining software, however, it’s mandatory.
…………Because the top brass at Microsoft seems to have lost sight of this fact, they haven’t updated the code for SSDM plug-ins properly by providing a real .Net interface. That means we have to jump through a few hoops to build a plug-in with a managed code language like C# or Visual Basic, but the cost in terms of time and energy are inconsequential in comparison to the benefits of being able to write your own SSDM algorithms. Because there’s a relative shortage of SSDM tutorials written with VB, I’ll be creating a Visual Studio project in that language. Before we can get to that stage, however, it is necessary to compile the C++ shells into a .dll that our VB project can reference, otherwise the object model will be invalid. The very first step in the process of writing your own algorithm is to download the Data Mining Managed Plug-in Algorithm API for SQL Server 2005 from the Microsoft Download Center webpage Feature Pack for Microsoft SQL Server 2005 – April 2006. Keep track of the installation location. The first time I attempted to build a plug-in, I downloaded it to C:\Microsoft\Data Mining Managed Plug-in API\ but for this tutorial series, I set it to C:\Program Files (x86)\Microsoft SQL Server\Data Mining Plugin. The folder should have a structure identical to the one depicted in Figure 1:
…………The SQL Server Data Mining Managed Plug-In Algorithms Tutorial.doc file is a much more in-depth tutorial than the one I’m providing here, so it would be wise to consult it throughout your development process. I had to lean heavily on MacLennan, et al. to get me started, but after that I relied on this .doc file by Bogdan Crivat, a leading expert on SSDM who is now with Predixion Software. He goes into much greater detail about specialized scenarios and provides several lengthy demos, which are related to the code found in the topmost folder in Figure 1. Rather than rehashing what Bogdan wrote and ending up with a pale copy of his tutorials, I’ll be contributing separate mistutorials in VB that illustrate some of the pitfalls fellow amateurs may run into on their first attempts at writing a custom algorithm. The first mistake that I made was a costly one in terms of development time: I completely overlooked the SSDMPlugIn.chm file depicted in Figure 1, which describes the object model we’ll be working with in the next several blog posts. I wasted quite a bit of time trying to tease out the structure in Visual Studio through Intellisense and the Object Browser, so please, do yourself a favor and read both this Help file and Bogdan’s tutorial before writing your custom algorithm.
…………The kicker is that you can’t get to the point of writing your first lines of code for the algorithm without first compiling the Visual Studio solution file in the DMPluginWrapper folder into a .dll. There are 46 files related to this project in the root of this folder alone, most of which are C++ headers and .cpp documents needed for this shell project. I don’t know enough about the inner workings of the plug-in architecture to state why this extraneous step is necessary, although it would obviously be simpler if users could simply work with a precompiled DMPluginWrapper.dll file, especially one that it natively .Net-compliant. Compiling the project in Visual Studio is a fairly straightforward process that doesn’t take long though, unless you run into one of the following “gotchas” that I stumbled across. Problem #1 was an error at compilation time about a file named oledbm.h being missing. If you’re working with SQL Server 2005 you can probably find it in the C:\Program Files\Microsoft SQL Server\90\SDK\Include folder, or perhaps Program Files (x86). If not, it can be downloaded from this webpage at the SQLServerDataMining.com website, which will require registration first. I put my copy in C:\Program Files\Microsoft SQL Server\110\SDK\Include, then added that directory to the Additional Include Directories under the Configuration Properties\C++\General tab, through the Project Properties window in Visual Studio. I’m not sure why dmalgo.h, msxml6.h are included in the same .zip file, but since they are was included in the same a set of bug fixes, I replaced those files with these versions in Data Mining Plugin\DMPluginWrapper as well with no ill effects.
…………After fixing this issue, I ran into Problem #2 on the next attempt at compilation: an error message including the text “C2065: ‘DMMVALUETYPE_INTERCEPT’: undeclared identifier.” To fix this, I added the line text “#define DMMVALUETYPE_INTERCEPT ( 11) ” to oledbm.h after “#define DMMVALUETYPE_NODE_UNIQUE_NAME ( 10 ).” Why I did this is a good question that even I really can’t answer. As I’ve said throughout this series, working with SSDM is sometimes akin to traveling on the starship Enterprise, because you’re “boldly going where no man has gone before.” The same is true of SSAS as a whole, which is why my series An Informal Compendium of SSAS Errors remains one of the few centralized sources of information on the planet for certain arcane Analysis Services bugs. In that series, I wrote about certain error messages that don’t seem to be documented anywhere, by anyone, but in this case I was fortunate to find a single helpful reference: a code sample I found at a somewhat anonymous ftp site that included this extra line. This was one of only two references on Google for the search term DMMVALUETYPE_INTERCEPT. I wasn’t even able to determine whose ftp site it was to give proper credit; I couldn’t even determine what its purpose was, other than what looked to be an inventory of someone’s hard drive. I can’t begin to explain why it worked, only that it did, which will have to do. For now, I will also have to settle for a similar workaround to Problem #3, which proved to be intractable but apparently irrelevant. On the next compilation attempt, I received “error MSB3073: The command “”C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\Bin\gacutil.exe” /i “C:\Microsoft\Data Mining Managed Plug-in API\DMPluginWrapper\Debug\DMPluginWrapper.dll” /f: VCEnd” exited with code 1. C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\Microsoft.CppCommon.targets 113.” The first workaround I applied was to restart Visual Studio with Administrator privileges as mentioned in the thread Error Code 1 When I Install an Assembly in GAC started in 2011 by the user starblue at StackOverflow.com. It had no effect, but running the offending command in a prompt with elevated privileges led to the error: “Failure adding assembly to the cache: This assembly is built by a runtime newer than the currently loaded runtime and cannot be loaded.” This commonly occurs when there is a mismatch with the .Net framework version targeted by a C++ project, so I tried the solution mentioned on April 27, 2010 by Brian Muth at the MSDN webpage Trying to Change Targeted Framework in VS2010. Because the Unload Project command was greyed out in Visual Studio for some mysterious reason, I simply edited the project’s .vcproj file by hand, by adding the new node v2.0 under the node. This set the framework to .Net 2.0 but it apparently did not fix the problem, since I still received the following compilation error after reopening the project: “.NET Framework 2.0/3.0/3.5 target the v90 platform toolset. Please make sure that Visual Studio 2008 is installed on the machine.” I applied a solution I found at this webpage at the esri.com knowledge base, which involved changing the Platform Toolset from v90 to v100 on the Configuration Properties\General tab in the Project Properties. I also altered the Stdafx.h file by adding the line “#define _WIN32_WINNT 0x050” because it was unclear if the solution at the esri page called for it or not, but it led to multiple compiler warnings with the text “C4005: ‘_WIN32_WINNT’: macro redefinition.” After removing that alteration to Stdafx.h, the solution still didn’t do the trick. My next course of action was to follow a suggestion at starblue’s aforementioned thread and simply delete the offending command in the Configuration Properties\Build Events\Post-Build Events, which was listed as “”$(FrameworkSDKDir)Bin\gacutil.exe” /i “$(TargetPath)” /f.” After that the .dll compiled just fine. Perhaps there may be unforeseen consequences for not properly referencing the .dll in the Global Assembly Cache as the command is designed to do, but I have yet to encounter them while working on more other installments in this series of mistutorials.
…………After this step, it is a simple matter to create a new Visual Basic (or C# if that is your preference) project and set a reference to the DMPluginWrapper.dll you just compiled in the C++ shell project. As long as it compiles fine, the abstract classes we will derive from in next week’s tutorial will be recognized by Visual Studio. The object model really isn’t that difficult to work with, because we’re only deriving from four base classes: AlgorithmMetadataBase, AlgorithmBase, AlgorithmNavigationBase and ICaseProcessor. As we will discuss next week, the functionality embodied in the methods, properties and other members of these classes is relatively easy to follow, especially with Bogdan’s tutorial to act as a guide. The Shell folder depicted in Figure 1 even contains basic code samples for these classes in C#, which I easily converted to Visual Basic. I was a little rusty in deriving from abstract classes which slowed me down a little more, but I was up and running with a VB plugin project within a matter of minutes after debugging the DMPluginWrapper.dll compilation issues. Designing and implementing the algorithm itself was of course much more time-consuming and thought-intensive, as we shall discuss in the next installment in this series. The extra obstacles we had to hurdle in this post were trivial in comparison, but that is precisely why they could be eliminated; if the “meat and potatoes” of the problems we’re trying to solve are so difficult that they necessitate building custom algorithms, we really shouldn’t be wasting time peeling the potatoes too, so to speak. There are many overwhelming advantages to writing custom algorithms and we’ve already disposed of the most serious drawback, which is the lack of upgrades to the SSDM architecture. Like Analysis Services as a whole, SSDM merely needs a simple makeover to keep it viable for another eight years or more, through such modernizations as the development of native .Net plug-ins and the like. If the DM Team is given the resources it needs to add such enhancements, as well as better documentation and new out-of-the-box algorithms, there is no reason why Microsoft could not quickly become the industry leader in data mining. As with MDX, it is a testimony to how well the product is designed that it has held up so well for so long against its competition, despite years of relative neglect by Microsoft bosses who seem unduly obsessed with Tablet Panic. Hopefully they will recognize in time that computer technology is fanning out in two directions, one leading to miniaturized devices on one end of the spectrum and Big Data on the other. Microsoft entered the former market far too late to ever achieve dominance, but the latter will require Big Analysis software that the company doesn’t seem to know it already has.