A Rickety Stairway to SQL Server Data Mining, Part 13: Using CLR with Analysis Services
by Steve Bolton
I was drawn into the realm of SQL Server in a roundabout manner thanks to Visual Basic. Around the time I got my Microsoft Certified Solution Developer (MCSD) certification in VB 6.0 (at precisely the same time .Net hit the market, instantly leaving me a year or so behind the curve), a potential employer rescinded an offer because I didn’t know how to write stored procedures. When SQL Server 2005 came out I was still learning the ropes of relational databases, but was overjoyed at the prospect of writing stored procedures using the .Net Common Language Runtime (CLR). It meant I could now leverage my VB skills while writing T-SQL, but I got busy with other aspects of SQL Server while getting my Microsoft Certified Database Administrator (MCDBA) and three Microsoft Certified Technology Specialist (MCTS) certs, then familiarizing myself with fields like data mining. In the meantime it sank on my priorities list because practically every blog post I read on the subject of CLR was cautionary, to the say the least; although there still seems to be widespread agreement that the tool has a lot of potential, the SQL Server community has been slower to make use of it than expected for a variety of reasons. One of these is a lack of clear guidelines and best practices on when to use CLR triggers, functions, aggregates, procedures and data types, as well as how to avoid the performance problems that mediocre CLR code can cause. When studying for my last exam, SQL Server 2008 R2 Business Intelligence Development and Maintenance, a couple of summers ago, I learned that it was possible to write CLR stored procedures in SSAS, but put off learning to do it indefinitely for a variety of reasons. Among these was the sheer amount of information I had to absorb for the certification exams, which can be brutal if you’re self-taught; I can’t be more specific due to the non-disclosure agreements (NDAs) Microsoft requires, but I advise anyone trying to go it alone like this to memorize practically every page of documentation in Books Online, as well as to devise hands-on practice exercises for every conceivable operation. I simply didn’t have time to adequately learn this arcane aspect of SQL Server Analysis Services (SSAS), even though it was one feature that definitely interested me. Furthermore, adoption of CLR in Analysis Services seemed to be even slower than on the relational side. Even Chris Webb, one of the leading SSAS cube gurus, said he was late to the game back in 2006, in a blog post titled “My First Analysis Services Stored Procedure.” How’s this for late in the game: I wasn’t even aware that CLR stored procedures could be written for SQL Server Data Mining (SSDM) until I was half-way through this amateur series of self-tutorials on SSDM, which I am writing the familiarize myself with the topic better while simultaneously giving the most underappreciated components of SQL Server some badly needed free press. I discovered this when I ran across a C# tutorial on the subject tucked in the back of Data Mining with Microsoft SQL Server 2008, the classic reference written by former members of Microsoft’s Data Mining Team.[i] It was only when I began writing my last article, A Rickety Stairway to SQL Server Data Mining, Part 12: Accessing SSDM with AMO, ADOMD, SSIS, XMLA and RS, that I realized that they were identical to Analysis Services stored procedures and deserved separate treatment in a special blog post.
I’m glad I took the time to split this topic off into a separate article, because I now know first-hand how useful they can be. In A Rickety Stairway to SQL Server Data Mining, Part 10.4: Data Mining Dimensions vs. Drastic Dynamic DMX, I provided a couple dozen scripts that enable users to import their mining results directly into relational tables, where they can be sliced and diced with T-SQL or exported to cubes for further analysis. By importing them into the relational schema I came up with, users can essentially circumvent Data Mining Expressions (DMX), the language used to retrieve results from SSDM, with the lone exception of prediction queries, which are only partially automated. I basically encapsulated a lot of the rough spots that data miners are bound to encounter when accessing DMX through T-SQL, particularly the astounding inability of the Microsoft OLE DB for Analysis Services data provider to translate #IND and NaN values correctly. The workaround I developed for that bug involves some ugly-looking string conversion code using native Excel functions, which unfortunately require Excel to be installed on the same machine at the Analysis Server that SSDM runs on. Now that I know how to write stored procedures for SSAS, it will be fairly easy to write a more elegant workaround that won’t require Excel at all. I may post it at Codeplex’s Analysis Services Stored Procedure Project, whenever I get around to finishing it. At present, the SQL Server community has contributed 23 SSAS stored procedures, not counting multiple versions of the same procedures. ClusterNaming is the only one which is directly relevant to data mining; it addresses the fairly simple problem of persisting names of clusters between processing runs, by renaming them by their characteristics in the same fashion as a verbose Association Rules label. There’s a practically endless supply of real-world use cases, bug fixes and missing features that could be addressed by other CLR stored procedures for SSDM. I’m always happy to be able to kill two birds with one stone – or rather, to discover that there’s only bird, as I did when I realized that SSAS and SSDM stored procedures are one and the same. Yet my newfound appreciation for CLR goes beyond merely crossing another item off my To-Do List. They’re not only useful, but fairly easy to write, at least from my odd perspective, as a self-taught VB programmer and homeschooled DBA. Perhaps real programmer or DBAs won’t find the process as easy, useful or interesting, depending on one’s own needs or experience, but personally, if I’d know it was this simple, I would have started using CLR years ago.
This post also gives me a chance to put another drop in the bucket to resolve a side issue I mentioned in the last post, i.e. the lack of SSDM code examples written in VB. Almost all of the .Net code related to SSDM seems to be written in C#, including the examples in the DM Team’s book, which I highly recommend if you prefer that language. VB has been getting short shrift from Microsoft in recent years, with the number of code examples in the Visual Studio documentation declining proportionally with each passing year; frankly, Visual Studio’s Help sections are a mess these days, due to issues like the lack of Intellisense links to Help topics; code samples cluttered with extraneous code; and tautological code examples that don’t tell you anything, like simply declaring an object without showing how it’s actually used. This would have been unacceptable as far back as VB 5.0, when Visual Studio’s documentation was actually much more useful. In any event, I’m going to provide my CLR code samples in Visual Basic, to put my drop in the bucket on its behalf. One of its advantages is that it’s fairly easy to follow in comparison to many other .Net languages, so interpreting these samples shouldn’t be that difficult. I’m not going to give examples in how to do all this in the old COM interface, because it’s obsolete, although you can implement SSAS procedures in it. I will, however, contrast the process of creating CLR stored procedures on the relational side with the means of writing them for Analysis Services. There are many similarities, but there are a few important differences at each stage of comparable processes.
The dissimilarities begin with the first line of code, in which we use an Imports statement to add a reference to Microsoft.AnalysisServices.AdomdServer. Before adding this line you may first have to use the Add Reference… command in Visual Studio to set a reference to msmgdsrv.dll, which may be located in your Program Files\Microsoft Analysis Services\AS OLEDB\110 folder. If we were writing CLR code for the relational side, we’d be adding a reference to Microsoft.SqlServer.Server, in order to access the server automation functions in SQL Server Management Objects (SMO). In last week’s tutorial, we glossed over how to perform Data Definition Language (DDL) and Data Manipulation Language (DML) in SSDM using analogous Analysis Services automation libraries, Analysis Management Objects (AMO) and ActiveX Data Objects Multi-Dimensional for .Net (ADOMD.Net). AdomdServer is a version of ADOMD.Net that can run commands such as stored procedures on the server before returning results to remote clients, which allows us to return leaner resultsets and thereby conserve server, client and network resources.
The second line of code in Figure 1 is an example of a class attribute, which is an aspect of VB coding I completely forgot about because I hadn’t used it in years. The easy way to understand class attributes is as a kind of custom property. In this case our SafeToPrepare property is required in both the class declaration and any functions within it, otherwise your server will return the error message “Prepare is not safe during execution of the MyStoredProcedure stored procedure” when you run your code. Basically, this just tells the server that there’s no code within the procedure that could compromise security or access sensitive resources haphazardly. If we were working with relational CLR code, we would navigate to the SQLCLR tab in Project Properties and set the level to SAFE there, or to EXTERNAL_ACCESS if we needed to perform some kind of sensitive access, such as manipulating files, network resources and the like. We could also set it to UNSAFE if we wanted to do really risky things, like execute non-managed code outside the .Net framework. In a relational CLR project we would also have to use the WITH PERMISSION_SET clause with one of these values when updating or adding an assembly with T-SQL commands like ALTER ASSEMBLY and CREATE ASSEMBLY. There are no equivalent keywords in DMX or Multidimensional Expressions (MDX), the language used to manipulate cubes in SSAS, nor is there a SQLCLR tab in the Project Properties of SSAS stored procedure projects. A few paragraphs further down I’ll explain how we can specify the same three permission settings in the SQL Server Management Studio (SSMS) GUI (or hand-type them into verbose XMLA commands, if you’re a masochist) when we add our compiled code to the server in the form of a CLR assembly. On the project side, however, the only way we can specify the permission set is using the SafeToPrepare class attribute. In fact, I can’t find any reference to an alternative to SafeToPrepare = True in the object model, Books Online (BOL) or other SSAS procedure tutorials scattered around the Internet. It is possible to add an assembly marked for External Access or as Unrestricted (i.e., Unsafe in SSAS lingo) once it’s been built, but I haven’t yet found a means of writing .Net code with either of these alternative permission sets.
Figure 1: An Example of a Server-Side Stored Procedure with SSDM
‘don’t forget the Imports keyword at the beginning
<SafeToPrepare(True)> Public Class MainClass
<SafeToPrepare(True)> Public Function GetFullStatistics(ModelName As String) As System.Data.DataTable
Dim TempDataTable As New System.Data.DataTable
Dim TempColumn As New System.Data.DataColumn
TempColumn.ColumnName = “AdjustedProbability“
TempColumn.DataType = System.Type.GetType(“System.Double“)
TempColumn = New System.Data.DataColumn
TempColumn.DataType = System.Type.GetType(“System.Double“)
TempColumn.ColumnName = “Probability”
TempColumn = New System.Data.DataColumn
TempColumn.DataType = System.Type.GetType(“System.Double“)
TempColumn.ColumnName = “Support”
TempColumn = New System.Data.DataColumn
TempColumn.DataType = System.Type.GetType(“System.Double“)
TempColumn.ColumnName = “Max”
TempColumn = New System.Data.DataColumn
TempColumn.DataType = System.Type.GetType(“System.Double“)
TempColumn.ColumnName = “Min”
TempColumn = New System.Data.DataColumn
TempColumn.ColumnName = “States”
TempColumn.DataType = System.Type.GetType(“System.Int32″)
Dim TempRow As System.Data.DataRow
Dim TempAttributeStats As New AttributeStats
Dim TempStateStats As New StateStats
For I = 0 To Context.MiningModels(ModelName).GetAttributes(MiningFeatureSelection.All).Count – 1 ‘
TempAttributeStats = Context.MiningModels(ModelName).GetAttributeFullStatistics(I)
TempRow = TempDataTable.NewRow()
TempRow.Item(“AdjustedProbability“) = TempAttributeStats.ExistingAdjustedProbability
TempRow.Item(“Probability”) = TempAttributeStats.ExistingProbability
TempRow.Item(“Support”) = TempAttributeStats.ExistingSupport
TempRow.Item(“Max”) = TempAttributeStats.Max
TempRow.Item(“Min”) = TempAttributeStats.Min
TempRow.Item(“States”) = TempAttributeStats.States
‘we could add a nested table below this level with the state stats returned by TempAttributeStats.StateStatistics()
‘the data seems to be equivalent to a NODE_DISTRIBUTION table
‘when returning multiple tables, add them to a dataset and apply the appropriate relations
If you were able to follow the ADOMD code sample I provided last week, then the code above for the server version of ADOMD should be a breeze, since their object models are quite similar. The first 35 lines or so (please forgive the lack of line numbers or proper formatting; I’m just now getting around to posting my first VB code samples) merely create the table that will be used to hold our resultset. The prefix “Temp” is merely an old coding custom of mine for specifying variables of local scope. There’s nothing fancy at the end of the function declaration either; just don’t forget to declare the return type for your function at the top of the class, otherwise your server will return the same error message mine did when you run the code: “The user-defined function (UDF) returned an inconsistent number of columns.” The code within the loop is where the action of assigning values to our resultset takes place. It’s also where we encounter the one critical exception to the server version of ADOMD, which allows you to access various server objects through the Context object, which is the counterpart of the SQLContext object in SMO. Well, sometimes it allows you too. You’re supposed to be able to access the object that is specified in the FROM clause of an MDX or DMX query using syntax like Context.CurrentCube and Context.CurrentMiningModel, as well as retrieve connection, server and database references using CurrentConnection, CurrentDatabaseName and CurrentServerID.
I can’t speak for these other Current objects, but a fatal bug was apparently introduced into CurrentMiningModel property of the Context object, which worked as specified until the release of SQL Server 2012. References to it in the current version of SQL Server will lead to “Object reference not set to an instance of an object” errors when debugging in Visual Studio, while in SSMS you’ll get the following message when calling your procedure: “Execution of the managed stored procedure MyStoredProcedureName failed with the following error: Exception has been thrown by the target of an invocation. Object reference not set to an instance of an object.” At least according to Google, there’s only one source of information on this puzzling error, in the thread “Context.CurrentMiningModel returns null in UDF” at Techques.com. A user named Jan provided important information to the effect that “A contact at Microsoft confirmed this behaviour as desired due to a ‘Metadata-Refactoring’ (whatever this means…). However, the website still pends to be updated appropriately.” There’s no mention of this in the documentation, unless this odd line from the MSDN webpage “Context.CurrentMiningModel Property” refers to this: “The return value will be Nothing when called from a Multidimensional Expressions (MDX) query, or as a stored procedure.” As Adam Sandler once said, “Once again, things that could’ve been brought to my attention Yesterday!” The single reply to Jan’s thread mentioned a fairly simple workaround almost identical to the one I use in Figure 1, in which we supply the name of a mining model to retrieve the right one from the MiningModels collection. What this means is that we can still execute our procedure with either the CALL syntax, as in CALL SSASSPClassLibrary1.GetFullStatistics (‘LRDenormalizedView1Model’), or like a function in a DMX select list, like so: SELECT SSASSPClassLibrary1.GetFullStatistics (‘LRDenormalizedView1Model’) FROM [LRDenormalizedView1Model]. In both of these examples, however, we have to supply the name of the mining model as an argument to the function, which wasn’t necessary in previous versions of SQL Server. Basically, this is an undocumented breaking change to Analysis Services stored procedures. It is also another illustration of the difficulty of finding decent documentation on SSAS, which often entails sifting through the combined knowledge of the human race on the Internet just to find a single post written by someone on the other side of the planet. It’s in the same league as “going boldly where no man has gone before,” like the crew of the starship Enterprise, except it’s even uglier than the eyebrows of a Klingon. That’s why I started a series of blog posts in 2011 titled An Informal Compendium of SSAS Errors, which remains one of the few centralized sources of information on the planet for certain arcane Analysis Services errors. I may have to add this to the list. Yet as long as you’re willing to overlook this defect – which is really only cosmetic if you’re writing brand new code – you can still do great things with SSAS stored procedures. That may include some things you might not know were possible, just as I didn’t know it was possible to directly retrieve a mining model’s Adjusted Probability stats until I messed around with the GetAttributeFullStatistics method depicted above.
The next step after writing code is usually to deploy it for testing. The first step in this process with SSAS procedures is to right-click the Assemblies folder in your SSDM project and add a reference to your procedure project. Figure 2 depicts some of the mining models we used for practice earlier in this series, as well as some additional projects in the same Visual Studio solution; the mouse is positioned near the Assemblies folder, where the name of your procedure will appear in the same list as the WpfApplication1 project once you reference it. In between building your project and deploying it, make sure that the newly built .dll and accompanying files are in the procedure project’s Bin/Obj/Release folder, otherwise you won’t be able to debug the project. If necessary, copy all of the files from one of the other Debug folders if necessary. And for the love of all that’s holy, make sure you navigate to your mining project’s Properties window and set the Processing Option to Do Not Process. If you do not set this option, every mining model in your project will be processed. Like a monster from a horror film, “there is no way to stop it once it’s begun.” Basically, you’ll have to restart the server. It’s essentially Ripley’s strategy in Aliens: “Nuke the site from orbit. It’s the only way to be sure.” You don’t want to have to use the nuclear option on your server just to deploy a procedure – especially if you’re only testing it, which should only be done on a test server anyways.
Except for that single whopping mistake, I have yet to see any real catastrophic errors arise from failed deployments of SSAS procedures. I only encountered a couple of other minor errors when building and deploying my first one, which turned out to be far less of a hassle than I anticipated. As mentioned earlier, you can also add assemblies to an Analysis Server through the GUI by right-clicking the Assemblies folder under your SSAS database, as depicted in Figure 5. One of the interesting features of SSAS procedures compared to their T-SQL cousins is that you can choose to deploy them to an entire server rather than a single database, by right-clicking the Assemblies folder for the server rather than the database. Either way, you’re then prompted to fill in the dialog in Figure 5, which is titled Register Server Assembly title and Register Database Assembly depending on which folder you right-clicked. Otherwise there is no difference. I’m not aware of any means to deploy any type of CLR object to an entire server on the relational side, where commands like ALTER, CREATE and DROP ASSEMBLY must be repeated in each database where you want to use your CLR trigger, aggregate, procedure, function or type. It may be a shame that Analysis Services lacks comparable MDX or DMX clauses – the XMLA equivalents are so fat and incomprehensible[ii] that I won’t even bother to post them – but I actually had an easier time deploying assemblies to SSAS than to relational databases so far, because the Visual Studio-generated code has consistently proved to be unreliable in the case of relational CLR. When I prepared for this post, I wrote some procedures and other CLR objects on the relational side for practice, but spent more time debugging the deployment process than I did tracking down the AdomdServer Context bug. Basically, the scripts Visual Studio generated weren’t being updated as they should have been and therefore couldn’t apply the correct permission sets. I had much better luck simply writing my own ALTER, CREATE and DROP ASSEMBLY commands in T-SQL by hand. In contrast, I didn’t run into any such hitches due to bugs in Visual Studio when deploying SSAS procedures.
Another reason I put off using CLR for so long was the daunting prospect of debugging DLLs, which I’d always had trouble with in the past. This time around the process went much more smoothly, perhaps because I was using a much more advanced version of Visual Studio than in my last DLL project long ago, or perhaps because I was a little less inexperienced than before. Figure 6 depicts the dialog box that appears when you select the Attach to Process… command in Visual Studio. The Qualifier dropdown allows you to select the server name, which is covered up by the Select Code Type dialog, while the Transport dropdown allows you to choose between the Default and Remote debugging modes. The trick is to make sure the Show processes from all users and Show process in all sessions boxes are both checked and then use the Select… button to bring up the Select Code Type dialog, where you should choose both T-SQL and Managed Code (v4.0). If you only select the first, your breakpoints in VS will be white rather than brown and will be accompanied by a balloon message to the effect that they will never be hit. You’ll also receive similar bad tidings that the breakpoints will never be hit “because there are no symbols loaded’ if you forget to set your DLL project at the Startup Project in your Visual Studio solution.
Relational CLR code can be debugged through two additional methods, which are succinctly described by a user named msdntwt in the MSDN thread Three Ways to Debug CLR Stored Procedures. I had trouble with one of these alternate methods, which involves adding a test script in Visual Studio. Now that I understand the Attach To Process… debugger better and am more accustomed to writing queries in SSMS, I’ll refrain from debugging these alternate debugging methods, so to speak. Other users may prefer one of these other methods, but I’m more partial to writing code in Management Studio query windows, which will automatically trigger breakpoints in Visual Studio if the debugging process is set up correctly. One minor hitch I ran into with this approach on the relational side was that I had more than one SQL Server process running and attached to the wrong one on the first try, which is a situation I haven’t encountered yet with the Analysis Services process, msmdsrv.exe. The most time-consuming one was the fatal error, “Tracking the .Net Framework :Deploy error SQL01234: The database version is not supported.” The culprit turned out to be SQL Server Data Tools (SSDT), which is the name of the version of Visual Studio included with SQL Server. In previous versions it was known as Business Intelligence Development Studio (BIDS), but it was still basically a stripped-down version of Visual Studio. I already had SSDT installed, but apparently the version included with SQL Server 2012 wasn’t good enough, because I had to download an alternate version and integrate it with Visual Studio before I could build CLR objects that targeted the .Net 4.0 Framework. This is the Framework version SQL Server 2012 requires. So in a nutshell, the new version of SQL Server demands that you write code for a specific set of object models and ships with programming tools that only work with previous versions, thereby forcing you to download a different product by the same name.
It’s best to take such hacks in stride, because they come with the territory when you’re writing computer code of any kind. One of the things I like about Microsoft’s programming tools is that they seem to have far fewer mistakes of this kind than any other coding software I’ve tried, going as far back as my Macintosh programming days in the 1990s, or even further back to writing games in BASIC on an old Compucolor II. My posts may sometimes make SSDM and other software I discuss seem much more difficult to use than they really are – unlike Windows 8 and the current installers for Oracle and DB2, which I really meant to knock in separate posts because they really are atrocious.[iii] I tend to focus on the pitfalls new users might encounter, since I’m one of them, but this may give the impression that the stumbling blocks are more numerous than they really are. This shouldn’t obscure the fact that there really is a lot of untapped potential in CLR, especially when it comes to Analysis Services stored procedures. Readers of this post can tap more of that potential if they are cognizant of the many rookie mistakes I pointed out here.
The CLR functionality in Analysis Services could certainly use a few upgrades, such as counterparts to the ASSEMBLY commands in T-SQL and management views and tables like sys.assembly_files that are available on the relational side. One of the key differences between the two flavors of CLR is that a much wider variety of tasks can be performed on the relational side, such as .Net aggregates, triggers, functions and types. On the other hand, it makes little sense for Microsoft to upgrade SSAS stored procedures when they are still so little used, even in comparison to relational CLR, which is itself still uncharted territory more than seven years after its release. Extending it much further at this time would be a little bit like Lewis and Clark setting up fortresses in the Dakotas or Oklahoma Territory; no one would be around to use them for another half-century. The field of CLR seems to still be wide open and full of potential that has gone untapped largely because the understanding of when to apply it seems imprecise, even among the experts who really comprehend it much better than I do. I did some desultory reading on the topic while preparing for this article and came across quite a few use cases that filled an extremely wide variety of extremely narrow niches, without any central theme to tie all of the use cases together. There are still disputes going on about the wisdom of using some of them at all, including the classic discussion at Alex Papadimoulis’ .NET Blog “When Should I Use SQL-Server CLR User Defined Types (UDT)?” Alex set off a firestorm of controversy in 2005 when he said that CLR user defined types (UDTs) should never be used for any purpose, on the grounds that they violate the demand of first normal form that all table cells should be atomic. Some of those who objected pointed out examples of what they believed to be atomic units that nonetheless contained more than one numeric within them, such as latitude-longitude pairs, Fourier transforms and complex numbers. A poster named John Cowan summed up the counter-argument nicely: “I entirely agree that only scalar values should be allowed in database columns. The trouble is that one person’s ‘atom” is another’s “bucket o’ quarks with electrons in shells around it’.” Five years later in a different thread, Pedro DeRose pointed out that the new hierarchyid and spatial data types introduced in SQL Server 2008 were both implemented in CLR. Personally, I’ve considered using them to store B.C. dates prior to the year 0 A.D. that serves as the current minimum value in SQL Server’s datetime type, which is insufficient for anyone with an interest in storing historical dates prior to that. Thanks to imprecise time-keeping methods in antiquity it sometimes difficult to compare B.C. dates, but that is no reason why such functionality should be left entirely left out of SQL Server; users are more likely to have a need to store the date 1200 B.C. than they are the year 9999 A.D., which is the current max for datetime values. This is one use for CLR UDTs that I don’t think would violate first normal form, although it wouldn’t hurt to get Papadimoulis’ commentary on the idea, since I’m not really familiar with the topic enough to take a position in the discussion he started long ago.
The potential uses for CLR UDTs mentioned above have little in common with each other, except for the fact that they are more complex types that fill voids in T-SQL. They address completely different use case scenarios, ranging from temporal to spatial entities, to complex numbers that are off the Cartesian plane altogether. I’ve recently been reading Richard Kimball’s classic references on data warehousing and thanks to his clear writing style, I now understand the idea behind junk dimensions, which is to store a hodge-podge of dissimilar flag values to keep them from cluttering regular dimensions. It struck me that the gamut of CLR code in use today also seems like a junk drawer, full of functionality that would be nice to include in T-SQL, yet isn’t. Another rule of thumb is that the functionality ought to be computationally complex. A third principle is that the same tasks should either be impossible or inefficient to perform using set-based logic. As Jonathan Kehayias put it in a thread at SQLCLR.Net in 2008, it shouldn’t be used to reinvent the wheel by performing tasks that T-SQL can already do or do better, and in general ought not be used for routine SELECT, UPDATE, INSERT and DELETE activity. There may be exceptions to the latter conditions, such as using a CLR stored procedure to SELECT sensitive information that isn’t available through ordinary T-SQL, such as file system records. In the past, external access to resources like this required the use of extended stored procedures, which were inefficient and represented security holes. CLR stored procedures were intended as replacements with better memory management and security, among other things. These are also the same use cases which differentiate CLR stored procedures from CLR functions. As Solomon Rutzky puts it in his thread “CLR Performance Testing” at Simple-Talk.com, “The only reasons I can think of to create a CLR stored procedure would be to either modify the state of the database or server, or to return a dynamic result set. Both of these operations are not allowed in functions so stored procedures are necessary if you need to do those things. But for data retrieval and/or manipulation, most of the time T-SQL stored procedures will not only be easier to create and maintain, but it should perform better.” His post includes some nice performance tests, so I recommend it as a starting point for anyone who wants a better understanding of when CLR might out-perform T-SQL. Some other performance recommendations I ran across include using datareaders rather than dataset objects in CLR[iv]; avoiding table-valued parameters in CLR functions and using only SqlInt32 or System.Int32 for integer return types[v]; and being cognizant of the fact that “T-SQL table-valued functions materialize their results in temp tables, whereas CLR TVFs can stream them.”[vi] One of the most important pitfalls to avoid is implementing RBAR (Row-By-Agonizing-Row) operations in CLR when set-based logic could perform the same tasks more efficiently in T-SQL, which is capable of tackling many computationally complex problems. Of course, if set-based logic can’t solve the problem well, then CLR may be the optimal choice, since it can perform certain RBAR operations and calculations more efficiently. In these situations we would use either a CLR function, or perhaps an aggregate if we were performing reiterative calculations over many rows.
We need to be aware of the potential for performance degradation from insufficiently thought-out CLR code, but that doesn’t mean it can’t out-perform T-SQL in some situations. Adam Machanic, one of the leading CLR gurus, posted a CLR string splitting function that performs better than the T-SQL alternatives in a thread that is still active three years later.[vii] The uses listed below have been demonstrated to either perform better than T-SQL or can take actions T-SQL is completely incapable of on its own. I collected them from a variety of threads scattered across the Internet, which represent just a fraction of the solid sources for CLR information that are available out there in the wild, if you take the time to look for them.[viii]
- Regular expressions, which are perhaps the most commonly cited use.
- A bin to hex converter that performs efficiently.[ix]
- Nested string replacements: “batches of nested find and replace are also faster. REPLACE(REPLACE(REPLACE(….kind of stuff.”[x]
- Shredding XML in certain situations.
- “Additionally, avoiding xpath for XML parsing is a huge benefit – one offending tsql XML statement takes 35 minutes to complete, whereas a re-implementation in SQL CLR takes 3 seconds (plus it’s simpler for us .NET devs to understand, debug, and write tests against :] ).”[xi]
- Time zone conversions.[xii]
- Other complex date manipulation logic.
- Other string parsing routines.
- Using CLR stored procedures to circumvent SQL Server’s silly limitations on INSERT EXEC and other means of nesting calls to other stored procedures (which is now near the top of my list of code to implement myself).
- Accessing external data sources like networks and file system, which T-SQL procedures can’t do at all
- Modifying the state of the server in CLR procedures.
- Replacing SQL Server’s relatively inefficient and insecure extended stored procedures, which have been deprecated.
- Using UDTs to implement intricate yet atomic objects like complex numbers, Fourier transforms and ancient dates.
- Using CLR aggregates in place of T-SQL cursor operations.
There are no use cases for CLR triggers listed above because I haven’t run across any, but it may be a valid solution that has yet to be matched to the right problem. This is just another indication that we’ve barely scratched the surface of the potential of CLR, especially with SSAS stored procedures. In that uncharted territory there are bound to be niches where CLR could solve problems relevant to data mining. The types of problems it can be used to solve are hinted at by Henrik Staun Poulsen in a reply to the Simple-Talk.com thread “Time for a Rethink on SQL CLR?”, in which he writes that his team already routinely uses CLR to calculate statistical measures like median, linear regression, skewness, first quartiles and Weibull least squares on the relational side. So CLR clearly already has data mining applications, even if we’re talking strictly about relational CLR. It might become more important as the field matures to have a more explicit understanding of when to utilize the various types of CLR objects on the relational side, as well as when to make use of Analysis Services stored procedures. For example, it might prove desirable to do some pre-processing of certain measures in the relational database – such as calculations like skewness that are in the same league as standard deviation – before feeding them into cubes, where more pre-processing can be done, such as the built-in MDX linear regression functions. After that, they can be fed into more sophisticated SSDM algorithms. SSAS stored procedures might be applied at either of the latter two stages. Given that there aren’t any formal lists of best practices or self-evident use cases for relational CLR, we’re clearly still far away from an explicit understanding of when to apply SSAS procedures – let alone design guidelines for applying CLR at each stage of the mining process, beginning with pre-processing on the relational side of certain statistical building blocks. This field will be wide open for some time to come, just like data mining itself. I’ve often said that Microsoft doesn’t realize just how valuable its MDX language and SSDM mining tools really are, but the same is also true of CLR, on both the relational and analytic side. All three of these tools are so far ahead of the curve that they have received declining attention from Microsoft in each version of SQL Server since 2005, yet still possess much latent potential which has yet to be tapped. In the computing industry, seven years is an eon, but these tools were so far ahead of their time that they may still be relatively under-utilized another seven years from now.
[i] pp. 527-538, MacLennan, Jamie; Tang, ZhaoHui and Crivat, Bogdan, 2009, Data Mining with Microsoft SQL Server 2008. Wiley Publishing: Indianapolis.
[ii] Kind of like Rush Limbaugh.
[iii] Windows 8 is even more of a bungle than I thought it might be in my worst nightmare. If anything, I was too soft on it in my previous post. There’s a crying need for someone in the computer industry to keep a running total of all of its many problems, because I can’t keep track of them all. Vista was a good idea badly executed; Windows 8 it is just a bad idea badly executed. In fact, it is so ill-conceived that it would be a horrible thing if it were well-executed. It’s bad enough that it may go down as one of the biggest bombs in the history of computing, but it gets worse. I can make a plausible case that it’s an outright menace to the whole technology sector, but it is not yet the right time or place for this discussion.
[iv] See Davis, Tony, 2011, “Time for a Rethink on SQL CLR?” published Sept. 29, 2011 at Simple-Talk.com website. Available at https://www.simple-talk.com/blogs/2011/09/29/time-for-a-rethink-on-sql-clr/%20
[vi] DeRose, Pedro, 2010, reply to “Is There Any Justification for Really Using SQL CLR?” on Jan. 21, 2010 at the SQLServerCentral.com website. Available at http://www.sqlservercentral.com/Forums/Topic850429-386-1.aspx.
[vii] Machanic, Adam, 2009, “SQLCLR String Splitting Part 2: Even Faster, Even More Scalable,” published April 29, 2009 at SQLBlog.com. Available at http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx
[viii] Some other good sources of information include the StackExchange.com thread “What has Been Your Experience with SQL CLR for Complex Business Logic?”; Jennings, Roger, 2005, “When To Use SQLCLR, And When Not To,” published Sept. 1, 2005 in Visual Studio Magazine and available online at http://visualstudiomagazine.com/articles/2005/09/01/when-to-use-sqlclr-and-when-not-to.aspx; Evans, Kirk, 2007,
“If LINQ to SQL Is Good, Is SQLCLR Still Evil?” published Sept. 19, 2007 at MSDN.Com and available online at http://blogs.msdn.com/b/kaevans/archive/2007/09/19/if-linq-to-sql-is-good-is-sqlclr-still-evil.aspx; and the StackExchange.com thread “Using CLR User Defined Functions in SQL Server” started by Miles D. on Feb. 1, 2011. Available online at http://dba.stackexchange.com/questions/1010/using-clr-user-defined-functions-in-sql-server
[x] See the reply by the user Lowell to the thread “Is There Any Justification for Really Using SQL CLR?” on Jan. 20, 2010 at the SQLServerCentral.com website. Available at http://www.sqlservercentral.com/Forums/Topic850429-386-1.aspx
[xi] See Kevin Pullin’s reply on Oct. 4, 2011 to the Simple-Talk.com thread “Time for a Rethink on SQL CLR?” started by Tony Davis on Sept. 29, 2011. Available online at https://www.simple-talk.com/blogs/2011/09/29/time-for-a-rethink-on-sql-clr/
[xii] IBID. Pullin writes that “SQL CLR is a lifesaver, both in terms of performance and usability, in areas where tsql falls flat on its face. We use it for extremely basic things that you’d think would be natively exposed: integer/long parsing, time zone conversions.”