Stay Tuned…for a SQL Server Tutorial Series Juggling Act
Posted by Stevan Bolton
by Steve Bolton
…………If all goes according to plan, my blog will return in a few weeks with two brand new series, Using Other Data Mining Tools with SQL Server and Information Measurement with SQL Server. Yes, I will be attempting what amounts to a circus act among SQL Server bloggers, maintaining two separate tutorial series at the same time. Like my series A Rickety Stairway to SQL Server Data Mining, it might be more apt to call them mistutorials; once again I’ll be tackling subject matter I’m not yet qualified to write about, in the hopes of learning as I go. The first series will compare and contrast the capabilities of SSDM against other tools in the market, with the proviso that it is possible to run them on data stored in SQL Server. That eliminates a lot of applications that run only on Linux or otherwise cannot access SQL Server databases right off the bat. The software packages I survey may include such recognizable names as RapidMiner and WEKA, plus a few I promised others to review long ago, like Autobox and Predixion Software. The latter is the brainchild of Jamie MacLennan and Bogdan Crivat, two former members of Microsoft’s Data Mining Team who were instrumental in the development of SSDM.
…………The second series of amateur tutorials may be updated more frequently because those posts won’t require lengthy periods of time to evaluate unfamiliar software. This will expand my minuscule knowledge of data mining in a different direction, by figuring out how to code some of the building blocks used routinely in the field, like Shannon’s Entropy, Bayes factors and the Akaike Information Criterion. Not only can such metrics be used in the development of new mining algorithms, but they can be applied out-of-the-box to answer myriad basic questions about the type of information stored in our SQL Server databases – such as how random, complex, ordered and compressed it might be. No mining company would ever excavate a new quarry without first performing a geological survey of what precious metals might be beneath the surface; likewise, it may be helpful for data miners to have a surface impression of how much useful information might be stored in our tables and cubes, before digging into them with sophisticated algorithms that can have formidable performance costs. Scores of such measures are scattered throughout the mathematical literature that underpins data mining applications, so it may take quite awhile to slog through them all; while haphazardly researching the topic, I ran across a couple of quite interesting measures of information that seem to have been forgotten. I will try to make these exercises useful to SQL Server users by providing T-SQL, MDX, Common Language Runtime (CLR) functions in Visual Basic and perhaps even short SSDM plug-in algorithms, as well as use cases for when they might be appropriate. To illustrate these uses, I may test them on freely available databases of interest to me, like the Higgs Boson dataset provided by the Univeristy of California at Irvine’s Machine Learning Repository. I may also make use of a tiny 9 kilobyte dataset on Duchenne’s form of muscular dystrophy, which Vanderbilt University’s Department of Biostatistics has made publicly available, and transcriptions of the Voynich Manuscript, an enigmatic medieval manuscript with an encryption scheme so obscure that even the National Security Agency (NSA) can’t crack it. Both tutorial series will use the same datasets in order to cut down on overhead and make it easier for readers to follow both. When and if I manage to complete both series, the next distant item on this blog’s roadmap will be a tutorial series on how to use various types of neural nets with SQL Server, which is a topic I had some intriguing experiences with many years ago.
About Stevan BoltonI am a VB programmer and SQL Server DBA with an interest in MDX and multidimensional applications. I have an alphabet's soup of certifications: * 3 MCTS certifications in SQL Server 2008 R2, including a recent exam in MDX and Analysis Services * an MCDBA in SQL Server 2000 * an MCSD in VB6. I've kept up with each version of VB since then but haven't taken the newer exams * I also have a Master's in American history with a concentration in foreign affairs, as well as some work towards a doctorate in Latin American history * My B.S. is in print journalism I'll be posting whatever code I can to help out the SQL Server and VB developer communities. There is always someone out there more knowledgeable, so if you're a guru, feel free to correct any information I might post. I haven't yet been paid professionally to work with some of the technologies I've been trained in and enjoy, like MDX, so the word of those who have ought to carry more weight. There's a shortage of information on some of the topics I'll be posting on, such as the arcane error messages in Analysis Services (SSAS), so users might still find some value in my posts. If you learn of any job openings for MDX, SSAS, SQL Server and VB, feel free to E-mail me.
Posted on July 1, 2014, in DIY Data Mining, Integrating Other Data Mining Tools with SQL Server, Outlier Detection with SQL Server. Bookmark the permalink. Leave a comment.