Integrating Other Data Mining Tools with SQL Server, Part 1.1: The Weaknesses of WEKA

By Steve Bolton

…………The same rules that applied to my amateur mistutorial series on A Rickety Stairway to SQL Server Data Mining and Outlier Detection with SQL Server are in play for this series of occasional articles, which will provide a brief overview of using various third-party data mining tools with SQL Server. Basically, I’m am a novice writing about my experiences as I go, partly because it helps me absorb the material a lot faster and partly so others won’t repeat my inevitable mistakes. I’ve only had a cursory introduction to tools like Waikato Environment for Knowledge Analysis (WEKA), an open source data mining suite developed at the University of Waikato in New Zealand[1], so I’m certain that aficionados can point out many more legitimate use cases that I haven’t thought of, in addition to many corrections for my desultory observations. Most of the use cases for WEKA simply don’t apply to SQL Server users, however, so it ought to occupy only a small portion of a DBA’s toolbox. My aim in this series is to assess how these tools may benefit SQL Server users, not the wider pool of data miners, so I will have to gloss over many obvious strengths they may possess. That includes many aspects of WEKA that the user community is rightly enthusiastic about, but which are difficult to integrate in a SQL Server environment. The Cliff’s Notes version of this article is that WEKA is useful to us mainly in cases where all three of the following conditions are met: 1) we need a particular algorithm that SSDM doesn’t provide and which can’t be coded easily in T-SQL or Multidimensional Expressions (MDX)[2]; 2) we can’t afford other professional tools that implement the same functionality; and 3) we only need to operate on small tables and views. It has some definite pluses, including the price tag: zero. Despite the old adage that “you get what you pay for,” it behaved much better than IBM’s supposedly professional data mining suite did when I wrote Thank God I Chose SQL Server part I: The Tribulations of a DB2 Trial. The fact that installation was even possible put it head and shoulders above the Windows version of Oracle’s data mining tools, which have regressed badly since I last installed them in the mid-2000s. These misadventures are recapped in Thank God I Chose SQL Server part II: How to Improperly Install Oracle 11gR2. This series is essentially a continuation of those two articles, hopefully with results that will do more than serve as cautionary tales about the poor software quality of SQL Server’s competitors.
…………I can’t really put WEKA into perspective without doing a little editorializing and pontificating on some broader software industry issues. Some might see the buzz word “open source” attached to the product and automatically assume that it somehow provides some benefit over proprietary products, but this isn’t necessarily the case. Any open source software is much more vulnerable to security holes, since coding in the open is akin to an army handing over access to all of its battle plans, troop movements and communications to a foreign power. The fact that it is provided free under the GNU General Public License is a plus, but the risks of open source development also have to be factored in. Since it’s been downloaded by at least 2.5 million users at Sourceforge.net, we can be sure that it has better support than certain other fly-by-night open sources projects. On the other hand, popularity is not always indicative of the worth of software; platform dependence can be very difficult to maintain in many areas of the software business, especially when market forces get involved. One of the key lessons of software economics over the last few decades is that if there’s a stampede in a particular product, stragglers eventually have no choice but to use it too – regardless of whether or not it suits their uses, or works at all. This is a lesson I had learned by the mid-1990s, when my father and I had to switch from programming for the Mac to Windows, even though we preferred the other platform. Another big minus is that WEKA’s dependent on the Java Runtime Environment (JRE), which simply can’t hold a candle to .Net in a Windows environment. I understand precisely why the U.S. Department of Homeland Security once warned users to disable Java entirely[3], long after China and others had already restricted its use: the only times in recent memory I’ve detected security issues on my poor, wheezing artifact of  a development machine, the Java Runtime has been the backdoor that allowed entry. For that reason I disallow its use as a matter of policy, except when testing out software packages that depend on it. One of the alleged benefits of Java is its portability, but the problem is that code written for the Java Runtime just doesn’t port well to a Windows environment; in my experience, portability of code has been a holy grail among the software giants since the 1990s, but every product I can think of that emphasized it as a benefit has fallen very short. Many of the aforementioned I encountered in the articles on Oracle and DB2 data mining were due to the fact that many of their components are dependent on Java, which may be the cat’s meow in the Linux world, but is a productivity killer in the Windows ecosystem.
…………In fact, I ran into installation problems right off the bat with WEKA, which were fortunately resolved much more satisfactorily than with DB2 and Oracle. The culprit once again, however, was Java, not the standard Windows installer available from the WEKA homepage. At present, running WEKA is a real inconvenience, given that I had to remove the Java environment to stay within my security policy and would thus have to reinstall it all over again, should some specific data mining use case arise. If the JRE is not installed, a command prompt window will pop up and immediately vanish whenever you run WEKA, which will also disappear in Task Manager. WEKA 3.7.11 and the JRE together took up 110.1 megabytes of space, which is a pittance in these days of cheap storage. The drawback, however, is that the size is indicative of the lack of functionality; the user interface is light precisely because it doesn’t do as much as your average SQL Server component, plus most of the algorithms developed by the user community have to be downloaded separately. The installer includes tabs and other controls for the License Agreement, Choose Start Menu Folder, and Start WEKA checkbox, all of which are fairly self-explanatory. The main choices can be found under the Choose Components window depicted in Figure 1, where an Install JRE control and Associate Files checkboxes for .arff and .xrff files can be found. After installation the structure of the new WEKA directory will look something like Figure 2. The Data directory includes a series of .arff files that are apparently sample datasets, such as “breast-cancer.arff” and “ReutersGrain-train.arff,” the changelogs folder is nothing but CHANGELOG files with different numbers appended which apparently refer to particular versions of WEKA. The doc folder is almost exclusively composed of .html Help files. Most of the other files in the screenshot are self-explanatory, with a couple of .gif, .ico, .exe, Java .jar and .bat files, plus a readme, a .pdf for the documentation and an uninstaller. It is also worth noting that the working directory will be set to an address like “C:\Users\your user name\wekafiles.”

Figure 1: The Choose Components Windows of the WEKA Installer
WEKA Choose Components

Figure 2: The WEKA Folder after Installation
WEKA Folder

Figure 3: The WEKA GUI Chooser
WEKA GUI Chooser

…………On a successful launch, the program starts out with the GUI Chooser in Figure 3. I’ll simply ignore the fourth choice, Simple CLI, which is merely a Java-based console application that is even more awkward to work with than ordinary Windows command prompts. As I pointed out in my SSDM series, command prompts aren’t something data miners ought to be hassled with on a regular basis, given that the topic is so broad, taxing and sophisticated that we can’t afford such quite unnecessary distractions. I stick to whatever GUI is available whenever possible unless there’s a console app has some indispensable, exclusive functionality, but I’m not aware of any the Simple CLI has that are not found in the other three GUI options. In addition to these, the documentation says that it is also possible to install Multiple Document Interface (MDI) capabilities in the GUI, but I didn’t try this feature.[4] Of the remaining three, the Experimenter offers the most functionality not found or easily implementable in SSDM or other SQL Server components. In my Outlier Detection series I spoke frequently of how statistical hypothesis testing methods are not common use cases in a SQL Server environment, even in the case of data mining activities. If a need arose, however, we could use the WEKA Experimenter interface to calculate stats like entropy, mean absolute error, false positives, Root Mean Squared Error (RMSE) and others listed in Figure 4. In the next graphic, we see how it is possible to select standard statistical parameters like significance levels and perform two types of Paired T-tests. Some of the other controls in Figure 5 are self-explanatory, like the Sorting (Asc. By) button, Displayed Columns button and Show Std. Deviations checkbox. When I tried to set up an experiment using the Start button on the Run tab, in conjunction with Distribute experiment and no Hosts file, I received a “No hosts specified! warning.” The first time I was able to use the Stop button to end the run, but the second time around, both the Start and Stop buttons were greyed out – thereby forcing me to restart the whole experiment. The window didn’t freeze per se, so I was able to save the associated .exp definition file, but I had to close the Experiment Environment window, open it again, then reopen the same .exp. The Setup and Run tabs allow users to perform ordinary tasks like randomizing data and producing training data, create new .exp Experimenter definition files, selecting datasets to operate on and choosing Destination types, like .csv files, JDBC databases or the default .arff files. Users can also select up to 10 Runs, with the results being reported in a Log window on the Runs tab. The Output format on the Analyse tab also includes such types as plain text, GNUPlot, HTML and LaTex along with various options for including or excluding certain types of data. These were initially saved to the default location of AppData/Local/Temp folder.

Figure 4: Some Stats That WEKA Experimenter Can Calculate
Experimenter Stat Selection

Figure 5: The Analyse Tab of the WEKA Experimenter
Experimenter Interface

 

…………Nevertheless, the WEKA Experimenter will probably only prove useful in a SQL Server user environment where several preconditions are met. First, there must be a need to perform tasks like experiment design and model comparison – but only for the basic stats available through the Cols button, some of which are depicted in Figure 4. Second, the need has to be just great enough to justify installing and learning to use the tool, which will take time – without being so pressing or frequent that proprietary tools are called for. In between trying WEKA and writing up this post, I became acquainted with Minitab, which is far beyond my budget at about $1,500 for a single user license. On the other hand, Minitab does far more than WEKA Experimenter ever could in terms of things like advanced experiment design, performing ANOVA and calculating more advanced model stats like the Akaike Information Criterion (which I hope to code in T-SQL in my next self-tutorial series). Everything WEKA can do, Minitab can do better; even the user interface is an immediate and obvious improvement. If I expected to do more than 75 hours of work of this kind for a client over the course of a couple of years, then a tool like Minitab would pay for itself, even if the labor rate was as low as $20 an hour[5]. The amount of time saved between using a professional and open source interface isn’t even comparable – although as I mentioned above, WEKA is actually more reliable than certain professional data mining packages like DB2 and Oracle, which often don’t work at all. I haven’t tried any of Minitab’s high-priced competitors yet, although I suspect that when I finally get around to it, I will find their interfaces and algorithms far superior to WEKA’s. The third caveat is that the datasets have to be much smaller than we’re accustomed to working with in SQL Server. This limitation applies to every component in WEKA, which sharply reduces the number of use cases we can utilize it for.
…………Most of the action occurs in the WEKA Explorer, where we find a series of menus including Program (containing commands to log all of the text results, display MemoryUsage and Exit) and Help, with the usual .html documentation links and the like. Some of the most valuable documentation resources include the WEKA Wiki,a guide to Data Mining: Practical Machine Learning Tools and Techniques and the WEKA forums. The 327-page WekaManual.pdf installed in the WEKA directory in Program Files is actually quite well-written and informative in comparison to other open source documentation I’ve seen before. In fact, it is indispensable for users trying to open XML Attribute Relation File Format (.xrff) and Attribution-Relation File Format (.arff) files, including the aforementioned practice datasets. The first format is used for “representing the data in a format that can store comments, attribute and instance weights,”[6] while the latter can be used to load the 23 plain text sample datasets included at installation, or others downloaded from the Internet. If you can’t get Weka to work with SQL Server, you can still use it in conjunction with these .arff files; which are basically plain text, plus .other extension like xrff, .bif and .dot that I’ve never seen used in a SQL Server environment. These can be inspected via the ArffViewer on the Tools menu, which also contains a SQLViewer that is really just a glorified text editor, not a worthy opponent for SQL Server Management Studio (SSMS). It is through this menu item that we begin our adventure in connecting to SQL Server. This comment in the WekaManual.pdf doesn’t exactly inspire confidence:

“A common query we get from our users is how to open a Windows database in the Weka Explorer. This page is intended as a guide to help you achieve this. It is a complicated process and we cannot guarantee that it will work for you. The process described makes use of the JDBC-ODBC bridge that is part of Sun’s JRE/JDK 1.3 (and higher). The following instructions are for Windows 2000. Under other Windows versions there may be slight differences.”[7]

…………Chapters 14 and 15 contain some information about connecting to Windows and JDBC databases, but it’s not really relevant. I complained often in my Rickety series about the incompleteness of the documentation for SSDM, but this takes the cake: the instructions are for Microsoft SQL Server 2000 (Desktop Engine), as well as Microsoft Access. Thankfully, I was able to connect with the aid of the WEKA Wiki page on Windows Databases and Anders Spur Hansen’s excellent tutorial on connecting through SQL Server, so I won’t waste time reinventing the wheel here.[8] Basically the procedure boils down to creating a DSN and using it in conjunction with the Query command through WEKA Explorer’s OpenDB… menu item, with some extra steps thrown in for Java-related hacks like editing the DatabaseUtils.props file. In the event of an error message like “No suitable driver found” when connecting via SQL Server, be aware that there’s at least one old MSDN thread about this, which turns out to be yet another Java issue.[9]
…………To date, I’ve been unable to find a means of connecting WEKA to Analysis Services, but the point is moot because the SQLViewer simply can’t handle cube-sized data. In fact, one of its chief drawbacks is that it simply chokes on relational tables of the size SQL Server users work with all the time, which limits us to really small datasets. The Lilliputian size of the 23 sample datasets included out-of-the-box is a clue to just how limited WEKA’s processing and display capabilities are in comparison to SQL Server: the breast-cancer.arff file, for example, has a measly 286 rows, whereas the largest, the supermarket.arff has 4,627, which is about several million short of the “Big Data” league. SSMS can effortlessly display millions of rows, even on my clunker of a development machine, but WEKA simply gives up the ghost after a few thousand. To stress test the SQLViewer, I set the maximum rows to more than 51,000, knowing I was loading data from a 41,360 record table that occupied a mere 0.883 megs in a half-gig SQL Server database. This not only locked up the SQLViewer window, but set SQLServer.exe running on one core. The situation deteriorated from there, as I immediately discovered that you can’t simply quit Java apps in the Task Manager, since they’re not listed there. Nor could I exit any of the separate windows that open with WEKA using their control boxes, once one of them froze. To extricate myself, I had to use the Kill Process command in SQL Server’s Activity Monitor, which had the unexpected secondary effect of killing all of the WEKA windows, plus the JRE.
…………Since this is a matter of critical importance for SQL Server users who need to use WEKA, I set out to find a threshold below which SQL Server tables could be safely accessed. When I limited the same dataset to the first 1,500 WEKA behaved itself, but at 5,500 it took about 2 minutes to unfreeze – in which time SQL Server was locked on one core the whole time, which is unacceptable. Plan ahead for this limitation, because the performance of the whole server could be affected by accessing unexpectedly small number of records. Just handling an ordinary record request that SSMS can do in a heartbeat can bring a server to its knees. After some experimentation I came to suspect that the flake point was only a little above 5,500 records, which of course may be substantially higher on a real server or workstation. Even after successfully loading that many records, I still couldn’t open the WEKA Explorer, which was simply greyed out in the GUI Chooser after the crash. The Experimenter did the same after this particular Frankenstein’s Experiment. After restarting the program and finding the same problems again, I checked the LogWindow, which was full of lots of red error messages with alarming titles like “Exception in thread ‘AWT-Event-Queue-0’ java.lang.NoClassDefFoundError.” Given that I was merely starting the program, I concluded that this was not good. I wasn’t able to restart WEKA successfully until I noticed that there were three different javaw.exe versions running and killed them all in Task Manager, then rebooted the JRE.
…………Given that WEKA had so many problems with merely displaying a few thousand rows, I strongly suspected that it wouldn’t be able to perform high-powered data mining calculations on the kinds of datasets I used for the Rickety and Outlier Detection series. There was simply no way that WEKA would be able to swallow the half-gig of fake Monitoring database data I routinely crammed through SSDM in the first series, nor the 11-million-row table of Higgs Boson data I tested my home-baked T-SQL outlier code on in the second. As a general rule of thumb, SSDM ought to be our go-to tool, except in cases where there’s a specific need for algorithms available only in WEKA. It is here that WEKA really shines, thanks to its vibrant user community and the growing pool of open source algorithms they’ve contributed; I still believe SSDM is a far better data mining tool than anything else I’ve yet used, but as I lamented in the Rickety series, Microsoft hasn’t supported it with any new algorithms since SQL Server 2008 R2. While familiarizing myself with the field over the last few years, I’ve been awe-struck by the sheer size of the gap between the data mining algorithms available in the academic literature and those that are available in the current software. This means there’s room for a lot of variation between software packages, none of which implements the same functionality as its competitors. The realization that building a toolbox that includes them all can be beneficial in meeting a wider range of use cases is what prompted me to write this series, in which I intend to match the software packages to the right problems. The limitation on dataset sizes hobbles WEKA even in these situations where it has a clear advantage over SSDM, but as long as we stay within the limit of about 5,000 records or so, it can meet some narrow use cases that SQL Server simply can’t. Now that we’ve got the preliminary steps of installing, connecting and displaying data out of the way, I’ll explain what WEKA can do with our data in the next article.

 

[1] Apparently the name’s a play on words, based on “a flightless bird with an inquisitive nature” that is native to New Zealand. See Asanka, Dinesh, 2013, “Weka 3: Data Mining Software in Java,” posted on Sept. 10, 2013 at the Toad World blog address http://www.toadworld.com/platforms/sql-server/b/weblog/archive/2013/09/10/weka-3-data-mining-software-in-java.aspx

[2] I rarely mention Data Analysis Expressions (DAX) in these contexts because I have found it more difficult to cook up custom data mining routines with it, in contrast to MDX and T-SQL. Perhaps my judgment is clouded though by the fact that I just plain dislike the language itself, which I find unnecessarily awkward and inflexible for my tastes.

[3] CBS News, 2013, “U.S. Tells Computer Users to Disable Java Software,” published Jan. 11, 2013 at the CBSNews.com web address http://www.cbsnews.com/news/us-tells-computer-users-to-disable-java-software/

[4] “If one prefers a MDI (“multiple document interface”) appearance, then this is provided by an alternative launcher called ‘Main’ (class weka.gui.Main).”

[5] Of course, at that rate the labor would end up paying for the program, rather than producing any immediate profit. The gist of the argument is clear though.

[6] p. 177, Bouckaert, Remco R.; Frank, Eibe; Hall, Mark;  Kirkby, Richard; Reutemann, Peter; Seewald, Alex; Scuse, David, 2014, WEKA Manual for Version 3-7-11. The University of Wakaito: Hamilton, New Zealand.

[7] IBID., p. 195.

[8] Hansen, Anders Spur, 2013, “Connect WEKA to SQL Server 2012 and ‘14’” posted Oct. 11, 2013 at the My Life with Business Intelligence blog address http://andersspur.wordpress.com/2013/10/11/connect-weka-to-sql-server-2012-and-14/

[9] See the replies by the user named “M i k e” and Joris Valkonet to the MSDN thread “Need Help Implementing Weka into SQL Server 2005” on Aug. 22, 2007. Available online at

http://social.msdn.microsoft.com/forums/sqlserver/en-US/aaff46ae-a1ea-4297-9a74-e9a13effd21b/need-help-implementing-weka-into-sql-server-2005

Advertisements

About Stevan Bolton

I am a VB programmer and SQL Server DBA with an interest in MDX and multidimensional applications. I have an alphabet's soup of certifications: * 3 MCTS certifications in SQL Server 2008 R2, including a recent exam in MDX and Analysis Services * an MCDBA in SQL Server 2000 * an MCSD in VB6. I've kept up with each version of VB since then but haven't taken the newer exams * I also have a Master's in American history with a concentration in foreign affairs, as well as some work towards a doctorate in Latin American history * My B.S. is in print journalism I'll be posting whatever code I can to help out the SQL Server and VB developer communities. There is always someone out there more knowledgeable, so if you're a guru, feel free to correct any information I might post. I haven't yet been paid professionally to work with some of the technologies I've been trained in and enjoy, like MDX, so the word of those who have ought to carry more weight. There's a shortage of information on some of the topics I'll be posting on, such as the arcane error messages in Analysis Services (SSAS), so users might still find some value in my posts. If you learn of any job openings for MDX, SSAS, SQL Server and VB, feel free to E-mail me.

Posted on May 28, 2015, in Integrating Other Data Mining Tools with SQL Server and tagged , , , , , . Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: