Implementing Fuzzy Sets in SQL Server, Part 0: The Buzz About Fuzz
By Steve Bolton
…………I originally planned to post a long-delayed series titled Information Measurement with SQL Server next, in which I’d like to cover scores of different metrics for quantifying the data our databases hold – such as how random, chaotic or ordered it might be, or how much information it might provide. I’m putting it off once again, however, because I stumbled onto a neglected topic that could be of immediate benefit to many DBAs: fuzzy sets and their applications in uncertainty management programs and software engineering processes. Since SQL Server is a set-based storage system, I always suspected that the topic would be directly relevant in some way, but never expected to discover just how advantageous they can be. As in my previous series on this blog, I’m posting this mistutorial series in order to introduce myself to the topic, not because I know what I’m talking about; writing about it helps reinforce what I learn along the way, which will hopefully still be of some use to others once all of the inevitable mistakes are overcome. In fact, I guarantee that every DBA and .Net programmer out there has encountered problems which could be more easily and quickly solved through these proven techniques for modeling imprecision, which is precisely what many software engineering and data modeling problems call for. Despite the fact that my own thinking on the topic is still fuzzy (as usual) I’m certain this series can be immediately helpful to many readers, since there’s such an incredible gap between the math, theory and implementations of fuzzy set techniques in other fields one hand, and their slow adoption in the relational and data mining markets on the other.
…………Instead of beating around the bush, I’ll try to encapsulate the purposes and uses cases of fuzzy sets as succinctly as possible: basically, you look for indefinite terms in ordinary speech, then squeeze what little information content you can out of them by assigning grades to records to signify how strongly they belong to a particular set. Most fuzzy set problems are modeled in terms of natural language like this. The overlap with Behavior-Driven Development (BDD) and user stories is quite obvious, but after reading on those hot topics a year prior to learning about fuzzy sets, I was immediately struck by how little these techniques of modeling imprecision are apparently used but how easy it would be to incorporate them into database and application development processes. Uncertainty is a notorious problem in any engineering process, but using sets with graded memberships can even be used to capture it and flesh it out more thoroughly, as part of one of the programs of “uncertainty management” I’ll describe later in this series.
From Crisp Sets to Membership Functions
These powerful techniques arise from the quite simple premise that we can assign membership values to records, which some SQL Server users might being doing from time to time insentiently, without realizing that they were approaching the borderlands of fuzzy set theory. Most relational and cube data is in the form of what mathematicians call “crisp sets,” which don’t require membership functions because they’re clear-cut yes-or-no decisions; to theoreticians, these are actually just a special case of a broader class of fuzzy sets, distinguished only by the fact that their membership functions are limited to values of either 0 or 1. In the relational field as it stands today, you either include a row in a set or you don’t, without any in-between. In contrast, most fuzzy membership functions assign continuous values between 0 and 1; although other scales are possible, I have yet to see an example in the literature where any other scale was used. I doubt it is wise to use any other range even if there might be a performance boost of some kind in applying larger-scale float or decimal data types, given that it helps integrate fuzzy sets with the scales used in a lot of other hot techniques I’ll cover later, like Dempster-Shafer evidence theory, possibility theory, decision theory and my personal favorite, neural net weights. That overlap transforms fuzzy sets into an interchangeable part of sorts, in what might be termed modular knowledge discovery.
…………That all sounds very grandiose, but anyone can practice picking out fuzzy sets represented in everyday speech. Artificial intelligence researchers Roger Jang and Enrique Ruspini provide a handy list of obvious ones in a set of slides reprinted by analytics consultant Piero P. Bonissone, including Height, Action Sequences, Hair Color, Sound Intensity, Money, Speed, Distance, Numbers and Decisions. Some corresponding instances of them we encounter routinely might include Tall People, Dangerous Maneuvers, Blonde Individuals, Loud Noises, Large Investments, High Speeds, Close Objects, Large Numbers and Desirable Actions.[i] The literature is replete with such simple examples, of which imprecise weather and height terms like “cloudy,” “hot” and “short” seem to be the most popular. The key things is to look in any BDD or user story implementation are linguistic states where the speech definitely signifies something, but the meaning is not quite clear – particularly when it would still be useful to have a sharper and numerically definable definition, even when we can’t be 100 percent precise.
Filling a Unique Niche in the Hierarchy of Data Types
It may be helpful to look at fuzzy sets as a new tool occupying a new rung in the ladder of Content types we already work with routinely, especially in SQL Server Data Mining (SSDM). At the lowest level of data type complexity we have nominal data, which represents categories that are not ranked on any scale; these are basically equivalent to the Discrete Content type in SSDM and are often implemented in text data types or tinyint codes in T-SQL. On the next rung up the ladder we have ordinal data in which categories are assigned some rank, although the gaps may not be defined or even proportional; above that we have continuous data types (or the best approximation we can get, since modern computers can’t handle infinitesimal scales) that are often implemented in T-SQL in the float, numeric and decimal data types. Fuzzy sets represent a new bridge between the top two rungs, by providing more meaningful continuous values to ordinal data that in turn allow us to do productive things we couldn’t do with them before, like performing arithmetic, set operations or calculating stats. Any fuzzy set modeling process ought to focus on looking for data that is ordinal with an underlying scale that is not precisely discernible, but in which it would be useful to work with a continuous scale. That really isn’t much more difficult than picking imprecise terminology out of natural language, which anyone can make a game of. Given their “ability to translate imprecise/vague knowledge of human experts” we might also want to make a habit of flagging instances where we know a rule is operative, but has not yet been articulated.
…………If one were to apply these techniques to database server and other computing terminologies, one of the most obvious examples of imprecise terms would be “performance.” As George J. Klir and Bo Yuan point out in their classic tome Fuzzy sets and Fuzzy Logic: Theory and Applications, this is actually an instance of a specific type of fuzzy set called a fuzzy number, which I will introduce later in the series.[ii] Say, for example, that you have a table full of performance data, which you’ve graded the records on scales of 0 to 1 based on whether they fall into categories like “Acceptable,” “Good” and perhaps “Outside Service Level Agreement Boundaries.” That still leaves open the question of what the term “performance “ itself means, so it constitutes another level of fuzziness on top of the membership issue; in fact, it might be necessary to use some of the techniques already hashed out by mathematicians decades for combining the opinions of multiple experts to arrive at a common fuzzy definition of it.
Modeling Natural Language
The heavy math in that resource may be too much for some readers to bear, but I highly recommended at least skimming the third section of Chapter 8, where Klir and Yuan identify many different types of fuzziness in ordinary speech. They separate them into four possible combinations of unconditional vs. conditional and unqualified vs. qualified fuzzy propositions, such as the statement “Tina is young is very true,” in which the terms “very” and “true” make it unconditional and qualified.[iii] They also delve into identifying “fuzzy quantifiers” like “about 10, much more than 100, at least about 5,” or “almost all, about half, most,” each of which is modeled by a different type of fuzzy number, which I’ll describe at a later date.[iv] Other distinct types to watch for in natural language include linguistic hedges such as “very, more, less, fairly and extremely” that are used to qualify statements of likelihood or truth and falsehood. These can be chained together in myriad ways, in statements like “Tina is very young is very true,” and the like.[v]
In a moment I’ll describe how chaining together such fuzzy terms and fleshing out other types of imprecision can lead to lesser-known but occasionally invaluable twists on fuzzy sets, but for now I just want to call attention to how quickly it added new layers of complexity to an otherwise simple topic. That is where the highly developed ideas of fuzzy set theory come in handy. The math for implementing all of these natural language concepts has existed for decades, so there’s little reason to reinvent the wheel – yet nor is there a need to overburden readers with all of the equations and jargon, which can look quite daunting on paper. There is a crying need in the data mining field for people willing to act as middlemen of sorts between the end users of the algorithms and their inventors, in the same way that a mechanic fits a need between automotive engineers and drivers; as I’ve pointed out before, it shouldn’t require a doctorate in artificial intelligence to operate data mining software, but the end users are nonetheless routinely buried in equations and formulas they shouldn’t have to decipher. It is imperative for end users to know what such techniques are used for, just as drivers must know how to read a speedometer and operate a brake, but it is not necessary for them to provide lemmas, or even know what a lemma is. While writing these mistutorial series, I’m trying to acquire the skills to do that for the end users by at least removing the bricks from the briefcase, so to speak, which means I’ll keep the equations and jargon down to a minimum and omit mathematical proofs altogether. The jargon is indispensable for helping mathematicians communicate with each other, but is an obstacle to implementing these techniques in practice. It is much easier for end users to think of this topic in terms of natural language, in which they’ve been unwittingly expressing fuzzy sets their whole lives on a daily basis. I can’t simplify this or any other data mining completely, so wall-of-text explanations like this are inevitable – but I’d wager it’s a vast improvement over having to wade through whole textbooks of dry equations, which is sometimes the only alternative. Throughout this series I will have to lean heavily on Klir and Yuan’s aforementioned work for the underlying math, which I will implement in T-SQL. If you want a well-written discussion of the concepts in human language, I’d recommend Dan McNeill’s 1993 book Fuzzy Logic.[vi]
The Low-Hanging Fruits of Fuzzy Set Applications
These concepts have often proved to be insanely useful whenever they’ve managed to percolate down to various sectors of the economy. The literature is so chock full of them I don’t even know where to begin; the only thing I see in common to the whole smorgasbord is that they seem to seep into industries almost haphazardly, rather than as part of some concerted push. Their “ability to control unstable systems” makes them an ideal choice for many control theory applications.[vii] Klir and Yuan spend several chapters on the myriad implementations already extant when they wrote two decades ago, in fields like robotics,[viii] estimation of longevity of equipment[ix], mechanical and industrial engineering[x], assessing the strength of bridges[xi], traffic scheduling problems[xii] (including the infamous Traveling Salesman) and image sharpening.[xiii] Another example is the field of reliability ratings, where Boolean all-or-nothing rankings like “working” vs. “broken” are often not sufficient to capture in-between states.[xiv] In one detailed example, they demonstrate how to couple weighted matrices of symptoms with fuzzy sets in medical diagnosis.[xv] Klir and Yuan also lament that these techniques are not put to obvious uses in psychology[xvi], where imprecision is rampant, and provide some colorful examples of how to model the imprecision inherent in interpersonal communication, particularly in dating.[xvii] As they point out, some messages are inherently uncertain, on top of any fuzz introduces by the receiver in interpretation; to that we can add the internal imprecision of the speaker, who might not be thinking through their statements thoroughly or selecting their words carefully.
…………Then there is a whole class of applications directly relevant to data mining, such as fuzzy clustering algorithms (like C-Means)[xviii], fuzzy decision trees, neural nets, state sequencing (“fuzzy dynamic systems and automata”)[xix], fuzzified virtual chromosomes in genetic algorithms[xx], fuzzy parameter estimation, pattern recognition[xxi], fuzzy regression procedures and regression on fuzzy data.[xxii] Most of that falls under the rubric of “soft computing,” a catch-all term for bleeding edge topics like artificial intelligence. The one facet of the database server field where fuzzy sets have succeeded in permeating somewhat since Klir and Yuan mentioned the possibility[xxiii] is fuzzy information retrieval, which we can see it in action in SQL Server full-text catalogs.
The Future of Fuzzy Sets in SQL Server
Like many of their colleagues, however, they wrote about ongoing research into fuzzy relational databases by researchers like Bill Buckles and F.E. Petry that has not come into widespread use since then.[xxiv] That is where this series comes in. I won’t be following any explicit prescriptions for implementing fuzzy relational databases per se, but will instead leverage the existing capabilities of T-SQL to demonstrate how easy it is to add your own fuzz for imprecision modeling purposes. Researcher Vivek V. Badami pointed out more than two decades ago that fuzz takes more code, but is easier to think about.[xxv] It takes very little experience with fuzzy sets to grasp what he meant by this – especially now that set-based languages like T-SQL that are ideal for this topic are widely used. I wonder if someday it might be possible to extend SQL or systems like SQL Server to incorporate fuzziness more explicitly, for example, by performing the extra operations on membership functions that are required for joins between fuzzy sets, or even more, fuzzy joins between fuzzy sets; later in the series I’ll demonstrate how DBAs can quickly implement DIY versions of these things, but perhaps there are ways to do the dirty work under the hood, in SQL Server internals. Maybe a generation from now we’ll see fuzzy indexes and SQL Server execution plans with Fuzzy Anti-Semi-Join operators – although I wonder how Microsoft could implement the retrieval of only one-seventh of a record and a third of another, using B-trees or any other type of internal data structure. In order to determine if a record is worthy of inclusion, it first has to be retrieved and inspected instead of passed over, which could lead to a quandary if SQL Server developers tried to implement fuzzy sets in the internals.
…………The good news is that we don’t have to wait for the theoreticians to hash out how to implement fuzzy relational databases, or for Microsoft and its competition to add the functionality for us. As it stands, T-SQL is already an ideal tool for implementing fuzzy sets. In the next article, I’ll demonstrate some trivial membership functions that any DBA can implement on their own quite easily, so that these concepts don’t seem so daunting. The difficulties can be boiled down chiefly to the fact that the possibilities are almost too wide open. Choosing the right membership functions to model the problem at hand is not necessarily straightforward, nor is selecting the right type of fuzzy set to model particular types of imprecision. As in regular data modeling, the wrong choices can sometimes lead not only to the waste of server resources, but also of returning incorrect answers. The greatest risk, in fact, consists of fuzzifying relationships that are inherently crisp and vice-versa, which can lead to fallacious reasoning. Fuzz has become a buzzword of sorts, so it would be wise to come up with a standard to discern its true uses from its abuses. In the next installment, I’ll tackle some criteria we can use to discern the difference, plus provide a crude taxonomy of fuzzy sets and get into some introductory T-SQL samples.
[i] p. 18, Bonissone, Piero P., 1998, “Fuzzy Sets & Expert Systems in Computer Eng. (1).” Available online at http://homepages.rpi.edu/~bonisp/fuzzy-course/99/L1/mot-conc2.pdf
[ii] pp. 101-102, Klir, George J. and Yuan, Bo, 1995, Fuzzy sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.
[iii] IBID., pp. 222-225.
[iv] IBID., pp. 225-226.
[v] IBID., pp. 229-230.
[vi] McNeill, Dan, 1993, Fuzzy Logic. Simon & Schuster: New York.
[vii] p. 8, Bonissone.
[viii] Klir and Yuan, p. 440.
[ix] IBID., p. 432.
[x] IBID., pp. 427-432.
[xi] IBID., p. 419.
[xii] IBID., pp. 422-423.
[xiii] IBID., pp. 374-376.
[xiv] IBID., p. 439.
[xv] IBID., pp. 443-450.
[xvi] IBID., pp. 463-464.
[xvii] IBID., pp. 459-461.
[xviii] IBID., pp. 358-364.
[xix] IBID., pp. 349-351.
[xx] IBID., p. 453.
[xxi] IBID., pp. 365-374.
[xxii] IBID., pp. 454-459.
[xxiii] IBID., p. 385.
[xxiv] IBID., pp. 380-381.
[xxv] p. 278, McNeill.
Posted on June 13, 2016, in Implementing Fuzzy Sets in SQL Server and tagged Analytics, Data Mining, Data Science, Fuzzy, Fuzzy Logic, Fuzzy Sets, Knowledge Discovery, SQL, SQL Server, Steve Bolton, T-SQL. Bookmark the permalink. Leave a comment.