Blog Archives

Information Measurement with SQL Server, Part 3: Inverse Probability and Bayes Factors

By Steve Bolton

…………In the last segment of this series of amateur self-tutorials, we discussed how to code various ways of quantifying how much we don’t know about the data in SQL Server tables and cubes. The various probabilistic entropies I translated into T-SQL in those six articles can be viewed as measures of a particular sort of information, which might best be interpreted as “newsworthiness,” since they tell us how much we might learn from each data point. The rest of this wide-ranging series will have to be somewhat haphazard, due to the sheer number of information metrics and the fact that I’m learning about them as I go; writing on these topics helps me absorb the material a lot faster, while hopefully helping other amateurs avoid my inevitable mistakes. I thought it fitting, however, to complement the topic of information entropy by next discussing means of quantifying what we do know about our data. Bayes Factors are mercifully easy to calculate, since all we need to do is divide two probabilities by each other and plug them into a logarithm. In fact, we can reuse some of the code from Information Measurement with SQL Server, Part 2.4: Conditional and Joint Entropy and simply strip out the LOG operations that transformed the conditional probabilities into entropies. This step could also be dispensed with, if one of the chief uses of Bayes Factors didn’t involve multiplying them by one conditional probability in order to derive another. This information metric can be interpreted as a sort of crude gauge of existing knowledge, but is chiefly useful in adjusting probabilities in the light of new evidence, through a famous probability formula developed by Presbyterian Minister Thomas Bayes (1701-1761) and extended a few decades later by Pierre-Simon Laplace (1749-1827) [1] to induction.[2]
…………Bayes Factors are indeed simple to code, but making use of them requires a lot more context; this is basically the same paradox that made Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy one of the longest articles in the last segment, although it was the simplest of the entropic metrics to code. This is due in part to the broad applications of Bayes’ Theorem, which essentially derives a conditional probability for one distinct event or value given another; this is arrived at by multiplying the probability of the first event by the conditional probability of the second given the first, then dividing by the overall probability of the second. The formula isn’t terribly difficult to follow even for an amateur like myself, but I don’t see any reason to post it and violate my longstanding ban on equations in this blog. Calculating a conditional probability in reverse in this way can be useful in some situations on its own, but can be harnessed for new uses by infusing with additional meaning as a “degree of belief.” Basically, the probability of X given Y derived from this inverse probability is interpreted by Bayesian statisticians as a posterior distribution, or a degree of belief updated in light of the conditional probability of Y given X, which is known as a prior. The Bayes Factor merely represents the ratio between the other two elements in Bayes’ Theorem, the two overall probabilities for X and Y. If we know the Bayes Factor and the prior conditional probability, we can derive the posterior conditional probability, hence the common refrain “posterior is proportional to prior times likelihood.” Sometimes the term “likelihood ratio” is used as a synonym for Bayes Factors, but this can lead to confusion, given that it is also used for the likelihood-ratio tests, certain diagnostic tests and “the ratio of two likelihood functions.”[3] Moreover, the term “likelihood” carries specific connotations in statistical parlance, whereas in ordinary speech it is often used interchangeably with “probability.” For that reason, I’ll probably avoid it in discussions on topics like Bayes Factors that are directly related to likelihood.

From Inverse Probability to Decibans

                I’ll also stick to the broader and older term “inverse probability” in place of the now-common “Bayesian probability,” in order to stay out of the knock-down, drag-out turf wars that occur among statisticians today over the topic; these methods don’t necessarily have to be used in ways often labeled “Bayesian” today, which as we shall see, sometimes implies endorsement of controversial interpretations. The process was originally known as inverse probability, with good reason, as a few intuitive illustrations of its usage will demonstrate. Many of the intuitive examples given for inverse probability in the statistical literature are similar to one in Fazlollah Reza’s An Introduction to Information Theory, where the goal is to gauge the probability of particular combinations of red and black balls being taken from three urns. The catch is that the probability is derived after they are drawn, not before, as is typically the case in ordinary “frequentist” probability.[4] This is summarized quite succinctly in Dan McNeill’s Fuzzy Logic, one of the most readable books on the topics covered in my Implementing Fuzzy Sets in SQL Server series.[5] Basically, in ordinary probability we’re asking what color the next ball we draw is likely to be, but with inverse probability we’re transposing the question into, “What is the ratio of the balls in the urn?” given foreknowledge of those we’ve drawn so far. Following Laplace’s “principle of indifference” in the absence of extra information, we’d set the probabilities for the outcomes even, as in a uniform distribution[6]; with each subsequent draw from the urns, Bayesians would update these probabilities until they gradually approached the true distribution. A previous Wikipedia article on Bayes Theorem or Bayes Rule contained a well-written example, in which the goal was to guess the gender of a person with long hair, given an even distribution of men and women and probabilities of each sex having long hair set to 15 and 75 percent respectively (with the remainder being unknown). In that example, the Bayes Factor would be a simple ratio between 75 and 15, which equals odds of 5:1. All we need to do is divide them and apply a LOG operation, which is normally done in base 10 and then multiplied by a factor of 10.   These units are known as decibans, a variant of an information measure pioneered by famed cryptanalyst Alan Turing and alternately known as the hartley, in honor of the developer of the metric discussed in Information Measurement with SQL Server, Part 1: A Quick Review of the Hartley Function. Statisticians find decibans convenient because common odds ratios can be translated into them seamlessly, such as 100 to 1, which equals 10 decibans, a nice round number.Figure 1 also includes the calculations in base 2, in order to provide a comparison point with bits, the most common unit associated with entropic information measures. About 95 percent of the performance costs are incurred in the INSERT, which is identical to the initial code used in Information Measurement with SQL Server, Part 2: Conditional and Joint Entropy, except with the entropic LOG calculations stripped out. This simplification may be the reason why it executes in just 1:14 on the same two float columns of the 11-million-row Higgs Boson dataset I’ve been using throughout this series for demonstration purposes[7], compared to the 1:46 required for the Joint Entropy article. One of the conditional probabilities can be reconstructed from the other as long as its inverse and individual probabilities are available, or can at least be derived from statistical estimation techniques[8]; in the database server world, we’re more likely to have the luxury of using actual proportions derived from extensive counts, whereas in other fields these inputs would have to be derived from probability distribution formulas and likelihood estimation methods. I included the individual proportions and conditional probabilities in the final SELECT merely to convey how Bayes’ Theorem can be used to derive them from each other in this manner.

Figure 1: Deriving Bayes Factors with T-SQL
DECLARE @Count1  bigint, @Count2 bigint, @JointCount bigint

SELECT @Count1=Count(*)
FROM DataMiningProjects.Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL

SELECT @Count2=Count(*)
FROM DataMiningProjects.Physics.HiggsBosonTable
WHERE Column2 IS NOT NULL

SELECT @JointCount=Count(*)
FROM DataMiningProjects.Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL OR Column2 IS NOT NULL

DECLARE @EntropyTable table
(Value1 decimal(33,29),
Value2 decimal(33,29),
ValueCount bigint,
Proportion1 decimal(38,37),
Proportion2 decimal(38,37),
JointProportion decimal(38,37)
) 

INSERT INTO @EntropyTable
(Value1, Value2, ValueCount, Proportion1,Proportion2, JointProportion)
SELECT Value1, Value2,ValueCount, Proportion1,Proportion2, JointProportion
FROM (SELECT Value1, Value2,ValueCount,
       CASE WHEN GroupingIDColumn1 = 0 AND GroupingIDColumn2 = 1 THEN ValueCount / CAST(@Count1 AS float) ELSE NULL END AS Proportion1,
       CASE WHEN GroupingIDColumn1 = 1 AND GroupingIDColumn2 = 0 THEN ValueCount / CAST(@Count2 AS float) ELSE NULL END AS Proportion2,
       CASE WHEN GroupingIDColumn1 = 0 AND GroupingIDColumn2 = 0 THEN ValueCount / CAST(@JointCount AS float) ELSE NULL END AS JointProportion
       FROM  (SELECT Column1 AS Value1, Column2 AS Value2,  Count(*) AS ValueCount, GROUPING_ID(Column1) AS GroupingIDColumn1, GROUPING_ID(Column2) AS GroupingIDColumn2
                     FROM Physics.HiggsBosonTable
                     WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
                     GROUP BY CUBE (Column1, Column2)) AS T1) AS T2 

SELECT 10 * Log(Proportion1 / Proportion2, 10) AS BayesFactorForHypothesisTesting,
10 * ABS(Log(Proportion1 / Proportion2, 10)) AS BayesFactorInDecibans,
ABS(Log(Proportion1 / Proportion2, 2)) AS BayesFactorInBits,
Proportion1 / Proportion2 AS OddsRatio,
Proportion1, Proportion2, ConditionalProbabilityOfColumn2GivenColumn1, ConditionalProbabilityOfColumn1GivenColumn2,
(CAST(ConditionalProbabilityOfColumn2GivenColumn1 as float) * CAST(Proportion1  as float)) / CAST(Proportion2 as float) AS RecalculationOfConditionalProbabilityOfColumn1GivenColumn2
FROM (SELECT Proportion1, Proportion2, JointProportion, JointProportion / Proportion1 AS ConditionalProbabilityOfColumn2GivenColumn1, JointProportion / Proportion2 AS ConditionalProbabilityOfColumn1GivenColumn2
      FROM (SELECT JointProportion
              FROM @EntropyTable

WHERE Value1 = 0.61253148317337036130000000000 AND Value2 = 0.54839861392974853520000000000) AS JointProportion,
              (SELECT Proportion1 FROM @EntropyTable
              WHERE Value1 = 0.61253148317337036130000000000 AND Proportion1 IS NOT NULL) AS Proportion1,
              (SELECT Proportion2 FROM @EntropyTable
              WHERE Value2 = 0.54839861392974853520000000000 AND Proportion2 IS NOT NULL) AS Proportion2) AS T1    

Figure 2: Results from the Higgs Boson Dataset (Click to Enlarge)

…………At 0 decibans or bits, the probability is evenly split. When the odds ratio is less than 1:1 then the value, event, hypothesis or model embodied in the divisor is more likely, whereas if it is greater than 1:1, the evidence is weighted in favor of the dividend[9]; since an odds ratio in favor of the divisor results in a fraction less than 1, the LOG applied to translate this into a Bayes Factor will result in a negative value.  This makes it difficult to compare them to entropic measures, most of which are non-negative by definition, hence the need for the two ABS functions in the last SELECT. On the other hand, the minus sign is useful when the ratio’s interpreted as a ratio of the weights of evidence for the event or value in the dividend, vs. the evidence for the contrary hypothesis embodied in the divisor. Hence the need for the BayesFactorForHypothesisTesting, sans an ABS operation. Bayes Factors are used directly in the Bayesian version of hypothesis testing, on the grounds that the odds can be interpreted as evidence in favor of one hypothesis or the other.[10] The same sort of comparison can be made for the global figures, in which case they are useful in model selection, a use more closely associated with data mining than ordinary statistics. For this use case, “an advantage of the use of Bayes factors is that it automatically, and quite naturally, includes a penalty for including too much model structure. It thus guards against overfitting.”[11] Furthermore, the ability to update probabilities in a sequential manner as new information is acquired can be leveraged to upgrade Design of Experiments (DOE). It is more common to use Bayes Factors for these tasks than as stand-alone, pure information measures, but they can be treated as a sort of crude gauge of a priori knowledge. The mathematical properties that make Bayes Factors attractive for hypothesis testing and model selection were worked out in detail long ago, but some of them – such as additivity and eventual convergence to a Gaussian or “normal” distribution[12] – might be useful in terms of an information metric as well.
…………I’m sure that theoreticians have worked out their relationships to measures of information entropy already, but I haven’t gotten that far in the literature yet. I do know that the journal Entropy recently published a special edition on integrating Bayesian stats with important information theory axioms like the Principles of Maximum Entropy and Minimum Cross Entropy, so apparently it’s still a hot research topic.[13] I strongly suspect that they share much in common with the Self-Information metric I introduced in the last article, given that all we have to do to derive the ratio of Self-Information is to apply the LOG operations first, then perform the division. I omitted those calculations in order to shrink the graphic in Figure 2, but calculating the ratio of the Self-Information of these two distinct values for Column1 and Column2 yields a figure of 1.06758789451323 for base 2, compared to 0.786263282058452 for the BayesFactorsInBits depicted above. It would actually be quite economical to calculate Bayes Factors and Self-Information together, just as it wouldn’t cost us much more to derive the Conditional and Joint Entropies in the same pass across the original table in the INSERT. If we’re going to derive the two probabilities and the conditional probability across two columns, it’s really simple in T-SQL to derive the other alongside it rather than calculating backwards from Bayes’ Theorem.

Uses and Interpretations of Bayes Factors

                The popular examples of counting balls of different color in separate urns would be easier to solve in SQL Server, given the extensive counts that are readily available in our tables and cubes. Bayesian probability would probably find wider use cases in terms of updating probabilities as new information comes in, which calls for interpreting them as “beliefs” in a manner similar to that introduced in Implementing Fuzzy Sets in SQL Server, Part 10.1: A Crude Introduction to Dempster-Shafer Evidence Theory. Bayesian stats can also be integrated with Decision Theory. In fact, it intersects with many other areas of statistics, which give rise to a wide array of use cases; for example, when Maximum Likelihood Estimates (MLEs) are used in Bayes Factors, it basically turns into an ordinary likelihood ratio test.[14] Bayes Factors are thus useful in a smorgasbord of applications beyond just a pure information metric, which really doesn’t even qualify as the tip of the iceberg because it’s a somewhat obscure usage. Many of those applications occur in the field of data mining, including the Bayesian stats used in Naive Bayes in SSDM A Rickety Stairway to SQL Server Data Mining, Algorithm 1: Not-So-Naïve Bayes. Bayes Factors might be adapted for all kinds of DIY mining algorithms, but I wonder if Bayesian stats might be integrated in interesting ways with my pet topic, neural nets. The human brain cannot yet act on information it does not have, so it stands to reason that it must have a particularly efficient way of using priors to update beliefs, to help us quickly zero in on truth (or the fallacies we prefer out of pride, which has been one of mankind’s most tragic imperfections throughout history) out of endless sea of wrong answers. It might daisy-chain priors together to update our internal knowledge bases with incoming sensory input every split second, perhaps by calibrating anticipation in the same manner theorized by artificial intelligence researcher Jeff Hawkins.[15
…………The limits of human perception of differences in degrees of belief apparently occur around 1 deciban.[16] This corresponds to inserting a ratio of 1.258925411794168 into the LOG operation, or 0.332192809488737 bits using base 2. Various informal scales for gauging the strength of evidence have been developed over the years, including one by famed Bayesian Harold Jeffreys and a more recent one by statistics Profs. Robert E. Kass and Adrian E. Raftery.[17] They’re not terribly different from the one mentioned in a .pdf document published by Biostatistics Prof. Leonhard Held, who says that odds of 1:1 to 3:1 are “barely worth mentioning,” 3:1 to 20:1 are “substantial” and 20:1 through 150:1 are “strong,” with anything beyond that constituting “very strong” evidence in favor of the hypothesis embodied in the dividend.[18] The three fences of 3:1, 20:1 and 150:1 corresponds to 4.77121254719662, 13.0102999566398 and 21.7609125905568 decibans respectively, or 1.58496250072116, 4.32192809488736 and7.22881869049588 bits.

Too Much of a Good Thing: The Danger of Fringe Interpretations of Bayesian Statistics

                Beyond these somewhat subjective criteria, the inherent meaning of Bayesian statistics is hotly debated. As is often the case in fields related to information theory, the difficulty rests in the interpretation, not the calculations; in fact, interpretation of Bayesian stats has been the subject of incessant trench warfare among statisticians and researchers in related fields for the last few decades. Apparently the gist is that some Bayesians claim their techniques supersede all others, including those for describing uncertainty[19], which many other statisticians take issue with. Peter Cheeseman, a leading Bayesian, took note of the “nearly religious fervor” and “stand-up fights” that arose in the field as a result, as did many of his counterparts on the other side of the fence.[20] This rivalry apparently spilled over into funding wars and appointments; for example, McNeill laments in Fuzzy Logic that Bayesians allegedly torpedoed fuzzy research in the U.S. in the 1980s.[21] He points out in return, however, that at the time he wrote two decades ago, there were no Bayesian products in the statistical marketplace yet, but plenty that made use of fuzzy sets.[22] George J. Klir, one of the authors of my favorite mathematical resource on fuzzy sets, criticizes Bayesianism on the grounds that it assumes degrees of belief when the corresponding numbers are in fact hazy. “It forces us to be certain about our uncertainty…“By making vague expectations precise, he says, the Bayesians fail to know their ignorance, and by disdain to model vagueness appropriately, they ignore available knowledge.” Klir says, then suggests that they read the wisdom of the Tao Te Ching: “Knowing ignorance is strength. Ignoring knowledge is sickness.”[23]
…………Bayes Factors opens a whole world of Bayesian statistics that I have yet to set foot in, nor intend to incorporate much in this series, which is strictly concerned with coding information metrics in SQL Server for purposes of DIY data mining. Therefore, I won’t get into a holy war about something I don’t know enough about by joining one camp or another at this point. I’ll stick with my usual approach of finding ways of pragmatically incorporating these techniques in my toolbelt wherever I can and matching them up with appropriate use cases; I agree with some of the Bayesian criticisms of standard statistical tests involving P-values, confidence levels and the like, but do not think they warrant jettisoning the whole “frequentist” gamut of techniques. If the characterizations of some Bayesians as extremists who will insist on throwing out many of the competing tools, then I would definitely have a problem with that kind of obsessiveness. I would also point out that the common Bayesian refrain that “certainty is simply total belief”[24] does not necessarily wash, since a person can be totally certain of falsehoods, even to the point of willingly embracing bad evidence; as Bill W. and his disciples are wont to point out, “Denial just ain’t a river in Egypt.” Other misgivings about Baysian inference would also have to be cleared up, such as the ramifications of using it to reason “backwards from the evidence to the hypothesis.”[25] Counter-intuitive results are also known to arise in Bayesian interpretations, just as they do with the Evidence Theory belief and plausibility measures I covered in the fuzzy sets tutorial series. For example, in the well-known “Oliver’s Blood” problem, if two blood samples are left at a crime scene with Type O and Type A blood, the evidence would be slightly against a person with Type O blood accounting for one of them, even though the former is more common than the latter among the general population by odds of about 60:1.[26] I’m not sure how Bayesians compensate for fallacious reasoning that often arises from ordinary probability problems, such as the common belief that priors have a direct, causative relationship in determining posteriors. This is an accurate statement in certain card games where applications without replacement are the rule (or against the rules, in the case of Ben Affleck), but not in coin tosses, where getting ten heads in a row has no effect whatsoever on the odds of getting an eleventh. This is one of the most dangerous aspects of probability theory, a trap I’d argue that some physicists have fallen into by overinterpreting quantum theory as arising from probabilities, rather than interpreting them as field effects.
…………The ultimate culprit there is the same as in extreme interpretations of Bayesianism: the idea that reality is subjective. Both Bayesian and frequentist models of probability are ultimately derived from combinatorics, includes Bayes Factors and entropic measures; this in turn gives them a rigid foundation in logic, but leavening it with the wrong kind of subjectivity might be tantamount to adding a fly to the ointment. It is critical to always keep in mind that subjectivity is fine when it is strictly a matter of gauging personal aesthetic preferences, including the questions we prefer to ask while data mining. “Is my data mine played out?” is an example of a subjectively selected question, which might be answered through such objective means as declining information gain and a progressive lack of informative priors. Given that probabilities do not determine outcomes, we might have to check every last square inch of ground to be certain that a particular mine does not hide the nuggets of gold we’re looking for, but these measures can serve as a reasonable guide on when to stop and look elsewhere. As I’ve pointed out many times in previous posts, subjectivity becomes problematic when confused with the answers to those questions, which are always a matter of hard, cold, objective reality. I’ve seen noted authors in the hard sciences and mathematics unconsciously let their definitions of these things drift across the course of their books, in ways which obscure this dividing line without justification. To put it bluntly, subjectivity in the wrong place leads to madness; when men have accepted the idea that their personal thoughts affect reality, they end up in institutions, and when the same ideas infect entire civilizations, they come crashing down. It is a permanent temptation of mankind, a dangerous undercurrent that has reared time and again through human intellectual history in guises like phenomenalism, nominalism, solipsism (which is considered a marker of schizophrenia, unless it is championed by professors) and certain misuses of David Hume’s philosophy. It is dangerous enough when civilizations tamper with the definition of purely abstract ideas like paper money (which always leads to hyperinflation), but doubly so when subjectivity degrades its capacity for wise decision-making. As I have pointed out from time to time in the fuzzy set tutorial series and A Rickety Stairway to SQL Server Data Mining, many of these mind-blowing concepts are apparently fraught with danger; perhaps that is why so many of the “rock star” mathematicians and physicists have gone mad or become degenerates, although I’d wager that overweening pride is more to blame. In the next installment I’ll pilot this series through less dangerous waters that I nevertheless really don’t have the expertise to navigate[27], in search of answers about the Kullback-Leibler Divergence, a tried-and-true distance measure used often in data mining.

[1] Pierre-Simon Laplace (1749- 1827). See the Wikipedia page “Pierre Simon Laplace” at http://en.wikipedia.org/wiki/Pierre-Simon_Laplace for his birth and death dates.

[2] See Statistical, LLC, 2015, “Likelihood,”available at the Bayesian-Inference web address http://www.bayesian-inference.com/likelihood  and the Wikipedia pages “Bayes’ Theorem” and “Bayes’ Rule”at http://en.wikipedia.org/wiki/Bayes%27_theorem and http://en.wikipedia.org/wiki/Bayes%27_rule

[3] See the Wikipedia page “Likelihood Ratio” at http://en.wikipedia.org/wiki/Likelihood_ratio

[4] p. 48, Reza, Fazlollah M., 1961, An Introduction to Information Theory. McGraw-Hill: New York.

[5] p. 178, McNeill, Dan, 1993, Fuzzy Logic. Simon & Schuster: New York.

[6] IBID.,

[7] This dataset is made publicly available by University of California at Irvine’s Machine Learning Repository. I downloaded it a few tutorial series ago and converted it to a SQL Server table, which now takes up about 5 gigabytes of space in a sham DataMiningProjects database I’ve been using ever since for practice data.

[8] “In order to complete the definition of a Bayesian model, both the prior distributions and the likelihood must be approximated or fully specified.” See Statistical, LLC, 2015, “Likelihood,”available at the Bayesian-Inference web address http://www.bayesian-inference.com/likelihood

[9] pp. 45-46, Downey, Allen B., 2012, Think Bayes: Bayesian Statistics Made Simple, Version 1.03. Green Tea Press: Needham, Mass. Available online at the Green Tea Press web address http://www.greenteapress.com/thinkbayes/thinkbayes.pdf

[10] Held, Leonhard, 2011, “Introducing Bayes Factors,” monograph published Nov. 25, 2011 at the Vienna University for Economics and Business Institute for Statistics and Mathematics web address http://statmath.wu.ac.at/research/talks/resources/talkheld.pdf

[11] See the Wikipedia article “Bayes Factor” at http://en.wikipedia.org/wiki/Bayes_factor

[12] Good, I. J., 2011, “A List of Properties of Bayes-Turing Factors,” undated monograph released March 9, 2011 by the National Security Agency as a result of Freedom of Information Act request #58820. Available online at the NSA web address https://www.nsa.gov/public_info/_files/tech_journals/list_of_properties.pdf

[13] Giffin, Adom, 2013, “Special Issue ‘Maximum Entropy and Bayes Theorem’,” commentary published Jan. 7, 2013 at the Multidisciplinary Digital Publishing Institute web address http://www.mdpi.com/journal/entropy/special_issues/bayes-theorem

[14] IBID.

[15] pp. 147, 238, Hawkins, Jeff, 2004, On Intelligence. Times Books: New York.

[16] Good, I.J., 1985, “Weight of Evidence: A Brief Survey,” pp. 249-270 in Bayesian Statistics 2, Bernardo J.M.; DeGroot, M.H.; Lindley, D.V. and Smith, A.F.M., eds. Elsevier Science Publishers: North Holland. Available online at the Californa Water Boards web address at http://www.waterboards.ca.gov/water_issues/programs/tmdl/docs/303d_policydocs/207.pdf . Originally cited at the Wikipedia webpage “Ban (Unit)” at http://en.wikipedia.org/wiki/Ban_(unit)

[17] Kass, Robert E. and Raftery, Adrian E., 1995, “Bayes Factors,” pp. 773-795 in Journal of the American Statistical Association, June 1995. Vol. 90, No. 430. Available online at the Carnegie Mellon web address http://www.andrew.cmu.edu/user/kk3n/simplicity/KassRaftery1995.pdf. I learned of it through the Wikipedia article “Bayes Factor” at http://en.wikipedia.org/wiki/Bayes_factor

[18] p. 2, Held.

[19] p. 181, McNeill.

[20] IBID, p. 181.

[21] IBID., p. 176.

[22] IBID., p. 190.

[23] IBID., pp. 188-189.

[24] IBID., pp. 177-178.

[25] p. 179, McNeill.

[26] pp. 45-46, Dwney. The original source was MacKay, David J. C, 2003, Information Theory, Inference, and Learning Algorithms. Cambridge University Press: New York.

[27] You might be in better hands with Gene Wilder.

Advertisements

Information Measurement with SQL Server, Part 1: A Quick Review of the Hartley Function

By Steve Bolton

…………This long-delayed series of amateur self-tutorials has been in the works ever since I began writing my A Rickety Stairway to SQL Server Data Mining series, which made it clear to me that I didn’t know enough about what was going on under the hood in SSDM. I still don’t know enough about the reasoning behind the various data mining algorithms implemented by SQL Server and other tools, but I am certain of one thing: I never will know enough, even if I actually became competent in these topics. These fields are just too detailed, broad and rooted in poorly understood corners of pure reason for anyone to master, let alone myself. Like my series on SSDM, this foray into coding all of the basic measures of information theory and related fields may well exemplify University of Connecticut statistician Daniel T. Larose’s witticism that “data mining is easy to do badly.”[1] My purpose in the Rickety series was merely to demonstrate that useful results can still be derived from SSDM, even when it is badly mishandled. In this series, I will try to explain how the metrics used in many data mining algorithms can be used to answer a whole cornucopia of questions about our datasets, such as: How much information might there be, as measured in terms of possible state descriptions and probabilities? How much meaning (i.e. semantic information) might it have? How many state descriptions does it rule out? How much is already known? How random, aperiodic, redundant, complex or ordered is it? Another interesting challenge is determining the shortest possible specifications of a structure.
…………There are literally dozens upon dozens of measures available to answer all of these questions, taken from such diverse fields as information theory, chaos theory, algorithmic complexity and many others which provide the basic building blocks of most data mining algorithms. The techniques provided by these fields are powerful, yet contain many logical stumbling blocks that far smarter people than ourselves have tripped over, often without even knowing it; these range from instances of subtle “definition drift” in the meaning of terms like “randomness” over the course of textbooks, to the development of full-blown crackpot theories by scholars overexcited by the potential of systems like chaos and information theory. I am bound to make mistakes along the way, given that I’m an amateur at this, so by all means take care when implementing my code and trusting my analysis, which is sometimes just plain wrong. On the other hand, these techniques are so powerful and so under-utilized that there is a crying need for some explanation of how they can be applied in a SQL Server environment, even a poor one. I know a little bit more of certain areas of philosophy – especially historical instances of when it has gone very wrong – so I can occasionally make a contribution by commenting on how to avoid fallacious reasoning, which is even more of a problem once these sophisticated topics are clouded over by excess jargon and complex math formulas.

DIY Data Mining and the Scope of the Series

                One thing I’ve learned while trying to force-feed myself the underlying math is that the analysis marketplace is decades behind the research in some ways; there is no way that any single company is ever going to be able to code all of the extant mining algorithms, assuming it is even possible to tally them all up. This means that it may be beneficial in the years to come to have the skills to build DIY solutions. Throughout this series I will provide T-SQL code so that SQL Server DBAs and data miners can implement some of these techniques on their own, without waiting for developers of mining software to code the particular algorithms that fit their use cases. Why T-SQL? I could make a strong case that the gradual accretion of features like windowing functions is slowly making set-based languages ideal for this purpose, although they are rarely thought of in that way; it boils down to the fact that most of the problems covered in these interrelated fields are much easier to express in terms of sets. Furthermore, the sheer size of “Big Data” (which has steadily gotten “bigger” ever since the first records were kept; like “globalization”, it has only accelerated in recent years) requires taking computing to a new level of abstraction in order to simplify things, which is a purpose that set-based languages can fulfill. This series will be a learning experience for me as well, in which I hope to at least help others avoid my mistakes, by teaching through misadventure; writing a series like this aids me in absorbing the material a lot faster, while also getting valuable practice in translating the difficult underlying math formulas into code. Since I don’t know what I’ll discover along the way, I suspect that at some point I may resort to using some of my other favorite languages, like Multidimensional Expressions (MDX) and Visual Basic .Net, possibly in conjunction with Common Language Runtime (CLR) stored procedures. One of the benefits I hope to provide is to take most of the math out of the picture, so that readers don’t get bogged down in it; the jargon and formulas are absolutely necessary for the researchers to communicate with each other, but there is no need for data miners to be writing formal mathematical proofs, just as commuters shouldn’t have to give a dissertation on automotive engineering in order to drive their cars. I’ve sometimes received comments to the effect that there’s too much text in these articles, but that’s because they don’t see the hundreds of pages of math formulas that gave rise to them; rather than stringing together some meaningless screenshots or rehashing MSDN tutorials from AdventureWorks, I aim to show how these techniques might be coded and how they can be used appropriately.
…………The goal in this series is to corner the uncertainty in our datasets by shining lights on it from every possible direction, so that organizations can make better decisions that result in more effective actions. To that end, everything from measures used in Bayesian inference to Solomonoff Algorithmic Probability to the calculation of periodicities to the Lyapunov Exponent will be fair game. These metrics vary quite widely in terms of sophistication, so at points we’ll cross the imprecise boundaries separating them from full-blown data mining algorithms; the dividing lines separating statistics, data mining, machine learning, “soft computing,” predictive analytics and the like seem to boil down to degrees of intricacy, rather than differences in kind, especially since their algorithms are derived from the same metrics and building blocks. My last mistutorial series was designed from the ground up to build on existing concepts, one article at a time. This one will be inherently disorganized, since the scope is so wide and I don’t know what I will find along the way. I will have to skip around quite a bit across topics that may only be distantly related, or across levels of sophistication exhibited by the measures of information. It will also be an open-ended series, whereas the Rickety series was merely necessary to cover a specific set of easily denumerable features. We could delve into dozens of obscure metrics if the need arises, or even concoct our own special-purpose metrics, if a use case calls for it.
…………The series may explore a wide-ranging topics along paths are still somewhat unknown, but I can at least kick it off by introducing some of the primordial foundations of information theory. Perhaps the simplest is a function developed in 1928 by electronics pioneer Ralph Hartley[2], who applied it to signal transmission[3] a few decades before Claude Shannon did the same with his own renowned entropy measure. Although Hartley considered it a measure of information, his function did not lead to the use of the term of “information theory,”[4] which was coined when Shannon’s famous equation gave birth to the field. This introduction will be made even easier by the fact that I already discussed a more advanced version of Hartley’s measure in Implementing Fuzzy Sets in SQL Server, Part 9: Measuring Nonspecificity with the Hartley Function. The version for ordinary “crisp” sets of the kind DBAs and data miners are accustomed to working is actually quite a bit easier to code and interpret: all we have to do is count the records in a set and take the logarithm. The code in Figure 1 is actually longer than it has to be, given that I used three different logarithm bases to measure the same quantity, for the sake of completeness. When base 2 is used, the units are known as bits or “shannons.” When the default value of Euler’s Number is used, they’re known as “nats,” but with base 10 we’re measuring in hartleys or “bans” (a term coined by famed cryptographer Alan Turing). It would be trivial to turn this into a stored procedure with an option to select the desired units. It only took about a second to calculate the results in Figure 2 on the first float column of the Higgs Boson dataset I downloaded from University of California at Irvine’s Machine Learning Repository a few tutorial series ago, which I converted to a 5-gigabyte SQL Server table. This was calculated effortlessly across all 11 million rows only because we had to performs some simple counts, without traversing the whole table.

Figure 1: Code for the Ordinary “Crisp” Version of the Hartley Function
DECLARE @HartleyEntropy float, @DistinctCount bigint, @Count bigint

SELECT @DistinctCount = Count(DISTINCT Column1), @Count = Count(*)
FROM Physics.HiggsBosonTable

SELECT Log(@DistinctCount, 2) AS BitsOrShannons, Log(@DistinctCount, 2.7182818284590452353602874713526624977) AS Nats, Log(@DistinctCount, 10) AS Hartleys,
Log(@Count, 2) AS MultisetBitsOrShannons, Log(@Count, 2.7182818284590452353602874713526624977) AS MultisetNats, Log(@Count, 10) AS MultisetHartleys

 Figure 2: Results from the Higgs Boson Dataset

…………The main problem I ran into was a fundamental one: the formula calls for plugging in the cardinality of the set, but in ordinary set theory parlance, duplicate values are only counted in multisets. The set notation includes the symbols |A|, but the bars specify a cardinality measure rather than the use of the ABS function. The question is, which cardinality? I’ve included both versions in Figure 1, which differ solely by the fact that one uses a DISTINCT operator and the other takes a standard COUNT. This discrepancy could have its uses though. I have yet to see this issue raised in the information literature, where Hartley’s metric is often skipped over as a topic of mainly historical interest in comparison to Shannon’s, but it may be possible to derive a third metric based on the difference between the two. A simple subtraction might help us quantify the contribution of the repeated values to the uncertainty, which could have its uses in the kinds of uncertainty management programs I spoke of in the fuzzy set series. In essence, this difference could act as a crude measure of redundancy. If it reflects the information gained by using the DISTINCT operator, we could use this to assess its costs. We know for a fact that the DISTINCT version can’t exceed the multiset version, which acts as a cap on its range. At the other end of the scale, the measure reaches the limit of 0 when all records have the same value, therefore giving us perfect certainty. Another important issue is whether or not the DISTINCT clause adds information, by reducing the amount of uncertainty about how many different values a variable can take on.
…………Of course, the clause might not be necessary at all if we knew in advance precisely which values were permissible for a column, such as the range of a decimal type or a varchar column limited to a handful of known category values. On the other hand, this raises a subtle distinction between values that are permissible for a column, which can be determined by the data type, and the values actually found, which can only be counted through a DISTINCT operator. The issue becomes more sophisticated if we are able to determine the counts of each individual value; these measures of “multiplicity,” as they are known in multiset theory, further reduce the uncertainty associated with the dataset. It is easy enough to implement these internal counts using windowing functions and GROUP BY statements, but the issue of how to factor them in quickly complicates the discussion of the otherwise simple Hartley function. Thankfully, the order of the records is only an issue with tuples, not the kinds of sets or multisets we plug into it – except when we get to the end of the dataset and can determine the last records from the remaining counts, which is something I’ve not seen addressed in the literature. This brings applications without replacement (i.e., the kind of probabilities associated with decks of cards where no reshuffling takes place) into play, at least briefly.

“Newsworthiness”: The Narrow Definition of “Information”

                Aside from all of these hidden subtleties, the information provided by both the DISTINCT and multiset versions can be summed up thus: how much am I learning each time I inspect a row and verify the actual value found there? This is equivalent to asking how much we learn from each slip of paper in a drawing, if we know the count of the jar in advance. In more advanced terms, we can think of this as increasing information by reducing the number of possible state descriptions the next record can take on; in such a context, whether or not repeated values are allowed makes a big difference. The same goes for their individual counts, if the answer is yes. Keep in mind that this type of “information” is practically the polar opposite of existing knowledge; basically, the higher the Hartley measure is, the less we don’t already know, so the more the next record can tell us. It is a highly specific type of information, which former journalists like myself might equate with “newsworthiness.” This is the dividing line between existing and new knowledge is precisely where measures of entropy (like the Hartley function) intersect with topics like Bayesian probability, which I will also address at some point in this series; as we shall see, many of these information measures are interrelated in complex ways. This highly specific definition of information is an important distinction that pertains to all of the other measures of entropy we’ll discuss in this series; interpretation is a critical stage in every field associated with data mining, particularly information theory, which should never be shortchanged in any workflow.
…………The Hartley function can be leveraged in a data mining workflow in various ways, such as calculating the reduction in uncertainty between two separate measures; this could be useful, for example, in specifying a numerical cut-off point in bits or bans, after which it’s not worthwhile to go on inspecting rows, for whatever end purpose that might be, such as sampling. The sample code in Figure 3 takes the Hartley measure after the 10 millionth row, or about 9 percent away from the end of the dataset, which is why remaining uncertainty in Figure 4 is so low; as we approach the last uninspected record, the remaining uncertainty would approach zero. This formula would be equivalent to counting the remaining records and plugging the results into the Hartley function. Another interesting question is whether or not we could we pair this with cardinality estimation, to get a ballpark figure of how much we can learn from each record we inspect, before we’ve even traversed a dataset. I don’t know much about cardinality estimation yet, but the possibility is tantalizing

Figure 3: Calculating the Remaining Uncertainty with the Hartley Function
DECLARE @DistinctCountOfKnownValues bigint, @CountOfKnownValues bigint

SELECT @DistinctCountOfKnownValues = Count(DISTINCT Column1), @CountOfKnownValues = Count(*)
FROM Physics.HiggsBosonTable
WHERE ID BETWEEN 1 AND 10000000

SELECT Bits, KnownBits, Bits KnownBits AS RemainingUncertaintyInBits, MultisetBits, KnownMultisetBits, MultisetBits KnownMultisetBits AS RemainingMultisetUncertaintyInBits
FROM (SELECT Log(@DistinctCount, 2) AS Bits, Log(@DistinctCountOfKnownValues, 2) AS KnownBits, Log(@Count, 2) AS MultisetBits, Log(@CountOfKnownValues, 2) AS KnownMultisetBits) AS T1

 Figure 4: Uncertainty Reduction Results

…………The kinship between the Hartley function and the rest of information theory is evident in some of its alternative names, like the Hartley Entropy or Max Entropy. It is equivalent to the Rényi Entropy[5] with its alpha parameter (α) set to 0, as I’ll explain a few articles from now. It’s also identical to the Shannon Entropy in cases of the uniform distribution, i.e. when all values are equally likely.[6] I’ll be spending a couple of articles on various aspects of entropy early on this series, since it’s such an important concept in information theory. The math and logic can get thick pretty quickly in this field, so it is best to start off with the measure that started it all, Shannon’s infamous “H.” I recognized its signature combination of a negative summation operator and log operation when translating some of the equations used in Implementing Fuzzy Sets in SQL Server, Part 10.2: Measuring Uncertainty in Evidence Theory into T-SQL. As with the Hartley function, this previous exposure to more advanced fuzzy derivatives ought to make the material a little easier to swallow. The difficulty with Shannon’s Entropy, however, is not in its calculation, but in its proper interpretation. Yet as long as we have the rigor to avoid assigning unwarranted shades of meaning to the term “information,” it can be a powerful addition to our data mining toolbelts.

[1] p. xii, LaRose, Daniel T., 2005, Discovering Knowledge in Data: An Introduction to Data Mining. Wiley-Interscience: Hoboken, N.J.

[2] See the Wikipedia articles “Hartley Function” and “Ralph Hartley” at http://en.wikipedia.org/wiki/Hartley_function and http://en.wikipedia.org/wiki/Ralph_Hartley respectively.

[3] p. 5 Ritchie, L. David., 1991, Information. Sage Publications: Newbury Park, Calif.

[4] p. 288, Bar-Hillel, Yehoshua, 1964, Language and Information: Selected Essays On Their Theory and Application. Addison-Wesley Pub. Co.:        Reading, Mass.

[5] See the Wikipedia article “Rényi Entropy” at http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

[6] See the Wikipedia article “Hartley Function” at http://en.wikipedia.org/wiki/Hartley_function

Implementing Fuzzy Sets in SQL Server, Part 11: Fuzzy Addenda

By Steve Bolton

…………One of the key reasons I looked into the topic of fuzzy sets in the first place was my suspicion that T-SQL, as a set-based language, would be ideal for modeling them. That turned out to be an understatement of sorts: I definitely was not prepared to discover just how useful they can be for translating imprecise linguistic modifiers in Behavior-Driven Development (BDD) environments and user stories, nor did I realize how little information has percolated down from the mammoth amount of theoretical research done on fuzzy topics over the last 40 years. Hopefully this series of amateur mistutorials helped rectify that gap by giving fuzzy sets some badly needed free press, of the kind I tried to bring SSDM in my older A Rickety Stairway to SQL Server Data Mining series awhile back. I originally set aside this final article as a kitchen drawer of sorts, to dispense with some postscripts that would’ve interfered with the flow of the rest of the series, in which one concept was used as a building block onto the next. One leftover concept I thought might be worthy of significant attention was fuzzy orders, which sounds as if it would be right up SQL Server’s alley. After all, DBAs use the ORDER BY statement every day. The problem is that it turns out T-SQL, like most other set-based languages, is not ideal for modeling this kind of fuzzy object.

Fuzzy Orders and the Limitations of Hierarchies in SQL

                In the literature, fuzzy set orders are created by applying continuous membership grades to a record’s position in a particular fuzzy set. Devices like Hesse diagrams and properties like “dominated” and “undominated” are useful in implementing them[1], but I won’t bother, for the simple reason that SQL Server lacks robust graph database capabilities. Modeling relationships of this kind is still notoriously difficult in the relational realm, even though they’ve been augmented by such useful tools as hierarchyid data type in recent years. I am rather fond of hierarchyid, but it is unable to model multiparent trees in an efficient way, let alone multidimensional directed graphs. Just try modeling a simple genealogical tree with it. Trees are instances of what are known in mathematical parlance as partial orders; when you really stop and think about it, they represent a form of order, except in more than one dimension, such as “my grandparents and I have a descendant-ancestor relationship, but not my cousins and I.”[2] As far as I can tell, directed graphs open up more possibilities by relaxing the rules of composition, in the same way the Riemann manifolds give us access to curved hyperspace. I for one would cast my vote for adding graph database capabilities similar to those found in Neo4j[3] to SQL Server, which would add a whole new dimension to the product in the same way that Analysis Services and Reporting Services do, without being a separate service.
…………Alas, until such capabilities are added to SQL Server, it wouldn’t be useful to model most forms of fuzzy orders in T-SQL, let alone Multidimensional Expressions (MDX) in SQL Server Analysis Server (SSAS) cubes, because they immediately require the flexibility of multiparent trees and directed graphs. These tasks could be accomplished in SQL Server 2014 as it stands, but in contrast to the other fuzzy objects I’ve introduced throughout this series, I doubt it can be done in an efficient way. It also doesn’t help matters at all that the Windows Presentation Foundation (WPF) tree control is a walking disaster – for years now, its shortcomings have been a thorn in the side of .Net developers of all skill levels. Microsoft simply didn’t build in such basic functionality as searching for specific members in a collapsed tree, and in fact made it virtually impossible for third-party developers to do it themselves. Needless to say, neither the WPF TreeView nor hierarchyid is well-suited to modeling directed graphs, which are simply a more flexible generalizations of trees. The kissing cousins of fuzzy orders, like fuzzy rankings[4] and fuzzy morphisms[5], aren’t really feasible either. George J. Klir and Bo Yuan’s Fuzzy Sets and Fuzzy Logic: Theory and Applications, my favorite go-to resource for fuzzy math formulas, provides a decent starting point for all three[6], but from my little experience, I wouldn’t even try to implement them unless I had access to a good third-party product like GoXAM’s directed graph control (which may be expensive, but would probably recoup its costs by saving weeks of wasted labor on the unworkable WPF TreeView). If it one day does become worthwhile to model fuzzy orders and ranks in some future edition of SQL Server (or I turn out to be wrong), they’ll probably require the use of a lot of CASE statements in ORDER BY clauses and windowing functions, respectively. Given that there’s a mountain of currently unsolved problems out there that other aspects of fuzzy sets could tackle right away, we’ll save this topic for a later date. It’ll be a long time before all the low-hanging fruit is used up and we’re to the point where struggling to model them will become worthwhile.

Some Simple T-SQL for Fuzzy Medians

                Because I realized early on that fuzzy orders were an afterthought – at least by the present capabilities of SQL Server and other relational databases – I left the subject of fuzzy medians for this junk drawer of an article. After all, medians are inherently dependent on the order of data, given that the pick the one or two values that occur precisely in the middle of a set. Furthermore, I noticed that the formulas involved calculations on two sets rather than one, which would have cluttered Implementing Fuzzy Sets in SQL Server, Part 7: The Significance of Fuzzy Stats, where the sample code was all done on a single table. That should have been a clue, however, that the fuzzy medians in the literature are a separate subject, not just a fuzzified version of ordinary medians. That would be easy enough to implement, given the principles of fuzzy sets introduced throughout this series; for example, instead of selecting the one or two records at the dead center of the dataset, we could select a fuzzy range. The trapezoidal numbers discussed in Implementing Fuzzy Sets in SQL Server, Part 6: Fuzzy Numbers and Linguistic Modifiers might be ideal for this purpose. The type of fuzzy medians under discussion here instead belong in the taxonomic hierarchy of fuzzy objects I mentioned in the fuzzy stats article, like Ordered Weighted Averages (OWAs), Lambda Averages (λ-Averages), T-norms, T-conorms and the like. Compared to some of those operations, the logic of fuzzy medians is fairly simple: we take the maximum of the values of two sets at each corresponding row when both membership scores are between 0 and the @LambdaParameter, the minimum values when both are between the @LambdaParameter and 1 and just the @LambdaParameter (which must be set between 0 and 1) in all other cases.[7] Assuming I read the formulas correctly – which is not a given, since I’m a novice at this – then this should all be implemented in Figure 1. As usual, it looks a lot longer than it really is; everything through the second UPDATE statement is just the same sample code I’ve used this series to populate the membership functions for binary set relations. Keep in mind that we don’t need to use Z-Scores to assign membership values here; I’m just using them to illustrate how to assign memberships in a fuzzy set, using familiar code from older tutorials. The sky’s the limit as far as the number of functions you can use to assign such values; the key thing is to find the right match to the problem you’re trying to solve. This would be a good match if we were trying to rate outliers by two different forms of Z-Scores, for example. The only novel part is the last SELECT, which isn’t difficult at all. As always, the results in Figure 2 are derived from the Duchennes muscular dystrophy dataset I downloaded a few tutorial series ago from Vanderbilt University’s Department of Biostatistics and have been using for practice data ever since.

Figure 1: Sample Code for a Simple Fuzzy Median
DECLARE @RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)
DECLARE       @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
)

DECLARE @ModifiedZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint,
OutlierCandidate bit
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
             @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore)
FROM @ZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

INSERT INTO @ModifiedZScoreTable
(PrimaryKey, Value, ZScore, GroupRank, OutlierCandidate)
EXEC   Calculations.ModifiedZScoreSP
              @DatabaseName = N’DataMiningProjects,
             @SchemaName = N’Health,
             @TableName = N’DuchennesTable,
             @ColumnName = N’LactateDehydrogenase,
             @PrimaryKeyName = N’ID’
              @OrderByCode = 8,
              @DecimalPrecision = ’38,32′

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore)
FROM @ModifiedZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ModifiedZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

DECLARE @LambdaParameter float = 0.43 

SELECT  T1.PrimaryKey, T1.Value, T1.MembershipScore, T2.MembershipScore,
CASE WHEN (T1.MembershipScore BETWEEN 0 AND @LambdaParameter) AND (T2.MembershipScore BETWEEN 0 AND @LambdaParameter) THEN (SELECT MAX(Value) FROM (VALUES (T1.MembershipScore), (T2.MembershipScore) ) AS T1(Value))
WHEN (T1.MembershipScore BETWEEN @LambdaParameter AND 1) AND (T2.MembershipScore BETWEEN  @LambdaParameter AND 1) THEN (SELECT MIN(Value) FROM (VALUES (T1.MembershipScore), (T2.MembershipScore) ) AS T1(Value))
ELSE @LambdaParameter END AS FuzzyMedian
FROM @ZScoreTable AS T1
       INNER JOIN @ModifiedZScoreTable AS T2
       ON T1.PrimaryKey = T2.PrimaryKey AND T1.Value IS NOT NULL AND T2.Value IS NOT NULL 

Figure 2: Results from the Duchennes Dataset

…………I barely began to scratch the surface of fuzzy objects like fuzzy medians, λ-Averages, T-norms, T-conorms and OWAs in this series. In fact, there’s an entire sea of ripe research out there on all topics fuzzy that could be quite useful to relational DBAs and decision support specialists, but which has gone unpicked. There are many different directions this topic can be taken in, so I may revisit this series and tack some additional articles onto it in the future. I didn’t get a chance to mention the extension principle[8] at all and glossed over important applications of fuzzy techniques in Decision Theory, way back in Implementing Fuzzy Sets in SQL Server, Part 4: From Fuzzy Unions to Fuzzy Logic. I might provide more detail on the use cases for particular T-norms and T-conorms (if I can ever get my hands on the relevant academic journal articles, which are expensive), model more linguistic states and get into indexing considerations, other brands of fuzzy aggregates and other types of fuzzy partitions besides alpha cuts (α-cuts), among other things. Yet I’d rather branch off into “soft computing,” which is a grab-bag and hodge-podge of cutting edge fields that are quite hard, which make its name something of an oxymoron. Fuzzy logic is merely one of the buzz words associated with it, like chaos theory, neural nets, support vector machines (SVMs) and genetic algorithms. What they all have in common is that they’re useful in situations where inexact solutions are acceptable, including NP-Complete problems.[9] The same hype and intellectual intoxication I spoke of in Implementing Fuzzy Sets in SQL Server, Part 1: Membership Functions and the Fuzzy Taxonomy also surrounds certain aspects of soft computing, which seems to make some theoreticians go soft in the head; I guarantee there will still be useful innovations occurring in these fields a century from now, assuming the human race lasts that long, but these incredible tools aren’t cure-alls. There are some things they just can’t do and I’d wager that certain brands of artificial intelligence and machine learning are among them; I love science fiction but it’s not wise to confuse it with cold, hard reality.
…………That’s a discussion I’ll take up by dribs and drabs in my next, long-delayed mistutorial series, Information Measurement with SQL Server, which may serve as stepping stone to my favorite topic, neural nets. Both topics dovetail nicely with fuzzy sets and many of the tangential topics we’ve covered in this series, like Shannon’s Entropy and the Hartley function. These are among dozens of metrics which can be coded in T-SQL and Multidimensional Expressions (MDX) and put to good use for data mining purposes, as I will demonstrate over the course of this long and possibly nomadic series. I aim to familiarize myself with semantic information, measures of order, measures of sensitivity to initial conditions (like the Lyapunov Exponent used in chaos theory), various means of quantifying algorithmic complexity – anything that will reduce uncertainty and glean whatever unused information is left in our datasets, by quantifying it in some way. Some of these metrics can be plugged into the formulas I introduced in this series for measuring fuzziness in terms of set complements, such as the Küllback-Leibler Divergence and Bhattacharyya Distance. We’ve already gotten our toes wet by introducing fuzzy stats and metrics for quantifying nonspecificity and fuzziness; now it’s time to jump in. Some of the topics will be quite shallow and easy to follow, while others may be incredibly deep. It’s largely unexplored territory for me as well, so I may have to skip around from topic to topic in an unsystematic way, instead of deliberately building up to more complex concepts as I did towards Dempster-Shafer Evidence Theory in this series. At a minimum, readers should at least benefit from learning from my mistakes, which don’t require a fancy fuzzy expert system to tell us that they’re inevitable; like death and taxes, they’re one of the few pieces of information that come with any certainty in predictive analytics and data mining.

[1] pp . 137-141, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J. On this particular page, they’re extending the meaning of the term even further, to complex network topologies.

[2] For more information, see the article “Partially Ordered Set” at the Wikipedia web address  http://en.wikipedia.org/wiki/Partially_ordered_set

[3] Which I have yet to try; I’m only speaking here of what’ve read about Neo4j casually.

[4] pp. 405-408, Klir and Yuan.

[5] IBID., pp. 141-144.

[6] IBID., pp. 137-144,

[7] IBID., p. 94.

[8] IBID., pp. 44-45.

[9] See the Wikipedia article “Soft Computing” at http://en.wikipedia.org/wiki/Soft_computing.

Implementing Fuzzy Sets in SQL Server, Part 10.2: Measuring Uncertainty in Evidence Theory

By Steve Bolton                                                                                                                      

…………To avoid overloading readers with too many concepts at once, I split my discussion of Dempster-Shafer Evidence Theory into two parts, with the bulk of the data modeling aspects and theory occurring in the last article. This time around, I’ll cover how fuzzy measures can be applied to it to quantify such forms of uncertainty as nonspecificity and imprecision (i.e., “fuzziness”) that were introduced in prior articles. Since the Plausibility, Belief and probability mass assignment figures work together to assign degrees of truth, they also introduce the potential for contradictory evidence, which leads to a few other measures of uncertainty: Strife, Discord and Conflict, which aren’t as relevant to possibility distributions and ordinary fuzzy sets. In addition, the probability mass for a universal hypothesis can be interpreted as a form of uncertainty left over after all of the probabilities for the subsets have been partitioned out. For example, in Figure 1, this crude type of uncertainty would be associated with the 0.0334928229665072 value for row 6. For the sake of brevity, I won’t rehash how I derived the ordinal LactateDehydrogenaseState category and the first three fuzzy measures associated with it, since the numbers are identical to those in the last tutorial. For the sake of convenience I added three columns with nearly identical names and calculated some sham data for them (based on the frequencies of some CreatineKinase data in the original table) so that we have some Conflicting data to work with. Ordinarily, such comparisons would be made using joins against an external view or table with its own separate ProbabilityMassAssignment, BeliefScore and PlausibilityScore columns, or a query that calculated them on the fly.

Figure 1: Some Sample Evidence Theory Data from the Last Tutorial

…………In Figure 2, I translated some of the most popular formulas for evidence theory measures into T-SQL, such as Strife, Discord and Conflict.[1] For these, I used a simpler version of the equations that performed calculations on differences in set values rather than fuzzy intersections and unions.[2] Despite the fact the two measures only differ by the divisor and order of the difference operation, Discord is apparently not used as often as Strife on the grounds that it does not capture as much information. These subtle differences occur only in the alternate measures of Conflict they’re based on; since the one related to Strife is more important, I only included that one in Figure 3, where it’s represented by a score of 0.286225667126791. Versions of Strife and Discord are available for possibility distributions, but I omitted these because the fact that possibility theory is “almost conflict-free” signifies that they’re of “negligible” benefit.[3] I also coded the evidence theory version of nonspecificity and essentially rehashed the crude fuzziness measure I used in Implementing Fuzzy Sets in SQL Server, Part 2: Measuring Imprecision with Fuzzy Complements, except with the YagerComplement parameter arbitrarily set to 0.55 and the probability mass used in place of the membership function results. Both of these are unary fuzzy measures that apply only to the set defined by the first three float columns, whereas Strife, Discord and Conflict are binary measures that are calculated on the differences between the two sets encoded in the Health.DuchennesEvidenceTheoryTable. We can also add the Strife and fuzziness figures together to derive a measure of total uncertainty, plus interpret the height of a fuzzy set – i.e., the count of records with the maximum MembershipScore of 1 – as a sort of credibility measure. Keep in mind that I’m not only a novice at this, but am consulting mathematical resources that generally don’t have the kind of step-by-step examples with sample data used in the literature on statistics. This means I wasn’t able to validate my implementation of these formulas well at all, so it would be wise to recheck them before putting them to use in a production environments where accuracy is an issue. I’m most concerned by the possibility that I may be incorrectly aggregating the individual focal elements for evidentiary fuzziness and nonspecificity, each of which should be weighted by the corresponding probability mass.

Figure 2: Several Evidence Theory Measures Implemented in T-SQL
DECLARE @Conflict float, @ConflictForDiscord float

SELECT @Conflict = SUM(CASE WHEN BeliefScore2 = 0 THEN ProbabilityMassAssignment2 * ABS(BeliefScore BeliefScore2)
       ELSE ProbabilityMassAssignment2 * ABS(BeliefScore BeliefScore2) / ABS(CAST(BeliefScore AS float))
       END),
       @ConflictForDiscord = SUM(CASE WHEN BeliefScore2 = 0 THEN ProbabilityMassAssignment2 * ABS(BeliefScore2 BeliefScore)
       ELSE ProbabilityMassAssignment2 * ABS(BeliefScore2 BeliefScore) / ABS(CAST(BeliefScore2 AS float))
       END)
FROM Health.DuchennesEvidenceTheoryTable 

— FUZZINESS
DECLARE @Count  bigint, @SimpleMeasureOfFuzziness float
DECLARE @OmegaParameter float = 0.55 — ω

SELECT @Count=Count(*)
FROM Health.DuchennesEvidenceTheoryTable

SELECT @SimpleMeasureOfFuzziness = SUM(ABS(ProbabilityMassAssignment YagerComplement)) /@Count
FROM (SELECT ProbabilityMassAssignment, Power(1 Power(ProbabilityMassAssignment, @OmegaParameter), 1 / CAST(@OmegaParameter AS float)) AS YagerComplement
       FROM Health.DuchennesEvidenceTheoryTable) AS T1 

— NONSPECIFICITY
DECLARE @EvidenceTheoryNonspecificityInBits float 

SELECT @EvidenceTheoryNonspecificityInBits = SUM(ProbabilityMassAssignment * Log(@Count, 2))
FROM Health.DuchennesEvidenceTheoryTable

SELECT Strife, Discord, Conflict, EvidenceTheoryNonspecificityInBits,SimpleMeasureOfFuzziness, Strife + EvidenceTheoryNonspecificityInBits
AS TotaUncertainty,
(SELECT ProbabilityMassAssignment
       FROM Health.DuchennesEvidenceTheoryTable
       WHERE LactateDehydrogenaseState = ‘Any’) AS ProbabilityMassRemainderUncertainty
             FROM (SELECT 1 * SUM(ProbabilityMassAssignment * Log((1 @Conflict), 2)) AS Strife,
       1 * SUM(ProbabilityMassAssignment * Log((1 @ConflictForDiscord), 2)) AS Discord,  @Conflict AS Conflict, @EvidenceTheoryNonspecificityInBits AS EvidenceTheoryNonspecificityInBits, @SimpleMeasureOfFuzziness AS SimpleMeasureOfFuzziness
       FROM Health.DuchennesEvidenceTheoryTable) AS T1

 

Figure 3: Sample Results from the Duchennes Evidence Theory Table

…………The nonspecificity measure in evidence theory is merely the Hartley function weighted by the probability mass assignments. On paper, the equation for Strife ought to appear awfully familiar to data miners who have worked with Shannon’s Entropy before. The evidence theory version incorporates some additional terms so that a comparison can be performed over two sets, but the negative summation operator and logarithm operation are immediately reminiscent of its more famous forerunner, which measures probabilistic uncertainty due to a lack of stochastic information.  Evidentiary nonspecificity trumps entropy in many situations because it is measured linearly, therefore avoiding computationally difficult nonlinear math (my paraphrase), but sometimes doesn’t produce unique solutions, in which case Klir and Yuan recommend using measures of Strife to quantify uncertainty.[4] Nevertheless, when interpreted correctly and used judiciously, they can be used in conjunction with axioms like the principles of minimum uncertainty, maximum uncertainty[5] and uncertainty invariance[6] to perform ampliative reasoning[7] and draw useful inferences about datasets:

                “Once uncertainty (and information) measures become well justified, they can very effectively be utilized for managing uncertainty and the associated information. For example, they can be utilized for extrapolating evidence, assessing the strength of relationship on between given groups of variables, assessing the influence of given input variables on given output variables, measuring the loss of information when a system is simplified, and the like. In many problem situations, the relevant measures of uncertainty are applicable only in their conditional or relative terms.”[8]

…………That often requires some really deep thinking in order to avoid various pitfalls in analysis; in essence, they all involve honing the use of pure reason, which I now see the benefits of, but could definitely use a lot more practice in. For example, Dempster-Shafer Theory has well-known issues with counter-intuitive results at the highest and lowest Conflict values, which may require mental discipline to ferret out; perhaps high values of Strife can act as a safeguard against this, by alerting analysts that inspection for these logical conundrums is warranted.[9] Critics like Judea Pearl have apparently elaborated at length on various other fallacies that can arise from “confusing probabilities of truth with probabilities of provability,” all of which need to be taken into account when modeling evidentiary uncertainty.[10] Keep in mind as well that Belief or Plausibility scores of 1 do not necessarily signify total certainty; as we saw a few articles ago, Possibility values of 1 only signify a state of complete surprise when an event does not occur rather than assurance that it will happen.
…………The issue with evidence theory is even deeper in a certain sense, especially if those figures are derived from subjective ratings. Nevertheless, even perfectly objective and accurate observations can be quibbled with, for reasons that basically boil down to Bill W.’s adage “Denial ain’t just a river in Egypt.” One of the banes of the human condition is our propensity to squeeze our eyes shut to evidence we don’t like, which can only be overcome by honesty, not education; more schooling may even make things worse, by enabling people to lie to themselves with bigger words than before. In that case, they may end up getting tenure for developing entirely preposterous philosophies, like solipsism, or doubting their own existence. As G.K. Chesterton warned more than a century ago, nothing can stop a man from piling doubt on top of doubt, perhaps by reaching for such desperate excuses as “perhaps all we know is just a dream.” He provided a litmus test for recognizing bad chains of logic, which can indeed go on forever, but can be judged on whether or not they tend to drive men into lunatic asylums. Cutting edge topics like fuzzy sets, chaos theory and information theory inevitably give birth to extravagant half-baked philosophies, born of the precisely the kind of obsession and intellectual intoxication that Chesterton speaks of in his chapter on The Suicide of Thought[11] and his colleague Arnold Lunn’s addresses in The Flight from Reason.[12] These are powerful techniques, but only when kept within the proper bounds; problems like “definition drift” and subtle, unwitting changes in the meanings assigned to fuzzy measures can easily lead to unwarranted, fallacious or even self-deceptive conclusions. As we shall see in the next series, information theory overlays some of its own interpretability issues on top of this, which means we must trend even more carefully when integrating it with evidence theory.
…………Fuzzy measures and information theory mesh so well together than George J. Klir and Bo Yuan included an entire chapter on the topic of  “Uncertainty-Based Information” in my favorite resource for fuzzy formulas, Fuzzy Sets and Fuzzy Logic: Theory and Applications.[13] The field of uncertainty management is still in its infancy, but scholars now recognize that uncertainty is often “the result of some information deficiency. Information…may be incomplete, imprecise, fragmentary, not fully reliable, vague, contradictory, or deficient in some other way. In general, these various information deficiencies may result in different types of uncertainty.”[14] Information in this context is interpreted as uncertainty reduction[15]; the more information we have, the more certain we become. Methods to ascertain how the reduction of fuzziness (i.e  how imprecise the boundaries of fuzzy sets are) contributes to information gain were not fully worked out two decades ago when most of the literature I consulted for this series was written, but I have the impression that still holds today. When we adapt the Hartley function to measure the nonspecificity of evidence, possibility distributions and fuzzy sets, all we’re doing is taking a count of how many states a dataset might take on. With Shannon’s Entropy, we’re performing a related calculation that incorporates the probabilities associated with those states. Given their status as the foundations of information theory, I’ll kick off my long-delayed tutorial series Information Measurement with SQL Server by discussing both from different vantage points. I hope to tackle a whole smorgasbord of various ways in which the amount of information associated with a dataset can be quantified, thereby helping to cut down further on uncertainty. Algorithmic complexity, the Lyapunov exponent, various measures of order and semantic information metrics can all be used to partition uncertainty and preserve the information content of our data, so that organizations can make more accurate decisions in the tangible world of the here and now.

[1] pp. 259, 262-263, 267, 269, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J. The formulas are widely available, but I adopted this as my go-to resource whenever the math got thick.

[2] IBID., p. 263.

[3] IBID., pp. 262-265.

[4] IBID., p. 274

[5] IBID.,  pp. 271-272. Klir and Yuan’s explanation of how to use maximum uncertainty for ampliative reasoning almost sounds a sort of reverse parsimony:  “use all information available, but make sure that no additional information is unwittingly added…the principle requires that conclusions resulting from any ampliative inference maximize the relevant uncertainty within the constraints representing the premises. The principle guarantees that our ignorance be fully recognized when we try to enlarge our claims beyond the given premises and, as the same time, that all information contained in the premises be fully utilized. In other words, the principle guarantees that our conclusions are maximally noncommittal with regard to information not contained in the premises.”

[6] IBID., p. 275.

[7] IBID., p. 271.

[8] IBID., p. 269.

[9] See the Wikipedia webpage “Dempster Shafer Theory” at http://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory

[10] IBID.

[11] See Chesterton, G.K., 2001, Orthodoxy. Image Books: London. Available online at the G. K. Chesterton’s Works on the Web address http://www.cse.dmu.ac.uk/~mward/gkc/books/

[12] Lunn, Arnold, 1931, The Flight from Reason. Longmans, Green and Co.: New York.

[13]  pp. 245-276, Klir and Yuan.

[14] IBID.

[15] IBID., p. 245.

 

Implementing Fuzzy Sets in SQL Server, Part 10.1: A Crude Introduction to Dempster-Shafer Evidence Theory

By Steve Bolton

…………Early on in this series, we learned how the imprecision in natural language statements like “the weather is hot” can be modeled using fuzzy sets. Ordinarily, the membership grades assigned to fuzzy sets are not to be interpreted as probabilities, even though they’re both implemented on continuous scales between 0 and 1; the exception to this rule is when a probabilistic meaning is consciously assigned to the type of fuzziness. A couple of articles ago we saw how membership scores can be interpreted as assessing the logical possibility of the associated statements; the possibility distributions this nuance gives rise to quantifies whether or not an event can occur, whereas a probability distribution assesses whether it will actually occur. The two scales are independent except at the maximum and minimum values, where possibility values acts as caps on probabilities, since an event must be possible if it is to have a non-zero probability. The possibility and necessity measures that factor into possibility distributions are actually special cases of the plausibility and belief measures used in Dempster-Shafer Evidence Theory, which has a related shade of meaning: instead of gauging whether or not an event can or will happen, plausibility and belief work together to grade the credibility of the associated evidence. If we were sifting through user stories in a Behavior-Driven Development (BDD) process, we wouldn’t use evidence theory for fuzzy terms like “the weather is hot,” or questions like “the weather could be cold” or “the weather is probably mild,”[1] which might be candidates for possibilistic or stochastic modeling. “As far as I can tell, the weather will be hot,” might be fair game, since the subject of the sentence is the trustworthiness of the associated statement. The clearest example I’ve yet run across in the literature occurs in George J. Klir and Bo Yuan’s Fuzzy Sets and Fuzzy Logic: Theory and Applications, which I’ve used as my go-to resource throughout this series for the heavy math formulas:

                “Consider, however, the jury members for a criminal trial who are uncertain about the guilt or innocence of the defendant. The uncertainty in this situation seems to be of a different type; the set of people who are guilty of the crime and the set of innocent people are assumed to have very distinct boundaries. The concern, therefore, is not with the degree to which the defendant is guilty, but with the degree to which the evidence proves his membership in either the crisp set of guilty people or the crisp set of innocent people.”[2]

In the last article, I gave a monologue on how organizations can benefit from uncertainty management programs, which begins with partitioning uncertainty into various types, like probabilities, nonspecificity, fuzziness and conflicting information; these in turn stretch across five mathematical subtopics, information theory, stochastics, possibility theory, fuzzy sets and evidence theory. The last of these has its own corresponding formulas for measures like nonspecificity, but is particularly useful for quantifying the degree of conflict between pieces of information. For this reason, it is widely used to aggregate disparate sources of information, which in turn integrates seamlessly with Decision Theory; for example, one of its most common implementations is sensor fusion.[3] Klir and Yuan also provide a concise list of possible use cases in various fields:

                “For instance, suppose we are trying to diagnose an ill patient. In simplified terms, we may be trying to determine whether the patient belongs to the set of people with, say, pneumonia, bronchitis, emphysema, or a common cold. A physical examination may provide us with helpful yet inconclusive evidence. For example, we might assign a high value, say 0.75, to our best guess, bronchitis, and a lower value to the other possibilities, such as 0.45 for the set consisting of pneumonia and emphysema and 0 for a common cold. These values reflect the degree to which the patient’s symptoms provide evidence for the individual diseases or sets of diseases; their collection constitutes a fuzzy measure representing the uncertainty associated with several well-defined alternatives; It is important to realize that this type of uncertainty, which results from information deficiency, is fundamentally different from fuzziness, which arises from the lack of sharp boundaries.”[4]

…………Thankfully, a sturdy mathematical scaffolding to model these types of evidence-based uncertainty already exists, although it isn’t being tested much these days in the relational database, data warehousing and data mining fields. The modeling process is akin to the one I introduced a few weeks ago for possibility distributions, but a tad more complicated. A continuous data type like float, numeric or decimal is required for probability values, but possibility theory also calls for the addition of a bit column, which is often assigned to the Necessity measure. In the theory developed independently by statisticians Glenn Shafer and Arthur Dempster, we need three measures: a Probability Mass Assignment (often denoted by a lower case m) that tells us the strength of the evidence that a record belongs just to one set; a Belief measure that measures the same, plus the evidence for belonging to its subsets; and a Plausibility measure, which covers both of those, as well as “the additional evidence or belief associated with sets that overlap with A.”[5] The easy part is that all three are measured on scale of 0 to 1, the same as fuzzy sets, probabilities, possibilities and the like; the complexity arises from the fact that they measure evidence at different levels. This leads to nested bodies of evidence, which alpha cuts (α-cuts) are ideal for modeling, as explained a couple of articles ago; I saved this topic for the next-to-last article precisely because it unites many of the concepts introduced throughout the series, like α-cuts, fuzzy unions, intersections and complements.
…………These relationships also give rise to various mathematical properties, some of which are similar to those used in possibility distributions. For example, just as Necessity is equal to 1 minus the complement of Possibility, so too is Plausibility equal to 1 minus the complement of the Belief measure. Plausibility must be greater than or equal to the Belief, since it models evidence at a higher scope. These “fuzzy measures” have weakened forms of properties like monotonicity, continuity and additivity than probabilities do.[6] Belief measures are superadditive, which means that if you sum them together across the subsets, the result must be greater than or equal to the Belief function for the whole set. For example, the Belief function for the whole set can be a figure less than 1, say 0.97, but the individual measures of each subset can be assigned degrees of belief like 0.5, 0.87, 0.3, etc. which together sum to 1.67, which is valid because it’s greater than 0.97. In contrast, probabilities must always sum to 1 across a dataset, including the probability mass assignments used in evidence theory. Plausibility is subadditive, which signifies the opposite relationship, so that the measures taken across the subsets must sum to the at least the Plausibility for the whole set. In short, they act as maximums rather than sums.  This all sounds weird, but it’s a necessary logical consequence of the nesting of evidence. As explained in the discussion on α-cuts a couple of articles ago, this signifies that records can belong to multiple hierarchical partitions of a set, which is an unfamiliar situation in the relational world (despite the fact that it is easily modeled using set-theoretic relational technology). The good news is that this web of interrelationships makes the three evidence theory measures reconstructible from each other; this makes it possible to validate the values using queries like the samples in Figure 2.

Two Common Illustrations of Dempster-Shafer Evidence Theory in Action

                The Wikipedia article on Dempster-Shafer Theory  has comprehensible examples of how these three measures work together, beginning with a sensor that detects whether a cat concealed in a box is in a Dead or Alive state. The value for Either obviously reaches the maximum value of 1 for both Belief and Plausibility, since it must be one of the two by logical necessity (that is, unless our cat happens to belong to Erwin Schröedinger or was buried in Pet Sematary).  It is thus an instance of a “universal hypothesis,” which encompasses the whole dataset. Yet the probability mass assignment for the Either state is only 0.3, which signifies the fact that we don’t have solid information on its status; the probability figure for the whole dataset still sums to 1 though, once the stats for Alive and Dead are factored in. The probability value for the universal hypothesis thus constitutes a measure of the uncertainty remaining in the data, once the probability, Belief and Plausibility measures have partitioned it off. Since Dead and Alive are discrete states without fuzzy intervals, the Wikipedia example assigns them Belief  figures equal to their probability masses – which when added to the value of 1 for the Either state, means that the total Belief for the whole dataset is greater than 1, unlike the probability mass. The Plausibility can then be reconstructed using the inverse of the complement of the Belief.
…………The tricky part is that the Belief measures must sum to 1 for each subset, which calls for looking at our data in an unfamiliar way. I initially thought that the existence of these subsets meant that we could simply model this by applying the appropriate normal form, but that’s not the case. The second example in the Wikipedia article has examples of states like Red, Yellow, Green which are mutually exclusive, as well as some that carry a bit of measurement uncertainty, like “Red or Yellow” and “Red or Green.” In this situation, the Belief figures for Red, Yellow and “Red or Yellow” must sum to 1, as must the Belief figures for Red, Green and “Red or Green,” since there are two overlapping subsets. Red, Yellow and Green are all members of more than one subset, but not the same ones. This leads to an odd predicament where each state is discrete and thus difficult to denormalize, yet the associated column still represents subsets; this is one situation where the presence of logical OR statements is not a hint that the design requires normalization. Since we can’t be certain how many other state descriptions a child could be related to, a single self-referencing ParentID column won’t do the job either. The next best thing is an interleaved solution, in which a separate table with two foreign keys pointing to the primary key of the table holding the Belief measures to keep track of which subsets each record belongs to. To aggregate the Belief figures for each subset in the parent table, we just inspect the interleaved table for all of the categories a record can belong to.

Server States: A SQL Server-Specific Example

                Let me give an example that might be more intuitive and relevant to SQL Server users: the state_desc column of sys.databases will assign one of seven mutually exclusive states to each database: Online, Offline, Restoring, Recovering; Recovery Pending, Suspect and Emergency. As far as I know, these states do not rule out which user modes a database can be, which range from SINGLE_USER to RESTRICTED_USER to MULTI_USER. Nevertheless, many combinations would be improbable, so each unique pair of descriptions requires a probability assignment that will probably differ from other pairs of state_desc and user mode values. Now let’s pretend we have a sensor that guesses which of pair of states a server is in at any given moment, perhaps based on I/O data or network bandwidth usage. If it can tell us the user mode plus whether we’re in one of the three recovery states, but can’t differentiate between them accurately, then we’re dealing with a fuzzy interval-valued set. From the point of view of the sensor, “Restoring | Recovering | Recovery Pending” is a discrete state and ought to be recorded as such in the database table. Nevertheless, to derive the Belief we must sum together all of the probabilities for the subsets it gives rise to, while the Plausibility equals one minus the sum of the probability assignments in the subsets it does not participate in. We could create a separate category like “Unknown” for situations where the sensor went offline or was otherwise unable to return accurate data – or better yet, establish a universal hypothesis like “Any State” with the Belief and Plausibility both set to 1 and we add all of its possible subsets. Subtracting the sum of the probabilities of all known states from that of the universal hypothesis would allow us to measure one type of uncertainty associated with the table. In order to measure the uncertainty inherent in the interval-valued fuzzy subsets that the Belief and Plausibility measures are attached to, we’d have to use a measure of fuzziness tailored to evidence theory. In the same vein, the count of possible state descriptions could be used to derive a measure of nonspecificity, albeit through a different formula than the ones introduced in the last article. In addition, we can define measures of uncertainty based on how much
…………It is easier to illustrate all of this with T-SQL code samples, beginning with the easiest part, a simple snapshot of a table with probability mass, Belief and Plausibility measures defined on it. Degrees of Belief are usually derived from some kind of input method, akin to fuzzy set membership functions – except that subjective ratings tend to be more common in evidence theory. It is no surprise that Bayesian methods are often applied in deriving Belief functions, given that they actually represent a more specific subset of evidence theory measures. Instead of complicating the topic any further, I’ve derived the values in Figure 1 by creating an artificial category in the Duchennes muscular dystrophy data I’ve been using for practice data for the last few tutorial series[7], then simply assigned probability mass assignments based on the frequency of the values for the LactateDehydrogenase column. From there, I derived the Belief measures, then constructed the Plausibility measures from those. I used the float data type for all three of the columns that associate measured with the LactateDehydrogenaseState column, an ordinal category; this represents yet another use of fuzzy sets to model ordinals on continuous scales, except at a more advanced level where three columns are required.

Figure 1: Simple Evidence Theory Measures Defined on the LactateDehydrogenase Column

Figure 2: Sample Validation Code for the Relationships Between the Three Evidence Theory Measures
— verifying the Belief via the ProbabilityMassAssignment mass assignment
SELECT ID, LactateDehydrogenaseState, ProbabilityMassAssignment, BeliefScore, PlausibilityScore,
CASE WHEN IntervalProbabilityMassAssignmentSum IS NOT NULL THEN IntervalProbabilityMassAssignmentSum ELSE ProbabilityMassAssignment END
AS BeliefReconstructedFromProbabilityMass
FROM Health.DuchennesEvidenceTheoryTable AS T3
        LEFT JOIN (SELECT ParentID, SUM(ProbabilityMassAssignment) AS IntervalProbabilityMassAssignmentSum
              FROM Health.DuchennesEvidenceTheoryTable AS T1
               INNER JOIN Health.DuchennesEvidenceTheoryIntervalTable AS T2
               ON T1.ID = T2.BeliefSubsetID
              GROUP BY ParentID) AS T4
       ON T3.ID = T4.ParentID 

SELECT ID, LactateDehydrogenaseState, BeliefScore, ProbabilityMassAssignment, ProbabilityMassAssignmentBySum,
CASE WHEN ProbabilityMassAssignmentBySum IS NULL THEN 1 ELSE ABS(1 (ProbabilityMassAssignment+ ProbabilityMassAssignmentBySum)) END AS PlausibilityScoreReconstructedFromProbability
FROM (SELECT ID, LactateDehydrogenaseState, BeliefScore, ProbabilityMassAssignment
FROM Health.DuchennesEvidenceTheoryTable) AS T5
       LEFT JOIN (SELECT BeliefSubsetID, SUM(ProbabilityMassAssignment) AS ProbabilityMassAssignmentBySum
       FROM (SELECT DISTINCT T1.BeliefSubsetID, T2.ParentID
              FROM Health.DuchennesEvidenceTheoryIntervalTable AS T1
                     INNER JOIN Health.DuchennesEvidenceTheoryIntervalTable AS T2
                     ON T1.ParentID = T2.BeliefSubsetID AND T1.BeliefSubsetID != T2.BeliefSubsetID) AS T4
                           INNER JOIN Health.DuchennesEvidenceTheoryTable AS T3
                           ON T4.ParentID = T3.ID
       GROUP BY BeliefSubsetID) AS T6
       ON T5.ID = T6.BeliefSubsetID

…………Note how the Belief is equal to the ProbabilityMassAssignment for Low, Medium and High, which is reflective of the fact that they have no substates; Medium or Low and High or Medium have BeliefScore values higher than their masses, precisely because we have to tack the values for Low, Medium and High onto them. The PlausibilityScore is in each case determined by adding together all of the ProbabilityMassAssignment values for the columns that aren’t among a record’s subsets, then taking an inverse, which is equivalent to subtracting the complement of the BeliefScore from 1. The second image depicts the Health.DuchennesEvidenceTheoryIntervalTable, in which the ParentID and BeliefSubsetID determine the linkages between subsets. For example, the records with ParentIDs of 4 tie together the Medium | Low, Medium and High | Medium values, so that we can aggregate the ProbabilityAssignments to derive the BeliefScore. The PlausibilityScore can be determined using the same table. Code similar to what I provided in Figure 2 can be used to validate the relationships between these fuzzy measures, with your own particular column and table names plugged in of course. The IS NULL condition is due to a bizarre problem in which setting the first condition in the CASE to BeliefScore = 1 THEN 1, or using NullIf, both led to NULL values. It is also possible to derive the ProbabilityMassAssignment values in reverse, but I’ll omit validation code for that scenario in the interest of brevity. To avoid pummeling readers with too much information all at once, I’ll also put off discussion of how to derive uncertainty measures like Strife and Discord from this crude example. In the next article, I’ll also mention some principles for interpreting the results that can in turn provide an important bridge to Information Theory. Among other things, the first table tells us that, “the belief that the Lactate Dehydrogenase values are Medium or Low is higher than that for Low alone, by a margin of 0.679425837320574 to 0.349282296650718. It is more plausible that the value is High than Low, by a margin of 0.822966507177033.” Once we define measures of fuzziness, nonspecificity and the like on top of them and apply some principles of inference drawn from Information Theory, we can partition the uncertainty further in order to glean additional valuable insights.

 

[1] Here in Western New York the natural language term “mild” has interesting shades of meaning (at least among local weathermen) which would be a challenge to model in terms of a fuzzy set. As winter approaches, “mild” means warmer than normal, but as the peak of summer comes, it means cooler than expected, so the meaning is inverted depending on the season. If we were to use an interval-valued set, we’d need a range ofvalues somewhere between 30 and 70 degrees – which is so imprecise that it borders on meaningless.

[2] p. 177, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[3] See the Wikipedia article “Dempster Shafer Theory” at http://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory

[4]  p. 179, Klir and Yuan.

[5] IBID., p. 181-182.

[6] IBID., p. 179-181.

[7] Which I downloaded from the Vanderbilt University’s Department of Biostatistics and converted into a SQL Server table in my sham DataMiningProjects database.

 

Implementing Fuzzy Sets in SQL Server, Part 9: Measuring Nonspecificity with the Hartley Function

By Steve Bolton

…………Imagine how empowering it would be to quantify what you don’t know. Even an inaccurate measure might be helpful in making better decisions in any area of life, but particularly in the business world, where change is the only certainty. This is where a program of “uncertainty management” can come in handy and fuzzy set techniques find one of their most useful applications. Fuzzy sets don’t introduce new information, but they do conserve and put to good use some information left over after ordinary “crisp” sets are defined – particularly when it would be helpful to model ordinal categories on continuous number scales. As I pointed out at the beginning of this series, uncertainty reduction is akin to Stephen King’s adage that monsters are less fearsome once some scale of measurement can be applied to them; knowing that a bug is 10 feet tall is at least reassuring, in the sense that we now know that it is not 100 or 1,000 feet tall.[1] Uncertainty reduction can also be put to obvious uses in data mining activities like prediction and clustering. Another potential use is in simplification of data, so that information loss is minimized.[2] In today’s article I’ll shine a little light on the Hartley function, a tried and true method of quantifying one particular category of uncertainty that has been used since 1928 to simplify and demystify datasets of all kinds and could easily be extended to SQL Server data.
George J. Klir and Bo Yuan, the authors of my favorite resource for fuzzy set equations, note that data models must take uncertainty into account, along with complexity and credibility. Later in the book, they go onto subdivide uncertainty into three types that sprawl across possibility theory, stochastics, information theory, fuzzy sets and Dempster-Shafer Evidence Theory:

                “The relationship is not as yet fully understood…Although usually (but not always) undesirable when considered alone, uncertainty becomes very valuable when considered in connection to the other characteristics of systems models; in general, allowing more uncertainty tends to reduce complexity and increase credibility of the resulting model. Our challenge in systems modelling is to develop methods by which an optimal level of allowable uncertainty can be estimated for each modelling problem…”[3]

“…Three types of uncertainty are now recognized in the five theories, in which measurement of uncertainty is currently well established. These three uncertainty types are: nonspecificity (or imprecision), which is connected with sizes (cardinalities) of relevant sets of alternatives; fuzziness (or vagueness), which results from imprecise boundaries of fuzzy sets; and strife (or discord), which expresses conflicts among the various sets of alternatives.

“It is conceivable that other types of uncertainty will be discovered when the investigation of uncertainty extends to additional theories of uncertainty.”[4]

…………Some authors also include “ambiguity (lack of information),”[5] which Klir and Yuan define as a parent class of both discord and nonspecificity in an excellent diagram I wish I could reprint.[6] Probabilities probably also ought to be included as well.[7]As soon as I introduced to the concept of uncertainty partitioning, I was intrigued by the possibility of defining human free will as an alternative form of uncertainty, but that raises many thorny philosophical questions. Among them is the contention that it doesn’t even exist, which is a disturbing tenet of many popular philosophies, like materialistic determinism and certain forms of theological predestination. I’d dispute that with evidence that would be hard to debunk and raise the possibility that it may not be possible to quantify it at all, by definition; the ability to assign values to it would certainly be helpful in academic fields like economics and psychology, where human behavior is the crux of the matter. This topic integrates quite nicely with the contention of authors like Lofti A. Zadeh, the father of fuzzy set theory, that it might be helpful to apply fuzzy techniques in these fields to model “humanistic systems.”[8] Other controversial candidates for new categories of uncertainty include the notion that reality is somewhat subjective (which I would argue is fraught with risk, since it is a key component of many forms of madness) and the contention that some events (particularly at the quantum level) can be truly random, in the sense of being indeterminate or “uncaused.” Albert Einstein drove home the point that uncertainty is deeply rooted in all we see in his famous quote from a lecture at the Prussian Academy of Sciences in 1921, in which he seemed to extend it right into the heart of mathematics itself: “…as far as the propositions of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality.”[9]

Partitioning Uncertainty

                The first step is to develop a habit of explicitly recognizing which type of uncertainty is under discussion, then partitioning it off using the appropriate type of fuzzy set. For example, whenever we need to cram continuous scales into finite data types like float, decimal and numeric, we end up creating measurement uncertainty about whatever values come after the precision we’ve chosen.[10] Like other types of measurement uncertainty, this is best addressed by fuzzy sets without any special probabilistic, possibilistic or evidence theory connotations attached to them. Incidentally, some theoreticians say that if we’re trying to quantify the uncertainty of a measurement, membership functions based on the normal distribution (i.e. the bell curve) are usually the best choice (based on empirical evidence from the aerospace industry).[11] If we were uncertain about the likelihood of an event occurring, we’d assign a probability value instead; if we were unsure of the logical necessity of an event, we’d use a possibility distribution, as explained in the last installment of this series. In the next installment, I’ll explain how Dempster-Shafer Theory can be used to judge the certainty and credibility of evidence, by assigning grades of membership in the set of true statements.
…………Once the appropriate method of uncertainty modeling has been selected, we can then apply its associated formulas to compute figures for nonspecificity, imprecision, discord and the like. The good news is that we already dispensed with the main means of computing fuzziness, back in Implementing Fuzzy Sets in SQL Server, Part 2: Measuring Imprecision with Fuzzy Complements. In the remainder of this article, I’ll provide sample T-SQL for implementing two of the three main methods for calculating the “U-Uncertainty,” a.k.a. the nonspecificity. Like many other authors I consulted for this series, Klir and Yuan stress that nonspecificity and fuzziness are completely independent stats, since they measure two distinct and unrelated types of uncertainty.[12] The former is dictated by the number of possible distinct states that a set can take on, whereas the latter quantifies imprecision in class boundaries.[13] A set can have many possible arrangements, yet still be entirely crisp; there’s no mistaking what a Lego or Lincoln Log is, but there’s apparently no end to the crazy things that can be built with either one. Sets with few arrangements but really fuzzy boundaries are also possible. That is why fuzzy sets sans any additional meaning like probability, possibility and credibility scores have both fuzziness and nonspecificity measures attached to them.
…………Possibility theory, the topic of the last blog post in this amateur series of self-tutorials, has a form of nonspecificity that is easier to specify (pun intended) than the ordinary fuzzy set version, so I’ll introduce that first. The SELECT in Figure 1 is performed on a column of muscular dystrophy data I downloaded from the Vanderbilt University’s Department of Biostatistics and added to a sham DataMiningProjects database a few tutorial series ago. The PossibilityScore was assigned by a random number generator in the last article and tacked onto the table definition, for the sake of convenience. It’s time for my usual disclaimer: I’m writing this in order to learn this topic, not because I know it well, so it is a good idea to check over my T-SQL samples before putting them to serious use. This is especially true of this SELECT, where I may be applying a Lead where there should be a Lag; in contrast to the topics I post on in previous series, examples with sample data are few and far between in the fuzzy set literature, which makes validation difficult. Furthermore, there is apparently a more compact version available for specific situations, but I’ll omit it for now because I’m still unclear on what mathematical prerequisites are needed.[14]

Figure 1: Possibilistic Nonspecificity for the LactateDehydrogenase Column
SELECT SUM(PossiblityDifference * Log(RN, 2)) AS PossibilisticUUncertainty
FROM (SELECT ROW_NUMBER() OVER (ORDER BY ID) AS RN, PossibilityScore Lead(PossibilityScore, 1, 0) OVER (ORDER BY ID) AS  PossiblityDifference
       FROM Health.DuchennesTable) AS T1 

…………The SELECT returns a single value of 4.28638426128113, which measures that amount of uncertainty in bits; the greater the number of possible state descriptions, the higher the U-Uncertainty will be. The same relationship applies to the procedure below, which returns a value of 7.30278910848746 bits; the difference is that one measures uncertainty about the number of possible values the LactateDehydrogenase column can have, while the other measures lack of certainty about the number of membership function scores a row can be assigned. Figure 2 is practically identical to the sample code I’ve posted throughout this series, at least as far as the UPDATE; all I’m doing is running the stored procedure from Outlier Detection with SQL Server, part 2.1: Z-Scores on the DuchennesTable and storing the results in a table variable, then transforming them to a scale of 0 to 1 using the @Rescaling variables and ReversedZScores column. The GroupRank column can be safely ignored, as usual. The first SELECT with the AlphaCutLeftBound and AlphaCutRightBound columns is only provided to illustrate the how the nonspecificity figure is arrived at in the last SELECT. What we’re basically doing is partitioning the dataset into nested levels, using the alpha cut (α-cut) technique I introduced in the last article, then applying a Base-2 LOG and summing the results across the hierarchy.[15] The tricky part is that with α-cuts, records can belong to more than one subset, as I pontificated on in my last post; the levels are widest at the bottom of the dataset, but narrowest at the top, where the MembershipScore values approach the maximum of 1.This calls for thinking about the data in an odd way, given that in most relational operations records are assigned to only a single subset.

Figure 2: Code for Hartley Nonspecificity
DECLARE @RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)
DECLARE @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin 

UPDATE @ZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

 

SELECT AlphaCutBound AS AlphaCutLeftBound, Lag(AlphaCutBound, 1, 0) OVER (ORDER BY AlphaCutBound) AS AlphaCutRightBound,
AlphaCutBound Lag(AlphaCutBound, 1, 0) OVER (ORDER BY AlphaCutBound) AS AlphaCutBoundaryChange, Log(AlphaCutCount, 2) AS IndividualLogValue
FROM (SELECT Count(*) AS AlphaCutCount, AlphaCutBound
       FROM @ZScoreTable AS T1
       INNER JOIN (SELECT DISTINCT MembershipScore AS AlphaCutBound
 FROM @ZScoreTable) AS T2
       ON MembershipScore >= AlphaCutBound
       GROUP BY AlphaCutBound) AS T3  

SELECT SUM(AlphaCutBoundaryChange * Log(AlphaCutCount, 2)) AS FuzzySetNonspecificityInBits
FROM (SELECT AlphaCutCount, AlphaCutBound Lag(AlphaCutBound, 1, 0) OVER (ORDER BY AlphaCutBound) AS AlphaCutBoundaryChange
       FROM (SELECT Count(*) AS AlphaCutCount, AlphaCutBound
              FROM @ZScoreTable AS T1
              INNER JOIN (SELECT DISTINCT MembershipScore AS AlphaCutBound
                     FROM @ZScoreTable) AS T2
             ON MembershipScore >= AlphaCutBound
              GROUP BY AlphaCutBound) AS T3) AS T4

 

Figure 3: Results for the Hartley Nonspecificity Example

…………The point of using the α-cuts is to chop the dataset up into combinations of possible state descriptions, which is problematic with fuzzy sets because the boundaries between states are less clear. The interpretation depends entirely on the meaning of the fuzzy attribute; as Klir and Yuan note, it can reflect an “an unsettled historical question” in the case of retrodiction, possible future states in the case of prediction, prescriptive uncertainty in the case of policies, diagnostic uncertainty in the case of medical information and so forth.[16] In the same vein, we can interpret my sample above as measuring 7.30278910848746 bits of uncertainty about a record’s place within the range of Z-Scores, which can in turn be used as a form of outlier detection. The smaller the range of possible values, the smaller the number of possible state descriptions becomes, which means that the cardinality of the α-cuts and the value of the final statistic decline as well.
…………This is an adaptation of a function developed way back in 1928 by electronic pioneer Ralph Hartley[17]; since it serves as one of the foundations of information theory I’ll put off discussion of the crisp version until my long-delayed monster of a series, Information Measurement with SQL Server. We’ve got at least two more articles in the fuzzy set series to dispense with first, including an examination of Dempster-Shafer Theory in the next installment. Evidence theory also has its own brand of nonspecificity measure, also based on the Hartley function.[18] Measures like strife and discord are more relevant to that topic, since they deal with conflicts in evidence. Possibility theory has counterparts for both, but I’ll leave them out, given that Klir and Yuan counsel that “We may say that possibility theory is almost conflict-free. For large bodies of evidence, at least, these measures can be considered negligible when compared with the other type of uncertainty, nonspecificity. Neglecting strife (or discord), when justifiable, may substantially reduce computation complexity in dealing with large possibilistic bodies of evidence.”[19] Possibility theory is a useful springboard into the topic though, given that Belief and Plausibility measures are modeled in much the same way. In fact, Possibility and Necessity measures are just special cases of Belief and Plausibility, which should serve to decomplicate my introduction to Dempster-Shafer Theory a little.

 

[1] p. 114, King, Stephen, 1981, Stephen King’s Danse Macabre. Everest House: New York. I’m paraphrasing King, who in turn paraphrased an idea expressed to him by author William F. Nolan at the 1979 World Fantasy Convention.

[2] p. 269, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[3] IBID., p. 3.

[4] IBID., p. 246.

[5] p. 2, Hinde, Chris .J. and Yang, Yingjie., 2009, “A New Extension of Fuzzy Sets Using Rough Sets: R-Fuzzy Sets,” pp. 354-365 in Information Sciences, Vol. 180, No. 3. Available online at the Loughborough University Institutional Repository web address https://dspace.lboro.ac.uk/dspace-jspui/bitstream/2134/13244/3/rough_m13.pdf

[6]  p. 268, Klir and Yuan.

[7] IBID., p. 3.

[8] IBID., p. 451.

[9] Cited from the Common Mistakes in Using Statistics web address https://www.ma.utexas.edu/users/mks/statmistakes/uncertaintyquotes.html

[10] IBID., pp. 327-328.

[11] Kreinovich, Vladik; Quintana, Chris and Reznik, L.,1992, “Gaussian Membership Functions are Most Adequate in Representing Uncertainty in Measurements,” pp. 618-624 in Proceedings of the North American Fuzzy Information Processing Society Conference, Vol. 2. NASA Johnson Space Center: Houston. Available online at the University of Texas at El Paso web address www.cs.utep.edu/vladik/2014/tr14-30.pdf

[12] p. 258, Klir and Yuan.

[13] p. 2, Hinde and Yang.

[14] pp. 253, 269, Klir and Yuan.

[15] IBID., pp. 248-251.

[16] IBID., p. 247.

[17] See the Wikipedia articles “Hartley Function” and “Ralph Hartley” at http://en.wikipedia.org/wiki/Hartley_function and http://en.wikipedia.org/wiki/Ralph_Hartley respectively.

[18] pp. 259, Klir and Yuan.

[19] IBID., p. 264.

 

Implementing Fuzzy Sets in SQL Server, Part 8: Possibility Theory and Alpha Cuts

By Steve Bolton

…………To get the point across that fuzzy sets require membership grades of some sort, throughout this series I’ve borrowed the stored procedure I coded for Outlier Detection with SQL Server, part 2.1: Z-Scores and rescaled the results on the customary range of 0 to 1. The literature on fuzzy sets contains frequent warnings against automatically interpreting membership scores as probabilities, but I deliberately introduced a tie-in to stochastics by using Z-Scores, which are inherently probabilistic. Other shades of meaning may be assigned which are unfamiliar to modelers of ordinary “crisp” sets, which is why I pointed out early on in this series of amateur self-tutorials that interpretability is a more prominent issue with fuzzy sets. For example, membership functions can be viewed as assigning scores to the accuracy of the associated values, which is similar to the way in which we used fuzzy numbers two articles ago to code such linguistic concepts like “about” and “near.” If we add the subtle distinction that the membership scores may mean “cannot be near” or “can be around” a certain value, we’re stepping into the realm of Possibility Theory, which has important uses in fuzzy logic.[i]
…………Approximate reasoning and related concepts are more relevant to topics like expert systems that are beyond the purview of this series, but Possibility Theory can serve as a useful springboard into Evidence Theory, which is useful in developing programs of uncertainty management. Possibility distributions are in one sense a more restricted brand of probability distributions, while also acting as more restrictive versions of Evidence Theory measures; it may therefore be easier to use them as bridge from one relatively familiar topic to a lesser-known one. I originally thought the topic would be quite difficult to grasp, but it’s actually a good deal easier that stochastics. Perhaps the most difficult aspect is that possibility distributions can be modeled using alpha cuts (α-cuts), a method of partitioning fuzzy sets that will prove useful in the next two articles to come.

From ‘Can’ and ‘Must’ to Surprise

                In fact, I’ll lighten the load further by dispensing with many of the details of Possibility Theory, since its simplicity can quickly give way to complexity, same as with any other fuzzy set topic. For example, stochastic concepts like conditional and marginal probabilities have their counterparts in Possibility Theory, all of which is too far afield for our purposes. For those have a need for the corresponding formulas and don’t mind wading through the thick math, I recommend consulting the seventh chapter of my favorite resource, George J. Klir and Bo Yuan’s Fuzzy Sets and Fuzzy Logic: Theory and Applications. I’m not even going to get into a discussion of how possibility scores are assigned; for the sake of argument, let’s assume any figures used in my examples are derived from subjective ratings by end users. The important thing to keep in mind is that we need two numbers to specify a possibility distribution, not just the single probability figure used in stochastics. One of these is known as the Possibility measure and the other as a measure of Necessity, which is the inverse of Possibility’s complement.
…………The two measures can be combined by adding them together and subtracting one, but the fact that this results in a non-standard range of -1 to 1 limits its usefulness.[ii] The simplest way to model this relationship is to use a bit column, in conjunction with the float, numeric or decimal columns normally used to represent fuzzy sets on a continuous scale between 0 and 1.[iii] The tricky thing is that an event must occur when Necessity equals 1, whereas a Possibility score of 0 means that it cannot; on the other hand, a Possibility score of 1 does not imply certainty, only a state of total surprise if it did; apparently this in analogous to a measure of “surprise” developed in the mid-20th Century by economist G. L. S. Shackle,[iv] which has since been further developed by such household names in the fuzzy set field as like Henri Prade and Ronald R. Yager.[v] As Lofti A. Zadeh, the father of fuzzy set theory, explains it:

                “Consider a numerical age, say u = 28, whose grade of membership in the fuzzy set ‘young’ is approximately 0.7. First we interpret 0.7 as the degree of compatibility of 28 with the concept labelled young. Then we postulate that the proposition ‘Peter is young’ converts the meaning of 0.7 from the degree of compatibility of 28 with young to the degree of possibility that Peter is 28 given the proposition ‘Peter is young.’ In short, the compatibility of a value of u given ‘Peter is young.’”[vi]

…………This lack of symmetry is comparable to the way possibilities and probabilities differ. A Necessity measure of 1 leads inevitably to a probability score of 1, since what must happen is entirely probable; conversely, a Possibility measure of 0 leads to a probability score of 0, since what cannot happen is entirely improbable. Apart from these extremes, however, the two theories diverge. A Necessity or Possibility score of 0.5 has no effect on the probability, since whether or not a thing is logically conceivable is not equivalent to whether it is likely to happen; it is entirely possible that we may win the lottery tomorrow, but I wouldn’t bet on it. This is the core difference between the two theories: one expresses confidence in our information about whether a thing can happen, while the other reflects confidence in information about whether it will.
…………Because of this relationship, a possibility distribution acts as a cap on the associated probability distribution; this has many mathematical consequences[vii], the most important of which is that the two distribution types intersect at their minimum and maximum values. This in turn leads to the interesting property that possibility scores do not have to sum to 1 across a set of records, unlike probabilities; the only restriction is that the maximum value per record is 1.[viii] This in turn means that to assess whether or not we’ve reached a certain threshold of possibility values, all of the records with scores greater than the threshold must be taken into account. In other words, if we want to know if an event has a possibility of 0.3, we must examine all of the records with scores higher than that to come to a verdict. Every record in a set will qualify for the lowest partition, where a possibility score of 0 is all it takes to qualify, but the number of records continually shrinks as we move up the dataset towards the perfect score of 1.

Nested Sets and α-cuts

                This creates a nested set of evidence in which records can belong to multiple partitions, which can be easily implemented in T-SQL despite the fact that it calls for thinking about sets in unusual ways. We’re doing something uncommon here by cutting a set up hierarchically, so that a row belongs to more and more sets as we approach the maximum value of the membership function, rather than a single subset as we see in most relational joins. Klir and Yuan include a couple of handy illustrations which could get across the meaning of nested sets of evidence in a heartbeat, but I haven’t had a chance to seek permission to reprint them and don’t have the ability to draw my own.[ix] In turns out that the fuzzy set partitioning method known as α-cuts are an ideal tool for implementing these relationships[x] (not to mention many others that are beyond the scope of this series, like fuzzy equivalence relations[xi]). In plain English, this means that we have to use >= comparison operators to chop up a dataset into nested subsets, or > operators in the case of strong α-cuts.
…………I’m trying to keep the jargon to a minimum, but since the terms “cutworthy” and “strong cutworthy” occur frequently in the literature, it may be helpful to know that they refer to mathematical properties of fuzzy sets which are preserved in their α-cuts. [xii] Another important property is reconstructibility, which means that a fuzzy set can be rebuilt from its partitions. The manner in which possibility distributions establish maximum values for their associated probability distributions is essentially one and the same as the min/max types of unions and intersections we dealt with in previous articles, while the possibilities themselves are defined by their α-cuts.[xiii]
…………The first SELECT statement in Figure 1 illustrates how a simple GROUP BY and SUM with a ROWS UNBOUNDED PRECEDING clause can be used to partition a SQL Server table in this unconventional manner. I also have an alternate version of these SELECTs in which partitioning is done by deciles (or any other arbitrary percentile value) rather than DISTINCT MembershipScores, which I omitted to keep things simple; if anyone needs it though, I’d be happy to post it. As usual, the sample data comes from a dataset on the Duchennes form of muscular dystrophy I downloaded from Vanderbilt University’s Department of Biostatistics a few tutorial series ago, which now resides in a sham DataMiningProjects database. The code from the beginning to the UPDATE statement is basically identical to the T-SQL samples I’ve posted throughout this series, which always begins with plugging the results of the aforementioned Z-Score procedure into a table variable. The GroupRank column is only included because it was part of the original procedure and can’t be omitted from the INSERT EXEC, but it can be safely ignored. The @Rescaling variables and the ReversedZScore column are then used to adjust the Z-Scores to the 0 to 1 range used in almost all fuzzy sets. There are only 202 records in the DuchennesTable where LactateDehydrogenase where is NOT NULL, which is exactly equal to the count of values in Figure 2 where the MembershipScore is zero. The counts for each α-cut continually decline after that, till they reach the perfect score of 1, which is equivalent to the Height measure mentioned in last week’s article on fuzzy stats. I left out the middle values for the sake of brevity.

Figure 1: An Example of α-Cut Partitioning

DECLARE @RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)
DECLARE  @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
) 

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

— ALPHA CUTS BY DISTINCT VALUES
— =======================================

SELECT MembershipScore, SUM(DistinctCount) OVER (ORDER BY MembershipScore
DESC ROWS UNBOUNDED PRECEDING) AS AlphaCutCount
FROM (SELECT MembershipScore, Count(*) AS DistinctCount
       FROM @ZScoreTable
       WHERE MembershipScore IS NOT NULL
       GROUP BY MembershipScore) AS T1 

— MEASURE OF SURPRISE
— =======================================
SELECT ID, LactateDehydrogenase, NecessityMeasure, PossibilityScore, 1 PossibilityScore AS SimpleMeasureOfSurprise
FROM Health.DuchennesTable
WHERE LactateDehydrogenase IS NOT NULL

 

Figure 2: Sample α-Cut Values from the Beginning and End of the Duchennes Dataset
alpha-cuts-result-1

alpha-cuts-result-2

Figure 3: Possibility Scores and the Measurement of Surprise
possibility-theory

…………The second SELECT merely returns some fake PossibilityScore values I randomly generated and tacked onto the DuchennesTable, with a simple inverse calculation to illustrate the most basic measure of Surprise.[xiv] Authors like Prade and Yager have extended the measure to address more sophisticated use cases, but Figure 2 is sufficient to get the point across for our purposes. The interpretation of any Surprise measure is straightforward: the higher the value, the greater our bewilderment will be if the associated event occurs. In this context, the Surprise would be attached to the possibility of observing the corresponding LactateDehyrogenase value; of course, these are actual values taken from a muscular dystrophy in the 1980s, so if we weren’t using this for practice purposes we’d have to assign Necessity values of 1. These measurements of qualities like Surprise are of course not perfect, but they do allow us to attach some sort of ballpark figure to our expectations. As we shall see in the next two articles, one of the primary uses of fuzzy sets is to measure uncertainty, which can be valuable even when those measures are themselves uncertain. Two articles from now we’ll see how possibility theory is useful not merely in measuring surprise or in deriving interval-valued probabilities[xv], but also as a bridge to Dempster-Shafer Evidence Theory, which is useful in reckoning subtypes of uncertainty like Strife, Discord and Conflict. In the next installment, I’ll explain how both possibility distributions and α-cuts can measure nonspecificity, which is one of several types of uncertainty we can quantify with the aid of fuzzy sets.

[i] p. 200, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[ii] IBID., p. 198.

[iii] I got this idea from the Wikipedia article “Possiblity Theory” at http://en.wikipedia.org/wiki/Possibility_theory.

[iv] IBID.

[v] Prade, Henri and Yager, Ronald R., 1994, “Estimations of Expectedness and Potential Surprise in Possibility Theory,” pp. 417-428 in International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, December 1994. Vol. 2, No. 4. Available online at the National Aeronautics and Space Administration (NASA) web address http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19930020329.pdf

[vi] Posted by Kornelia Brutoczki July 4, 2001 at  the Fuzzy Logic Home Page address http://mazsola.iit.uni-miskolc.hu/DATA/diploma/brutoczki_kornelia/fu_gz_02.html. The original source is not given.

[vii] pp. 206-207, Klir and Yuan.

[viii] IBID., p. 204.

[ix] IBID., pp. 24, 195.

[x] IBID., pp. 19-21, 35.

[xi] IBID., p. 133.

[xii] IBID., p. 23, 25, 36.

[xiii] IBID., pp. 187-188, 198.

[xiv] See Prade and Yager, 1994.

[xv] p. 205, Klir and Yuan.

 

Implementing Fuzzy Sets in SQL Server, Part 7: The Significance of Fuzzy Stats

By Steve Bolton

…………In the world of fuzzy sets and imprecision modeling, the concept of cardinality takes on new shades of meaning that are not applicable to ordinary “crisp” sets, i.e. those without membership grades. In the last article in this series of amateur-self-tutorials, I mentioned one type of “fuzzy cardinality,” based on triangular, trapezoidal and other fuzzy numbers that are quite useful in modeling many vague statements found in everyday speech. Of course, another means of expressing cardinality is through ordinary numbers that are defined by a single value, rather than intervals and other such fuzzy set types. This raises some interesting questions, because one of the logical implications of graded set membership is that records with a score of 0 shouldn’t be included in the count. By extension, the values that are non-zero should only be counted in proportion to the score assigned by their membership function; since these are almost always on a scale of 0 to 1, the count of a fuzzy set never exceeds that of its crisp count, but may be much lower. Moreover, since membership scores are represented by fractional values, we’d normally use SQL Server data types like float, decimal and numeric to represent them, rather than members of the int family as we would with ordinary counts.
…………This apparently gives rise to many different possible calculations for fuzzy counts, but the most common one in the literature is the sigma count, in which we simply add up all of the membership scores for an entire set. Another stat seen occasionally in the literature is Support, which is defined as a crisp count of all the non-zero members of a fuzzy set; it thus always results in an integer somewhere between the sigma count and the ordinary crisp count. The Height refers to the crisp count, the Bandwidth to the number of records with scores greater than 0.5 and the Core to those with the maximum score of 1; these concepts might be useful in such applications as fuzzy clustering, but I see the sigma count used far more often in connection with today’s topic, fuzzy stats, which come into play whenever we want to calculate aggregates on fuzzy sets in platforms like SQL Server.[1]

Partial Credit for Partial Set Membership

                The trick with this topic is “Think partial credit!” to borrow a phrase from University of Minnesota Prof. Glen Medeen.[2]  Even if we restrict ourselves to the sigma count definition of fuzzy counts, the concept carries many interesting implications for all of the statistics that are derived from it. Averages, standard deviations, variances and all of the more advanced statistics derived from them must be recalculated, given that they’re derived from fundamental measures like values and counts that no longer apply. The logic inherent in partial set membership demands this fundamental rethinking of basic statistics. The crisp versions of some of these stats are precalculated by SQL Server, so by switching to the fuzzy set versions we’ll incur some performance costs by computing them on the fly instead, with the aid of T-SQL aggregates and windowing functions. Thankfully, some of these fuzzy stats are worth the extra computations, because they can shed light on our data in unusual ways. Perhaps the most obvious example is the difference between the crisp count and sigma count, which might be used as an alternative measure of fuzziness in the place of fuzzy complements, which as we saw early on in this series, are normally ideal for that use case.
…………Figure 1 provides a simple example of how to code this possible alternate measure of imprecision, by subtracting the sigma count from the height, i.e. the crisp count. I also demonstrate how easy it is to derive the bandwidth, support and core stats, even though these are only used infrequently. As usual, most of the initial code involves assigning the membership scores, by drafting the procedure I wrote for Outlier Detection with SQL Server, part 2.1: Z-Scores for double duty as my membership function. The calculations are performed on the Duchennes muscular dystrophy data I downloaded a few tutorial series ago from Vanderbilt University’s Department of Biostatistics, which now resides in a dummy DataMiningProjects database; afterwards, they’re stored in a @ZScoreTable table variable, that can be operated on as needed. For the sake of consistency, I’ve stuck to the same format I’ve used throughout this series by using the three @Rescaling variables and ReversedZScore column to transform the ZScores in a membership score on the traditional 0 to 1 range.

New Means, Medians and Modes

                Once we’ve derived the sigma count from these grades, I then calculate the standard fuzzy mean, which may be the simplest, most intuitive form of a “fuzzy absolute center.”[3] Another alternate measure of centrality is of course the mode, which I’ve thrown in because it’s so easy to calculate; to derive the fuzzy version, we just have to multiply each value’s count by its membership grade. This is one of the few fuzzy stats where the value is not affected by its score. In Figure 2 we can see that both versions of the mode return the same value of 198, which is within the general rule that both modes and their fuzzy counterparts will only return actual crisp values from their datasets.  Since medians are dependent on orders, I’ll take up that topic when I address the fuzzification of ranks in a wrap-up of the whole series.
…………Instead, I’ve incorporated a higher class of averages known as Generalized Means, which can be used to derive a whole family of means between the minimum and maximum values, including the fuzzy arithmetic mean mentioned above, along with the harmonic and geometric means.[4] We basically plug in an @AlphaParameter bounded between 0 and 1, which allows us to cover the whole range, in much the same fashion that the various T-norm and T-conorm parameters empowered us to derive myriad types of fuzzy intersections and unions in previous articles. Note that in Figure 2, we see that the parameter value I arbitrarily chose led to a far different value for the GeneralizedMean than the one derived for the ordinary FuzzyMean.

Figure 1: Sample Code for Fuzzy Counts and Means
DECLARE @RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)

DECLARE  @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

 

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

DECLARE @Count bigint, @SigmaCount float, @Support float, @Bandwidth float, @Core float,
@Mean float, @FuzzyMean float, @GeneralizedMean float, @Mode float, @FuzzyMode float

— COUNTS
SELECT @SigmaCount = SUM(MembershipScore), @Count = Count(*)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL

SELECT @Support =  Count(*)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL AND MembershipScore > 0 

SELECT @Bandwidth =  Count(*)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL AND MembershipScore > 0.5

SELECT @Core =  Count(*)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL AND MembershipScore = 1

— MODES
SELECT @Mode = Value
FROM (SELECT TOP 1 ValueCount, Value
       FROM (SELECT Count(*) AS ValueCount, Value
              FROM @ZScoreTable
              WHERE ZScore IS NOT NULL
              GROUP BY Value) AS T1
       ORDER BY ValueCount DESC) AS T2

 

SELECT @FuzzyMode = Value
FROM (SELECT TOP 1 ValueCount, Value
       FROM (SELECT Count(*) * MembershipScore AS ValueCount, Value
              FROM @ZScoreTable
              WHERE ZScore IS NOT NULL
              GROUP BY Value,MembershipScore) AS T1
       ORDER BY ValueCount DESC) AS T2 

— AVERAGES
DECLARE @AlphaParameter float
SELECT @AlphaParameter = 0.3

SELECT @FuzzyMean = SUM(MembershipScore * Value) / @SigmaCount , @Mean  = Avg(Value)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL

SELECT @GeneralizedMean = Power(SUM(Power(Value, @AlphaParameter)) / CAST(@SigmaCount AS float), 1 / @AlphaParameter)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL

SELECT @Count AS RegularCount, @SigmaCount AS SigmaCount,
@Count @SigmaCount AS AlternativeMeasureOfFuzziness,
@Support AS Support, @Bandwidth As Bandwidth, @Core as Core,
@Mode AS Mode, @FuzzyMode as FuzzyMode,
@Mean AS Mean, @FuzzyMean AS FuzzyMean, @GeneralizedMean AS GeneralizedMean

 

Figure 2: Sample Results from the Duchennes Table
fuzzy-counts-and-averages

…………Generalized means occupy a space in the set of norm operations in between T-norms and T-conorms, along with Ordered Weighted Averages.[5] Basically, each record in an OWA is multiplied by a weight which globally equals one, but the choice of weights is so broad that I won’t bother with them; I’ll merely point out that this obviously overlaps the topic of neural net weights, at least to anyone who has coded them before. I’ll also omit my sample code for Lambda Averages (i.e. λ-Averages), because it’s simply too long in comparison to its usefulness. This class of norm operations is derived from binary set relations, which means we first have to create a second table variable, fill it and adjust the scores, as we did in previous articles with T-norms and T-conorms. We’d then apply a CASE statement to select the MIN value of the @LambdaParameter and the outcome of the union, when the both records were between 0 and the @LambdaParameter; take the MAX of the @LambdaParameter and the outcome of the fuzzy intersection when the outcomes were greater than the @LambdaParameter; then use the @LambdaParameter value in the ELSE statement.[6]
…………As with fuzzy complements, unions and intersections, the applications are determined by the selection of appropriate parameter values.[7] One method of accomplishing this is of course parameter estimation.[8] A good starting point for fuzzy parameter estimation may be Seyed Mahmoud Taheri’s select bibliography of recent developments in fuzzy stats, which also lists many resources for extending fuzzy stats to standard statistical and data mining topics like Bayesian priors, fuzzy regression and hypothesis testing.[9] A couple of the sources also connect fuzzy sets to information theory, which I will also begin doing in my next tutorial.

Fuzzy Variance: A Fresh Take on an Staid Stat

                Taheri also mentions published research on fuzzy sets and maximum likelihood, which makes me wonder if there is also some connection to Fisher Information. The same is true of the different types of fuzzy variance, given that variance is interpreted in Fisher Information as a form of uncertainty. This may be a more worthwhile topic to cover than λ-Averages and OWAs since the formulas are less broad and have clearer applications. First of all, it makes intuitive sense to calculate variance differently on fuzzy sets, for precisely the same reasons as fuzzy means: the crisp version of the statistic is dependent on counts, which ought to be replaced with alternative measures like the sigma count when possible. If a record has zero membership, for example, its value shouldn’t count at all in the computation, because it’s no longer part of the set. It thus follows that a value with partial membership should only be taken into consideration in proportion to its score, just as with fuzzy means; it stands to reason that the same principles would apply if we went beyond the first and second statistical moments, mean and variance, to the third and fourth, skewness and kurtosis.
…………This leads to some interesting questions over how we can interpret the differences in the crisp and fuzzy variances. Given that the difference between crisp and sigma counts reflects a measure of fuzziness – albeit not clearly as fuzzy complements – perhaps we can interpret this as a measure of how dispersed the fuzziness is. This might come in quite handy in many data mining applications. I haven’t seen it used that way in the literature – but it’s good to keep in mind that my exposure to the whole topic of fuzzy sets is limited, given that I can’t afford the hefty fees for many of the academic journals, which render them inaccessible to me. Nor have I seen trapezoidal numbers combined with variance, but a construction like the CASE statement for TrapezoidalRangeOnTheCrispVariance in Figure 3 might be useful in expressing natural language slang about dispersion like, “all over the place.” The TrapezoidalRangeOnTheFuzzyVariance expresses the same concept, except that it represents a fuzzy number on a fuzzy set, rather than a fuzzy number on a crisp set; it thus amounts to saying, “this graded set is all over the place.” I set the range boundaries arbitrarily so that both would have fractional scores in Figure 4, which serves as a better illustration of partial membership in a fuzzy number. Using the square root and power techniques mentioned in the last article, we could add superlatives to it like “really” or “somewhat.” If we were using Z-Scores in the context of a normal distribution, we might set graded boundaries based on the “68–95–99.7 Rule”  I covered in Goodness-of-Fit Testing with SQL Server, part 1: The Simplest Methods, which involve the number of expected records between the first, second and third standard deviations. I left out the more complicated case of the superlative in sample code below, just to illustrate how a SQL Server user might write T-SQL for simple cases of the these two fuzzy variance types:

Figure 3: Some Possible Measures of Fuzzy Variance
DECLARE @StDev float, @FuzzyStDev float, @Var float, @FuzzyVar float

SELECT @FuzzyVar = Sum(Power((Value * MembershipScore) @FuzzyMean, 2)) / @SigmaCount, @Var
= Var(Value), @StDev = StDev(Value)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL

SELECT @FuzzyStDev = Power(@FuzzyVar, 0.5)

DECLARE @LowerBound float, @UpperBound float
SELECT @LowerBound  = 4000, @UpperBound = 5000

 SELECT @StDev AS StDev, @FuzzyStDev AS FuzzyStDev,
@Var AS Var, @FuzzyVar AS FuzzyVar,
@Var @FuzzyVar AS PossiblyTheVarianceOfTheFuzziness,
CASE WHEN @Var BETWEEN @LowerBound AND @UpperBound THEN 1
       WHEN @Var < @LowerBound THEN ((@Var @LowerBound)) / @Var + 1
       WHEN @Var > @UpperBound THEN ((@UpperBound @Var)) / @Var + 1
       ELSE NULL END AS TrapezoidalRangeOnTheCrispVariance,
CASE WHEN @FuzzyVar BETWEEN @LowerBound AND @UpperBound THEN 1
       WHEN @FuzzyVar < @LowerBound THEN ((@FuzzyVar @LowerBound)) / @FuzzyVar + 1
       WHEN @FuzzyVar > @UpperBound THEN ((@UpperBound @FuzzyVar)) / @FuzzyVar + 1
       ELSE NULL END AS TrapezoidalRangeOnTheFuzzyVariance

Figure 4: Fuzzy Variance Result for the LactageDehydrogenase Column
fuzzy-variance

…………Fuzzy variance may serve as a bridge to Fisher Information, a topic I want to cover in my long-delayed series, Information Measurement with SQL Server. Early on in this series we saw how fuzzy complements serve as one important measure of a different type of information, fuzziness, which quantifies the imprecision of a dataset in a different manner than variance. The difference between the sigma and crisp counts might serve the same purposes, although I’ve seen the various types of complements used more often for this purpose. One of the coolest things about fuzzy sets is that they give rise to several useful statistics that quantify different types of imprecision, which can be used to derive a program of “uncertainty management” for an organization. In the next installment we’ll see how we can use some of the fuzzy stats defined here to pin down a different brand of imprecision known as nonspecificity. This will involve discussion of the Hartley function and possibly Shannon’s Entropy, the latter of which is a fundamental concept in many data mining algorithms. Since entropy is among the foundations of information theory, this introduction to its applications in nonspecificity will serve as a bridge to my future Information Measurement series.

 

[1] pp. 25-28, Bonissone, Piero P., 1998, “Fuzzy Sets & Expert Systems in Computer Eng. (1).” Available online at http://homepages.rpi.edu/~bonisp/fuzzy-course/99/L1/mot-conc2.pdf. Bonissone’s material is reprinted at least in part from slides produced by artificial intelligence researchers Roger Jang and Enrique Ruspini.

[2] p. 5, Medeen, Glen, 2015, Two Examples of the Use of Fuzzy Set Theory in Statistics,” published online at the University of Minnesota web address http://users.stat.umn.edu/~gmeeden/talks/fuzznov09.pdf

[3] p. 435, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J. On this particular page, they’re extending the meaning of the term even further, to complex network topologies.

[4] IBID., p. 90.

[5] IBID., pp. 92-93.

[6] IBID., p. 94.

[7] IBID., p. 93.

[8] IBID., p. 94.

[9] p. 240, Taheri, Seyed Mahmoud, 2003, “Trends in Fuzzy Statistics,” pp. 239-257 in Austrian Journal of Statistics, Vol. 32, No. 3. Available online at Vienna University of Technology web address http://www.statistik.tuwien.ac.at/oezstat/ausg033/papers/taheri.pdf

Implementing Fuzzy Sets in SQL Server, Part 6: Fuzzy Numbers and Linguistic Modifiers

By Steve Bolton

…………I’ve written several amateur tutorial series on this blog in order to more quickly absorb difficult data mining, statistical and machine learning topics, while hopefully helping other SQL Server users avoid some of my inevitable mistakes. Since I don’t know what I’m talking about, I’m occasionally surprised at how useful some of the material turns out to be – particularly in the case of this week’s topic, fuzzy numbers, which I originally thought were a curiosity. In recent years I’ve made moderate progress in recapturing some of the advanced math skills I had as a kid, but some of the resources I’m consulting still wear me out with densely packed, arcane symbols; that was also true of George J. Klir and Bo Yuan’s Fuzzy Sets and Fuzzy Logic: Theory and Applications and several other sources I’ve used for this series on using T-SQL to implement fuzzy sets. Much of the literature on the subject is written by mathematicians who need thick equations and highly precise terminology to communicate with each other effectively, which can be taxing on non-specialists not accustomed to the jargon of the field.
…………That communication gap may be responsible for the enormous lag between the refinement of the theory and its adoption; this comes despite all of the empirical evidence that these data modeling techniques are insanely useful in solving certain classes of real-world problems, in a wide variety of industries. Adoption has also lagged in the relational database and data mining markets, in spite of the fact that set-based languages like T-SQL and Multidimensional Expressions (MDX) are ideal for implementing fuzzy sets. I’m trying to put my drop in the bucket to remedy that, but my lack of adequate equation translation skills almost caused me to skip over fuzzy numbers, which are actually one of the simplest and most useful aspects of fuzzy set theory.

Modeling Everyday Speech 

 …………One rule of thumb I’ve learned along the way is that whenever confusion arises over fuzzy sets, it is best to return to basics and state the problem in terms of natural language. After all, the main overarching use case for fuzzy set theory is to model imprecision that can be expressed linguistically, particularly when it would be useful to use continuous scales to ordinal categories. If we were using a Behavior-Driven Development (BDD) methodology, we might want to flag any qualifiers we encounter in user stories like “about,” “around,” “near,” “approximately,” “most,” “few” and the like as candidates for fuzzy numbers. These require several numbers and calculations of various intensity in order to assign a range of values to a single number – but since ordinary numbers require none of this extra computation, the obvious question is, why bother? It turns out that these particular instances of fuzzy sets are useful in modeling these types of natural language qualifiers, which express uncertainty about how close a value is to a definite target value.
…………“About half” is a clear and simple example from everyday speech. In fuzzy set parlance, this would be modeled as a “triangular number,” which is a fancy way of saying that we’d assign a perfect membership grade of 1 to values that were exactly equal to 0.5, but descending on either side of 0.5 in proportion to how far away the value was from that target. The term “triangular” is used because if we use a line chart to depict a membership function of this kind, it peaks at the target value and descends to 0 as the values decrease or increase away from it. Trapezoidal numbers are a mouthful, but really aren’t much more difficult; they have basically have the same shape as triangular numbers, except that the peak is flattened in order to express an interval of some kind.[1] University of Minnesota Prof. Glen Meeden’s natural language example of a trapezoidal number is the best I’ve yet run across in the literature: “What is the average yearly snowfall in the Twin Cities? You might answer somewhere between 20 and 50 inches.”[2]
…………Functions that only increase or decrease are useful in capturing related distinctions, like “a large number of” or “a small number of.”[3] Although it is wise to obey a few restrictions that lead to certain useful mathematical properties – particularly the industry-standard requirement of a boundary between 0 and 1 – it is not necessary for the membership functions to be symmetric. In fact, assigning a lopsided peak to a triangular number can be useful in modeling statements like “almost all,” which would be closer to the right edge of a line graph than a function that implements the term “most.”[4] They can even be bell-shaped.[5]
…………At a higher level of sophistication, fuzzy numbers can be used to model terms like “very,” which fall under the rubric of linguistic hedges and fuzzy modifiers – including statements like “very true,” which can be useful in fuzzy logic.[6] Klir and Yuan suggest applying powers and square roots to model such distinctions, since they don’t follow a linear scale. For example, they say that if we use a score of 0.8 to model the term “John is young,” squaring it would could model the phrase “very young” and using a square root could lead be used for “fairly young.” This is because the result of the first is 0.64, which strongly modifies the original term, while the second returns 0.89, which modifies it slightly.[7] Of course, the exact boundaries of all of these terms have to be set in light of domain knowledge of some kind. “Half” is a definite term but the modifier “about” can mean different things to different people, which may require aggregating the viewpoints of users in some way, perhaps using Decision Theory methods that integrate seamlessly with fuzzy sets. Neural nets likewise work well together with fuzzy sets because they are highly useful for encoding unknown functions, which means they can be put to use to derive such boundaries if greater precision is required. For many use cases, an informal guesstimate may suffice.

Translating Fuzzy Numbers into T-SQL

…………My sample code below implements four types of fuzzy numbers, using quite simple and arbitrary criteria that is merely designed to illustrate the concepts. For the sake of consistency, I’m once again using the procedure I wrote for Outlier Detection with SQL Server, part 2.1: Z-Scores to my derive membership grades and storing the results in a table variable (which has an extraneous column named GroupRank that can be safely ignored). In this case we’re calculating Z-Scores on the LactateDehydrogenase column of the Duchennes muscular dystrophy dataset I’ve been using as practice data for the last several tutorial series, which I downloaded ages ago from Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table in a dummy DataMiningProjects database. Some of the key differences from previous articles include the absence of the ReversedZScore column and associated @Rescaling variables. This is because we don’t need to perform rescaling of any kind, since we’re measuring nearness to a few Z-Score values in all four examples, not calculating a relative score on a range of 0 to 1. The MembershipScore column is missing for the same reason. In its place, we have four computed columns, two of which measure the closeness of each Z-Score in the dataset to a target value or range.
…………To put it simply, the triangular number assigns a grade to  the natural language statement, “around a Z-Score of 0.960526,” while the trapezoidal expresses the concept of “somewhere between 0.450526 and 1.360526.” The other two columns define increasing and decreasing functions that model how close the values are to either the top or bottom of the dataset, which can be interpreted as “few” or “most.” The hard-coded numbers in the UPDATE and the declarations above it are picked out of thin air merely to illustrate the point, not because they express any domain knowledge. There are probably more efficient ways of coding this, but the point is to get the concepts across.

Figure 1: Code for the Fuzzy Number Sample
DECLARE @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
TriangularNearnessScore decimal(38,6),
TrapezoidalNearnessScore decimal(38,6),
FewScore decimal(38,6),
MostScore decimal(38,6),
GroupRank bigint
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

DECLARE  @ComparisonPoint float = 0.960526
DECLARE @LowerBound float = @ComparisonPoint 0.5, — we could of course make it lopsided if that would model our data better
@UpperBound float = @ComparisonPoint + 0.5

UPDATE @ZScoreTable
SET TriangularNearnessScore =     CASE WHEN ZScore NOT BETWEEN @LowerBound and @UpperBound THEN 0
       WHEN ZScore = @ComparisonPoint THEN 1
       WHEN ZScore < @ComparisonPoint THEN (ZScore @ComparisonPoint) + 1
       WHEN ZScore > @ComparisonPoint THEN (@ComparisonPoint ZScore) + 1
       ELSE NULL END,
       TrapezoidalNearnessScore = CASE WHEN ZScore NOT BETWEEN @LowerBound and @UpperBound THEN 0
       WHEN ZScore BETWEEN @LowerBound + 0.1 and @UpperBound 0.1 THEN 1
       WHEN ZScore < @LowerBound + 0.1 THEN (ZScore @ComparisonPoint) + 1
       WHEN ZScore >  @UpperBound 0.1 THEN (@ComparisonPoint ZScore) + 1  ELSE NULL END,
       FewScore = CASE WHEN ZScore BETWEEN 0.2 AND 0.7 THEN 1  ELSE 1 (ZScore * 0.3) END,
       MostScore = CASE WHEN ZScore BETWEEN 1 AND 2 THEN 1  ELSE ZScore / 1.2 END

SELECT PrimaryKey, ZScore, TriangularNearnessScore, TrapezoidalNearnessScore, FewScore, MostScore
FROM @ZScoreTable
WHERE ZScore BETWEEN 0.2 AND 2 AND TriangularNearnessScore != 0 AND TrapezoidalNearnessScore != 0
ORDER BY ZScore DESC

Figure 2: Sample Results from the Duchennes Practice Dataset
fuzzy-number-results

…………Note how the TriangularNearnessScore peaks at a single value, while the TrapezoidalNearnessScore is contained within a particular range. The FewScore and MostScore values peak at opposite ends of the dataset. Of course, a picture is worth a thousand words: if the values above aren’t clear, then the Reporting Services line graphs below ought to clear up any confusion. The TriangularNearnessScore doesn’t precisely follow a triangular shape, but it does come to single peak, which is good enough. The trapezoidal example reaches the same peak, but encompasses a range of values represented in the flat line. The FewScore and MostScore also have flat peaks, but these occur at the far edges of the membership grades. The shapes aren’t as neat as those in the literature, in part because I’m using real-world data from the Duchennes dataset together with some arbitrary range values, but I’m sure that readers will get the gist of it.

Figure 3: Reporting Services Line Graphs for 4 Fuzzy Number Samples
fuzzy-number-report
…………Asking questions like “what is about half of A plus almost all of B” is probably a rare use case, one that is more likely to come up in data mining than in relational situations. If the need for this kind of comparison does arise, be aware that at least some of the math has been worked out, so there’s no need to reinvent the entire wheel. I say “some” because, at least at the time Klir and Yuan wrote, mathematicians were still struggling with some of the strange properties and enigmatic logic associated with these kinds of comparisons.[8]  Just imagine several of the trapezoids and triangles above overlaid and you can see how quickly the topics of fuzzy arithmetic, fuzzy set relations and fuzzy matrix math can become.  Some basic procedures for solving them are available[9], as well as neural net techniques for solving fuzzy equations[10], but they lead to certain logical difficulties that I believe still aren’t fully understood, such as the fact that many of the approximate solutions may not be unique.[11] Thankfully, we don’t encounter statements like “between 3 and 5 of A minus a little of B” often in ordinary speech, so I imagine that SQL Server users are unlikely to encounter it. In contrast, qualifiers like “around” or “near” are so common that I guarantee these fuzzy numbers will prove valuable to many users in the long run.
…………Instead of taking fuzzy numbers in directions of this kind that lack real-world applications, I’ll instead use them as a stepping stone towards quantifying imprecision, which can be helpful in programs of uncertainty management. Many of the linguistic qualifiers mentioned here are actually instances of what is known as “fuzzy cardinality,” which enable modeling of phrases like “about a quarter” or “near.”[12] In the next article, I’ll delve into the realm of fuzzy statistics, where the implications of relative membership in a set leads automatically to a range of different types of cardinality, not just the single type of Count used in T-SQL. I have yet to see this done in the literature (I simply can’t afford access to most of the research published on certain advanced topics like fuzzy stats), but I’ll provide an example of how trapezoidal numbers might be implemented in order to create fuzzy analogues of standard deviation and variance. Given that the ordinary “crisp” versions of these aggregates are determined in part by counts, the fuzzification of counts plays into that as well. It may be useful to probe for connections to Fisher’s Information, a metric I hope to code for a long-delayed series titled Information Measurement with SQL Server, given that it apparently uses variance to model uncertainty. Using increasing and decreasing functions to model statements like “almost all” as we have done here can be seen as fuzzy instances of MIN and MAX aggregates, which essentially express the same sentiment as “near the bottom” or “near the top.” These can also be used to create new methods of fuzzy outlier detection, as I’ve essentially been doing throughout this series by using Z-Scores in my sample code; fuzzy set grades don’t have anything to do with stochastics or outlier detection unless such meanings are deliberately assigned.
…………In the next article I will probably continue to use Z-Scores in my sample code mainly for the sake of consistency and to reuse old, familiar concepts, but also to kill two birds with one stone and investigate possible uses in outlier detection. As we shall see, membership functions can also be interpreted in the light of Evidence Theory, in which each grade indicates the levels of credibility and truthfulness of a statement. In such cases, we’re speaking of fuzzy measures[13], which are also useful in partitioning and quantifying different types of uncertainty. As we already saw in Implementing Fuzzy Sets in SQL Server, Part 2: Measuring Imprecision with Fuzzy Complements, fuzzy complements can be used to measure one type of uncertainty, in quantifying just how imprecise the boundaries of fuzzy sets are; in essence, they become measures of fuzziness. There are other types of uncertainty, however, which require entirely different modeling techniques, which is where fuzzy stats and measures come in handy.

 

 

[1] Most fuzzy set references discuss triangulars and trapezoidals. One source I found to be helpful was pp. 12-13, Alavala, Chennakesava R., 2008, Fuzzy Logic and Neural Networks: Basic Concepts and Applications. New Age International Pvt. Ltd.: New Delhi.

[2] p. 3, Medeen, Glen, 2015, Two Examples of the Use of Fuzzy Set Theory in Statistics,” published online at the University of Minnesota web address http://users.stat.umn.edu/~gmeeden/talks/fuzznov09.pdf

[3] pp. 96-98, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[4] IBID., p. 228.

[5] IBID., p. 99.

[6] IBID., p. 222.

[7] IBID., p. 230.

[8] IBID., p. 115.

[9] IBID., pp. 160-162.

[10] IBID., pp. 171-173.

[11] IBID., pp. 153-157, 166-167.

[12] IBID., p. 98.

[13] IBID., p. 177.

Implementing Fuzzy Sets in SQL Server, Part 5: The Mystery of the Missing Left Join

By Steve Bolton

…………Information on set operations like complements, intersections and unions is plentiful in the literature on fuzzy sets, which made the last three articles in this series of amateur self-tutorials easier to write in a certain sense. These topics are far more complex than with ordinary “crisp” sets because there are so many different methods for calculating membership grades for the resultsets, each of which can be applied to different use cases in order to model imprecision. This can be especially useful in Behavior-Driven Development (BDD) and user stories, as well as in uncertainty management programs of all kinds, including reducing the insecurity involved in software engineering and data modeling. There may be extra steps involved in calculating these fuzzy set relations, but the topic is at least well-studied and the sources of information for the formulas are readily available. Strangely, that is not true for some of the set operations that SQL Server users encounter more often, like LEFT JOINs, INNER JOINs, OUTER JOINs, RIGHT JOINs and Cartesian Products. There are only 41 hits on Google for the terms “left join” “relational” “fuzzy set” combined, one of which is an off-hand reference from my own blog. I haven’t see these topics addressed much in the literature, which is rich in information on complements, intersections and unions, but not these common relational joins. Like many other puzzling aspects of fuzzy set theory, the reason behind this is readily apparent only after we get back to basics and frame the questions we’re asking of the database in natural language terms.
…………This problem is easily reducible to a single enigma, which nevertheless requires a lengthy explanation. As we saw in the last two tutorials, INNER and OUTER JOIN statements are useful in deriving fuzzy intersections and unions respectively. We really can’t use the standard T-SQL INTERSECT and UNION operators for these purposes, since we need to retrieve membership function values from both sides in order to calculate membership in the resultset. For all intents and purposes, these are the principal use cases for fuzzy INNER and OUTER JOIN operations. Although I won’t rule out the possibly of some oddball set operation that requires the fuzzy versions of these statements, without qualifying as fuzzy intersections and unions, such use cases are probably pretty rare. CROSS JOIN operations are easier to explain once we realize that all fuzzy binary relations are performed on the complete crisp Cartesian Product to derive a fuzzy subset of some kind. To perform a fuzzy Cartesian Product, we would have to take the membership grades of both sets and implement a mathematical operation of some type, just like with fuzzy intersections and unions. The CROSS JOIN operator isn’t used as often in T-SQL as its kin, in large part because retrieving every possible combination of records from both sides can tax the server beyond belief, but cross products aren’t mentioned much in the fuzzy set literature for an additional reason: a wide range of fuzzy subsets of it also qualify as fuzzy intersections and unions, depending on whether they’re implemented with classes of functions known as T-norms and T-conorms. The domain of possible resultsets is no wider than the crisp CROSS JOIN, but fuzzy intersections and unions take up a far wider space within it than their crisp counterparts.
…………It is certainly possible to define fuzzy Cartesian Product operators that are neither T-norms nor T-conorms, but the use cases for doing so are neither clear nor very wide. In order to derive such a measure, we’d probably have to take the next step up and select from the class of “norm operations” that encompass all types of possible fuzzy aggregates, including T-norms and T-conorms as special cases.[i] Unfortunately, the concept is so general that it provides even less guidance for deriving new types of fuzzy Cartesian Products for particular use cases than we have for matching T-norms and T-conorms to the right problems. What the CROSS JOIN, INNER JOIN and OUTER JOIN all have in common, however, is that both participating sets are given equal weight in the determining the results, which is also true for their fuzzy counterparts. As I always caution, I’m not an expert in these matters and am only writing on the topic in order to introduce myself to it, while hopefully helping other SQL Server users to avoid my inevitable mistakes. At the risk of committing another one, I’ll take a stab at solving the Mystery of the Missing Join, which may boil down to the fact that the LEFT JOIN statement gives preferential treatment to one side.

Comparing Membership Functions in LEFT JOINs

                One of the most common scenarios for a LEFT JOIN in ordinary crisp sets is to join a dependent table to a parent via a foreign key. When the relationship works in reverse, we’re speaking of a RIGHT JOIN – which is just the converse of a LEFT JOIN, so I’ll dispense with any discussion of it. The join results typically have repeated values for the parent table, since they often match multiple rows in the dependent table. Whenever we work with fuzzy sets, spelling out what we’re looking for in explicit natural language terms often simplifies what may seem like really complex problems. In the case of crisp LEFT JOINs, we’re asking the database, “Give me all of the records in Set1, plus some additional information from Set2 that we sometimes do without.” Here’s the key problem we face: fuzzy set theory only matters in the context of a LEFT JOIN if those additional columns of information are included or excluded based on some membership function, which can be mathematically compared to the membership function of the parent table. In other words, the math involved in a fuzzy complement, union or intersection is needed to determine whether or not the results are included in the new set, but with a LEFT JOIN, oftentimes we’re just suppling more information in a Master-Detail situation – in which case, the single membership function of the parent ought to be sufficient to determine membership in the resultset. With intersections and unions, both sides of the binary relation are of equal importance, but in a LEFT JOIN, the left side takes precedence, with the right just being filler material that doesn’t affect the membership values of the parent. Perhaps the only time we might take a membership function on the dependent table into account is when it measures a type of uncertainty comparable to the parent, plus we specifically want to grade how relevant the additional detail is to the parent.
…………For example, let’s say we’re performing a fuzzy LEFT JOIN on a parent CustomerTable to a child AddressTable. If the first has a function that ranks how Tall a person is and the second carries a grade for membership in the set of Rural places, then we can retrieve the extra data in order to answer questions like, “Give me the set of Short people who live in Suburbs” without performing any math operations on either membership function. If we exclude the much more advanced topic of statistical independence, how Tall or Short a customer is has no bearing on how Rural or Urban their hometown is, so the two memberships don’t affect each other. We can just do a regular LEFT JOIN in such situations and return the two grades separately.
…………The question gets murkier when we consider dependent tables that measure the same quality as the parent, or nearly so. For example, let’s pretend we’re operating a database for a hardware store that has a CeilingHeight membership function defined on the AddressTable, also split into categories like Tall, Short, etc. (for the sake of argument, let’s say the customers volunteered this information about their homes in a survey or whatever). We can ask the database a simple question like, “Give me the set of Short people who live in homes with Tall ceilings” without doing any extra math, but there may be use cases where we can compare these two types of information to draw inferences. An example might be asking a question like, “Give me the set of people who belong to the category of Mismatched Customers,” which can in turn be defined as Short people with Tall ceilings and vice-versa. This could be calculated through LEFT JOINs with WHERE clauses on the Tall and Short values for the Customers, after which some computation can be applied to derive a new category from based on some comparison of that membership function to the CeilingHeight column of the child.

An Advanced Exception: LEFT JOINs Creating New Categories of Comparison

                Keep in mind that with fuzzy intersections and unions, we’re essentially adding a new layer of fuzzy membership grades to the operation that joins the two sets; if we extend the same principle to fuzzy LEFT JOINs, then we’re likewise creating a new type of fuzzy set by defining the join in an imprecise way. I’m still a novice at all of this, but would suggest that it might be possible to discern the difference between this type of fuzzy LEFT JOIN from one that doesn’t require any extra math by looking at the results: if they define a new ordinal category that is graded on a continuous scale, then we’ve created a new type of fuzzy set from the two membership functions. If no such category is being created, then we can probably just return the child’s membership functions values as additional details if it’s germane to our query, without taking any extra steps. The question may be complicated by the fact that both tables may have multiple membership functions defined on them, some of which define sets that may not be relevant at all to the query at hand, while others can have various shades of meaning that may overlap. For example, a ProductTable might have different membership functions for such disparate characteristics as Color, Width and Availability, while the dependent table may even have stochastic labels like “Gaussian” or “Gamma,” if we needed to measure membership of a column in some kind of probability distribution.
…………Either way, we have to retrieve all the values we would for a LEFT JOIN, but would only perform additional math in certain use cases that seem to be much narrower than those for fuzzy unions and intersections. In these instances, we’re essentially automatically extending it to a new level, just as the Sugeno and Yager Complements add another level of membership grade on top of the existing membership function values. For that reason, I strongly suspect that we would have to select a fuzzy join function that most closely matches the type of imprecision we’re trying to model. As we saw in the last couple of articles, T-norms and T-conorms are the ideal mathematical structures for expressing fuzzy intersections and unions, but selecting the ones with the right mix of mathematical properties and output histograms is often a tough call. Research has been ongoing in the field for decades to narrow down those use cases, but the paucity of information on LEFT JOINs leaves us with an even shakier starting point than that. This is the weakest article in this series, in the sense that I can’t even provide T-SQL sample code for a rare (or even perhaps non-existent) set of use cases, which also would require advanced research that the professionals apparently haven’t deemed worthy to investigate much; nevertheless, anyone coming from the realm of crisp SQL database servers is going to be immediately struck by the absence of LEFT JOINs and could use a tentative answer to this nagging question. Now that we’re over this hump, in the next installment I’ll explain how to use the most common fuzzy quantifiers and measures, like triangular and trapezoidal numbers. These can be incredibly useful in the right circumstances, including making it child’s play to model common linguistic phrases like “about half” or “most” on both crisp and fuzzy sets. It is here that many SQL Server users can probably find immediate applications for the material covered in this series, which the untapped potential of fuzzy set theory can deal with better than any other alternative.

[i] p. 93, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.