Implementing Fuzzy Sets in SQL Server, Part 7: The Significance of Fuzzy Stats

By Steve Bolton

…………In the world of fuzzy sets and imprecision modeling, the concept of cardinality takes on new shades of meaning that are not applicable to ordinary “crisp” sets, i.e. those without membership grades. In the last article in this series of amateur-self-tutorials, I mentioned one type of “fuzzy cardinality,” based on triangular, trapezoidal and other fuzzy numbers that are quite useful in modeling many vague statements found in everyday speech. Of course, another means of expressing cardinality is through ordinary numbers that are defined by a single value, rather than intervals and other such fuzzy set types. This raises some interesting questions, because one of the logical implications of graded set membership is that records with a score of 0 shouldn’t be included in the count. By extension, the values that are non-zero should only be counted in proportion to the score assigned by their membership function; since these are almost always on a scale of 0 to 1, the count of a fuzzy set never exceeds that of its crisp count, but may be much lower. Moreover, since membership scores are represented by fractional values, we’d normally use SQL Server data types like float, decimal and numeric to represent them, rather than members of the int family as we would with ordinary counts.
…………This apparently gives rise to many different possible calculations for fuzzy counts, but the most common one in the literature is the sigma count, in which we simply add up all of the membership scores for an entire set. Another stat seen occasionally in the literature is Support, which is defined as a crisp count of all the non-zero members of a fuzzy set; it thus always results in an integer somewhere between the sigma count and the ordinary crisp count. The Height refers to the crisp count, the Bandwidth to the number of records with scores greater than 0.5 and the Core to those with the maximum score of 1; these concepts might be useful in such applications as fuzzy clustering, but I see the sigma count used far more often in connection with today’s topic, fuzzy stats, which come into play whenever we want to calculate aggregates on fuzzy sets in platforms like SQL Server.[1]

Partial Credit for Partial Set Membership

                The trick with this topic is “Think partial credit!” to borrow a phrase from University of Minnesota Prof. Glen Medeen.[2]  Even if we restrict ourselves to the sigma count definition of fuzzy counts, the concept carries many interesting implications for all of the statistics that are derived from it. Averages, standard deviations, variances and all of the more advanced statistics derived from them must be recalculated, given that they’re derived from fundamental measures like values and counts that no longer apply. The logic inherent in partial set membership demands this fundamental rethinking of basic statistics. The crisp versions of some of these stats are precalculated by SQL Server, so by switching to the fuzzy set versions we’ll incur some performance costs by computing them on the fly instead, with the aid of T-SQL aggregates and windowing functions. Thankfully, some of these fuzzy stats are worth the extra computations, because they can shed light on our data in unusual ways. Perhaps the most obvious example is the difference between the crisp count and sigma count, which might be used as an alternative measure of fuzziness in the place of fuzzy complements, which as we saw early on in this series, are normally ideal for that use case.
…………Figure 1 provides a simple example of how to code this possible alternate measure of imprecision, by subtracting the sigma count from the height, i.e. the crisp count. I also demonstrate how easy it is to derive the bandwidth, support and core stats, even though these are only used infrequently. As usual, most of the initial code involves assigning the membership scores, by drafting the procedure I wrote for Outlier Detection with SQL Server, part 2.1: Z-Scores for double duty as my membership function. The calculations are performed on the Duchennes muscular dystrophy data I downloaded a few tutorial series ago from Vanderbilt University’s Department of Biostatistics, which now resides in a dummy DataMiningProjects database; afterwards, they’re stored in a @ZScoreTable table variable, that can be operated on as needed. For the sake of consistency, I’ve stuck to the same format I’ve used throughout this series by using the three @Rescaling variables and ReversedZScore column to transform the ZScores in a membership score on the traditional 0 to 1 range.

New Means, Medians and Modes

                Once we’ve derived the sigma count from these grades, I then calculate the standard fuzzy mean, which may be the simplest, most intuitive form of a “fuzzy absolute center.”[3] Another alternate measure of centrality is of course the mode, which I’ve thrown in because it’s so easy to calculate; to derive the fuzzy version, we just have to multiply each value’s count by its membership grade. This is one of the few fuzzy stats where the value is not affected by its score. In Figure 2 we can see that both versions of the mode return the same value of 198, which is within the general rule that both modes and their fuzzy counterparts will only return actual crisp values from their datasets.  Since medians are dependent on orders, I’ll take up that topic when I address the fuzzification of ranks in a wrap-up of the whole series.
…………Instead, I’ve incorporated a higher class of averages known as Generalized Means, which can be used to derive a whole family of means between the minimum and maximum values, including the fuzzy arithmetic mean mentioned above, along with the harmonic and geometric means.[4] We basically plug in an @AlphaParameter bounded between 0 and 1, which allows us to cover the whole range, in much the same fashion that the various T-norm and T-conorm parameters empowered us to derive myriad types of fuzzy intersections and unions in previous articles. Note that in Figure 2, we see that the parameter value I arbitrarily chose led to a far different value for the GeneralizedMean than the one derived for the ordinary FuzzyMean.

Figure 1: Sample Code for Fuzzy Counts and Means
DECLARE @RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)

DECLARE  @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

 

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

DECLARE @Count bigint, @SigmaCount float, @Support float, @Bandwidth float, @Core float,
@Mean float, @FuzzyMean float, @GeneralizedMean float, @Mode float, @FuzzyMode float

— COUNTS
SELECT @SigmaCount = SUM(MembershipScore), @Count = Count(*)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL

SELECT @Support =  Count(*)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL AND MembershipScore > 0 

SELECT @Bandwidth =  Count(*)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL AND MembershipScore > 0.5

SELECT @Core =  Count(*)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL AND MembershipScore = 1

— MODES
SELECT @Mode = Value
FROM (SELECT TOP 1 ValueCount, Value
       FROM (SELECT Count(*) AS ValueCount, Value
              FROM @ZScoreTable
              WHERE ZScore IS NOT NULL
              GROUP BY Value) AS T1
       ORDER BY ValueCount DESC) AS T2

 

SELECT @FuzzyMode = Value
FROM (SELECT TOP 1 ValueCount, Value
       FROM (SELECT Count(*) * MembershipScore AS ValueCount, Value
              FROM @ZScoreTable
              WHERE ZScore IS NOT NULL
              GROUP BY Value,MembershipScore) AS T1
       ORDER BY ValueCount DESC) AS T2 

— AVERAGES
DECLARE @AlphaParameter float
SELECT @AlphaParameter = 0.3

SELECT @FuzzyMean = SUM(MembershipScore * Value) / @SigmaCount , @Mean  = Avg(Value)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL

SELECT @GeneralizedMean = Power(SUM(Power(Value, @AlphaParameter)) / CAST(@SigmaCount AS float), 1 / @AlphaParameter)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL

SELECT @Count AS RegularCount, @SigmaCount AS SigmaCount,
@Count @SigmaCount AS AlternativeMeasureOfFuzziness,
@Support AS Support, @Bandwidth As Bandwidth, @Core as Core,
@Mode AS Mode, @FuzzyMode as FuzzyMode,
@Mean AS Mean, @FuzzyMean AS FuzzyMean, @GeneralizedMean AS GeneralizedMean

 

Figure 2: Sample Results from the Duchennes Table
fuzzy-counts-and-averages

…………Generalized means occupy a space in the set of norm operations in between T-norms and T-conorms, along with Ordered Weighted Averages.[5] Basically, each record in an OWA is multiplied by a weight which globally equals one, but the choice of weights is so broad that I won’t bother with them; I’ll merely point out that this obviously overlaps the topic of neural net weights, at least to anyone who has coded them before. I’ll also omit my sample code for Lambda Averages (i.e. λ-Averages), because it’s simply too long in comparison to its usefulness. This class of norm operations is derived from binary set relations, which means we first have to create a second table variable, fill it and adjust the scores, as we did in previous articles with T-norms and T-conorms. We’d then apply a CASE statement to select the MIN value of the @LambdaParameter and the outcome of the union, when the both records were between 0 and the @LambdaParameter; take the MAX of the @LambdaParameter and the outcome of the fuzzy intersection when the outcomes were greater than the @LambdaParameter; then use the @LambdaParameter value in the ELSE statement.[6]
…………As with fuzzy complements, unions and intersections, the applications are determined by the selection of appropriate parameter values.[7] One method of accomplishing this is of course parameter estimation.[8] A good starting point for fuzzy parameter estimation may be Seyed Mahmoud Taheri’s select bibliography of recent developments in fuzzy stats, which also lists many resources for extending fuzzy stats to standard statistical and data mining topics like Bayesian priors, fuzzy regression and hypothesis testing.[9] A couple of the sources also connect fuzzy sets to information theory, which I will also begin doing in my next tutorial.

Fuzzy Variance: A Fresh Take on an Staid Stat

                Taheri also mentions published research on fuzzy sets and maximum likelihood, which makes me wonder if there is also some connection to Fisher Information. The same is true of the different types of fuzzy variance, given that variance is interpreted in Fisher Information as a form of uncertainty. This may be a more worthwhile topic to cover than λ-Averages and OWAs since the formulas are less broad and have clearer applications. First of all, it makes intuitive sense to calculate variance differently on fuzzy sets, for precisely the same reasons as fuzzy means: the crisp version of the statistic is dependent on counts, which ought to be replaced with alternative measures like the sigma count when possible. If a record has zero membership, for example, its value shouldn’t count at all in the computation, because it’s no longer part of the set. It thus follows that a value with partial membership should only be taken into consideration in proportion to its score, just as with fuzzy means; it stands to reason that the same principles would apply if we went beyond the first and second statistical moments, mean and variance, to the third and fourth, skewness and kurtosis.
…………This leads to some interesting questions over how we can interpret the differences in the crisp and fuzzy variances. Given that the difference between crisp and sigma counts reflects a measure of fuzziness – albeit not clearly as fuzzy complements – perhaps we can interpret this as a measure of how dispersed the fuzziness is. This might come in quite handy in many data mining applications. I haven’t seen it used that way in the literature – but it’s good to keep in mind that my exposure to the whole topic of fuzzy sets is limited, given that I can’t afford the hefty fees for many of the academic journals, which render them inaccessible to me. Nor have I seen trapezoidal numbers combined with variance, but a construction like the CASE statement for TrapezoidalRangeOnTheCrispVariance in Figure 3 might be useful in expressing natural language slang about dispersion like, “all over the place.” The TrapezoidalRangeOnTheFuzzyVariance expresses the same concept, except that it represents a fuzzy number on a fuzzy set, rather than a fuzzy number on a crisp set; it thus amounts to saying, “this graded set is all over the place.” I set the range boundaries arbitrarily so that both would have fractional scores in Figure 4, which serves as a better illustration of partial membership in a fuzzy number. Using the square root and power techniques mentioned in the last article, we could add superlatives to it like “really” or “somewhat.” If we were using Z-Scores in the context of a normal distribution, we might set graded boundaries based on the “68–95–99.7 Rule”  I covered in Goodness-of-Fit Testing with SQL Server, part 1: The Simplest Methods, which involve the number of expected records between the first, second and third standard deviations. I left out the more complicated case of the superlative in sample code below, just to illustrate how a SQL Server user might write T-SQL for simple cases of the these two fuzzy variance types:

Figure 3: Some Possible Measures of Fuzzy Variance
DECLARE @StDev float, @FuzzyStDev float, @Var float, @FuzzyVar float

SELECT @FuzzyVar = Sum(Power((Value * MembershipScore) @FuzzyMean, 2)) / @SigmaCount, @Var
= Var(Value), @StDev = StDev(Value)
FROM @ZScoreTable
WHERE ZScore IS NOT NULL

SELECT @FuzzyStDev = Power(@FuzzyVar, 0.5)

DECLARE @LowerBound float, @UpperBound float
SELECT @LowerBound  = 4000, @UpperBound = 5000

 SELECT @StDev AS StDev, @FuzzyStDev AS FuzzyStDev,
@Var AS Var, @FuzzyVar AS FuzzyVar,
@Var @FuzzyVar AS PossiblyTheVarianceOfTheFuzziness,
CASE WHEN @Var BETWEEN @LowerBound AND @UpperBound THEN 1
       WHEN @Var < @LowerBound THEN ((@Var @LowerBound)) / @Var + 1
       WHEN @Var > @UpperBound THEN ((@UpperBound @Var)) / @Var + 1
       ELSE NULL END AS TrapezoidalRangeOnTheCrispVariance,
CASE WHEN @FuzzyVar BETWEEN @LowerBound AND @UpperBound THEN 1
       WHEN @FuzzyVar < @LowerBound THEN ((@FuzzyVar @LowerBound)) / @FuzzyVar + 1
       WHEN @FuzzyVar > @UpperBound THEN ((@UpperBound @FuzzyVar)) / @FuzzyVar + 1
       ELSE NULL END AS TrapezoidalRangeOnTheFuzzyVariance

Figure 4: Fuzzy Variance Result for the LactageDehydrogenase Column
fuzzy-variance

…………Fuzzy variance may serve as a bridge to Fisher Information, a topic I want to cover in my long-delayed series, Information Measurement with SQL Server. Early on in this series we saw how fuzzy complements serve as one important measure of a different type of information, fuzziness, which quantifies the imprecision of a dataset in a different manner than variance. The difference between the sigma and crisp counts might serve the same purposes, although I’ve seen the various types of complements used more often for this purpose. One of the coolest things about fuzzy sets is that they give rise to several useful statistics that quantify different types of imprecision, which can be used to derive a program of “uncertainty management” for an organization. In the next installment we’ll see how we can use some of the fuzzy stats defined here to pin down a different brand of imprecision known as nonspecificity. This will involve discussion of the Hartley function and possibly Shannon’s Entropy, the latter of which is a fundamental concept in many data mining algorithms. Since entropy is among the foundations of information theory, this introduction to its applications in nonspecificity will serve as a bridge to my future Information Measurement series.

 

[1] pp. 25-28, Bonissone, Piero P., 1998, “Fuzzy Sets & Expert Systems in Computer Eng. (1).” Available online at http://homepages.rpi.edu/~bonisp/fuzzy-course/99/L1/mot-conc2.pdf. Bonissone’s material is reprinted at least in part from slides produced by artificial intelligence researchers Roger Jang and Enrique Ruspini.

[2] p. 5, Medeen, Glen, 2015, Two Examples of the Use of Fuzzy Set Theory in Statistics,” published online at the University of Minnesota web address http://users.stat.umn.edu/~gmeeden/talks/fuzznov09.pdf

[3] p. 435, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J. On this particular page, they’re extending the meaning of the term even further, to complex network topologies.

[4] IBID., p. 90.

[5] IBID., pp. 92-93.

[6] IBID., p. 94.

[7] IBID., p. 93.

[8] IBID., p. 94.

[9] p. 240, Taheri, Seyed Mahmoud, 2003, “Trends in Fuzzy Statistics,” pp. 239-257 in Austrian Journal of Statistics, Vol. 32, No. 3. Available online at Vienna University of Technology web address http://www.statistik.tuwien.ac.at/oezstat/ausg033/papers/taheri.pdf

Implementing Fuzzy Sets in SQL Server, Part 6: Fuzzy Numbers and Linguistic Modifiers

By Steve Bolton

…………I’ve written several amateur tutorial series on this blog in order to more quickly absorb difficult data mining, statistical and machine learning topics, while hopefully helping other SQL Server users avoid some of my inevitable mistakes. Since I don’t know what I’m talking about, I’m occasionally surprised at how useful some of the material turns out to be – particularly in the case of this week’s topic, fuzzy numbers, which I originally thought were a curiosity. In recent years I’ve made moderate progress in recapturing some of the advanced math skills I had as a kid, but some of the resources I’m consulting still wear me out with densely packed, arcane symbols; that was also true of George J. Klir and Bo Yuan’s Fuzzy Sets and Fuzzy Logic: Theory and Applications and several other sources I’ve used for this series on using T-SQL to implement fuzzy sets. Much of the literature on the subject is written by mathematicians who need thick equations and highly precise terminology to communicate with each other effectively, which can be taxing on non-specialists not accustomed to the jargon of the field.
…………That communication gap may be responsible for the enormous lag between the refinement of the theory and its adoption; this comes despite all of the empirical evidence that these data modeling techniques are insanely useful in solving certain classes of real-world problems, in a wide variety of industries. Adoption has also lagged in the relational database and data mining markets, in spite of the fact that set-based languages like T-SQL and Multidimensional Expressions (MDX) are ideal for implementing fuzzy sets. I’m trying to put my drop in the bucket to remedy that, but my lack of adequate equation translation skills almost caused me to skip over fuzzy numbers, which are actually one of the simplest and most useful aspects of fuzzy set theory.

Modeling Everyday Speech 

 …………One rule of thumb I’ve learned along the way is that whenever confusion arises over fuzzy sets, it is best to return to basics and state the problem in terms of natural language. After all, the main overarching use case for fuzzy set theory is to model imprecision that can be expressed linguistically, particularly when it would be useful to use continuous scales to ordinal categories. If we were using a Behavior-Driven Development (BDD) methodology, we might want to flag any qualifiers we encounter in user stories like “about,” “around,” “near,” “approximately,” “most,” “few” and the like as candidates for fuzzy numbers. These require several numbers and calculations of various intensity in order to assign a range of values to a single number – but since ordinary numbers require none of this extra computation, the obvious question is, why bother? It turns out that these particular instances of fuzzy sets are useful in modeling these types of natural language qualifiers, which express uncertainty about how close a value is to a definite target value.
…………“About half” is a clear and simple example from everyday speech. In fuzzy set parlance, this would be modeled as a “triangular number,” which is a fancy way of saying that we’d assign a perfect membership grade of 1 to values that were exactly equal to 0.5, but descending on either side of 0.5 in proportion to how far away the value was from that target. The term “triangular” is used because if we use a line chart to depict a membership function of this kind, it peaks at the target value and descends to 0 as the values decrease or increase away from it. Trapezoidal numbers are a mouthful, but really aren’t much more difficult; they have basically have the same shape as triangular numbers, except that the peak is flattened in order to express an interval of some kind.[1] University of Minnesota Prof. Glen Meeden’s natural language example of a trapezoidal number is the best I’ve yet run across in the literature: “What is the average yearly snowfall in the Twin Cities? You might answer somewhere between 20 and 50 inches.”[2]
…………Functions that only increase or decrease are useful in capturing related distinctions, like “a large number of” or “a small number of.”[3] Although it is wise to obey a few restrictions that lead to certain useful mathematical properties – particularly the industry-standard requirement of a boundary between 0 and 1 – it is not necessary for the membership functions to be symmetric. In fact, assigning a lopsided peak to a triangular number can be useful in modeling statements like “almost all,” which would be closer to the right edge of a line graph than a function that implements the term “most.”[4] They can even be bell-shaped.[5]
…………At a higher level of sophistication, fuzzy numbers can be used to model terms like “very,” which fall under the rubric of linguistic hedges and fuzzy modifiers – including statements like “very true,” which can be useful in fuzzy logic.[6] Klir and Yuan suggest applying powers and square roots to model such distinctions, since they don’t follow a linear scale. For example, they say that if we use a score of 0.8 to model the term “John is young,” squaring it would could model the phrase “very young” and using a square root could lead be used for “fairly young.” This is because the result of the first is 0.64, which strongly modifies the original term, while the second returns 0.89, which modifies it slightly.[7] Of course, the exact boundaries of all of these terms have to be set in light of domain knowledge of some kind. “Half” is a definite term but the modifier “about” can mean different things to different people, which may require aggregating the viewpoints of users in some way, perhaps using Decision Theory methods that integrate seamlessly with fuzzy sets. Neural nets likewise work well together with fuzzy sets because they are highly useful for encoding unknown functions, which means they can be put to use to derive such boundaries if greater precision is required. For many use cases, an informal guesstimate may suffice.

Translating Fuzzy Numbers into T-SQL

…………My sample code below implements four types of fuzzy numbers, using quite simple and arbitrary criteria that is merely designed to illustrate the concepts. For the sake of consistency, I’m once again using the procedure I wrote for Outlier Detection with SQL Server, part 2.1: Z-Scores to my derive membership grades and storing the results in a table variable (which has an extraneous column named GroupRank that can be safely ignored). In this case we’re calculating Z-Scores on the LactateDehydrogenase column of the Duchennes muscular dystrophy dataset I’ve been using as practice data for the last several tutorial series, which I downloaded ages ago from Vanderbilt University’s Department of Biostatistics and converted to a SQL Server table in a dummy DataMiningProjects database. Some of the key differences from previous articles include the absence of the ReversedZScore column and associated @Rescaling variables. This is because we don’t need to perform rescaling of any kind, since we’re measuring nearness to a few Z-Score values in all four examples, not calculating a relative score on a range of 0 to 1. The MembershipScore column is missing for the same reason. In its place, we have four computed columns, two of which measure the closeness of each Z-Score in the dataset to a target value or range.
…………To put it simply, the triangular number assigns a grade to  the natural language statement, “around a Z-Score of 0.960526,” while the trapezoidal expresses the concept of “somewhere between 0.450526 and 1.360526.” The other two columns define increasing and decreasing functions that model how close the values are to either the top or bottom of the dataset, which can be interpreted as “few” or “most.” The hard-coded numbers in the UPDATE and the declarations above it are picked out of thin air merely to illustrate the point, not because they express any domain knowledge. There are probably more efficient ways of coding this, but the point is to get the concepts across.

Figure 1: Code for the Fuzzy Number Sample
DECLARE @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
TriangularNearnessScore decimal(38,6),
TrapezoidalNearnessScore decimal(38,6),
FewScore decimal(38,6),
MostScore decimal(38,6),
GroupRank bigint
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

DECLARE  @ComparisonPoint float = 0.960526
DECLARE @LowerBound float = @ComparisonPoint 0.5, — we could of course make it lopsided if that would model our data better
@UpperBound float = @ComparisonPoint + 0.5

UPDATE @ZScoreTable
SET TriangularNearnessScore =     CASE WHEN ZScore NOT BETWEEN @LowerBound and @UpperBound THEN 0
       WHEN ZScore = @ComparisonPoint THEN 1
       WHEN ZScore < @ComparisonPoint THEN (ZScore @ComparisonPoint) + 1
       WHEN ZScore > @ComparisonPoint THEN (@ComparisonPoint ZScore) + 1
       ELSE NULL END,
       TrapezoidalNearnessScore = CASE WHEN ZScore NOT BETWEEN @LowerBound and @UpperBound THEN 0
       WHEN ZScore BETWEEN @LowerBound + 0.1 and @UpperBound 0.1 THEN 1
       WHEN ZScore < @LowerBound + 0.1 THEN (ZScore @ComparisonPoint) + 1
       WHEN ZScore >  @UpperBound 0.1 THEN (@ComparisonPoint ZScore) + 1  ELSE NULL END,
       FewScore = CASE WHEN ZScore BETWEEN 0.2 AND 0.7 THEN 1  ELSE 1 (ZScore * 0.3) END,
       MostScore = CASE WHEN ZScore BETWEEN 1 AND 2 THEN 1  ELSE ZScore / 1.2 END

SELECT PrimaryKey, ZScore, TriangularNearnessScore, TrapezoidalNearnessScore, FewScore, MostScore
FROM @ZScoreTable
WHERE ZScore BETWEEN 0.2 AND 2 AND TriangularNearnessScore != 0 AND TrapezoidalNearnessScore != 0
ORDER BY ZScore DESC

Figure 2: Sample Results from the Duchennes Practice Dataset
fuzzy-number-results

…………Note how the TriangularNearnessScore peaks at a single value, while the TrapezoidalNearnessScore is contained within a particular range. The FewScore and MostScore values peak at opposite ends of the dataset. Of course, a picture is worth a thousand words: if the values above aren’t clear, then the Reporting Services line graphs below ought to clear up any confusion. The TriangularNearnessScore doesn’t precisely follow a triangular shape, but it does come to single peak, which is good enough. The trapezoidal example reaches the same peak, but encompasses a range of values represented in the flat line. The FewScore and MostScore also have flat peaks, but these occur at the far edges of the membership grades. The shapes aren’t as neat as those in the literature, in part because I’m using real-world data from the Duchennes dataset together with some arbitrary range values, but I’m sure that readers will get the gist of it.

Figure 3: Reporting Services Line Graphs for 4 Fuzzy Number Samples
fuzzy-number-report
…………Asking questions like “what is about half of A plus almost all of B” is probably a rare use case, one that is more likely to come up in data mining than in relational situations. If the need for this kind of comparison does arise, be aware that at least some of the math has been worked out, so there’s no need to reinvent the entire wheel. I say “some” because, at least at the time Klir and Yuan wrote, mathematicians were still struggling with some of the strange properties and enigmatic logic associated with these kinds of comparisons.[8]  Just imagine several of the trapezoids and triangles above overlaid and you can see how quickly the topics of fuzzy arithmetic, fuzzy set relations and fuzzy matrix math can become.  Some basic procedures for solving them are available[9], as well as neural net techniques for solving fuzzy equations[10], but they lead to certain logical difficulties that I believe still aren’t fully understood, such as the fact that many of the approximate solutions may not be unique.[11] Thankfully, we don’t encounter statements like “between 3 and 5 of A minus a little of B” often in ordinary speech, so I imagine that SQL Server users are unlikely to encounter it. In contrast, qualifiers like “around” or “near” are so common that I guarantee these fuzzy numbers will prove valuable to many users in the long run.
…………Instead of taking fuzzy numbers in directions of this kind that lack real-world applications, I’ll instead use them as a stepping stone towards quantifying imprecision, which can be helpful in programs of uncertainty management. Many of the linguistic qualifiers mentioned here are actually instances of what is known as “fuzzy cardinality,” which enable modeling of phrases like “about a quarter” or “near.”[12] In the next article, I’ll delve into the realm of fuzzy statistics, where the implications of relative membership in a set leads automatically to a range of different types of cardinality, not just the single type of Count used in T-SQL. I have yet to see this done in the literature (I simply can’t afford access to most of the research published on certain advanced topics like fuzzy stats), but I’ll provide an example of how trapezoidal numbers might be implemented in order to create fuzzy analogues of standard deviation and variance. Given that the ordinary “crisp” versions of these aggregates are determined in part by counts, the fuzzification of counts plays into that as well. It may be useful to probe for connections to Fisher’s Information, a metric I hope to code for a long-delayed series titled Information Measurement with SQL Server, given that it apparently uses variance to model uncertainty. Using increasing and decreasing functions to model statements like “almost all” as we have done here can be seen as fuzzy instances of MIN and MAX aggregates, which essentially express the same sentiment as “near the bottom” or “near the top.” These can also be used to create new methods of fuzzy outlier detection, as I’ve essentially been doing throughout this series by using Z-Scores in my sample code; fuzzy set grades don’t have anything to do with stochastics or outlier detection unless such meanings are deliberately assigned.
…………In the next article I will probably continue to use Z-Scores in my sample code mainly for the sake of consistency and to reuse old, familiar concepts, but also to kill two birds with one stone and investigate possible uses in outlier detection. As we shall see, membership functions can also be interpreted in the light of Evidence Theory, in which each grade indicates the levels of credibility and truthfulness of a statement. In such cases, we’re speaking of fuzzy measures[13], which are also useful in partitioning and quantifying different types of uncertainty. As we already saw in Implementing Fuzzy Sets in SQL Server, Part 2: Measuring Imprecision with Fuzzy Complements, fuzzy complements can be used to measure one type of uncertainty, in quantifying just how imprecise the boundaries of fuzzy sets are; in essence, they become measures of fuzziness. There are other types of uncertainty, however, which require entirely different modeling techniques, which is where fuzzy stats and measures come in handy.

 

 

[1] Most fuzzy set references discuss triangulars and trapezoidals. One source I found to be helpful was pp. 12-13, Alavala, Chennakesava R., 2008, Fuzzy Logic and Neural Networks: Basic Concepts and Applications. New Age International Pvt. Ltd.: New Delhi.

[2] p. 3, Medeen, Glen, 2015, Two Examples of the Use of Fuzzy Set Theory in Statistics,” published online at the University of Minnesota web address http://users.stat.umn.edu/~gmeeden/talks/fuzznov09.pdf

[3] pp. 96-98, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[4] IBID., p. 228.

[5] IBID., p. 99.

[6] IBID., p. 222.

[7] IBID., p. 230.

[8] IBID., p. 115.

[9] IBID., pp. 160-162.

[10] IBID., pp. 171-173.

[11] IBID., pp. 153-157, 166-167.

[12] IBID., p. 98.

[13] IBID., p. 177.

Implementing Fuzzy Sets in SQL Server, Part 5: The Mystery of the Missing Left Join

By Steve Bolton

…………Information on set operations like complements, intersections and unions is plentiful in the literature on fuzzy sets, which made the last three articles in this series of amateur self-tutorials easier to write in a certain sense. These topics are far more complex than with ordinary “crisp” sets because there are so many different methods for calculating membership grades for the resultsets, each of which can be applied to different use cases in order to model imprecision. This can be especially useful in Behavior-Driven Development (BDD) and user stories, as well as in uncertainty management programs of all kinds, including reducing the insecurity involved in software engineering and data modeling. There may be extra steps involved in calculating these fuzzy set relations, but the topic is at least well-studied and the sources of information for the formulas are readily available. Strangely, that is not true for some of the set operations that SQL Server users encounter more often, like LEFT JOINs, INNER JOINs, OUTER JOINs, RIGHT JOINs and Cartesian Products. There are only 41 hits on Google for the terms “left join” “relational” “fuzzy set” combined, one of which is an off-hand reference from my own blog. I haven’t see these topics addressed much in the literature, which is rich in information on complements, intersections and unions, but not these common relational joins. Like many other puzzling aspects of fuzzy set theory, the reason behind this is readily apparent only after we get back to basics and frame the questions we’re asking of the database in natural language terms.
…………This problem is easily reducible to a single enigma, which nevertheless requires a lengthy explanation. As we saw in the last two tutorials, INNER and OUTER JOIN statements are useful in deriving fuzzy intersections and unions respectively. We really can’t use the standard T-SQL INTERSECT and UNION operators for these purposes, since we need to retrieve membership function values from both sides in order to calculate membership in the resultset. For all intents and purposes, these are the principal use cases for fuzzy INNER and OUTER JOIN operations. Although I won’t rule out the possibly of some oddball set operation that requires the fuzzy versions of these statements, without qualifying as fuzzy intersections and unions, such use cases are probably pretty rare. CROSS JOIN operations are easier to explain once we realize that all fuzzy binary relations are performed on the complete crisp Cartesian Product to derive a fuzzy subset of some kind. To perform a fuzzy Cartesian Product, we would have to take the membership grades of both sets and implement a mathematical operation of some type, just like with fuzzy intersections and unions. The CROSS JOIN operator isn’t used as often in T-SQL as its kin, in large part because retrieving every possible combination of records from both sides can tax the server beyond belief, but cross products aren’t mentioned much in the fuzzy set literature for an additional reason: a wide range of fuzzy subsets of it also qualify as fuzzy intersections and unions, depending on whether they’re implemented with classes of functions known as T-norms and T-conorms. The domain of possible resultsets is no wider than the crisp CROSS JOIN, but fuzzy intersections and unions take up a far wider space within it than their crisp counterparts.
…………It is certainly possible to define fuzzy Cartesian Product operators that are neither T-norms nor T-conorms, but the use cases for doing so are neither clear nor very wide. In order to derive such a measure, we’d probably have to take the next step up and select from the class of “norm operations” that encompass all types of possible fuzzy aggregates, including T-norms and T-conorms as special cases.[i] Unfortunately, the concept is so general that it provides even less guidance for deriving new types of fuzzy Cartesian Products for particular use cases than we have for matching T-norms and T-conorms to the right problems. What the CROSS JOIN, INNER JOIN and OUTER JOIN all have in common, however, is that both participating sets are given equal weight in the determining the results, which is also true for their fuzzy counterparts. As I always caution, I’m not an expert in these matters and am only writing on the topic in order to introduce myself to it, while hopefully helping other SQL Server users to avoid my inevitable mistakes. At the risk of committing another one, I’ll take a stab at solving the Mystery of the Missing Join, which may boil down to the fact that the LEFT JOIN statement gives preferential treatment to one side.

Comparing Membership Functions in LEFT JOINs

                One of the most common scenarios for a LEFT JOIN in ordinary crisp sets is to join a dependent table to a parent via a foreign key. When the relationship works in reverse, we’re speaking of a RIGHT JOIN – which is just the converse of a LEFT JOIN, so I’ll dispense with any discussion of it. The join results typically have repeated values for the parent table, since they often match multiple rows in the dependent table. Whenever we work with fuzzy sets, spelling out what we’re looking for in explicit natural language terms often simplifies what may seem like really complex problems. In the case of crisp LEFT JOINs, we’re asking the database, “Give me all of the records in Set1, plus some additional information from Set2 that we sometimes do without.” Here’s the key problem we face: fuzzy set theory only matters in the context of a LEFT JOIN if those additional columns of information are included or excluded based on some membership function, which can be mathematically compared to the membership function of the parent table. In other words, the math involved in a fuzzy complement, union or intersection is needed to determine whether or not the results are included in the new set, but with a LEFT JOIN, oftentimes we’re just suppling more information in a Master-Detail situation – in which case, the single membership function of the parent ought to be sufficient to determine membership in the resultset. With intersections and unions, both sides of the binary relation are of equal importance, but in a LEFT JOIN, the left side takes precedence, with the right just being filler material that doesn’t affect the membership values of the parent. Perhaps the only time we might take a membership function on the dependent table into account is when it measures a type of uncertainty comparable to the parent, plus we specifically want to grade how relevant the additional detail is to the parent.
…………For example, let’s say we’re performing a fuzzy LEFT JOIN on a parent CustomerTable to a child AddressTable. If the first has a function that ranks how Tall a person is and the second carries a grade for membership in the set of Rural places, then we can retrieve the extra data in order to answer questions like, “Give me the set of Short people who live in Suburbs” without performing any math operations on either membership function. If we exclude the much more advanced topic of statistical independence, how Tall or Short a customer is has no bearing on how Rural or Urban their hometown is, so the two memberships don’t affect each other. We can just do a regular LEFT JOIN in such situations and return the two grades separately.
…………The question gets murkier when we consider dependent tables that measure the same quality as the parent, or nearly so. For example, let’s pretend we’re operating a database for a hardware store that has a CeilingHeight membership function defined on the AddressTable, also split into categories like Tall, Short, etc. (for the sake of argument, let’s say the customers volunteered this information about their homes in a survey or whatever). We can ask the database a simple question like, “Give me the set of Short people who live in homes with Tall ceilings” without doing any extra math, but there may be use cases where we can compare these two types of information to draw inferences. An example might be asking a question like, “Give me the set of people who belong to the category of Mismatched Customers,” which can in turn be defined as Short people with Tall ceilings and vice-versa. This could be calculated through LEFT JOINs with WHERE clauses on the Tall and Short values for the Customers, after which some computation can be applied to derive a new category from based on some comparison of that membership function to the CeilingHeight column of the child.

An Advanced Exception: LEFT JOINs Creating New Categories of Comparison

                Keep in mind that with fuzzy intersections and unions, we’re essentially adding a new layer of fuzzy membership grades to the operation that joins the two sets; if we extend the same principle to fuzzy LEFT JOINs, then we’re likewise creating a new type of fuzzy set by defining the join in an imprecise way. I’m still a novice at all of this, but would suggest that it might be possible to discern the difference between this type of fuzzy LEFT JOIN from one that doesn’t require any extra math by looking at the results: if they define a new ordinal category that is graded on a continuous scale, then we’ve created a new type of fuzzy set from the two membership functions. If no such category is being created, then we can probably just return the child’s membership functions values as additional details if it’s germane to our query, without taking any extra steps. The question may be complicated by the fact that both tables may have multiple membership functions defined on them, some of which define sets that may not be relevant at all to the query at hand, while others can have various shades of meaning that may overlap. For example, a ProductTable might have different membership functions for such disparate characteristics as Color, Width and Availability, while the dependent table may even have stochastic labels like “Gaussian” or “Gamma,” if we needed to measure membership of a column in some kind of probability distribution.
…………Either way, we have to retrieve all the values we would for a LEFT JOIN, but would only perform additional math in certain use cases that seem to be much narrower than those for fuzzy unions and intersections. In these instances, we’re essentially automatically extending it to a new level, just as the Sugeno and Yager Complements add another level of membership grade on top of the existing membership function values. For that reason, I strongly suspect that we would have to select a fuzzy join function that most closely matches the type of imprecision we’re trying to model. As we saw in the last couple of articles, T-norms and T-conorms are the ideal mathematical structures for expressing fuzzy intersections and unions, but selecting the ones with the right mix of mathematical properties and output histograms is often a tough call. Research has been ongoing in the field for decades to narrow down those use cases, but the paucity of information on LEFT JOINs leaves us with an even shakier starting point than that. This is the weakest article in this series, in the sense that I can’t even provide T-SQL sample code for a rare (or even perhaps non-existent) set of use cases, which also would require advanced research that the professionals apparently haven’t deemed worthy to investigate much; nevertheless, anyone coming from the realm of crisp SQL database servers is going to be immediately struck by the absence of LEFT JOINs and could use a tentative answer to this nagging question. Now that we’re over this hump, in the next installment I’ll explain how to use the most common fuzzy quantifiers and measures, like triangular and trapezoidal numbers. These can be incredibly useful in the right circumstances, including making it child’s play to model common linguistic phrases like “about half” or “most” on both crisp and fuzzy sets. It is here that many SQL Server users can probably find immediate applications for the material covered in this series, which the untapped potential of fuzzy set theory can deal with better than any other alternative.

[i] p. 93, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

Implementing Fuzzy Sets in SQL Server, Part 4: From Fuzzy Unions to Fuzzy Logic

By Steve Bolton

…………Fuzzy set relations carry an added layer of complexity not seen in ordinary “crisp” sets, due to the need to derive new grades for membership in the resultset from the scores in the original sets. As I explained two weeks ago in this series of amateur self-tutorials, binary set relations like fuzzy intersections are a bit more complicated than reflexive set relations like fuzzy complements, since we have to perform fuzzy aggregation operations on multiple sets. The good news is that we can reuse most of the concepts introduced in the last article for fuzzy unions, which are handled in an almost identical way. With intersections, we’re basically requesting “all of the records that are members of a both sets,” but with unions, we’re asking SQL Server, “Give me any records that are members of either set.” When we use the fuzzy versions of either, that amounts to retrieving maximum values in the case of intersections and minimums in the case of unions. With the crisp sets SQL Server users work with every day, set relations are straightforward because no calculations of membership values are involved, but with fuzzy intersections there are myriad ways of defining these grades; fortunately, the mathematical functions for fuzzy unions are quite similar to those introduced in the last article. Last time around I provided sample code for four basic types of fuzzy intersections known as the standard intersection, drastic intersection, algebraic product and bounded difference; this week, I’ll demonstrate how to code the corresponding inverses, which are known as the Standard Union, Drastic Union, Algebraic Sum and Bounded Sum. Just as the concept of fuzzy intersections are best exemplified by a class of mathematical functions known as T-norms, so too are fuzzy unions typified by another class known as T-conorms. As we shall see, the authors and parameters of the most popular T-conorms are almost identical to those I coded for the last installment, based on my favorite mathematical reference on the topic, George J. Klir and Bo Yuan’s Fuzzy Sets and Fuzzy Logic: Theory and Applications. At the tail end of the last article I offered suggestions for coping with the odd data modeling issues that crop up in cases when it is desirable to persist parameter values for such advanced functions; I won’t rehash that here because the solutions are exactly the same for fuzzy unions. I’ll also omit discussion of the theorems and formulas that justify these functions, since it’s not really necessary to overload end users with that kind of information, for the same reason that it’s not necessary to give a dissertation on automotive engineering to get a driver’s license. Suffice it to say that T-conorms share many of the same mathematical properties as T-norms, except that they’re characterized by superidempotency instead of subidempotency, which you don’t really need to know unless you’re devising your own fuzzy aggregates.[i] Instead, I’ll spend more time discussing the use cases for the different varieties of fuzzy set relations; suffice it to say for now that it boils down to looking for different shades of meaning in conjunctions like “or” and “and” in natural language, which often correspond to fuzzy unions and joins.
…………First, let me get the code for fuzzy unions out of the way. To readers of the last article, the structure of Figures 1 through 4 ought to look familiar. Once again, I’m using the output of two stored procedures I wrote for Outlier Detection with SQL Server, part 2.1: Z-Scores and Outlier Detection with SQL Server, part 2.2: Modified Z-Scores as my membership functions (you can of course test it on other membership functions that have nothing to do with Z-Scores), then storing the results in two table variables. As usual, the GroupRank and OutlierCandidate columns can be safely ignored (they’re only included in the INSERT EXEC because they were part of the original procedures), while the ReversedZScore column is used in conjunction with the @RescalingMax, @RescalingMin and @RescalingRange variables to normalize the results on the industry-standard fuzzy set range of 0 to 1. In fact, there aren’t any differences whatsoever in the code till we insert the results of the union in a third table variable, so that we can demonstrate several T-conorms with it later on; in practice, it may be possible to perform fuzzy unions without these table variables, depending on how the membership functions are being calculated. We don’t have to perform any mathematical operations on the records returned by ordinary crisp relations, but this is not true with their fuzzy counterparts. Using a regular INTERSECT operator in the sample code in the last tutorial would have been awkward to say the least, since it would have excluded unmatched values for the MembershipScores of both sets, which must be preserved to calculate the grade for membership in the fuzzy relation. The same problem arises when we try to implement a fuzzy union with the T-SQL UNION operator, so I had to resort to a workaround similar to the one used for fuzzy intersections, with a few cosmetic changes. I use a FULL JOIN here instead of an INNER JOIN and had to add IsNull statements and a CASE to substitute grades of 0 whenever the Value of one of the columns is NULL. The same two WHERE clauses I used after the join condition in the last tutorial are present here, but are buried in two subqueries; thanks to the null checks, they’re assigned membership grades of 0 on one of the two sets, whenever the WHERE clause removes them from one side but not the other.

Figure 1: Sample Code for a Fuzzy Union on Two Different Types of Z-Scores
DECLARE @RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)
DECLARE   @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
)

DECLARE  @ModifiedZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint,
OutlierCandidate bit
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

 INSERT INTO @ModifiedZScoreTable
(PrimaryKey, Value, ZScore, GroupRank, OutlierCandidate)
EXEC   Calculations.ModifiedZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @OrderByCode = 8,
              @DecimalPrecision = ’38,32′

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ModifiedZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ModifiedZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

— DERIVING THE Union
DECLARE @UnionTable table
(PrimaryKey sql_variant, Value decimal(38,6), MembershipScoreForSet1 decimal(38,6), MembershipScoreForSet2 decimal(38,6),
StandardUnion AS (CASE WHEN MembershipScoreForSet1 >= MembershipScoreForSet2 THEN MembershipScoreForSet1 ELSE MembershipScoreForSet2 END),
DrasticUnion AS (CASE WHEN MembershipScoreForSet1 = 0 THEN MembershipScoreForSet2 WHEN MembershipScoreForSet2 = 0 THEN MembershipScoreForSet1 ELSE 1 END),
AlgebraicSum AS (MembershipScoreForSet1 + MembershipScoreForSet2) (MembershipScoreForSet1 * MembershipScoreForSet2),
BoundedSum AS (CASE WHEN MembershipScoreForSet1 * MembershipScoreForSet2 > 1 THEN 1 ELSE MembershipScoreForSet1 * MembershipScoreForSet2 END)
)

INSERT INTO @UnionTable
(PrimaryKey, Value, MembershipScoreForSet1, MembershipScoreForSet2)
SELECT CASE WHEN T1.PrimaryKey IS NULL THEN T2.PrimaryKey ELSE T1.PrimaryKey END,
CASE WHEN T1.Value IS NULL THEN T2.Value ELSE T1.Value END,
IsNull(T1.MembershipScore, 0), IsNull(T2.MembershipScore, 0)
FROM (SELECT PrimaryKey, Value, MembershipScore
       FROM @ZScoreTable WHERE Value != 142) AS T1
       FULL JOIN (SELECT PrimaryKey, Value, MembershipScore
              FROM @ModifiedZScoreTable WHERE Value != 147) AS T2
       ON T1.PrimaryKey = T2.PrimaryKey
       WHERE T1.Value IS NOT NULL OR T2.Value IS NOT NULL

SELECT *
FROM @UnionTable
ORDER BY DrasticUnion ASC

Figure 2: Sample Results from the Duchennes Table[ii]
fuzzy-union-basic-results

…………Basically, the four computed columns in Figure 1 implement the Standard Union, Drastic Union, Bounded Sum and Algebraic Sum by inverting the operations used for the corresponding fuzzy intersections.[iii] The only record which has a zero membership value for the first set also has a zero membership value for the second, hence it’s assigned a fuzzy union value of 0, so there a no Drastic Union values in between 0 and 1 in this case. A few weeks ago I had to order the fuzzy intersection results by the DrasticIntersection DESC in order to depict any difference in the return values of that more restrictive function, but this time I had to order by the DrasticUnion in the opposite direction to do the same. That is because the two operations are basically the inverse of each other, with the first being a maximum and the other a minimum.
…………The more complex T-conorms cited by Klir and Yuan and other fuzzy set experts also implement minimum bounds for membership in the resultset, just in a more specific manner than the four listed above.[iv] The code in Figure 3 is not much different from that of Figure 3 in the last article, except for the substitution of MIN operators for MAX, plugging in a reciprocal here and there and other such tweaks. Once again I used the VALUES trick posted by Jamie Thomson a few years back to perform row-by-row MAX and MIN operations, instead of using them to retrieve aggregates for the whole table.[v] The fuzzy unions are named after the same authors mentioned last week, with those by József Dombi, M.J. Frank, Horst Hamacher, Yandong Yu, Didier Dubois, Henry Prade. Ronald R.Yager, Michio Sugeno and Siegfried Weber segregated into the top query. For the sake of convenience and readability, I put the four T-conorms developed by Berthold Schweizer and Abe Sklar in the second query. A couple of differences include the @OneDividedByLambdaParameter and complicated CASE statement for the SchweizerAndSklar3 fuzzy union, which are designed to strain out some opportunities for Divide By Zero and Invalid Floating Point Operation errors. The WHERE clause in both SELECTs are designed to do the same through brute force, instead of cluttering the code, which is only for demonstration purposes. I suggest more strongly than ever that my calculations be rechecked if accuracy is important for your use cases, since there are few step-by-step examples with sample data in the literature for me to validate these on.

Figure 3: Code for Several Popular Types of T-Conorm Fuzzy Unions
SELECT PrimaryKey, Value, MembershipScoreForSet1, MembershipScoreForSet2,
1 ((1 MembershipScoreForSet1) * (1 MembershipScoreForSet2)) / (SELECT MAX(Value) FROM (VALUES (1 MembershipScoreForSet1),(1 MembershipScoreForSet2),(@AlphaParameter)) AS T1(Value)) AS DuboisPradeTConorm, — basically the same as the intersection except we add some 1 minuses — I adapted this trick from http://sqlblog.com/blogs/jamie_thomson/archive/2012/01/20/use-values-clause-to-get-the-maximum-value-from-some-columns-sql-server-t-sql.aspx
Power(1 + Power(Power(CAST(1 / MembershipScoreForSet1 AS float) 1, @LambdaParameter) + Power(CAST(1 / MembershipScoreForSet2 AS float) 1, @LambdaParameter), CAST(1 / @LambdaParameter AS float)), 1) AS DombiTConorm,
1 Log(1 + (((CAST(Power(@SParameter, 1 MembershipScoreForSet1) AS float) 1) * (Cast(Power(@SParameter, 1 MembershipScoreForSet2) AS float) 1)) / (@SParameter 1)), @SParameter) AS Frank1979TConorm, — basically the same as the intersection except we add some 1 minuses
MembershipScoreForSet1 + MembershipScoreForSet2 + (@RParameter 2) * (MembershipScoreForSet1 * MembershipScoreForSet2) / (@RParameter + (@RParameter 1) * (MembershipScoreForSet1 * MembershipScoreForSet2)) AS HamacherTConorm,
(SELECT MIN(Value) FROM (VALUES (1), (Power(Power(MembershipScoreForSet1, @OmegaParameter)  + Power(MembershipScoreForSet2, @OmegaParameter), 1 / CAST(@OmegaParameter AS float))))  AS T1(Value)) AS YagerTConorm,
(SELECT MIN(Value) FROM (VALUES (1), ((MembershipScoreForSet1 + MembershipScoreForSet2 + (@LambdaParameter * MembershipScoreForSet1 * MembershipScoreForSet2)))) AS T1(Value)) AS YuTConorm,
(SELECT MIN(Value) FROM (VALUES (1), ((MembershipScoreForSet1 + MembershipScoreForSet2 (@LambdaParameter / @OneDividedByLambdaParameter) * (MembershipScoreForSet1 * MembershipScoreForSet2)))) AS T1(Value)) AS WeberTNorm
FROM @UnionTable
WHERE MembershipScoreForSet1 != 0 AND MembershipScoreForSet2
!= 0

SELECT PrimaryKey, Value, MembershipScoreForSet1, MembershipScoreForSet2,
1 Power((SELECT MAX(Value) FROM (VALUES (0), (Power(MembershipScoreForSet1, @PParameter) + Power(MembershipScoreForSet2, @PParameter) 1)) AS T1(Value)), 1 / CAST(@PParameter AS float)) AS SchweizerSklarTConorm1, — just the inverse of the same intersection
Power(Power(MembershipScoreForSet1, @PParameter) + Power(MembershipScoreForSet2, @PParameter) (Power(MembershipScoreForSet1, @PParameter) * Power(MembershipScoreForSet2, @PParameter)), 1 / CAST(@PParameter AS float)) AS SchwiezerAndSklar2,
CASE WHEN 1 MembershipScoreForSet1 <= 0 AND 1 MembershipScoreForSet2 >= 0  THEN  1 EXP(-1 * Power(Power(1, @PParameter) + Power(Abs(Log(1 MembershipScoreForSet2)), @PParameter), 1 / CAST(@PParameter AS float)))
WHEN 1 MembershipScoreForSet1 >= 0 AND 1 MembershipScoreForSet2 <= 0  THEN  1 EXP(-1 * Power(Power(1, @PParameter) + Power(Abs(Log(1)), @PParameter), 1 / CAST(@PParameter AS float)))
ELSE 1 EXP(1 * Power(Power(Abs(Log(1 MembershipScoreForSet1)), @PParameter) + Power(Abs(Log(1 MembershipScoreForSet2)), @PParameter), 1 / CAST(@PParameter AS float))) END AS SchwiezerAndSklar3,
((1 MembershipScoreForSet1) * (1 MembershipScoreForSet2)) / Power(Power((1 MembershipScoreForSet1), @PParameter) + Power((1 MembershipScoreForSet2), @PParameter) (Power(1 MembershipScoreForSet1, @PParameter) * Power(1 MembershipScoreForSet2, @PParameter)), 1 / CAST(@PParameter AS float)) AS SchwiezerAndSklar4 –basically the same as the intersection except we add some 1 minuses,
FROM @UnionTable
WHERE MembershipScoreForSet1 != 0 AND MembershipScoreForSet2
!= 0

Figure 4: T-Conorm Results (click to enlarge)

t-conorm-results

…………So what’s the point of going to all of this trouble to devise fuzzy unions and intersections? In many use cases this level of computation wouldn’t be of much benefit, but for others we can leverage the different shapes of the function returns depicted in Figure 4 to model different types of uncertainty; in fact, we could combine them into new T-norms and T-conorms more suited to our purposes if necessary.[vi] Theoreticians prefer them not only because they mesh well with fuzzy intersections and unions well, but because they’re well-studied. Apparently, there are at least five main research papers[vii] that detail which use cases match the particular T-norms and T-conorms I’ve introduced in the last couple of articles, but unfortunately, I was unable to get my hands on any of them because of my limited budget.
…………I did manage to find some more recent papers on tangential matters like linguistic connectives, such as AND and OR, which were at least helpful in outlining the main principles for selecting the right ones.[viii] Such discussions usually lead in two directions, one of which is using AND/OR logical operations to aggregate opinions and factor in costs in Decision Theory. Klir and Yuan include a well-written section on how to perform such related tasks as weighting goals, actions and constraints; aggregating multiperson decisions and criteria over multiple stages; selecting fuzzy or crisp responses and using fuzzy input states; and implementing it all with state transition matrices, a topic I touched on briefly in A Rickety Stairway to SQL Server Data Mining, Algorithm 8: Sequence Clustering.[ix]
…………The other direction is towards constructing fuzzy logic arguments from the same building blocks, like NOT, AND and OR. Some surveys have found surprisingly subtle shades of meaning in ordinary linguistic terms like AND and OR, which factor into the selection of the right T-norms and T-conorms. Once I can get my hands on those sources I may be tack another article onto the back end of this series, but for now I will have to leave readers dangling. One other suggestion I might offer for matching use cases is to use the Empirical Distribution Functions (EDFs) I introduced in my tutorial series Goodness-of-Fit Testing with SQL Server, perhaps along with scoring systems like the Kolmogorov-Smirnov Test. In crude terms, these are akin to measuring how closely two histograms match each other, so plugging in the T-norms and T-conorms alongside an ideal curve ought to give us a good starting point. Linear regression is a related technique in the same vein. Some alternatives mentioned elsewhere by Klir and Yuan include using maximum likelihood estimation (MLE) methods, curve fitting and neural nets to derive appropriate parameter values for the T-norms and T-conforms.

The Bridge to Fuzzy Logic

                As is often the case in the fuzzy set field,  simplicity quickly gives way to complexity; just as there are many more ways of defining fuzzy complements, intersections and unions than their crisp counterparts, so too are there many types of fuzzy logic. I’ll omit any in-depth discussion of alternatives like Lukasiewicz logic, since they’re beyond the range of ordinary SQL Server use cases. I’ll limit my comments to pointing out that the translation of fuzzy complements, intersections and unions into fuzzy NOT, AND and OR statements is your ticket across that bridge, should you need to cross it. Also keep in mind that there may be trolls beneath that bridge. Some of the steps are well-traveled, like certain multi-valued logics, which are really a matter of common sense; even a child can grasp the concept of inserting answer like “maybe” between a Boolean range of simple “yes” and “no” answers. Other innovations in the field of fuzzy logic smell of intellectual shock value, which may help a scholar to publish rather than perish; watch out for some of the tell-tale signs, like the justification of contradiction as a form of knowledge or misuse of subjective evidence. In such cases, dreaming up a sly mathematical justification for madness may get a person tenure, but does nothing to advance the cause of human knowledge. A solid mathematical scaffolding is available for the less glitzy forms of fuzzy logic in case you need to climb it, but be aware that it requires some mental gymnastics to navigate.
…………At first glance, even the most reliable brands of fuzzy logic can lead to apparent violations of Aristotle’s three laws of logic, like the Law of Contradiction, the Law of the Excluded Middle and the Law of Identity.[x] How can a thing be yet not be at the same time, or exist in a ghost-like state between being and not-being, or be something other than what it is? To keep from having your mind blown, just remember from Implementing Fuzzy Sets in SQL Server, Part 2: Measuring Imprecision with Fuzzy Complements how records have can have partial membership in both a fuzzy set and its complement. That’s an easily digestible example of how a thing “can be” and “not be” at the same time; the false discrepancy arises from applying ordinary crisp, Boolean-valued logic to a set that was defined on a fuzzy range of values. In the same way, many of the apparent violations of Aristotle’s principles[xi] only arise from similar misapplications of fuzzy to crisp sets and vice-versa; in fact, this is usually the hidden culprit when we encounter the kind of manic misuse of fuzziness I addressed in Implementing Fuzzy Sets in SQL Server, Part 1: Membership Functions and the Fuzzy Taxonomy. It would be a mistake to think that Aristotelian logic is incapable of incorporating fuzzy set theory – as many authors in the field seem to suggest – given that Aristotle himself mentioned in his Metaphysics that ambiguity of definitions only leads to the appearance of contradictions.[xii]

Putting Fuzzy Logic to Use

                Once we get over this bump in the road, it is possible to construct towers of logical propositions from fuzzy sets, using tried-and-true methods like generalized modus ponens, generalized modus tollens, generalized hypothetical syllogisms, the intersection/product syllogism, approximate reasoning, multi-valued and interval-valued reasoning, as well as making inferences from “quantified propositions.” [xiii] This includes the Kleene-Dienes, Reichenbach-Lukasiewicz and Goguen brands of fuzzy implications, which are a fuzzified version of regular syllogistic match that leads into the topic of fuzzy expert systems.[xiv] As Klir and Yuan point out, “To select an appropriate implication for approximate reasoning under each particular situation is a difficult problem. Although some theoretically supported guidelines are now available for some situations, we are still far from a general solution to this problem.”[xv] Fuzzy logic techniques can even enable us to make exotic apples-and-oranges comparisons of a kind rarely seen in ordinary speech, like Taner Bilgic and I.B. Turksen’s examples of “John is taller than he is clever,” “Inventory is higher than it is low,” “Coffee is at least as unhealthy as it is tasty,” and “Her last novel is more political than it is confessional.”[xvi] These are all examples of sentences with two disparate and imprecisely defined objects, each of which has membership functions measured on the same standard fuzzy scale of 0 to 1. The question becomes even more complicated when we use sentences with two objects and two subjects, like “I am taller then you are skinny,” which in fuzzy set terms means “I belong to the set of tall people more than you belong to the set of skinny people.”[xvii]
…………Fuzzy logic certainly has many legitimate uses, but I’ll omit any further discussion of the topic because of the obvious complexity involved. To that we can tack on the lack of direct application to use cases SQL Server users are liable to encounter anytime soon. Data mining and decision support are definitely within the realm of legitimate SQL Server use cases, but other types of “soft computing” like fuzzy expert systems represent advanced cases that are far beyond the scope of this series. For those with a need to know, Klir and Yuan include an entire chapter on how to assemble fuzzy expert systems out of individual components like knowledge bases, knowledge acquisition modules, explanatory interfaces, “blackboard interfaces” and inference engines, which are composed of chains of fuzzy IF-THEN clauses.[xviii] Of course, scores of books have been written on fuzzy expert systems since then, most of which seem to be just as heavy on arcane fuzzy math. Two articles from now, I’ll explain how a much simpler fuzzy math construct known as a fuzzy number can be easily implemented in SQL Server to address certain use cases that we encounter every day, particularly in modeling fuzzy linguistic statements like “about half” and “almost all” seen in Behavior-Driven Development (BDD) and user stories. First I’ll dispense with another topic that DBAs and data miners encounter on a daily basis, alongside set relations like unions, intersections: the all-pervasive LEFT JOIN operator. In the next installment, I’ll take a stab at explaining its striking absence in the fuzzy set li

[i] p. 77, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[ii] These figures were derived from the Duchennes muscular dystrophy dataset made publicly available by Vanderbilt University’s Department of Biostatistics, which I have been using for practice data for the last couple of tutorial series.

[iii] p. 78, Klir and Yuan.

[iv] IBID., p. 82.

[v] Thomson, Jamie, 2012, “Use VALUES Clause to Get the Maximum Value from Some Columns,” published Jan. 20, 2012 at the SQLBlog web address http://sqlblog.com/blogs/jamie_thomson/archive/2012/01/20/use-values-clause-to-get-the-maximum-value-from-some-columns-sql-server-t-sql.aspx

[vi] pp. 75, 83, Klir and Yuan.

[vii] IBID., p. 95. These are the five sources, for anyone interested in implementing these T-norms and T-conforms:

  • Thole, U.; Zimmerman, Hans J. and Zysno, P., 1979, “On the Suitability of Minimum and Product Operators for the Intersection of Fuzzy Sets,” pp.167-180 in Fuzzy Sets and Systems, April 1979. Vol. 2, No. 2.
  • Yager, Ronald R., 1979, “A Measurement-Informational Discussion of Fuzzy Union and Intersection,” pp. 189-200 in International Journal of Man-Machine Studies, March 1979. Vol. 11, No. 2.
  • Yager, Ronald R., 1982, “Some Procedures for Selecting Fuzzy Set-Theoretic Operations” pp. 235-242 in International Journal of General Systems, 1982. Vol. 8.
  • Zimmerman, Hans J. 1978, “Results of Empirical Studies in Fuzzy Set Theory,” pp. 303-312 in Applied General Systems Research, NATO Conference Series. Vol. 5.
  • Zimmerman, Hans J. and Zysno, P., 1980, “Latent Connectives in Human Decision Making, “pp. 37-51 in Fuzzy Sets and Systems, July 1980, Vol. 4, No. 1.

[viii] Among them were Alsina, C.; Trillas E. and Valverde, L. , 1983, “On Some Logical Connectives for Fuzzy Sets Theory,” pp. 15-26 in Journal of Mathematical Analysis and Applications. Vol. 93; Dubois, Didier and Prade, Henri, 1985, “A Review of Fuzzy Set Aggregation Connectives,” pp. 85-121 in Information Sciences, July-August, 1985. Vol. 36, Nos. 1-2.

[ix] pp. 391-405, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J. On this particular page, they’re extending the meaning of the term even further, to complex network topologies.

[x] Dawkins, Jamie, 2010, “Aristotle’s Three Laws of Thought,” published July 26, 2010 at The Year to the Present blog address https://jamescarterdawkins.wordpress.com/2010/07/26/aristotles-three-laws-of-thought/

[xi] For a handy table of these, see p. 8, Klir and Yuan.

[xii] I’ve read very little Aristotle first-hand, so I’ll have to rely on second-hand quotes from him, such as this one: “It is impossible, then, that ‘being a man’ should mean precisely ‘not being a man’, if ‘man’ not only signifies something about one subject but also has one significance. … And it will not be possible to be and not to be the same thing, except in virtue of an ambiguity, just as if one whom we call ‘man’, and others were to call ‘not-man’; but the point in question is not this, whether the same thing can at the same time be and not be a man in name, but whether it can be in fact. (Metaphysics4.4, W.D. Ross (trans.), GBWW 8, 525–526).” Dawkins, Jamie, 2010, “Aristotle’s Three Laws of Thought,” published July 26, 2010 at The Year to the Present blog address https://jamescarterdawkins.wordpress.com/2010/07/26/aristotles-three-laws-of-thought/

[xiii]  pp. 235, 239, Klir and Yuan.

[xiv] pp. 304, 306-307, Klir  and Yuan.

[xv] p. 312, Klir and Yuan.

[xvi] p. 16, Bilgic, Taner and Turksen, I.B. August 1994, “Measurement–Theoretic Justification of Connectives in Fuzzy Set Theory,” pp. 289–308 in Fuzzy Sets and Systems, January 1995. Vol. 76, No. 3.

[xvii] IBID., p. 17.

[xviii] pp. 302-324, Klir and Yuan.

Implementing Fuzzy Sets in SQL Server, Part 3: Using Fuzzy Intersections as AND Statements

By Steve Bolton

…………Whenever we assign set membership grades to records on a continuous scale, we open up a whole new world of possibilities for measuring uncertainty and modeling different types of imprecision. Two articles ago in this series of amateur self-tutorials, we saw how a whole taxonomy of fuzzy set types can be derived by applying different combinations of membership grades; these fuzzy set types allow us to model imprecise natural language terms that are more difficult to handle with traditional “crisp” sets. In the last installment, we saw how crisp sets have only one type of complement, but how the concept of graded membership naturally leads to a whole smorgasbord of fuzzy complements. These can be quite useful in developing measures of imprecision that are tailor-made for specific real-world problems. Binary relations between fuzzy sets can likewise be put to good use, especially in the development of other measures that are useful in uncertainty management programs, but reside at a higher level of complexity than complements since multiple sets are involved.
…………All of the set relations I’ll be introducing in the next three articles, like fuzzy unions and fuzzy joins, are defined on some subset of the crisp Cartesian product of the two sets. I’ve seen several discussions in literature of how to model fuzzy relations as cylindrical projections of the Cartesian product in multidimensional space, which has uses in maximizing nonspecificity[1] – a metric of uncertainty I’ll discuss later in this series – but for now I’ll keep the discussion as down-to-earth as I can. Fuzzy set relations are essentially equivalent to defining a fuzzy membership function on the set that results from an operation, which leads to myriad ways of assigning grades to the relation. [2] This in turn results in a dizzying array of choices that are not available with crisp relations, which empowers us to create new measures, as the risk of getting lost in the complexity of all the available choices. I can hardly pretend to be an authority on which choices are the best match for particular use cases, but I can at least introduce the topic in the hopes that I can learn as I go and help others to at least avoid my inevitable mistakes.

Four Types of Fuzzy Intersections

                In the case of intersections, we’re basically asking SQL Server, “Give me all of the records that are members of a both sets.” Partial membership introduces the need for some kind of math operation to assign a grade to the memberships of the resulting set, which essentially boils down to taking the minimum values of the membership grades of the two constituent sets. This is one of the unifying principles of fuzzy intersections, which differ only in the manner in which they select those minimums. The simplest ones are the Standard Intersection, which is merely minimum of either the first set or second set; the algebraic product, in which the membership grades of the corresponding rows are merely multiplied together; Bounded Difference, which is either 0 or the two set grades added together minus one, whichever is higher; and the Drastic Intersection, which is the membership value of the first set when the second equals one or vice-versa, but 0 otherwise.[3]
…………The code in Figure 1 implements all four of these, in a manner similar to the way I illustrated the use of membership functions and fuzzy complements in the last articles. The stored procedure I wrote for Outlier Detection with SQL Server, part 2.1: Z-Scores is pulled in here for double-duty as the membership function of my first set, while the alternate version I used in Outlier Detection with SQL Server, part 2.2: Modified Z-Scores is used for the second. The GroupRank and OutlierCandidate columns were members of the original procedures and could not be left out of the INSERT EXEC statements, but can be safely ignored. The @RescalingMax, @RescalingMin and @RescalingRange values are used in conjunction with the ReversedZScore of the two table variables to normalize the stored procedure results within the industry-standard 0 to 1 range for fuzzy sets. We could of course define much simpler membership functions and do without the stored procedures, but I wanted to demonstrate the usefulness of fuzzy intersections in a more meaningful way. We could also do without these two table variables and the third which holds the intersection values, but it is more practical in this situation to calculate all of this once only, since we have to demonstrate many different types of intersections afterwards. Since the membership values of both sets often differ, using an INTERSECT operator might backfire by discarding those that don’t match. Hence, I’ve found it easier in cases like this to use an INNER JOIN. The WHERE clause that follows merely illustrates how we can create an intersection on sets that have different crisp members, regardless of their membership grades. The code is lengthy because of the need for the three table variables in this particular instance, but is not difficult at all, once you strip away all of this preparatory code and see that the meat and potatoes can be found in the @IntersectionTable’s four computed columns.

Figure 1: Sample Code for a Fuzzy Intersection on Two Different Types of Z-Scores
— CREATING THE FUZZY SET TEST DATA
DECLARE
@RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)

DECLARE   @ZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
) 

DECLARE   @ModifiedZScoreTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint,
OutlierCandidate bit
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

INSERT INTO @ModifiedZScoreTable
(PrimaryKey, Value, ZScore, GroupRank, OutlierCandidate)
EXEC   Calculations.ModifiedZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @OrderByCode = 8,
              @DecimalPrecision = ’38,32′

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ModifiedZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin 

UPDATE @ModifiedZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

— DERIVING THE INTERSECTION
DECLARE  @IntersectionTable table
(PrimaryKey sql_variant,
Value decimal(38,6),
MembershipScoreForSet1 decimal(38,6),
MembershipScoreForSet2 decimal(38,6),
StandardIntersection AS (CASE WHEN MembershipScoreForSet1 <= MembershipScoreForSet2 THEN MembershipScoreForSet1
ELSE MembershipScoreForSet2 END),
DrasticIntersection AS (CASE WHEN MembershipScoreForSet1 = 1 THEN MembershipScoreForSet2
WHEN MembershipScoreForSet2 = 1 THEN MembershipScoreForSet1
ELSE 0 END),
AlgebraicProduct AS MembershipScoreForSet1 * MembershipScoreForSet2,
BoundedDifference AS (CASE WHEN MembershipScoreForSet1 + MembershipScoreForSet2 1 > 0 THEN MembershipScoreForSet1 + MembershipScoreForSet2 1
ELSE
0 END)
) 

INSERT INTO @IntersectionTable
(PrimaryKey, Value, MembershipScoreForSet1,
MembershipScoreForSet2)
 

SELECT T1.PrimaryKey, T1.Value, T1.MembershipScore, T2.MembershipScore
FROM @ZScoreTable AS T1
       INNER JOIN @ModifiedZScoreTable AS T2
       ON T1.PrimaryKey = T2.PrimaryKey
       WHERE T1.Value != 142 AND T2.Value != 147

SELECT *
FROM @IntersectionTable
ORDER BY DrasticIntersection DESC

 Figure 2: Sample Results from the Duchennes Table
first-4-fuzzy-intersection-types

…………Note how quickly the Drastic Intersection drops off to 0, when we order by that particular column. The other three undulate in comparison, but do so at varying rates I don’t have screen space to fully illustrate here.  That is because there are many duplicates values for the Lactate Dehydrogenase column of the practice dataset I used for Figure 2, which is full of Duchennes muscular dystrophy data I downloaded from Vanderbilt University’s Department of Biostatistics a few tutorial series ago.
…………The different shapes of the function results can be leveraged to meet various use cases; for example, if the problem at hand calls for a rapid drop-off in intersection membership values, the Drastic Intersection might be ideal. As I’ve pointed out a few times in this series, fuzzy sets are much easier to understand once you recognize that they’re ideal for modeling imprecise natural language terms, including many found in everyday speech like “about” or “near,” not just broad high-tech concepts like “I/O bottleneck.” Membership grades should not be interpreted as probabilities unless they’re explicitly designed to measure uncertainty about stochastic measures. In this case, we’re deliberately assigning meanings to the two membership sets that are comparable as outlier detection methods, even though they measure aberrance in different ways; the purpose here is not to teach anything about Z-Scores, but to use a familiar concept to illustrate an unfamiliar one. The results here might be expressed in natural language as, “when both methods of Z-scores are taken into account, Lactate Dehydrogenase values of 198 are very close to the center point of the dataset. They have high membership value in the set of records that are NOT outliers.” As the values of each type of intersection move in the other direction, that becomes less true.

Upgrading to T-Norms

                These four intersection types are simple instances of a broader class of functions known as triangular norms (T-Norms) that have been shown to be equivalent to fuzzy intersections, according to various mathematical theorems. The Standard Intersection is basically identical to Gödel’s T-norm, while the algebraic product is equivalent to the Product T-norm and the Bounded Difference to the Lukasiewicz T-norm.[4] The functions in this family satisfy certain obligatory mathematical properties, like commutativity, associativity, monotonicity and sometimes also continuity (i.e. a continuous function).[5] An Archimedean T-norm has the additional property of subidempotency, while a strict Archimedean T-norm also has strict monotonicity.[6] I won’t bore you with the formal definitions of these things, let alone the mathematical proofs and equations, since it is essentially my function to explain this to readers who don’t have a mathematical background; there is a crying need in the data mining field for bridges between the underlying theories and the end users, in the same way that mechanics fill a gap between drivers and automotive engineers. I suggest that anyone who needs this extra detail consult my favorite reference, George J. Klir and Bo Yuan’s Fuzzy Sets and Fuzzy Logic: Theory and Applications, which I will be leaning on through much of this series. They also get into a discussion of concepts like decreasing and increasing generators and pseudo-inverses and how to use them in the construction of T-norms, but that is also beyond the bounds of this introduction.[7]
…………Just as we were able to pile different types of fuzziness on top of each other a few articles ago to create a whole fuzzy set taxonomy, so too can we combine T-norms to form an endless array of more complex T-norms.[8] I’ll leave to the imagination and needs of the reader and just give examples of how to code 11 of the more common T-norms. Figure 3 contains additional code that can be tacked on to Figure 1 to calculate four types of T-norms developed by Berthold Schweizer and Abe Sklar and others likewise developed in the ‘70s and ‘80s by József Dombi, M.J. Frank, Horst Hamacher, Yandong Yu, Didier Dubois and Henry Prade.[9] Ronald R. Yager, the author of the Yager Complement mentioned in the last article, also developed a corresponding fuzzy intersection that likewise takes a @LambdaParameter variable as an input, while Michio Sugeno, the creator of the Sugeno Complement, devise a T-norm that was independently rediscovered by Siegfried Weber. The declaration section at the top also includes parameters also required by some of the other authors. These parameters allow us to vary the shape of each T-norm to suit particular use cases, just as we did with the parameters of the Sugeno and Yager Complements in the last article.

Figure 3: Code for Several Popular Types of T-Norm Fuzzy Intersections
DECLARE @AlphaParameter float = 1, — 0 to 1
@PParameter float = 1, — <> 0 for SchweizerandSklar 1, > 0 otherwise
@RParameter float = 1,
@SParameter float = 0.7, — must be > 0 but not =1 for Frank1979
@LambdaParameter float = 1, — > -1
@OmegaParameter float = 1, — ω must be > 0
@Pi decimal(38,37) = 3.1415926535897932384626433832795028841 — from http://www.eveandersson.com/pi/digits/100

SELECT PrimaryKey, Value, MembershipScoreForSet1, MembershipScoreForSet2,
(MembershipScoreForSet1 * MembershipScoreForSet2) / (SELECT MAX(Value) FROM (VALUES (MembershipScoreForSet1),(MembershipScoreForSet2),(@AlphaParameter)) AS T1(Value)) AS DuboisPradeTNorm,
Power(1 + Power(Power(CAST(1 / MembershipScoreForSet1 AS float) 1, @LambdaParameter) + Power(CAST(1 / MembershipScoreForSet2 AS float) 1, @LambdaParameter), CAST(1 / @LambdaParameter AS float)), 1) AS DombiTNorm,
Log(1 + (((CAST(Power(@SParameter, MembershipScoreForSet1) AS float) 1) * (Cast(Power(@SParameter, MembershipScoreForSet2) AS float) 1)) / (@SParameter 1)), @SParameter) AS Frank1979TNorm,
(MembershipScoreForSet1 * MembershipScoreForSet2) / (@RParameter + (1 @RParameter) * (MembershipScoreForSet1 + MembershipScoreForSet2 (MembershipScoreForSet1 * MembershipScoreForSet2))) AS HamacherTNorm,
(SELECT MAX(Value) FROM (VALUES (0), ((MembershipScoreForSet1 + MembershipScoreForSet2 + (@LambdaParameter * MembershipScoreForSet1 * MembershipScoreForSet2) 1) / CAST(1 + @LambdaParameter AS float))) AS T1(Value)) AS WeberTNorm,
(SELECT MAX(Value) FROM (VALUES (0), ((1 + @LambdaParameter) * (MembershipScoreForSet1 + MembershipScoreForSet2 1) (@LambdaParameter * MembershipScoreForSet1 * MembershipScoreForSet2))) AS T1(Value)) AS YuNorm,
1 (SELECT MIN(Value) FROM (VALUES (1), (Power(Power(1 MembershipScoreForSet1, @OmegaParameter)  + Power(1 MembershipScoreForSet2, @OmegaParameter), 1 / CAST(@OmegaParameter AS float))))  AS T1(Value)) AS YagerTNorm
FROM @IntersectionTable
WHERE MembershipScoreForSet1 != 0 AND MembershipScoreForSet2 != 0 

SELECT PrimaryKey, Value, MembershipScoreForSet1, MembershipScoreForSet2,
Power((SELECT MAX(Value) FROM (VALUES (0), (Power(MembershipScoreForSet1, @PParameter) + Power(MembershipScoreForSet2, @PParameter) 1)) AS T1(Value)), 1 / CAST(@PParameter AS float)) AS SchweizerSklarTNorm1,
1 Power(Power(1 MembershipScoreForSet1, @PParameter) + Power(1 MembershipScoreForSet2, @PParameter) (Power(1 MembershipScoreForSet1, @PParameter) * Power(1 MembershipScoreForSet2, @PParameter)), 1 / CAST(@PParameter AS float)) AS SchwiezerAndSklar2,
EXP(1 * Power(Power(Abs(Log(MembershipScoreForSet1)), @PParameter) + Power(Abs(Log(MembershipScoreForSet2)), @PParameter), 1 / CAST(@PParameter AS float))) AS SchwiezerAndSklar3,
(MembershipScoreForSet1 * MembershipScoreForSet2) / Power(Power(MembershipScoreForSet1, @PParameter) + Power(MembershipScoreForSet2, @PParameter) (Power(MembershipScoreForSet1, @PParameter) * Power(MembershipScoreForSet2, @PParameter)), 1 / CAST(@PParameter AS float)) AS SchwiezerAndSklar4
FROM @IntersectionTable
WHERE MembershipScoreForSet1 != 0 AND MembershipScoreForSet2 != 0

Figure 4: T-Norm Results (click to enlarge)
t-norm-results

…………There’s an awful lot of math going on here but very few examples in the literature with sample data to verify the results on, so be on the lookout for undiscovered errors in this inexpert attempt to code these more advanced T-norms. To simplify things, I decided to add the brute force WHERE clause in order to prevent divide-by-zero errors, rather than including validation logic in each separate calculation. To make the code a little easier on the eyes, I split the Schweizer and Sklar T-norms into a separate query and put each on a line of its own, which probably won’t render correctly in WordPress. In most of them, I adopted a T-SQL trick posted by Jamie Thomson at SQLBlog.com which allows you to select the MAX or MIN for a set of values at each row, without having to do an aggregate operation on the whole table.[10]  The Yager T-Norm uses a MIN operation, but the outer subtraction turns it into a reciprocal so it’s actually a MAX in the end.
…………Note that once again, the each function follows a different pattern. Some of them overlap at times, like the Yager T-norm, which equals the Bounded Difference when the @OmegaParameter is set to 1; as it approaches infinity (or the best can approximate at the limits of SQL Server’s float, numeric and decimal data types) it joins up with the Standard Intersection.[11] Our goal is to find a T-norm that produces a pattern that closely matches the imprecision we’re trying to model. It’s a mouthful when put that way; suffice it to say that once again, it is easier to grasp all of this if we get away from computer code, heavy math and big words and look at fuzzy sets in terms of natural language. As Klir and Yuan put it, our selection of the right fuzzy set operation is dependent on the meaning of particular linguistic terms”

                “We have to determine, in each particular application, which of the available operations on fuzzy sets best represent the intended operations on the corresponding linguistic terms. It is now experimentally well established that the various connectives of linguistic terms, such as and, or, not, and if-then, have different meanings in different contexts, We have to determine which of the T-norms, T-conorms, complements or other operations on fuzzy sets best approximate the intended meanings of the connectives. The various linguistic hedges are context dependent as well.”[12]

…………In Behavior-Driven Development (BDD) and user stories, it might be natural to flag “and” statements as candidates for fuzzy intersections. Once we’ve identified a suitable T-norm, parameter estimation techniques can be used to take educated guesses on where to set the corresponding input variables (I suppose this would include maximum likelihood estimation, which I am currently stuck on).[13] I’ll elaborate on this selection process further in the next article, which is likely to be a good deal shorter because fuzzy unions are implemented in almost the same way.
…………The same data modeling principles apply to both fuzzy unions and fuzzy intersections, so I’ll dispense with the topic here. If we go to all the trouble of calculating these fuzzy intersection or union values, we might want to persist them instead of incurring their performance costs repeatedly by doing them on the fly. In the last article, I introduced the issue of storing the parameters for Sugeno and Yager Complements in such instances, which leads to some intriguing data modeling issues not often in seen in ordinary crisp sets, where there are no choices of functions to apply to set relations and therefore no parameters to store. I listed a couple of possible solutions, like defining indexed views with the parameter values baked into them or creating a ComplementTable to store the parameter values alongside their function type, which might span schemas or whole databases and include text keys linking them to the table and views the complements are applied to. We could go one step further and also store the complement values in one or more dependent tables linked to the ComplementTable by a foreign key, but if we want to enforce a FOREIGN KEY REFERENCES statement on the second key we’d need one such table for each fuzzy object we’re getting records from. The issue becomes more complicated with binary relations, since we’d need another text column to identify the second object in the IntersectionTable and a third key foreign key if we’re also storing the row-by-row values returned for each parameter setting. The latter would amount to a many-to-many join across two fuzzy sets that are also joined to the IntersectionTable holding the parameters and T-norm type. Thankfully, the modeling options don’t get any more complex than this with other binary relations, such as fuzzy unions, the topic of next week’s article. If you can wade through this tutorial, the next should be a breeze, considering that the counterparts of T-norms were developed by the same theorists and are implemented with similar formulas. The logic behind them isn’t much different either, except that we’re taking maximums rather than minimums at each row. This means we can reuse much of the sample code I’ve posted here and just swap out a few math operations, which is a piece of cake.

[1] IBID., p. 122.

[2] p. 120, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[3] IBID., p. 63.

[4] See the Scholarpedia.com webpage “Triangular Norms and Conorms” at thehttp://www.scholarpedia.org/article/Triangular_norms_and_conorms#Fuzzy_intersections_and_unions

[5] IBID., pp. 62-63.

[6] IBID., p. 63.

[7] IBID., p. 68.

[8] IBID., p. 70.

[9] I implemented the formulas I found in a table at p. 76, Klir andYuan, which is widely available elsewhere. I had to look up some of the first names at the Wikipedia webpage “Construction of T-Norms” at http://en.wikipedia.org/wiki/Construction_of_t-norms

[10] Thomson, Jamie, 2012, “Use VALUES Clause to Get the Maximum Value from Some Columns,” published Jan. 20, 2012 at the SQLBlog web address http://sqlblog.com/blogs/jamie_thomson/archive/2012/01/20/use-values-clause-to-get-the-maximum-value-from-some-columns-sql-server-t-sql.aspx

[11] pp. 70-71, Klir and Yuan.

[12] IBID., pp. 280-281.

[13] IBID., p. 94.

Implementing Fuzzy Sets in SQL Server, Part 2: Measuring Imprecision with Fuzzy Complements

By Steve Bolton

…………Taking the dive into fuzzy sets immediately elicits the obvious question: just how fuzzy is the data we’re operating on? As discussed in the first two installment of this amateur series of self-tutorials, the raison d’etre of fuzzy set theory is to model imprecise data that is typically expressible in nebulous terms in everyday speech; it addresses a class of data that is represented in ordinals (i.e. categories with orders) but for which it would be helpful to at least approximate with a continuous number scale. It essentially amounts to taking the information left over after we’ve defined ordinary (i.e. “crisp’) sets and putting it to at least some good use, just as recycled aluminum and cork shavings can be used in less lucrative products like paint and corkboard. Of course, once we’ve used some of the procedures outlined in the last two articles to define them using graded membership functions, it’s natural to try to quantify the imprecision we’re modeling in some way. It at least sets a sort of outer boundary to the uncertainty we’re modeling where none existed previously, even if that barrier is not completely accurate; it is the first baby step towards a program of Uncertainty Management. The ability to apply at least some yardstick to previously unquantifiable data is useful for the same reason that the unknown is invariably certain to scare readers of horror fiction. As Stephen King puts it so well,

                “You approach the door in the old, deserted house, and you hear something scratching at it. The audience holds its breath along with the protagonist as she/he (more often she) approaches that door. The protagonist throws it open, and there is a ten-foot-tall bug. The audience screams, but this particular scream has an oddly relieved sound to it. ‘A bug ten feet tall is pretty horrible,’ the audience thinks, ‘but I can deal with a ten-foot-tall bug. I was afraid it might be a hundred feet tall.’”[I]

…………A good horror writer will of course maximize the unknown in order to scare their customers, but the object of relational database management and data mining is to minimize it, to reassure end users. We know that bugs of an entirely different kind may arise whenever uncertainty is innately woven into the tapestry of the problems software engineers, DBAs and analysts seek to solve, but these underused precision modeling tools include a yardstick with which to measure our gremlins: the fuzzy complement, which allows us to assign a ballpark figure to the amount of fuzz involved. Complements are also one of the simplest types of set relations, which can serve as a bridge into discussion of more complicated topics like fuzzy intersections, unions and joins between two sets.
…………Another single-set operation that is useful in fuzzy set theory is a partitioning method known as the alpha cut, but I’ll delay discussion of them since they’re more applicable to higher-level topics like possibility theory. Fuzzy complements and α-cuts both lead to peculiar nested subsets, albeit in different ways. In the latter case, it is because they assign to records to multiple, progressively larger subsets within the original fuzzy set, in tandem with the decrease in membership grade; in the former, it is because the various types encompass ever-wider sets of records, depending on what fuzzy complement we choose to apply and how generalized its definition is. As discussed in the taxonomy of fuzzy set types in the last article, the simplicity of these techniques quickly give rise to complexity because there are so many ways of combining these concepts; just as the difficulty in using membership functions consists mainly in the fact that there are so many we can choose from, so too does the challenge with fuzzy complements consist mostly of choosing the right type. This complexity is not an issue in ordinary crisp sets, which can be viewed as highly specific cases of fuzzy set theory. There’s no need to specify their membership functions in code because they either do or do not belong, which amounts to Boolean membership grades of 0 and 1 if we put it in fuzzy set terms. Likewise, there is only one type of complement in ordinary crisp sets, i.e. the sets of all records that don’t belong to them.

Involutive and Non-Involutive Complements

…………The whole point of membership functions is to define “belonging” in terms of a grade on a continuous scale, so by logical necessity the concept of “not belonging” that defines complements also has to be a assigned a grade on a continuous scale. The most obvious way to do this is to simply subtract a membership score from 1, assuming that we’re defining our grades on a range of 0 to 1, as is almost always the case in fuzzy set theory. This is known as the standard complement, but we’re not limited to that single choice. The injection of membership grades into set theory raises the possibility of performing further operations to alter them when we perform the inversion operation, which opens up the door to a whole class of new complements that aren’t possible in crisp sets. In fact, we’re only limited by our imagination and the restrictions that the inversion functions must be strictly increasing or strictly decreasing, i.e. the values must move in only one direction, with no repeats. A violation of this principle would appear in a Reporting Services line chart as a flat segment.
…………It is possible to define a really broad set of “non-involutive” fuzzy complements if we don’t require that the inversion function return us to the original value when applied twice[ii], but I won’t expend much time on these “since involutive fuzzy complements play by far the most important role in practical applications.”[iii] Fuzzy Sets and Fuzzy Logic: Theory and Applications, the classic resource George J. Klir and Bo Yuan I’ll be depending upon for much of the math formulas in this series, contains an elegant illustration of how the standard fuzzy complement is contained within the set of involutive complements, which is in turn nested within the set of non-involutives, which are likewise a subset of all possible continuous complements and then to the set of all fuzzy complements, without any restrictions.[iv]

Coding the Sugeno and Yager Complements

                To demonstrate how these types of complements lead to different values, I calculated the complements in Figure 1 on a column in the Duchennes muscular dystrophy data I downloaded a few tutorial series ago from  Vanderbilt University’s Department of Biostatistics. As in my last article, I’ll recycle the stored procedure I wrote for Outlier Detection with SQL Server, part 2.1: Z-Scores for double-duty as my membership function, but keep in mind that fuzzy sets have nothing to do with Z-Scores or probability in general – unless you specifically add that meaning, as I have done here to essentially turn it into a fuzzy outlier detection method. To illustrate the possibility of combining membership functions in myriad ways, I retrieved Z-Scores for two columns and multiplied them together, but this sample code is much simpler since we’re only acting on one column. The results of the stored procedure are plugged into a table variable identical to the one we used last time around, except for the addition of four columns to hold the complements we’re experimenting with (the GroupRank is part of the original procedure and required for the INSERT EXEC operation, but can be safely ignored). The same @RescalingMax, @RescalingMin and @RescalingRange values are used in conjunction with the ReversedZScore column to normalize the MembershipScores within ranges of 0 to 1.
…………The key difference is the addition of the @Pi, @LambdaParameter and @OmegaParameter variables, which are needed in the second UPDATE to calculate the non-involutive example[v] and the Sugeno and Yager Complements, which are the two most commonly mentioned ones in the literature. These two are useful when the problem at hand calls for bowing the number line between 0 and 1 in various arcs. When the @LambdaParameter is set to 0, the Sugeno Complement ought to be equal to the standard complement, while the Yager Complement ought to equal the same when its @OmegaParameter is set to 1. As the @LambdaParameter surpasses 0 the line bows towards lower values of both the complement and the membership grade, but at negative values it bows towards higher values of both. The Yager Complement moves in the opposite direction as its parameter is changed, but in different proportions.[vi] Although I have yet to do so in practice, these properties can be leveraged to fit different use cases – perhaps in curve or distribution fitting, sensitivity adjustments or instances where an object carries less imprecision than its negation. In essence, the act of parameterization turns them both into whole families of functions that can be adapted as needed.

Figure 1: Four Examples of Fuzzy Complements
DECLARE @RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)
DECLARE @LambdaParameter float = 1,
@OmegaParameter float = 1, — ω
@Pi decimal(38,37) = 3.1415926535897932384626433832795028841 — from http://www.eveandersson.com/pi/digits/100

DECLARE  @ZScoreTable table
(ID bigint IDENTITY (1,1),
PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
RegularComplement float,
NonInvolutiveComplementExample float,
SugenoComplement float,
YagerComplement float,
GroupRank bigint
)

INSERT INTO @ZScoreTable
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

— RESCALING
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

UPDATE @ZScoreTable
SET RegularComplement = 1 MembershipScore,
NonInvolutiveComplementExample = 0.5 * (1 + Cos(@Pi * MembershipSCore)),
SugenoComplement = (1 MembershipScore) / CAST(1 + (@LambdaParameter * MembershipScore) AS float),
YagerComplement = Power(1 Power(MembershipScore, @OmegaParameter), 1 / CAST(@OmegaParameter AS float))
FROM @ZScoreTable 

— RESULTS WITH EQUILIBRIA
DECLARE @StandardEquilibrium AS float = 0.5
DECLARE @SugenoEquilibrium AS float = (Power(1 + @LambdaParameter, 0.5) 1) / @LambdaParameter 

SELECT MembershipScore, RegularComplement, NonInvolutiveComplementExample, SugenoComplement, YagerComplement,
CASE WHEN MembershipScore RegularComplement > @StandardEquilibrium THEN 1 ELSE 0 END AS IsGreaterThanStandardEquilibrium,
CASE WHEN MembershipScore RegularComplement > @SugenoEquilibrium THEN 1 ELSE 0 END AS IsGreaterThanSugenoEquilibrium
FROM (SELECT MembershipScore, RegularComplement, NonInvolutiveComplementExample, SugenoComplement, YagerComplement
       FROM @ZScoreTable
       WHERE MembershipScore IS NOT NULL) AS T1
ORDER BY MembershipScore ASC

SELECT @StandardEquilibrium AS StandardEquilibrium, @SugenoEquilibrium AS SugenoEquilibrium

— MEASURE OF FUZZINESS
DECLARE @Count  bigint
SELECT @Count=Count(*)
FROM @ZScoreTable

SELECT SUM(ABS(MembershipScore YagerComplement)) / @Count AS SimpleMeasureOfFuzziness
FROM @ZScoreTable
WHERE MembershipScore IS NOT NULL

Figure 2: Sample Results from the Duchennes Table (click to enlarge)
Fuzzy Complement Results

…………The concept of fuzzy complements also gives rise to two properties which are mentioned in Klir and Yuan, but for which I have yet to find a practical use: equilibria and dual points. An equilibrium represents the single value in any set where the subtraction of the complement equals zero, which is 0.5 in a standard complement but can vary from that in more advanced types of complements. The dual point is apparently an extension of equilibria to non-involutives, which are seldom used as mentioned above – so I’ll seldom mention them in this series.[vii] The sample data in Figure 1 did not contain any records where the complements values hit the equilibrium points, so I tried to illustrate the concept by declaring two variables to identify the standard and Sugeno equilibria (which requires plugging the @LambdaParameter into a simple math formula) and adding two CASE conditions to identify whether the corresponding complement was above or below its equilibrium point.
…………The end of the procedure contains an example of the most fundamental measure of fuzziness I could think of. As Klir and Yuan put it, “Another way of measuring fuzziness, which seems more practical as well as more general, is to view the fuzziness of a set in terms of the lack of distinction between the set and its complement. Indeed, it is precisely the lack of distinction between sets and their complements that distinguishes fuzzy sets from crisp sets. The less a set differs from its complement, the fuzzier it is.”[viii] In a later discussion on evidence theory, they give some clear examples of exactly what fuzziness measures:

                “Observing attributes such as ‘a type of cloud formation’ in meteorology, ‘a characteristic posture of an animal’ in ethnology, or ‘a degree of defect in a tree’ in forestry clearly involves situations in which it is not practical to draw sharp boundaries; such observations or measurements are inherently fuzzy and consequently, their connection with the concept of the fuzzy set is suggestive. In most measurement in physics, on the other hand, such as the measurement of length, weight, electric current, or light intensity, we define classes with sharp boundaries.”[ix]

…………They go on to provide an example which uses the Hamming Distance to measure the difference between each membership grade and its complement, but I chose to go the even simpler route of devising a mere ratio of the differences to the overall count. Later in the series I’ll explain how partial set membership introduces many alternative definitions of cardinality and other statistics, but to stay on point I used a regular “crisp” count here. Note how we’ve added yet another layer of complexity here: we’re no longer dependent just on varied definitions of complements, but also of the measurement we choose to use. This leads to greater potential of getting lost in an endless sea of choices, but it can also empower us by letting us choose a distance measure that is best suited to the kind of fuzziness we’re trying to measure. It might be useful in one instance to use a simple Euclidean Distance, in another we might want to work in a Küllback-Leibler Divergence and in another, it might pay to figure out what the heck an Isaura-Saito Distance is and code it. In Figure 2, we can see that the Yager Complement leads to a fairly high measure of imprecision in this case, at 0.736238435643564, which would change in tandem with our parameter choices. Perhaps we don’t know what the ideal parameter choices would be, but this at least gets us on the board with a measure of imprecision where we had none whatsoever before.
…………In the next couple of articles I’ll show how these concepts can be extended in equally useful ways to fuzzy intersections, unions and common join types. These topics are a bit more complicated, since we’re using binary relations rather than simple reflexive ones like complements. If we want to store the values of set operations of this kind, the options are easier to choose from in the case of complements: 1) using indexed views with the parameter values baked into them, or 2) creating special tables to hold the values of the Sugeno, Yager, etc. parameters if we need to reuse them in other SQL statements. In the second modeling option, it might pay to store all of the complement parameters in a single table, possibly for a whole schema or database, with some sort of text key identifying the table/view and column they’re associated with. If we wanted to also store the row values derived from the Sugeno and Yager functions for later retrieval, we might create a ComplementValueTable with one foreign key on our ComplementTable and another leading to the primary key of whatever table the row belongs to. We wouldn’t be able to leverage a FOREIGN KEY REFERENCES to enforce these relationships unless we used multiple ComplementValueTable, each associated with a different table with fuzzy data. The issue of fuzzy set relation storage takes us in strange data modeling directions because there are so many different options with unions, joins and the like that simply aren’t available to us with crisp sets. We can at least rule out many-to-many join on a ComplementTable of any kind, because complements are a reflexive relationship. That is no longer true when we’re speaking of intersections and other binary relations that add another layer of both complexity and problem-solving potential, as we’ll see in two weeks.

[i] p. 114, King, Stephen, 1981, Stephen King’s Danse Macabre. Everest House: New York. King said he was basically paraphrasing an idea expressed to him by author William F. Nolan at the 1979 World Fantasy Convention.

[ii] For the definition of the term, I consulted the Wikipedia page”Involution (Mathematics)” at  http://en.wikipedia.org/wiki/Involution_(mathematics).

[iii] p . 59, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[iv] IBID., p. 57.

[v] IBID. Adapted from a cosine-based formula on p. 54.

[vi] IBID., pp. 55-57.

[vii] IBID., pp. 57-59.

[viii] IBID., p. 255.

[ix] IBID., p. 179.

Implementing Fuzzy Sets in SQL Server, Part 1: Membership Functions and the Fuzzy Taxonomy

By Steve Bolton

…………In the first installment of this amateur self-tutorial series on applying fuzzy set theory to SQL Server databases, I discussed how neatly it dovetails with Behavior-Driven Development (BDD) principles and user stories. This is another compelling reason to take notice of fuzzy sets, beyond the advantages of using a set-based language like T-SQL to implement them, which will become obvious as this series progresses. There aren’t any taxing mental gymnastics involved in flagging imprecision in natural language statements like “hot,” “cloudy” or “wide,” which is strikingly similar to the way user stories are handled in BDD. What fuzzy sets bring to the table is the ability to handle imprecise data that resides in the no-man’s land between ordinal and continuous Content types. In addition to flagging imprecision in natural language and domain knowledge that is difficult to pin down, it may be helpful to look for attributes which represent categories that are ranked in some way (which sets them apart from nominal data, which is not ordered on any scale) but which it would be beneficial to express on a continuous numerical scale, even at the cost of inexactness. Thankfully, mathematicians have already hashed out a whole framework for modeling this notoriously tricky class of data, even though it is as underused as the SQL Server Data Mining (SSDM) components I tried to publicize in a previous mistutorial series. It is also fortunate that we already have an ideal tool to implement it with in T-SQL, which can already handle most of the mathematical formulas devised over the last few decades. As I’ll demonstrate in this article, it only takes a few minutes to implement simple membership functions that grade records based on how much they belong to a particular set. It is only when we begin combining different types of imprecision together and assigning more nuanced interpretations to the grading systems that complexity quickly arises and the math becomes challenging. Although I’m still learning the topic as I go – I find it is much easier to absorb this kind of material by writing about it – I hope to reduce the challenge involved by taking a stab at explaining it, which will at a minimum at least help readers avoid repeating my inevitable mistakes.
…………The first challenge to overcome is intimidation, because the underlying concepts don’t even require a college education to grasp; in fact, some DBAs have probably already worked with forerunners of fuzzy sets unwittingly, on occasions where they’ve added columns that rate a row’s inclusion in a particular set. It doesn’t take much mental juggling to start thinking explicitly about such attributes as measures of membership in a particular set. Perhaps the simplest forms of membership functions are single columns filled with data that has been assigned that kind of meaning, which can even be derived from such sources as subjective grades assigned by end users in exactly the same manner as movie or restaurant ratings. The data can even be permanently static. At the next level of complexity, we could of course store such data in the form of computed columns, regardless of whether it is read-only or not.
…………A couple of really simple restrictions are needed to bring this kind of data into line with fuzzy set theory though. First, since the whole object is to treat ordinal data as if it were continuous, we’d normally use T-SQL data types like float, numeric and decimal – which are the closest we can get, considering that our finite computers can’t truly handle infinitesimal scales. Furthermore, it is probably wise to stick with the convention of using a scale between 0 and 1, since this enables us to integrate it seamlessly with evidence theory, stochastics, decision theory, control theory and neural net weights, all of which are also typically bounded in the same range or quite similar ones; some of the theoretical resources I consulted mentioned in an offhand way that it is possible to use other scales, but I haven’t seen a single instance of it yet in the literature. Ordinal categories are often modeled in SQL Server in text data types like nvarchar, tinyint codes or some type of foreign key, which might have to be retained alongside the membership function storage column; in other instances, our membership function may be scoring on the basis of several attributes in a table or view, or perhaps all of them. Of course, in many use cases we won’t need to store the membership function value at all, so it will have to be calculated on the fly. If we’re simply storing a subjective rating or whatever, we might only need some sort of interface to allow end users to enter their own numbers (on a continuous scale of 0 to 1), in which case there is no need for a membership function per se. If a table or view participates in different types of fuzzy sets, it may be necessary to add more of these membership columns for each of them, unless you want to calculate the values as you go. Simply apply the usual rules of data modeling and principles of performance maximization to determine the strategies that fit your use cases best.

Selecting Membership Functions

                That is all kid stuff for most DBAs. The challenges begin when we try to identify which membership function would be ideal for the use cases at hand. Since the questions being asked of the data vary from one problem to the next, I cannot possibly answer that. I suppose that you could say the general rule of thumb with membership functions is that the sky’s the limit, as long as we stay at an altitude between 0 and 1. Later in this series I’ll demonstrate how to use particular classes of functions called T-norms and T-conorms, since various mathematical theorems demonstrate that they’re ideal for implementing unions and intersections, but even in these cases, there are so many available to us that the difficulty consists chiefly in selecting an appropriate match to the problem you’re trying to solve. There might be more detailed guidelines available from more recent sources, but my favorite reference for the math formulas, George J. Klir and Bo Yuan’s classic Fuzzy sets and Fuzzy Logic: Theory and Applications, provides some suggestions. For example, membership values can be derived from sample data through LaGrange interpolation and two methods I have used before, least-squares curve fitting and neural networks.[1] They also discuss how to aggregate the opinions of multiple experts using both direct and indirect methods of collection, in order to ascertain the meaning of fuzzy language terms. The specifics they provide get kind of involved, but it is once again not at all difficult to implement the premises in a basic way; a development team could, for example, reach a definition of the inherently fuzzy term “performance” by scoring their opinions, then weighting them by the authority of their positions.[2] The trick is to pick a mathematical operation that pools them altogether into a single value that stays on a scale of 0 to 1, while still capturing the meaning in a way that is relevant to the problem at hand.
…………Klir and Yuan refer to this as an application of the newborn field of “knowledge engineering,”[3] which has obvious connections to expert systems. Since fuzzy set theory is still a wide-open field there’s a lot of latitude for inventing your own functions; there might be an optimal function that matches the problem at hand, but no one may have discovered it yet. In situations like these, my first choice would be neural nets, since I saw spectacular evidence long ago of how they can be ideal for modeling unknown functions (which pretty much sparked my interest in data mining). Before trying one of these advanced approaches, however, it might be wise to think hard about what mathematical properties you require of your outputs and then consult a calculus book or other math reference to try to find a matching function. While trying to teach myself calculus all over again recently, I was reintroduced to the whole smorgasbord of properties that distinguish mathematical functions from each other, like differentiability, integrability, monotonicity, analycity, concavity, subadditity, superadditivity, discontinuity, splines, super- and subidempotence and the like. You’ll encounter these terms on every other page in fuzzy set math references, which can be differentiated (pun intended?) into broad categories like function magnitude, result, shape and mapping properties. One thing I can help with is to caution that it’s often difficult or even impossible to implement ones (like the popular gamma function) which require calculations of permutations or combinations. It doesn’t matter whether you’re talking about T-SQL, Visual Basic, C# or some computer language implemented outside of the Microsoft ecosystem: it only takes very small input values before you reach the boundaries of the highest data types. This renders certain otherwise useful data mining and statistical algorithms essentially useless in the era of Big Data. An exclamation point in a math formula ought to elicit a groan, because the highest value you might be able to plug into a factorial function in SQL Server is about 170.

A Trivial Example with Two Membership Functions

                I’ll provide an example here of moderate difficulty, in between the two extremes of advanced techniques like least squares (or God forbid, the gamma function and its relatives) on the one hand and cheesy screenshots of an ordinary table that just happens to have a float column scored between 0 and 1 on the other. As we’ll see in the next few tutorials on fuzzy complements, unions, intersections and the like, when calculating set memberships on the fly we usually end up using a lot of CASE, BETWEEN and MIN/MAX statements in T-SQL, but that won’t be the case in the example below because the values are derived from a stored procedure and stored in two temporary tables. To demonstrate how seamlessly fuzzy set techniques can be integrated with standard outlier detection techniques, I’ll recycle the code from my old tutorial Outlier Detection with SQL Server, part 2.1: Z-Scores and use it as my membership function.
…………There’s a lot of code in Figure 1, but it’s really easy to follow, since all we’re doing is running the Z-Scores procedure on a dataset on the Duchennes form of muscular dystrophy I downloaded from  Vanderbilt University’s Department of Biostatistics a couple of tutorial series ago, which now occupies about 9 kilobytes of space in a sham DataMiningProjects database. There’s probably a more efficient way of going about this, but the results are stored in a table variable and the @RescalingMax, @RescalingMin and @RescalingRange variables and the ReversedZScore column are then used to normalize the Z-Score on a range of 0 to 1 (the GroupRank column was needed for the stored procedure definition in the original Z-Scores tutorial, but can be ignored in this context). To illustrate how we can combine fuzzy set approaches together in myriad combinations, I added an identical table that holds Z-Scores for a second column from the same dataset, which is rescaled in exactly the same way. In the subquery SELECT I merely multiply the two membership values together to derive a CombinedMembershipScore. What this essentially does is give us a novel means of multidimensional outlier detection.

Figure 1: Using Z-Scores for Membership Functions
DECLARE @RescalingMax decimal(38,6), @RescalingMin decimal(38,6), @RescalingRange decimal(38,6)
DECLARE  @ZScoreTable1 table
(ID bigint IDENTITY (1,1),
PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
)

DECLARE  @ZScoreTable2 table
(ID bigint IDENTITY (1,1),
PrimaryKey sql_variant,
Value decimal(38,6),
ZScore decimal(38,6),
ReversedZScore as
CAST(1 as decimal(38,6)) ABS(ZScore),
MembershipScore decimal(38,6),
GroupRank bigint
)

INSERT INTO @ZScoreTable1
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’CreatineKinase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

INSERT INTO @ZScoreTable2
(PrimaryKey, Value, ZScore, GroupRank)
EXEC   Calculations.ZScoreSP
              @DatabaseName = N’DataMiningProjects,
              @SchemaName = N’Health,
              @TableName = N’DuchennesTable,
              @ColumnName = N’LactateDehydrogenase,
              @PrimaryKeyName = N’ID’,
              @DecimalPrecision = ’38,32′,
              @OrderByCode = 8

— RESCALING FOR COLUMN 1
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable1
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable1
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

 — RESCALING FOR COLUMN 2
SELECT @RescalingMax = Max(ReversedZScore), @RescalingMin= Min(ReversedZScore) FROM @ZScoreTable2
SELECT @RescalingRange = @RescalingMax @RescalingMin

UPDATE @ZScoreTable2
SET MembershipScore = (ReversedZScore @RescalingMin) / @RescalingRange

SELECT ID, PrimaryKey, Value, ZScore1, ZScore2, MembershipScore1, MembershipScore2, CombinedMembershipScore
FROM (SELECT T1.ID, T1.PrimaryKey, T1.Value, T1.ZScore AS ZScore1, T2.ZScore as ZScore2,
       T1.MembershipScore MembershipScore1, T2.MembershipScore AS MembershipScore2, T1.MembershipScore * T2.MembershipScore AS CombinedMembershipScore
       FROM @ZScoreTable1 AS T1
              INNER JOIN @ZScoreTable2 AS T2
              ON T1.ID = T2.ID) AS T3
WHERE CombinedMembershipScore IS NOT NULL
ORDER BY CombinedMembershipScore DESC

if we want to store the values in the original table, we can use code like this:
UPDATE T4
SET T4.MembershipScore1 = T3.MembershipScore1, T4.MembershipScore2 = T3.MembershipScore2, T4.CombinedMembershipScore =
T3.CombinedMembershipScore
FROM DataMiningProjects.Health.DuchennesTable AS T4
       INNER JOIN  (SELECT T1.PrimaryKey, T1.MembershipScore AS MembershipScore1, T2.MembershipScore AS MembershipScore2, T1.MembershipScore * T2.MembershipScore AS CombinedMembershipScore
       FROM @ZScoreTable1 AS T1
              INNER JOIN @ZScoreTable2 AS T2
              ON T1.ID = T2.ID) AS T3
       ON T4.ID = T3.PrimaryKey

 Figure 2: Sample Results from the Duchennes Practice Data
Combined Membership Function Example

…………Figure 2 gives a glimpse of what the original DuchennesTable might look like if we wanted to store these values rather than calculate them on the fly, which can be accomplished by adding the three float columns on the right to the table definition and executing the UPDATE code at the end of Figure 1. In natural language, we might say that “the first record is 0.941446th of a member in the set around the average Creatine Kinase value” but “the fifth record is only 0.764556th of a member of the set near the mean Lactate Dehydrogenase value.” We could even model deeper levels of imprecision by creating categories like “near” for the high membership values in each column and “outlier” for the lowest ones, then define their boundaries in terms of fuzzy sets. This might be an ideal use for triangular and trapezoidal numbers, which can be worth the expense in extra code, as I’ll explain a few articles from now. We’re also modeling a different type of imprecision in another sense, because we know instinctively that there ought to be some way of gauging whether or not a record’s an outlier when both columns are taken into account; perhaps nobody knows precisely what the rules for constructing such a metric might be, but the CombinedMembershipScore at least allows us to get on the board.
…………Please keep in mind that I’m only using Z-Scores here because it’s familiar to me and is ideal for illustrating how fuzzy sets can be easily adapted to one particular use case, outlier detection. If we needed to make inferences about how well the data fit a gamma or exponential distribution, we might as well have used the corresponding goodness-of-fit tests and applied some rescaling techniques to derive our membership values; if we needed to perform fuzzy clustering, we could have plugged in a Manhattan distance function or one of its relatives. Fuzzy set memberships are often completely unrelated to stochastics and should not be interpreted as probabilities unless you specific intend to model them. The usefulness of fuzzy sets is greatly augmented when we move beyond mere set membership by tweaking the meaning a little, so that they can be interpreted as degrees of evidence, reliability, risk, desirability, or the like, which allow us to plug into various other well-developed mathematical theories. All functions can be differentiated by their return types,  number of return and input values, allowable data types and ranges, mathematical properties and the like (not to mention performance costs), but in fuzzy set theory the issue of meaning has a somewhat more prominent role. In some cases, it may even be desirable to use multiple membership functions to determine membership in one fuzzy set, as in my crude example above. These myriad shades of meaning and potential for combinations of them lead to a whole new level of complexity, which may nonetheless be worthwhile to wade through for certain imprecision modeling problems.

A Taxonomy of Fuzzy Sets (that Doesn’t Tax the Brain)

                I originally figured that I’d have to organize this series according to a taxonomy of different types of fuzzy sets, but it’s actually fairly simply to sketch the outlines of that otherwise advanced topic. Instead of delving into all of the complex math, it’s a lot easier for a layman to dream up all of the combinations of all the places in a set they can apply fuzziness, different means of encoding it and so on. The important thing to keep in mind is that there’s probably a term out there for whatever combination you’re using and that somewhere along the line, mathematicians have probably already figured out most of the logical implications decades ago (thereby saving a lot of the grunt work and reinventing the wheel, assuming that you can interpret their writing and the really thick formulas that often accompany them). The easiest ones to explain are real-valued and interval sets, in which the membership functions are determined on the real number line (which is all we ever encounter in SQL Server) or by a range of values on it.[4] Type-2 Fuzzy Sets illustrate the concept of tacking on further fuzziness perfectly – all we do take an interval-valued set and then assign grades to its boundaries as well. Fuzzy set theorists Yingjie Yang and Chris Hinde state that “A type-2 fuzzy set describes its memberships using type-1 fuzzy sets, but it needs precise crisp values to describe its secondary memberships.”[5] As the levels and number of values needed to define these sets proliferates, the performance costs do as well, so one has to be sure in advance that the extra complexity is useful in modeling real-world data. As Klir and Yuan put it, “Fuzzy sets of type 2 possess a great expressive power, and, hence, are conceptually quite appealing. However, computational demands for dealing with them are even greater than those for dealing with interval-valued sets. This seems to be the primary reason why they have almost never been utilized in any applications.”[6] I’d wager that’s still true, given the fact that the applications of ordinary fuzzy sets to data mining, data warehousing and relational databases have barely been scratched since the mathematicians invented these things years ago.
…………Rough sets also involve fuzzy values on intervals in a sense, but they model approximate distinctions between objects. Say, for example, you classify all of the objects in a child’s bedroom and want to see which qualify as part of a set labeled Toys. A sports car might be considered an adult toy to a certain degree, depending on such factors as whether or not the owner uses it for purposes other than occasional joy rides. The plastic dinosaurs and megafauna in a Prehistoric Playset are certainly toys, as are Fisher Price’s wooden people (well, cheap plastic these days). Medicine definitely wouldn’t belong to the set (at least according to these singing pills). Would one of these classic glow-in-the-dark Godzilla models from the ‘70s qualify? Well, that’s not quite clear, since it’s an object only a child would really appreciate, but they’re unlikely to actually play with it as a toy very often, since it’s designed to stay on display. They could conceivably take them off the shelf and pit them against the Fisher Price people; in this instance, the set membership might be defined by criteria as fuzzy as the whims of a child’s imagination, but we have tools to model it, if a need should arise. The definition of the attribute is in question, not whether a particular row belongs to a set, which is the case with ordinary fuzzy membership functions.
…………In Soft Sets, the characteristics that define the set are themselves fuzzy. I haven’t attempted to model those yet, but I imagine it may require comparisons between tables and views and placing weights on how comparable their different columns are to each other, rather than the rows. Here’s a crude and possibly mistaken example I came up with off the top of my head: in Soft Sets you might have a table with columns for Height, Width and Age and another with columns for Height, Width and Time, in which the first two columns of each are completely related to each and therefore are assigned weights of one, whereas Age and Time are only tangentially related and therefore might be assigned a weight somewhere between 0 and 1. Near sets apparently address a problem tangential to rough and soft sets, by quantifying the quantity and quality of resemblances between objects that might belong to a fuzzy set. Once we’ve been introduced to these concepts, they can obviously be combined together into an endless array of variants, which go by such mouthfuls as “rough intuitionistic Level-2 fuzzy near sets.” Just keep in mind that it is more common to encounter such structures in the real world in everyday language than it is to know the labels and their mathematical properties. It is also easier than it sounds to implement them in practice, if we’re using set-based tools like T-SQL that are ideal for the job.
…………I probably won’t spend much time in this series on even more sophisticated variants that might nonetheless be useful in modeling particular problems. Shadowed sets used multidimensional projections to qualify the lack of knowledge of whether or not a data point belongs to a fuzzy set. Neural nets are a cutting-edge topic I hope to tackle on this blog in a distant future (my interest in data mining was piqued way back in the 1990s when I saw some I cooked up at home do remarkable things) but it is fairly easy to describe Neuro-Fuzzy Sets, in which we’re merely using neural nets to perform some of the functions related to fuzzy sets. The combinations that can be derived from are limited only by one’s imagination; there are already neural nets in use in industry today that use fuzzy functions for activation and fuzzy sets whose membership values are derived from neural nets, and so forth. Undetermined and Neutrosophic Logic are variants of fuzzy logic that can be applied to fuzzy sets if we need to model different types of indeterminacy, which is a topic I’ll take up in a future article on how fuzzy sets can be put to good use in uncertainty management.
…………Blurry sets are a recent innovation designed to incorporate the kind of combinations of fuzziness we’ve just mentioned, but without sacrificing the benefits of normal logic – which might be of great benefit in the long run, since the value of some of recently developed logical systems is at best unproven.[7] Some will probably be substantiated in the long run, but some seem to be motivated by the sort of attention-getting shock value that can make academicians famous overnight these days (some of them seem to be implementations and formal defenses of solipsism, i.e. one of the defining characteristics of schizophrenia). Q-Sets are apparently an even more advanced variants developed for use in the strange world of quantum physics; since making Schrödinger’s cat disappear isn’t among most SQL Server users’ daily duties, I’ll leave that one out for now. I’ll probably also steer away from discussing more advanced types of fuzzy sets that include multiple membership functions, which aren’t referenced often in the literature and apparently are implemented only in rare circumstances.. Intuitionistic Sets have two, one for membership or non-membership, while Vague Sets also use two, except in that case one assesses the truth of the evidence for a record’s membership and the other its falsehood; I presume truth tables and the like are then built from the two values. A novel twist on this theme is the use of multiple membership functions to model the fact that the programmer is uncertain of which membership functions to use in defining fuzzy sets.[8] Multisets are often lumped in with the topic of fuzzy sets, but since they’re just sets that allow duplicate values, I don’t see much benefit in discussing them here. Genuine sets take fuzziness to a new level in an entirely different way, by generalizing the concept of fuzzy set in the same manner that fuzzy sets generalize ordinary “crisp” sets, but I won’t tack on another layer of mathematical complexity at this point, not when the potential for using the established methods of generalization has barely been scratched.

False Mysticism and the Fuzzy Mystique

                This wide-open field is paradoxically young in terms of mathematical intellectual history, but overripe for implementation, given that many productive uses for it were derived decades ago but haven’t percolated down from academia yet. Taking a long view of the history of math, it seems that new waves of innovation involve the addition of new dimensions to existing objects. Leonhard Euler introduced complex numbers in the 18th Century, then theoreticians like Bernhard Riemann and Charles Hinton contributed the concepts of higher-dimensional space and its curvature in the 19th. Around the same time, Georg Cantor was working out set theory and such mind-blowing structures as high-cardinality infinities and transfinities. More recently, Benoit Mandelbrot elaborated the theory of fractional dimensions, which are now cornerstones in chaos theory and modern art, where they go by the better-known term of fractals. This unifying principle of mathematical innovation stretches back as far as ancient Greece, when concepts like infinity, continuous scales and the like were still controversial; in fact, the concept of zero did not reach the West until it was imbibed from Arab sources in the Middle Ages. Given that zero was accepted so late in history, it is thus not at all surprising that negative numbers were often derided by Western mathematicians as absurdities well into the 18th and 19th Centuries, many centuries after their discovery by Chinese and Indian counterparts.[9] A half-conscious prejudice against the infinite regress of non-repeating digits in pi and Euler’s number is embedded in the moniker they still go by today, “irrational numbers.” The same culprit is behind the term “imaginary number” as well. Each of these incredibly useful innovations was powered by the extension of scales into previously uncharted territory; each was also met by derision and resistance at first, as were fuzzy sets to a certain extent after their development by 20th Century theoreticians like Max Black and Lofti A. Zadeh.
…………Many of these leaps forward were also accompanied by hype and as sort of unbalanced intellectual intoxication, which is the main risk in using these techniques. Fuzzy sets are unique, however, in that some of the pioneers were conscious of the possibility of leveraging the term “fuzzy” for attention; Zadeh openly acknowledges that the term has its uses in terms of publicity power, although he did not originally invent the term for that purpose. The strategy has backfired to a certain extent, however, by drawing the wrong kind of attention. “Fuzzy” is a term that immediately conjures up many alternative images, many of which don’t seem conducive to a high-powered, mission-critical production environment – like teddy bears, static, 1970s cop shows and something out of the back of George Carlin’s fridge.
…………Many of the taxonomic terms listed above also carry a kind of shock value to them; in other branches of academia this usually signifies that the underlying theory is being overstated or is even the product of crackpots with tenure, but in this case there is substantial value once the advertising dross has been stripped away. In fact, I’d wager that if more neutral terms like “graded set” or “continuously-valued set” were used in place of “fuzzy,” these techniques would be commonplace today in a wide variety of industries, perhaps even database management; in this case, the hype has boomeranged by stunting the adoption of an otherwise indispensable set of tools. As McNeill points out, some of the researchers employed in implementing fuzzy sets in various industries (including the development of the space shuttle) back in the early ‘90s had to overcome significant institutional resistance from “higher-ups” who “fretted about image.”[10] They are right to fret within reason, because these tools can certainly be misapplied; in fact, I’ve seen brilliant theorists who grasp the math a lot better than I do abuse it in illogical ways (for the sake of being charitable, I don’t want to call them out by name). Some highly regarded intellectuals don’t recognize any boundaries to the theory, for all of reality is fuzzy in their eyes – which is the mark of fanaticism, and certain to stiffen any institutional resistance on the other side. Every mathematical innovation in history has not only been accompanied by knee-jerk opposition from Luddites on one side, but also unwarranted hype and irrational exuberance on the other; fuzzy sets are as susceptible to misuse by bad philosophers and fanatics as higher dimensions, chaos theory and information theory have been for decades, so it is not unwise to tread carefully and maintain intellectual sobriety when integrating fuzzy sets into any development process.
…………Perhaps the best way to overcome this kind of institutional resistance and receive backing for these techniques is to be up front and demonstrate that you recognize the hype factor, plus have clear litmus tests for discerning when and when not to apply fuzzy set theory. Two of these are the aforementioned criteria of searching for data that resides in between ordinal and continuous data in the hierarchy of Content types and sifting through natural language terms for imprecision modeling. It is also imperative to develop clear standards for differentiating between legitimate and illegitimate uses of fuzzy sets, to prevent the main risk: “fuzzifying” data that it is inherently crisp. It is indeed possible to add graded boundaries to any mathematical objects (some of which we’ll explore later in this series), but in many cases, there is no need to bother. Fuzzy logic in the wrong doses and situations can even lead to fallacious conclusions. In fact, applying fuzziness to inherently crisp objects and vice-versa is one of the fundamental strategies human beings have employed since time immemorial to deceive both themselves and others. Here’s a case in point we’ve all seen: you tell your son or daughter they can’t have a snack, but you catch them eating crackers; invariably, their excuse involves taking advantage of the broad interval inherent in the term “snack,” a set which normally, but not always, included crackers. Of course, when people grow up they sometimes only get more skilled at blurring lines through such clever speech (in which case they often rise high in politics, Corporate America and the legal profession). Here’s an important principle to keep in mind: whenever you see a lot of mental energy expended to tamper with the definitions of things, but find the dividing lines less clear afterwards, then it’s time to throw a red flag. The whole point of fuzzy sets is not to obscure clear things, but to clear up the parts that remain obscure. Fuzziness is in exactly the same boat as mysticism, which as G.K. Chesterton once said, is only useful when it explains mysteries:

                “A verbal accident has confused the mystical with the mysterious. Mysticism is generally felt vaguely to be itself vague—a thing of clouds and curtains, of darkness or concealing vapours, of bewildering conspiracies or impenetrable symbols. Some quacks have indeed dealt in such things: but no true mystic ever loved darkness rather than light. No pure mystic ever loved mere mystery. The mystic does not bring doubts or riddles: the doubts and riddles exist already…The mystic is not the man who makes mysteries but the man who destroys them. The mystic is one who offers an explanation which may be true or false, but which is always comprehensible—by which I mean, not that it is always comprehended, but that it always can be comprehended, because there is always something to comprehend.”[11]

…………Fuzzy sets are not meant to mystify; they’re not nebulous or airy, but designed to squeeze some clarity out of apparently nebulous or airy data and logic. They are akin to spraying Windex on a streaky windshield; if you instead find your vision blocked by streaks of motor oil, it’s time to ask who smeared it there and what their motive was. Fuzziness isn’t an ingredient you add to a numerical recipe to make it better; it’s a quality inherent in the data, which is made clearer by modeling the innate imprecision that results from incomplete measurement, conflicting evidence and many other types of uncertainty. The point is not to make black and white into grey, but to shine a light on it, so that we can distinguish the individual points of black and white that make up grey, which is just a composite of them. These techniques don’t conjure up information; they only ensure that what little information is left over after we’ve defined the obvious crisp sets doesn’t go to waste. Fuzziness can actually arise from a surfeit of detail or thought, rather than a deficit or either; the definition of an object may be incomplete because so many sense impressions, images, stray thoughts, academic theories and whatnot are attached to its meaning that we can neither include them all nor leave any out.
…………As we shall see in future articles on uncertainty management, the manner in which the meaning of set membership can be altered to incorporate evidence theory and the like is indeed empowering, but calls for a lot of mental rigor to resist unconscious drifts in definition. It’s an all-too human problem that can occur to anyone, particular when mind-blowing topics are under discussion; it’s even noticeable at times in the writings of brilliant quantum physicists, who sometimes unconsciously define their terms slightly differently at the beginning of a book than at the end, in ways that nonetheless make all the difference between Schrödinger’s Cat being alive or dead. “Definition drift” also seems to be a Big Problem in Big Analysis for the same reason. It likewise seems to occur in texts on fuzzy sets, where term “fuzz” is often accurately described on one page as a solution to innate imprecision, but on the next, is unconsciously treated as if it were a magic potion that ought to be poured on everything. Another pitfall is getting lost in all of the bewildering combinations of fuzziness I introduced briefly in the taxonomy, but the answer to that is probably to just think of them in terms of ordinary natural language and only use the academic names when sifting through the literature for appropriate membership functions and the like. Above all, avoiding modeling crisp sets that have inherently Boolean yes-or-no membership values as fuzzy sets, because as the saying goes, you can’t be “a little bit pregnant.” Continuous scales can certainly be added to any math object, but if the object being modeled is naturally precise, then it is at best a waste of resources that introduces the risk of fallacious reasoning and at worst, an opening for someone with an axe to grind to pretend a particular scale is much more imprecise than it really is. One dead giveaway is the use of short scales in comparison to the length of the original crisp version. For example, this is the culprit when quibbling erupts over such obviously crisp sets as “dead” and “alive,” on the weak grounds that brain death takes a finite amount of time, albeit just a fraction of a person’s lifespan. It might be possible to develop a Ridiculousness Score by comparing the difference in intervals between those few moments, which occur on an almost infinitesimal scale, against an “alive” state that can span 70-plus years in human beings or the “dead,” which is always infinite. I haven’t seen that done in the literature, but in two weeks, I’ll demonstrate how the complements of fuzzy sets can be used to quantify just how imprecise our fuzzy sets are.  The first two installments of this series were lengthy and heavy on text because we needed a solid grounding in the meaning of fuzzy sets before proceeding to lessons in T-SQL code, but the next few articles will be much shorter and immediately beneficial to anyone who wants to put it into action.

[1] pp. 290-293, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[2] IBID., pp. 287-288, 292-293.

[3] IBID., p. 281.

[4] For a quick introduction to the various fuzzy set types, see the Wikipedia article Fuzzy Sets at http://en.wikipedia.org/wiki/Fuzzy_set. I consulted it to make sure that I wasn’t leaving out some of the newer variants that came out since Klir and Yuan and some of the older fuzzy set literature I’ve read, much of which dates from the 1990s. I lost some of the citations to the notes I derived these three paragraphs from (so my apologies go out to anyone I might have inadvertently plagiarized) but nothing I said here can’t be looked up quickly on Wikipedia, Google or any recent reference on fuzzy sets.

[5] Hinde, Chris and Yang, Yingjie, 2000, A New Extension of Fuzzy Sets Using Rough Sets: R-Fuzzy Sets, pp. 354-365 in Information Sciences, Vol. 180, No. 3. Available online at the web address https://dspace.lboro.ac.uk/dspace-jspui/bitstream/2134/13244/3/rough_m13.pdf

[6] p. 17, Klir and Yuan.

[7] Smith, Nicholas J. J., 2004, Vagueness and Blurry Sets, pp 165-235 in Journal of Philosophical Logic, April 2004. Vol. 33, No. 2. Multiple online sources are available at http://philpapers.org/rec/SMIVAB

[8] See  Pagola, Miguel; Lopez-Molina, Carlos; Fernandez, Javier; Barrenechea, Edurne; Bustince, Humberto , 2013, “Interval Type-2 Fuzzy Sets Constructed From Several Membership Functions: Application to the Fuzzy Thresholding Algorithm,” pp. 230-244 in IEEE Transactions on Fuzzy Systems, April, 2013. Vol. 21, No. 2. I haven’t read the paper yet (I simply can’t afford access to many of these sources) but know of its existence.

[9] See Rogers, Leo, 2014, “The History of Negative Numbers,” published online at the NRICH.com web address http://nrich.maths.org/5961.

[10] pp. 261-262, McNeill.

[11] Chesterton, G.K., 1923, St Francis of Assisi. Published online at the Project Gutenberg web address http://gutenberg.net.au/ebooks09/0900611.txt

Implementing Fuzzy Sets in SQL Server, Part 0: The Buzz About Fuzz

By Steve Bolton

…………I originally planned to post a long-delayed series titled Information Measurement with SQL Server next, in which I’d like to cover scores of different metrics for quantifying the data our databases hold –  such as how random, chaotic or ordered it might be, or how much information it might provide. I’m putting it off once again, however, because I stumbled onto a neglected topic that could be of immediate benefit to many DBAs: fuzzy sets and their applications in uncertainty management programs and software engineering processes. Since SQL Server is a set-based storage system, I always suspected that the topic would be directly relevant in some way, but never expected to discover just how advantageous they can be. As in my previous series on this blog, I’m posting this mistutorial series in order to introduce myself to the topic, not because I know what I’m talking about; writing about it helps reinforce what I learn along the way, which will hopefully still be of some use to others once all of the inevitable mistakes are overcome. In fact, I guarantee that every DBA and .Net programmer out there has encountered problems which could be more easily and quickly solved through these proven techniques for modeling imprecision, which is precisely what many software engineering and data modeling problems call for.  Despite the fact that my own thinking on the topic is still fuzzy (as usual) I’m certain this series can be immediately helpful to many readers, since there’s such an incredible gap between the math, theory and implementations of fuzzy set techniques in other fields one hand, and their slow adoption in the relational and data mining markets on the other.
…………Instead of beating around the bush, I’ll try to encapsulate the purposes and uses cases of fuzzy sets as succinctly as possible: basically, you look for indefinite terms in ordinary speech, then squeeze what little information content you can out of them by assigning grades to records to signify how strongly they belong to a particular set. Most fuzzy set problems are modeled in terms of natural language like this. The overlap with Behavior-Driven Development (BDD) and user stories is quite obvious, but after reading on those hot topics a year prior to learning about fuzzy sets, I was immediately struck by how little these techniques of modeling imprecision are apparently used but how easy it would be to incorporate them into database and application development processes. Uncertainty is a notorious problem in any engineering process, but using sets with graded memberships can even be used to capture it and flesh it out more thoroughly, as part of one of the programs of “uncertainty management” I’ll describe later in this series.

From Crisp Sets to Membership Functions

                These powerful techniques arise from the quite simple premise that we can assign membership values to records, which some SQL Server users might being doing from time to time insentiently, without realizing that they were approaching the borderlands of fuzzy set theory. Most relational and cube data is in the form of what mathematicians call “crisp sets,” which don’t require membership functions because they’re clear-cut yes-or-no decisions; to theoreticians, these are actually just a special case of a broader class of fuzzy sets, distinguished only by the fact that their membership functions are limited to values of either 0 or 1. In the relational field as it stands today, you either include a row in a set or you don’t, without any in-between. In contrast, most fuzzy membership functions assign continuous values between 0 and 1; although other scales are possible, I have yet to see an example in the literature where any other scale was used. I doubt it is wise to use any other range even if there might be a performance boost of some kind in applying larger-scale float or decimal data types, given that it helps integrate fuzzy sets with the scales used in a lot of other hot techniques I’ll cover later, like Dempster-Shafer evidence theory, possibility theory, decision theory and my personal favorite, neural net weights. That overlap transforms fuzzy sets into an interchangeable part of sorts, in what might be termed modular knowledge discovery.
…………That all sounds very grandiose, but anyone can practice picking out fuzzy sets represented in everyday speech. Artificial intelligence researchers Roger Jang and Enrique Ruspini provide a handy list of obvious ones in a set of slides reprinted by analytics consultant Piero P. Bonissone, including Height, Action Sequences, Hair Color, Sound Intensity, Money, Speed, Distance, Numbers and Decisions. Some corresponding instances of them we encounter routinely might include Tall People, Dangerous Maneuvers, Blonde Individuals, Loud Noises, Large Investments, High Speeds, Close Objects, Large Numbers and Desirable Actions.[i] The literature is replete with such simple examples, of which imprecise weather and height terms like “cloudy,” “hot” and “short” seem to be the most popular. The key things is to look in any BDD or user story implementation are linguistic states where the speech definitely signifies something, but the meaning is not quite clear – particularly when it would still be useful to have a sharper and numerically definable definition, even when we can’t be 100 percent precise.

Filling a Unique Niche in the Hierarchy of Data Types

                It may be helpful to look at fuzzy sets as a new tool occupying a new rung in the ladder of Content types we already work with routinely, especially in SQL Server Data Mining (SSDM). At the lowest level of data type complexity we have nominal data, which represents categories that are not ranked on any scale; these are basically equivalent to the Discrete Content type in SSDM and are often implemented in text data types or tinyint codes in T-SQL. On the next rung up the ladder we have ordinal data in which categories are assigned some rank, although the gaps may not be defined or even proportional; above that we have continuous data types (or the best approximation we can get, since modern computers can’t handle infinitesimal scales) that are often implemented in T-SQL in the float, numeric and decimal data types. Fuzzy sets represent a new bridge between the top two rungs, by providing more meaningful continuous values to ordinal data that in turn allow us to do productive things we couldn’t do with them before, like performing arithmetic, set operations or calculating stats. Any fuzzy set modeling process ought to focus on looking for data that is ordinal with an underlying scale that is not precisely discernible, but in which it would be useful to work with a continuous scale. That really isn’t much more difficult than picking imprecise terminology out of natural language, which anyone can make a game of. Given their “ability to translate imprecise/vague knowledge of human experts” we might also want to make a habit of flagging instances where we know a rule is operative, but has not yet been articulated.
…………If one were to apply these techniques to database server and other computing terminologies, one of the most obvious examples of imprecise terms would be “performance.” As George J. Klir and Bo Yuan point out in their classic tome Fuzzy sets and Fuzzy Logic: Theory and Applications, this is actually an instance of a specific type of fuzzy set called a fuzzy number, which I will introduce later in the series.[ii] Say, for example, that you have a table full of performance data, which you’ve graded the records on scales of 0 to 1 based on whether they fall into categories like “Acceptable,” “Good” and perhaps “Outside Service Level Agreement Boundaries.” That still leaves open the question of what the term “performance “ itself means, so it constitutes another level of fuzziness on top of the membership issue; in fact, it might be necessary to use some of the techniques already hashed out by mathematicians decades for combining the opinions of multiple experts to arrive at a common fuzzy definition of it.

Modeling Natural Language

                The heavy math in that resource may be too much for some readers to bear, but I highly recommended at least skimming the third section of Chapter 8, where Klir and Yuan identify many different types of fuzziness in ordinary speech. They separate them into four possible combinations of unconditional vs. conditional and unqualified vs. qualified fuzzy propositions, such as the statement “Tina is young is very true,” in which the terms “very” and “true” make it unconditional and qualified.[iii] They also delve into identifying “fuzzy quantifiers” like “about 10, much more than 100, at least about 5,” or “almost all, about half, most,” each of which is modeled by a different type of fuzzy number, which I’ll describe at a later date.[iv] Other distinct types to watch for in natural language include linguistic hedges such as “very, more, less, fairly and extremely” that are used to qualify statements of likelihood or truth and falsehood. These can be chained together in myriad ways, in statements like “Tina is very young is very true,” and the like.[v]
                In a moment I’ll describe how chaining together such fuzzy terms and fleshing out other types of imprecision can lead to lesser-known but occasionally invaluable twists on fuzzy sets, but for now I just want to call attention to how quickly it added new layers of complexity to an otherwise simple topic. That is where the highly developed ideas of fuzzy set theory come in handy. The math for implementing all of these natural language concepts has existed for decades, so there’s little reason to reinvent the wheel – yet nor is there a need to overburden readers with all of the equations and jargon, which can look quite daunting on paper. There is a crying need in the data mining field for people willing to act as middlemen of sorts between the end users of the algorithms and their inventors, in the same way that a mechanic fits a need between automotive engineers and drivers; as I’ve pointed out before, it shouldn’t require a doctorate in artificial intelligence to operate data mining software, but the end users are nonetheless routinely buried in equations and formulas they shouldn’t have to decipher. It is imperative for end users to know what such techniques are used for, just as drivers must know how to read a speedometer and operate a brake, but it is not necessary for them to provide lemmas, or even know what a lemma is. While writing these mistutorial series, I’m trying to acquire the skills to do that for the end users by at least removing the bricks from the briefcase, so to speak, which means I’ll keep the equations and jargon down to a minimum and omit mathematical proofs altogether. The jargon is indispensable for helping mathematicians communicate with each other, but is an obstacle to implementing these techniques in practice. It is much easier for end users to think of this topic in terms of natural language, in which they’ve been unwittingly expressing fuzzy sets their whole lives on a daily basis. I can’t simplify this or any other data mining completely, so wall-of-text explanations like this are inevitable – but I’d wager it’s a vast improvement over having to wade through whole textbooks of dry equations, which is sometimes the only alternative. Throughout this series I will have to lean heavily on Klir and Yuan’s aforementioned work for the underlying math, which I will implement in T-SQL. If you want a well-written discussion of the concepts in human language, I’d recommend Dan McNeill’s 1993 book Fuzzy Logic.[vi]

The Low-Hanging Fruits of Fuzzy Set Applications

                These concepts have often proved to be insanely useful whenever they’ve managed to percolate down to various sectors of the economy. The literature is so chock full of them I don’t even know where to begin; the only thing I see in common to the whole smorgasbord is that they seem to seep into industries almost haphazardly, rather than as part of some concerted push. Their “ability to control unstable systems” makes them an ideal choice for many control theory applications.[vii] Klir and Yuan spend several chapters on the myriad implementations already extant when they wrote two decades ago, in fields like robotics,[viii] estimation of longevity of equipment[ix], mechanical and industrial engineering[x], assessing the strength of bridges[xi], traffic scheduling problems[xii] (including the infamous Traveling Salesman) and image sharpening.[xiii]  Another example is the field of reliability ratings, where Boolean all-or-nothing rankings like “working” vs. “broken” are often not sufficient to capture in-between states.[xiv] In one detailed example, they demonstrate how to couple weighted matrices of symptoms with fuzzy sets in medical diagnosis.[xv] Klir and Yuan also lament that these techniques are not put to obvious uses in psychology[xvi], where imprecision is rampant, and provide some colorful examples of how to model the imprecision inherent in interpersonal communication, particularly in dating.[xvii] As they point out, some messages are inherently uncertain, on top of any fuzz introduces by the receiver in interpretation; to that we can add the internal imprecision of the speaker, who might not be thinking through their statements thoroughly or selecting their words carefully.
…………Then there is a whole class of applications directly relevant to data mining, such as fuzzy clustering algorithms (like C-Means)[xviii], fuzzy decision trees, neural nets, state sequencing (“fuzzy dynamic systems and automata”)[xix], fuzzified virtual chromosomes in genetic algorithms[xx], fuzzy parameter estimation, pattern recognition[xxi], fuzzy regression procedures and regression on fuzzy data.[xxii] Most of that falls under the rubric of “soft computing,” a catch-all term for bleeding edge topics like artificial intelligence. The one facet of the database server field where fuzzy sets have succeeded in permeating somewhat since Klir and Yuan mentioned the possibility[xxiii] is fuzzy information retrieval, which we can see it in action in SQL Server full-text catalogs.

The Future of Fuzzy Sets in SQL Server

                Like many of their colleagues, however, they wrote about ongoing research into fuzzy relational databases by researchers like Bill Buckles and F.E. Petry that has not come into widespread use since then.[xxiv] That is where this series comes in. I won’t be following any explicit prescriptions for implementing fuzzy relational databases per se, but will instead leverage the existing capabilities of T-SQL to demonstrate how easy it is to add your own fuzz for imprecision modeling purposes. Researcher Vivek V. Badami pointed out more than two decades ago that fuzz takes more code, but is easier to think about.[xxv] It takes very little experience with fuzzy sets to grasp what he meant by this – especially now that set-based languages like T-SQL that are ideal for this topic are widely used. I wonder if someday it might be possible to extend SQL or systems like SQL Server to incorporate fuzziness more explicitly, for example, by performing the extra operations on membership functions that are required for joins between fuzzy sets, or even more, fuzzy joins between fuzzy sets; later in the series I’ll demonstrate how DBAs can quickly implement DIY versions of these things, but perhaps there are ways to do the dirty work under the hood, in SQL Server internals. Maybe a generation from now we’ll see fuzzy indexes and SQL Server execution plans with Fuzzy Anti-Semi-Join operators – although I wonder how Microsoft could implement the retrieval of only one-seventh of a record and a third of another, using B-trees or any other type of internal data structure. In order to determine if a record is worthy of inclusion, it first has to be retrieved and inspected instead of passed over, which could lead to a quandary if SQL Server developers tried to implement fuzzy sets in the internals.
…………The good news is that we don’t have to wait for the theoreticians to hash out how to implement fuzzy relational databases, or for Microsoft and its competition to add the functionality for us. As it stands, T-SQL is already an ideal tool for implementing fuzzy sets. In the next article, I’ll demonstrate some trivial membership functions that any DBA can implement on their own quite easily, so that these concepts don’t seem so daunting. The difficulties can be boiled down chiefly to the fact that the possibilities are almost too wide open. Choosing the right membership functions to model the problem at hand is not necessarily straightforward, nor is selecting the right type of fuzzy set to model particular types of imprecision. As in regular data modeling, the wrong choices can sometimes lead not only to the waste of server resources, but also of returning incorrect answers. The greatest risk, in fact, consists of fuzzifying relationships that are inherently crisp and vice-versa, which can lead to fallacious reasoning. Fuzz has become a buzzword of sorts, so it would be wise to come up with a standard to discern its true uses from its abuses. In the next installment, I’ll tackle some criteria we can use to discern the difference, plus provide a crude taxonomy of fuzzy sets and get into some introductory T-SQL samples.

[i] p. 18, Bonissone, Piero P., 1998, “Fuzzy Sets & Expert Systems in Computer Eng. (1).” Available online at http://homepages.rpi.edu/~bonisp/fuzzy-course/99/L1/mot-conc2.pdf

[ii] pp. 101-102, Klir, George J. and Yuan, Bo, 1995, Fuzzy sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[iii] IBID., pp. 222-225.

[iv] IBID., pp. 225-226.

[v] IBID., pp. 229-230.

[vi] McNeill, Dan, 1993, Fuzzy Logic. Simon & Schuster: New York.

[vii] p. 8, Bonissone.

[viii] Klir and Yuan, p. 440.

[ix] IBID., p. 432.

[x] IBID., pp. 427-432.

[xi] IBID., p. 419.

[xii] IBID., pp. 422-423.

[xiii] IBID., pp. 374-376.

[xiv] IBID., p. 439.

[xv] IBID., pp. 443-450.

[xvi] IBID., pp. 463-464.

[xvii] IBID., pp. 459-461.

[xviii] IBID., pp. 358-364.

[xix] IBID., pp. 349-351.

[xx] IBID., p. 453.

[xxi] IBID., pp. 365-374.

[xxii] IBID., pp. 454-459.

[xxiii] IBID., p. 385.

[xxiv] IBID., pp. 380-381.

[xxv] p. 278, McNeill.

 

Goodness-of-Fit Testing with SQL Server Part 7.4: The Cramér–von Mises Criterion

By Steve Bolton

…………This last installment of this series of amateur tutorials features a goodness-of-fit metric that is closely related to the Anderson-Darling Test discussed in the last post, with one important caveat: I couldn’t find any published examples to verify my code against. Given that the code is already written and the series is winding down, I’ll post it anyway in the off-chance it may be useful to someone, but my usual disclaimer applies more than ever: I’m writing this series in order to learn about the fields of statistics and data mining, not because I have any real expertise. Apparently, the paucity of information on the Cramér–von Mises Criterion stems from the fact that experience with this particular measure is a lot less common than that of its cousin, the Anderson-Darling Test. They’re both on the high end when it comes to statistical power, so the Cramér–von Mises Criterion might be a good choice when you need to be sure you’re detecting effects with sufficient accuracy.[I]
…………Although it is equivalent to the Anderson-Darling with the weighting function[ii] set to 1, the calculations are more akin to those of the other methods based on the empirical distribution function (EDF) we’ve discussed in this segment of the series. It is in fact a refinement[iii] of the Kolmogorov-Smirnov Test we discussed a few articles ago, one that originated with separate papers published in 1928 by statisticians Harald Cramér and Richard Edler von Mises.[iv] One of the advantages it appears to enjoy over the Anderson-Darling Test is that it seems to perform much better, in the same league as the Kolmogorov-Smirnov, Kuiper’s and Lilliefors Tests. One of the disadvantages is that it that the value of the test statistic might be distorted by Big Data-sized value ranges and counts, which many established statistical tests were never designed to handle. It does appear, however, to suffer from this to a lesser degree than the Anderson-Darling Test, as discussed last time around. Judging from past normality tests I’ve performed on datasets I’m familiar with, it seems to assign higher values to those that were definitely not Gaussian in the proper order, although perhaps not in the correct proportion.
…………That of course assumes that I coded it correctly, which I can’t verify in the case of this particular test. In fact, I had to cut out some of the T-SQL that calculated the Watson Test along with it, since the high negative numbers it routinely returned were obviously wrong. I’ve also omitted the commonly used two-sample version of the test, since sampling occupies a less prominent place in SQL Server use cases than it would in ordinary statistical testing and academic research; one of the benefits of having millions of rows of data in our cubes and relational tables is that we can derive more exact numbers, without depending as much on parameter estimation or inexact methods of random selection. I’ve also omitted the hypothesis testing step that usually accompanies the use of the criterion, for the usual reasons: loss of information content by boiling down the metric to a simple either-or choice, the frequency with which confidence intervals are misinterpreted and most of all, the fact that we’re normally going to be doing exploratory data mining with SQL Server, not narrower tasks like hypothesis testing. One of the things that sets the Cramér–von Mises Criterion apart from other tests I’ve covered in the last two tutorial series is that the test statistic is compared to critical values from the F-distribution rather than the Chi-Squared or Student’s T, but the same limitation still arises: most of the lookup tables have gaps or stop at a few hundred values at best, but calculating them for the millions of degrees of freedom we’d need for such large tables would be computationally costly. Moreover, since I can’t be sure the code below for this less common metric is correct, there is less point in performing those expensive calculations.
…………The bulk of the procedure in Figure 1 is identical to the sample code posted for the Kolmogorov-Smirnov and Lilliefors Tests, which means they can and really ought to be calculated together. The only differences are in the final calculations of the test statistics, which are trivial in comparison to the derivation of the empirical distribution function (EDF) table variable from a dynamic SQL statement. The @Mean and @StDev aggregates are plugged into the Calculations.NormalDistributionSingleCDFFunction I wrote for Goodness-of-Fit Testing with SQL Server, part 2.1: Implementing Probability Plots in Reporting Services. If you want to test other distributions besides the Gaussian or “normal” distribution (i.e. the bell curve), simply substitute a different cumulative distribution function (CDF) here. The final calculation is straightforward: just subtract the CDF from the EDF, square the result and do a SUM over the whole dataset.[v] The two-sample test, which I haven’t included here in order for the sake of brevity and simplicity, merely involves adding together the results for the same calculation across two different samples and making a couple minor corrections. I’ve also included a one-sample version of the test I saw cited at Wikipedia[vi], since it was trivial to calculate. I would’ve liked to include the Watson Test, since “it is useful for distributions on a circle since its value does not depend on the arbitrary point chosen to begin cumulating the probability density and the sample points”[vii] and therefore meets a distinct but important set of use cases related to circular and cyclical data, but my first endeavors were clearly inaccurate, probably due to mistranslations of the equations.

Figure 1: T-SQL Code for the Cramér–von Mises Criterion

CREATE PROCEDURE [Calculations].[GoodnessOfFitCramerVonMisesCriterionSP]
@Database1 as nvarchar(128) = NULL, @Schema1 as nvarchar(128), @Table1 as nvarchar(128),@Column1 AS nvarchar(128)
AS

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString nvarchar(max)
SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1

DECLARE @Mean float,
@StDev float,
@Count  float

DECLARE @EDFTable table

(ID bigint IDENTITY (1,1),
Value float,
ValueCount bigint,
EDFValue float,
CDFValue decimal(38,37),
EDFCDFDifference decimal(38,37))

 

DECLARE @ExecSQLString nvarchar(max), @MeanOUT nvarchar(200),@StDevOUT nvarchar(200),@CountOUT nvarchar(200), @ParameterDefinition nvarchar(max)
SET @ParameterDefinition = ‘@MeanOUT nvarchar(200) OUTPUT,@StDevOUT nvarchar(200) OUTPUT,@CountOUT nvarchar(200) OUTPUT ‘
SET @ExecSQLString = ‘SELECT @MeanOUT = CAST(Avg(Value) as float),@StDevOUT = StDev(Value),@CountOUT = CAST(Count(Value) as float)
       FROM (SELECT CAST( + @Column1 + ‘ as float) as Value
              FROM ‘ + @SchemaAndTable1 +
              WHERE ‘ + @Column1 + ‘ IS NOT NULL) AS T1’

EXEC sp_executesql @ExecSQLString,@ParameterDefinition, @MeanOUT = @Mean OUTPUT,@StDevOUT = @StDev OUTPUT,@CountOUT = @Count OUTPUT

SET @SQLString = ‘SELECT Value, ValueCount, SUM(ValueCount) OVER (ORDER BY
Value ASC) / CAST(‘
+ CAST(@Count as nvarchar(50)) + ‘AS float) AS EDFValue
       FROM (SELECT DISTINCT  + @Column1 + ‘ AS      Value, Count(+ @Column1 + ‘) OVER (PARTITION BY ‘ + @Column1 + ‘ ORDER BY ‘ + @Column1 + ‘) AS ValueCount
              FROM ‘ + @SchemaAndTable1 +
WHERE ‘ + @Column1 + ‘ IS NOT NULL) AS T1’

INSERT INTO @EDFTable
(Value, ValueCount, EDFValue)
EXEC (@SQLString)

UPDATE T1
SET CDFValue = T3.CDFValue, EDFCDFDifference = EDFValue T3.CDFValue
FROM @EDFTable AS T1
       INNER JOIN  (SELECT DistinctValue, Calculations.NormalDistributionSingleCDFFunction (DistinctValue, @Mean, @StDev) AS CDFValue
       FROM (SELECT DISTINCT Value AS DistinctValue
       FROM @EDFTable) AS T2) AS T3
       ON T1.Value = T3.DistinctValue

 

DECLARE @OneDividedByTwelveN As float = 1 / CAST(12 as float)
DECLARE @TwoTimesCount as float = 2 * @Count
DECLARE @ReciprocalOfCount as float      = 1 / CAST(@Count as float)

DECLARE @ResultTable table
(CramerVonMisesTest float

)

INSERT INTO @ResultTable
SELECT CramerVonMisesTest
FROM (SELECT @OneDividedByTwelveN
+ Sum(Power(((((2 * ID) 1) / CAST(@TwoTimesCount as float)) CDFValue), 2)) AS CramerVonMisesTest
       FROM  @EDFTable) AS T1

 

SELECT CramerVonMisesTest
FROM @ResultTable

SELECT ID, Value, ValueCount, EDFValue, CDFValue, EDFCDFDifference
FROM @EDFTable

…………One potential issue with the code above is that I may need to input the EDF rather than the EDFCDFDifference in the final SELECT; in the absence of example data from published journal articles, I can’t be sure of that. Like some of its kin, the criterion can also be adapted “for comparing two empirical distributions”[viii] rather than using the difference between the EDF and the cumulative distribution function (CDF). Most of these concepts have already been covered in the last three articles in this segment of the series; in fact, pretty much all except the last two selects are identical to that of the Kolmogorov-Smirnov, Kuiper’s and Lilliefors procedures. As usual, the parameters allow users to perform the test on any numerical column in any database they have sufficient access to.

Figure 2: Sample Results from the Duchennes Table
EXEC   Calculations.GoodnessOfFitCramerVonMisesCriterionSP
              @Database1 = N’DataMiningProjects’,
              @Schema1 = N’Health’,
              @Table1 = N’DuchennesTable’,
              @Column1 = N’PyruvateKinase’

Cramer Von Mises Results

…………I performed queries like the one in Figure 2 against two datasets I’ve used throughout the last two tutorial series, one by on the Duchennes form of muscular dystrophy made publicly available by the Vanderbilt University’s Department of Biostatistics and another on the Higgs Boson provided by the University of California at Irvine’s Machine Learning Repository. Now that I’m familiar with how closely their constituent columns follow the bell curve, I was not surprised to see that the LactateDehydrogenase and PyruvateKinase enzymes scored 0.579157871602764 and 2.25027709042408 respectively, or that the test statistic for the Hemopexin protein was 0.471206505704088. Once again, I can’t guarantee that those figures are accurate in the case of this test, but the values follow the expected order (the same cannot be said of the one-sample Wikipedia version, which varied widely across all of the columns I tested it on). Given that the test statistic is supposed to rise in tandem with the lack of fit, I was likewise not surprised to see that the highly abnormal first float column of the Higgs Boson Dataset scored a 118.555073824395. The second float column obviously follows a bell curve in histograms and had a test statistic of 0.6277795953021942279. Note that the results for the same columns in Goodness-of-Fit Testing with SQL Server Part 7.3: The Anderson-Darling Test were 5.43863473749926, 17.4386371653374, 5.27843535947881, 870424.402686672 and 12987.3380102254 respectively. One of the reasons I posted the code for this week’s test statistic despite by misgivings about its accuracy is that the numbers are easier to read than those for its closest cousin. Unlike the Kolmogorov-Smirnov, Kuiper’s and Lilliefors Tests, the Cramér–von Mises Criterion is not bounded between 0 and 1, but it at least doesn’t reach such inflated sizes as in the Anderson-Darling stat. Furthermore, the vastly higher count in the Higgs Boson Dataset seems to swell the Anderson-Darling results even for clearly Gaussian data like the second float column, which makes it difficult to compare stats across datasets.

Figure 3: Execution Plan for the Cramér–von Mises Criterion (click to enlarge)
Cramer von Mises Execution Plan

…………Another advantage the Cramér–von Mises Criterion might enjoy over the Anderson-Darling Test is far better performance. The execution plan in Figure 3 is almost identical to those we saw in Goodness-of-Fit Testing with SQL Server Part 7.1: The Kolmogorov-Smirnov and Kuiper’s Tests and Part 7.2: The Lilliefors Test. Once again there are five queries, only two of which have any real cost. Both of those begin with nonclustered Index Seeks, which is usually a good sign. The only really costly savings we might be able to drudge up are with the Hash Match (Aggregate) operator – but there isn’t much point, since the procedure had the same stellar performance as the Kolmogorov-Smirnov, Kuiper’s and Lilliefors Tests. Given that the procedure is structured so similarly to its high-performing kin, it’s not surprising that it took only 24 seconds to race through the 11 million rows in the Higgs Boson Dataset when processing the first float column and 1:15 for the second; in contrast, the Anderson-Darling took a whopping 9:24 for Column1 and 9:09 for Column2. These times will be several orders of magnitude better when run on a real database server instead of my run-down development machine, but it would probably be wise to go with the better-performing measures anyways, assuming they’re otherwise a good fit for our use cases.
…………I originally intended to include the Akaike, Bayesian and several other metrics with “Information Criterion,” in their name, but decided that these measures of mining model fitness would best be handled in an upcoming series titled Information Measurement with SQL Server. The last tutorial series on outlier detection ended with articles on Cook’s Distance and Mahalanobis Distance, which were both intended to segue into that more advanced series (which I’m entirely unqualified to write), in which we’ll tackle various types of entropy, measures of structure, quantification of order and other mind-blowing topics. We’ll hear from such familiar names as Cramér and Kolmogorov again in that series, but first I must take a detour into a topic that could be of immediate benefit to a wide range of SQL Server users. In the upcoming tutorial series Implementing Fuzzy Sets with SQL Server, I’ll explain how the tragically neglected but techniques used in fuzzy sets can be immediately applied to real-world problems that the SQL Server community encounters routinely, particularly those where modeling data on continuous scales would be preferable but is not currently done because of inexact measurement methods. This article was probably the weakest entry in the Goodness-of-Fit series, which was itself only of narrow interest to certain niches in the SQL Server user base; I only went off on this tangent because I recognized my own weaknesses in this area while writing the Outlier Detection with SQL Server series and sought to rectify it through the school of hard knocks. In the future I may tack a few more articles onto the end of this series, such as sample code for Mardia’s Multivariate Skewness and Kurtosis and other multivariate goodness-of-fit tests, but the bulk of this series is complete. In my next post I’ll introduce some of the basic concepts behind fuzzy sets, before providing code samples that should make this treasure trove of techniques immediately available to a wide range of SQL Server users, in order to solve classes of problems that are not being solved efficiently today. As with much of the research into data mining techniques, the academic study of fuzzy sets is at least two decades ahead of the practice. It’s high time it was brought within reach of non-specialists, many of whom could derive surprising practical benefit from these techniques.

 

[i] See the Wikipedia webpage “Anderson-Darling Test” at http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test

[ii] IBID.

[iii] See the comment “Several goodness-of-fit tests, such as the Anderson-Darling test and the Cramer Von-Mises test, are refinements of the K-S test,” at National Institute for Standards and Technology, 2014,  “1.3.5.16 Kolmogorov-Smirnov Goodness-of-Fit Test,” published in the online edition of the Engineering Statistics Handbook. Available at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm

[iv] See the Wikipedia page “Cramér–von Mises Criterion” at http://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion

[v] I derived the formula from Anderson, T.W. and Stephens, M.A., 1994, The Modified Cramer-Von Mises Goodness-of-Fit Criterion for Time Series. Technical Report No. 47, Jan. 17, 1994. Published by the Office of Naval Research and the National Science Foundation. Available at the DTIC Online web address http://www.dtic.mil/dtic/tr/fulltext/u2/a275377.pdf Also see Xiao, Yuanhui; Gordon, Alexander and Yakovlev, Andrei, 2006, “A C++ Program for the Cramér-Von Mises Two-Sample Test,” pp. 1-15 in Journal of Statistical Software, January 2007. Vol. 17, No. 8. Available online at http://www.jourlib.org/paper/2885039#.VIXYZP4o4uU All three authors taught at the University of Rochester, which is on the other side of the city from me. I had to dust off my C++ for this one, which brought back interesting memories of typing “Hello World” while precariously balancing a laptop on my knees at my sister’s house ages ago, after having a few beers and a whole box of chocolates on Valentine’s Day.

[vi] See the Wikipedia page “Cramér–von Mises Criterion” at https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion

[vii] p. 57 , Watson, G. S., 1962, “Goodness-of-Fit Tests on a Circle,” p. 57 in Biometrika, Vol. 49, No. 1 and 2. Available online at http://phdtree.org/pdf/33054228-goodness-of-fit-tests-on-a-circle-ii/

[viii] IBID.

Goodness-of-Fit Testing with SQL Server Part 7.3: The Anderson-Darling Test

By Steve Bolton

…………As mentioned in previous installments of this series of amateur self-tutorials, goodness-of-fit tests can be differentiated in many ways, including by the data and content types of the inputs and the mathematical properties, data types and cardinality of the outputs, not to mention the performance impact of the internal calculations in between them. Their uses can be further differentiated by the types of probability distributions or regression models they can be applied to and the points within those distributions where their statistical power is highest, such as in the tails of a bell curve or the central point around the median or mean. The Anderson-Darling Test differs from the Kolmogorov-Smirnov Test we recently surveyed and others in its class in a plethora of ways, some of which I was able to glean from sundry comments scattered across the Internet. Unlike many other such tests, it can be applied beyond the usual Gaussian or “normal” distribution to other distributions including the “lognormal, exponential, Weibull, logistic, extreme value type 1” and “Pareto, and logistic.”[1] This is perhaps its major drawing card, since many other such tests are limited to a really narrow range of distributions, usually revolving around the Gaussian.
…………In terms of interpretation of the test statistic, it is “generally valid to compare AD values between distributions and go with the lowest.”[2] When used with the normal distribution it is also “close to optimal” in terms of the Bahadur Slope, one of several methods of assessing the usefulness of the tests statistics produced by goodness-of-fit tests.[3] One of the drawbacks is that “it performs poorly if there are many ties in the data.”[4] Another is that it may be necessary to multiply the final test statistic by specific constants when testing distributions other than the normal[5], but I was unable to find references to any of them in time to include them in this week’s T-SQL stored procedure. This is not true of the Kolmogorov-Smirnov Test we surveyed a few weeks back, which is “distribution-free as the critical values do not depend on whether Gaussianity is being tested or some other form.”[6]

Parameter Estimation and EDA

                This particular limitation is not as much of an issue in the kind of exploratory data mining that the SQL Server community is more likely to use these tests for, given that we’re normally not performing hypothesis testing; I’ve generally shied away from that topic in this series for many reasons that I’ve belabored in previous articles, like the ease of misinterpretation of confidence intervals and the information loss involved in either-or hypothesis rejections. Don’t get me wrong, hypothesis testing is a valuable and often necessary step when trying to prove a specific point, at the stage of Confirmatory Data Analysis (CDA), but most of our mining use cases revolve around informal Exploratory Data Analysis (EDA), a distinction made in the ‘70s by John W. Tukey, the father of modern data mining.[7] Another issue with hypothesis testing is the fact that most of the lookup tables and approximations weren’t designed with datasets consisting of millions of rows, as DBAs and miners of SQL Server cubes encounter every day.
This size difference has a side benefit, in that we generally don’t have to estimate the means and variances of our datasets, which is a much bigger issue in the kinds of small random samples that hypothesis tests are normally applied to. One of the properties of the Anderson-Darling Test is that parameter estimation is less of an issue with it, whereas the Lilliefors Test, the subject of last week’s article, is designed specifically for cases where the variance is unknown. There are apparently special formulations where different combinations of the mean and standard deviation are unknown, but these aren’t going to be common in our use cases, since the mean and variance are usually trivial to compute in an instant for millions of rows. Another noteworthy property that may be of more use to us is the fact that the Anderson-Darling Test is more sensitive to departures from normality in the tails of distributions in comparison with other popular fitness tests.[8]

The Perils and Pitfalls of Equation Translation

                It is not surprising that this long and varied list of properties differentiates the Anderson-Darling Test from the Kolmogorov-Smirnov, Kuiper’s and Lilliefors Tests we’ve surveyed in the last few articles, given that there are some marked differences in its internal calculations. The inner workings apparently involve transforming the inputs into a uniform distribution, which is still a bit above my head, because I’m still learning stats and stochastics as I go. The same can be said of some of the equations I had to translate for this week’s article, which contained some major stumbling blocks I wasn’t expecting. Once of these was the fact that the Anderson-Darling Test is usually categorized along with other methods based on the empirical distribution function (EDF), which as explained in recent articles, involves computing the difference between the actual values and the probabilities generated for them by the distribution’s cumulative distribution function (CDF). Nevertheless, the CDF is used twice in the calculation of the test statistic and the EDF is not used at all, which led to quite a bit of confusion on my part.
…………Another issue I ran into is the fact that the term “N +1 – I” in the formula actually requires the calculation of an order statistic of the kind we used in Goodness-of-Fit Testing with SQL Server, part 6.1: The Shapiro-Wilk Test. I won’t recap that topic here, except to say that it is akin to writing all of the values in a dataset on a sheet of paper in order, then folding it in half and adding them up on each side. Prior to that discovery I was mired in trying various combinations of Lead and Lag that just weren’t returning the right outputs. I found an offhand remark after the fact in an academic paper (which I can’t recall in order to give proper credit) to the effect that the identification of this term as an order statistic is missing from most of the literature on the subject for some unknown reason. As I’ve learned over the past few months, the translation of equations[9] is not always as straightforward as I originally thought it would be (even though I already had some experience doing back in fourth and fifth grade, when my father taught  college physics classes and I used to read all of his textbooks). Other remaining issues with the code in Figure 1 include the fact that I may be setting the wrong defaults for the LOG operations on the CDFValues when they’re equal to zero and the manner in which I handle ties in the order statistics, which may be incorrect. Some of the literature also refers to plugging the standard normal distribution values of 0 and 1 for the mean and standard deviation. Nevertheless, I verified the output of the procedure on two different sets of examples I found on the Internet, so the code may be correct as is.[10]

Figure 1: T-SQL Code for the Anderson-Darling Procedure
CREATE PROCEDURE Calculations.GoodnessofFitAndersonDarlingTestSP
@Database1 as nvarchar(128) = NULL, @Schema1 as nvarchar(128), @Table1 as nvarchar(128),@Column1 AS nvarchar(128)
AS

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString nvarchar(max)
SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1

DECLARE @Mean float,
@StDev float,
@Count  float

DECLARE @ValueTable table
(Value  float)

DECLARE @CDFTable table
(ID bigint IDENTITY (1,1),
Value  float,
CDFValue  float)

DECLARE @ExecSQLString nvarchar(max), @MeanOUT nvarchar(200),@StDevOUT nvarchar(200),@CountOUT nvarchar(200), @ParameterDefinition nvarchar(max)
SET @ParameterDefinition = ‘@MeanOUT nvarchar(200) OUTPUT,@StDevOUT nvarchar(200) OUTPUT,@CountOUT nvarchar(200) OUTPUT ‘
SET @ExecSQLString = ‘SELECT @MeanOUT = CAST(Avg(‘ + @Column1 + ‘) as float),@StDevOUT = CAST(StDev(‘ + @Column1 + ‘)as float),@CountOUT = CAST(Count(‘ + @Column1 + ‘)as float)
       FROM ‘ + @SchemaAndTable1 +
       WHERE ‘ + @Column1 + ‘ IS NOT NULL’

EXEC sp_executesql @ExecSQLString,@ParameterDefinition, @MeanOUT = @Mean OUTPUT,@StDevOUT = @StDev OUTPUT,@CountOUT = @Count OUTPUT

SET @SQLString = ‘SELECT ‘ + @Column1 + ‘ AS Value
FROM ‘ + @SchemaAndTable1 +
WHERE ‘ + @Column1 + ‘ IS NOT NULL’

INSERT INTO @ValueTable
(Value)
EXEC (@SQLString)  

INSERT INTO @CDFTable
(Value, CDFValue)
SELECT T1.Value, CDFValue
FROM @ValueTable AS T1
INNER JOIN  (SELECT DistinctValue, Calculations.NormalDistributionSingleCDFFunction (DistinctValue, @Mean, @StDev) AS CDFValue
       FROM (SELECT DISTINCT Value AS DistinctValue
       FROM @ValueTable) AS T2) AS T3
       ON T1.Value = T3.DistinctValue

SELECT SUM(((1 (ID * 2)) / @Count) * (AscendingValue + DescendingValue)) @Count AS AndersonDarlingTestStatistic
FROM
(SELECT TOP 9999999999 ID, CASE WHEN CDFValue = 0 THEN 0 ELSE Log(CDFValue) END AS AscendingValue
FROM @CDFTable
ORDER BY ID) AS T1
       INNER JOIN (SELECT ROW_NUMBER() OVER (ORDER BY CDFValue DESC) AS RN, CASE WHEN 1 CDFValue = 0 THEN 0 ELSE  Log(1 CDFValue) END AS DescendingValue
       FROM @CDFTable) AS T2
       ON T1.ID = T2.RN

this statement is included merely for convenience and can be eliminated
SELECT Value, CDFValue
FROM @CDFTable
ORDER BY Value

…………The code turned out to be quite short in comparison to the length of time it took to write and the number of mistakes I made along the way. Most of it self-explanatory for readers who are used to the format of the T-SQL procedures I’ve posted in the last two tutorial series. As usual, there is no null-handling, SQL injection or validation code, nor can the parameters handle brackets, which I don’t allow in my own code. The first four allow users to run the procedure on any column in any database they have sufficient access to. The final statement returns the table of CDF values used to calculate the test statistic, since there’s no reason not to now that the costs have already been incurred; as noted above, “this statement is included merely for convenience and can be eliminated.” The joins in the INSERT statement lengthen the code but actually make it more efficient, by enabling the calculation of CDF values just once for each unique column value. The Calculations.NormalDistributionSingleCDFFunction has been introduced in several previous articles, so I won’t rehash it here. In the SELECT where the test statistic is derived, I used an identity value in the join because the ROW_NUMBER operations can be expensive on big table, so I wanted to avoid doing two in one statement.

Figure 2: Sample Results from the Anderson-Darling Test
EXEC   Calculations.GoodnessofFitAndersonDarlingTestSP
              @Database1 = N’DataMiningProjects’,
              @Schema1 = N’Health’,
              @Table1 = N’DuchennesTable’,
              @Column1 = N’PyruvateKinase’

Anderson-Darling Results

…………One of the concerns I had when running queries like the one in Figure 2 against the 209 rows of the Duchennes muscular dystrophy dataset and the 11 million rows of the Higgs Boson dataset (which I downloaded from the Vanderbilt University’s Department of Biostatistics and University of California at Irvine’s Machine Learning Repository and converted into SQL Server tables for use in these tutorial series) is that the values seemed to be influenced by the value ranges and cardinality of the inputs. Unlike the last three tests covered in this series, it is not supposed to be bounded between 0 and 1. As I discovered when verifying the procedure against other people’s examples, it’s not uncommon for the test statistic to get into the single or double digits. In the examples at the NIST webpage, those for the Lognormal and Cauchy distributions were in the double and triple digits respectively, while that of the double exponential distribution were well above 1, so it may not be unusual to get a test statistic this high. It is definitely not bounded between 0 and 1 like the stats returned by other goodness-of-fit tests, but the range might be distribution-dependent. This is exactly what happened with the LactateDehydrogenase, PyruvateKinase and Hemopexin columns, which scored 5.43863473749926, 17.4386371653374 and 5.27843535947881respectively. Now contrast that range with the Higgs Boson results, where the second float column scored a 12987.3380102254 and the first a whopping 870424.402686672. The problem is not with the accuracy of the results, which are about what I’d expect, given that Column2 clearly follows a bell curve in a histogram while Column1 is ridiculously lopsided. The issue is that for very large counts, the test statistic seems to be inflated, so that it can’t be compared across datasets. Furthermore, once a measure gets up to about six or seven digits to the left of the decimal point, it is common for readers to semiconsciously count the digits and interpolate commas, which is a very slow and tedious process. The test statistics are accurate, but suffer from legibility and comparability issues when using Big Data-sized record counts.

Figure 3: Execution Plan for the Anderson-Darling Procedure
Anderson-Darling Execution Plan

…………The procedure also performed poorly on the super-sized Higgs Boson dataset, clocking in at 9:24 for Column1 and 9:09 for Column2; moreover, it gobbled up several gigs of RAM and a lot of space in TempDB, probably as a result of the Table Spool in the execution plan above. Perhaps some optimization could also be performed on the Merge Join, which was accompanied by some expensive Sort operators, by forcing a Hash Match or Nested Loops. The major stumbling block is the number of Table Scans, which I tried to overcome with a series of clustered and non-clustered indexes on the table variables, but this unexpectedly degraded the execution time badly, in tandem with outrageous transaction log growth. I’m sure a T-SQL expert could spot ways to optimize this procedure, but as it stands, the inferior performance means it’s not a good fit for our use cases, unless we’re dealing with small recordsets and need to leverage its specific properties. All told, the Anderson-Darling procedure has some limitations that make it a less attractive option for general-purpose fitness testing than the Kolmogorov-Smirnov Test, at least for our unique uses cases. On the other hand, it has a well-defined set of uses cases based on a well-established properties, which means it could be applied to a wide variety of niche cases. Among these properties is its superior ability “for detecting most departures from normality.”[11] In the last installment of this series, we’ll discuss the Cramér–von Mises Criterion, another EDF-based method that is closely related to the Anderson-Darling Test and enjoys comparable statistical power in detecting non-normality.[12]

[1] See National Institute for Standards and Technology, 2014,  “1.3.5.14 Anderson-Darling Test,” published in the online edition of the Engineering Statistics Handbook. Available at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm

[2] Frost, Jim, 2012, “How to Identify the Distribution of Your Data using Minitab,” published March, 2012 at The Minitab Blog  web address http://blog.minitab.com/blog/adventures-in-statistics/how-to-identify-the-distribution-of-your-data-using-minitab

[3] p. 52, No author listed, 1997, Encyclopaedia of Mathematics, Supplemental Vol. 1. Reidel: Dordrecht. This particular page was retrieved from Google Books.

[4] No author listed, 2014, “Checking Gaussianity,” published online at the MedicalBiostatistics.com web address http://www.medicalbiostatistics.com/checkinggaussianity.pdf

[5] See the Wikipedia webpage “Anderson-Darling Test” at http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test

[6] See the aforementioned MedicalBiostatistics.com article.

[7] See Tukey, John W., 1977, Exploratory Data Analysis. Addison-Wesley: Reading, Mass. I’ve merely heard about the book second-hand and have yet to read it, although I may have encountered a few sections here and there.

[8] IBID.

[9] For the most part, I depended on the more legible version in National Institute for Standards and Technology, 2014,  “1.3.5.14 Anderson-Darling Test,” published in the online edition of the Engineering Statistics Handbook. Available at http://www.itl.nist.gov/div898/handbook/eda/section3/eda35e.htm All of the sources I consulted though had the same notation, without using the term order statistic.

[10] See Frost, 2012, for the sample data he calculated in Minitab and See Alion System Reliability Center, 2014, “Anderson-Darling: A Goodness of Fit Test for Small Samples Assumptions,” published in Selected Topics in Assurance Related Technologies, Vol. 10, No. 5. Available online at the Alion System Reliability Center web address http://src.alionscience.com/pdf/A_DTest.pdf  These START publications are well-written , so I’m glad I discovered them recently through Freebird2008’s post on Sept. 4, 2008 at the TalkStats thread “The Anderson-Darling Test,” which is available at http://www.talkstats.com/showthread.php/5484-The-Anderson-Darling-Test

[11] See the Wikipedia webpage “Anderson-Darling Test” at http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test

[12] IBID.