Implementing Fuzzy Sets in SQL Server, Part 5: The Mystery of the Missing Left Join
By Steve Bolton
…………Information on set operations like complements, intersections and unions is plentiful in the literature on fuzzy sets, which made the last three articles in this series of amateur self-tutorials easier to write in a certain sense. These topics are far more complex than with ordinary “crisp” sets because there are so many different methods for calculating membership grades for the resultsets, each of which can be applied to different use cases in order to model imprecision. This can be especially useful in Behavior-Driven Development (BDD) and user stories, as well as in uncertainty management programs of all kinds, including reducing the insecurity involved in software engineering and data modeling. There may be extra steps involved in calculating these fuzzy set relations, but the topic is at least well-studied and the sources of information for the formulas are readily available. Strangely, that is not true for some of the set operations that SQL Server users encounter more often, like LEFT JOINs, INNER JOINs, OUTER JOINs, RIGHT JOINs and Cartesian Products. There are only 41 hits on Google for the terms “left join” “relational” “fuzzy set” combined, one of which is an off-hand reference from my own blog. I haven’t see these topics addressed much in the literature, which is rich in information on complements, intersections and unions, but not these common relational joins. Like many other puzzling aspects of fuzzy set theory, the reason behind this is readily apparent only after we get back to basics and frame the questions we’re asking of the database in natural language terms.
…………This problem is easily reducible to a single enigma, which nevertheless requires a lengthy explanation. As we saw in the last two tutorials, INNER and OUTER JOIN statements are useful in deriving fuzzy intersections and unions respectively. We really can’t use the standard T-SQL INTERSECT and UNION operators for these purposes, since we need to retrieve membership function values from both sides in order to calculate membership in the resultset. For all intents and purposes, these are the principal use cases for fuzzy INNER and OUTER JOIN operations. Although I won’t rule out the possibly of some oddball set operation that requires the fuzzy versions of these statements, without qualifying as fuzzy intersections and unions, such use cases are probably pretty rare. CROSS JOIN operations are easier to explain once we realize that all fuzzy binary relations are performed on the complete crisp Cartesian Product to derive a fuzzy subset of some kind. To perform a fuzzy Cartesian Product, we would have to take the membership grades of both sets and implement a mathematical operation of some type, just like with fuzzy intersections and unions. The CROSS JOIN operator isn’t used as often in T-SQL as its kin, in large part because retrieving every possible combination of records from both sides can tax the server beyond belief, but cross products aren’t mentioned much in the fuzzy set literature for an additional reason: a wide range of fuzzy subsets of it also qualify as fuzzy intersections and unions, depending on whether they’re implemented with classes of functions known as T-norms and T-conorms. The domain of possible resultsets is no wider than the crisp CROSS JOIN, but fuzzy intersections and unions take up a far wider space within it than their crisp counterparts.
…………It is certainly possible to define fuzzy Cartesian Product operators that are neither T-norms nor T-conorms, but the use cases for doing so are neither clear nor very wide. In order to derive such a measure, we’d probably have to take the next step up and select from the class of “norm operations” that encompass all types of possible fuzzy aggregates, including T-norms and T-conorms as special cases.[i] Unfortunately, the concept is so general that it provides even less guidance for deriving new types of fuzzy Cartesian Products for particular use cases than we have for matching T-norms and T-conorms to the right problems. What the CROSS JOIN, INNER JOIN and OUTER JOIN all have in common, however, is that both participating sets are given equal weight in the determining the results, which is also true for their fuzzy counterparts. As I always caution, I’m not an expert in these matters and am only writing on the topic in order to introduce myself to it, while hopefully helping other SQL Server users to avoid my inevitable mistakes. At the risk of committing another one, I’ll take a stab at solving the Mystery of the Missing Join, which may boil down to the fact that the LEFT JOIN statement gives preferential treatment to one side.
Comparing Membership Functions in LEFT JOINs
One of the most common scenarios for a LEFT JOIN in ordinary crisp sets is to join a dependent table to a parent via a foreign key. When the relationship works in reverse, we’re speaking of a RIGHT JOIN – which is just the converse of a LEFT JOIN, so I’ll dispense with any discussion of it. The join results typically have repeated values for the parent table, since they often match multiple rows in the dependent table. Whenever we work with fuzzy sets, spelling out what we’re looking for in explicit natural language terms often simplifies what may seem like really complex problems. In the case of crisp LEFT JOINs, we’re asking the database, “Give me all of the records in Set1, plus some additional information from Set2 that we sometimes do without.” Here’s the key problem we face: fuzzy set theory only matters in the context of a LEFT JOIN if those additional columns of information are included or excluded based on some membership function, which can be mathematically compared to the membership function of the parent table. In other words, the math involved in a fuzzy complement, union or intersection is needed to determine whether or not the results are included in the new set, but with a LEFT JOIN, oftentimes we’re just suppling more information in a Master-Detail situation – in which case, the single membership function of the parent ought to be sufficient to determine membership in the resultset. With intersections and unions, both sides of the binary relation are of equal importance, but in a LEFT JOIN, the left side takes precedence, with the right just being filler material that doesn’t affect the membership values of the parent. Perhaps the only time we might take a membership function on the dependent table into account is when it measures a type of uncertainty comparable to the parent, plus we specifically want to grade how relevant the additional detail is to the parent.
…………For example, let’s say we’re performing a fuzzy LEFT JOIN on a parent CustomerTable to a child AddressTable. If the first has a function that ranks how Tall a person is and the second carries a grade for membership in the set of Rural places, then we can retrieve the extra data in order to answer questions like, “Give me the set of Short people who live in Suburbs” without performing any math operations on either membership function. If we exclude the much more advanced topic of statistical independence, how Tall or Short a customer is has no bearing on how Rural or Urban their hometown is, so the two memberships don’t affect each other. We can just do a regular LEFT JOIN in such situations and return the two grades separately.
…………The question gets murkier when we consider dependent tables that measure the same quality as the parent, or nearly so. For example, let’s pretend we’re operating a database for a hardware store that has a CeilingHeight membership function defined on the AddressTable, also split into categories like Tall, Short, etc. (for the sake of argument, let’s say the customers volunteered this information about their homes in a survey or whatever). We can ask the database a simple question like, “Give me the set of Short people who live in homes with Tall ceilings” without doing any extra math, but there may be use cases where we can compare these two types of information to draw inferences. An example might be asking a question like, “Give me the set of people who belong to the category of Mismatched Customers,” which can in turn be defined as Short people with Tall ceilings and vice-versa. This could be calculated through LEFT JOINs with WHERE clauses on the Tall and Short values for the Customers, after which some computation can be applied to derive a new category from based on some comparison of that membership function to the CeilingHeight column of the child.
An Advanced Exception: LEFT JOINs Creating New Categories of Comparison
Keep in mind that with fuzzy intersections and unions, we’re essentially adding a new layer of fuzzy membership grades to the operation that joins the two sets; if we extend the same principle to fuzzy LEFT JOINs, then we’re likewise creating a new type of fuzzy set by defining the join in an imprecise way. I’m still a novice at all of this, but would suggest that it might be possible to discern the difference between this type of fuzzy LEFT JOIN from one that doesn’t require any extra math by looking at the results: if they define a new ordinal category that is graded on a continuous scale, then we’ve created a new type of fuzzy set from the two membership functions. If no such category is being created, then we can probably just return the child’s membership functions values as additional details if it’s germane to our query, without taking any extra steps. The question may be complicated by the fact that both tables may have multiple membership functions defined on them, some of which define sets that may not be relevant at all to the query at hand, while others can have various shades of meaning that may overlap. For example, a ProductTable might have different membership functions for such disparate characteristics as Color, Width and Availability, while the dependent table may even have stochastic labels like “Gaussian” or “Gamma,” if we needed to measure membership of a column in some kind of probability distribution.
…………Either way, we have to retrieve all the values we would for a LEFT JOIN, but would only perform additional math in certain use cases that seem to be much narrower than those for fuzzy unions and intersections. In these instances, we’re essentially automatically extending it to a new level, just as the Sugeno and Yager Complements add another level of membership grade on top of the existing membership function values. For that reason, I strongly suspect that we would have to select a fuzzy join function that most closely matches the type of imprecision we’re trying to model. As we saw in the last couple of articles, T-norms and T-conorms are the ideal mathematical structures for expressing fuzzy intersections and unions, but selecting the ones with the right mix of mathematical properties and output histograms is often a tough call. Research has been ongoing in the field for decades to narrow down those use cases, but the paucity of information on LEFT JOINs leaves us with an even shakier starting point than that. This is the weakest article in this series, in the sense that I can’t even provide T-SQL sample code for a rare (or even perhaps non-existent) set of use cases, which also would require advanced research that the professionals apparently haven’t deemed worthy to investigate much; nevertheless, anyone coming from the realm of crisp SQL database servers is going to be immediately struck by the absence of LEFT JOINs and could use a tentative answer to this nagging question. Now that we’re over this hump, in the next installment I’ll explain how to use the most common fuzzy quantifiers and measures, like triangular and trapezoidal numbers. These can be incredibly useful in the right circumstances, including making it child’s play to model common linguistic phrases like “about half” or “most” on both crisp and fuzzy sets. It is here that many SQL Server users can probably find immediate applications for the material covered in this series, which the untapped potential of fuzzy set theory can deal with better than any other alternative.
[i] p. 93, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.
Posted on November 7, 2016, in Implementing Fuzzy Sets in SQL Server and tagged Analytics, Data Mining, Data Science, Fuzzy, Fuzzy Logic, Fuzzy Sets, Knowledge Discovery, SQL, SQL Server, Steve Bolton, T-SQL. Bookmark the permalink. Leave a comment.