# Blog Archives

## Information Measurement with SQL Server, Part 4.1: The Kullback-Leibler Divergence

By Steve Bolton

…………This informal series of tutorials on how to implement a hodgepodge of information metrics in SQL Server is somewhat disorganized by necessity, given that I’m trying to cram in every noteworthy type of measure into one series while simultaneously acquainting myself with the topics along the way. That’s going to require some skipping around, but whenever possible, I try to provide what little structure I can by starting off each segment with the simplest and/or more fundamental measures. The formula underlying the Kullback-Leibler Divergence is not quite as easy as certain intuitive distance measures like Euclidean Distance, which belongs to a group of measures that are more appropriate for calculating the separation between data points in clustering algorithms, all of which I’ll deal with later on in this series if all goes according to plan. Yet neither is the Kullback-Leibler Divergence (KL Divergence) much of a burden in terms of comprehension and calculation, as long as we already have concepts like information entropy under our belts.
…………I kicked off this series with a segment on entropic measures because they are such elementary parts of information theory, whose subject matter often overlaps that of this amateur series. I saved Information Measurement with SQL Server, Part 2.4: Conditional and Joint Entropy for the tail end of that segment because it involved comparisons between two entropies rather than one, which is also the case with the topic of this week’s mistutorial; the good news is that it is actually takes a little less code and even performs better, with faster calculation times and fewer burdens on server resources. The KL Divergence and its relatives serve a different purpose though: instead of measuring the association between two probability distributions as Conditional and Joint Entropy do, they are used to assess how far apart an approximate distribution is from its target. To compute it, we merely need to divide each actual probability value by its approximate counterpart and plug the results into the LOG operation in Shannon’s Entropy, which I covered in a previous article. Various shades of meaning can be attached to the KL (as is always the case with the fundamental information metrics), but they all boil down to measuring some kind of information loss, in terms of how much we can learn from particular data values when we encounter them, i.e. of  “newsworthiness.”. The ultimate goal is often to select better approximations by minimizing the KL Divergence, which is useful in a wide range of applications.

The Cold War Origins of a Foundational Information Metric

This indispensable information measure was the brainchild of Solomon Kullback (1907-1994) and Richard Leibler (1914-2003), two more of those pioneering American cryptanalysts who never received proper credit for breaking the Japanese and Soviet ciphers in World War II and the early Cold War.[1] While working at the National Security Agency (NSA) at a time when it supposedly did not exist (hence the nickname “No Such Agency”[2]), the two code-breakers outlined this simple yet powerful measure in an classic 1951 paper in The Annals of Mathematical Statistics.[3] At the time Claude Shannon’s newborn measure of information entropy hadn’t yet been repurposed for discriminating between two distributions, so their groundbreaking idea opened up a whole new world of use cases for these entropic measures, which were then confined more to coding theory. Its inventors noted in the original paper how the measure can serve as a bridge between Shannon’s Entropy and Fisher Information, an important metric based on variances and covariances that I’ll have to explain in detail later in this series.
…………Both the Fisher Information and KL Divergence are vital to the field of information geometry, but we needn’t look to the cutting edge to find use cases. For example, it has applications in the study of “large deviations” where it acts as the rate function and can be used as an alternative measure of “surprise” in the kind of Bayesian statistics I introduced in the last article, if the approximation is treated as incorrect a priori knowledge.[4] The latter might be of more use to SQL Server DBAs and data miners than its original applications in coding theory, where it was interpreted either as a measure of compression or the count of bits needed to derive the incorrect code embodied in the approximation.[5] It can be adapted for statistical hypothesis testing[6], which as I noted in my Outlier Detection series is an uncommon use case in the SQL Server world; the same principles also make it attractive as a measure of the usefulness of mining models though, which users of SQL Server Data Mining (SSDM) might be able to leverage. Later in this long series I hope to cover the Akaike Information Criterion, which performs the same mining purposes by inferring the equivalent of the KL Divergence through calculations of statistical likelihood.[7] It has even served in recent years as an alternative to the distribution matching tasks I covered in the Goodness-of-Fit Testing with SQL Server series.
…………It has many deep relationships with other forms of entropy, hence the frequency of the nickname “Relative Entropy” in the information theory literature. In some situations it acts as the expected value of the Mutual Information[8] measure we covered a couple of articles ago, which is closely related to the type of information gain we’ll tackle in a future article on tree structures. Since the KL Divergence measures information loss, it makes sense that it would be useful in gauging information gain. Furthermore, the KL Divergence intersects in many complex ways with other measures of information distance; for example, when the main formula discussed in Information Measurement with SQL Server, Part 2.2: The Rényi Entropy and Its Kin is treated as a distance measure, it becomes the Rényi Divergence or α-Divergence, which the KL Divergence serves as a special case of[9]. It is also an F-Divergence derivable from a Bregman Divergence, all of which I’ll explain to the best of my inability later in this segment of the series. Given that it integrates with all of these diverse information measures in myriad ways, it is not surprising that the Kullback-Leibler is probably encountered more often in the data mining and information theory literature than any of these other information distances. These complex interrelationships work to our advantage, however, by enabling us to cast a web of metrics around whatever information we’re trying to capture in our tables and cubes. The KL Divergence is useful in large part because it integrates so well with many of its relatives, which makes it one of the strongest threads in the web.

Coding the KL

Since the KL Divergence can also be derived by subtracting the Shannon’s Entropy for the target distribution from another simple measure known as Cross Entropy[10], I included it in the sample T-SQL in Figure 1 as a check against my calculations. The code is structured like the six T-SQL samples in the segment on entropic measures, so it shouldn’t be too hard to follow despite the length of the text. In fact, I just cut and pasted a good part of it, particularly the derivation of probabilities from actual proportions found for the second float column in the same Higgs Boson dataset I’ve been using for practice material for ages[11] and their storage in an @EntropyTable table variable. The second set of probabilities are derived from a Calculations.NormalDistributionCDFFindProbabilityFunction, which is not depicted here but takes the mean and standard deviation for the column and plugs them into the Cumulative Distribution Function (CDF) for the Gaussian distribution (i.e. the bell curve). Note that I substituted an arbitrary value of 0.61 for the standard deviation in this case, simply because it generated non-zero probabilities for every ActualProportion; most other values did not, even when using the real standard deviation, which is a strong clue that Column2 would fail a normality test based on the Kullback-Leibler Divergence. In fact, the KL isn’t even defined in many cases like this. One of the restrictions on the Kullback-Leibler Divergence is that both distributions must be “absolutely continuous,” i.e. they sum to 1, which signifies that the counterpart of each non-zero probability must also be non-zero; this means that we can use the Value column as a sort of primary key for the @EntropyTable, but it’s actually an irritation that disqualifies many distributions from comparison. The first set of probabilities can never be zero because we’re only storing ones that have observed values in the dataset, but some nil values might sneak into the second set of probabilities if we populated it through a distribution that wasn’t absolutely continuous with the first. The @IsAbsolutelyContinuous variable acts as a check against this and sets the KL results to NULL when there are any non-zero values for the second probability. Both the ActualProportion and GaussianProbability indeed add up to 1 in this case, as can be verified by uncommenting the validation block in the middle.
…………The target probability (the ActualProportion in this case) is divided by the approximation (the GaussianProbability in this sample) and the result is bubbled up to what amounts to the Shannon’s Entropy formula,  featuring the usual LOG, multiplication and summation operations, except with a different input. I also calculate the Cross Entropy and Shannon’s Entropy in the same pass, in order to compute the simple check in the last SELECT against my calculation of the KL Divergence. Figure 2 depicts only slight discrepancies for this check at really small precisions, which could be due to rounding errors alone. The code could use some sprucing up (the CASE checks might not even be necessary) but it gets the point across and can serve as a springboard for anyone who wants to write production code based on it. Note how in Figure 2 the KL Divergence is much higher for Column1 in the top graphic than for Column2 in the bottom one. After using these two columns as practice data for the last few tutorial series, I know that Column2 is somewhat Gaussian and Column1 definitely is not, so I thought they’d make an ideal contrast. Since Column1 has a different distribution, I had to substitute a different fake standard deviation value to simulate the property of absolute continuity. I also had to set an arbitrary lag of Value – 0.1 to cover cases when the first proportion in the @EntropyTable generated a GaussianProbability of zero.

Figure 1: Sample T-SQL Code for the KL Divergence and Cross Entropy
DECLARE @LogarithmBase decimal(38,36)
SET @LogarithmBase = 2 2.7182818284590452353602874713526624977  10

DECLARE @KullbackLeiblerDivergence float, @CrossEntropy float, @ShannonsEntropy float, @IsAbsolutelyContinuous bit
–@JeffreysDivergence float,
–@KDivergence float,
–@TopsoeDistance float,
–@JensenShannonDivergence float,
–@JensenDifference float,

DECLARE @Count bigint, @Mean float, @StDev float

SELECT @Count=Count(*), @Mean = Avg(Column2), @StDev= 0.61 –, @StDev= StDev(Column2)
FROM Physics.HiggsBosonTable
WHERE Column2 IS NOT NULL

DECLARE @EntropyTable table
(Value decimal(33,29),
ActualProportion decimal(38,37),
GaussianProbability decimal(38,37),
SummationInputForEntropy float,
SummationInputForKullbackLeibler float,
SummationInputForCrossEntropy float
)

INSERT INTO @EntropyTable
(Value, ActualProportion, GaussianProbability, SummationInputForEntropy, SummationInputForKullbackLeibler, SummationInputForCrossEntropy)
SELECT Value, ActualProportion, GaussianProbability, ActualProportion *
Log(ActualProportion, @LogarithmBase) AS SummationInputForEntropy,
CASE WHEN GaussianProbability<= 0 THEN 0 ELSE ActualProportion * Log(ActualProportion / GaussianProbability, @LogarithmBase) END AS SummationInputForKullbackLeibler,
CASE WHEN GaussianProbability <= 0 THEN 0 ELSE ActualProportion * Log(GaussianProbability, @LogarithmBase) END AS SummationInputForCrossEntropy
FROM (SELECT Value, ActualProportion, Calculations.NormalDistributionCDFFindProbabilityFunction(ActualValueLag, Value, @Mean, @StDev) AS GaussianProbability
FROM (SELECT Value,  ActualProportion, Lag(Value, 1, Value 0.1) OVER (ORDER BY Value ASC) AS ActualValueLag
FROM (SELECT Value,  ValueCount / CAST(@Count as float) AS ActualProportion
FROM (SELECT Column2 AS Value, Count(*) AS ValueCount
FROM Physics.HiggsBosonTable
GROUP BY Column2) AS T1) AS T2) AS T3) AS T4

— VALIDATION – uncomment to debug
–SELECT SUM(CAST(ActualProportion AS float)), SUM(CAST(GaussianProbability AS float))
–FROM @EntropyTable

if this count is not = 0, then the distributions are not absolutely continuous and the KL Divergence is not defined
SELECT @IsAbsolutelyContinuous = CASE WHEN Count(*) > 0 THEN 0 ELSE 1 END
FROM @EntropyTable
WHERE GaussianProbability = 0

SELECT @ShannonsEntropy = ABS(SUM(SummationInputForEntropy)), @CrossEntropy = ABS(SUM(SummationInputForCrossEntropy)),
@KullbackLeiblerDivergence = SUM(SummationInputForKullbackLeibler)
FROM @EntropyTable

SELECT @IsAbsolutelyContinuous AS IsAbsolutelyContinuous, @ShannonsEntropy AS ShannonsEntropy,
@CrossEntropy AS CrossEntropy,
ABS(@CrossEntropy @ShannonsEntropy) AS CalculationCheck,
ABS(@CrossEntropy @ShannonsEntropy) @KullbackLeiblerDivergence AS DiscrepancyWithCrossEntropyCheck,
CASE WHEN @IsAbsolutelyContinuous = 1 THEN @KullbackLeiblerDivergence ELSE NULL END AS KullbackLeiblerDivergence

Figure 2: Results from the First and Second Float Columns in the Higgs Boson Dataset (click to enlarge)

…………The accuracy of the CrossEntropy check still needs some work, but the performance is stellar. It runs in just 3 seconds on my rusty abacus of a development machine, which hardly qualifies as a modern production server. The execution plans are quite similar to those for in the segment on entropic measures, albeit even simpler. The INSERT accounts for 98 percent of the batch costs, and one Sort operator accounts for 49 percent of its costs, with 45 percent more attributable to the actual Table Insert. That means it probably can’t optimized much further – unless of course you can have precalculated probabilities that can be read directly from a table, in which case most of this code and the associated costs in the INSERT can be sliced in half.  It does soak up more RAM and TempDB space than I’d like, but this is only a problem when calculating all of these probabilities on the fly on a little desktop machine.
…………You’ve probably noticed already how I commented out five other measures at the beginning of Figure 1. Since they’re all members of the same family of entropic distance measures I’ll be tackling them in the next article, which will involve merely pasting the same code and altering some of the math operations in the INSERT. None of them are particularly difficult to translate into T-SQL, so I’ll dispose of them in one fell swoop. This segment will basically consist of diving into a maelstrom of distance measures, many of which are easy to code and also have minimal performance costs; many of them are used in clustering algorithms, which may be the hidden reason why the SSDM functionality I covered in A Rickety Stairway to SQL Server Data Mining, Algorithm 7: Clustering ran so efficiently compared to other built-in algorithms. The overarching reason why theoreticians have develop so many of these distances metrics is that they exhibit a welter of different properties, some of which are more conducive to certain data mining and statistical tasks than others. For example, divergences don’t have to be symmetric, in the sense that measuring in reverse from the destination to the source is not guaranteed to return the same values as the original formula. Nor are they subadditive, in that the distances between each set of data points and a third reference point must always sum to greater than or equal to the distance between the two distributions. Technically speaking, “distances” exhibit symmetry but are not required to be subadditive (i.e. they do not have to obey the “triangle inequality”). Throughout this series I’ve been using the term “metric” in a casual way, but in mathematical parlance it also has the stricter meaning of a distance measure that exhibits both symmetry and subadditivity. The KL does neither, hence the need for some of the compensatory tweaks embodied in the divergences we’ll discuss next time around; the KL is still remarkably useful without these desirable properties though[12], to the extent that it is still more commonly used than any of these variants, the most well-known of which may be the Jensen-Shannon Divergence. After that, we’ll tackle divergences, distances and metrics that aren’t limited to probabilistic inputs. In the meantime, I highly recommend Cha Sung-Hyuk’s well-organized and highly readable taxonomy of distance measures, which explains better than I can how they fit together.[13] There’s such an abundance of data mining measures available in the mathematical literature that I doubt anybody will be able to corner them all in the analytics marketplace. By the end of the series, however, I hope to demonstrate how several dozen of the leading metrics can be “weaponized” in T-SQL form and put into action in cornering the uncertainty in our SQL Server tables and cubes from every possible direction. The most numerous section of our arsenal will consist of these divergences and distances, which will allow us to perform DIY data mining with greater skill than before.

[1] See the Wikipedia pages “Solomon Kullback” and “Richard Leibler” at http://en.wikipedia.org/wiki/Solomon_Kullback and http://en.wikipedia.org/wiki/Richard_Leibler respectively.

[2] Which was still in use long after the Church Committee publicly exposed its illegal domestic surveillance operations in the mid-‘70s.

[3] Kullback, Solomon and Leibler, Richard, 1951, “On Information and Sufficiency,” pp. 79-86 in The Annals of Mathematical Statistics, March 1951. Vol. 22, No. 1. Available online at the West Virginia University Lane Department of Computer Science and Electrical Engineering web address http://www.csee.wvu.edu/~xinl/library/papers/math/statistics/Kullback_Leibler_1951.pdf KL

[4] See the Wikipedia page “Kullback-Leibler Divergence” at http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

[5] See the Wikipedia page “Information Theory” at http://en.wikipedia.org/wiki/Information_theory

[6] See the Wikipedia page “Kullback-Leibler Divergence” at http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

[7] See the Wikipedia page “Akaike Information Criterion” at http://en.wikipedia.org/wiki/Akaike_information_criterion

[8] See the Wikipedia page “Information Gain in Decision Trees” at http://en.wikipedia.org/wiki/Information_gain_in_decision_trees

[9] See the Wikipedia article “Rényi Entropy” at http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

[10] See the Wikipedia page “Cross Entropy” at http://en.wikipedia.org/wiki/Cross_entropy

[11] This dataset is made publicly available by the University of California at Irvine’s Machine Learning Repository. which I converted it to a SQL Server table of about 5 gigabytes at the start of my Outlier Detection series.

[12]  “We can say that KL divergence has some of the properties of a metric on the space of probability distribution: it’s non-negative, with equality only when the two distributions are equal (a.e.). Unfortunately, however, it is not symmetric, and it does not obey the triangle inequality. (This is why it’s the KL divergence rather than the KL distance.) Nonetheless, it’s enough like a metric that it can be used to construct a kind of geometry on the space of probability distributions, and so of statistical models, which can be extremely useful.” p. 193, Cosma Shalizi, Cosma, 2006, “Advanced Probability II or Almost None of Stochastic Processes,” monograph published at the Carnegie Mellon University Department of Statistics web address http://www.stat.cmu.edu/~cshalizi/754/2006/notes/all.pdf

[13] Sung-Hyuk Cha, 2007, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” pp. 300-307 in International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 1, No. 4.

Advertisements

## Information Measurement with SQL Server, Part 2.5: Mutual, Lautum and Shared Information

By Steve Bolton

…………The sample T-SQL I posted in the last article wasn’t as difficult as it looked, considering that it merely implemented the same code on the same data we used in Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy, except for two columns rather than one. A little more logic was required to calculate Joint Entropy from them, but a GROUP BY CUBE and some GROUPING_IDs took care of that. That sample code was basically a two-for-one deal, because we merely need to tack the practically effortless code from this article onto it. Once we’ve derived measures like Conditional and Joint Entropy, it is child’s play to derive Mutual Information, which is one of the most widely used metrics in information theory. There are many equivalent ways of deriving it, all of which involve simple subtraction and addition operations on various combinations of the Joint Entropy, @ConditionalEntropyOfBGivenA, @ConditionalEntropyOfAGivenB and the Shannon’s Entropies for Column1 and Column2.[1] All of these were already calculated in last week’s code (which I’ll omit here for the sake of brevity) so this week’s installment should be a breeze. As is often the case in information theory, the difficulties consist chiefly in interpreting Mutual Information and some of the kindred metrics depicted in Figures 1 and 2.
…………Mutual Information functions as a kind of two-way version of Conditional Entropy, so that we can gauge how much “how much knowing one of these variables reduces uncertainty about the other,”[2] rather than just about one variable given a value for the other. The interpretation is pretty much the same though: if the two variables are perfectly dependent on one another the value will be zero, but as their degree of independence rises, so does their Mutual Information. In this way, Mutual Information can be harnessed to complement standard measures of association like covariance and correlation. In coding theory it can be interpreted as the information transferred before and after receiving a signal, which can be adapted to our purposes by interpreting it as the information transferred before and after a particular event is observed and added to our records.[3] The information-carrying capacity and other properties of communication channels also play a part in determining whether or not particular messages affect the Mutual Information of a code[4], which can be interpreted in terms of the frequency of values in database tables and cubes. Daniel P. Palomar and Sergio Verdí, the reinventors of an up-and-coming alternative known as Lautum Information[5], sum up the many uses for Mutual Information quite succinctly: “Mutual information has also proven a popular measure of statistical dependence in many experimental applications such as neurobiology, genetics, machine learning, medical imaging, linguistics, artificial intelligence, authentication, and signal processing.”[6]
…………Palomar and Verdí’s measure has been lauded as replacement for Mutual Information in certain contexts based on various mathematical properties, but I had a difficult time wading through the notation in their original paper from 2008, so I had to rely on the formula given by chemist Gavin E. Crooks in a handy .pdf guide to measures of entropy.[7] As I mentioned in the last article, there are several equivalent methods of calculating Joint Entropy, so I selected the one I wagered would be easiest to incorporate in the T-SQL I had already written for Shannon’s Entropy. Among these is a formula for computing it using a division operation, which we simply need to invert to derive Lautum Information; the name is apparently an anadrome of “mutual” chosen because it means “elegant” in Latin, so it too is an inversion of sorts.[8] It may be possible to simplify Figure 1 to calculate Lautum Information inside the same INSERT that Joint Entropy is derived in (which is not included here, for the sake of brevity), or at least derive it from Joint Entropy after the fact through simpler means than a complete pass over the whole @EntropyTable table variable. Surprisingly, the convoluted solution in the fourth SELECT adds next to nothing to the performance costs, despite the fact that the @EntropyTable contains more than 9,119,674 distinct combinations for Column1 and Column2, out of the 11 million rows in the Higgs Boson dataset I’ve been using for practice data for the last few self-tutorial series.[9] It was simpler to code Shared Information Distance, another metric that has apparently gained in popularity in recent years, which has been used in cutting-edge fields like information geometry and applications like plagiarism detection.[10] Its advantages over the better-established Mutual Information metric include the fact that it obeys the triangle inequality (a transitive property related to subadditivity, that separates true “metrics” from ordinary distance measures) and that it may qualify as a “universal metric,” so “that if any other distance measure two items close-by, then the variation of information will also judge them close.”[11] The Shared Information Distance is really easy to compute[12] once we have Joint Entropy and a Conditional Entropy, which in Figure 1 is calculated via the Shannon’s Entropy for Column 1.

Figure 1: Sample T-SQL for Four Entropic Information Measures
DECLARE @MutualInformation float,
@LautumInformation float,
@SharedInformationDistance float,
@ConditionalEntropyOfBGivenA float,
@ConditionalEntropyOfAGivenB float

SELECT @ConditionalEntropyOfBGivenA = @JointEntropy @ShannonsEntropy1,
@ConditionalEntropyOfAGivenB = @JointEntropy @ShannonsEntropy2
SELECT @MutualInformation = @ShannonsEntropy1 + @ShannonsEntropy2 @JointEntropy
SELECT @SharedInformationDistance = 1 (@ShannonsEntropy1 @ConditionalEntropyOfAGivenB) / @JointEntropy

SELECT @LautumInformation = SUM(JointProportion * Log(@JointEntropy / JointProportion))
FROM @EntropyTable

SELECT @ShannonsEntropy1 AS ShannonsEntropyForX, @ShannonsEntropy2 AS ShannonsEntropyForY, @JointEntropy AS JointEntropy,
@ShannonsEntropy1 + @ShannonsEntropy2 AS SumOfIndividualShannonEntropies,
@ConditionalEntropyOfBGivenA AS ConditionalEntropyOfBGivenA, @ConditionalEntropyOfAGivenB AS ConditionalEntropyOfAGivenB,
@MutualInformation AS MutualInformation, @LautumInformation AS LautumInformation, @SharedInformationDistance AS SharedInformationDistance

Figure 2: Self-Information and Rate Calculations
— Information and Entropy Rates
SELECT @ShannonsEntropy1 / CAST (@Count1 as float) AS EntropyRateForColumn1, @ShannonsEntropy2 / CAST (@Count2 as float) AS EntropyRateForColumn2, @JointEntropy / CAST (@JointCount as float) AS JointEntropyRate, @MutualInformation / @JointCount AS InformationRate

— Self-Information
SELECT DISTINCT Value1, 1 * Log(Proportion1, @LogarithmBase) AS SelfInformation
FROM @EntropyTable
WHERE Proportion1 IS NOT NULL
ORDER BY Value1
OFFSET 0 ROWS FETCH FIRST 100 ROWS ONLY

Figure 3: Results from the First Two Float Columns of the Higgs Boson Dataset (Click to enlarge)

…………The routine clocked in at 1:46, four seconds faster than the sample T-SQL from the last article, despite the fact that it’s identical except for the extra routines I appended. I expected the Table Scan required for my brute force calculation of Lautum Information to add significantly to the performance costs, but it accounted for less than 1 percent of the query batch; otherwise, the execution plan was pretty much indistinguishable to the one from the last post. Note that the results also contain a sample of the first 100 Self-Information values for the first float column, as defined in Figure 2. In a nutshell, this trivial calculation[13] tells us the entropy for each individual record. This can be used to identify, partition and cluster records that offer greater potential for information gain, among other things. Like many other information metrics, it has multiple layers of meaning. At the most fundamental level, Self-Information quantifies the question, “How much can we learn when this particular event occurs?” Another subtle implication is that it may constitute another means of quantifying “surprise” in addition to the possibilistic one I mentioned in Implementing Fuzzy Sets in SQL Server, Part 8: Possibility Theory and Alpha Cuts. Rare events are surprising by definition, but also more informative in the same specific sense of all information theory entropies, which tell us how “newsworthy” specific values can be. For example, it is really rare to see spectacular lenticular, anvil or Undulatus Asperatus clouds. Their presence can tell us a lot more about the weather than ordinary patterns though, since they require specific atmospheric conditions to produce them; I don’t know what their specific entropy values would be a weather model, but I’d wager that they’d be astronomically high. In the same way, the observation of rare events with high self-information values can be a clue that some extraordinary and often highly specific underlying process is at work.
…………Figure 2 also includes code for the Information Rate, which is even more trivial than Self-Information to compute once we have Mutual Information out of the way. The term is sometimes applied to the ordinary Entropy Rate, so I differentiated it in the code by referring to it as the MutualInformationRate. Both are merely averages over different quantities, the Mutual Information and Shannon’s Entropy; I should note though that there is a supremum symbol in the equations I found for the Information Rate[14], which may mean that I need to incorporate a MAX operation somewhere in there. My usual disclaimer is always lurking in the background: I’m posting code in these tutorial series in order to absorb the material faster, not because I know what I’m doing, so check my code over before putting into production if accuracy is of paramount concern. Also keep in mind that the interpretation and calculation of both may be affected by the properties of the underlying data, such as whether or not they can modeled as irreducible or aperiodic Markov chains.[15] Further nuances can be added for calculating entropy per message in cases where the input is unknown  and the Conditional Entropy “per message if the input is known,” which indicates the presence of noise if the value is non-zero.[16] Erroneous data leaves the former rate unchanged but increases the second at a steady rate dependent on the maximum probability.[17]
…………Just as with simply statistical building blocks like averages and standard deviation, the sky’s the limit when it comes to the combinations and variations we can build out of Shannon’s Entropy and its relatives. Information theory is a vast field, which was birthed in the ‘40s out of a desire to send clearer communication signals, but now extends its tendrils into fields as diverse as cryptography, data compression, Internet search algorithms and quantum physics. Modern Man is perpetually surrounded by technology that relies on the principles of information theory, regardless of whether they were introduced to the term as late in life as I was. My only concern in this series is to focus on the actual metrics used in information theory, particularly those that might be of benefit to SQL Server users in DIY data mining. In the future I might revisit these entropic measures, once I have a better understanding of which variants might be beneficial in such applications; for instance, I’m allured by the tantalizing possibility of calculating the channel capacity of neural nets, although I’m not yet sure how to go about it. I’ll cross those bridges when I come to them, but for now it would be more beneficial to shift our focus to other classes of information metrics, which calls for a whole lengthy segment on various distance and divergence measures. Shared Information Distance implies some share in that group by its very name, but it is also intimately related to Kolmogorov Complexity, a fascinating topic I hope to take up in a later segment on minimum information length metrics. Lautum Information has also apparently been referred to as the “reverse Kullback-Leibler Divergence,”[18] which links it to one of the most important distance measures used in data mining and knowledge discovery. The KL-Divergence is also known by the alias “Relative Entropy,” which has its own associated Relative Entropy Rate.[19] Mutual Information is likewise an important building block in many other information metrics, including Pointwise Mutual Information (PMI), which is used in Internet search engines; I’ll have to save PMI and related topics for a much later segment on semantic information metrics, which are rather advanced and often difficult to calculate and interpret. New information measures are being churned out by theoreticians much faster than the analytics marketplace can keep up with them; after skimming the literature over the last few years, I doubt that the gap will be crossed anytime soon, because it yawns as wide as the Valles Marineris. No single software package is going to be capable of implementing all of the worthwhile algorithms and metrics for a long time to come, which makes DIY data mining skills worthwhile to acquire. To that end, it would make sense to become familiar with certain rudimentary measures of existing knowledge, to complement entropic measures that tell us how much we don’t know about our data. To that end, next time around I’ll kick off another segment of this meandering mistutorial series with an introduction to Bayes Factors. These are fairly easily to calculate and may serve as bridges to Bayesian probability – a topic I know little about, but which makes use of Conditional Entropy – and Fisher Information, which is of vital importance in fields like data mining and information geometry. As the series progresses, metrics like these will allow us to box in the remaining uncertainty in our datasets little by little, as each new piece of the information puzzle sheds new light on our data from fresh directions.

[1] The Wikipedia page “Mutual Information” at http://en.wikipedia.org/wiki/Mutual_information is a handy reference for these formulas, but they’re readily available in many information theory texts, such as p. 16, Jones, D.S., 1979, Elementary Information Theory. Oxford University Press:  New York and pp. 49-62, Mansuripur, Masud, 1987, Introduction to Information Theory. Prentice-Hall: Englewood Cliffs, N.J.

[2] See the Wikipedia page “Mutual Information” at http://en.wikipedia.org/wiki/Mutual_information. It may be a free resource and quoting from it may be frowned upon, but sometimes the contributing writers word things particularly well or provide really succinct explanations that you can’t find in professional texts in various fields.

[3] pp. 126-127, Moser, Stefan M. and Po-Ning, Chen, 2012, A Student’s Guide to Coding and Information Theory. Cambridge University Press: New York.

[4] p. 64, Mansuripur.

[5] Which was originally mentioned in a 1945 paper by renowned Hungarian statistican Abraham Wald, the inventor of the well-known Wald Test. p. 964, Palomar, Daniel P, and Verdí, Sergio, 2008, “Lautum Information,” pp. 964-975 in IEEE Transactions on Information Theory, March 2008. Vol. 54, No. 3.

[6] IBID.

[7] p. 3, Crooks, Gavin E., 2015, , “On Measures of Entropy and Information,” monograph published Jan. 22, 2015 at the ThreePlusOne.com web address  http://threeplusone.com/info

[8] p. 964, Palomar and Verdí.

[9] The original source was the University of California at Irvine’s Machine Learning Repository. I converted it ages ago to a SQL Server table, which now takes up about 5 gigabytes of space in a sham DataMiningProjects database.

[10] For the latter, see Kleiman, Alan Bustos and Kowaltowski, Tomasz, 2009, Qualitative Analysis and Comparison of Plagiarism-Detection Systems in Student Programs. Technical Report available in pdf format at the Universidade Estadual de Campinas web address http://www.ic.unicamp.br/~reltech/2009/09-08.pdf

[11] See the Wikipedia page “Variation of Information” at http://en.wikipedia.org/wiki/Variation_of_information

[12] p. 3, Li, Ming, 2006, “Information Distance and Its Applications,” pp. 1-9 in Implementation and Application of Automata: 11th International Conference, CIAA 2006. Taipei, Taiwan, August 2006 Proceedings. Ibarra, Oscar H. ed. Springer-Verlag: Berlin.

[13] See p. 16, Jones and the Wikipedia article “Self-Information” at http://en.wikipedia.org/wiki/Self-information

[14] p. 219, Gray, Robert M., 2011, Entropy and Information Theory. Springer: New York.

[15] See the Wikipedia page “Entropy Rate ” at http://en.wikipedia.org/wiki/Entropy_rate

[16] p. 47-48, Goldman, Stanford, 1953, Information Theory. Prentice-Hall: New York.

[17] IBID.

[18] See Sarwate, Anand, 2013, “C.R. Rao and Information Geometry,” posted on April 13, 2013 at The Ergodic Walk web address http://ergodicity.net/2013/04/13/c-r-rao-and-information-geometry/

[19] p. 45, Goldman, Stanford, 1953, Information Theory. Prentice-Hall: New York.

## Information Measurement with SQL Server, Part 2.4: Conditional and Joint Entropy

By Steve Bolton

…………Since this series on using SQL Server to implement the whole gamut of information metrics is wide-ranging in nature, it will also be somewhat disorganized by necessity; this is doubly true, given that I’m writing it in order to learn the material faster, not because I’m already intimately familiar with these topics. Nonetheless, I’m trying to provide what structure I can along the way, including segregating the most important entropy measures of information theory in this early segment. I’m slowly ratcheting up the level of complexity within this section by introducing simpler concepts first and using them as stepping stones to more difficult ones, yet this week’s topic differs in the manner of its complexity from that of the previous article. The foray we took into thermodynamic entropies in the last post was difficult due to the depth of the subject matter whenever definitions of qualities like “randomness” and “order” are in play. I put off discussing Conditional and Joint Entropy until this post because their complexity is of a different type; basically, they just involve binary comparisons of entropy rather than the simple unary metrics I introduced in Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy and Information Measurement with SQL Server, Part 2.2: The Rényi Entropy and Its Kin. They’re among the first topics discussed in texts on information theory and coding theory and shouldn’t be too difficult to fathom, if readers could swallow the last two installments in this series. Coding them is not terribly difficult, although they do present some challenges in terms of performance and interpretation.
…………One of the benefits of reading my amateur self-tutorials is that you get sample T-SQL code and a few paragraphs of commentary, which basically saves the hassle of having to consult gigantic math tomes chock full of arcane equations. Data miners shouldn’t have to give a dissertation on the theorems and lemmas that justify the underlying algorithms, any more than a commuter should be required to give a dissertation on automotive engineering in order to get a license. To that end, I usually remove all of the associated equations to suit my target audience of SQL Server DBAs and data miners, who can probably grasp them much easier in T-SQL form. I broke that rule in the article on Shannon’s Entropy for a sound reason and this time around, I’ll bend it a little in order to clear up some possible sources of confusion about probability notation. Symbols like P(A|B) denote a conditional probability, which signifies, “Given a value for B, what is the probability of A?” Joint probability is denoted by P(A,B) or P(AB), which can be read as, “What is the probability of two specific values for A and B occurring together?” Both concepts have counterparts in information theory, where we simply replace the P with the H symbol to denote entropy rather than probability. This may come in handy if readers want to try to one of several alternatives formulas for arriving at the same figures with these entropies, or want to double-check my sample T-SQL in Figure 1 (which is always a wise idea, since I don’t yet have any real expertise in these subjects).

Unit and Scaling Issues in the Sample T-SQL

I chose the methods for calculating both that I believed would mesh well with the code I posted a few articles ago for Shannon’s Entropy. Although the Figure 1 is somewhat lengthier than the sample T-SQL in that post, the same basic format is at work. Once again, I include a @LogarithmBase parameter so that users can easily switch between base 2, base 10 and Euler’s Number to derive the three main units use for entropy, bits (i.e. shannons), bans (i.e. hartleys) and nats respectively. The INSERT INTO populates a table variable by selecting from the same Higgs Boson dataset, which is ideal for stress-testing the kind of routines I’ve been posting for the last few tutorial series, since it has 11 million rows of mainly float columns.[1] Shannon’s Entropy is derived in precisely the same way as before by using intermediate table variable columns like SummationInput1, except that it must be done for two of the dataset’s float columns this time around. The tricky part is the use of the GROUP BY CUBE statement, which allows us to simultaneously calculate the joint proportions of both columns without traversing the table repeatedly. The GROUPING_ID conditions derived from it are used to distinguish the aggregates of the two columns and their joint proportions in the CASE statements, which were not necessary in the original sample T-SQL for Shannon’s Entropy. We likewise need to take two additional counts in this routine, since we need to perform calculations on an additional column and its joint distribution with the other; in this particular instance Count1 and Count2 are equal to the count of the whole dataset, i.e. the JointCount, simply because the Higgs Boson dataset has no null values for these columns. This is not going to be the case with many other datasets.
…………It is also good to keep in mind that I’m once again cheating a little by using known proportions as probability values, which could also be derived from alternative sources like sampling, probability distribution formulas and deductive reasoning about the underlying processes, as is possible with simple examples like dice and card games. Most of the texts I’ve read on information theory and coding theory begin with the same dice and card examples, but in most real-world applications, the underlying processes are far too complex to reason them out in advance. As I mentioned in the articles on the Hartley Function and thermodynamic entropies, in some cases it may be necessary to calculate probabilities across all possible values of a particular column, but that can get messy since it involves taking permutations and combinations over large data types. There’s simply no way we’re going to cram all 1038 + 1 values that are permissible in a decimal(38,0) column into standard combinatorics formulas, given that the highest values we can use for factorials is about 170. This is an inherent limitation of SQL Server’s data types, which actually do a much better job at this than many of its competitors in the data mining market, like Minitab and WEKA.  By using existing proportions, we can eliminate millions of values that are not found in the dataset, but for which we might have to assign nonzero values if we were using uniform distribution or whatever. It is worthwhile to note though that this kind of subtle fudging is more appropriate to the sizes of the tables used in relational databases and cubes than in ordinary scientific analysis, since we have extensive counts that can be leveraged in this manner. Our roles are almost reversed: it is far easier for DBAs and Analysis Services users to leverage actual counts of this kind, whereas it is much easier for scientists to perform calculations involving permutations, factorials and the like.
…………I also took the easy way out in setting the table variable data types, since Figure 1 is merely intended to convey the underlying concepts rather than to serve as production code. Values1 and 2 in the table variable are set to decimal(33,29) because that’s the precision and scale of the original columns, whereas the proportions are set to decimal(38,37), since they should never exceed 1 and therefore require just 1 digit to the left of the decimal place. The SummationInput columns are set to (38,21) arbitrarily, since I’m familiar with these two columns and was certain that this was enough space to accommodate them, while retaining enough precision to get the point across; I could have used floats, but kept getting annoying floating point rounding errors at lower precisions than I expected. The validation code in the middle of the routine can be uncommented if users want to inspect the table variable’s individual values or make sure that the proportions sum to 1; these check may reveal a few rounding errors that I haven’t yet been able to track down and weed out, but they’re occurring at trivial precision values and don’t really detract from the lessons. The good news is that once we calculate either the Joint Entropy or Conditional Entropy, it is trivial to derive the other using the Chain Rule[2], which involves some simple subtraction or addition operations. We therefore get two metrics for the price of one. I wagered that it would be less costly to derive the Joint Entropy first, since it can be done alongside the individual Shannon’s Entropy values less awkwardly than the Conditional Entropy can through alternative formulas.

Figure 1: T-SQL for Joint and Conditional Entropy
DECLARE @LogarithmBase decimal(38,36)
SET @LogarithmBase = 2 2.7182818284590452353602874713526624977  10

DECLARE @Count1  bigint, @Count2 bigint, @JointCount bigint, @ShannonsEntropy1 float, @ShannonsEntropy2 float, @JointEntropy float
SELECT @Count1=Count(*)
FROM DataMiningProjects.Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL

SELECT @Count2=Count(*)
FROM DataMiningProjects.Physics.HiggsBosonTable
WHERE Column2 IS NOT NULL

SELECT @JointCount=Count(*)
FROM DataMiningProjects.Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL OR Column2 IS NOT NULL

DECLARE @EntropyTable table
(Value1 decimal(33,29),
Value2 decimal(33,29),
ValueCount bigint,
Proportion1 decimal(38,37),
Proportion2 decimal(38,37),
JointProportion decimal(38,21),
SummationInput1 decimal(38,21),
SummationInput2 decimal(38,21),
JointSummationInput decimal(38,21)
)

INSERT INTO @EntropyTable
(Value1, Value2, ValueCount, Proportion1,Proportion2, JointProportion, SummationInput1, SummationInput2, JointSummationInput)
SELECT Value1, Value2,ValueCount, Proportion1,Proportion2, JointProportion,
Proportion1 * Log(Proportion1, @LogarithmBase) AS SummationInput1,
Proportion2 * Log(Proportion2, @LogarithmBase) AS SummationInput2,
JointProportion * Log(JointProportion, @LogarithmBase) AS JointSummationInput
FROM (SELECT Value1, Value2,ValueCount,
CASE WHEN GroupingIDColumn1 = 0 AND GroupingIDColumn2 = 1 THEN ValueCount / CAST(@Count1 AS float) ELSE NULL END AS Proportion1,
CASE WHEN GroupingIDColumn1 = 1 AND GroupingIDColumn2 = 0 THEN ValueCount / CAST(@Count2 AS float) ELSE NULL END AS Proportion2,
CASE WHEN GroupingIDColumn1 = 0 AND GroupingIDColumn2 = 0 THEN ValueCount / CAST(@JointCount AS float) ELSE NULL END AS JointProportion,
GroupingIDColumn1 = 0,GroupingIDColumn2
FROM  (SELECT Column1 AS Value1, Column2 AS Value2,  Count(*) AS ValueCount, GROUPING_ID(Column1) AS GroupingIDColumn1, GROUPING_ID(Column2) AS GroupingIDColumn2
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL AND Column2 IS NOT NULL
GROUP BY CUBE (Column1, Column2)) AS T1) AS T2

for validation
–SELECT * FROM @EntropyTable

–SELECT SUM(Proportion1), SUM(Proportion2), SUM(JointProportion)
–FROM @EntropyTable

SELECT @ShannonsEntropy1 = SUM(SummationInput1) * 1,
@ShannonsEntropy2 = SUM(SummationInput2) * 1,
@JointEntropy = SUM(JointSummationInput) * 1
FROM @EntropyTable

SELECT @ShannonsEntropy1 AS ShannonsEntropyForX, @ShannonsEntropy2 AS ShannonsEntropyForY, @JointEntropy AS JointEntropy,
@ShannonsEntropy1 + @ShannonsEntropy2 AS SumOfIndividualShannonEntropies,
@JointEntropy @ShannonsEntropy1 AS ConditionalEntropyOfBGivenA,
@JointEntropy @ShannonsEntropy2 AS ConditionalEntropyOfAGivenB

Figure 2: Results from the Higgs Boson Dataset

…………Both Joint and Conditional Entropy must meet a bevy of validity tests, which the results in Figure 2 pass with flying colors. First, the Shannon’s Entropy for the first float column is identical to the results we received in previous articles. Likewise, the Joint Entropy is greater than the individual entropies of both columns, but less than their sum, as indicated by SumOfIndividualShannonEntropies.[3] As expected, the ConditionalValueOfBGivenA is below that of the Shannon’s Entropy for the second column and ConditionalValueOfAGivenB is below that of the first. One unexpected result was that the JointEntropy is close to the output of the multiset versions of the Hartley function I coded in Information Measurement with SQL Server, Part 1: A Quick Review of the Hartley Function, which returned 23.2534966642115 and 23.3910011060398 respectively. This may merely be a fluke, since Column2 has a very different distribution and the Hartley function only took Column1 into account in that particular article. Moreover, I tested this procedure on other combinations of float columns in the same dataset and received results between 22 and 23 most of the time, even when Column1 wasn’t being tested at all. I’m not yet well-versed enough in these matters to say if this indicates that we could derive additional insights from our data, by comparing each column’s Hartley function against the Joint Entropy.
…………The performance implications are of more immediate concern, of course. These two columns have 9,119,674 distinct combinations between them, which required a few gigabytes of TempDB space for spooling; SQL Server also gobbled up a few more gigs of RAM during these calculations, so it is probably wise to keep ample memory at hand. The good news is that this code ran in just 1:46, which is much better than I expected for multiple calculations on two float columns across 11 million rows. I have to perform these kinds of calculations on a beat-up clunker of a development machine that routinely falls apart like the Blues Brothers’ car, so the results on professional hardware are likely to be several orders of magnitude better. They could also probably benefit from a tune-up in the hands of a qualified T-SQL expert, which I am not. I’ve always had a little trouble with grouping statements, so there may be more efficient ways to write this; I doubt that would involve calculating the counts in the INSERT though, even though this could be done, albeit more awkwardly than taking them in advance. The INSERT statement accounted for 96 percent of the batch and 46 percent of those costs were incurred in a single Sort, so that operator might be a good target for performance tweaks. Another 49 percent was locked up in a single Table Insert operator, which probably can’t be dispensed with through optimization. The execution plan was short and otherwise uneventful though, since seeks and scans on the nonclustered indexes of the two columns did the bulk of the heavy lifting.

Interpretations and Use Cases for Joint and Conditional Entropies

…………So why bother going to the trouble of calculating these metrics at all? There’s no point in expending any server resources without first answering Aristotle’s causa efficiens. There are several different categories of use cases for these measures, some of which are really straightforward: whenever any question arises in any data model arises concerning the “news value” of one column given a value for another, or of a state description with two specific values for those columns, these are usually the measures we’d turn to. Conditional Entropy answers the fundamental question “How much can knowing a value of Y tell us about X?” whereas Joint Entropy measures the same thing as Shannon’s Entropy, except for two or more columns. As I’ve emphasized throughout this series, interpretation is a critical stage in any workflow involving information theory metrics, and in the case of these binary entropies, their subtle implications give rise to a whole additional class of use cases. For example, then knowing Y can tell us everything about X if the value of Conditional Entropy is zero,  but progressively less as the value rises; in this roundabout way, it becomes a sort of measure of association by proxy, which might be integrated with more familiar measures like correlation, covariance and regression. Conditional Entropy is also sometimes referred to as “equivocation,” especially when it is interpreted a conditional uncertainty. If we were speaking strictly of the original application of information theory to communication channels, then we can view a change in equivocation as a “decrease of uncertainty as to what message might have been enciphered.”[4] It is also used to “justify the definition for channel capacity.”[5] Wendell R. Garner’s excellent but largely unnoticed work from 1962, Uncertainty and Structure as Psychological Concepts, is chock full of formulas in which Conditional Entropy can be transmuted into various measures of uncertainty interactions[6], which are akin to the interaction effects in analysis of variance (ANOVA). These might be useful in the kind of “uncertainty management” programs I delved into in the Implementing Fuzzy Sets with SQL Server series. Some of these related measures, like conditional and contingent uncertainty, can be used to quantify such qualities as redundancy, the relationships between variables and the amount of information transmitted between them.[7] The math is rather thick but Garner’s arguments are sufficiently rigorous for demonstrating connections between Conditional Entropy to both “irrelevant information”[8] and patterns: as he puts it, “structure is related uncertainty…structure is still the important kind of uncertainty, and unstructured uncertainty is equivocation, or noise, or error, whichever you prefer. It is uncertainty which is unrelated to another uncertainty.”[9] It also has subtle relationships with measures of “irony”[10] and a priori knowledge.[11]
…………The third class of use cases for these entropies involves using them to calculate other information metrics, just as we derived Conditional Entropy from Joint Entropy in this article. In like manner, Joint Entropy can be useful as a stepping stone to Mutual Information, one of the most important measures in information theory. It would thus be wise to have it in our toolbelts, if we want to go beyond what the existing software can do and do some wildcat data mining, using DIY code for metrics and algorithms that haven’t been implemented yet. One of the surprises I’ve encountered while trying to learn these fields in recent years has been the sheer size of the yawning gap between the research and theory that undergirds data mining and the available software, which is in some respects decades behind the theoreticians. Only a fraction of the available measures and techniques are available anywhere in the analytics marketplace. Since this situation is likely to last for a long time to come, it may be helpful to acquire the skills to develop our own DIY solutions, which is where posts like this might prove useful (perhaps as cautionary tales against making the same amateur mistakes). To that end, I’ll also address a couple of up-and-coming metrics known as Shared Information Distance and Lautum Information, both of which are related to the more established topic of Mutual Information. It would be feasible to take this segment of the series on a whole array of tangents, including coding all of the information theory counterparts of probabilistic concepts like multiple outcomes, independent events, mutually exclusive events, non-exclusive, unordered pairs and compound events. All of these have equivalent implementations and ramifications in terms of entropy, but I’ll restrict my scope to actual information metrics in keeping with the title of the series, rather than all of the calculations and principles that can be derived from them. In the next article I’ll also dispense with Self Information, which is the entropic counterpart to probabilities for a single record. I’ll complete my wrap-up of this segment of the Information Measurement series with brief discussions of Entropy Rate and Information Rate, which are fairly simple to code.

[1] I downloaded this ages ago from the University of California at Irvine’s Machine Learning Repository and converted it to a SQL Server table, which now takes up about 5 gigabytes of space in a sham DataMiningProjects database.

[2] See the Wikipedia pages “Conditional Entropy” and “Entropy (Information Theory) ” at http://en.wikipedia.org/wiki/Entropy_(information_theory) and http://en.wikipedia.org/wiki/Conditional_entropy respectively.

[3] See the Wikipedia page “Joint Entropy ” at http://en.wikipedia.org/wiki/Joint_entropy

[4] p.  272, Pierce, John Robinson, 1980, An Introduction to Information Theory: Symbols, Signals & Noise. Dover Publications: New York. Also see Pierce, John Robinson, 1961, Symbols, Signals and Noise: The Nature and Process of Communication. Harper: New York.

[5] pp. 49-62, Mansuripur, Masud, 1987, Introduction to Information Theory. Prentice-Hall: Englewood Cliffs, N.J.

[6] p. 106, Garner, Wendell R., 1962, Uncertainty and Structure as Psychological Concepts. Wiley: New York.

[7] IBID., p. 96, 136.

[8] IBID., p. 316.

[9] IBID., p. 339.

[10] p. 46, Ritchie, L. David., 1991, Information. Sage Publications: Newbury Park, Calif.

[11] p. 19, Brillouin, Léon, 1962, Science and Information Theory. Academic Press: New York.

## Information Measurement with SQL Server, Part 2.3: Thermodynamic and Quantum Entropies

By Steve Bolton

…………When I was about 12 years old, I suddenly discovered football. Many lessons still awaited far in the future – such as the risks of being a Buffalo Bills fan, the explosive sound footballs make when they hit a Saguaro cactus, or the solid reasons for not playing tackle on city streets – but I did make one important finding right away. During the following summer vacation, I fought off boredom by looking up old rosters (in a sort of precursor to fantasy football) and running the previous year’s standings through the NFL’s comprehensive formula for creating schedules. The rules were simple: teams with higher won-loss percentages were the victors whenever they were scheduled against opponents with worse records the previous year, then at the end of each fake season, I ran the new standings through the scheduling formula again. Based on the initial patterns I expected my eccentric amusement to last for a while, but what I ended up getting was a hard lesson in information entropy. No matter what standings I plugged into the formula, they eventually stopped changing, given enough iterations; the best I could get was permanent oscillations of two teams who swapped positions in their division each year, usually with records of 9-7 and 7-9. It occurred to me that I could increase the range of possible patterns by changing the scheduling rules, but that I was merely putting off the inevitable, for in due time they would freeze in some configuration or another. The information system was purely self-contained and was not dependent on any input of physical energy whatsoever; some of my favorite underrated players of the day, like Neil Lomax and Danny White, might have been able to change the course of a real NFL season, but they were powerless to alter this one. This experience helped me to quickly grasp the Law of Conversation of Information[1] when I heard of it as an adult. It is analogous to the Second Law of Thermodynamics, but for a different quantity that isn’t really interchangeable. All of the energy of the sun itself could not affect the outcome of that closed information system, which was doomed to run down sooner or later, once set in motion. The only way to change that fate was to input more information into the system, by adding new scheduling rules, random standing changes and new teams. This is almost precisely what occurs in the Third Law of Thermodynamics[2], in which a closed physical system slowly loses its ability to change as it approaches equilibrium of some kind.
…………Both sets of principles are intimately related to entropy, which is the topic of this segment of my amateur self-tutorial series on coding various information metrics in SQL Server. Nevertheless, both are subject to misunderstandings, some of which are innocuous and others which can be treacherously fallacious, particularly when they are used interchangeably with broad terms like “order.” The most common errors with information theory occur in its interpretations rather than its calculations, since it intersects in complex ways with deep concepts like order, as well as many of the other fundamental quantities I hope to measure later in this series, like redundancy, complexity and the like. As I touched on in my lengthy introduction to Shannon’s Entropy, order is in the eye of the beholder. The choice to search for a particular order is thus entirely subjective, although whether or not a particular empirical observation meets the criteria inherent in the question is entirely objective. Say, for example, that the arrangement of stars was considered “random,” another broad term which intersects with the meaning of “order” but is not synonymous. If some we were to discover a tattoo on the arm of some celestial being corresponding to the arrangement of stars in our particular universe, we would have to assume that it was either put there through conscious effort or as the output of some unfathomable natural process. Either way, it would require an enormous amount of energy to derive that highly complex order. It might also require a lot of energy to derive an unnaturally simple order through the destruction of a fault-tolerant, complex one; furthermore, in cryptography, it sometimes requires a greater expenditure of resources to derive deceptively simple information, which is actually generated from a more complex process than the information it is designed to conceal. A perfectly uniform distribution is often difficult to derive either by natural processes or intelligent intervention; the first examples that spring to mind are the many machinists I know, who do highly skilled labor day-in, day-out to create metal parts that have the smoothest possible tolerances. If kids build a sand castle, that’s one particular form of order; perhaps a real estate developer would prefer a different order, in which case he might bulldoze the sand castle, grade the beach and put up a hotel. Both require inputs of energy to derive their particular orders, which exhibit entirely different levels of complexity. Neither “disorder” nor “order” are synonymous with entropy, which only measures the capacity of an information, thermodynamic or quantum system to be reordered under its own impetus, without some new input of information or energy from some external source.

Adjusting Hartley and Shannon Entropy by Boltzmann’s Constant

As we shall see, coding the formulas for the original thermodynamic measures isn’t terribly difficult, provided that we’ve been introduced to the corresponding information theory concepts covered in the first four blog posts of this series. Despite the obvious similarities in the underlying equations, measures of entropy are the subject of subtle differences in the way these fields handle them. In the physical sciences, entropic states are often treated as something to be avoided, particularly when cosmic topics like the Heath Death of the Universe are brought up.[3] Entropy is often embraced in data mining and related fields, because it represents complete knowledge; paradoxically, the higher the entropy, the higher the potential information gain from adding to our existing knowledge. The more incomplete our understanding is, the more benefit we can derive from “news,” so to speak.
…………It’s not surprising that the measures of entropy in thermodynamics and information theory have similar formulas, given that the former inspired the latter. Information theory repaid this debt to physics by giving birth to quantum information theory and spotlighted the deep principles which give rise to the laws of thermodynamics – which was originally an empirical observation rather than an inevitable consequence of logic. Both ultimately stem from the same principles of math and logic, like the Law of Large Numbers, but as usual, I’ll skip over the related theorems, proofs and lemmas to get to the meat and potatoes. Nor am I going to explain the principles of thermodynamics any more than I have to; this blog is not intended to be a resource for basic science, especially since that topic is much older and well-known than information theory, which means there are many more resources available for readers who want to learn more. This detour in my Information Measurement series is posted mainly for the sake of completeness, as well as the off-chance that some readers may encounter use cases where they have to use T-SQL versions of some of these common thermodynamic formulas; I also hope to illustrate the differences between physical and informational entropies, as well as demonstrate how more concepts of thermodynamics can be assimilated into information theory and put to good use in our databases.
…………On the other hand, I’m going to make another exception to my unwritten rule against posting equations on this blog, in order to point out the obvious similarities between Shannon’s Entropy and its cousin in thermodynamics, Gibb’s Entropy (a.k.a. the Shannon-Gibbs, Boltzmann-Gibbs, Boltzmann–Gibbs–Shannon, BGS or BG are all used interchangeably).[4]. As I noted in Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy, H = -Σ pi logb pi is one of the most famous equations in the history of mathematics. Except for the fact it multiplies the result by the infamous Boltzmann’s Constant, kb [5] Gibb’s Entropy is practically identical: S = -kb -Σ pi logb pi. The thermodynamic formulas are distinguishable from their information theory kin in large part by the presence of his constant, which is measured in different branches of science via 15 different units[6]; in the Code in Figure 1, I used the formulas corresponding to the three main units used in information theory, which I introduced in the article on Hartley’s Function. Its discoverer, Austrian physicist Ludwig Boltzmann (1844-1906) was another one of those unbalanced math wizzes who bequeathed us many other crucial advances in the hard sciences, but who was eccentric to the point of self-destruction; he was in all likelihood a manic depressive, which may have led to his suicide attempts.[7]  He also feuded with Ernst Mach (1838-1916), another influential Austrian physicist – yet that may have been to his credit, given that Mach was one of the last holdouts in the field who opposed the existence of atoms.[8]
…………Boltzmann also lent his name to the Boltzmann Entropy, whose formula is obviously similar to the Hartley function. As I pointed out in Information Measurement with SQL Server, Part 1: A Quick Review of the Hartley Function, deciding which count of records to plug into the Hartley measure is not as straightforward as it seems; as expected, the DISTINCT count version returned the same results as a uniform distribution plugged into Shannon’s equation, but counting repeated records (as we would in a multiset) can also tell us useful things about our data. With Boltzmann’s Entropy, we’re more likely to use a really expansive definition of cardinality that incorporates all permissible arrangements in a particular space. This would be equivalent to using the Hartley measure across all permissible states, regardless of their probabilities. This would in turn equal the Shannon’s Entropy on a uniform distribution where all permissible states – not just the ones with nonzero values – are included. Counting all permissible states is usually a lot easier in physics than actually much easier than determining their probabilities, which is turn a far cry from determining the number of unique particle types, let alone the exact counts of particles. In modern databases, this relationship is almost completely reversed; in SQL Server, table counts are preaggregated and thus instantly available, while DISTINCT clauses are costly in T-SQL and even more so in Analysis Services. To make matters worse, calculating a factorial on values higher than about 170 is impossible, even using the float data type; this precludes counting all 1038 + 1 permissible value in a decimal(38,0) column and plugging it into the Boltzmann Entropy, which uses factorials to precalculate the counts plugged into it. In other words, we simply can’t use the Boltzmann Entropy on all permissible values, if the universal set would contain more than 170 members. For that reason, I used the old-fashioned SQL Server count, in this case on the first float column of the same Higgs Boson dataset I’ve been using for practice purposes for several tutorial series.[9]  Despite the fact that the table takes up 5 gigabytes and consists of 11 million rows, my sample code ran in just 1 second on my clunker of a development machine. The code is actually quite trivial and can be condensed from the deliberately verbose version below:

Figure 1: Various Measures of Thermodynamic Entropy
DECLARE @LogarithmBase decimal(38,36)
SET @LogarithmBase = 2  2.7182818284590452353602874713526624977 — 10

DECLARE @EntropicIndexParameter float = 0.99
DECLARE @Count  bigint, @Mean decimal(38,32), @StDev decimal(38,32), @GibbsEntropy float, @BoltzmannEntropy float,  @TsallisEntropy float, @ConfigurationEntropy float @BoltzmannConstant float

SELECT @BoltzmannConstant  = CASE WHEN @LogarithmBase = 2 THEN 1 / CAST(Log(2) AS float)
WHEN @LogarithmBase = 10 THEN 10 / CAST(Log(2) AS float)
WHEN @LogarithmBase = 2.7182818284590452353602874713526624977 THEN 1
ELSE NULL END

SELECT @Count = Count(*)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL

DECLARE @EntropyTable table
(Value decimal(33,29),
ValueCount bigint,
Proportion float
)

INSERT INTO @EntropyTable
(Value, ValueCount, Proportion)
SELECT Value, ValueCount, ValueCount / CAST(@Count AS float) AS Proportion
FROM  (SELECT Column1 AS Value, Count(*) AS ValueCount
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL
GROUP BY Column1) AS T1

SELECT @GibbsEntropy  = 1 * @BoltzmannConstant * SUM(CASE WHEN Proportion = 0 THEN 0 ELSE Proportion * CAST(Log(Proportion, @LogarithmBase) as float) END), @TsallisEntropy = (1 / CAST(@EntropicIndexParameter
1 AS float)) * (1 SUM(CAST(Power(Proportion, @EntropicIndexParameter) AS float))),
@ConfigurationEntropy =
1 *  @BoltzmannConstant  * SUM(Proportion * Log(Proportion, @LogarithmBase))
FROM @EntropyTable

SELECT @BoltzmannEntropy = @BoltzmannConstant  * Log(@Count, @LogarithmBase)

SELECT @BoltzmannConstant AS BoltzmannConstant, @BoltzmannEntropy AS BoltzmannEntropy, @GibbsEntropy AS GibbsEntropy, @ConfigurationEntropy AS ConfigurationEntropy, @TsallisEntropy AS TsallisEntropy

Figure 2: Results from the Higgs Boson Dataset

…………The formula used here is for Boltzmann’s Constant is based on a common approximation of 1 / Log(2), which introduces inaccuracy as early as the third decimal place; greater precision can be achieved for the common Base-2 log using more accurate approximations, such as a hard-coded constant of 1.3806485097962231207904142850573. Of course, the results in Figure 3 would only make sense if the particular columns measured thermodynamic quantities, and I’m not familiar enough with the semantic meaning of the Higgs Boson data to say if that’s the case. The same is true of the Tsallis Entropy (which may be identical to the Havrda-Charvat Entropy)[10] derived in the first SELECT after the INSERT, which is the thermodynamic counterpart of the Rényi Entropy we covered a few articles ago. The two formulas are almost identical, except that the Tsallis takes an @EntropicIndexParameter similar to the @AlphaParameter used in the Rényi, which also cannot equal 1 because it would lead to a divide-by-zero error. As it approaches this forbidden value, however, it nears the Gibbs Entropy, just as the Rényi Entropy approaches Shannon’s measure. At 0 it is equivalent to the DISTINCT COUNT minus 1. A wide range of other values have proven useful in multifarious physics problems, as well as “sensitivity to initial conditions and entropy production at the edge of chaos,”[11] which could prove useful when I tackle chaos theory at the tail end of this series. The formula for the Configuration Entropy[12] is exactly the same as that for the Gibbs Entropy, except that it measures possible configurations rather than the energy of a system; for our immediate purposes in translating this for DIY data mining purposes, the number of configurations and energy are both equivalent to the information content.
…………Although all of the above measures are derived from thermodynamics, they have obviously similarities and uses in information theory and by extension, related fields like data mining. Some additional concepts from thermodynamics might prove good matches for certain SQL Server use cases in the future as well, although I have yet to see any formulas posted that could be translated into T-SQL at this point. The concept of Entropy of Mixing[13] involves measuring the change in entropy from the merger of two closed thermodynamic systems, separated by some sort of impermeable barrier. This might prove incredibly useful if it could be ported to information theory (assuming it already hasn’t), since it could be used to gauge the change in entropy that occurs from the merger of two datasets or partitions. This might include appending new data in temporal order, partitioning sets by value or topic and incorporating new samples – all of which could be useful in ascertaining whether the possible information gain is worth the performance costs, in advance of the merge operation. In the same vein, a “perfect crystal, at absolute zero” has zero entropy and therefore no more potential for internally-generated change, but it may be in one of several microstates at the point it is frozen in stasis; this is what is measured by Residual Entropy[14], which might be transferable to information theory by quantifying the states an information system can exhibit once its potential for internally-generated change is gone.

Adapting Other Thermodynamic Entropies

One of the most promising new information metrics is information enthalpy, which is an analogue of an older thermodynamic concept. The original version performed calculations on measures of energy, pressure and volume, in which the first “term can be interpreted as the energy required to create the system” and the second as the energy that would be required to ‘make room’ for the system if the pressure of the environment remained constant.”[15] It measures “the amount of heat content used or released in a system at constant pressure,” but is usually expresses as the change in enthalpy as measured in joules.[16] This could be adapted to quantify the information needed to give rise to a particular data structure, or to assess changes in it as information is added or removed. The thermodynamic version can be further partitioned into the enthalpy due to such specific processes as hydrogenation, combustion, atomization, hydration, vaporization and the like; perhaps information enthalpy can be partitioned as well, except by processes specific to information and data science. Information enthalpy is the subject of at least one patent for a cybersecurity algorithm to prevent data leaks, depending on security rating.[17] At least two other research papers use information enthalpy for data modeling with neural nets, which is a subject nearer to my heart.[18] A more recent journal article uses it in measuring artificial intelligence[19], which is also directly relevant to data mining and information theory.
…………Loop Entropy is specific not just to thermodynamics, but to specific materials within it, since it represents “the entropy lost upon bringing together two residues of a polymer within a prescribed distance.”[20] Nevertheless, it might be possible to develop a similar measure in information theory to quantify the entropy lost in the mixing between two probability spaces. Conformational Entropy is even more specific to chemistry, but it might be helpful to develop similar measures to quantify the structures an information system can take on, as this entropy does with molecular arrangements. Incidentally, the formula is identical to that of the Gibbs Entropy, except that the Gas Constant version of the Boltzmann Constant is used.[21]  Likewise, Entropic Force is a concept mainly used to quantify phenomena related to Brownian motion, crystallization, gravity and “hydrophobic force.”[22] In recent years, however, it has been linked together with “entropy-like measures of complexity,” intelligence and the knowledge discovery principle of Occam’s Razor, which brings it within the purview of information theory.[23] I surmise that a similar concept could be put to good use in measuring entropic forces in data science, for such constructive purposes as estimating data loss and contamination, or even ascertaining tendencies of data to form clusters of a particular type. It is also possible that a more thorough relationship between thermodynamic free energy, Free Entropy and “free probability” can be fleshed out.[24] These are related to the “internal energy of a system minus the amount of energy that cannot be used to perform work,” but it might be useful outside of the thermodynamic context, if we can further partition measures like Shannon’s Entropy to strain out information that is likewise unavailable for our purposes. I cannot think of possible adaptations for more distant thermodynamic concepts like Entropic Explosion and Free Entropy off the top of my head, but that does not mean they are not possible or would not be useful in data science.

Esoteric Quantum Entropies

These measures of physical entropy are of course prerequisites for bleeding-edge topics like quantum mechanics, where the strange properties of matter below the atomic level of ordinary particle physics introduce mind-bending complications like entanglement. This gives rise a whole host of quantum-specific entropies which I won’t provide code for because they’re too far afield for my intended audience, the SQL Server community, where making Schrödinger’s Cat reappear like a rabbit out of a hat isn’t usually required in C.V.s. These metrics will only prove useful if we can find objects and data that can be modeled like quantum states, which might actually be feasible in the future. Thermodynamics served as a precursor to information theory, which in turn provided a foundation for the newborn field of quantum information theory, so further cross-pollination between these subject areas can be expected. This could arise out of information geometry, another bleeding-edge field that borrows concepts like Riemann manifolds and multidimensional hyperspace and applies them to information theory; I hope that by the end of this wide-ranging series I’ll have acquired the skills to at least explore the topic, but the day when I can write tutorials on it is far off. Another interesting instance of cross-pollination is occurring as we speak between spin glasses, i.e. disordered magnets which are deeply interesting to physicists for their phase transitions, and neural nets, which apparently share analogous properties in common with them. It is nonetheless far more likely that some of the simpler thermodynamic concepts like enthalpy will be adapted for use in information theory (and by extension, data mining) before any of the quantum information measures I’ll quickly dispense with here.
…………The Von Neumann Entropy is one of the brands of entropy most frequently mentioned in books on quantum mechanics, like Ingemar Bengtsson’s Geometry of Quantum States: An Introduction to Quantum Entanglement [25]and Vlatko Vedral’s Decoding Reality: The Universe as Quantum Information[26], but it merely extends the information theory and thermodynamic concepts we’ve already discussed, using the kind of high-level math employed in quantum physics. For example, “the Gibbs entropy translates over almost unchanged into the world of quantum physics to give the von Neumann entropy” [27] except that we plug in a stochastic density matrix for the probabilities and use a Trace operation on it instead of a summation operator. Moreover, when Von Neumann’s Entropy is calculated via its eigenvectors it reduces to Shannon ‘s version.[28] Linear Entropy (or “Impurity”) is likewise in some respects an extension of the concept of entropic mixing to the field birthed from the union of these two cutting-edge fields, quantum information theory.[29]
…………Certain quantum-specific measures, such as the Belavkin-Staszewski Entropy, are obscure enough that it is difficult to find references on them anywhere, even in professional quantum theory and mathematical texts. I became acquainted with these quantum measures awhile back while skimming works like Bengtsson’s and Vedral’s, but as you can tell from my citations, I had to rely heavily on Wikipedia to fill in a lot of the blanks (which were quite sizeable, given that I’ve evidently forgotten most of what I learned on the topic as a kid, when I imbibed a lot from my father’s moonlighting as a college physics teacher). Fortunately, I was able to find a Wikipedia article on the Sackur-Tetrode Entropy[30], which was just comprehensible enough to allow me to decipher its purpose. Evidently, it’s used to partition quantum entropies by the types of missing information they quantify, similar to how I used various measures of imprecision, nonspecificity, strife, discord, conflict and the like at the tail of my Implementing Fuzzy Sets in SQL Server series to partition “uncertainty.” In those tutorials I in turn likened uncertainty partitioning to the manner in which variance is partitioned in analysis of variance (ANOVA).
…………It is much more common to find references to measures like Quantum Relative Entropy and Generalized Relative Entropy[31] in works on quantum mechanics, which are just highly specialized, souped-up versions of the Kullback-Leibler Divergence we’ll be tackling later in this series. They’re used to quantify the dissimilarity of indistinguishability of quantum states across Hilbert Spaces, which aren’t exactly everyday use cases for SQL Server DBAs and data miners (yet not entirely irrelevant, given that they’re constructed from inner products). On the other hand, the KL-Divergence they’re derived from is almost as important in information theory as Shannon’s Entropy, so I’ll be writing at length on it once I can set aside room in this series for a really long segment on distance and divergence metrics. It can certainly be put to a wide range of productive uses by SQL Server end users. The same can be said of the conditional and joint information entropies we’ll be tackling in the next article, alongside such ubiquitous measures as information rates. In one sense, they’re a little more complex that the topics we covered so far in this series, since they’re binary relations between probability figures rather than simple unary measures. On the other hand, they’re far more simple and useful than counterparts like Quantum Mutual Information and Conditional Quantum Entropy. I’m only mentioning these metrics and their quantum kin for the benefit of readers who want to learn more about information theory and don’t want to wade into the literature blind, without any inkling as to what these more esoteric entropies do or how compartmentalized their use cases are. When applying information theory outside of quantum mechanics, we don’t need to concern ourselves with such exotic properties as non-separability and oddities like the fact that quantum conditional entropy can be negative, which is equivalent to a measure known as “coherent information.”[32] In contrast, I’ll provide code at the end of this segment of the series for measures like Mutual, Lautum and Shared Information, which indeed have more general-purpose use cases in data mining and knowledge discovery. The same can be said of next week’s article on Conditional and Joint Entropy, which are among the tried-and-true principles of information theory.

[1] This is also justified by such principles as the Data Processing Theorem, which demonstrate that a loss of information occurs at each step in data processing, since no new information can be added. It is actually a more solid proof than that its thermodynamic counterpart, since it stems directly from logical and mathematical consistency rather than empirical observations with unknown causes. For a discussion of the Data Processing Theorem, see p. 30, Jones, D.S., 1979, Elementary Information Theory. Oxford University Press:  New York.

[2] Boundless Chemistry, 2014, “The Third Law of Thermodynamics and Absolute Energy,” published Nov. 19, 2014 at the Boundless.com web address https://www.boundless.com/chemistry/textbooks/boundless-chemistry-textbook/thermodynamics-17/entropy-124/the-third-law-of-thermodynamics-and-absolute-energy-502-3636/

[3] See the Wikipedia webpage “Heat Death of the Universe” at http://en.wikipedia.org/wiki/Heat_death_of_the_universe

[4] See the Wikipedia page “History of Entropy” at http://en.wikipedia.org/wiki/History_of_entropy

[5] See the Wikipedia article “Boltzmann’s Entropy Formula” at http://en.wikipedia.org/wiki/Boltzmann%27s_entropy_formula

[6] I retrieved this value from the Wikipedia page “Boltzmann’s Constant” at http://en.wikipedia.org/wiki/Boltzmann_constant

[7] See the Wikipedia article “Ludwig Boltzmann” at  http://en.wikipedia.org/wiki/Ludwig_Boltzmann

[8] See the Wikipedia page “Ernst Mach” at http://en.wikipedia.org/wiki/Ernst_Mach

[9] I originally downloaded this from the University of California at Irvine’s Machine Learning Repository and converted it into a single table in a sham SQL Server database called DataMiningProjects.

[10] See the Wikipedia article “Talk:Entropy/Archive11” at http://en.wikipedia.org/wiki/Talk%3AEntropy/Archive11

[11] See the Wikipedia webpage “Tsallis Entropy” at http://en.wikipedia.org/wiki/Tsallis_entropy

[12] See the Wikipedia article “Configuration Entropy” at http://en.wikipedia.org/wiki/Configuration_entropy

[13] See the Wikipedia page http://en.wikipedia.org/wiki/Entropy_of_mixing

[14] See the Wikipedia article “Residual Entropy” at http://en.wikipedia.org/wiki/Residual_entropy

[15] See the Wikipedia page “Enthalpy” at http://en.wikipedia.org/wiki/Enthalpy

[16] See the ChemWiki webpage “Enthalpy” at

http://chemwiki.ucdavis.edu/Physical_Chemistry/Thermodynamics/State_Functions/Enthalpy

[17] See the Google.com webpage “High Granularity Reactive Measures for Selective Pruning of Information” at http://www.google.com/patents/US8141127

[18] Lin, Jun-Shien and Jang, Shi-Shang, 1998, “Nonlinear Dynamic Artificial Neural Network Modeling Using an Information Theory Based Experimental Design Approach,” pp. 3640-3651 in Industrial and Engineering Chemistry Research, Vol. 37, No. 9. Also see Lin, Jun-Shien; Jang, Shi-Shang; Shieh, Shyan-Shu and M. Subramaniam, M., 1999, “Generalized Multivariable Dynamic Artificial Neural Network Modeling for Chemical Processes,” pp. 4700-4711 in Industrial and Engineering Chemistry Research, Vol. 38, No. 12.

[19] Benjun, Guo; Peng, Wang; Dongdong, Chen;  and Gaoyun, Chen, 2009, “Decide by Information Enthalpy Based on Intelligent Algorithm,” pp. 719-722 in Information Technology and Applications. Vol. 1.

[20] See the Wikipedia article “Loop Entropy” at http://en.wikipedia.org/wiki/Loop_entropy

[21] See the Wikipedia article “Conformational Entropy” at http://en.wikipedia.org/wiki/Conformational_entropy

[22] See the Wikipedia article “Entropic Force” at http://en.wikipedia.org/wiki/Entropic_force

[23] IBID.

[24] See the Wikipedia pages “Free Entropy,” “Thermodynamic Free Energy” and “Free Probability” at http://en.wikipedia.org/wiki/Free_entropy, http://en.wikipedia.org/wiki/Thermodynamic_free_energy and http://en.wikipedia.org/wiki/Free_probability respectively.

[25] Bengtsson, Ingemar, 2008, Geometry of Quantum States: An Introduction to Quantum Entanglement. Cambridge University Press: New York.

[26] Vedral, Vlatko, 2010, Decoding Reality: The Universe as Quantum Information. Oxford University Press: New York

[27] See the Wikipedia article “Entropy (Information Theory)” at http://en.wikipedia.org/wiki/Entropy_(information_theory)

[28] See the Wikipedia article “Von Neumman Entropy” at http://en.wikipedia.org/wiki/Von_Neumann_entropy

[29] See the Wikipedia article “Linear Entropy” at http://en.wikipedia.org/wiki/Linear_entropy

[30] See the Wikipedia webpage “Sackur-Tetrode Equation” at http://en.wikipedia.org/wiki/Sackur%E2%80%93Tetrode_equation

[31] See the Wikipedia articles “Quantum Relative Entropy” and “Generalized Relative Entropy” at http://en.wikipedia.org/wiki/Quantum_relative_entropy and http://en.wikipedia.org/wiki/Generalized_relative_entropy respectively

[32] See the Wikipedia articles Quantum Mutual Information and “Conditional Quantum Entropy” at

## Information Measurement with SQL Server, Part 2.2: The Rényi Entropy and Its Kin

By Steve Bolton

…………I kicked off this far-ranging series on using SQL Server to quantify information by discussing two of the earliest and most important measures, the Hartley function and Shannon’s Entropy. These foundations of information theory are intimately related to a more general measure, Rényi Entropy, which is a bit more complex but nonetheless worthwhile to discuss, since its unites many different information measures under one umbrella. The underlying math formula isn’t much more difficult than the one for Shannon’s Entropy I posted last time around, but its alpha parameter (α-parameter) enables it to give rise to a wider range of results. This assortment of entropy types can be adapted to solving a wider range of problems.
…………A general-purpose means of parameterizing the concept of entropy was the explicit goal of Alfréd Rényi (1920-1971), a mathematician who overcame anti-Semitic persecution at the hands of Hungary’s World War II regime, which was allied with the Nazis and passed stringent laws against their own Jewish minority. Incidentally, his coffee addiction was the inspiration for the colorful saying, “A mathematician is a device for turning coffee into theorems”, which apparently predates the saying, “A programmer is a machine that turns coffee into code.”[1] The formula that bears his name has proven useful in many diverse fields and industries, from quantum mechanics to fractals to ecology and statistics, where it is useful in deriving indices of diversity.[2] When the α-parameter is set to 2, it becomes Collision or Quadratic Entropy (or sometimes just “the Rényi Entropy”), which “has been used in physics, in signal processing and in economics” and is attractive to statisticians because they “have found an easy way to estimate it directly from samples.”[3] Given that it has multifarious uses and is both easy to explain and compute in SQL Server, it makes sense to add it to our toolbelt, in order to derive our own DIY data mining methods. One thing I’ve come to realize since writing my initial A Rickety Stairway to SQL Server Data Mining tutorial series awhile back is that the data mining marketplace is decades behind the research in some respects, to the point where all of the available software taken together is probably several orders of magnitude behind the formulas available in the math books. If we encounter situations where the particular algorithms that would work best for our use cases aren’t yet implemented, waiting might not be an option, and if we’re going to learn to build our own, Rényi Entropy is bound to come in handy at some point.

Coding the Range of α-Parameter  Values

The key to understanding Rényi Entropy is grasping how it intersects with its cousins. The α-parameter can be any non-negative value except for 1, which would result in a divide-by-zero error since one of the terms is 1 / (1 – α). This result is multiplied by a logarithm taken on the sum of the probabilities taken to the power of α; this differs from Shannon’s famous equation in that his log operation is performed on each probability and then multiplied by that probability, then summed afterwards and multiplied by -1. This logic is all implemented in the INSERT operation in Figure 1, in which most of the code is devoted to deriving actual probability values from the proportions of a single float column, taken from the same Higgs Boson dataset[4] I’ve been using for practice purposes for the past few tutorial series. I omitted any display of the intermediate proportions stored in the @EntropyTable variable, since they’re identical to the sample results posted in last week’s article. Evidently, it’s economical to compute these entropies, given that this code takes less than 2 seconds to run on my ancient abacus of a development machine (I might as well be using quipus) and most of that was devoted to deriving the probabilities for all 11 million rows; if we already had probability values derived from some other source like estimates, sampling or deductive methods, it would run practically instantaneously. I also hard-coded the results we already calculated over the last two articles for the Hartley and Shannon Entropies, given that they are also unchanged. The final SELECT in Figure 2 provides the same information from another vantage point, by plugging a range of α-values rather than a single parameter into the Rényi formula.
…………Note that as the values approach 1, they converge towards the Shannon Entropy, although they can never quite reach it because of the divide-by-zero error that would occur if plugged in the forbidden parameter value of 1.Likewise, the return values approach the Hartley measure as the parameter values approach 0; this acts as a cap on the permissible values, hence the frequent use of the synonym Max Entropy. Because SQL Server’s calculation powers are impressive yet finite, we can’t plug in values approaching infinity (if such a thing were possible). In fact, we can’t even go much higher than α-values  of about 50 (which are rarely used in practice anyways) on this particular dataset without getting floating point errors. Figure 3 is sufficient to nevertheless sufficient to illustrate how the Rényi results approach an information measure known as the Min Entropy, which acts as a cap on the other end. References to it as the Chebyshev Entropy[5] seem to be few and far between outside the realm of quantum physics, but I’ve run into a few here and there. It is nonetheless easy to calculate, given that all we have to do is find the maximum probability value and take the negative log. As Wikipedia points out quite succinctly, “The min entropy is never greater than the ordinary or Shannon entropy (which measures the average unpredictability of the outcomes) and that in turn is never greater than the Hartley or max entropy, defined as the logarithm of the number of outcomes.”[6] It is also useful as the “most conservative way of measuring the unpredictability of a set of outcomes,” a property that makes it useful in determining randomness and quantum cryptography, in a way that the Shannon Entropy can’t handle.[7] Each Rényi value in between the fences set by the Min, Hartley and Shannon Entropies has its own distinct uses in ascertaining “the unpredictability of a nonuniform distribution in different ways.”[8] We can thus choose whatever α-parameter is ideal for dealing with the particular distribution of our data. As an amateur, I’m lacking in the experience needed to determine which α-values are best-suited to specific problems, but this code at least provides a launching pad for exploring such questions.

Figure 1: The Rényi Entropy in T-SQL
DECLARE @LogarithmBase decimal(38,36) = 2  –2.7182818284590452353602874713526624977 — 10
DECLARE @AlphaParameter decimal(38,35) = 2

DECLARE @Count  bigint, @DistinctValueCount bigint, @RenyiEntropy float, @MinOrChebyshevEntropy float

SELECT @Count = Count(*)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL

DECLARE @EntropyTable table
(Value decimal(33,29),
ValueCount bigint,
Proportion float,
SummationInput float
)

INSERT INTO @EntropyTable
(Value, ValueCount, Proportion, SummationInput)
SELECT Value, ValueCount, Proportion, Power(Proportion, @AlphaParameter) AS SummationInput
FROM (SELECT Value, ValueCount, ValueCount / CAST(@Count AS float) AS Proportion
FROM  (SELECT Column1 AS Value, Count(*) AS ValueCount
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL
GROUP BY Column1) AS T1) AS T2

SELECT @RenyiEntropy = 1 / (1 @AlphaParameter) * Log(SUM(SummationInput), @LogarithmBase), @MinOrChebyshevEntropy = 1 * Log(Max(Proportion), @LogarithmBase)
FROM @EntropyTable

SELECT @RenyiEntropy AS RenyiEntropy, @MinOrChebyshevEntropy AS MinEntropy, 13.2879304339032 AS ShannonsEntropy, 14.772263018717 AS HartleyOrMaxEntropy

Figure 2: A Range of Sample T-SQL
–WHOLE RANGE OF RENYI ENTROPY
SELECT AlphaParameter, 1 / (1 AlphaParameter) * Log(SUM(Power(Proportion, AlphaParameter)), @LogarithmBase) AS ResultingRenyiEntropy
FROM (VALUES (0.01), (0.5),(1.01),(2),(3),(4),(5),(10),(50)) AS T1(AlphaParameter)
LEFT JOIN @EntropyTable AS T2
ON 1 = 1
GROUP BY AlphaParameter
ORDER BY AlphaParameter

Figure 3: Results from the Higgs Boson Dataset

…………Jose C. Principe, a neural net expert who teaches at the University of Florida, has written a handy 35-page guide to using the Rényi Entropy for tasks even further beyond my level of inexpertise, such as deriving measures like Information Potential based on probability distribution function (PDF) estimates. This quickly leads into the bleeding edge topic of information geometry, which I won’t touch with a ten-foot pole for a few more years, at least until I have a better understanding of such mind-blowing concepts as statistical Riemann manifolds and multidimensional spaces. The introductory pages are nevertheless useful in illustrating how Rényi Entropy works on simpler expressions of information. This is particularly true of the excellent diagram on the page 4, which shows how the various α-values measure the distance of a PDF to the origin point, thereby filling in the available space for information content in different proportions. As we shall see later in this series of self-tutorials, plugging different α-values into a related concept known as the Rényi Divergence also allows us to measure the distances between two PDFs in various ways. This is intimately related to the Kullback–Leibler Divergence in much the same way as the Rényi Entropy is related to the Shannon Entropy. The KL-Divergence is among the most important measures in information theory and related fields so I’ll be writing quite a bit on the subject later on in the series, as I did with the Shannon Entropy in the last article. It also goes by the name of Relative Entropy, but I’ll have to put off that discussion for a later time, when I can set aside a separate segment of the series for distance and divergence measures between probability distributions. Over the next four articles I’ll stick to the topic of entropy measures on single distributions, including Leaf and Root Entropy, which are inextricably related to the SSDM algorithm I covered in A Rickety Stairway to SQL Server Data Mining, Algorithm 3: Decision Trees.

[1] See the Wikipedia page “Alfréd Rényi” at  http://en.wikipedia.org/wiki/Alfr%C3%A9d_R%C3%A9nyi

[2] See the Wikipedia article “Rényi Entropy” at http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

[3] pp. 6-7, Principe, Jose C., 2009, “Rényi Entropy,” course notes posted at the University of Florida webpage http://www.cnel.ufl.edu/courses/EEL6814/renyis_entropy.pdf

[4] I downloaded this publicly available dataset from University of California at Irvine’s Machine Learning Repository and while back and converted it to a SQL Server table, which now takes up about 5 gigabytes in a sham DataMiningProjects database.

[5] One source I found it mentioned in was Bengtsson, Ingemar, 2008, Geometry of Quantum States: An Introduction to Quantum Entanglement. Cambridge University Press: New York.

[6] See the Wikipedia article “Min Entropy” at http://en.wikipedia.org/wiki/Min_entropy

[7] IBID.

[8] IBID.

## Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy

By Steve Bolton

…………In the first installment of this wide-ranging series of amateur tutorials, I noted that the Hartley function indeed returns “information,” but of a specific kind that could be described as “newsworthiness.” This week’s measure also quantifies how much we add to our existing knowledge from each new fact, but through explicitly stochastic methods, whereas the Hartley function is more closely related to counts of state descriptions than probabilities.[1] Shannon’s Entropy is of greater renown than its predecessor, but isn’t much more difficult to code in T-SQL. In fact, it should be a breeze given that I already posted a more advanced version of it tailor-made for Implementing Fuzzy Sets in SQL Server, Part 10.2: Measuring Uncertainty in Evidence Theory, just as I had already coded a modified version of Hartley’s measure for Implementing Fuzzy Sets in SQL Server, Part 9: Measuring Nonspecificity with the Hartley Function.
…………Historically, the real difficulty with the metrics of information theory is with their interpretation, largely due to the broad and fuzzy meaning of the imprecise term, “information.” As we shall see, many brilliant theorists far smarter than we have made the mistake of overextending it beyond its original domain, just as some thinkers have gotten carried with the hype surrounding fuzzy sets and chaos theory. Like those cutting-edge topics, Shannon’s Entropy and its relatives within information theory are indeed powerful, but can go badly wrong when misapplied outside its specific use cases. It has really subtle yet mind-blowing implications for all of the different classes of information measurement I hope to cover in this wide-ranging and open-ended series, not all of which are fully understood. My purposes in this series is to illuminate datasets from every possible direction, using a whole smorgasbord of measures of meaning (semantic information), randomness, complexity, order, redundancy and sensitivity to initial conditions, among others; since many of these metrics serve as the foundation of many mining algorithms, we can use them in a SQL Server environment to devise DIY data mining algorithms. Shannon’s Entropy intersects many of these disparate class of information in multifarious ways, such as the fact that it is equal to the Hartley measure in the case of uniform distributions (i.e. when all values are equally likely). It is not a measure of what is known, but of how much can be learned by adding to it; entropy measures rise in tandem with uncertainty precisely we because we can learn more from new facts when we don’t already know everything. It can thus be considered complementary to measures of existing knowledge, such as Bayes Factors. I’ve read more on information theory than many other fields I’ve commented on over the past couple of years, when I started writing self-tutorials on data mining, but that doesn’t mean I understand all of these complex interrelationships well. In fact, I’m writing this series in part because it helps me absorb the material a lot faster, and posting it publicly in the hopes that it can at least help readers avoid my inevitable mistakes. I lamented often in A Rickety Stairway to SQL Server Data Mining that SSDM was woefully under-utilized in comparison to its potential benefits, but the same can also be said of the algorithms that underpin it.

Channel Capacity as a Godsend

Few of these information metrics are as well-known as the measure that Claude E. Shannon (1916-2001), a key American cryptographic specialist during the Second World War, introduced in 1948 in a two-part journal article titled “A Mathematical Theory of Communication.”[2] Most of it consists of math theorems and formulas that only got thicker as researchers built on his theory over the years, but it is noteworthy to mention that he credits earlier work by electrical engineers Ralph Hartley[3] (1888-1970) and Harry Nyquist (1889-1976) in the opening paragraphs.[4] I’ll skip over most of the proofs and equations involved – even the ones I understand – for the usual reasons: as I’ve pointed out in past articles, users of mining tools shouldn’t be burdened with these details, for the same reason that commuters don’t need a degree in automotive engineering in order to drive their cars. Suffice it to say that the original purpose was to determine the shortest possible codes in the case of noisy transmission lines. In this effort to extend coding theory, Shannon gave birth to the whole wider field of information theory. Masud Mansuripur, the chair of the Optical Data Storage Department at the University of Arizona, sums up Shannon’s surprising mathematical discovery best: “In the past, communications engineers believed that the rate of information transmission over a noisy channel had to decline to zero, if we require the error probability to approach zero. Shannon was the first to show that the information-transmission rate can be kept constant for an arbitrarily small probability of error.”[5] It is also possible to determine the information-carrying capacity of the channel at the same time.[6]
…………The idea is still somewhat startling to this day, as many writers in the field of information theory have pointed out ever since; it almost seems too good to be true. Basically, the idea is to engineer a “decrease of uncertainty as to what message might have been enciphered,” [7] in part by leveraging combinatorics to derive probabilities. For example, the Law of Large Numbers states that rolling a pair of dice is far more likely to result in a number like eight, since there are multiple ways of adding up both dice to get that result, whereas there’s only one combination apiece for snake eyes or boxcars. The edges of the resulting dataset are thus far less likely than the values in the middle, just as we’d find in a Gaussian or “normal” distribution, i.e. the ubiquitous bell curve. This gives rise to a dizzying array of axioms and lemmas and a whole set of procedures for determining codes, all of which I’ll skip over. For perhaps the first time in the history of this blog, however, I have a sound reason for posting the underlying equation: H = -Σ pi logb pi. It’s not a household name like E = mc2 or 2 + 2 = 4, but it’s still one of the most famous formulas of all-time. It relatives can be spotted in the literature by the telltale combination of a negative summation operator with a logarithm operation. Aside from the fact that it possesses many ideal mathematical properties[8], “it was proven in numerous ways, from several well-justified axiomatic characterizations, that this function is the only sensible measure of uncertainty in probability theory.”[9]

Translating H into Code

As with Hartley’s measure, a base 10 logarithm results in units known as hartleys or bans (as famed cryptographer Alan Turing called them). With Euler’s Number (as in the natural logarithm) the units are known as nats and with base 2, they’re referred to as shannons or more commonly, bits. The code in Figure 1 uses bits, but users can uncomment the numbers that follow to use one of the other units. It is also worth noting that physicist Léon Brillouin used a “negentropy” measure that is basically the inverse of H, to compensate for the fact that Shannon’s Entropy measures information in terms of uncertainty, which is a bit counter-intuitive; it never really caught on though.[10]
…………My sample code derives some fake probabilities by taking the known proportions of values for a column in the same Higgs Boson dataset[11] I’ve been using for practice purposes for several tutorial series. The probabilities are then multiplied by the logs for those probabilities, then summed across the dataset and multiplied by -1. It’s actually quite simpler than it looks; the table variable and INSERT statement could be done away with altogether if we already had probability figures calculated through some other means, such as some sort of statistical sampling method or even reasoning out the probabilities from the underlying process, as writers on information theory often do with dice and cards to illustrate the concepts. In fact, if we know the underlying probability distribution in advance, we can sometimes calculate the entropy in reverse by using special formulas specific to certain distributions, such as the normal. Moreover, all of the code pertaining the @UniformProportion is included solely for the purpose of validating the results, which equal the results we received in the last article for the DISTINCT version of the Hartley function, as expected. If need be, users can also validate their results using Lukasz Kozlowski’s Shannon Entropy Calculator. All of that can code can also be removed at will. Despite all of this extraneous baggage, the code ran in just 1.9 seconds according to the Client Statistics, on 11 million rows of float values on my beat-up, wheezing semblance of a development machine. The execution plan was uneventful and consisted mainly of seeks on the nonclustered index I put on the column a couple of tutorial series ago.

Figure 1: Shannon’s Entropy in T-SQL
DECLARE @LogarithmBase decimal(38,36) =–2.7182818284590452353602874713526624977 — 10
DECLARE @Count bigint, @DistinctValueCount bigint, @ShannonsEntropy float, @EntropyForUniformDistribution float,
@MaxProbability float, @ChebyshevEntropy float, @MetricEntropy float, @UniformProportion float

SELECT @Count = Count(*), @DistinctValueCount = Count(DISTINCT Column1)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL

SELECT @UniformProportion = 1 / CAST(@DistinctValueCount as float)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL

DECLARE @EntropyTable table
(Value decimal(33,29),
ValueCount bigint,
Proportion float,
SelfInformation float
)

INSERT INTO @EntropyTable
(Value, ValueCount, Proportion, SelfInformation)
SELECT Value, ValueCount, Proportion,  1 * Proportion * Log(Proportion, @LogarithmBase) AS SelfInformation
FROM (SELECT Value, ValueCount, ValueCount / CAST(@Count AS float) AS Proportion
FROM  (SELECT Column1 AS Value, Count(*) AS ValueCount
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL
GROUP BY Column1) AS T1) AS T2

SELECT * FROM @EntropyTable

SELECT @ShannonsEntropy = SUM(SelfInformation), @EntropyForUniformDistribution=   SUM (@UniformProportion * Log(@UniformProportion, @LogarithmBase)) * 1, @MaxProbability = Max(Proportion)
FROM @EntropyTable

— ====================================
— SOME RELATED ENTROPIES
— ====================================
SELECT @ChebyshevEntropy  = 1 * Log(@MaxProbability, @LogarithmBase)
SELECT @MetricEntropy  = @ShannonsEntropy / CAST(@Count as float)

SELECT @ShannonsEntropy AS ShannonsEntropy, @MetricEntropy AS MetricEntropy, @ChebyshevEntropy AS ChebyshevEntropy, @EntropyForUniformDistribution as EntropyForUniformDistribution

Figure 2: Calculating Shannon’s Entropy on the Higgs Boson Dataset

…………Note that in the last section of the code, I tacked on the Metric and Chebyshev Entropies, which are trivial to calculate once we have building blocks like Shannon’s Entropy. References to the Chebyshev Entropy[12], a.k.a the Min Entropy because it represents the minimum amount a variable can exhibit, seem to be few and far between outside the realm of quantum physics. Metric Entropy, on the other hand, can  serve as a simple measure of randomness[13], which is more likely to be useful to SQL Server users. These are among a couple dozen extensions and relatives of Shannon’s Entropy, a few of which I’ve already dealt with, like the Hartley Entropy (i.e. the Max Entropy). I won’t discuss Differential Entropy, the extensions to continuous variables using various methods of calculus, because SQL Server data types like float, decimal and numeric are actually discrete in the strict sense; they can’t actually represent infinitesimal grades in between values, any more than the data types of any other software can on finite computers. Nor will I delve into the myriad applications that have been developed from Shannon’s Entropy for signal transmission and coding theory, since these are off-topic.

Detection of “Randomness” and Other Uses

Its relationship to cryptography might be a more appropriate subject, but for the sake of brevity I’ll limit my discussion to pointing out how it sheds light on the nature of “randomness.” Like “information,” it’s a broad term that raises the danger of “definition drift,” even within one’s own mind; I’ve seen big names in certain scientific fields unconsciously shift their usage of the term from “unintentioned” to “uncaused” to “indeterminate” and back again, without discerning the subtle differences between them all. In data mining, we’re basically trying to uncover information that is badly concealed, whereas in cryptography it’s deliberately concealed by mixing it in with apparently random information; the patterns generated by cryptographers are the opposite of random in one specific sense, that it takes great intelligence and hard work to deliberately create these false patterns. Cryptanalysis involves detecting these deliberate patterns, in order to remove the noise obscuring the original message. Of course, “noise” is a subjective matter determined entirely by the questions we choose to ask of the data (although the answers to those questions are entirely objective). Randomness can thus be viewed as excess information of the wrong kind, not an absence of pattern or lack of causation. For example, if we’re looking for evidence of solar flares, then the static on shortwave radios might tell us a lot, but if we’re trying to listen to a broadcast from some foreign land (like the mysterious “numbers stations,” which I picked up when I was younger) then it degrades our information. Consider the case of paleontologists trying to follow dinosaur tracks: each rain drop that has fallen upon them over the eons has interfered with that pattern, but the rain itself follows a pattern, albeit the wrong kind. It is only random for our chosen purposes and pattern recognition goals, but perhaps not to some prehistoric weathermen to whom the pattern of rain might’ve been of keen interest. The relationship between randomness and information is thus quite deep; this brief introduction is merely a wade into the kiddie pool. Later on in this series we may have to dive in, if the measures of randomness I hope to discuss call for it.
…………Shannon’s Entropy and related concepts from information theory have all sorts of far-flung implications, some of which are understood in great detail (particularly when they relate directly to designing codes with minimum information loss) and others which are really intellectually challenging. One of the simplest yet trickiest rules is that the greater the uncertainty, the greater the amount of probabilistic information that can be conveyed with each new record. [14] Some of the corollaries of information theory can be put to good use in reasoning from the data we’ve mined, which is after all, its raison d’etre. The principle of maximum entropy, for example, states that the ideal probability density functions (PDFs) maximize Shannon’s Entropy when producing the expected values.[15] I’d imagine that this could be leveraged for such good purposes as mining model selection. There are all kinds of hidden relationships to other forms of information I hope to cover eventually, such as redundancy, which is a big topic within coding theory. Someday I hope to tackle measures of semantic information, which can be extremely difficult to grasp because they quantify the actual meaning of data, which can be quite slippery. Sometimes the absence of any data at all can tell us everything we need to know, which many mystery writers and crime scene investigation shows have put to good use. More often, people often differ not only between but within themselves as the meaning they choose to assign to terms and numbers, without thinking it through clearly. Perhaps this would make it an ideal field to apply fuzzy sets to, since their chief use cases include modeling imprecision in natural language, as I discussed ad nauseum in the last tutorial series.

Semantic Misinterpretation, Cybernetics, “Disorder” and Other Subtle Stumbling Blocks

Shannon’s Entropy is probably further away from semantic information than fuzzy sets, but that hasn’t stopped many theorists from mistakenly conflating stochastic information with meaning. Shannon himself warned against this kind of bandwagon-jumping.[16] In fact, that’s probably the most common stumbling block of all. Perhaps the clearest cautionary tale is the whole philosophy developed out of information theory by bombastic mathematician Norbert Wiener (1894-1964), who asserted explicitly and deliberately” the very same connection to semantic meaning that Shannon and others cautioned against.[17] Another great mathematician who took Shannon’s Entropy a little too far into the realm of semantic meaning was Shannon’s colleague, Warren Weaver (1894-1978). Wiener’s philosophy is known by the cool name of “cybernetics,” which has a certain intriguing flair to it – just like the loaded terms “fuzzy sets” and “chaos theory”, which are often used in a manner precisely opposite to their meaning and purpose. Nobody is really quite certain what “cybernetics” is, but that hasn’t stopped academics from establishing a journal by that name; some of its contributors are among the leading names in analytics I most respect, but I find other commentators on the field as disturbing as some of the frenetic theologians who followed Pierre Teilhard de Chardin. To put it simply, information theory has been put to even more bad uses in the last seven decades than other hot mathematical ideas, like chaos theory and fuzzy sets – some of which border on madness.
…………I’m not convinced that even a lot of the authors who specialize this sort of thing really grasp all of these intricacies yet, although most at least have the common sense not to idolize these ideas and blow them up into full-blown crackpot philosophies. For example, there seems to be a lot of misunderstanding about its relationship to measures of complexity, structure and order, which are alike but not precisely the same. To complicate matters further, they all overlap thermodynamic entropy, in which not the same thing as probabilistic entropy. They both ultimately proceed from mathematical relationships like the Law of the Large Numbers, but measure different things which are not always connected; Shannon noticed the resemblance between the two off the bat, as did Leo Szilard (1898-1964), hence the name “entropy.” This kind of entropy is not really “disorder”; it merely removes the energy a system one would need to move from a disordered to an ordered state and vice-versa. Essentially, it freezes a system into its current state, no matter what order it exhibits. Likewise, zero probabilistic entropy always signifies complete certainty, but not necessarily lack of structure; we could for example, be quite certain that our data is disorganized. It is perhaps true that “…H can also be considered as a measure of disorganization…The more organized a system, the lower the value of H”[18] but only in a really broad sense. Furthermore, entropy prevents new complexity from arising in a system; the range of possible states that are reachable from a system’s current state are determined by its energy, so that the greater the entropy, the greater the number of unreachable states. Without an input of fresh energy, this set of reachable states cannot increase. This means that entropy tends towards simplicity, which can nevertheless exhibit order, while still not ruling out complexity. The same is true of probabilistic entropy, which intersects with algorithmic complexity at certain points.
…………The latter measures the shortest possible description of a program, which Shannon also investigated in terms of the minimum length of codes. He “further proclaimed that random sources – such as speech, music, or image signals – possess an irreducible complexity beyond which they cannot be compressed distortion-free. He called this complexity the source entropy (see the discussion in Chapter 5). He went on to assert that if a source has an entropy that is less than the capacity of a communication channel, the asymptotically error-free transmission of the source over the channel can be achieved.”[19] It is not wise to conflate these two powerful techniques, but when probabilistic and algorithmic information intersect, we can leverage the properties of both to shed further light on our data from both angles at once. In essence, we can borrow the axioms of both at the same time to discover more new knowledge, with greater certainty. One of the axioms of algorithmic complexity is that a program cannot contain a program more sophisticated than itself, for essentially the same reasons that a smaller box cannot contain a larger one in the physical realm. This is related to a principle called the Conservation of Information, which operates like the Second Law of Thermodynamics, in that absent or lost information cannot be added to a system from itself; since its violation would be a logical contradiction, actually more solid than its thermodynamic counterpart, which is based merely on empirical observation. It is essentially a violation of various No Free Lunch theorems. This has profound implications for fields like artificial intelligence and concepts like “self-organization” that are deeply intertwined with data mining. Since the thermodynamic entropies aren’t as closely related to data mining and information theory, I’ll only spend a little bit of time on them a couple of articles from now, where I’ll also dispense with extraneous topics like quantum entropies. There are many other directions we could take this discussion in by factoring in things like applications without replacement, compound events, multiple possible outcomes, mutually exclusive events, unordered pairs and sources with memory (non-Markov models). Instead, I’ll concentrate on using this article as a springboard to more complex forms of probabilistic entropy that might be of more use to SQL Server data miners, like leaf and root entropies, binary entropies like the conditional and joint, the information and entropy rates and  I’ll gradually build up towards more complex cases like Mutual, Lautum and Shared Information at the tail end of this segment of the tutorial series, whereas the Cross Entropy will be saved for a future segment on an important distance measure called the Kullback-Leibler Divergence. The next logical step  is to discuss the Rényi Entropy, which subsumes the Shannon and Hartley Entropies with other relatives in a single formula.

[1] I lost my original citation for this, but believe it is buried somewhere in Klir, George J., 2006, Uncertainty and Information: Foundations of Generalized Information Theory, Wiley-Interscience: Hoboken, N.J.

[2] See the Wikipedia page “Claude Shannon” at http://en.wikipedia.org/wiki/Claude_Shannon

[3] See the Wikipedia articles “Hartley Function” and “Ralph Hartley” at http://en.wikipedia.org/wiki/Hartley_function and http://en.wikipedia.org/wiki/Ralph_Hartley respectively.

[4] pp. 5-6, Shannon, C.E., 1974, “A Mathematical Theory of Communication,” pp. 5-18 in Key Papers in the Development of Information Theory, Slepian, David Slepian ed. IEEE Press: New York.

[5] p. xv, Mansuripur, Masud, 1987, Introduction to Information Theory. Prentice-Hall: Englewood Cliffs, N.J.

[6]  “It is nevertheless quite remarkable that, as originally shown by Shannon, one can show that, by proper encoding into long signals, one can attain the maximum possible language transmisssion capacity of a system while at the same time obtaining a vanishingly small percentage of errors.” p. 172, Goldman, Stanford, 1953, Information Theory. Prentice-Hall: New York.

[7] p.  272, Pierce, John Robinson, 1980, An Introduction to Information Theory: Symbols, Signals & Noise. Dover Publications: New York. Also see Pierce, John Robinson, 1961, Symbols, Signals and Noise: The Nature and Process of Communication. Harper: New York

[8] p. 31, Ritchie, L. David., 1991, Information. Sage Publications: Newbury Park, Calif.

[9] p. 259, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[10] p. 12, Brillouin, Léon, 1964, Science, Uncertainty and Information. Academic Press: New York. .

[11] Which I originally downloaded from the University of California at Irvine’s Machine Learning Repository.

[12] One source I found it mentioned in was Bengtsson, Ingemar, 2008, Geometry of Quantum States: An Introduction to Quantum Entanglement. Cambridge University Press: New York.

[13] See the Wikipedia page “Entropy (Information Theory)” at http://en.wikipedia.org/wiki/Entropy_(information_theory)

[14] p. 87, Wright, Robert, 1988, Three Scientists and Their Gods: Looking For Meaning in an Age of Information. Times Books: New York.

[15] See the Wikipedia page “Minimum Fisher Information” at http://en.wikipedia.org/wiki/Minimum_Fisher_information

[16] p. xvi, Mansuripur and p.  61, Ritchie, 1991.

[17] p. 289, Bar-Hillel, Yehoshua, 1964, Language and Information: Selected Essays On Their Theory and Application. Addison-Wesley Pub. Co.: Reading, Mass.

[18] p. 5, Ritchie.

[19] p. 143, Moser, Stefan M. and Po-Ning, Chen, 2012, A Student’s Guide to Coding and Information Theory.  Cambridge University Press: New York

## Information Measurement with SQL Server, Part 1: A Quick Review of the Hartley Function

By Steve Bolton

…………This long-delayed series of amateur self-tutorials has been in the works ever since I began writing my A Rickety Stairway to SQL Server Data Mining series, which made it clear to me that I didn’t know enough about what was going on under the hood in SSDM. I still don’t know enough about the reasoning behind the various data mining algorithms implemented by SQL Server and other tools, but I am certain of one thing: I never will know enough, even if I actually became competent in these topics. These fields are just too detailed, broad and rooted in poorly understood corners of pure reason for anyone to master, let alone myself. Like my series on SSDM, this foray into coding all of the basic measures of information theory and related fields may well exemplify University of Connecticut statistician Daniel T. Larose’s witticism that “data mining is easy to do badly.”[1] My purpose in the Rickety series was merely to demonstrate that useful results can still be derived from SSDM, even when it is badly mishandled. In this series, I will try to explain how the metrics used in many data mining algorithms can be used to answer a whole cornucopia of questions about our datasets, such as: How much information might there be, as measured in terms of possible state descriptions and probabilities? How much meaning (i.e. semantic information) might it have? How many state descriptions does it rule out? How much is already known? How random, aperiodic, redundant, complex or ordered is it? Another interesting challenge is determining the shortest possible specifications of a structure.
…………There are literally dozens upon dozens of measures available to answer all of these questions, taken from such diverse fields as information theory, chaos theory, algorithmic complexity and many others which provide the basic building blocks of most data mining algorithms. The techniques provided by these fields are powerful, yet contain many logical stumbling blocks that far smarter people than ourselves have tripped over, often without even knowing it; these range from instances of subtle “definition drift” in the meaning of terms like “randomness” over the course of textbooks, to the development of full-blown crackpot theories by scholars overexcited by the potential of systems like chaos and information theory. I am bound to make mistakes along the way, given that I’m an amateur at this, so by all means take care when implementing my code and trusting my analysis, which is sometimes just plain wrong. On the other hand, these techniques are so powerful and so under-utilized that there is a crying need for some explanation of how they can be applied in a SQL Server environment, even a poor one. I know a little bit more of certain areas of philosophy – especially historical instances of when it has gone very wrong – so I can occasionally make a contribution by commenting on how to avoid fallacious reasoning, which is even more of a problem once these sophisticated topics are clouded over by excess jargon and complex math formulas.

DIY Data Mining and the Scope of the Series

One thing I’ve learned while trying to force-feed myself the underlying math is that the analysis marketplace is decades behind the research in some ways; there is no way that any single company is ever going to be able to code all of the extant mining algorithms, assuming it is even possible to tally them all up. This means that it may be beneficial in the years to come to have the skills to build DIY solutions. Throughout this series I will provide T-SQL code so that SQL Server DBAs and data miners can implement some of these techniques on their own, without waiting for developers of mining software to code the particular algorithms that fit their use cases. Why T-SQL? I could make a strong case that the gradual accretion of features like windowing functions is slowly making set-based languages ideal for this purpose, although they are rarely thought of in that way; it boils down to the fact that most of the problems covered in these interrelated fields are much easier to express in terms of sets. Furthermore, the sheer size of “Big Data” (which has steadily gotten “bigger” ever since the first records were kept; like “globalization”, it has only accelerated in recent years) requires taking computing to a new level of abstraction in order to simplify things, which is a purpose that set-based languages can fulfill. This series will be a learning experience for me as well, in which I hope to at least help others avoid my mistakes, by teaching through misadventure; writing a series like this aids me in absorbing the material a lot faster, while also getting valuable practice in translating the difficult underlying math formulas into code. Since I don’t know what I’ll discover along the way, I suspect that at some point I may resort to using some of my other favorite languages, like Multidimensional Expressions (MDX) and Visual Basic .Net, possibly in conjunction with Common Language Runtime (CLR) stored procedures. One of the benefits I hope to provide is to take most of the math out of the picture, so that readers don’t get bogged down in it; the jargon and formulas are absolutely necessary for the researchers to communicate with each other, but there is no need for data miners to be writing formal mathematical proofs, just as commuters shouldn’t have to give a dissertation on automotive engineering in order to drive their cars. I’ve sometimes received comments to the effect that there’s too much text in these articles, but that’s because they don’t see the hundreds of pages of math formulas that gave rise to them; rather than stringing together some meaningless screenshots or rehashing MSDN tutorials from AdventureWorks, I aim to show how these techniques might be coded and how they can be used appropriately.
…………The goal in this series is to corner the uncertainty in our datasets by shining lights on it from every possible direction, so that organizations can make better decisions that result in more effective actions. To that end, everything from measures used in Bayesian inference to Solomonoff Algorithmic Probability to the calculation of periodicities to the Lyapunov Exponent will be fair game. These metrics vary quite widely in terms of sophistication, so at points we’ll cross the imprecise boundaries separating them from full-blown data mining algorithms; the dividing lines separating statistics, data mining, machine learning, “soft computing,” predictive analytics and the like seem to boil down to degrees of intricacy, rather than differences in kind, especially since their algorithms are derived from the same metrics and building blocks. My last mistutorial series was designed from the ground up to build on existing concepts, one article at a time. This one will be inherently disorganized, since the scope is so wide and I don’t know what I will find along the way. I will have to skip around quite a bit across topics that may only be distantly related, or across levels of sophistication exhibited by the measures of information. It will also be an open-ended series, whereas the Rickety series was merely necessary to cover a specific set of easily denumerable features. We could delve into dozens of obscure metrics if the need arises, or even concoct our own special-purpose metrics, if a use case calls for it.
…………The series may explore a wide-ranging topics along paths are still somewhat unknown, but I can at least kick it off by introducing some of the primordial foundations of information theory. Perhaps the simplest is a function developed in 1928 by electronics pioneer Ralph Hartley[2], who applied it to signal transmission[3] a few decades before Claude Shannon did the same with his own renowned entropy measure. Although Hartley considered it a measure of information, his function did not lead to the use of the term of “information theory,”[4] which was coined when Shannon’s famous equation gave birth to the field. This introduction will be made even easier by the fact that I already discussed a more advanced version of Hartley’s measure in Implementing Fuzzy Sets in SQL Server, Part 9: Measuring Nonspecificity with the Hartley Function. The version for ordinary “crisp” sets of the kind DBAs and data miners are accustomed to working is actually quite a bit easier to code and interpret: all we have to do is count the records in a set and take the logarithm. The code in Figure 1 is actually longer than it has to be, given that I used three different logarithm bases to measure the same quantity, for the sake of completeness. When base 2 is used, the units are known as bits or “shannons.” When the default value of Euler’s Number is used, they’re known as “nats,” but with base 10 we’re measuring in hartleys or “bans” (a term coined by famed cryptographer Alan Turing). It would be trivial to turn this into a stored procedure with an option to select the desired units. It only took about a second to calculate the results in Figure 2 on the first float column of the Higgs Boson dataset I downloaded from University of California at Irvine’s Machine Learning Repository a few tutorial series ago, which I converted to a 5-gigabyte SQL Server table. This was calculated effortlessly across all 11 million rows only because we had to performs some simple counts, without traversing the whole table.

Figure 1: Code for the Ordinary “Crisp” Version of the Hartley Function
DECLARE @HartleyEntropy float, @DistinctCount bigint, @Count bigint

SELECT @DistinctCount = Count(DISTINCT Column1), @Count = Count(*)
FROM Physics.HiggsBosonTable

SELECT Log(@DistinctCount, 2) AS BitsOrShannons, Log(@DistinctCount, 2.7182818284590452353602874713526624977) AS Nats, Log(@DistinctCount, 10) AS Hartleys,
Log(@Count, 2) AS MultisetBitsOrShannons, Log(@Count, 2.7182818284590452353602874713526624977) AS MultisetNats, Log(@Count, 10) AS MultisetHartleys

Figure 2: Results from the Higgs Boson Dataset

…………The main problem I ran into was a fundamental one: the formula calls for plugging in the cardinality of the set, but in ordinary set theory parlance, duplicate values are only counted in multisets. The set notation includes the symbols |A|, but the bars specify a cardinality measure rather than the use of the ABS function. The question is, which cardinality? I’ve included both versions in Figure 1, which differ solely by the fact that one uses a DISTINCT operator and the other takes a standard COUNT. This discrepancy could have its uses though. I have yet to see this issue raised in the information literature, where Hartley’s metric is often skipped over as a topic of mainly historical interest in comparison to Shannon’s, but it may be possible to derive a third metric based on the difference between the two. A simple subtraction might help us quantify the contribution of the repeated values to the uncertainty, which could have its uses in the kinds of uncertainty management programs I spoke of in the fuzzy set series. In essence, this difference could act as a crude measure of redundancy. If it reflects the information gained by using the DISTINCT operator, we could use this to assess its costs. We know for a fact that the DISTINCT version can’t exceed the multiset version, which acts as a cap on its range. At the other end of the scale, the measure reaches the limit of 0 when all records have the same value, therefore giving us perfect certainty. Another important issue is whether or not the DISTINCT clause adds information, by reducing the amount of uncertainty about how many different values a variable can take on.
…………Of course, the clause might not be necessary at all if we knew in advance precisely which values were permissible for a column, such as the range of a decimal type or a varchar column limited to a handful of known category values. On the other hand, this raises a subtle distinction between values that are permissible for a column, which can be determined by the data type, and the values actually found, which can only be counted through a DISTINCT operator. The issue becomes more sophisticated if we are able to determine the counts of each individual value; these measures of “multiplicity,” as they are known in multiset theory, further reduce the uncertainty associated with the dataset. It is easy enough to implement these internal counts using windowing functions and GROUP BY statements, but the issue of how to factor them in quickly complicates the discussion of the otherwise simple Hartley function. Thankfully, the order of the records is only an issue with tuples, not the kinds of sets or multisets we plug into it – except when we get to the end of the dataset and can determine the last records from the remaining counts, which is something I’ve not seen addressed in the literature. This brings applications without replacement (i.e., the kind of probabilities associated with decks of cards where no reshuffling takes place) into play, at least briefly.

“Newsworthiness”: The Narrow Definition of “Information”

Aside from all of these hidden subtleties, the information provided by both the DISTINCT and multiset versions can be summed up thus: how much am I learning each time I inspect a row and verify the actual value found there? This is equivalent to asking how much we learn from each slip of paper in a drawing, if we know the count of the jar in advance. In more advanced terms, we can think of this as increasing information by reducing the number of possible state descriptions the next record can take on; in such a context, whether or not repeated values are allowed makes a big difference. The same goes for their individual counts, if the answer is yes. Keep in mind that this type of “information” is practically the polar opposite of existing knowledge; basically, the higher the Hartley measure is, the less we don’t already know, so the more the next record can tell us. It is a highly specific type of information, which former journalists like myself might equate with “newsworthiness.” This is the dividing line between existing and new knowledge is precisely where measures of entropy (like the Hartley function) intersect with topics like Bayesian probability, which I will also address at some point in this series; as we shall see, many of these information measures are interrelated in complex ways. This highly specific definition of information is an important distinction that pertains to all of the other measures of entropy we’ll discuss in this series; interpretation is a critical stage in every field associated with data mining, particularly information theory, which should never be shortchanged in any workflow.
…………The Hartley function can be leveraged in a data mining workflow in various ways, such as calculating the reduction in uncertainty between two separate measures; this could be useful, for example, in specifying a numerical cut-off point in bits or bans, after which it’s not worthwhile to go on inspecting rows, for whatever end purpose that might be, such as sampling. The sample code in Figure 3 takes the Hartley measure after the 10 millionth row, or about 9 percent away from the end of the dataset, which is why remaining uncertainty in Figure 4 is so low; as we approach the last uninspected record, the remaining uncertainty would approach zero. This formula would be equivalent to counting the remaining records and plugging the results into the Hartley function. Another interesting question is whether or not we could we pair this with cardinality estimation, to get a ballpark figure of how much we can learn from each record we inspect, before we’ve even traversed a dataset. I don’t know much about cardinality estimation yet, but the possibility is tantalizing

Figure 3: Calculating the Remaining Uncertainty with the Hartley Function
DECLARE @DistinctCountOfKnownValues bigint, @CountOfKnownValues bigint

SELECT @DistinctCountOfKnownValues = Count(DISTINCT Column1), @CountOfKnownValues = Count(*)
FROM Physics.HiggsBosonTable
WHERE ID BETWEEN 1 AND 10000000

SELECT Bits, KnownBits, Bits KnownBits AS RemainingUncertaintyInBits, MultisetBits, KnownMultisetBits, MultisetBits KnownMultisetBits AS RemainingMultisetUncertaintyInBits
FROM (SELECT Log(@DistinctCount, 2) AS Bits, Log(@DistinctCountOfKnownValues, 2) AS KnownBits, Log(@Count, 2) AS MultisetBits, Log(@CountOfKnownValues, 2) AS KnownMultisetBits) AS T1

Figure 4: Uncertainty Reduction Results

…………The kinship between the Hartley function and the rest of information theory is evident in some of its alternative names, like the Hartley Entropy or Max Entropy. It is equivalent to the Rényi Entropy[5] with its alpha parameter (α) set to 0, as I’ll explain a few articles from now. It’s also identical to the Shannon Entropy in cases of the uniform distribution, i.e. when all values are equally likely.[6] I’ll be spending a couple of articles on various aspects of entropy early on this series, since it’s such an important concept in information theory. The math and logic can get thick pretty quickly in this field, so it is best to start off with the measure that started it all, Shannon’s infamous “H.” I recognized its signature combination of a negative summation operator and log operation when translating some of the equations used in Implementing Fuzzy Sets in SQL Server, Part 10.2: Measuring Uncertainty in Evidence Theory into T-SQL. As with the Hartley function, this previous exposure to more advanced fuzzy derivatives ought to make the material a little easier to swallow. The difficulty with Shannon’s Entropy, however, is not in its calculation, but in its proper interpretation. Yet as long as we have the rigor to avoid assigning unwarranted shades of meaning to the term “information,” it can be a powerful addition to our data mining toolbelts.

[1] p. xii, LaRose, Daniel T., 2005, Discovering Knowledge in Data: An Introduction to Data Mining. Wiley-Interscience: Hoboken, N.J.

[2] See the Wikipedia articles “Hartley Function” and “Ralph Hartley” at http://en.wikipedia.org/wiki/Hartley_function and http://en.wikipedia.org/wiki/Ralph_Hartley respectively.

[3] p. 5 Ritchie, L. David., 1991, Information. Sage Publications: Newbury Park, Calif.

[4] p. 288, Bar-Hillel, Yehoshua, 1964, Language and Information: Selected Essays On Their Theory and Application. Addison-Wesley Pub. Co.:        Reading, Mass.

[5] See the Wikipedia article “Rényi Entropy” at http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

[6] See the Wikipedia article “Hartley Function” at http://en.wikipedia.org/wiki/Hartley_function