Blog Archives

Information Measurement with SQL Server, Part 2.2: The Rényi Entropy and Its Kin

By Steve Bolton

…………I kicked off this far-ranging series on using SQL Server to quantify information by discussing two of the earliest and most important measures, the Hartley function and Shannon’s Entropy. These foundations of information theory are intimately related to a more general measure, Rényi Entropy, which is a bit more complex but nonetheless worthwhile to discuss, since its unites many different information measures under one umbrella. The underlying math formula isn’t much more difficult than the one for Shannon’s Entropy I posted last time around, but its alpha parameter (α-parameter) enables it to give rise to a wider range of results. This assortment of entropy types can be adapted to solving a wider range of problems.
…………A general-purpose means of parameterizing the concept of entropy was the explicit goal of Alfréd Rényi (1920-1971), a mathematician who overcame anti-Semitic persecution at the hands of Hungary’s World War II regime, which was allied with the Nazis and passed stringent laws against their own Jewish minority. Incidentally, his coffee addiction was the inspiration for the colorful saying, “A mathematician is a device for turning coffee into theorems”, which apparently predates the saying, “A programmer is a machine that turns coffee into code.”[1] The formula that bears his name has proven useful in many diverse fields and industries, from quantum mechanics to fractals to ecology and statistics, where it is useful in deriving indices of diversity.[2] When the α-parameter is set to 2, it becomes Collision or Quadratic Entropy (or sometimes just “the Rényi Entropy”), which “has been used in physics, in signal processing and in economics” and is attractive to statisticians because they “have found an easy way to estimate it directly from samples.”[3] Given that it has multifarious uses and is both easy to explain and compute in SQL Server, it makes sense to add it to our toolbelt, in order to derive our own DIY data mining methods. One thing I’ve come to realize since writing my initial A Rickety Stairway to SQL Server Data Mining tutorial series awhile back is that the data mining marketplace is decades behind the research in some respects, to the point where all of the available software taken together is probably several orders of magnitude behind the formulas available in the math books. If we encounter situations where the particular algorithms that would work best for our use cases aren’t yet implemented, waiting might not be an option, and if we’re going to learn to build our own, Rényi Entropy is bound to come in handy at some point.

Coding the Range of α-Parameter  Values

                The key to understanding Rényi Entropy is grasping how it intersects with its cousins. The α-parameter can be any non-negative value except for 1, which would result in a divide-by-zero error since one of the terms is 1 / (1 – α). This result is multiplied by a logarithm taken on the sum of the probabilities taken to the power of α; this differs from Shannon’s famous equation in that his log operation is performed on each probability and then multiplied by that probability, then summed afterwards and multiplied by -1. This logic is all implemented in the INSERT operation in Figure 1, in which most of the code is devoted to deriving actual probability values from the proportions of a single float column, taken from the same Higgs Boson dataset[4] I’ve been using for practice purposes for the past few tutorial series. I omitted any display of the intermediate proportions stored in the @EntropyTable variable, since they’re identical to the sample results posted in last week’s article. Evidently, it’s economical to compute these entropies, given that this code takes less than 2 seconds to run on my ancient abacus of a development machine (I might as well be using quipus) and most of that was devoted to deriving the probabilities for all 11 million rows; if we already had probability values derived from some other source like estimates, sampling or deductive methods, it would run practically instantaneously. I also hard-coded the results we already calculated over the last two articles for the Hartley and Shannon Entropies, given that they are also unchanged. The final SELECT in Figure 2 provides the same information from another vantage point, by plugging a range of α-values rather than a single parameter into the Rényi formula.
…………Note that as the values approach 1, they converge towards the Shannon Entropy, although they can never quite reach it because of the divide-by-zero error that would occur if plugged in the forbidden parameter value of 1.Likewise, the return values approach the Hartley measure as the parameter values approach 0; this acts as a cap on the permissible values, hence the frequent use of the synonym Max Entropy. Because SQL Server’s calculation powers are impressive yet finite, we can’t plug in values approaching infinity (if such a thing were possible). In fact, we can’t even go much higher than α-values  of about 50 (which are rarely used in practice anyways) on this particular dataset without getting floating point errors. Figure 3 is sufficient to nevertheless sufficient to illustrate how the Rényi results approach an information measure known as the Min Entropy, which acts as a cap on the other end. References to it as the Chebyshev Entropy[5] seem to be few and far between outside the realm of quantum physics, but I’ve run into a few here and there. It is nonetheless easy to calculate, given that all we have to do is find the maximum probability value and take the negative log. As Wikipedia points out quite succinctly, “The min entropy is never greater than the ordinary or Shannon entropy (which measures the average unpredictability of the outcomes) and that in turn is never greater than the Hartley or max entropy, defined as the logarithm of the number of outcomes.”[6] It is also useful as the “most conservative way of measuring the unpredictability of a set of outcomes,” a property that makes it useful in determining randomness and quantum cryptography, in a way that the Shannon Entropy can’t handle.[7] Each Rényi value in between the fences set by the Min, Hartley and Shannon Entropies has its own distinct uses in ascertaining “the unpredictability of a nonuniform distribution in different ways.”[8] We can thus choose whatever α-parameter is ideal for dealing with the particular distribution of our data. As an amateur, I’m lacking in the experience needed to determine which α-values are best-suited to specific problems, but this code at least provides a launching pad for exploring such questions.

Figure 1: The Rényi Entropy in T-SQL
DECLARE @LogarithmBase decimal(38,36) = 2  –2.7182818284590452353602874713526624977 — 10
DECLARE @AlphaParameter decimal(38,35) = 2  

DECLARE @Count  bigint, @DistinctValueCount bigint, @RenyiEntropy float, @MinOrChebyshevEntropy float

SELECT @Count = Count(*)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL

DECLARE @EntropyTable table
(Value decimal(33,29),
ValueCount bigint,
Proportion float,
SummationInput float
) 

INSERT INTO @EntropyTable
(Value, ValueCount, Proportion, SummationInput)
SELECT Value, ValueCount, Proportion, Power(Proportion, @AlphaParameter) AS SummationInput
FROM (SELECT Value, ValueCount, ValueCount / CAST(@Count AS float) AS Proportion
       FROM  (SELECT Column1 AS Value, Count(*) AS ValueCount
              FROM Physics.HiggsBosonTable
              WHERE Column1 IS NOT NULL
              GROUP BY Column1) AS T1) AS T2

SELECT @RenyiEntropy = 1 / (1 @AlphaParameter) * Log(SUM(SummationInput), @LogarithmBase), @MinOrChebyshevEntropy = 1 * Log(Max(Proportion), @LogarithmBase)
FROM @EntropyTable

SELECT @RenyiEntropy AS RenyiEntropy, @MinOrChebyshevEntropy AS MinEntropy, 13.2879304339032 AS ShannonsEntropy, 14.772263018717 AS HartleyOrMaxEntropy 

Figure 2: A Range of Sample T-SQL
–WHOLE RANGE OF RENYI ENTROPY
SELECT AlphaParameter, 1 / (1 AlphaParameter) * Log(SUM(Power(Proportion, AlphaParameter)), @LogarithmBase) AS ResultingRenyiEntropy
FROM (VALUES (0.01), (0.5),(1.01),(2),(3),(4),(5),(10),(50)) AS T1(AlphaParameter)
     LEFT JOIN @EntropyTable AS T2
     ON 1 = 1
GROUP BY AlphaParameter
ORDER BY AlphaParameter

 Figure 3: Results from the Higgs Boson Dataset

…………Jose C. Principe, a neural net expert who teaches at the University of Florida, has written a handy 35-page guide to using the Rényi Entropy for tasks even further beyond my level of inexpertise, such as deriving measures like Information Potential based on probability distribution function (PDF) estimates. This quickly leads into the bleeding edge topic of information geometry, which I won’t touch with a ten-foot pole for a few more years, at least until I have a better understanding of such mind-blowing concepts as statistical Riemann manifolds and multidimensional spaces. The introductory pages are nevertheless useful in illustrating how Rényi Entropy works on simpler expressions of information. This is particularly true of the excellent diagram on the page 4, which shows how the various α-values measure the distance of a PDF to the origin point, thereby filling in the available space for information content in different proportions. As we shall see later in this series of self-tutorials, plugging different α-values into a related concept known as the Rényi Divergence also allows us to measure the distances between two PDFs in various ways. This is intimately related to the Kullback–Leibler Divergence in much the same way as the Rényi Entropy is related to the Shannon Entropy. The KL-Divergence is among the most important measures in information theory and related fields so I’ll be writing quite a bit on the subject later on in the series, as I did with the Shannon Entropy in the last article. It also goes by the name of Relative Entropy, but I’ll have to put off that discussion for a later time, when I can set aside a separate segment of the series for distance and divergence measures between probability distributions. Over the next four articles I’ll stick to the topic of entropy measures on single distributions, including Leaf and Root Entropy, which are inextricably related to the SSDM algorithm I covered in A Rickety Stairway to SQL Server Data Mining, Algorithm 3: Decision Trees.

 

[1] See the Wikipedia page “Alfréd Rényi” at  http://en.wikipedia.org/wiki/Alfr%C3%A9d_R%C3%A9nyi

[2] See the Wikipedia article “Rényi Entropy” at http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

[3] pp. 6-7, Principe, Jose C., 2009, “Rényi Entropy,” course notes posted at the University of Florida webpage http://www.cnel.ufl.edu/courses/EEL6814/renyis_entropy.pdf

[4] I downloaded this publicly available dataset from University of California at Irvine’s Machine Learning Repository and while back and converted it to a SQL Server table, which now takes up about 5 gigabytes in a sham DataMiningProjects database.

[5] One source I found it mentioned in was Bengtsson, Ingemar, 2008, Geometry of Quantum States: An Introduction to Quantum Entanglement. Cambridge University Press: New York.

[6] See the Wikipedia article “Min Entropy” at http://en.wikipedia.org/wiki/Min_entropy

[7] IBID.

[8] IBID.

Advertisements

Information Measurement with SQL Server, Part 2.1: The Uses and Abuses of Shannon’s Entropy

By Steve Bolton

…………In the first installment of this wide-ranging series of amateur tutorials, I noted that the Hartley function indeed returns “information,” but of a specific kind that could be described as “newsworthiness.” This week’s measure also quantifies how much we add to our existing knowledge from each new fact, but through explicitly stochastic methods, whereas the Hartley function is more closely related to counts of state descriptions than probabilities.[1] Shannon’s Entropy is of greater renown than its predecessor, but isn’t much more difficult to code in T-SQL. In fact, it should be a breeze given that I already posted a more advanced version of it tailor-made for Implementing Fuzzy Sets in SQL Server, Part 10.2: Measuring Uncertainty in Evidence Theory, just as I had already coded a modified version of Hartley’s measure for Implementing Fuzzy Sets in SQL Server, Part 9: Measuring Nonspecificity with the Hartley Function.
…………Historically, the real difficulty with the metrics of information theory is with their interpretation, largely due to the broad and fuzzy meaning of the imprecise term, “information.” As we shall see, many brilliant theorists far smarter than we have made the mistake of overextending it beyond its original domain, just as some thinkers have gotten carried with the hype surrounding fuzzy sets and chaos theory. Like those cutting-edge topics, Shannon’s Entropy and its relatives within information theory are indeed powerful, but can go badly wrong when misapplied outside its specific use cases. It has really subtle yet mind-blowing implications for all of the different classes of information measurement I hope to cover in this wide-ranging and open-ended series, not all of which are fully understood. My purposes in this series is to illuminate datasets from every possible direction, using a whole smorgasbord of measures of meaning (semantic information), randomness, complexity, order, redundancy and sensitivity to initial conditions, among others; since many of these metrics serve as the foundation of many mining algorithms, we can use them in a SQL Server environment to devise DIY data mining algorithms. Shannon’s Entropy intersects many of these disparate class of information in multifarious ways, such as the fact that it is equal to the Hartley measure in the case of uniform distributions (i.e. when all values are equally likely). It is not a measure of what is known, but of how much can be learned by adding to it; entropy measures rise in tandem with uncertainty precisely we because we can learn more from new facts when we don’t already know everything. It can thus be considered complementary to measures of existing knowledge, such as Bayes Factors. I’ve read more on information theory than many other fields I’ve commented on over the past couple of years, when I started writing self-tutorials on data mining, but that doesn’t mean I understand all of these complex interrelationships well. In fact, I’m writing this series in part because it helps me absorb the material a lot faster, and posting it publicly in the hopes that it can at least help readers avoid my inevitable mistakes. I lamented often in A Rickety Stairway to SQL Server Data Mining that SSDM was woefully under-utilized in comparison to its potential benefits, but the same can also be said of the algorithms that underpin it.

Channel Capacity as a Godsend

                Few of these information metrics are as well-known as the measure that Claude E. Shannon (1916-2001), a key American cryptographic specialist during the Second World War, introduced in 1948 in a two-part journal article titled “A Mathematical Theory of Communication.”[2] Most of it consists of math theorems and formulas that only got thicker as researchers built on his theory over the years, but it is noteworthy to mention that he credits earlier work by electrical engineers Ralph Hartley[3] (1888-1970) and Harry Nyquist (1889-1976) in the opening paragraphs.[4] I’ll skip over most of the proofs and equations involved – even the ones I understand – for the usual reasons: as I’ve pointed out in past articles, users of mining tools shouldn’t be burdened with these details, for the same reason that commuters don’t need a degree in automotive engineering in order to drive their cars. Suffice it to say that the original purpose was to determine the shortest possible codes in the case of noisy transmission lines. In this effort to extend coding theory, Shannon gave birth to the whole wider field of information theory. Masud Mansuripur, the chair of the Optical Data Storage Department at the University of Arizona, sums up Shannon’s surprising mathematical discovery best: “In the past, communications engineers believed that the rate of information transmission over a noisy channel had to decline to zero, if we require the error probability to approach zero. Shannon was the first to show that the information-transmission rate can be kept constant for an arbitrarily small probability of error.”[5] It is also possible to determine the information-carrying capacity of the channel at the same time.[6]
…………The idea is still somewhat startling to this day, as many writers in the field of information theory have pointed out ever since; it almost seems too good to be true. Basically, the idea is to engineer a “decrease of uncertainty as to what message might have been enciphered,” [7] in part by leveraging combinatorics to derive probabilities. For example, the Law of Large Numbers states that rolling a pair of dice is far more likely to result in a number like eight, since there are multiple ways of adding up both dice to get that result, whereas there’s only one combination apiece for snake eyes or boxcars. The edges of the resulting dataset are thus far less likely than the values in the middle, just as we’d find in a Gaussian or “normal” distribution, i.e. the ubiquitous bell curve. This gives rise to a dizzying array of axioms and lemmas and a whole set of procedures for determining codes, all of which I’ll skip over. For perhaps the first time in the history of this blog, however, I have a sound reason for posting the underlying equation: H = -Σ pi logb pi. It’s not a household name like E = mc2 or 2 + 2 = 4, but it’s still one of the most famous formulas of all-time. It relatives can be spotted in the literature by the telltale combination of a negative summation operator with a logarithm operation. Aside from the fact that it possesses many ideal mathematical properties[8], “it was proven in numerous ways, from several well-justified axiomatic characterizations, that this function is the only sensible measure of uncertainty in probability theory.”[9]

Translating H into Code

                As with Hartley’s measure, a base 10 logarithm results in units known as hartleys or bans (as famed cryptographer Alan Turing called them). With Euler’s Number (as in the natural logarithm) the units are known as nats and with base 2, they’re referred to as shannons or more commonly, bits. The code in Figure 1 uses bits, but users can uncomment the numbers that follow to use one of the other units. It is also worth noting that physicist Léon Brillouin used a “negentropy” measure that is basically the inverse of H, to compensate for the fact that Shannon’s Entropy measures information in terms of uncertainty, which is a bit counter-intuitive; it never really caught on though.[10]
…………My sample code derives some fake probabilities by taking the known proportions of values for a column in the same Higgs Boson dataset[11] I’ve been using for practice purposes for several tutorial series. The probabilities are then multiplied by the logs for those probabilities, then summed across the dataset and multiplied by -1. It’s actually quite simpler than it looks; the table variable and INSERT statement could be done away with altogether if we already had probability figures calculated through some other means, such as some sort of statistical sampling method or even reasoning out the probabilities from the underlying process, as writers on information theory often do with dice and cards to illustrate the concepts. In fact, if we know the underlying probability distribution in advance, we can sometimes calculate the entropy in reverse by using special formulas specific to certain distributions, such as the normal. Moreover, all of the code pertaining the @UniformProportion is included solely for the purpose of validating the results, which equal the results we received in the last article for the DISTINCT version of the Hartley function, as expected. If need be, users can also validate their results using Lukasz Kozlowski’s Shannon Entropy Calculator. All of that can code can also be removed at will. Despite all of this extraneous baggage, the code ran in just 1.9 seconds according to the Client Statistics, on 11 million rows of float values on my beat-up, wheezing semblance of a development machine. The execution plan was uneventful and consisted mainly of seeks on the nonclustered index I put on the column a couple of tutorial series ago.

Figure 1: Shannon’s Entropy in T-SQL
DECLARE @LogarithmBase decimal(38,36) =–2.7182818284590452353602874713526624977 — 10
DECLARE @Count bigint, @DistinctValueCount bigint, @ShannonsEntropy float, @EntropyForUniformDistribution float,
@MaxProbability float, @ChebyshevEntropy float, @MetricEntropy float, @UniformProportion float

SELECT @Count = Count(*), @DistinctValueCount = Count(DISTINCT Column1)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL 

SELECT @UniformProportion = 1 / CAST(@DistinctValueCount as float)
FROM Physics.HiggsBosonTable
WHERE Column1 IS NOT NULL

DECLARE @EntropyTable table
(Value decimal(33,29),
ValueCount bigint,
Proportion float,
SelfInformation float
)

INSERT INTO @EntropyTable
(Value, ValueCount, Proportion, SelfInformation)
SELECT Value, ValueCount, Proportion,  1 * Proportion * Log(Proportion, @LogarithmBase) AS SelfInformation
FROM (SELECT Value, ValueCount, ValueCount / CAST(@Count AS float) AS Proportion
       FROM  (SELECT Column1 AS Value, Count(*) AS ValueCount
              FROM Physics.HiggsBosonTable
              WHERE Column1 IS NOT NULL
              GROUP BY Column1) AS T1) AS T2 

SELECT * FROM @EntropyTable

SELECT @ShannonsEntropy = SUM(SelfInformation), @EntropyForUniformDistribution=   SUM (@UniformProportion * Log(@UniformProportion, @LogarithmBase)) * 1, @MaxProbability = Max(Proportion)
FROM @EntropyTable

— ====================================
— SOME RELATED ENTROPIES
— ====================================
SELECT @ChebyshevEntropy  = 1 * Log(@MaxProbability, @LogarithmBase)
SELECT @MetricEntropy  = @ShannonsEntropy / CAST(@Count as float)

SELECT @ShannonsEntropy AS ShannonsEntropy, @MetricEntropy AS MetricEntropy, @ChebyshevEntropy AS ChebyshevEntropy, @EntropyForUniformDistribution as EntropyForUniformDistribution

 

Figure 2: Calculating Shannon’s Entropy on the Higgs Boson Dataset

…………Note that in the last section of the code, I tacked on the Metric and Chebyshev Entropies, which are trivial to calculate once we have building blocks like Shannon’s Entropy. References to the Chebyshev Entropy[12], a.k.a the Min Entropy because it represents the minimum amount a variable can exhibit, seem to be few and far between outside the realm of quantum physics. Metric Entropy, on the other hand, can  serve as a simple measure of randomness[13], which is more likely to be useful to SQL Server users. These are among a couple dozen extensions and relatives of Shannon’s Entropy, a few of which I’ve already dealt with, like the Hartley Entropy (i.e. the Max Entropy). I won’t discuss Differential Entropy, the extensions to continuous variables using various methods of calculus, because SQL Server data types like float, decimal and numeric are actually discrete in the strict sense; they can’t actually represent infinitesimal grades in between values, any more than the data types of any other software can on finite computers. Nor will I delve into the myriad applications that have been developed from Shannon’s Entropy for signal transmission and coding theory, since these are off-topic.

Detection of “Randomness” and Other Uses

                Its relationship to cryptography might be a more appropriate subject, but for the sake of brevity I’ll limit my discussion to pointing out how it sheds light on the nature of “randomness.” Like “information,” it’s a broad term that raises the danger of “definition drift,” even within one’s own mind; I’ve seen big names in certain scientific fields unconsciously shift their usage of the term from “unintentioned” to “uncaused” to “indeterminate” and back again, without discerning the subtle differences between them all. In data mining, we’re basically trying to uncover information that is badly concealed, whereas in cryptography it’s deliberately concealed by mixing it in with apparently random information; the patterns generated by cryptographers are the opposite of random in one specific sense, that it takes great intelligence and hard work to deliberately create these false patterns. Cryptanalysis involves detecting these deliberate patterns, in order to remove the noise obscuring the original message. Of course, “noise” is a subjective matter determined entirely by the questions we choose to ask of the data (although the answers to those questions are entirely objective). Randomness can thus be viewed as excess information of the wrong kind, not an absence of pattern or lack of causation. For example, if we’re looking for evidence of solar flares, then the static on shortwave radios might tell us a lot, but if we’re trying to listen to a broadcast from some foreign land (like the mysterious “numbers stations,” which I picked up when I was younger) then it degrades our information. Consider the case of paleontologists trying to follow dinosaur tracks: each rain drop that has fallen upon them over the eons has interfered with that pattern, but the rain itself follows a pattern, albeit the wrong kind. It is only random for our chosen purposes and pattern recognition goals, but perhaps not to some prehistoric weathermen to whom the pattern of rain might’ve been of keen interest. The relationship between randomness and information is thus quite deep; this brief introduction is merely a wade into the kiddie pool. Later on in this series we may have to dive in, if the measures of randomness I hope to discuss call for it.
…………Shannon’s Entropy and related concepts from information theory have all sorts of far-flung implications, some of which are understood in great detail (particularly when they relate directly to designing codes with minimum information loss) and others which are really intellectually challenging. One of the simplest yet trickiest rules is that the greater the uncertainty, the greater the amount of probabilistic information that can be conveyed with each new record. [14] Some of the corollaries of information theory can be put to good use in reasoning from the data we’ve mined, which is after all, its raison d’etre. The principle of maximum entropy, for example, states that the ideal probability density functions (PDFs) maximize Shannon’s Entropy when producing the expected values.[15] I’d imagine that this could be leveraged for such good purposes as mining model selection. There are all kinds of hidden relationships to other forms of information I hope to cover eventually, such as redundancy, which is a big topic within coding theory. Someday I hope to tackle measures of semantic information, which can be extremely difficult to grasp because they quantify the actual meaning of data, which can be quite slippery. Sometimes the absence of any data at all can tell us everything we need to know, which many mystery writers and crime scene investigation shows have put to good use. More often, people often differ not only between but within themselves as the meaning they choose to assign to terms and numbers, without thinking it through clearly. Perhaps this would make it an ideal field to apply fuzzy sets to, since their chief use cases include modeling imprecision in natural language, as I discussed ad nauseum in the last tutorial series.

Semantic Misinterpretation, Cybernetics, “Disorder” and Other Subtle Stumbling Blocks

                Shannon’s Entropy is probably further away from semantic information than fuzzy sets, but that hasn’t stopped many theorists from mistakenly conflating stochastic information with meaning. Shannon himself warned against this kind of bandwagon-jumping.[16] In fact, that’s probably the most common stumbling block of all. Perhaps the clearest cautionary tale is the whole philosophy developed out of information theory by bombastic mathematician Norbert Wiener (1894-1964), who asserted explicitly and deliberately” the very same connection to semantic meaning that Shannon and others cautioned against.[17] Another great mathematician who took Shannon’s Entropy a little too far into the realm of semantic meaning was Shannon’s colleague, Warren Weaver (1894-1978). Wiener’s philosophy is known by the cool name of “cybernetics,” which has a certain intriguing flair to it – just like the loaded terms “fuzzy sets” and “chaos theory”, which are often used in a manner precisely opposite to their meaning and purpose. Nobody is really quite certain what “cybernetics” is, but that hasn’t stopped academics from establishing a journal by that name; some of its contributors are among the leading names in analytics I most respect, but I find other commentators on the field as disturbing as some of the frenetic theologians who followed Pierre Teilhard de Chardin. To put it simply, information theory has been put to even more bad uses in the last seven decades than other hot mathematical ideas, like chaos theory and fuzzy sets – some of which border on madness.
…………I’m not convinced that even a lot of the authors who specialize this sort of thing really grasp all of these intricacies yet, although most at least have the common sense not to idolize these ideas and blow them up into full-blown crackpot philosophies. For example, there seems to be a lot of misunderstanding about its relationship to measures of complexity, structure and order, which are alike but not precisely the same. To complicate matters further, they all overlap thermodynamic entropy, in which not the same thing as probabilistic entropy. They both ultimately proceed from mathematical relationships like the Law of the Large Numbers, but measure different things which are not always connected; Shannon noticed the resemblance between the two off the bat, as did Leo Szilard (1898-1964), hence the name “entropy.” This kind of entropy is not really “disorder”; it merely removes the energy a system one would need to move from a disordered to an ordered state and vice-versa. Essentially, it freezes a system into its current state, no matter what order it exhibits. Likewise, zero probabilistic entropy always signifies complete certainty, but not necessarily lack of structure; we could for example, be quite certain that our data is disorganized. It is perhaps true that “…H can also be considered as a measure of disorganization…The more organized a system, the lower the value of H”[18] but only in a really broad sense. Furthermore, entropy prevents new complexity from arising in a system; the range of possible states that are reachable from a system’s current state are determined by its energy, so that the greater the entropy, the greater the number of unreachable states. Without an input of fresh energy, this set of reachable states cannot increase. This means that entropy tends towards simplicity, which can nevertheless exhibit order, while still not ruling out complexity. The same is true of probabilistic entropy, which intersects with algorithmic complexity at certain points.
…………The latter measures the shortest possible description of a program, which Shannon also investigated in terms of the minimum length of codes. He “further proclaimed that random sources – such as speech, music, or image signals – possess an irreducible complexity beyond which they cannot be compressed distortion-free. He called this complexity the source entropy (see the discussion in Chapter 5). He went on to assert that if a source has an entropy that is less than the capacity of a communication channel, the asymptotically error-free transmission of the source over the channel can be achieved.”[19] It is not wise to conflate these two powerful techniques, but when probabilistic and algorithmic information intersect, we can leverage the properties of both to shed further light on our data from both angles at once. In essence, we can borrow the axioms of both at the same time to discover more new knowledge, with greater certainty. One of the axioms of algorithmic complexity is that a program cannot contain a program more sophisticated than itself, for essentially the same reasons that a smaller box cannot contain a larger one in the physical realm. This is related to a principle called the Conservation of Information, which operates like the Second Law of Thermodynamics, in that absent or lost information cannot be added to a system from itself; since its violation would be a logical contradiction, actually more solid than its thermodynamic counterpart, which is based merely on empirical observation. It is essentially a violation of various No Free Lunch theorems. This has profound implications for fields like artificial intelligence and concepts like “self-organization” that are deeply intertwined with data mining. Since the thermodynamic entropies aren’t as closely related to data mining and information theory, I’ll only spend a little bit of time on them a couple of articles from now, where I’ll also dispense with extraneous topics like quantum entropies. There are many other directions we could take this discussion in by factoring in things like applications without replacement, compound events, multiple possible outcomes, mutually exclusive events, unordered pairs and sources with memory (non-Markov models). Instead, I’ll concentrate on using this article as a springboard to more complex forms of probabilistic entropy that might be of more use to SQL Server data miners, like leaf and root entropies, binary entropies like the conditional and joint, the information and entropy rates and  I’ll gradually build up towards more complex cases like Mutual, Lautum and Shared Information at the tail end of this segment of the tutorial series, whereas the Cross Entropy will be saved for a future segment on an important distance measure called the Kullback-Leibler Divergence. The next logical step  is to discuss the Rényi Entropy, which subsumes the Shannon and Hartley Entropies with other relatives in a single formula.

 

[1] I lost my original citation for this, but believe it is buried somewhere in Klir, George J., 2006, Uncertainty and Information: Foundations of Generalized Information Theory, Wiley-Interscience: Hoboken, N.J.

[2] See the Wikipedia page “Claude Shannon” at http://en.wikipedia.org/wiki/Claude_Shannon

[3] See the Wikipedia articles “Hartley Function” and “Ralph Hartley” at http://en.wikipedia.org/wiki/Hartley_function and http://en.wikipedia.org/wiki/Ralph_Hartley respectively.

[4] pp. 5-6, Shannon, C.E., 1974, “A Mathematical Theory of Communication,” pp. 5-18 in Key Papers in the Development of Information Theory, Slepian, David Slepian ed. IEEE Press: New York.

[5] p. xv, Mansuripur, Masud, 1987, Introduction to Information Theory. Prentice-Hall: Englewood Cliffs, N.J.

[6]  “It is nevertheless quite remarkable that, as originally shown by Shannon, one can show that, by proper encoding into long signals, one can attain the maximum possible language transmisssion capacity of a system while at the same time obtaining a vanishingly small percentage of errors.” p. 172, Goldman, Stanford, 1953, Information Theory. Prentice-Hall: New York.

[7] p.  272, Pierce, John Robinson, 1980, An Introduction to Information Theory: Symbols, Signals & Noise. Dover Publications: New York. Also see Pierce, John Robinson, 1961, Symbols, Signals and Noise: The Nature and Process of Communication. Harper: New York

[8] p. 31, Ritchie, L. David., 1991, Information. Sage Publications: Newbury Park, Calif.

[9] p. 259, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.

[10] p. 12, Brillouin, Léon, 1964, Science, Uncertainty and Information. Academic Press: New York. .

[11] Which I originally downloaded from the University of California at Irvine’s Machine Learning Repository.

[12] One source I found it mentioned in was Bengtsson, Ingemar, 2008, Geometry of Quantum States: An Introduction to Quantum Entanglement. Cambridge University Press: New York.

[13] See the Wikipedia page “Entropy (Information Theory)” at http://en.wikipedia.org/wiki/Entropy_(information_theory)

[14] p. 87, Wright, Robert, 1988, Three Scientists and Their Gods: Looking For Meaning in an Age of Information. Times Books: New York.

[15] See the Wikipedia page “Minimum Fisher Information” at http://en.wikipedia.org/wiki/Minimum_Fisher_information

[16] p. xvi, Mansuripur and p.  61, Ritchie, 1991.

[17] p. 289, Bar-Hillel, Yehoshua, 1964, Language and Information: Selected Essays On Their Theory and Application. Addison-Wesley Pub. Co.: Reading, Mass.

[18] p. 5, Ritchie.

[19] p. 143, Moser, Stefan M. and Po-Ning, Chen, 2012, A Student’s Guide to Coding and Information Theory.  Cambridge University Press: New York

 

Information Measurement with SQL Server, Part 1: A Quick Review of the Hartley Function

By Steve Bolton

…………This long-delayed series of amateur self-tutorials has been in the works ever since I began writing my A Rickety Stairway to SQL Server Data Mining series, which made it clear to me that I didn’t know enough about what was going on under the hood in SSDM. I still don’t know enough about the reasoning behind the various data mining algorithms implemented by SQL Server and other tools, but I am certain of one thing: I never will know enough, even if I actually became competent in these topics. These fields are just too detailed, broad and rooted in poorly understood corners of pure reason for anyone to master, let alone myself. Like my series on SSDM, this foray into coding all of the basic measures of information theory and related fields may well exemplify University of Connecticut statistician Daniel T. Larose’s witticism that “data mining is easy to do badly.”[1] My purpose in the Rickety series was merely to demonstrate that useful results can still be derived from SSDM, even when it is badly mishandled. In this series, I will try to explain how the metrics used in many data mining algorithms can be used to answer a whole cornucopia of questions about our datasets, such as: How much information might there be, as measured in terms of possible state descriptions and probabilities? How much meaning (i.e. semantic information) might it have? How many state descriptions does it rule out? How much is already known? How random, aperiodic, redundant, complex or ordered is it? Another interesting challenge is determining the shortest possible specifications of a structure.
…………There are literally dozens upon dozens of measures available to answer all of these questions, taken from such diverse fields as information theory, chaos theory, algorithmic complexity and many others which provide the basic building blocks of most data mining algorithms. The techniques provided by these fields are powerful, yet contain many logical stumbling blocks that far smarter people than ourselves have tripped over, often without even knowing it; these range from instances of subtle “definition drift” in the meaning of terms like “randomness” over the course of textbooks, to the development of full-blown crackpot theories by scholars overexcited by the potential of systems like chaos and information theory. I am bound to make mistakes along the way, given that I’m an amateur at this, so by all means take care when implementing my code and trusting my analysis, which is sometimes just plain wrong. On the other hand, these techniques are so powerful and so under-utilized that there is a crying need for some explanation of how they can be applied in a SQL Server environment, even a poor one. I know a little bit more of certain areas of philosophy – especially historical instances of when it has gone very wrong – so I can occasionally make a contribution by commenting on how to avoid fallacious reasoning, which is even more of a problem once these sophisticated topics are clouded over by excess jargon and complex math formulas.

DIY Data Mining and the Scope of the Series

                One thing I’ve learned while trying to force-feed myself the underlying math is that the analysis marketplace is decades behind the research in some ways; there is no way that any single company is ever going to be able to code all of the extant mining algorithms, assuming it is even possible to tally them all up. This means that it may be beneficial in the years to come to have the skills to build DIY solutions. Throughout this series I will provide T-SQL code so that SQL Server DBAs and data miners can implement some of these techniques on their own, without waiting for developers of mining software to code the particular algorithms that fit their use cases. Why T-SQL? I could make a strong case that the gradual accretion of features like windowing functions is slowly making set-based languages ideal for this purpose, although they are rarely thought of in that way; it boils down to the fact that most of the problems covered in these interrelated fields are much easier to express in terms of sets. Furthermore, the sheer size of “Big Data” (which has steadily gotten “bigger” ever since the first records were kept; like “globalization”, it has only accelerated in recent years) requires taking computing to a new level of abstraction in order to simplify things, which is a purpose that set-based languages can fulfill. This series will be a learning experience for me as well, in which I hope to at least help others avoid my mistakes, by teaching through misadventure; writing a series like this aids me in absorbing the material a lot faster, while also getting valuable practice in translating the difficult underlying math formulas into code. Since I don’t know what I’ll discover along the way, I suspect that at some point I may resort to using some of my other favorite languages, like Multidimensional Expressions (MDX) and Visual Basic .Net, possibly in conjunction with Common Language Runtime (CLR) stored procedures. One of the benefits I hope to provide is to take most of the math out of the picture, so that readers don’t get bogged down in it; the jargon and formulas are absolutely necessary for the researchers to communicate with each other, but there is no need for data miners to be writing formal mathematical proofs, just as commuters shouldn’t have to give a dissertation on automotive engineering in order to drive their cars. I’ve sometimes received comments to the effect that there’s too much text in these articles, but that’s because they don’t see the hundreds of pages of math formulas that gave rise to them; rather than stringing together some meaningless screenshots or rehashing MSDN tutorials from AdventureWorks, I aim to show how these techniques might be coded and how they can be used appropriately.
…………The goal in this series is to corner the uncertainty in our datasets by shining lights on it from every possible direction, so that organizations can make better decisions that result in more effective actions. To that end, everything from measures used in Bayesian inference to Solomonoff Algorithmic Probability to the calculation of periodicities to the Lyapunov Exponent will be fair game. These metrics vary quite widely in terms of sophistication, so at points we’ll cross the imprecise boundaries separating them from full-blown data mining algorithms; the dividing lines separating statistics, data mining, machine learning, “soft computing,” predictive analytics and the like seem to boil down to degrees of intricacy, rather than differences in kind, especially since their algorithms are derived from the same metrics and building blocks. My last mistutorial series was designed from the ground up to build on existing concepts, one article at a time. This one will be inherently disorganized, since the scope is so wide and I don’t know what I will find along the way. I will have to skip around quite a bit across topics that may only be distantly related, or across levels of sophistication exhibited by the measures of information. It will also be an open-ended series, whereas the Rickety series was merely necessary to cover a specific set of easily denumerable features. We could delve into dozens of obscure metrics if the need arises, or even concoct our own special-purpose metrics, if a use case calls for it.
…………The series may explore a wide-ranging topics along paths are still somewhat unknown, but I can at least kick it off by introducing some of the primordial foundations of information theory. Perhaps the simplest is a function developed in 1928 by electronics pioneer Ralph Hartley[2], who applied it to signal transmission[3] a few decades before Claude Shannon did the same with his own renowned entropy measure. Although Hartley considered it a measure of information, his function did not lead to the use of the term of “information theory,”[4] which was coined when Shannon’s famous equation gave birth to the field. This introduction will be made even easier by the fact that I already discussed a more advanced version of Hartley’s measure in Implementing Fuzzy Sets in SQL Server, Part 9: Measuring Nonspecificity with the Hartley Function. The version for ordinary “crisp” sets of the kind DBAs and data miners are accustomed to working is actually quite a bit easier to code and interpret: all we have to do is count the records in a set and take the logarithm. The code in Figure 1 is actually longer than it has to be, given that I used three different logarithm bases to measure the same quantity, for the sake of completeness. When base 2 is used, the units are known as bits or “shannons.” When the default value of Euler’s Number is used, they’re known as “nats,” but with base 10 we’re measuring in hartleys or “bans” (a term coined by famed cryptographer Alan Turing). It would be trivial to turn this into a stored procedure with an option to select the desired units. It only took about a second to calculate the results in Figure 2 on the first float column of the Higgs Boson dataset I downloaded from University of California at Irvine’s Machine Learning Repository a few tutorial series ago, which I converted to a 5-gigabyte SQL Server table. This was calculated effortlessly across all 11 million rows only because we had to performs some simple counts, without traversing the whole table.

Figure 1: Code for the Ordinary “Crisp” Version of the Hartley Function
DECLARE @HartleyEntropy float, @DistinctCount bigint, @Count bigint

SELECT @DistinctCount = Count(DISTINCT Column1), @Count = Count(*)
FROM Physics.HiggsBosonTable

SELECT Log(@DistinctCount, 2) AS BitsOrShannons, Log(@DistinctCount, 2.7182818284590452353602874713526624977) AS Nats, Log(@DistinctCount, 10) AS Hartleys,
Log(@Count, 2) AS MultisetBitsOrShannons, Log(@Count, 2.7182818284590452353602874713526624977) AS MultisetNats, Log(@Count, 10) AS MultisetHartleys

 Figure 2: Results from the Higgs Boson Dataset

…………The main problem I ran into was a fundamental one: the formula calls for plugging in the cardinality of the set, but in ordinary set theory parlance, duplicate values are only counted in multisets. The set notation includes the symbols |A|, but the bars specify a cardinality measure rather than the use of the ABS function. The question is, which cardinality? I’ve included both versions in Figure 1, which differ solely by the fact that one uses a DISTINCT operator and the other takes a standard COUNT. This discrepancy could have its uses though. I have yet to see this issue raised in the information literature, where Hartley’s metric is often skipped over as a topic of mainly historical interest in comparison to Shannon’s, but it may be possible to derive a third metric based on the difference between the two. A simple subtraction might help us quantify the contribution of the repeated values to the uncertainty, which could have its uses in the kinds of uncertainty management programs I spoke of in the fuzzy set series. In essence, this difference could act as a crude measure of redundancy. If it reflects the information gained by using the DISTINCT operator, we could use this to assess its costs. We know for a fact that the DISTINCT version can’t exceed the multiset version, which acts as a cap on its range. At the other end of the scale, the measure reaches the limit of 0 when all records have the same value, therefore giving us perfect certainty. Another important issue is whether or not the DISTINCT clause adds information, by reducing the amount of uncertainty about how many different values a variable can take on.
…………Of course, the clause might not be necessary at all if we knew in advance precisely which values were permissible for a column, such as the range of a decimal type or a varchar column limited to a handful of known category values. On the other hand, this raises a subtle distinction between values that are permissible for a column, which can be determined by the data type, and the values actually found, which can only be counted through a DISTINCT operator. The issue becomes more sophisticated if we are able to determine the counts of each individual value; these measures of “multiplicity,” as they are known in multiset theory, further reduce the uncertainty associated with the dataset. It is easy enough to implement these internal counts using windowing functions and GROUP BY statements, but the issue of how to factor them in quickly complicates the discussion of the otherwise simple Hartley function. Thankfully, the order of the records is only an issue with tuples, not the kinds of sets or multisets we plug into it – except when we get to the end of the dataset and can determine the last records from the remaining counts, which is something I’ve not seen addressed in the literature. This brings applications without replacement (i.e., the kind of probabilities associated with decks of cards where no reshuffling takes place) into play, at least briefly.

“Newsworthiness”: The Narrow Definition of “Information”

                Aside from all of these hidden subtleties, the information provided by both the DISTINCT and multiset versions can be summed up thus: how much am I learning each time I inspect a row and verify the actual value found there? This is equivalent to asking how much we learn from each slip of paper in a drawing, if we know the count of the jar in advance. In more advanced terms, we can think of this as increasing information by reducing the number of possible state descriptions the next record can take on; in such a context, whether or not repeated values are allowed makes a big difference. The same goes for their individual counts, if the answer is yes. Keep in mind that this type of “information” is practically the polar opposite of existing knowledge; basically, the higher the Hartley measure is, the less we don’t already know, so the more the next record can tell us. It is a highly specific type of information, which former journalists like myself might equate with “newsworthiness.” This is the dividing line between existing and new knowledge is precisely where measures of entropy (like the Hartley function) intersect with topics like Bayesian probability, which I will also address at some point in this series; as we shall see, many of these information measures are interrelated in complex ways. This highly specific definition of information is an important distinction that pertains to all of the other measures of entropy we’ll discuss in this series; interpretation is a critical stage in every field associated with data mining, particularly information theory, which should never be shortchanged in any workflow.
…………The Hartley function can be leveraged in a data mining workflow in various ways, such as calculating the reduction in uncertainty between two separate measures; this could be useful, for example, in specifying a numerical cut-off point in bits or bans, after which it’s not worthwhile to go on inspecting rows, for whatever end purpose that might be, such as sampling. The sample code in Figure 3 takes the Hartley measure after the 10 millionth row, or about 9 percent away from the end of the dataset, which is why remaining uncertainty in Figure 4 is so low; as we approach the last uninspected record, the remaining uncertainty would approach zero. This formula would be equivalent to counting the remaining records and plugging the results into the Hartley function. Another interesting question is whether or not we could we pair this with cardinality estimation, to get a ballpark figure of how much we can learn from each record we inspect, before we’ve even traversed a dataset. I don’t know much about cardinality estimation yet, but the possibility is tantalizing

Figure 3: Calculating the Remaining Uncertainty with the Hartley Function
DECLARE @DistinctCountOfKnownValues bigint, @CountOfKnownValues bigint

SELECT @DistinctCountOfKnownValues = Count(DISTINCT Column1), @CountOfKnownValues = Count(*)
FROM Physics.HiggsBosonTable
WHERE ID BETWEEN 1 AND 10000000

SELECT Bits, KnownBits, Bits KnownBits AS RemainingUncertaintyInBits, MultisetBits, KnownMultisetBits, MultisetBits KnownMultisetBits AS RemainingMultisetUncertaintyInBits
FROM (SELECT Log(@DistinctCount, 2) AS Bits, Log(@DistinctCountOfKnownValues, 2) AS KnownBits, Log(@Count, 2) AS MultisetBits, Log(@CountOfKnownValues, 2) AS KnownMultisetBits) AS T1

 Figure 4: Uncertainty Reduction Results

…………The kinship between the Hartley function and the rest of information theory is evident in some of its alternative names, like the Hartley Entropy or Max Entropy. It is equivalent to the Rényi Entropy[5] with its alpha parameter (α) set to 0, as I’ll explain a few articles from now. It’s also identical to the Shannon Entropy in cases of the uniform distribution, i.e. when all values are equally likely.[6] I’ll be spending a couple of articles on various aspects of entropy early on this series, since it’s such an important concept in information theory. The math and logic can get thick pretty quickly in this field, so it is best to start off with the measure that started it all, Shannon’s infamous “H.” I recognized its signature combination of a negative summation operator and log operation when translating some of the equations used in Implementing Fuzzy Sets in SQL Server, Part 10.2: Measuring Uncertainty in Evidence Theory into T-SQL. As with the Hartley function, this previous exposure to more advanced fuzzy derivatives ought to make the material a little easier to swallow. The difficulty with Shannon’s Entropy, however, is not in its calculation, but in its proper interpretation. Yet as long as we have the rigor to avoid assigning unwarranted shades of meaning to the term “information,” it can be a powerful addition to our data mining toolbelts.

[1] p. xii, LaRose, Daniel T., 2005, Discovering Knowledge in Data: An Introduction to Data Mining. Wiley-Interscience: Hoboken, N.J.

[2] See the Wikipedia articles “Hartley Function” and “Ralph Hartley” at http://en.wikipedia.org/wiki/Hartley_function and http://en.wikipedia.org/wiki/Ralph_Hartley respectively.

[3] p. 5 Ritchie, L. David., 1991, Information. Sage Publications: Newbury Park, Calif.

[4] p. 288, Bar-Hillel, Yehoshua, 1964, Language and Information: Selected Essays On Their Theory and Application. Addison-Wesley Pub. Co.:        Reading, Mass.

[5] See the Wikipedia article “Rényi Entropy” at http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

[6] See the Wikipedia article “Hartley Function” at http://en.wikipedia.org/wiki/Hartley_function