For more information on Shannon Entropy, please check his article on wolfram.
In physics, the word entropy has important physical implications as the amount of “disorder” of a system. In mathematics, a more abstract definition is used. The (Shannon) entropy of a variable
is defined as
bits, where
is the probability that
is in the state
, and
is defined as 0 if
. The joint entropy of variables
, …,
is then defined by
It is actually pretty simple to calculate this in java. First you need to count the occurrences of each value. Then, you can calculate the entropy of it.
public static Double calculateShannonEntropy(List<String> values) {
Map<String, Integer> map = new HashMap<String, Integer>();
// count the occurrences of each value
for (String sequence : values) {
if (!map.containsKey(sequence)) {
map.put(sequence, 0);
}
map.put(sequence, map.get(sequence) + 1);
}
// calculate the entropy
Double result = 0.0;
for (String sequence : map.keySet()) {
Double frequency = (double) map.get(sequence) / values.size();
result -= frequency * (Math.log(frequency) / Math.log(2));
}
return result;
}
This was a nice post.I need similar code for implementation of Kullback-leibler divergence in java.
Do we need to – every time?
result -= frequency * (Math.log(frequency) / Math.log(2));
shall we do as follows?
public static Double calculateShannonEntropy(List values)
{
for (String sequence : map.keySet()) {
Double frequency = (double) map.get(sequence) / values.size();
result += frequency * (Math.log(frequency) / Math.log(2));
}
return -result;
}
pl ignore my comment
no problem
you didn’t do summation.
it should be:
result -= result + frequency * (Math.log(frequency) / Math.log(2));
Can’t we factorize “/ log(2)” to save instructions in the loop ?
Indeed we can
would u plz. help me the full implemetation
Nice, but you can generify the algorithm to be able to throw just about everything into it:
public static double entropy(List values) {
final Map valueOccurances = new HashMap();
for (Comparable value : values) {
Long valueOccurance = valueOccurances.get(value);
valueOccurances.put(value, valueOccurance == null ? 1L : ++valueOccurance);
}
double combinedEntropy = 0.0d;
for (Comparable value : valueOccurances.keySet()) {
double entropy = valueOccurances.get(value) / (double) values.size();
combinedEntropy += entropy * (Math.log(entropy) / Math.log(2));
}
return -combinedEntropy;
}
Optimization points: A .containsKey(…) is essentially the same as a .get(…) != null, except for the fact that .containsKey(..) has to support null as key and is thus more complex (expensive). By definition, null makes no sense in an entropy analysis algorithm so we might as well just make one call alltogether. Rather than adding 0 (for initial item), retrieving and then adding; we can just add 1. Avoid boxing by using primitive types when possible (null not a possibility). There’s no reason to factor out stuff like i.e. Math.log(2) as one of the comments says, since this is basic compiler low-hanging fruit. Consider adding an initial capacity argument to be passed to HashMap’s constructor, to avoid rehashing when the alphabet size is known (I.e. all letters from A-Z).
Ah, the blog ate angle braces… proper code here: http://pastebin.com/1SC7PYvM