A java implementation for Shannon Entropy

For more information on Shannon Entropy, please check his article on wolfram.

In physics, the word entropy has important physical implications as the amount of “disorder” of a system. In mathematics, a more abstract definition is used. The (Shannon) entropy of a variable X is defined as

 H(X)=-sum_(x)P(x)log_2[P(x)]

bits, where P(x) is the probability that X is in the state x, and Plog_2P is defined as 0 if P=0. The joint entropy of variables X_1, …, X_n is then defined by

  H(X_1,...,X_n)=-sum_(x_1)...sum_(x_n)P(x_1,...,x_n)log_2[P(x_1,...,x_n)].

It is actually pretty simple to calculate this in java. First you need to count the occurrences of each value. Then, you can calculate the entropy of it.


public static Double calculateShannonEntropy(List<String> values) {
  Map<String, Integer> map = new HashMap<String, Integer>();
  // count the occurrences of each value
  for (String sequence : values) {
    if (!map.containsKey(sequence)) {
      map.put(sequence, 0);
    }
    map.put(sequence, map.get(sequence) + 1);
  }

  // calculate the entropy
  Double result = 0.0;
  for (String sequence : map.keySet()) {
    Double frequency = (double) map.get(sequence) / values.size();
    result -= frequency * (Math.log(frequency) / Math.log(2));
  }

  return result;
}

About these ads

10 thoughts on “A java implementation for Shannon Entropy

  1. This was a nice post.I need similar code for implementation of Kullback-leibler divergence in java.

  2. Do we need to – every time?
    result -= frequency * (Math.log(frequency) / Math.log(2));

    shall we do as follows?
    public static Double calculateShannonEntropy(List values)
    {
    for (String sequence : map.keySet()) {

    Double frequency = (double) map.get(sequence) / values.size();
    result += frequency * (Math.log(frequency) / Math.log(2));
    }

    return -result;
    }

  3. you didn’t do summation.

    it should be:

    result -= result + frequency * (Math.log(frequency) / Math.log(2));

  4. Nice, but you can generify the algorithm to be able to throw just about everything into it:

    public static double entropy(List values) {

    final Map valueOccurances = new HashMap();

    for (Comparable value : values) {
    Long valueOccurance = valueOccurances.get(value);
    valueOccurances.put(value, valueOccurance == null ? 1L : ++valueOccurance);
    }

    double combinedEntropy = 0.0d;

    for (Comparable value : valueOccurances.keySet()) {
    double entropy = valueOccurances.get(value) / (double) values.size();
    combinedEntropy += entropy * (Math.log(entropy) / Math.log(2));
    }

    return -combinedEntropy;
    }

    Optimization points: A .containsKey(…) is essentially the same as a .get(…) != null, except for the fact that .containsKey(..) has to support null as key and is thus more complex (expensive). By definition, null makes no sense in an entropy analysis algorithm so we might as well just make one call alltogether. Rather than adding 0 (for initial item), retrieving and then adding; we can just add 1. Avoid boxing by using primitive types when possible (null not a possibility). There’s no reason to factor out stuff like i.e. Math.log(2) as one of the comments says, since this is basic compiler low-hanging fruit. Consider adding an initial capacity argument to be passed to HashMap’s constructor, to avoid rehashing when the alphabet size is known (I.e. all letters from A-Z).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s