One of the most popular entropy encoding schemes for text data compression is Huffman
coding. Huffman code is an optimal unique prefix code. In other words, given a set
characters for which the frequency of occurrence of each character is known in advance,
the maximum possible compression is achieved by Huffman coding. Also, no character
code is a proper prefix of another character code. This makes the code unambiguous and
uniquely decodable. Unlike uniform coding where we encode each of the n characters
in the character set with equal size code length of [log n] bits, Huffman coding is a
variable length coding. For more detailed information relating to Huffman coding refer
to Cormen et al. (2006).
In this paper, we analyze the compression ratio achieved by Huffman coding with
that of uniform coding. For this, we first study the skewness property of the Huffman
coding tree. We demonstrate that this tree will be completely skewed when the sorted
frequency distribution of characters satisfies certain prefix properties. We also establish
that among all the frequency distributions f1, f2, …, fn of a set of n characters that satisfy
the prefix property, the average code length is maximum when the frequency
distribution is a Fibonacci sequence. Then we estimate the average code length of
Huffman coding for this frequency distribution.
The overall organization of the paper is as follows: in Section 2, we study the
frequency distribution of characters for which the Huffman coding tree will be
completely skewed. For this, we introduce the notion of prefix property of a sorted sequence. We also establish that for Fibonacci sequence, which also satisfies this prefix
property, the average code length will be the maximum. In Section 3, we compute the
average code length of Huffman code when the frequency distribution follows the
Fibonacci sequence. Lastly, in Section 4, we conclude by summarizing our contribution.
|