Programmer's Guide To Theory - Splitting the Bit
Written by Mike James
Monday, 22 July 2024
Article Index
Programmer's Guide To Theory - Splitting the Bit
Make It Equal
Huffman And Zip

The two symbols that are least likely now are D and E with a combined probability of 0.55. This also completes the coding because there are now only two groups of symbols and we might as well combine these to produce the finished tree.

The final step

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

## Summary

• Information theory predicts the number of bits required to represent a message, which sometimes is a fractional quantity.

• In particular, if we consider average information contained in a message, a fractional number of bits will be used to represent it.

• No coding method can use fewer bits than the average information, but inefficient coding methods can use many more bits.

• The average information in a message is maximum when all of the symbols are equally likely.

• We can get close to the number of bits in the average information using the Shannon-Fano coding algorithm which attempts to split the symbols into groups that are close to being equally likely.

• The Shannon-Fano coding is good, but in most cases it isn't optimal. To get an optimal code we need to use the Huffman coding algorithm where variable length codes are used. The shortest codes represent the most likely symbols.

• Finding codes that represent messages using a smaller number of bits generally don't use theoretically optimal methods. The reason is that speed of coding is important. The most common method of compressing data is to find repeating patterns and represent them as entries in a dictionary.

#### Related Articles

Information Theory

How Error Correcting Codes Work

Claude Shannon

Introduction to Cryptography with Open-Source Software

## A Programmers Guide To Theory

#### Contents

1. What Is Computer Science?
Part I What Is Computable?
2. What Is Computation?
3. The Halting Problem
4. Finite State Machines
Extract 1: Finite State Machines
5. Practical Grammar
6. Numbers, Infinity and Computation
Extract 1: Numbers
Extract 2: Aleph Zero The First Transfinite
Extract 3: In Search Of Aleph-One
Extract 4: Transcendental Numbers
7. Kolmogorov Complexity and Randomness
Extract 1:Kolmogorov Complexity
8. The Algorithm of Choice
9. Gödel’s Incompleteness Theorem
10. Lambda Calculus
Part II Bits, Codes and Logic
11. Information Theory
12. Splitting the Bit ***NEW!
13. Error Correction
14. Boolean Logic
Part III Computational Complexity
15. How Hard Can It Be?
Extract 1: Where Do The Big Os Come From
16. Recursion
Extract 1: What Is Recursion
Extract 2: Why Recursion
17. NP Versus P Algorithms
Extract 1: NP & Co-NP
Extract 2: NP Complete

<ASIN:1871962439>

<ASIN:1871962587>

 Google Launches Genkit For Go24/07/2024Google has announced Genkit for Go, an open source framework for building AI-powered applications and cloud services natively in Go. Google says that it combines Go's performance and concurrency advan [ ... ] + Full Story World Emoji Day 2024 Surveys Most Confusing Emojis19/07/2024This week sees the 'celebration', if that's the word we're looking for, of World Emoji Day, the annual emoji-fest that has happened on July 17 for the last eleven years. This year we've largely b [ ... ] + Full Story More News