Programmer's Guide To Theory - Splitting the Bit

Written by Mike James

Monday, 22 July 2024

Article Index
Programmer's Guide To Theory - Splitting the Bit
Make It Equal
Huffman And Zip

Page 3 of 3

The two symbols that are least likely now are D and E with a combined probability of 0.55. This also completes the coding because there are now only two groups of symbols and we might as well combine these to produce the finished tree.

fig4

The final step

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Summary

Information theory predicts the number of bits required to represent a message, which sometimes is a fractional quantity.
In particular, if we consider average information contained in a message, a fractional number of bits will be used to represent it.
No coding method can use fewer bits than the average information, but inefficient coding methods can use many more bits.
The average information in a message is maximum when all of the symbols are equally likely.
We can get close to the number of bits in the average information using the Shannon-Fano coding algorithm which attempts to split the symbols into groups that are close to being equally likely.
The Shannon-Fano coding is good, but in most cases it isn't optimal. To get an optimal code we need to use the Huffman coding algorithm where variable length codes are used. The shortest codes represent the most likely symbols.
Finding codes that represent messages using a smaller number of bits generally don't use theoretically optimal methods. The reason is that speed of coding is important. The most common method of compressing data is to find repeating patterns and represent them as entries in a dictionary.

Information Theory

How Error Correcting Codes Work

Claude Shannon

Introduction to Cryptography with Open-Source Software

A Programmers Guide To Theory

Now available as a paperback and ebook from Amazon.

cover600

What Is Computer Science?
Part I What Is Computable?
What Is Computation?
The Halting Problem
Finite State Machines
Extract 1: Finite State Machines
Extract 2: Turing Thinking ***NEW!
Practical Grammar
Numbers, Infinity and Computation
Extract 1: Numbers
Extract 2: Aleph Zero The First Transfinite
Extract 3: In Search Of Aleph-One
Extract 4: Transcendental Numbers
Kolmogorov Complexity and Randomness
Extract 1:Kolmogorov Complexity
The Algorithm of Choice
Gödel’s Incompleteness Theorem
Lambda Calculus
Part II Bits, Codes and Logic
Information Theory
Splitting the Bit
Error Correction
Boolean Logic
Part III Computational Complexity
How Hard Can It Be?
Extract 1: Where Do The Big Os Come From
Recursion
Extract 1: What Is Recursion
Extract 2: Why Recursion
NP Versus P Algorithms
Extract 1: NP & Co-NP
Extract 2: NP Complete

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Langfuse Goes Truly Open Source
04/08/2025

The news is that Langfuse, the LLM observability platform,
has made all it commercial product features available for free and open source. But first of all, what is Langfuse?

+ Full Story

Robot Crabs Attacked By Real Crabs
08/08/2025

A robot crab called Wavy Dave has been having a rough time as his real life rivals ripped his claw off.

+ Full Story

More News