Programmer's Guide To Theory - Splitting the Bit
Written by Mike James   
Monday, 22 July 2024
Article Index
Programmer's Guide To Theory - Splitting the Bit
Make It Equal
Huffman And Zip

The two symbols that are least likely now are D and E with a combined probability of 0.55. This also completes the coding because there are now only two groups of symbols and we might as well combine these to produce the finished tree. 

fig4

The final step

 

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

This coding tree gives the most efficient representation of the five letters possible. To find the code for a symbol you simply move down the tree reading off the zeros and ones as you go until you arrive at the symbol.

To decode a set of bits that has just arrived you start at the top of the tree and take each branch in turn according to whether the bit is a 0 or a 1 until you run out of bits and arrive at the symbol. Notice that the length of the code used for each symbol varies depending on how deep in the tree the symbol is.

The theoretical average information in a symbol in this example is 2.3 bits - this is what you get if you work out the average information formula given earlier. If you try to code B you will find that it corresponds to 111, i.e. three bits, and it corresponds to moving down the far right hand branch of the tree. If you code D you will find it corresponds to 00, i.e. the far left hand branch on the tree. In fact each remaining letter is either coded as a two- or three-bit code and guess what? If the symbols occur with their specified probabilities, the average length of code used is 2.3 bits.

So we have indeed split the bit! The code we are using averaged 2.3 bits to send a symbol.

Notice that there are some problems with variable length codes in that it is more difficult to store them because you need to indicate how many bits are in each group of bits. The most common way of overcoming this is to use code words that have a unique sequence of initial bits. This wastes some code words, but it still generally produces a good degree of data compression.

 

Summary

  • Information theory predicts the number of bits required to represent a message, which sometimes is a fractional quantity.

  • In particular, if we consider average information contained in a message, a fractional number of bits will be used to represent it.

  • No coding method can use fewer bits than the average information, but inefficient coding methods can use many more bits.

  • The average information in a message is maximum when all of the symbols are equally likely.

  • We can get close to the number of bits in the average information using the Shannon-Fano coding algorithm which attempts to split the symbols into groups that are close to being equally likely.

  • The Shannon-Fano coding is good, but in most cases it isn't optimal. To get an optimal code we need to use the Huffman coding algorithm where variable length codes are used. The shortest codes represent the most likely symbols.

  • Finding codes that represent messages using a smaller number of bits generally don't use theoretically optimal methods. The reason is that speed of coding is important. The most common method of compressing data is to find repeating patterns and represent them as entries in a dictionary.

Related Articles

Information Theory

How Error Correcting Codes Work

Claude Shannon

Introduction to Cryptography with Open-Source Software 

A Programmers Guide To Theory

Now available as a paperback and ebook from Amazon.

cover600

Contents

    1. What Is Computer Science?
      Part I What Is Computable?
    2. What Is Computation?
    3. The Halting Problem
    4. Finite State Machines
      Extract 1: Finite State Machines
      Extract 2: Turing Thinking ***NEW!
    5. Practical Grammar
    6. Numbers, Infinity and Computation
      Extract 1: Numbers 
      Extract 2: Aleph Zero The First Transfinite
      Extract 3: In Search Of Aleph-One
      Extract 4: Transcendental Numbers
    7. Kolmogorov Complexity and Randomness
      Extract 1:Kolmogorov Complexity 
    8. The Algorithm of Choice
    9. Gödel’s Incompleteness Theorem 
    10. Lambda Calculus
      Part II Bits, Codes and Logic
    11. Information Theory 
    12. Splitting the Bit 
    13. Error Correction 
    14. Boolean Logic 
      Part III Computational Complexity
    15. How Hard Can It Be?
      Extract 1: Where Do The Big Os Come From
    16. Recursion
      Extract 1: What Is Recursion
      Extract 2: Why Recursion
    17. NP Versus P Algorithms
      Extract 1: NP & Co-NP
      Extract 2: NP Complete

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Parasoft Adds AI Assistant To C/C++ Test
30/06/2025

Parasoft has updated its C/C++ Test software with an AI-powered documentation assistant, along with complete support for MISRA C:2025 and auto-suppression of equivalent violations. C/C++ Test can be u [ ... ]



Linux Foundation Launches Agent2Agent Project
24/06/2025

The Linux Foundation has launched the Agent2Agent (A2A) project, an open protocol created by Google for secure agent-to-agent communication and collaboration.


More News

pico book

 

Comments




or email your comment to: comments@i-programmer.info



Last Updated ( Monday, 22 July 2024 )