Page 2 of 3
Artificial Intelligence
Of course with such an interest in relays it was natural enough that he should move, in 1941, from MIT to the Bell Telephone Labs. The telephone companies of the time were very heavy users of relays of all types to connect phone calls and Bell was the biggest. Perhaps no other single research establishment has had such a profound influence on computing. We can thank Bell Labs, among other things, for the invention of the transistor, C and Unix.
While at Bell, Shannon carried on showing how relays could be used to create computational machines. He wasn't just interested in arithmetic, though, and he should be counted among the earliest pioneers of what we now call artificial intelligence, or AI. In 1950 he wrote a paper called "Programming a digital computer for playing chess" which essentially invented the whole subject of computer game playing.
Interestingly he also felt a need to give the idea a wider audience and wrote a more approachable article called "Automatic Chess Player" in a 1950s copy of Scientific American. (Reprinted in The World of Mathematics Vol 4  see side panel.)
He even built a relay controlled mouse, called Theseus after the legendary king of Athens who escaped the labyrinth of the Minotaur, that could run a maze and learn by storing the maze pattern as relay states. You may remember the micro mouse competitions where contestants had to build a micro processor controlled mouse that would beat all others at running a maze. Well Shannon did it first using relays!
Theseus
InformationTheory
If you think that the work described so far is enough for any lifetime, then you will be surprised to hear that we haven't even touched on Shannon's real claim to fame.
At the time that he was musing on using the binary system for computations, we knew very little about what information was. It was possible for very knowledgeable engineers to propose schemes that today we would realize were just plain silly.
For example, it wasn't understood at all clearly why radio signals spilled over from the frequency that they were transmitted on to occupy a band of frequencies. It was thought that with improved technology you might be able to reduce the bandwidth needed to almost nothing. Engineers were mystified why, when they transmitted a radio signal on a single frequency, 100kHz, say the signal actually spread out to occupy a range of frequencies, say 80kHz to 120Hz. This is where the term "bandwidth" comes from and it limited how close you could pack radio stations. If the bandwidth could be reduced you could get more radio stations on the air.
Today we know that you need a certain amount of bandwidth to transmit a given amount of information and this is a law of nature. For example, you can't transmit data down a telephone line faster than a given speed because the phone line has a very limited bandwidth  just enough to transmit the human voice. To do better you need a cable with a wider bandwidth such as a coaxial cable or a fiber optic cable.
In the same way we know that you can take a 1MByte file and compress it down to say 0.5MByte, but after you have used the best compression algorithm on it you can't get it into any less space.
The reason is that once the file's true information content is reached you cannot compress it any more. All of these ideas and many more are due to Shannon and his theory of information.
Although Shannon cast his theory in terms of communication it is just as applicable to information storage and retrieval. The general ideas are difficult to describe because they involve probability theory and considerations of how surprised you are to receive any particular information.
It may sound strange to say that the amount of information in a message depends on how surprising its content it but that's the key to a coherent theory of information. If you can receive a total of M possible messages and they are all equally likely then the amount of information in any message is log_{2}M, where log_{2} is the logarithm to the base 2.
To prove this would take us into probability theory, but from our computeroriented standpoint it seems obvious enough because you need exactly log_{2}M bits to represent M messages. If you have 1 bit you can represent two messages as 0 or 1; 2 bits can code 4 messages as 00,01,10 and 11, and so on..
It should be clear that given b bits you can represent 2^{b} messages. That is M=2^{b} or as log_{2}M to the base 2 is simply the power that you have to raise 2 by to get M you should also be able to see that log_{2}M=b.
If the messages are not all equally likely then you can use a more sophisticated code to reduce the average number of bits needed to represent the messages.
This is the principle of data compression and you can prove that if each message M_{i} happens with a probability P_{i} then the average number of bits needed is the sum of P_{i}log_{2}P_{i }over all possible messages. This is the information content of the messages in bits.
<ASIN:0486240614>
<ASIN:0486411524>
<ASIN:0486604349>
