Page 2 of 2
The big split in design technologies, RISC v CISC, is just part of the story. It gets even more interesting when you start looking at what can be done to tweak the basic design of a processor.
If software can be multi-tasking so can hardware. The early processor designs carried out part of a single instruction at every clock cycle. First the instruction had to be fetched from memory, then it had to be decoded, then perhaps data had to be fetched, the operation was then carried out and so on. The exact steps vary according to the processor but there are always a number of steps involved in completing an instruction.
A non-pipeline processor has to do everything before moving on to the next instruction.
Modern processors can speed things up by overlapping the execution of commands. In other words, by starting a new instruction before the current one is completed, the number of instructions per clock cycle can be increased. This idea is called “pipelining” and you can think of it as bringing the ideas of a production line to executing instructions.
A pipeline processor on the other hand can be working on more than one instruction at a time – just like a production line.
The Pentium Core architecture, for example, has up to 14 stages of pipelining. Longer pipelines have to be better?! Not necessarily. The big problem with pipelining is what happens if the instruction that you’ve just completed makes the partially completed instructions still in the pipeline invalid. In this case the pipeline has to be restarted and this costs clock cycles. This can be so expensive that most benchmarks showed that the Pentium 4 with a 20 stage pipeline was slower than a Pentium III with a 10 stage pipleine at the same clock speed.
One way of avoiding the need to restart a pipeline is to always make sure that it is filled with the right instructions. When a processor reaches a branch instruction it has the choice of one of two possible sets of instructions that it could follow. Branch prediction aims to guess which set is the right set! it sounds like fortune telling but it works because picking the wrong branch isn't any worse and picking the right branch pays off.
In physics a scalar is a single number - as opposed to a vector which is a lot of numbers. Most processors do arithmetic one data value at a time and these are called scalar processors. A superscalar processor can spilt its instruction stream into two or more pipelines and get things done twice as quickly.
In most cases this doesn’t happen because one pipeline ends up having to wait for the other. The Pentium hyperthreading" architecture has two parallel pipelined executions paths and it uses out-of-order execution to try to keep them running. Out-of-order execution is a difficult thing to implement because in any sensible program the results of one instruction depend on the instructions that went before it.
Out-of-order execution is sometimes called “speculative execution” because at the end of the day you may well have to throw away the results! You can see that this all relates to branch prediction and a general tendency for processors to do extra work just in case it proves to save them time!
As we run out of simpler options for making our processors faster this is likely to become the only way forward.
Along with the superscalar idea there is also SIMD – Single Instruction Multiple Data. This is another way of getting the job done faster by doing the same operation on multiple data items. For example, if you want to add 10 numbers together why not bring them all in to the processor in one fetch and add them using one huge adder. This is exactly what the MMX extensions to the Pentium and related processors allow programmers to do. The MMX instructions perform arithmetic on arrays of values in one operation. The Pentium III extended MMX to 128-bit floating point values and the Pentium 4 took this another step further with SSE2 and double precision values. The Intel Core architecture extends this to SSE4.
The big problem with SIMD hardware is that it only makes a difference if you can make use of it. That is, if you only have one number to add then the ability to add ten in the same time is wasted. What this means is that SIMD improvements,like SSE, really only make a difference to multimedia, graphics, games and signal processing software. If all you want to do is run Excel or Word then they have little impact.
Back to clock speeds?
Sadly, for all our cleverness, increasing clock speeds is still the best way to make a processor go faster. The Pentium 4 was expected to reach 10GHz but it stuck at around 2GHz. The highest speeds achievable seem to be below 4GHz
Fast logic is expensive logic and so one thing that processor designers have been doing for some time is to use split clock speeds. They run the external logic at much lower speeds than the processor and even running different parts of the processor at different speeds.This of course is another opertunity to make use of the cache principle to speed things up when systems run at different speeds.
Intel always planned to make faster versions of the Pentium architecture but now manufacturers have withdrawn from the clock speed race. Instead the hope for the future is to put more processors or cores on the same chip. Multi-core processors can be seen as the only thing you can do when all of the other ways to make a processor go faster have run out of steam. The big problem is that you may have multiple cores but just like SIMD, if you don't have work for them to do they don't help. The big challenge for the future is making the software that can take advantage of the multi-processor configurations. I wouldn't say that the days of the ever faster processor were over but they seem to have dried up for the moment.