In physics a scalar is a single number - as opposed to a vector which is a lot of numbers. Most processors do arithmetic one data value at a time and these are called scalar processors.
A superscalar processor can spilt its instruction stream into two or more pipelines and get things done twice as quickly.
In most cases this doesn’t happen because one pipeline ends up having to wait for the other. The Pentium hyperthreading" architecture has two parallel pipelined executions paths and it uses out-of-order execution to try to keep them running. Out-of-order execution is a difficult thing to implement because in any sensible program the results of one instruction depend on the instructions that went before it.
Out-of-order execution is sometimes called “speculative execution” because at the end of the day you may well have to throw away the results! You can see that this all relates to branch prediction and a general tendency for processors to do extra work just in case it proves to save them time!
As we run out of simpler options for making our processors faster this is likely to become the only way forward.
Along with the superscalar idea there is also SIMD – Single Instruction Multiple Data. This is another way of getting the job done faster by doing the same operation on multiple data items. For example, if you want to add 10 numbers together why not bring them all in to the processor in one fetch and add them using one huge adder.
This is exactly what the MMX extensions to the Pentium and related processors allow programmers to do. The MMX instructions perform arithmetic on arrays of values in one operation. The Pentium III extended MMX to 128-bit floating point values and the Pentium 4 took this another step further with SSE2 and double precision values. The Intel Core architecture extends this to SSE4.
The big problem with SIMD hardware is that it only makes a difference if you can make use of it. That is, if you only have one number to add then the ability to add ten in the same time is wasted. What this means is that SIMD improvements,like SSE, really only make a difference to multimedia, graphics, games and signal processing software. If all you want to do is run Excel or Word then they have little impact.
The best example of SIMD processors in use to day are the many different types of GPU - Graphics Processing Units. A GPU is a SIMD processor optimized to work on the sort of data operations that are needed in graphics.
Back to clock speeds?
Sadly, for all our cleverness, increasing clock speeds is still the best way to make a processor go faster. The Pentium 4 was expected to reach 10GHz but it stuck at around 2GHz. The highest speeds achievable seem to be below 4GHz. Only very recently has a production processor managed to reach the speed of 5GHz - the AMD FX-9590 (June 2013).
Fast logic is expensive logic and so one thing that processor designers have been doing for some time is to use split clock speeds. They run the external logic at much lower speeds than the processor and even running different parts of the processor at different speeds.This of course is another opportunity to make use of the cache principle to speed things up when systems run at different speeds.
Intel always planned to make faster versions of the Pentium architecture but now manufacturers have mostly withdrawn from the clock speed race.
Instead the hope for the future is to put more processors or cores on the same chip. Multi-core processors can be seen as the only thing you can do when all of the other ways to make a processor go faster have run out of steam. The big problem is that you may have multiple cores but just like SIMD, if you don't have work for them to do they don't help.
The big challenge for the future is making the software that can take advantage of the multi-processor configurations. I wouldn't say that the days of the ever faster processor were over but they seem to have dried up for the moment.
Today the hopes for the future speed increases are in more cores in more heterogeneous systems. Today it isn't only the multicore CPU that does processing but the GPU with tens of simple but fast processing cores.