The GPU - Graphics Processor Unit - inside most modern machines is a parallel super computer dedicated to rendering complex graphics. It didn't take long for programmers to notice that this particular bit of hardware was capable of doing more general tasks - hence GPGPU, General Purpose GPU.
Initially the problem with using a GPU for number crunching applications was simply the lack of tools. Even trying to use a GPU for the purpose it was intended was difficult. Add to this the variation in GPU hardware and the fact that early GPUs were primitive in the facilities they provided and you can start to understand the difficulties. Applications were created using graphics tools such as OpenGL or DirectX and had to be cast in graphics terms.
However as things improved - working our way up to Shader model 3.0 and tools such as CUDA, an SDK that allows C programming of the GPU, so its use for general purpose computing grew. Microsoft also has DirectCompute an extension to DirectX that lets you write GPGPU applications in HLSL.
The key idea is that of a Kernel. Most GPU programs work on a 2D array of pixel data and have an implicit double for loop built in that effectively scans the entire array. In programming the GPU you only need to specify the Kernel of the loop because the code that you specify is automatically applied to each and every pixel. The reason whey the GPU implements this faster than the CPU say is simply that it has many processors that process the pixel data in parallel. Thus a GPU is an example of a Single Program Multiple Data SPMD approach to parallel architecture.
In the case of a modern GPU the statistics are impressive - typically 16 streaming multiprocessors each with 8 streaming processors. Each of these processors is kept busy with a by a large number of threads. You can sum it all up by quoting the usual statistic of 10 gigaflops for the CPU but 1 teraflop for the GPU. This is slightly misleading as the GPU can only perform at this speed on data that has the correct structure - but it is good at common tasks that involve matrices.
Currently one of the problems is that the nature of the hardware makes it difficult to write kernel functions which work as fast as possible. Small changes in the code can make big differences in the efficiency and these changes are not obvious or intuitive and they depend on the hardware. Now a research team have taken the first few steps in building an optimising compiler that provides speed gains of 128 times over an unoptimised kernel function and beating hand optimised code by some 30%. The compiler is open source and could be modified to work with other hardware: A GPGPU Compiler for Memory Optimization and Parallelism Management
An alternative approach is manual optimisation and this also has the benefit of a new tool. The ATI Stream Profiler v1.3 has just been released. This integrates with Visual Studio and gathers GPU data as an OpenCL application runs.
If you would like more information on GPGPU programming in general then visit: GPGPU
Other relevant articles:
GPU Gems Volume 1