Halide is a new open source language designed specifically for image processing and computational photography. It not only makes it easy to implement photo algorithms, it also makes them run fast by semi-automatic parallelization.
Algorithms that work with images are ideal for parallel implementation because they usually work with small isolated blocks of data that means the task can be parallelized without worry about interactions. The only problem is that even converting something that is ripe for parallelization from serial code to something that runs on today's confusing architecture of CPU cores and GPUs is difficult.
Halide is a new functional programming language from MIT, (with help from Stanford and Adobe) that allows you to specify image processing algorithms, mostly block convolution methods, more easily and without having to worry about how the algorithm is implemented. A second section of the program then provides a general description of how the algorithm should be parallelized. It not only describes how the algorithm should be split up among computational elements but how to organize the data to keep the processing pipelines running at maximum efficiency by avoiding restarts.
The easiest way to understand the general idea is to see a simple example (taken from the paper):
Func halide_blur(Func in) f Func tmp, blurred; Var x, y, xi, yi; // The algorithm tmp(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3; blurred(x, y) = (tmp(x, y-1) + tmp(x, y) + tmp(x, y+1))/3; // The schedule blurred.tile(x, y, xi, yi, 256, 32) .vectorize(xi, 8).parallel(y); tmp.chunk(x).vectorize(x, 8); return blurred; }
The first part of the program defines a simple 3x3 blur filter split into a blur horizontal followed by a blur vertical step. The last part of the program, the schedule specifies how the algorithm can be treated in a parallel implementation. The Schuyler is machine specific and has to be changed to get the best performance out of a particular processor pipeline.
Local Laplacian Filter 69 lines 158ms 2x faster
Compared to a simple non-parallel implementation in C++ the Halide version runs in 0,9ms where the C++ takes 9.94ms. An optimized C++ parallel version runs at the same speed but of course takes a lot longer to construct.
The Halide complier outputs code for x86, ARM and PTX using the LLVM compiler. In use it tends to perform as well or better than optimized code but using far fewer lines to implement. For example, a Laplacian filter takes 335ms and 262 lines of C++/OpenMP but Halide does the job in 158ms and 79 lines. Typically Halide programs are about on third the length of C++ and anything from 2x to 10x.
Not only is this a step forward for image processing it is also an interesting approach to parallelism that could be extended into other application areas. Specifying and algorithm and how the different variables can be treated in parallelization seems like a good compromise between pure manual and pure automatic methods.