The Kinect is currently the hardware that provides developers with the greatest opportunities for innovative programs - both games and "serious" artificial applications. How does it work? How do you use it? What can you use it for?
Microsoft's Kinect is described as a "controller-free gaming and entertainment experience" and is commonly sold bundled with the Xbox 360 - but to see it as just another way to play games is to underestimate its significance. In this article we look at how it works, why it is special and what people are doing with it.
Microsoft acquired the 3D sensing technology that is the key to the Kinect hardware from Israeli company PrimeSense.
Essentially this hardware is a box with some cameras that makes use of infra-red (IR) illumination to obtain depth data, color images and sound. The IR is used as a distance ranging device much in the same way a camera autofocus works - see later for more details. It is claimed that the system can measure distance with a 1cm accuracy at 2m and has a resolution of 3mm at 2m. The depth image is also 640x480 i.e. standard VGA resolution. The color image is 1600x1200.
A custom chip processes the data to provide a depth field that is correlated with the color image. That is the software can match each pixel with its approximate depth. The preprocessed data is fed to the machine via a USB interface in the form of a depth field map and a color image.
The PrimeSense reference implementation
If you would like to see more about the actual hardware inside the Kinect then view the "teardown" video prepared by the people at iFixIt:
The depth map
So how does the Kinect create a depth map?
It has been suggested that the way that the Kinect works is as a sort of laser radar. That is the laser fires a pulse and then times how long it takes for the pulse to reflect off a surface. This is most definitely not how the system works. The accuracy and speed needed to implement so called "time of flight" distance measuring equipment is too much for a low cost device like the Kinect.
Instead the Kinect uses a much simpler method that is equally effective over the distances we are concerned with called "structured light". The idea is simple. If you have a light source offset from a detector by a small distance then the a projected spot of light is shifted according to the distance it is reflected back from.
So by projecting fixed grid of dots onto a scene and measuring how much each one has shifted when viewed with a video camera you can work how far away each dot was reflected back from.
The actual details are more complicated than this because the IR laser in the Kinect uses a hologram to project a random speckle pattern onto the scene. It then measures the offset of each of the points to generate an 11 bit depth map. A custom chip does the computation involved in converting the dot map into a depth map so that it can process the data at the standard frame rate.
The exact details can be found in patent US 2010/0118123 A1 which PrimeSense registered. Once you know the trick it is tempting to think that it is easy however to measure accurately the optics has to be precise and you will discover that the engineering of the Kinect goes to great lengths to keep the whole thing cool - there is even a peltier cooling device on the back of the IR projector to stop heat expanding the holographic interference filter.
At this point you might be wondering why a depth map is so important.
The answer is that many visual recognition tasks are much easier with depth information. If you try to process a flat 2D image then pixels with similar colors that are near to each other might not belong to the same object. If you have 3D information then pixels that correspond to locations physically near to each other tend to belong to the same object, irrespective of their color! It has often been said that pattern recognition has been made artificially difficult because most systems rely on 2D data.
Once you have a depth field many seemingly difficult tasks become much easier. You can use fairly simple algorithms to recognise objects. You can even use the data to reconstruct 3D models and use the depth data to guide robots with collision avoidance and so on.
There is no doubt the depth map is essential to the Kinect's working and it will be some time before the same tricks will be possible using standard video cameras.