Kinect SDK 1 - Depth and Video Space
Written by Mike James   
Article Index
Kinect SDK 1 - Depth and Video Space
Problems with masks
Converting from depth to video

The Kinect has both a video and a depth camera and they have slightly different viewpoints on the world. Relating their co-ordinate systems is the subject of this chapter of our ebook on using the Kinect for Windows SDK 1. We create a background remover along the way.

Practical Windows Kinect in C#
Chapter List

  1. Introduction to Kinect
  2. Getting started with Microsoft Kinect SDK 1
  3. Using the Depth Sensor
  4. The Player Index
  5. Depth and Video Space
  6. Skeletons
  7. The Full Skeleton
  8. A 3D Point Cloud


In  previous chapters we learned how to read in video and depth data. Now we take a break from looking at how to extract data from the cameras and concentrate on how to relate data from the depth camera to the video camera.

The Kinect has two cameras, one video and one depth. In both cases they return a bitmap corresponding to what they see. In the case of the video camera the bitmap is usually 640x480 and for the depth map it is often 320x240, i.e. exactly half the resolution.

A very standard requirement is to pick out video pixels that correspond to particular depth pixels. This sounds fairly easy as you might think that the pixel at x,y in the depth image corresponds to the four pixels at 2x,2y in the video image.

Unfortunately this simple mapping doesn't work and this can be the cause of much wasted time in trying figure out why your perfectly reasonable program doesn't quite do what you expect. Fortunately, the solution is fairly easy - once you know how.

As a simple example of using the data from the depth camera with the video camera we construct a demo of how to use the player index (see chapter 4) to create masks that can be used to remove the background. from a user's image.

Getting started

If you don't know how to setup or the basic processes of getting data from the Kinect you need to the earlier chapters. In this chapter it is assumed you know how to get started.

Also it is assumed that you know how the depth plus player index data is processed.

Start a new C# Windows forms project.

First  you need to include a reference to


To avoid having to type fully qualified names add:

using Microsoft.Kinect;

to the start of the code.

The program starts off in the usual way with the creation of a Runtime object:

sensor = KinectSensor.KinectSensors[0];

Next you have to initialize it to user the depth camera and to use SkeletalTracking, video  and depth and player index:


And finally we setup the event handlers to process the frames as they become ready. In this program we are going to be working with both the depth and the video frame and so you might think that we need two event handlers. The new AllFramesReady event can be used to trigger code when all of the frame types you hve requested are ready to be processed. So we can simply use a single event handler: 

sensor.AllFramesReady += FramesReady;

Finally we can set the sensor running:


A Simple Mask

In this case we are going to use the depth image and the video image together. Specifically the player index to derive a mask that can be applied to the video stream.

When the FramesReady event handler is called both the depth and the video frame are ready to be processed:

void FramesReady(object sender,
 AllFramesReadyEventArgs e)

Notice that we have to use a different type of EventArgs.

The first thing we need to do is to retrieve the raw bits for both the depth and the video data. This is done in the same way for each type of frame - the big difference is that the video data is stored in a byte array and the depth data in a short array. The depth data retrival is:

DepthImageFrame DFrame = e.OpenDepthImageFrame();
if (DFrame == null) return;
short[] depthimage =
 new short[DFrame.PixelDataLength];

As always, in a real production program you would need to create the resources such as the depthimage array just once not each time the routine was called.

The video data follows the same pattern:

ColorImageFrame VFrame = e.OpenColorImageFrame();
if (VFrame == null) return;
byte[] pixeldata =
new byte[VFrame.PixelDataLength];

In the first attempt at building  a mask we will assume that the mapping between depth and video images is simple and just a matter of x-> 2x and y->2y i.e the video image has twice the resolution of the depth image. 

In more general terms we are applying a simple mapping between depth image co-ordinates x and y and video image co-ordinates vx, vy say and:


This simple approach will give you some idea of why this doesn't work and why we need a more complex transformation between the two types of co-ordinate.