There are two basic variations on this idea depending on whether the data you are trying to work with has a lot of statistical variation or not.
If the data isn't noisy then you can simply take the first layers and train them as an auto-encoder. An auto-encoder is a neural network that learns to reproduce its inputs as its outputs. You might think that this is just a memory but if you set the neural network up so that it doesn't have enough resources to memorize the inputs then something really interesting happens.
To reproduce the inputs the neural network has to find an internal representation of the data that isn't a simple memory of the inputs. It has to find features that allow the inputs to be stored in less space. You might say that to reproduce a face I could do it by saying things like "big mouth, wide eyes, small nose" and so on. By describing features I can tell you what a face is like using much less data than a photo say. You don't have to remember a pixel perfect representation of a face because you can reconstruct it from your memory of its features.
A memory restricted auto-encoder has to discover a feature representation of the input data to be able to reconstruct it.
Notice that the features that an auto encoder deduces aren't guaranteed to be useful to the next layers of the network, but you can see that there is every chance that it will be. What you do next is to take the auto-encoder and use the features it has found as features as the input to the next layer, which you train as an auto-encoder. You train this to reproduce the output and so it learns higher features, and so on until you have pre-trained all of the layers of the deep network. At this point you can put the entire network together and use back propagation to fine tune its settings.
The result is a network that delivers on the promise of deep networks.
If the data is noisy or statistical in nature then you can do the same thing, but instead of using auto-encoder pre-training you can use the lower layers as a Restricted Boltzmann Machine (RBM) which learns to reproduce the statistical distribution of the input data. The same arguments about not having enough memory to simply store the distributions means that the RBM has to learn statistical features to do a good job of reproducing the input data. Once the first RBM has been trained, you can use its output to train the next layer as an RMB, and so on until the entire network has been trained. Then you stack the RBMs together and throw away the extra mechanisms needed to make the sub-layers work like an RBM - and you have a standard neural network again. A little back-propagation fine-tuning and once again you have a network that delivers the performance you are looking for.
There are other justifications of why the RBM approach works and there are even some weak theoretical results which suggest that it should work but what really matters is that we have practical evidence that deep learning does work. It is not that the original idea was wrong it was just that we need more data than we imagined - a lot more data - and we needed more computational power than we imagined. Put simply deep neural networks didn't learn in the past because we didn't give them long enough. Pre-training reduces the time that a neural network takes to learn by allowing it to build a structure that is likely to succeed when the training proper actually starts. There is evidence however that if you have enough computer power then perhaps you don't even need pre-training.
Neural networks were first put forward as analogs of the way the brain works. Surely this sort of pre-training can't be an analog of how a biological brain works? it does seem fairly obvious that there isn't an initial pre-training phase of an infant brain, trained layer-by-layer and then stacked together to create a deep network. So the whole analogy probably breaks down and what ever we are doing at the moment it doesn't have much to do with biology.
This view misses the simple fact that brains are the product of evolution as well as training. It could be that layer-by-layer training is part of the evolutionary development of a brain. This could be how the structure comes about and what we think of as learning is just the fine tuning.
Of course this is speculation intended more to make you think about the problem rather than present a solution. We really don't understand that much about how the brain works.
There is still a lot of work to be done but it really does seem that deep neural networks work after all. In the future we need more computer power to make them work even better but it really does seem to be a matter of doing the job correctly and throwing what back in the early days would have seemed to have been a ridiculous amount of computing power at the problem.
If you would like to see the state of the art in 2010 then the following video by Geoffry Hinton gives you a good idea of what what going on - but remember things have moved on a great deal since then!