The first layer of neurons learns primitive features, like an edge in an image. It does this by finding combinations of digitized pixels that occur more often than they should by chance. Once that layer accurately recognizes those features, they are fed to the next layer, which trains itself to recognize more complex features, like a corner. The process is repeated in successive layers until the system can reliably recognize objects or phonemes. An interesting paper that Jordi Nin sent to me is from Google, that used a neural network of a billion connections. They consider the problem of building high-level, class-specific feature detectors from only unlabelled data training a 9-layered virtual neurons (the model has 1 billion connections), with a dataset of 10 million images. Training the many layers of virtual neurons in the experiment required 16,000 computer cores!!!. Is it clear now why our research group is entering in this amazing world? 

(*) Picture from Andrew Ng (Stanford)