AI - It's All Downhill From Here?

Written by Mike James

Wednesday, 31 December 2025

Article Index
AI - It's All Downhill From Here?
Variety and Generalization

Page 1 of 2

AI is a complex beast, but it is based on some very simple and very powerful ideas that deserve to be better known as they throw much light not only on the way AI works but on the way the universe works.

The current approach to AI is amazing. No really, it looks more like a conjurer's trick than science or engineering. You take a box full of artificial neurons complete with lots and lots of connection strengths, you apply a simple learning algorithm based on predicting the next word and, if you train it long enough, it can talk to you. You can argue about how well it can talk to you and whether or not this is intelligence, but you cannot be unimpressed by its abilities. What you have done is so remarkably simple that the whole outcome is unreasonable.

This the unreasonable effectiveness of the back propagation learning algorithm.

How can something so very simple be so good at learning complicated things.

This is just the tip of the iceberg as the idea is even more general than this suggests and it all comes down to something called "gradient descent".

Up Hill Or Down Hill?

Let's suppose you have a machine which controls the brightness of an LED via a lot of control knobs, a 100 to be concrete. If you don't know what the knobs do then trying to get the LED as bright as possible might seem like an impossible task, but in fact there is a very easy algorithm that nearly always works:

First number all the knobs from 1 to 100.
Pick a random number between -100 and +100 and adjust the knob with that number, turning it anticlockwise if negative and clockwise if positive by just a small amount.
If the LED increases in brightness do nothing, but if it dims a little turn the knob back to where it was. Keep doing this and eventually you will reach a state where you consistently do not manage to increase the brightness of the LED. When you reach this state you have set the LED to its brightest possible under the control of the machine you are optimizing.

This procedure, taking steps in a random direction and only accepting those that take you up, is usually called hill climbing. Of course, if you were trying to make the LED as dim as possible it would be hill descending, but for some reason, this is a much less common terminology.

Hill climbing is guaranteed to find a "sort of" maximum brightness, but it might take some time. It is not very efficient. It also has another problem - it does always find a maximum, but it may not be the biggest maximum. It might be what is called a local maximum.

You can easily understand this if you consider the hill climbing algorithm on a real landscape. Which hill you climb depends on where you start. You always climb a hill, but it is the one you are nearest to when you set off. This hill is a local high point but it might not be the highest possible high point in the overall landscape - it is one of, possibly many, local maxima. This is a defect of the hill climbing algorithm - it always finds a local maximum but not necessarily the global maximum. The usual way of trying to find a global maximum is simply to start the hill climbing off from lots of random starting places. Again not efficient and only highly likely to produce the correct answer as it is still possible to miss the actual global maximum.

Keep in mind that hill climbing or hill descending does get you a solution - it is just very, very slow. It is the mechanism that in part makes evolution work. Each generation tries out random variations in design and anything that makes things worse are eliminated by the examples being removed from the gene pool while the things that make things better increase in the gene pool.This is what "survival of the fittest" really means. It is not a purposeful guided program of improvement but a blind culling of any random step that takes you to a lower point on the survival probability surface.

It's hill climbing where effective fitness to survive is the hill being climbed and, yes, it takes a long time.

Gradients!

We could use simple hill climbing to train our neural networks but it would take many years - just like evolution. What we do use is a simple improvement on hill climbing. If you can work out the gradient of the surface you are on then you can find a good direction to move in by selecting the direction in which the height increases most rapidly. In plain terms you can get to the top of a hill more quickly by always walking up the steepest slope from where you are. This is gradient ascent, but for reasons that are unclear we usually refer to this a gradient descent and try and find the lowest point.

Gradient descent, follow the steepest slope down, seems like an obvious algorithm - and it is, but implementing it in the real world can be difficult. Back in the early days of neural networks we could only work out the gradients for a network with a single layer of neurons. This worked, but there were some very simple things it couldn't learn.

This led to the first AI Winter when everyone was forced to give up on neural networks because no one was prepared to fund such stupid research and few researchers were prepared to risk their careers pursuing unfashionable goals. But, and this is a very big but, no one had proved that networks with multiple layers suffered from things that they couldn't learn. In principle everyone thought that they would be capable of learning anything that was thrown at them. A mathematical theorem, the Kolmogorv representation theorem, proves not only that a multilayer network can learn anything but you don't need more than two layers. This is remarkable, yet no one had any idea how to train a multilayer network.

The big breakthrough was when someone worked out the gradient of a multi-layer network and so allowed gradient descent to be used to train it. This is the learning algorithm generally called back propagation or back prop. While it was Paul Werbos in his 1974 PhD thesis who was the first to explicitly apply it to training multi-layer neural networks, the concept of efficiently computing gradients backwards existed earlier in control theory. The algorithm then gained massive traction and became the foundational learning method for neural networks thanks to the work of David E. Rumelhart, Geoffrey Hinton, and Ronald J. Williams in the mid-1980s

So neural networks are trained by back propagation which is just gradient descent.

In fact a big chunk of the success of AI is just due to the application of gradient descent to the training of deep, i.e. multilayer neural networks. Hence "deep learning".

You might be wondering why, if Kolmogorov proves that just two layers are necessary, we bother with more? The answer is that while a two-layer network can be taught anything, a deep network with more than two layers can learn some things much much quicker and can generalize, i.e. go beyond what they have been taught.

Differentiable Systems

The key idea here is that back prop was invented as soon as it was realized that it was possible to work out a gradient. Any system for which we we can work out a gradient is called "differentiable" because working out gradients is what differential calculus is all about. What we have discovered is that if you have a differentiable system of any sort then we can use gradient descent to train it.

Researchers have created all sorts of unlikely differentiable versions of systems that at first sight appear to be very non-differentiable. For example, a differentiable programming language allows programs to be created or corrected by gradient descent. If we could find a way to work out a gradient for evolution we could make it operate in days rather than years. In fact many have pointed out that our ability to manipulate genes has done exactly this. Humans add gradients to survival of the fittest and move organisms more quickly towards much better performance.

This is a crazy idea, but you cannot take any old differentiable black box and apply gradient descent to make it do what you want. If this was true we could have used single-layer networks, or something even more trivial, instead of the expensive-to-run very deep networks we use today. So what does it take to make something trainable by gradient descent?

Prev - Next >>

Last Updated ( Wednesday, 31 December 2025 )