The Math Behind Neural Networks
Interactive prerequisites for Karpathy's "Neural Networks: Zero to Hero"
Interactive prerequisites for Karpathy's "Neural Networks: Zero to Hero"
Machine learning is about finding patterns in data to make predictions.
A model is a function with tunable parameters. We adjust these parameters until the model's predictions match our data.
Example: Predicting house prices from square footage.
price = slope × sqft + intercept
The slope tells us the price per square foot.
The intercept is the base price.
Our job: find the best values for these parameters.
Here's some house price data. Try to drag the line so it fits the pattern.
How do we know if our line is good? We measure the error — how far off each prediction is.
The red squares show the squared error for each point. We square the errors so they're all positive (errors above and below don't cancel out) and so big errors are penalized more heavily.
The loss is the total area of all squares — a single number measuring how wrong our model is.
We want to minimize the loss. But there's a catch...
We can only see the loss at our current parameter values.
Which way should we adjust to make it smaller?
What if we could see the loss at every possible parameter value?
Here's a chart showing loss (vertical) for different slope values (horizontal):
Now we can see there's a minimum — a valley where the loss is lowest! That's where we want to be.
But in practice, we can't compute this entire curve. We can only see the loss at our current position...
Even without seeing the whole curve, we can ask a simpler question:
"At my current position, which way is downhill?"
The derivative answers this. It tells us the slope of the curve at our current point.
Positive derivative → curve is going up → move left to descend
Negative derivative → curve is going down → move right to descend
Zero derivative → you're at the bottom (or top) — a flat spot!
Now we have an algorithm! Repeat these steps:
1. Compute the derivative → 2. Step in the opposite direction → 3. Repeat
The learning rate controls step size. Too small = slow progress. Too big = overshoot the minimum.
Notice how steps naturally get smaller near the bottom — because the curve flattens out there!
Real models have many parameters. With two parameters, the loss becomes a surface:
X-axis = first parameter, Y-axis = second parameter, Height = loss
The goal is the same: find the lowest point (the valley).
The gradient is the derivative for each parameter — a list of numbers pointing toward steepest ascent.
Real neural networks have millions of parameters. We need a clean way to organize all these numbers...
With multiple parameters, we need a way to organize all these numbers.
A vector is simply a list. Our house isn't just square footage — it's [sqft, bedrooms, age]. Our weights become [w₁, w₂, w₃].
The dot product combines weights and inputs into a single prediction:
output = (w₁ × x₁) + (w₂ × x₂) + (w₃ × x₃)
This is exactly what a single neuron does: take a vector of inputs, apply a vector of weights, and produce one number.
But a neural network layer doesn't have just one neuron — it has many.
Each neuron has its own weights. Stack these weight vectors as rows → you get a matrix.
This matrix multiplication is the neural network layer:
Each row of the matrix = one neuron's weights. The lines are the weights, the circles are the computations.
Neural networks stack multiple layers: the output of one becomes the input to the next...
So a neural network is a chain of matrix multiplications, layer after layer.
Each layer transforms its input vector into an output vector, which feeds into the next layer.
The final output is compared to our target to compute the loss.
We want to adjust weights to minimize loss. But there are weights in every layer...
If we tweak a weight in Layer 1, how much does that change the final loss?
The effect ripples through Layers 2, 3, and beyond. We need a way to trace this chain.
The chain rule tells us: multiply the effects at each step.
If changing w changes h by some amount, and changing h changes L by some amount...
Then the total effect of w on L is the product of these effects.
Backpropagation = applying the chain rule backward through the network.
Starting from the loss, we work backward, computing how much each weight contributed to the error. This gives us the gradient for every weight in the network!
You now have intuition for the core mathematical ideas behind neural networks:
Models — functions with tunable parameters (weights)
Loss — a single number measuring prediction error
Derivatives — tell us which way is "downhill"
Gradient Descent — iteratively step opposite to the derivative
Vectors — lists of numbers; dot products give single outputs
Matrices — grids of weights; matrix × vector = layer transformation
Chain Rule — multiply effects through a chain of operations
Backpropagation — applying chain rule to compute all gradients
These concepts will appear again and again as you learn to build neural networks.
You're now equipped to understand why things work, not just how.
One more tool that appears throughout neural networks: the bell curve.
You'll encounter it when:
• Initializing weights — start with small random values from a Gaussian
• Understanding data — many natural measurements follow this distribution
• Modeling uncertainty — representing noise and confidence
Mean (μ) = the center of the distribution
Standard deviation (σ) = how spread out it is
About 68% of values fall within 1σ of the mean, 95% within 2σ.