The Math Behind Neural Networks

Interactive prerequisites for Karpathy's "Neural Networks: Zero to Hero"

What is Machine Learning?

Machine learning is about finding patterns in data to make predictions.

A model is a function with tunable parameters. We adjust these parameters until the model's predictions match our data.

Example: Predicting house prices from square footage.

price = slope × sqft + intercept

The slope tells us the price per square foot.

The intercept is the base price.

Our job: find the best values for these parameters.

Fitting a Model

Here's some house price data. Try to drag the line so it fits the pattern.

Slope ($/sqft)

200

Intercept ($)

50000

Drag left/right to adjust the slope (steepness). Drag up/down to adjust the intercept (where the line crosses).

Measuring Error

How do we know if our line is good? We measure the error — how far off each prediction is.

Slope

200

Total Loss

0.00

The red squares show the squared error for each point. We square the errors so they're all positive (errors above and below don't cancel out) and so big errors are penalized more heavily.

The loss is the total area of all squares — a single number measuring how wrong our model is.

Drag left/right to adjust slope, up/down for intercept. Watch the squares grow and shrink!

The Problem

We want to minimize the loss. But there's a catch...

Our Model

What We Can See

Current Loss

0.00

Just this one number

We can only see the loss at our current parameter values.

Which way should we adjust to make it smaller?

Drag the line and watch the loss change. How do you know which direction will help?

Visualizing the Loss

What if we could see the loss at every possible parameter value?

Here's a chart showing loss (vertical) for different slope values (horizontal):

Slope

0.00

Loss at this slope

0.00

Now we can see there's a minimum — a valley where the loss is lowest! That's where we want to be.

But in practice, we can't compute this entire curve. We can only see the loss at our current position...

Move your mouse along the curve to explore different slope values.

The Derivative

Even without seeing the whole curve, we can ask a simpler question:

"At my current position, which way is downhill?"

The derivative answers this. It tells us the slope of the curve at our current point.

Current Slope

0.00

Loss

0.00

Derivative

0.00

Positive derivative → curve is going up → move left to descend

Negative derivative → curve is going down → move right to descend

Zero derivative → you're at the bottom (or top) — a flat spot!

Move your mouse along the curve. The green arrow shows which way to move to reduce loss.

Gradient Descent

Now we have an algorithm! Repeat these steps:

1. Compute the derivative → 2. Step in the opposite direction → 3. Repeat

Learning Rate: 0.15

Parameter

-1.50

Loss

0.00

Steps

0

The learning rate controls step size. Too small = slow progress. Too big = overshoot the minimum.

Notice how steps naturally get smaller near the bottom — because the curve flattens out there!

Click Step to take one step, Run to animate until convergence, Reset to start over. Try different learning rates!

Multiple Parameters

Real models have many parameters. With two parameters, the loss becomes a surface:

X-axis = first parameter, Y-axis = second parameter, Height = loss

Weight 1

0.00

Weight 2

0.00

Loss

0.00

The goal is the same: find the lowest point (the valley).

The gradient is the derivative for each parameter — a list of numbers pointing toward steepest ascent.

Real neural networks have millions of parameters. We need a clean way to organize all these numbers...

Move your mouse over the surface. The arrow shows the gradient direction (steepest descent).

Vectors: Organizing Our Numbers

With multiple parameters, we need a way to organize all these numbers.

A vector is simply a list. Our house isn't just square footage — it's [sqft, bedrooms, age]. Our weights become [w₁, w₂, w₃].

The dot product combines weights and inputs into a single prediction:

output = (w₁ × x₁) + (w₂ × x₂) + (w₃ × x₃)

This is exactly what a single neuron does: take a vector of inputs, apply a vector of weights, and produce one number.

Hover over rows to see how each pair contributes to the final output.

Matrices: Many Neurons at Once

But a neural network layer doesn't have just one neuron — it has many.

Each neuron has its own weights. Stack these weight vectors as rows → you get a matrix.

This matrix multiplication is the neural network layer:

Each row of the matrix = one neuron's weights. The lines are the weights, the circles are the computations.

Neural networks stack multiple layers: the output of one becomes the input to the next...

Hover over outputs to see the corresponding neuron and its weights highlighted in both views.

The Problem with Layers

So a neural network is a chain of matrix multiplications, layer after layer.

Each layer transforms its input vector into an output vector, which feeds into the next layer.

The final output is compared to our target to compute the loss.

We want to adjust weights to minimize loss. But there are weights in every layer...

If we tweak a weight in Layer 1, how much does that change the final loss?

The effect ripples through Layers 2, 3, and beyond. We need a way to trace this chain.

The Chain Rule

The chain rule tells us: multiply the effects at each step.

Weight (w): 1.0

If changing w changes h by some amount, and changing h changes L by some amount...

Then the total effect of w on L is the product of these effects.

Backpropagation = applying the chain rule backward through the network.

Starting from the loss, we work backward, computing how much each weight contributed to the error. This gives us the gradient for every weight in the network!

What We've Learned

You now have intuition for the core mathematical ideas behind neural networks:

Models — functions with tunable parameters (weights)

Loss — a single number measuring prediction error

Derivatives — tell us which way is "downhill"

Gradient Descent — iteratively step opposite to the derivative

Vectors — lists of numbers; dot products give single outputs

Matrices — grids of weights; matrix × vector = layer transformation

Chain Rule — multiply effects through a chain of operations

Backpropagation — applying chain rule to compute all gradients

These concepts will appear again and again as you learn to build neural networks.

You're now equipped to understand why things work, not just how.

Bonus Concept

The Gaussian Distribution

One more tool that appears throughout neural networks: the bell curve.

You'll encounter it when:

• Initializing weights — start with small random values from a Gaussian

• Understanding data — many natural measurements follow this distribution

• Modeling uncertainty — representing noise and confidence

Mean (μ): 0.0

Std Dev (σ): 1.0

Mean (μ) = the center of the distribution

Standard deviation (σ) = how spread out it is

About 68% of values fall within 1σ of the mean, 95% within 2σ.