This lesson connects the abstract linear algebra from previous lessons directly to the operations inside modern ML systems — from a simple linear regression to transformer attention.
Linear Regression: The Simplest ML Model
Linear regression fits a line (or hyperplane) to data. In matrix form:
y = Xw + bWhere:
- X: n × d matrix of input features (n samples, d features)
- w: d × 1 weight vector
- b: scalar bias
- y: n × 1 vector of predictions
The optimal weights minimise the mean squared error loss:
L = ‖y_pred - y_true‖² = ‖Xw - y‖²The closed-form solution (Normal Equation):
w* = (XᵀX)⁻¹ Xᵀ yThis is matrix algebra giving the optimal weights in one shot. For large datasets, gradient descent is preferred over computing the inverse, but the equation reveals the structure.
import numpy as np
# Fit linear regression via Normal Equation
def fit(X, y):
return np.linalg.solve(X.T @ X, X.T @ y)np.linalg.solve avoids computing the inverse directly — numerically safer.
Neural Networks: Stacked Matrix Operations
A neural network with L layers computes:
h₀ = x # input
h₁ = σ(W₁h₀ + b₁) # layer 1
h₂ = σ(W₂h₁ + b₂) # layer 2
...
output = Wₗhₗ₋₁ + bₗ # output layerEach layer is: (1) a matrix multiply Wᵢhᵢ₋₁, (2) add bias bᵢ, (3) apply non-linearity σ.
The non-linearity (ReLU, sigmoid, tanh) is what makes neural networks more than just one big linear transformation. Without it, stacked linear layers collapse to a single linear layer.
Batched computation: processing n samples simultaneously is a single matrix multiply:
# W: (out_features, in_features)
# X: (batch_size, in_features)
# output: (batch_size, out_features)
output = X @ W.T + bGPUs are optimised for exactly this — thousands of cores performing multiply-accumulate operations in parallel.
Backpropagation: Gradients as Matrix Operations
Training a neural network requires computing how much each weight contributed to the loss — the gradient. This is the chain rule applied to matrix operations.
For a layer y = Wx, the gradients are:
∂L/∂W = (∂L/∂y)ᵀ x — outer product
∂L/∂x = Wᵀ (∂L/∂y) — matrix-vector multiplyEvery backward pass is a sequence of matrix operations — transposes and multiplies of the same weight matrices used in the forward pass.
# Manual forward + backward for one linear layer
def forward(W, x):
return W @ x
def backward(W, x, grad_output):
grad_W = np.outer(grad_output, x) # gradient for W
grad_x = W.T @ grad_output # gradient for x
return grad_W, grad_xThis is why automatic differentiation frameworks (PyTorch, JAX) are built on tensors — they track operations as a computation graph and reverse them using transposed matrix operations.
Embeddings: Vectors as Meaning
An embedding is a function that maps discrete objects (words, users, products) to dense vectors in a learned space.
In a word embedding, the embedding matrix E is of shape (vocab_size, embedding_dim):
E = np.random.randn(50000, 512) # vocab size 50k, 512-dim embeddings
word_vector = E[word_id] # O(1) lookup — just matrix row indexingAfter training, semantically related words end up close in the embedding space (high cosine similarity):
cosine_sim(E["king"] - E["man"] + E["woman"], E["queen"]) ≈ 0.9This "vector arithmetic" on meanings only works because the embedding space has learned linear structure — relationships are encoded as directions.
Attention: Matrix Operations All the Way Down
The scaled dot-product attention in transformers is pure linear algebra:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) VWhere Q (queries), K (keys), V (values) are all matrices — linear projections of the input.
Breaking it down:
QKᵀ: matrix multiply — computes dot products between all query-key pairs → (n × n) attention score matrix/ √dₖ: scalar division — prevents vanishing gradients from large dot productssoftmax(...): row-wise normalisation — converts scores to probability distributions× V: weighted sum of value vectors — produces output
def attention(Q, K, V):
d_k = Q.shape[-1]
scores = Q @ K.transpose(-2, -1) / np.sqrt(d_k)
weights = softmax(scores) # along last axis
return weights @ VThe entire attention mechanism is four matrix operations. The "magic" of transformers is that these matrices are learned — the model figures out which queries should attend to which keys.
Matrix Factorisation for Recommendations
Collaborative filtering models user-item interactions as a low-rank matrix factorisation:
R ≈ U × VᵀWhere:
- R: (users × items) rating matrix (sparse, mostly unknown)
- U: (users × k) user embedding matrix
- V: (items × k) item embedding matrix
- k: latent dimension (typical: 32–256)
Training minimises the difference between predicted and observed ratings on known entries. After training:
# Predict rating for user u, item i
predicted_rating = U[u] @ V[i] # dot product of embeddings# Recommend top-N items for user u
scores = U[u] @ V.T # dot product with all items at once
top_items = scores.argsort()[-N:][::-1]This is SVD-based recommendation — the same decomposition from the eigenvalue lesson, now used to fill in missing entries of a sparse matrix.
Key Takeaways
- Linear regression's optimal weights are w* = (XᵀX)⁻¹Xᵀy — a closed-form matrix operation.
- A neural network forward pass is a sequence of matrix multiplies + bias additions + non-linearities. Batching is free — just one bigger matrix multiply.
- Backpropagation computes gradients using transposed versions of the same weight matrices — chain rule over matrix operations.
- Embeddings are matrix row lookups. Their geometry encodes semantic relationships as directions and distances.
- Transformer attention is
softmax(QKᵀ/√dₖ)V— four matrix operations that compute which tokens attend to which. - Collaborative filtering decomposes a sparse rating matrix into two low-rank embedding matrices — recommendations are dot products.