A matrix is a rectangular array of numbers — and more fundamentally, a description of a linear transformation. Every layer in a neural network, every coordinate transformation in computer graphics, and every system of linear equations is captured by matrix operations.
Matrix Basics
An m × n matrix has m rows and n columns:
A = [[1, 2, 3],
[4, 5, 6]] # 2×3 matriximport numpy as np
A = np.array([[1, 2, 3],
[4, 5, 6]])
A.shape # (2, 3)Element access: A[i][j] is the element in row i, column j (0-indexed in code).
Matrix Multiplication
For matrices A (m × n) and B (n × p), the product C = AB is m × p:
C[i][j] = sum(A[i][k] × B[k][j] for k in range(n))The number of columns in A must equal the number of rows in B.
A = np.array([[1, 2],
[3, 4]]) # 2×2
B = np.array([[5, 6],
[7, 8]]) # 2×2
C = A @ B
# C[0][0] = 1×5 + 2×7 = 19
# C[0][1] = 1×6 + 2×8 = 22
# C[1][0] = 3×5 + 4×7 = 43
# C[1][1] = 3×6 + 4×8 = 50Complexity: O(m × n × p) — matrix multiplication of large matrices is expensive. This is why hardware accelerators (GPUs, TPUs) are built specifically for this operation.
Matrix multiplication is not commutative: AB ≠ BA in general.
Matrices as Linear Transformations
Every matrix A represents a function that transforms vectors — a linear transformation.
Applying A to vector x produces a new vector Ax:
A = np.array([[2, 0],
[0, 3]]) # scales x by 2, y by 3
x = np.array([1, 1])
A @ x # [2, 3]Common transformations in 2D:
| Matrix | Effect |
|---|---|
[[2,0],[0,2]] | Scale by 2 (uniform) |
[[1,0],[0,-1]] | Reflect over x-axis |
[[0,-1],[1,0]] | Rotate 90° counterclockwise |
[[1,s],[0,1]] | Shear horizontally |
[[1,0],[0,0]] | Project onto x-axis |
Transpose
The transpose Aᵀ swaps rows and columns: Aᵀ[i][j] = A[j][i].
A = np.array([[1, 2, 3],
[4, 5, 6]]) # shape (2, 3)
A.T # shape (3, 2)
# [[1, 4],
# [2, 5],
# [3, 6]]Key property: (AB)ᵀ = BᵀAᵀ.
A symmetric matrix satisfies A = Aᵀ. Covariance matrices in statistics are symmetric, as are many matrices in physics and ML.
Identity and Inverse
The identity matrix I is the matrix equivalent of the number 1 — multiplying by it changes nothing.
I = np.eye(3) # 3×3 identity
# [[1, 0, 0],
# [0, 1, 0],
# [0, 0, 1]]The inverse of a square matrix A, written A⁻¹, satisfies AA⁻¹ = A⁻¹A = I.
Not all matrices are invertible. A matrix is invertible if and only if its determinant is non-zero.
A = np.array([[1., 2.],
[3., 4.]])
A_inv = np.linalg.inv(A)
A @ A_inv # ≈ identity (floating point errors)Solving linear systems: the system Ax = b has solution x = A⁻¹b (if A is invertible). In practice, never compute the inverse explicitly — use np.linalg.solve(A, b) which uses more numerically stable algorithms.
Determinant
The determinant det(A) is a scalar measuring how much a matrix scales area (2D) or volume (nD).
For a 2×2 matrix:
det([[a, b], [c, d]]) = ad - bc- det(A) > 0: transformation preserves orientation
- det(A) < 0: transformation flips orientation
- det(A) = 0: transformation collapses space to a lower dimension — matrix is not invertible
np.linalg.det(A)Rank
The rank of a matrix is the number of linearly independent rows (= number of linearly independent columns). It is the dimension of the column space — how much of the output space the matrix can actually reach.
- Full rank: rank = min(m, n). The matrix transformation loses no information.
- Rank-deficient:
rank < min(m, n). The transformation collapses information — inputs from a subspace all map to the same output.
In ML: low-rank matrices appear in model compression (LoRA fine-tuning approximates weight updates with low-rank matrices), and rank deficiency signals redundant features.
Neural Network Layers as Matrix Operations
A fully connected (dense) layer computes:
output = activation(W @ input + b)Where:
Wis the weight matrix — a linear transformationbis the bias vectoractivationapplies a non-linear function element-wise
For a layer with 512 inputs and 256 outputs, W is a 256×512 matrix. A forward pass through this layer is one matrix-vector multiply — 256 × 512 = 131,072 multiplications.
A batch of n inputs is represented as an n × 512 matrix. The layer processes the entire batch as a single matrix multiply:
# Input: (batch_size, 512)
# W: (256, 512)
# Output: (batch_size, 256)
output = input @ W.T + bEverything from attention mechanisms in transformers to convolutional filters reduces to matrix operations — which is why GPU acceleration is so impactful.
Key Takeaways
- An m×n matrix represents a linear transformation from n-dimensional to m-dimensional space.
- Matrix multiplication AB: columns of A must equal rows of B; result is m×p. Not commutative.
- Transpose swaps rows and columns: used in attention, covariance matrices, and gradient computation.
- The inverse A⁻¹ undoes the transformation. Non-invertible matrices have determinant = 0 and collapse space.
- Rank measures how much information a transformation preserves — full rank means no information loss.
- Every neural network layer is a matrix multiply followed by a non-linearity. GPU hardware is optimised specifically for this operation.