Matrices and Linear Transformations — Linear Algebra for Machine Learning | Sabaoon Academy

A matrix is a rectangular array of numbers — and more fundamentally, a description of a linear transformation. Every layer in a neural network, every coordinate transformation in computer graphics, and every system of linear equations is captured by matrix operations.

Matrix Basics

An m × n matrix has m rows and n columns:

A = [[1, 2, 3],
     [4, 5, 6]]    # 2×3 matrix

import numpy as np
A = np.array([[1, 2, 3],
              [4, 5, 6]])
A.shape   # (2, 3)

Element access: A[i][j] is the element in row i, column j (0-indexed in code).

Matrix Multiplication

For matrices A (m × n) and B (n × p), the product C = AB is m × p:

C[i][j] = sum(A[i][k] × B[k][j] for k in range(n))

The number of columns in A must equal the number of rows in B.

A = np.array([[1, 2],
              [3, 4]])   # 2×2

B = np.array([[5, 6],
              [7, 8]])   # 2×2

C = A @ B
# C[0][0] = 1×5 + 2×7 = 19
# C[0][1] = 1×6 + 2×8 = 22
# C[1][0] = 3×5 + 4×7 = 43
# C[1][1] = 3×6 + 4×8 = 50

Complexity: O(m × n × p) — matrix multiplication of large matrices is expensive. This is why hardware accelerators (GPUs, TPUs) are built specifically for this operation.

Matrix multiplication is not commutative: AB ≠ BA in general.

Matrices as Linear Transformations

Every matrix A represents a function that transforms vectors — a linear transformation.

Applying A to vector x produces a new vector Ax:

A = np.array([[2, 0],
              [0, 3]])   # scales x by 2, y by 3

x = np.array([1, 1])
A @ x   # [2, 3]

Common transformations in 2D:

Matrix	Effect
`[[2,0],[0,2]]`	Scale by 2 (uniform)
`[[1,0],[0,-1]]`	Reflect over x-axis
`[[0,-1],[1,0]]`	Rotate 90° counterclockwise
`[[1,s],[0,1]]`	Shear horizontally
`[[1,0],[0,0]]`	Project onto x-axis

Transpose

The transpose Aᵀ swaps rows and columns: Aᵀ[i][j] = A[j][i].

A = np.array([[1, 2, 3],
              [4, 5, 6]])   # shape (2, 3)

A.T   # shape (3, 2)
# [[1, 4],
#  [2, 5],
#  [3, 6]]

Key property: (AB)ᵀ = BᵀAᵀ.

A symmetric matrix satisfies A = Aᵀ. Covariance matrices in statistics are symmetric, as are many matrices in physics and ML.

Identity and Inverse

The identity matrix I is the matrix equivalent of the number 1 — multiplying by it changes nothing.

I = np.eye(3)   # 3×3 identity
# [[1, 0, 0],
#  [0, 1, 0],
#  [0, 0, 1]]

The inverse of a square matrix A, written A⁻¹, satisfies AA⁻¹ = A⁻¹A = I.

Not all matrices are invertible. A matrix is invertible if and only if its determinant is non-zero.

A = np.array([[1., 2.],
              [3., 4.]])
A_inv = np.linalg.inv(A)
A @ A_inv   # ≈ identity (floating point errors)

Solving linear systems: the system Ax = b has solution x = A⁻¹b (if A is invertible). In practice, never compute the inverse explicitly — use np.linalg.solve(A, b) which uses more numerically stable algorithms.

Determinant

The determinant det(A) is a scalar measuring how much a matrix scales area (2D) or volume (nD).

For a 2×2 matrix:

det([[a, b], [c, d]]) = ad - bc

det(A) > 0: transformation preserves orientation
det(A) < 0: transformation flips orientation
det(A) = 0: transformation collapses space to a lower dimension — matrix is not invertible

np.linalg.det(A)

Rank

The rank of a matrix is the number of linearly independent rows (= number of linearly independent columns). It is the dimension of the column space — how much of the output space the matrix can actually reach.

Full rank: rank = min(m, n). The matrix transformation loses no information.
Rank-deficient: rank < min(m, n). The transformation collapses information — inputs from a subspace all map to the same output.

In ML: low-rank matrices appear in model compression (LoRA fine-tuning approximates weight updates with low-rank matrices), and rank deficiency signals redundant features.

Neural Network Layers as Matrix Operations

A fully connected (dense) layer computes:

output = activation(W @ input + b)

Where:

W is the weight matrix — a linear transformation
b is the bias vector
activation applies a non-linear function element-wise

For a layer with 512 inputs and 256 outputs, W is a 256×512 matrix. A forward pass through this layer is one matrix-vector multiply — 256 × 512 = 131,072 multiplications.

A batch of n inputs is represented as an n × 512 matrix. The layer processes the entire batch as a single matrix multiply:

# Input: (batch_size, 512)
# W: (256, 512)
# Output: (batch_size, 256)
output = input @ W.T + b

Everything from attention mechanisms in transformers to convolutional filters reduces to matrix operations — which is why GPU acceleration is so impactful.

Key Takeaways

An m×n matrix represents a linear transformation from n-dimensional to m-dimensional space.
Matrix multiplication AB: columns of A must equal rows of B; result is m×p. Not commutative.
Transpose swaps rows and columns: used in attention, covariance matrices, and gradient computation.
The inverse A⁻¹ undoes the transformation. Non-invertible matrices have determinant = 0 and collapse space.
Rank measures how much information a transformation preserves — full rank means no information loss.
Every neural network layer is a matrix multiply followed by a non-linearity. GPU hardware is optimised specifically for this operation.