Dimensionality Reduction — Linear Algebra for Machine Learning | Sabaoon Academy

Real-world ML data is high-dimensional: images are thousands of pixels, text embeddings are hundreds of dimensions, genomic data has millions of features. Dimensionality reduction finds compact representations that preserve what matters. This lesson covers the three most important methods: PCA, t-SNE, and UMAP.

Why Reduce Dimensions?

The curse of dimensionality: in high-dimensional spaces, data becomes sparse. Distance metrics lose meaning — everything is approximately the same distance from everything else. Models that work well in 10 dimensions often fail with 1,000.

Practical reasons:

Visualisation: humans can see 2D and 3D, not 512D
Training speed: fewer features → faster computation
Noise removal: irrelevant dimensions add noise, hurting generalisation
Storage: compressed representations are cheaper to store and retrieve

PCA: Linear Dimensionality Reduction

PCA (Principal Component Analysis) projects data onto the directions of maximum variance.

Step-by-step:

import numpy as np

def pca(X, k):
    # 1. Centre the data
    X_mean = X.mean(axis=0)
    X_c = X - X_mean

    # 2. Compute covariance matrix
    C = X_c.T @ X_c / len(X)   # (d × d)

    # 3. Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(C)

    # 4. Sort descending
    idx = eigenvalues.argsort()[::-1]
    eigenvectors = eigenvectors[:, idx]
    eigenvalues = eigenvalues[idx]

    # 5. Project onto top k eigenvectors
    W = eigenvectors[:, :k]         # (d × k) projection matrix
    X_reduced = X_c @ W             # (n × k)

    return X_reduced, eigenvalues, W

The projection matrix W transforms new data points the same way:

new_point_reduced = (new_point - X_mean) @ W

Choosing k

The explained variance ratio of component i is eigenvalue λᵢ / Σλ.

explained_variance_ratio = eigenvalues / eigenvalues.sum()
cumulative = np.cumsum(explained_variance_ratio)

# Find k to explain 95% of variance
k = np.searchsorted(cumulative, 0.95) + 1

A common rule: choose k to explain 90–95% of variance. The scree plot (explained variance vs component index) often shows an "elbow" — components after the elbow add little.

What PCA Preserves and Loses

PCA preserves global structure — points far apart in high dimensions remain far apart. It preserves variance — the principal components capture the spread of the data.

PCA does NOT preserve local structure — clusters and neighbourhoods can be distorted. For visualising cluster structure, non-linear methods (t-SNE, UMAP) work better.

Whitening

After PCA, you can further normalise each component to unit variance:

X_white = X_reduced / np.sqrt(eigenvalues[:k])

Whitened data has identity covariance — all components carry equal variance. This is used as a preprocessing step before training some models (especially neural networks and k-means).

t-SNE: Non-Linear Visualisation

t-SNE (t-Distributed Stochastic Neighbour Embedding) is designed for visualisation in 2D or 3D. It preserves local structure — nearby points in high-dimensional space end up nearby in the 2D projection.

The intuition:

For each pair of points in high-dimensional space, compute a probability that they are "neighbours" (using a Gaussian kernel based on distance).
Initialise random 2D positions.
Compute similar probabilities in 2D (using a t-distribution — the heavy tails prevent crowding).
Minimise the difference (KL divergence) between the two probability distributions — this pushes 2D positions to match high-dimensional neighbour relationships.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_high_dim)   # shape (n, 2)

Perplexity (typically 5–50) controls the effective number of neighbours each point considers.

t-SNE Caveats

Not for distance interpretation: distances in the t-SNE plot are not meaningful. Clusters being far apart in 2D does not mean they are far apart in high-dimensional space.
Stochastic: different runs produce different layouts. Use a fixed random_state.
Slow: O(n² log n) — impractical for n > 100k. Use PCA to reduce to ~50 dimensions first.
Not for new data: you cannot project new points; you must refit on the full dataset.

UMAP: Fast Non-Linear Reduction

UMAP (Uniform Manifold Approximation and Projection) is newer than t-SNE and has significant practical advantages:

	PCA	t-SNE	UMAP
Structure preserved	Global	Local	Both (adjustable)
Speed	Fast (O(nd²))	Slow (O(n²))	Fast (O(n^1.14))
New data projection	Yes	No	Yes
Interpretable distances	Yes	No	Partially
Deterministic	Yes	No	No (but seeds help)
Typical use	Preprocessing, features	Visualisation only	Visualisation + features

import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
X_2d = reducer.fit_transform(X_high_dim)

# Project new points:
new_2d = reducer.transform(new_points)

n_neighbors (typical: 5–50): balances local vs global structure. Low values = local focus, high values = more global.

min_dist (typical: 0.0–0.5): controls how tightly points cluster in the low-dimensional space.

UMAP is now widely used for both visualisation and as a feature engineering step before training classifiers on the reduced representation.

Autoencoders: Learned Dimensionality Reduction

Autoencoders are neural networks that learn to compress and decompress data. The bottleneck forces a compact representation.

Input (d dims) → Encoder → Latent space (k dims) → Decoder → Reconstruction (d dims)

import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, d, k):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(d, 256), nn.ReLU(),
            nn.Linear(256, k)
        )
        self.decoder = nn.Sequential(
            nn.Linear(k, 256), nn.ReLU(),
            nn.Linear(256, d)
        )

    def forward(self, x):
        z = self.encoder(x)           # compress to k dims
        return self.decoder(z), z     # reconstruct + return latent

Training minimises reconstruction loss. The latent vectors z are the learned low-dimensional representations.

Variational autoencoders (VAEs) go further — the latent space is probabilistic, enabling generation of new data points.

Choosing the Right Method

Situation	Method
Preprocessing before ML training	PCA
Visualising cluster structure	UMAP or t-SNE
Need to project new data points	PCA or UMAP
Interpretable components	PCA
Non-linear structure, large dataset	UMAP
Non-linear structure, small dataset	t-SNE
Learn task-specific representations	Autoencoder

Key Takeaways

PCA projects data onto eigenvectors of the covariance matrix — maximum variance directions. Linear, fast, interpretable, projects new data.
Explained variance ratio tells you how many components to keep. Choose k for 90–95% cumulative variance.
t-SNE preserves local neighbourhoods for visualisation but distances are not interpretable, it cannot project new data, and it is slow.
UMAP is faster than t-SNE, can project new data, and balances local and global structure. Preferred for large datasets.
Autoencoders learn non-linear compression end-to-end, optimised for a specific task.
Always reduce to ~50 dims with PCA before applying t-SNE on large datasets — reduces compute from O(n²) to manageable.