Skip to main content

Dimensionality Reduction

Real-world ML data is high-dimensional: images are thousands of pixels, text embeddings are hundreds of dimensions, genomic data has millions of features. Dimensionality reduction finds compact representations that preserve what matters. This lesson covers the three most important methods: PCA, t-SNE, and UMAP.

Why Reduce Dimensions?

The curse of dimensionality: in high-dimensional spaces, data becomes sparse. Distance metrics lose meaning — everything is approximately the same distance from everything else. Models that work well in 10 dimensions often fail with 1,000.

Practical reasons:

  • Visualisation: humans can see 2D and 3D, not 512D
  • Training speed: fewer features → faster computation
  • Noise removal: irrelevant dimensions add noise, hurting generalisation
  • Storage: compressed representations are cheaper to store and retrieve

PCA: Linear Dimensionality Reduction

PCA (Principal Component Analysis) projects data onto the directions of maximum variance.

Step-by-step:

import numpy as np

def pca(X, k):
    # 1. Centre the data
    X_mean = X.mean(axis=0)
    X_c = X - X_mean

    # 2. Compute covariance matrix
    C = X_c.T @ X_c / len(X)   # (d × d)

    # 3. Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(C)

    # 4. Sort descending
    idx = eigenvalues.argsort()[::-1]
    eigenvectors = eigenvectors[:, idx]
    eigenvalues = eigenvalues[idx]

    # 5. Project onto top k eigenvectors
    W = eigenvectors[:, :k]         # (d × k) projection matrix
    X_reduced = X_c @ W             # (n × k)

    return X_reduced, eigenvalues, W

The projection matrix W transforms new data points the same way:

new_point_reduced = (new_point - X_mean) @ W

Choosing k

The explained variance ratio of component i is eigenvalue λᵢ / Σλ.

explained_variance_ratio = eigenvalues / eigenvalues.sum()
cumulative = np.cumsum(explained_variance_ratio)

# Find k to explain 95% of variance
k = np.searchsorted(cumulative, 0.95) + 1

A common rule: choose k to explain 90–95% of variance. The scree plot (explained variance vs component index) often shows an "elbow" — components after the elbow add little.

What PCA Preserves and Loses

PCA preserves global structure — points far apart in high dimensions remain far apart. It preserves variance — the principal components capture the spread of the data.

PCA does NOT preserve local structure — clusters and neighbourhoods can be distorted. For visualising cluster structure, non-linear methods (t-SNE, UMAP) work better.

Whitening

After PCA, you can further normalise each component to unit variance:

X_white = X_reduced / np.sqrt(eigenvalues[:k])

Whitened data has identity covariance — all components carry equal variance. This is used as a preprocessing step before training some models (especially neural networks and k-means).

t-SNE: Non-Linear Visualisation

t-SNE (t-Distributed Stochastic Neighbour Embedding) is designed for visualisation in 2D or 3D. It preserves local structure — nearby points in high-dimensional space end up nearby in the 2D projection.

The intuition:

  1. For each pair of points in high-dimensional space, compute a probability that they are "neighbours" (using a Gaussian kernel based on distance).
  2. Initialise random 2D positions.
  3. Compute similar probabilities in 2D (using a t-distribution — the heavy tails prevent crowding).
  4. Minimise the difference (KL divergence) between the two probability distributions — this pushes 2D positions to match high-dimensional neighbour relationships.
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_high_dim)   # shape (n, 2)

Perplexity (typically 5–50) controls the effective number of neighbours each point considers.

t-SNE Caveats

  • Not for distance interpretation: distances in the t-SNE plot are not meaningful. Clusters being far apart in 2D does not mean they are far apart in high-dimensional space.
  • Stochastic: different runs produce different layouts. Use a fixed random_state.
  • Slow: O(n² log n) — impractical for n > 100k. Use PCA to reduce to ~50 dimensions first.
  • Not for new data: you cannot project new points; you must refit on the full dataset.

UMAP: Fast Non-Linear Reduction

UMAP (Uniform Manifold Approximation and Projection) is newer than t-SNE and has significant practical advantages:

PCAt-SNEUMAP
Structure preservedGlobalLocalBoth (adjustable)
SpeedFast (O(nd²))Slow (O(n²))Fast (O(n^1.14))
New data projectionYesNoYes
Interpretable distancesYesNoPartially
DeterministicYesNoNo (but seeds help)
Typical usePreprocessing, featuresVisualisation onlyVisualisation + features
import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
X_2d = reducer.fit_transform(X_high_dim)

# Project new points:
new_2d = reducer.transform(new_points)

n_neighbors (typical: 5–50): balances local vs global structure. Low values = local focus, high values = more global.

min_dist (typical: 0.0–0.5): controls how tightly points cluster in the low-dimensional space.

UMAP is now widely used for both visualisation and as a feature engineering step before training classifiers on the reduced representation.

Autoencoders: Learned Dimensionality Reduction

Autoencoders are neural networks that learn to compress and decompress data. The bottleneck forces a compact representation.

Input (d dims)  Encoder  Latent space (k dims)  Decoder  Reconstruction (d dims)
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, d, k):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(d, 256), nn.ReLU(),
            nn.Linear(256, k)
        )
        self.decoder = nn.Sequential(
            nn.Linear(k, 256), nn.ReLU(),
            nn.Linear(256, d)
        )

    def forward(self, x):
        z = self.encoder(x)           # compress to k dims
        return self.decoder(z), z     # reconstruct + return latent

Training minimises reconstruction loss. The latent vectors z are the learned low-dimensional representations.

Variational autoencoders (VAEs) go further — the latent space is probabilistic, enabling generation of new data points.

Choosing the Right Method

SituationMethod
Preprocessing before ML trainingPCA
Visualising cluster structureUMAP or t-SNE
Need to project new data pointsPCA or UMAP
Interpretable componentsPCA
Non-linear structure, large datasetUMAP
Non-linear structure, small datasett-SNE
Learn task-specific representationsAutoencoder

Key Takeaways

  • PCA projects data onto eigenvectors of the covariance matrix — maximum variance directions. Linear, fast, interpretable, projects new data.
  • Explained variance ratio tells you how many components to keep. Choose k for 90–95% cumulative variance.
  • t-SNE preserves local neighbourhoods for visualisation but distances are not interpretable, it cannot project new data, and it is slow.
  • UMAP is faster than t-SNE, can project new data, and balances local and global structure. Preferred for large datasets.
  • Autoencoders learn non-linear compression end-to-end, optimised for a specific task.
  • Always reduce to ~50 dims with PCA before applying t-SNE on large datasets — reduces compute from O(n²) to manageable.