Real-world ML data is high-dimensional: images are thousands of pixels, text embeddings are hundreds of dimensions, genomic data has millions of features. Dimensionality reduction finds compact representations that preserve what matters. This lesson covers the three most important methods: PCA, t-SNE, and UMAP.
Why Reduce Dimensions?
The curse of dimensionality: in high-dimensional spaces, data becomes sparse. Distance metrics lose meaning — everything is approximately the same distance from everything else. Models that work well in 10 dimensions often fail with 1,000.
Practical reasons:
- Visualisation: humans can see 2D and 3D, not 512D
- Training speed: fewer features → faster computation
- Noise removal: irrelevant dimensions add noise, hurting generalisation
- Storage: compressed representations are cheaper to store and retrieve
PCA: Linear Dimensionality Reduction
PCA (Principal Component Analysis) projects data onto the directions of maximum variance.
Step-by-step:
import numpy as np
def pca(X, k):
# 1. Centre the data
X_mean = X.mean(axis=0)
X_c = X - X_mean
# 2. Compute covariance matrix
C = X_c.T @ X_c / len(X) # (d × d)
# 3. Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(C)
# 4. Sort descending
idx = eigenvalues.argsort()[::-1]
eigenvectors = eigenvectors[:, idx]
eigenvalues = eigenvalues[idx]
# 5. Project onto top k eigenvectors
W = eigenvectors[:, :k] # (d × k) projection matrix
X_reduced = X_c @ W # (n × k)
return X_reduced, eigenvalues, WThe projection matrix W transforms new data points the same way:
new_point_reduced = (new_point - X_mean) @ WChoosing k
The explained variance ratio of component i is eigenvalue λᵢ / Σλ.
explained_variance_ratio = eigenvalues / eigenvalues.sum()
cumulative = np.cumsum(explained_variance_ratio)
# Find k to explain 95% of variance
k = np.searchsorted(cumulative, 0.95) + 1A common rule: choose k to explain 90–95% of variance. The scree plot (explained variance vs component index) often shows an "elbow" — components after the elbow add little.
What PCA Preserves and Loses
PCA preserves global structure — points far apart in high dimensions remain far apart. It preserves variance — the principal components capture the spread of the data.
PCA does NOT preserve local structure — clusters and neighbourhoods can be distorted. For visualising cluster structure, non-linear methods (t-SNE, UMAP) work better.
Whitening
After PCA, you can further normalise each component to unit variance:
X_white = X_reduced / np.sqrt(eigenvalues[:k])Whitened data has identity covariance — all components carry equal variance. This is used as a preprocessing step before training some models (especially neural networks and k-means).
t-SNE: Non-Linear Visualisation
t-SNE (t-Distributed Stochastic Neighbour Embedding) is designed for visualisation in 2D or 3D. It preserves local structure — nearby points in high-dimensional space end up nearby in the 2D projection.
The intuition:
- For each pair of points in high-dimensional space, compute a probability that they are "neighbours" (using a Gaussian kernel based on distance).
- Initialise random 2D positions.
- Compute similar probabilities in 2D (using a t-distribution — the heavy tails prevent crowding).
- Minimise the difference (KL divergence) between the two probability distributions — this pushes 2D positions to match high-dimensional neighbour relationships.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_high_dim) # shape (n, 2)Perplexity (typically 5–50) controls the effective number of neighbours each point considers.
t-SNE Caveats
- Not for distance interpretation: distances in the t-SNE plot are not meaningful. Clusters being far apart in 2D does not mean they are far apart in high-dimensional space.
- Stochastic: different runs produce different layouts. Use a fixed
random_state. - Slow: O(n² log n) — impractical for n > 100k. Use PCA to reduce to ~50 dimensions first.
- Not for new data: you cannot project new points; you must refit on the full dataset.
UMAP: Fast Non-Linear Reduction
UMAP (Uniform Manifold Approximation and Projection) is newer than t-SNE and has significant practical advantages:
| PCA | t-SNE | UMAP | |
|---|---|---|---|
| Structure preserved | Global | Local | Both (adjustable) |
| Speed | Fast (O(nd²)) | Slow (O(n²)) | Fast (O(n^1.14)) |
| New data projection | Yes | No | Yes |
| Interpretable distances | Yes | No | Partially |
| Deterministic | Yes | No | No (but seeds help) |
| Typical use | Preprocessing, features | Visualisation only | Visualisation + features |
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
X_2d = reducer.fit_transform(X_high_dim)
# Project new points:
new_2d = reducer.transform(new_points)n_neighbors (typical: 5–50): balances local vs global structure. Low values = local focus, high values = more global.
min_dist (typical: 0.0–0.5): controls how tightly points cluster in the low-dimensional space.
UMAP is now widely used for both visualisation and as a feature engineering step before training classifiers on the reduced representation.
Autoencoders: Learned Dimensionality Reduction
Autoencoders are neural networks that learn to compress and decompress data. The bottleneck forces a compact representation.
Input (d dims) → Encoder → Latent space (k dims) → Decoder → Reconstruction (d dims)import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, d, k):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(d, 256), nn.ReLU(),
nn.Linear(256, k)
)
self.decoder = nn.Sequential(
nn.Linear(k, 256), nn.ReLU(),
nn.Linear(256, d)
)
def forward(self, x):
z = self.encoder(x) # compress to k dims
return self.decoder(z), z # reconstruct + return latentTraining minimises reconstruction loss. The latent vectors z are the learned low-dimensional representations.
Variational autoencoders (VAEs) go further — the latent space is probabilistic, enabling generation of new data points.
Choosing the Right Method
| Situation | Method |
|---|---|
| Preprocessing before ML training | PCA |
| Visualising cluster structure | UMAP or t-SNE |
| Need to project new data points | PCA or UMAP |
| Interpretable components | PCA |
| Non-linear structure, large dataset | UMAP |
| Non-linear structure, small dataset | t-SNE |
| Learn task-specific representations | Autoencoder |
Key Takeaways
- PCA projects data onto eigenvectors of the covariance matrix — maximum variance directions. Linear, fast, interpretable, projects new data.
- Explained variance ratio tells you how many components to keep. Choose k for 90–95% cumulative variance.
- t-SNE preserves local neighbourhoods for visualisation but distances are not interpretable, it cannot project new data, and it is slow.
- UMAP is faster than t-SNE, can project new data, and balances local and global structure. Preferred for large datasets.
- Autoencoders learn non-linear compression end-to-end, optimised for a specific task.
- Always reduce to ~50 dims with PCA before applying t-SNE on large datasets — reduces compute from O(n²) to manageable.