Something remarkable happened in early 2026. Alibaba's Qwen 3.5 9B — a model you can run on a laptop — started outperforming models with 70B and even 120B parameters on key benchmarks. Not by a little. By statistically significant margins on coding, math, and instruction following.
This is not an anomaly. Allen AI's OLMo 3 and NVIDIA's Nemotron 3 Super are posting similar results. The era of "bigger is always better" in LLMs is definitively over.
Let me walk you through what these models can do, how they achieve it, and — most importantly — how to run them locally and integrate them into your workflow.
The Benchmark Results That Turned Heads
Qwen 3.5 9B
Alibaba's Qwen team has been quietly shipping excellent models for two years. Qwen 3.5 9B is their most impressive release yet.
| Benchmark | Qwen 3.5 9B | Llama 3.3 70B | Qwen 2.5 72B | GPT-4o mini |
|---|---|---|---|---|
| MMLU-Pro | 72.1 | 70.4 | 71.8 | 73.0 |
| HumanEval+ | 84.2 | 80.5 | 81.9 | 86.7 |
| MATH-500 | 89.4 | 85.3 | 87.6 | 90.1 |
| IFEval (strict) | 81.7 | 78.2 | 79.5 | 82.3 |
| MT-Bench | 8.91 | 8.72 | 8.84 | 9.01 |
| Arena-Hard | 71.8 | 68.4 | 70.2 | 74.5 |
Look at the MMLU-Pro and HumanEval+ columns. A 9B model beating a 70B model on knowledge and coding benchmarks. The MATH-500 score of 89.4 is within striking distance of GPT-4o mini, which is a closed-source API model backed by massive infrastructure.
How Is This Possible?
Three factors converge to make small models punch above their weight.
1. Better training data curation. The Qwen team invested heavily in data quality over quantity. They use multi-stage filtering: deduplication, quality scoring with smaller classifier models, domain-specific filtering, and synthetic data generation for underrepresented tasks. The training mix matters more than the training volume.
2. Architectural improvements. Qwen 3.5 uses grouped-query attention (GQA), SwiGLU activations, RoPE with extended context, and a tokenizer optimized for multilingual text. These are not revolutionary individually, but combined, they extract more capability per parameter.
3. Post-training alignment. The model goes through multiple stages of supervised fine-tuning (SFT) followed by RLHF and DPO. The Qwen team's post-training pipeline is arguably the most sophisticated in the open-source world. They use a mixture of human feedback and AI feedback (RLAIF) to maximize instruction following without sacrificing raw capability.
Training Pipeline:
Pre-training (trillions of tokens, curated web + books + code)
↓
Supervised Fine-Tuning (high-quality instruction pairs)
↓
RLHF / DPO (preference optimization)
↓
Safety alignment (red-teaming + constitutional AI)
↓
Quantization-aware training (maintains quality at INT4/INT8)OLMo 3: The Fully Open Alternative
Allen AI's OLMo project takes a different philosophical approach: everything is open. Not just the weights — the training data, the training code, the evaluation framework, and the intermediate checkpoints.
What Makes OLMo 3 Special
| Aspect | OLMo 3 | Typical "Open" Models |
|---|---|---|
| Model weights | Open | Open |
| Training data | Open (Dolma 3 dataset) | Closed |
| Training code | Open | Usually closed |
| Intermediate checkpoints | Available | Rarely |
| Evaluation framework | Open (Catwalk) | Varies |
| Fine-tuning recipes | Open | Sometimes |
This matters for several reasons:
- Reproducibility. You can verify their claims by retraining from scratch.
- Research. Intermediate checkpoints let you study how capabilities emerge during training.
- Compliance. You know exactly what data the model was trained on — critical for regulated industries.
- Customization. You can continue pre-training on your domain data using their exact pipeline.
OLMo 3 7B Performance
OLMo 3 7B does not quite match Qwen 3.5 9B on raw benchmarks (it is a smaller model with less compute budget), but it holds its own:
| Benchmark | OLMo 3 7B | Llama 3.1 8B | Mistral 7B v0.3 |
|---|---|---|---|
| MMLU-Pro | 62.8 | 61.4 | 60.1 |
| HumanEval+ | 71.3 | 69.5 | 65.8 |
| MATH-500 | 74.6 | 72.1 | 68.9 |
| GSM8K | 83.2 | 81.6 | 78.4 |
| TruthfulQA | 58.7 | 55.3 | 52.9 |
The TruthfulQA score is notable — OLMo 3 is less likely to hallucinate confident nonsense, which matters more in production than benchmark scores.
NVIDIA Nemotron 3 Super: Hybrid Architecture
NVIDIA's contribution to the small model revolution is architecturally the most interesting. Nemotron 3 Super uses a hybrid Mamba-Transformer architecture that combines the best of both worlds.
The Mamba Advantage
Traditional Transformers have quadratic attention complexity — processing a 32K context window takes 4x the compute of a 16K window. Mamba (a state-space model) has linear complexity, making long contexts dramatically cheaper.
Nemotron 3 Super interleaves Transformer layers (for precise attention) with Mamba layers (for efficient long-range processing):
Layer structure (simplified):
Layer 1: [Mamba] — Efficient sequential processing
Layer 2: [Mamba] — Long-range dependencies
Layer 3: [Transformer] — Precise attention for complex reasoning
Layer 4: [Mamba] — Efficient context compression
Layer 5: [Mamba] — Pattern recognition
Layer 6: [Transformer] — Cross-reference and logical inference
...repeating pattern...The result: Nemotron 3 Super 8B processes 128K context windows with the memory footprint of a 32K Transformer model.
Benchmark Comparison
| Benchmark | Nemotron 3 Super 8B | Qwen 3.5 9B | Llama 3.3 8B |
|---|---|---|---|
| MMLU-Pro | 64.7 | 72.1 | 61.4 |
| HumanEval+ | 73.8 | 84.2 | 69.5 |
| RULER (128K) | 91.2 | 82.4 | 76.8 |
| Needle-in-Haystack | 99.1% | 94.3% | 88.7% |
| Tokens/sec (A100) | 4,820 | 3,650 | 3,890 |
The RULER and Needle-in-Haystack scores tell the story. When you need to process very long documents — legal contracts, codebases, research papers — the hybrid architecture dominates. The throughput advantage (tokens per second) also matters for production deployments where you are paying per GPU-hour.
Running These Models Locally
Here is the practical part. You do not need a data center to run these models.
Hardware Requirements
| Model | RAM (FP16) | RAM (Q4_K_M) | GPU VRAM (Q4_K_M) | CPU-Only Viable? |
|---|---|---|---|---|
| Qwen 3.5 9B | 18 GB | 5.5 GB | 6 GB | Yes (slow) |
| OLMo 3 7B | 14 GB | 4.5 GB | 5 GB | Yes |
| Nemotron 3 Super 8B | 16 GB | 5 GB | 6 GB | Yes (slow) |
Q4_K_M quantization reduces memory usage by roughly 70% with minimal quality loss (typically less than 1% on benchmarks). For most practical tasks, you will not notice the difference.
Setup with Ollama
Ollama is the fastest way to get running locally.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull models
ollama pull qwen3.5:9b
ollama pull olmo3:7b
ollama pull nemotron3-super:8b
# Run interactively
ollama run qwen3.5:9b
# Or serve as an API
ollama serveThe API is OpenAI-compatible, so you can point any OpenAI SDK client at it:
import OpenAI from 'openai'
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // Required but unused
})
const response = await client.chat.completions.create({
model: 'qwen3.5:9b',
messages: [
{
role: 'system',
content: 'You are a senior software engineer. Be concise and precise.',
},
{
role: 'user',
content: 'Review this function for bugs and performance issues:\n\n```python\ndef find_duplicates(lst):\n dupes = []\n for i in range(len(lst)):\n for j in range(i+1, len(lst)):\n if lst[i] == lst[j] and lst[i] not in dupes:\n dupes.append(lst[i])\n return dupes\n```',
},
],
temperature: 0.3,
})
console.log(response.choices[0].message.content)Setup with llama.cpp for Maximum Performance
If you want to squeeze every last token per second out of your hardware:
# Clone and build with GPU support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or DGGML_METAL=ON for Mac
cmake --build build --config Release -j
# Download a GGUF quantized model
# (Check huggingface.co for official GGUF releases)
# Run the server
./build/bin/llama-server \
-m models/qwen3.5-9b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \ # Offload all layers to GPU
-c 32768 \ # Context length
--threads 8Performance Comparison: Local vs API
I benchmarked these models on a MacBook Pro M3 Max (36 GB) and compared against API calls:
| Setup | Tokens/sec | Cost per 1M tokens | Latency (first token) |
|---|---|---|---|
| Qwen 3.5 9B (M3 Max, Q4) | 42 t/s | $0 (local) | ~180ms |
| Qwen 3.5 9B (Ollama, CPU) | 8 t/s | $0 (local) | ~900ms |
| GPT-4o mini (API) | ~85 t/s | $0.60 | ~300ms |
| Claude 3.5 Haiku (API) | ~95 t/s | $1.00 | ~250ms |
Local inference on Apple Silicon is genuinely competitive for interactive use. At 42 tokens per second, you are reading faster than the model generates — which means it feels instant.
Where Small Models Beat Large Ones
Small models are not universally better. But there are specific use cases where they are the right choice.
1. Code Completion and Editing
For inline code suggestions, autocomplete, and small refactors, Qwen 3.5 9B performs within 5% of GPT-4o on HumanEval+ benchmarks. The latency advantage of local inference makes it feel faster.
# Use with Continue.dev (VS Code extension)
# In .continue/config.json:
{
"models": [{
"title": "Qwen 3.5 9B",
"provider": "ollama",
"model": "qwen3.5:9b",
"contextLength": 32768
}],
"tabAutocompleteModel": {
"title": "Qwen 3.5 9B",
"provider": "ollama",
"model": "qwen3.5:9b"
}
}2. Data Processing Pipelines
When you need to classify, extract, or transform thousands of documents, API costs add up fast. A local model processing 10,000 documents costs $0 vs $50-200 in API calls.
import ollama
import json
def classify_support_tickets(tickets: list[str]) -> list[dict]:
results = []
for ticket in tickets:
response = ollama.chat(
model='qwen3.5:9b',
messages=[{
'role': 'user',
'content': f"""Classify this support ticket. Return JSON only.
Categories: billing, technical, feature-request, bug-report
Ticket: {ticket}
Output: {{"category": "...", "priority": "low|medium|high", "summary": "..."}}"""
}],
format='json',
)
results.append(json.loads(response['message']['content']))
return results3. Privacy-Sensitive Applications
Healthcare, legal, and financial applications often cannot send data to external APIs. Local models solve this completely — no data leaves your infrastructure.
4. Edge and Embedded Deployments
Nemotron 3 Super 8B can run on NVIDIA Jetson devices for robotics, IoT gateways, and edge inference. The Mamba layers make it particularly efficient for streaming sensor data with long context windows.
5. Offline Development
Working on a plane, in a secure facility, or in a region with unreliable internet? Local models work everywhere.
When You Still Need Large Models
Be honest about the limitations. Small models struggle with:
- Complex multi-step reasoning over long documents (legal analysis, research synthesis)
- Creative writing that requires nuance and stylistic range
- Rare knowledge domains where training data is sparse
- Agentic workflows requiring 10+ tool calls with complex state management
- Multilingual tasks in low-resource languages
For these, GPT-4o, Claude Opus, and Gemini 2 Pro are still significantly better. The practical strategy is to use small models for 80% of your tasks and route the hard 20% to frontier models.
The Routing Pattern
interface ModelRouter {
route(task: Task): ModelConfig
}
class CostOptimizedRouter implements ModelRouter {
route(task: Task): ModelConfig {
// Simple classification, extraction, formatting
if (task.complexity === 'low' && task.tokens < 4000) {
return { model: 'qwen3.5:9b', provider: 'ollama' }
}
// Medium complexity: code review, summarization, analysis
if (task.complexity === 'medium' && !task.requiresLatestKnowledge) {
return { model: 'qwen3.5:9b', provider: 'ollama' }
}
// Complex reasoning, creative tasks, agentic workflows
return { model: 'claude-sonnet-4-20250514', provider: 'anthropic' }
}
}This pattern can cut your API costs by 60-80% while maintaining quality where it matters.
Fine-Tuning for Your Domain
The real superpower of small models is fine-tuning. You can specialize a 9B model to outperform GPT-4o on your specific domain with a few hundred high-quality examples.
Quick Fine-Tune with Unsloth
from unsloth import FastLanguageModel
# Load base model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen3.5-9B",
max_seq_length=8192,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
use_gradient_checkpointing="unsloth",
)
# Train on your data
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=your_dataset, # HuggingFace Dataset object
dataset_text_field="text",
max_seq_length=8192,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
output_dir="outputs",
bf16=True,
),
)
trainer.train()
# Export to GGUF for Ollama
model.save_pretrained_gguf("qwen-custom", tokenizer, quantization_method="q4_k_m")A LoRA fine-tune on 500 examples takes about 30 minutes on a single A100 or 2 hours on an RTX 4090. The resulting model can be served with Ollama just like the base model.
What This Means for the Industry
The small model revolution has real consequences.
For startups: You can build AI-powered products without a massive API budget. A $2,000 GPU server running Qwen 3.5 9B can handle thousands of requests per day at zero marginal cost.
For enterprises: On-premise deployment becomes viable. No data sovereignty concerns, no vendor lock-in, predictable costs.
For researchers: OLMo 3's full openness means you can actually study how these models work, not just use them as black boxes.
For the frontier labs: The pressure to justify API pricing intensifies when a free, local model covers 80% of use cases. Expect frontier models to differentiate on reasoning depth, tool use, and multimodal capabilities — areas where small models still lag.
Getting Started Today
Here is a concrete action plan:
- Install Ollama and pull
qwen3.5:9b. Takes 5 minutes. - Replace one API call in your current project with a local model call. Pick something simple — text classification, formatting, or code explanation.
- Measure the quality delta. Run your existing test cases against the local model. You might be surprised how small the gap is.
- Set up a router that sends simple tasks to the local model and complex tasks to your API provider.
- Consider fine-tuning if you have domain-specific data. Even 100 high-quality examples can meaningfully improve performance on your specific task.
The models are here. The tooling is mature. The only thing left is to try them.