The fundamental limitation of a raw LLM is that it has no persistent memory. Every conversation starts with a blank context window. Ask it about a meeting you had last Tuesday—it has no idea what you're talking about. This statelessness is fine for one-off queries but breaks down the moment you want an agent that learns, adapts, and builds on previous interactions.
Memory architecture is what transforms a stateless language model into a persistent AI agent. There are three distinct types of memory, each serving a different purpose, and the right combination depends on your use case.
The Three Types of Agent Memory
AGENT MEMORY ARCHITECTURE:
┌──────────────────────────────────────────────────────────┐
│ AI Agent │
│ │
│ ┌─────────────────┐ ┌───────────────┐ ┌───────────┐ │
│ │ Short-Term │ │ Long-Term │ │ Episodic │ │
│ │ (Context │ │ (Vector DB / │ │ (Specific │ │
│ │ Window) │ │ KV Store) │ │ Events) │ │
│ │ │ │ │ │ │ │
│ │ Current session │ │ User prefs │ │ Past │ │
│ │ Last N messages │ │ Documents │ │ decisions │ │
│ │ Working memory │ │ Facts │ │ Episodes │ │
│ └─────────────────┘ └───────────────┘ └───────────┘ │
└──────────────────────────────────────────────────────────┘Short-Term Memory: The Context Window
Short-term memory is the conversation history kept within a single session. It lives entirely inside the LLM's context window — a fixed-size buffer of recent messages.
The Sliding Window Problem
As conversations grow longer, older messages fall outside the context window and are lost. A naive implementation simply includes all messages:
// ❌ Naive: will eventually exceed the context window
const response = await client.messages.create({
model: 'claude-sonnet-4-5',
max_tokens: 4096,
messages: allMessages, // This grows without bound
});Sliding Window with Summarization
The production solution is to maintain a fixed window of recent messages and summarize older context:
// lib/memory/short-term.ts
interface Message {
role: 'user' | 'assistant';
content: string;
timestamp: number;
}
export class ShortTermMemory {
private messages: Message[] = [];
private summary: string = '';
private readonly windowSize: number;
constructor(windowSize = 20) {
this.windowSize = windowSize;
}
add(role: 'user' | 'assistant', content: string) {
this.messages.push({ role, content, timestamp: Date.now() });
// When window is full, summarize and trim
if (this.messages.length > this.windowSize) {
this.compressOlderMessages();
}
}
private async compressOlderMessages() {
const toCompress = this.messages.splice(0, 10);
const summaryResponse = await client.messages.create({
model: 'claude-haiku-4-5', // Use cheaper model for summarization
max_tokens: 500,
messages: [{
role: 'user',
content: `Summarize this conversation segment concisely, preserving key facts and decisions:\n\n${
toCompress.map(m => `${m.role}: ${m.content}`).join('\n')
}`,
}],
});
const newSummary = (summaryResponse.content[0] as { text: string }).text;
this.summary = this.summary
? `${this.summary}\n\n[Later]: ${newSummary}`
: newSummary;
}
getContextMessages(): { role: string; content: string }[] {
const context = [];
if (this.summary) {
context.push({
role: 'user' as const,
content: `[Previous conversation summary]: ${this.summary}`,
});
context.push({ role: 'assistant' as const, content: 'I understand the context.' });
}
return [...context, ...this.messages.map(m => ({ role: m.role, content: m.content }))];
}
}Long-Term Memory: Vector Databases
Long-term memory persists across sessions. Rather than keeping every past conversation in the context window (impossible), long-term memory uses semantic search: convert memories to vector embeddings and retrieve only the most relevant ones when needed.
// lib/memory/long-term.ts
import { Pinecone } from '@pinecone-database/pinecone';
import Anthropic from '@anthropic-ai/sdk';
import { randomUUID } from 'crypto';
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.index('agent-memory');
const client = new Anthropic();
async function embed(text: string): Promise<number[]> {
// Use a text-embedding model to convert text to a vector
const response = await fetch('https://api.openai.com/v1/embeddings', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ model: 'text-embedding-3-small', input: text }),
});
const data = await response.json();
return data.data[0].embedding;
}
export class LongTermMemory {
constructor(private readonly userId: string) {}
// Store a new memory (fact, preference, or conversation summary)
async store(content: string, metadata: Record<string, string> = {}) {
const vector = await embed(content);
await index.upsert([{
id: randomUUID(),
values: vector,
metadata: {
userId: this.userId,
content,
storedAt: new Date().toISOString(),
...metadata,
},
}]);
}
// Retrieve the most relevant memories for a given query
async retrieve(query: string, topK = 5): Promise<string[]> {
const queryVector = await embed(query);
const results = await index.query({
vector: queryVector,
topK,
filter: { userId: this.userId },
includeMetadata: true,
});
return results.matches
.filter(m => m.score && m.score > 0.7) // Only high-relevance memories
.map(m => m.metadata?.content as string)
.filter(Boolean);
}
}
// Usage in an agent
async function agentResponseWithLongTermMemory(userId: string, userMessage: string) {
const memory = new LongTermMemory(userId);
// Retrieve relevant past memories before generating a response
const relevantMemories = await memory.retrieve(userMessage);
const response = await client.messages.create({
model: 'claude-sonnet-4-5',
max_tokens: 2048,
system: `You are a personalized assistant. Use the user's memory context to give relevant, personalized responses.`,
messages: [
...(relevantMemories.length > 0 ? [{
role: 'user' as const,
content: `[Relevant memories from past conversations]:\n${relevantMemories.join('\n')}`,
}, {
role: 'assistant' as const,
content: 'I have reviewed the relevant context from our past interactions.',
}] : []),
{ role: 'user', content: userMessage },
],
});
const responseText = (response.content[0] as { text: string }).text;
// Extract and store new facts from this conversation
await extractAndStoreMemories(memory, userMessage, responseText);
return responseText;
}
Episodic Memory: Remembering Specific Events
Episodic memory stores discrete, timestamped events — not just facts, but the full context of a specific interaction:
// lib/memory/episodic.ts
interface Episode {
id: string;
userId: string;
title: string; // Summary of what happened
outcome: string; // What was decided or achieved
context: string; // Full conversation or event details
participants: string[]; // Who was involved
timestamp: Date;
tags: string[];
}
export class EpisodicMemory {
// Store a completed episode (e.g., end of a task or decision)
async storeEpisode(episode: Omit<Episode, 'id'>) {
await db.query(
`INSERT INTO episodes (id, user_id, title, outcome, context, participants, timestamp, tags)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
[
randomUUID(),
episode.userId,
episode.title,
episode.outcome,
episode.context,
episode.participants,
episode.timestamp,
episode.tags,
]
);
// Also embed for semantic search
await longTermMemory.store(
`Episode: ${episode.title}. Outcome: ${episode.outcome}`,
{ type: 'episode', tags: episode.tags.join(',') }
);
}
// Recall episodes similar to the current situation
async recall(query: string, userId: string): Promise<Episode[]> {
const episodes = await db.query<Episode>(
`SELECT * FROM episodes
WHERE user_id = $1
AND to_tsvector('english', title || ' ' || outcome) @@ plainto_tsquery($2)
ORDER BY timestamp DESC
LIMIT 3`,
[userId, query]
);
return episodes.rows;
}
}
Combining All Three Memory Types
In production, all three types work together:
async function fullMemoryAgent(userId: string, message: string) {
const shortTerm = await getSessionMemory(userId); // Current conversation
const longTerm = new LongTermMemory(userId);
const episodic = new EpisodicMemory();
// 1. Retrieve relevant long-term memories and past episodes
const [relevantFacts, relevantEpisodes] = await Promise.all([
longTerm.retrieve(message),
episodic.recall(message, userId),
]);
// 2. Build enriched context
const enrichedContext = [
relevantFacts.length > 0 ? `User preferences/facts:\n${relevantFacts.join('\n')}` : '',
relevantEpisodes.length > 0 ? `Relevant past episodes:\n${
relevantEpisodes.map(e => `- ${e.title}: ${e.outcome}`).join('\n')
}` : '',
].filter(Boolean).join('\n\n');
// 3. Generate response using all memory layers
shortTerm.add('user', message);
const response = await generateResponse(shortTerm.getContextMessages(), enrichedContext);
shortTerm.add('assistant', response);
// 4. Extract and persist new facts for long-term memory
await persistNewFacts(userId, message, response, longTerm);
return response;
}Conclusion
Memory architecture is the difference between an AI assistant and an AI agent that actually knows who you are and what you care about. Short-term memory handles the current conversation. Long-term vector memory surfaces relevant facts from the past. Episodic memory recalls specific events and decisions. Implementing all three together creates an agent that compounds knowledge over time — building a model of each user that makes every subsequent interaction more useful than the last.