How AI Agents Use Memory: Short-Term, Long-Term, and Episodic

The fundamental limitation of a raw LLM is that it has no persistent memory. Every conversation starts with a blank context window. Ask it about a meeting you had last Tuesday—it has no idea what you're talking about. This statelessness is fine for one-off queries but breaks down the moment you want an agent that learns, adapts, and builds on previous interactions.

Memory architecture is what transforms a stateless language model into a persistent AI agent. There are three distinct types of memory, each serving a different purpose, and the right combination depends on your use case.

The Three Types of Agent Memory

AGENT MEMORY ARCHITECTURE:

┌──────────────────────────────────────────────────────────┐
│                     AI Agent                             │
│                                                          │
│  ┌─────────────────┐  ┌───────────────┐  ┌───────────┐  │
│  │  Short-Term     │  │  Long-Term    │  │ Episodic  │  │
│  │  (Context       │  │  (Vector DB / │  │ (Specific │  │
│  │   Window)       │  │   KV Store)   │  │  Events)  │  │
│  │                 │  │               │  │           │  │
│  │ Current session │  │ User prefs    │  │ Past      │  │
│  │ Last N messages │  │ Documents     │  │ decisions │  │
│  │ Working memory  │  │ Facts         │  │ Episodes  │  │
│  └─────────────────┘  └───────────────┘  └───────────┘  │
└──────────────────────────────────────────────────────────┘

Short-Term Memory: The Context Window

Short-term memory is the conversation history kept within a single session. It lives entirely inside the LLM's context window — a fixed-size buffer of recent messages.

The Sliding Window Problem

As conversations grow longer, older messages fall outside the context window and are lost. A naive implementation simply includes all messages:

// ❌ Naive: will eventually exceed the context window
const response = await client.messages.create({
  model: 'claude-sonnet-4-5',
  max_tokens: 4096,
  messages: allMessages, // This grows without bound
});

Sliding Window with Summarization

The production solution is to maintain a fixed window of recent messages and summarize older context:

// lib/memory/short-term.ts
interface Message {
  role: 'user' | 'assistant';
  content: string;
  timestamp: number;
}

export class ShortTermMemory {
  private messages: Message[] = [];
  private summary: string = '';
  private readonly windowSize: number;

  constructor(windowSize = 20) {
    this.windowSize = windowSize;
  }

  add(role: 'user' | 'assistant', content: string) {
    this.messages.push({ role, content, timestamp: Date.now() });

    // When window is full, summarize and trim
    if (this.messages.length > this.windowSize) {
      this.compressOlderMessages();
    }
  }

  private async compressOlderMessages() {
    const toCompress = this.messages.splice(0, 10);
    const summaryResponse = await client.messages.create({
      model: 'claude-haiku-4-5', // Use cheaper model for summarization
      max_tokens: 500,
      messages: [{
        role: 'user',
        content: `Summarize this conversation segment concisely, preserving key facts and decisions:\n\n${
          toCompress.map(m => `${m.role}: ${m.content}`).join('\n')
        }`,
      }],
    });

    const newSummary = (summaryResponse.content[0] as { text: string }).text;
    this.summary = this.summary
      ? `${this.summary}\n\n[Later]: ${newSummary}`
      : newSummary;
  }

  getContextMessages(): { role: string; content: string }[] {
    const context = [];

    if (this.summary) {
      context.push({
        role: 'user' as const,
        content: `[Previous conversation summary]: ${this.summary}`,
      });
      context.push({ role: 'assistant' as const, content: 'I understand the context.' });
    }

    return [...context, ...this.messages.map(m => ({ role: m.role, content: m.content }))];
  }
}

Long-Term Memory: Vector Databases

Long-term memory persists across sessions. Rather than keeping every past conversation in the context window (impossible), long-term memory uses semantic search: convert memories to vector embeddings and retrieve only the most relevant ones when needed.

// lib/memory/long-term.ts
import { Pinecone } from '@pinecone-database/pinecone';
import Anthropic from '@anthropic-ai/sdk';
import { randomUUID } from 'crypto';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const index = pinecone.index('agent-memory');
const client = new Anthropic();

async function embed(text: string): Promise<number[]> {
  // Use a text-embedding model to convert text to a vector
  const response = await fetch('https://api.openai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ model: 'text-embedding-3-small', input: text }),
  });
  const data = await response.json();
  return data.data[0].embedding;
}

export class LongTermMemory {
  constructor(private readonly userId: string) {}

  // Store a new memory (fact, preference, or conversation summary)
  async store(content: string, metadata: Record<string, string> = {}) {
    const vector = await embed(content);

    await index.upsert([{
      id: randomUUID(),
      values: vector,
      metadata: {
        userId: this.userId,
        content,
        storedAt: new Date().toISOString(),
        ...metadata,
      },
    }]);
  }

  // Retrieve the most relevant memories for a given query
  async retrieve(query: string, topK = 5): Promise<string[]> {
    const queryVector = await embed(query);

    const results = await index.query({
      vector: queryVector,
      topK,
      filter: { userId: this.userId },
      includeMetadata: true,
    });

    return results.matches
      .filter(m => m.score && m.score > 0.7) // Only high-relevance memories
      .map(m => m.metadata?.content as string)
      .filter(Boolean);
  }
}

// Usage in an agent
async function agentResponseWithLongTermMemory(userId: string, userMessage: string) {
  const memory = new LongTermMemory(userId);

  // Retrieve relevant past memories before generating a response
  const relevantMemories = await memory.retrieve(userMessage);

  const response = await client.messages.create({
    model: 'claude-sonnet-4-5',
    max_tokens: 2048,
    system: `You are a personalized assistant. Use the user's memory context to give relevant, personalized responses.`,
    messages: [
      ...(relevantMemories.length > 0 ? [{
        role: 'user' as const,
        content: `[Relevant memories from past conversations]:\n${relevantMemories.join('\n')}`,
      }, {
        role: 'assistant' as const,
        content: 'I have reviewed the relevant context from our past interactions.',
      }] : []),
      { role: 'user', content: userMessage },
    ],
  });

  const responseText = (response.content[0] as { text: string }).text;

  // Extract and store new facts from this conversation
  await extractAndStoreMemories(memory, userMessage, responseText);

  return responseText;
}

Episodic Memory: Remembering Specific Events

Episodic memory stores discrete, timestamped events — not just facts, but the full context of a specific interaction:

// lib/memory/episodic.ts
interface Episode {
  id: string;
  userId: string;
  title: string;           // Summary of what happened
  outcome: string;         // What was decided or achieved
  context: string;         // Full conversation or event details
  participants: string[];  // Who was involved
  timestamp: Date;
  tags: string[];
}

export class EpisodicMemory {
  // Store a completed episode (e.g., end of a task or decision)
  async storeEpisode(episode: Omit<Episode, 'id'>) {
    await db.query(
      `INSERT INTO episodes (id, user_id, title, outcome, context, participants, timestamp, tags)
       VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
      [
        randomUUID(),
        episode.userId,
        episode.title,
        episode.outcome,
        episode.context,
        episode.participants,
        episode.timestamp,
        episode.tags,
      ]
    );

    // Also embed for semantic search
    await longTermMemory.store(
      `Episode: ${episode.title}. Outcome: ${episode.outcome}`,
      { type: 'episode', tags: episode.tags.join(',') }
    );
  }

  // Recall episodes similar to the current situation
  async recall(query: string, userId: string): Promise<Episode[]> {
    const episodes = await db.query<Episode>(
      `SELECT * FROM episodes 
       WHERE user_id = $1 
       AND to_tsvector('english', title || ' ' || outcome) @@ plainto_tsquery($2)
       ORDER BY timestamp DESC
       LIMIT 3`,
      [userId, query]
    );
    return episodes.rows;
  }
}

Combining All Three Memory Types

In production, all three types work together:

async function fullMemoryAgent(userId: string, message: string) {
  const shortTerm = await getSessionMemory(userId);   // Current conversation
  const longTerm = new LongTermMemory(userId);
  const episodic = new EpisodicMemory();

  // 1. Retrieve relevant long-term memories and past episodes
  const [relevantFacts, relevantEpisodes] = await Promise.all([
    longTerm.retrieve(message),
    episodic.recall(message, userId),
  ]);

  // 2. Build enriched context
  const enrichedContext = [
    relevantFacts.length > 0 ? `User preferences/facts:\n${relevantFacts.join('\n')}` : '',
    relevantEpisodes.length > 0 ? `Relevant past episodes:\n${
      relevantEpisodes.map(e => `- ${e.title}: ${e.outcome}`).join('\n')
    }` : '',
  ].filter(Boolean).join('\n\n');

  // 3. Generate response using all memory layers
  shortTerm.add('user', message);
  const response = await generateResponse(shortTerm.getContextMessages(), enrichedContext);
  shortTerm.add('assistant', response);

  // 4. Extract and persist new facts for long-term memory
  await persistNewFacts(userId, message, response, longTerm);

  return response;
}

Conclusion

Memory architecture is the difference between an AI assistant and an AI agent that actually knows who you are and what you care about. Short-term memory handles the current conversation. Long-term vector memory surfaces relevant facts from the past. Episodic memory recalls specific events and decisions. Implementing all three together creates an agent that compounds knowledge over time — building a model of each user that makes every subsequent interaction more useful than the last.