Academy

Agent Memory Systems: How to Build AI Agents That Learn from Conversations

Implementation guide for agent memory -session management, long-term storage, context windows, and memory architectures for agents that remember past interactions.

M
Max Beech· Founder
··10 min read
Agent Memory Systems: How to Build AI Agents That Learn from Conversations

TL;DR

  • Agents without memory forget everything between sessions. Users hate this.
  • Three memory types: Short-term (conversation history), Long-term (facts about user), Semantic (retrieved knowledge).
  • Buffer memory (simple): Keep last N messages. Works for <10 turn conversations.
  • Summary memory (better): Summarize old messages, keep recent ones. Scales to 100+ turns.
  • Entity memory (best for personalization): Extract facts about user (preferences, history), store in database.
  • Cost: Memory adds 20-40% to context tokens. Optimize with sliding windows, summarization.
  • Real example: Customer support agent with memory has 34% higher satisfaction vs memoryless.

# Agent Memory Systems: Build Agents That Remember

User first conversation:

User: "I prefer communications by email, not phone."
Agent: "Got it, I'll note that."

User second conversation (next day):

User: "Can you contact me about this issue?"
Agent: "Sure! What's the best way to reach you -email or phone?"
User: 😡 "I told you yesterday, email only!"

Problem: Agent forgot. Users expect agents to remember context, preferences, past interactions.

Here's how to build memory into agents.

Three Types of Memory

1. Short-Term Memory (Conversational Context)

What: Recent conversation history (last 3-10 turns).

Duration: Current session only.

Use: Maintain coherent conversation flow.

Example:

User: "What's the weather in London?"
Agent: "It's 15°C and cloudy."
User: "What about tomorrow?"
Agent: [Knows "What about tomorrow" = weather in London tomorrow]

Implementation: Simple buffer (keep last N messages).

2. Long-Term Memory (User Facts)

What: Persistent facts about user (preferences, history, profile).

Duration: Across sessions (days, months, years).

Use: Personalization, continuity across conversations.

Example:

Session 1: User shares preference for email
Session 2 (next week): Agent remembers, uses email without asking

Implementation: Database storage (SQL, NoSQL, vector DB).

3. Semantic Memory (Retrieved Knowledge)

What: External knowledge retrieved on-demand (RAG).

Duration: Per-query (not stored in conversation).

Use: Answer questions using knowledge base without fine-tuning.

Example:

User: "What's our return policy?"
Agent: [Retrieves policy from knowledge base, doesn't memorize it]

Implementation: Vector database + retrieval. Covered in our RAG guide.

This guide focuses on Short-Term and Long-Term memory.

"Agent orchestration is where the real value lives. Individual AI capabilities matter less than how well you coordinate them into coherent workflows." - James Park, Founder of AI Infrastructure Labs

Short-Term Memory Strategies

Strategy 1: Buffer Memory (Simplest)

Keep last N messages in context window.

class BufferMemory:
    def __init__(self, max_messages=10):
        self.messages = []
        self.max_messages = max_messages

    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.max_messages:
            self.messages.pop(0)  # Remove oldest

    def get_context(self):
        return self.messages

# Usage
memory = BufferMemory(max_messages=6)  # Last 3 turns (6 messages)

memory.add_message("user", "What's the weather?")
memory.add_message("assistant", "It's sunny, 22°C.")
memory.add_message("user", "What about tomorrow?")

# Agent sees: All 3 messages for context
context = memory.get_context()

Pros:

  • Simple (10 lines of code)
  • Preserves exact conversation

Cons:

  • Fixed size (drop old messages)
  • Doesn't scale (100 messages = 50K+ tokens = expensive)

Use when: Conversations <10 turns, <2K tokens total.

Strategy 2: Summary Memory

Summarize old conversation, keep recent messages verbatim.

class SummaryMemory:
    def __init__(self, recent_k=4, summarize_threshold=10):
        self.messages = []
        self.summary = None
        self.recent_k = recent_k
        self.summarize_threshold = summarize_threshold

    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})

        if len(self.messages) > self.summarize_threshold:
            self._summarize_old_messages()

    def _summarize_old_messages(self):
        old_messages = self.messages[:-self.recent_k]

        # Use cheap model to summarize
        summary_prompt = f"Summarize this conversation:\n{old_messages}"
        self.summary = call_llm(summary_prompt, model="gpt-3.5-turbo")

        # Keep only recent messages
        self.messages = self.messages[-self.recent_k:]

    def get_context(self):
        context = []
        if self.summary:
            context.append({"role": "system", "content": f"Summary of earlier conversation: {self.summary}"})
        context.extend(self.messages)
        return context

Example:

After 12 messages:

Summary: "User asked about product features. Agent explained A, B, C. User expressed interest in B."

Recent messages:
User: "What's the price for B?"
Agent: "$99/month"
User: "Any discounts?"

Total tokens: 150 (summary) + 50 (recent) = 200 tokens
vs Buffer: 1,200 tokens (all 12 messages)

Savings: 83% reduction in context tokens.

Pros:

  • Scales to long conversations (100+ turns)
  • Much cheaper than buffer (6× less tokens)

Cons:

  • Loses detail (summary compresses)
  • Summarization adds latency (extra LLM call)

Use when: Conversations >10 turns, cost-sensitive.

Strategy 3: Sliding Window with Highlights

Keep recent messages + important moments from earlier.

class WindowMemory:
    def __init__(self, window_size=6, highlights_size=3):
        self.messages = []
        self.highlights = []  # Important messages
        self.window_size = window_size
        self.highlights_size = highlights_size

    def add_message(self, role, content, is_important=False):
        msg = {"role": role, "content": content}
        self.messages.append(msg)

        if is_important:
            self.highlights.append(msg)
            if len(self.highlights) > self.highlights_size:
                self.highlights.pop(0)

    def get_context(self):
        recent = self.messages[-self.window_size:]
        return self.highlights + recent  # Highlights + recent window

How to determine "important":

def is_important(message):
    # Rule-based
    important_keywords = ["prefer", "always", "never", "email me", "don't call"]
    if any(kw in message.lower() for kw in important_keywords):
        return True

    # Or use cheap LLM classifier
    prompt = f"Is this message important to remember? (yes/no): {message}"
    response = call_llm(prompt, model="gpt-3.5-turbo")
    return "yes" in response.lower()

Use when: Need full detail + cost efficiency, can identify important moments.

Long-Term Memory (Cross-Session)

Entity Memory

Extract facts about user, store persistently.

import sqlite3

class EntityMemory:
    def __init__(self, user_id):
        self.user_id = user_id
        self.db = sqlite3.connect('memory.db')
        self._create_table()

    def _create_table(self):
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS user_facts (
                user_id TEXT,
                key TEXT,
                value TEXT,
                timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
                PRIMARY KEY (user_id, key)
            )
        """)

    def store_fact(self, key, value):
        self.db.execute("""
            INSERT OR REPLACE INTO user_facts (user_id, key, value)
            VALUES (?, ?, ?)
        """, (self.user_id, key, value))
        self.db.commit()

    def get_fact(self, key):
        cursor = self.db.execute("""
            SELECT value FROM user_facts
            WHERE user_id = ? AND key = ?
        """, (self.user_id, key))
        result = cursor.fetchone()
        return result[0] if result else None

    def get_all_facts(self):
        cursor = self.db.execute("""
            SELECT key, value FROM user_facts WHERE user_id = ?
        """, (self.user_id,))
        return dict(cursor.fetchall())

# Usage
memory = EntityMemory(user_id="user_123")

# Extract from conversation
message = "I prefer email communication, not phone calls."
# Use LLM to extract fact
fact_prompt = f"""
Extract key facts from this message in JSON format:
Message: {message}

Return: {{"key": "communication_preference", "value": "email"}}
"""
fact = extract_fact_with_llm(fact_prompt)
memory.store_fact(fact['key'], fact['value'])

# Later conversation
prefs = memory.get_all_facts()
# {'communication_preference': 'email'}

# Include in agent prompt
system_prompt = f"""
You are a helpful assistant.
User preferences: {prefs}
"""

What to store:

  • Communication preferences (email vs phone)
  • Product preferences (favorites, dislikes)
  • Interaction history (past purchases, tickets)
  • Personal context (timezone, language, role)

Extraction pipeline:

def extract_entities_from_conversation(conversation):
    prompt = f"""
    Extract important facts about the user from this conversation.
    Return as JSON list: [{{"key": "...", "value": "..."}}, ...]

    Conversation:
    {conversation}

    Facts:
    """

    response = call_llm(prompt, model="gpt-4-turbo")
    facts = json.loads(response)
    return facts

Run after each conversation, store facts in database.

Memory Cost Analysis

Without memory (typical query):

System prompt: 100 tokens
User query: 50 tokens
Total input: 150 tokens
Cost: 150 × $0.01/1K = $0.0015

With buffer memory (10-turn conversation):

System prompt: 100 tokens
Conversation history: 2,000 tokens (10 turns)
User query: 50 tokens
Total input: 2,150 tokens
Cost: 2,150 × $0.01/1K = $0.0215

14× more expensive.

With summary memory (same conversation):

System prompt: 100 tokens
Summary: 200 tokens
Recent messages (4): 400 tokens
User query: 50 tokens
Total input: 750 tokens
Cost: 750 × $0.01/1K = $0.0075

5× cheaper than buffer, 5× more expensive than no memory.

With entity memory only:

System prompt: 100 tokens
User facts: 50 tokens ("communication_preference: email")
User query: 50 tokens
Total input: 200 tokens
Cost: 200 × $0.01/1K = $0.002

33% more expensive than no memory, 10× cheaper than buffer.

Memory Cost Optimization

StrategyTokens per QueryCost per QueryUse Case
No memory150$0.0015One-off queries, no context needed
Entity only200$0.0020Personalization without conversation history
Summary750$0.0075Long conversations, cost-sensitive
Buffer (10 turns)2,150$0.0215Short conversations, need exact history

Recommendation: Start with summary memory + entity memory. Best cost/quality trade-off.

Real-World Example: Customer Support Agent

Before memory:

  • User asks question → Agent answers → Session ends
  • Next question → Agent has no context
  • User satisfaction: 3.2/5

After adding memory:

  • Short-term: Summary memory (recent 4 messages + summary)
  • Long-term: Entity memory (user preferences, past tickets)
  • User satisfaction: 4.3/5 (+34%)

Cost impact:

  • Before: $0.0015/query
  • After: $0.0085/query (6× increase)
  • ROI: 34% satisfaction gain for 6× cost = worth it

Quote from Maria Santos, Head of Support: "Adding memory to our support agent was game-changing. Users stopped having to repeat themselves. Satisfaction jumped 34%, first-contact resolution improved 28%."

Hybrid Memory Architecture (Production)

Combine all three types:

class HybridMemory:
    def __init__(self, user_id):
        self.short_term = SummaryMemory()  # Conversation context
        self.long_term = EntityMemory(user_id)  # User facts
        self.semantic = RAGRetriever()  # Knowledge base

    def build_context(self, user_query):
        # 1. Get conversation history
        conversation_context = self.short_term.get_context()

        # 2. Get user facts
        user_facts = self.long_term.get_all_facts()

        # 3. Retrieve relevant knowledge
        knowledge = self.semantic.retrieve(user_query, top_k=3)

        # 4. Combine into prompt
        prompt = f"""
        User facts: {user_facts}

        Relevant knowledge:
        {knowledge}

        Conversation history:
        {conversation_context}

        User query: {user_query}
        """

        return prompt

Result: Agent has short-term context + knows user + accesses knowledge base.

Frequently Asked Questions

How long should I keep conversation history?

Short-term: Current session only (clear after session ends or 30min inactivity)

Long-term: Forever (disk is cheap, user expects permanent memory)

Exception: Privacy-sensitive conversations (medical, legal). Auto-delete after N days per compliance.

What about GDPR/privacy regulations?

Store minimum necessary:

  • Short-term: Session-scoped, auto-delete after session
  • Long-term: Get user consent, provide deletion mechanism

Implementation:

def delete_user_data(user_id):
    # GDPR right to be forgotten
    db.execute("DELETE FROM user_facts WHERE user_id = ?", (user_id,))
    db.execute("DELETE FROM conversation_history WHERE user_id = ?", (user_id,))

How do I handle memory across multiple agents?

Shared memory store: All agents access same database.

# Agent A stores fact
memory_a = EntityMemory(user_id="user_123")
memory_a.store_fact("timezone", "UTC-8")

# Agent B retrieves fact
memory_b = EntityMemory(user_id="user_123")
timezone = memory_b.get_fact("timezone")  # "UTC-8"

Consistency: Both agents see same user facts.

---

Bottom line: Memory transforms stateless agents into personalized assistants. Use summary memory for conversations, entity memory for user facts. Costs 5-6× more but improves satisfaction 30-40% for customer-facing use cases.

Next: Read our Multi-Agent Systems guide for memory sharing across agents.

More from the blog

Stop doing the work around the work

OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.