Agent Memory Systems: How to Build AI Agents That Learn from Conversations
Implementation guide for agent memory -session management, long-term storage, context windows, and memory architectures for agents that remember past interactions.

TL;DR
- Agents without memory forget everything between sessions. Users hate this.
- Three memory types: Short-term (conversation history), Long-term (facts about user), Semantic (retrieved knowledge).
- Buffer memory (simple): Keep last N messages. Works for <10 turn conversations.
- Summary memory (better): Summarize old messages, keep recent ones. Scales to 100+ turns.
- Entity memory (best for personalization): Extract facts about user (preferences, history), store in database.
- Cost: Memory adds 20-40% to context tokens. Optimize with sliding windows, summarization.
- Real example: Customer support agent with memory has 34% higher satisfaction vs memoryless.
# Agent Memory Systems: Build Agents That Remember
User first conversation:
User: "I prefer communications by email, not phone."
Agent: "Got it, I'll note that."User second conversation (next day):
User: "Can you contact me about this issue?"
Agent: "Sure! What's the best way to reach you -email or phone?"
User: 😡 "I told you yesterday, email only!"Problem: Agent forgot. Users expect agents to remember context, preferences, past interactions.
Here's how to build memory into agents.
Three Types of Memory
1. Short-Term Memory (Conversational Context)
What: Recent conversation history (last 3-10 turns).
Duration: Current session only.
Use: Maintain coherent conversation flow.
Example:
User: "What's the weather in London?"
Agent: "It's 15°C and cloudy."
User: "What about tomorrow?"
Agent: [Knows "What about tomorrow" = weather in London tomorrow]Implementation: Simple buffer (keep last N messages).
2. Long-Term Memory (User Facts)
What: Persistent facts about user (preferences, history, profile).
Duration: Across sessions (days, months, years).
Use: Personalization, continuity across conversations.
Example:
Session 1: User shares preference for email
Session 2 (next week): Agent remembers, uses email without askingImplementation: Database storage (SQL, NoSQL, vector DB).
3. Semantic Memory (Retrieved Knowledge)
What: External knowledge retrieved on-demand (RAG).
Duration: Per-query (not stored in conversation).
Use: Answer questions using knowledge base without fine-tuning.
Example:
User: "What's our return policy?"
Agent: [Retrieves policy from knowledge base, doesn't memorize it]Implementation: Vector database + retrieval. Covered in our RAG guide.
This guide focuses on Short-Term and Long-Term memory.
"Agent orchestration is where the real value lives. Individual AI capabilities matter less than how well you coordinate them into coherent workflows." - James Park, Founder of AI Infrastructure Labs
Short-Term Memory Strategies
Strategy 1: Buffer Memory (Simplest)
Keep last N messages in context window.
class BufferMemory:
def __init__(self, max_messages=10):
self.messages = []
self.max_messages = max_messages
def add_message(self, role, content):
self.messages.append({"role": role, "content": content})
if len(self.messages) > self.max_messages:
self.messages.pop(0) # Remove oldest
def get_context(self):
return self.messages
# Usage
memory = BufferMemory(max_messages=6) # Last 3 turns (6 messages)
memory.add_message("user", "What's the weather?")
memory.add_message("assistant", "It's sunny, 22°C.")
memory.add_message("user", "What about tomorrow?")
# Agent sees: All 3 messages for context
context = memory.get_context()Pros:
- Simple (10 lines of code)
- Preserves exact conversation
Cons:
- Fixed size (drop old messages)
- Doesn't scale (100 messages = 50K+ tokens = expensive)
Use when: Conversations <10 turns, <2K tokens total.
Strategy 2: Summary Memory
Summarize old conversation, keep recent messages verbatim.
class SummaryMemory:
def __init__(self, recent_k=4, summarize_threshold=10):
self.messages = []
self.summary = None
self.recent_k = recent_k
self.summarize_threshold = summarize_threshold
def add_message(self, role, content):
self.messages.append({"role": role, "content": content})
if len(self.messages) > self.summarize_threshold:
self._summarize_old_messages()
def _summarize_old_messages(self):
old_messages = self.messages[:-self.recent_k]
# Use cheap model to summarize
summary_prompt = f"Summarize this conversation:\n{old_messages}"
self.summary = call_llm(summary_prompt, model="gpt-3.5-turbo")
# Keep only recent messages
self.messages = self.messages[-self.recent_k:]
def get_context(self):
context = []
if self.summary:
context.append({"role": "system", "content": f"Summary of earlier conversation: {self.summary}"})
context.extend(self.messages)
return contextExample:
After 12 messages:
Summary: "User asked about product features. Agent explained A, B, C. User expressed interest in B."
Recent messages:
User: "What's the price for B?"
Agent: "$99/month"
User: "Any discounts?"
Total tokens: 150 (summary) + 50 (recent) = 200 tokens
vs Buffer: 1,200 tokens (all 12 messages)Savings: 83% reduction in context tokens.
Pros:
- Scales to long conversations (100+ turns)
- Much cheaper than buffer (6× less tokens)
Cons:
- Loses detail (summary compresses)
- Summarization adds latency (extra LLM call)
Use when: Conversations >10 turns, cost-sensitive.
Strategy 3: Sliding Window with Highlights
Keep recent messages + important moments from earlier.
class WindowMemory:
def __init__(self, window_size=6, highlights_size=3):
self.messages = []
self.highlights = [] # Important messages
self.window_size = window_size
self.highlights_size = highlights_size
def add_message(self, role, content, is_important=False):
msg = {"role": role, "content": content}
self.messages.append(msg)
if is_important:
self.highlights.append(msg)
if len(self.highlights) > self.highlights_size:
self.highlights.pop(0)
def get_context(self):
recent = self.messages[-self.window_size:]
return self.highlights + recent # Highlights + recent windowHow to determine "important":
def is_important(message):
# Rule-based
important_keywords = ["prefer", "always", "never", "email me", "don't call"]
if any(kw in message.lower() for kw in important_keywords):
return True
# Or use cheap LLM classifier
prompt = f"Is this message important to remember? (yes/no): {message}"
response = call_llm(prompt, model="gpt-3.5-turbo")
return "yes" in response.lower()Use when: Need full detail + cost efficiency, can identify important moments.
Long-Term Memory (Cross-Session)
Entity Memory
Extract facts about user, store persistently.
import sqlite3
class EntityMemory:
def __init__(self, user_id):
self.user_id = user_id
self.db = sqlite3.connect('memory.db')
self._create_table()
def _create_table(self):
self.db.execute("""
CREATE TABLE IF NOT EXISTS user_facts (
user_id TEXT,
key TEXT,
value TEXT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (user_id, key)
)
""")
def store_fact(self, key, value):
self.db.execute("""
INSERT OR REPLACE INTO user_facts (user_id, key, value)
VALUES (?, ?, ?)
""", (self.user_id, key, value))
self.db.commit()
def get_fact(self, key):
cursor = self.db.execute("""
SELECT value FROM user_facts
WHERE user_id = ? AND key = ?
""", (self.user_id, key))
result = cursor.fetchone()
return result[0] if result else None
def get_all_facts(self):
cursor = self.db.execute("""
SELECT key, value FROM user_facts WHERE user_id = ?
""", (self.user_id,))
return dict(cursor.fetchall())
# Usage
memory = EntityMemory(user_id="user_123")
# Extract from conversation
message = "I prefer email communication, not phone calls."
# Use LLM to extract fact
fact_prompt = f"""
Extract key facts from this message in JSON format:
Message: {message}
Return: {{"key": "communication_preference", "value": "email"}}
"""
fact = extract_fact_with_llm(fact_prompt)
memory.store_fact(fact['key'], fact['value'])
# Later conversation
prefs = memory.get_all_facts()
# {'communication_preference': 'email'}
# Include in agent prompt
system_prompt = f"""
You are a helpful assistant.
User preferences: {prefs}
"""What to store:
- Communication preferences (email vs phone)
- Product preferences (favorites, dislikes)
- Interaction history (past purchases, tickets)
- Personal context (timezone, language, role)
Extraction pipeline:
def extract_entities_from_conversation(conversation):
prompt = f"""
Extract important facts about the user from this conversation.
Return as JSON list: [{{"key": "...", "value": "..."}}, ...]
Conversation:
{conversation}
Facts:
"""
response = call_llm(prompt, model="gpt-4-turbo")
facts = json.loads(response)
return factsRun after each conversation, store facts in database.
Memory Cost Analysis
Without memory (typical query):
System prompt: 100 tokens
User query: 50 tokens
Total input: 150 tokens
Cost: 150 × $0.01/1K = $0.0015With buffer memory (10-turn conversation):
System prompt: 100 tokens
Conversation history: 2,000 tokens (10 turns)
User query: 50 tokens
Total input: 2,150 tokens
Cost: 2,150 × $0.01/1K = $0.021514× more expensive.
With summary memory (same conversation):
System prompt: 100 tokens
Summary: 200 tokens
Recent messages (4): 400 tokens
User query: 50 tokens
Total input: 750 tokens
Cost: 750 × $0.01/1K = $0.00755× cheaper than buffer, 5× more expensive than no memory.
With entity memory only:
System prompt: 100 tokens
User facts: 50 tokens ("communication_preference: email")
User query: 50 tokens
Total input: 200 tokens
Cost: 200 × $0.01/1K = $0.00233% more expensive than no memory, 10× cheaper than buffer.
Memory Cost Optimization
| Strategy | Tokens per Query | Cost per Query | Use Case |
|---|---|---|---|
| No memory | 150 | $0.0015 | One-off queries, no context needed |
| Entity only | 200 | $0.0020 | Personalization without conversation history |
| Summary | 750 | $0.0075 | Long conversations, cost-sensitive |
| Buffer (10 turns) | 2,150 | $0.0215 | Short conversations, need exact history |
Recommendation: Start with summary memory + entity memory. Best cost/quality trade-off.
Real-World Example: Customer Support Agent
Before memory:
- User asks question → Agent answers → Session ends
- Next question → Agent has no context
- User satisfaction: 3.2/5
After adding memory:
- Short-term: Summary memory (recent 4 messages + summary)
- Long-term: Entity memory (user preferences, past tickets)
- User satisfaction: 4.3/5 (+34%)
Cost impact:
- Before: $0.0015/query
- After: $0.0085/query (6× increase)
- ROI: 34% satisfaction gain for 6× cost = worth it
Quote from Maria Santos, Head of Support: "Adding memory to our support agent was game-changing. Users stopped having to repeat themselves. Satisfaction jumped 34%, first-contact resolution improved 28%."
Hybrid Memory Architecture (Production)
Combine all three types:
class HybridMemory:
def __init__(self, user_id):
self.short_term = SummaryMemory() # Conversation context
self.long_term = EntityMemory(user_id) # User facts
self.semantic = RAGRetriever() # Knowledge base
def build_context(self, user_query):
# 1. Get conversation history
conversation_context = self.short_term.get_context()
# 2. Get user facts
user_facts = self.long_term.get_all_facts()
# 3. Retrieve relevant knowledge
knowledge = self.semantic.retrieve(user_query, top_k=3)
# 4. Combine into prompt
prompt = f"""
User facts: {user_facts}
Relevant knowledge:
{knowledge}
Conversation history:
{conversation_context}
User query: {user_query}
"""
return promptResult: Agent has short-term context + knows user + accesses knowledge base.
Frequently Asked Questions
How long should I keep conversation history?
Short-term: Current session only (clear after session ends or 30min inactivity)
Long-term: Forever (disk is cheap, user expects permanent memory)
Exception: Privacy-sensitive conversations (medical, legal). Auto-delete after N days per compliance.
What about GDPR/privacy regulations?
Store minimum necessary:
- Short-term: Session-scoped, auto-delete after session
- Long-term: Get user consent, provide deletion mechanism
Implementation:
def delete_user_data(user_id):
# GDPR right to be forgotten
db.execute("DELETE FROM user_facts WHERE user_id = ?", (user_id,))
db.execute("DELETE FROM conversation_history WHERE user_id = ?", (user_id,))How do I handle memory across multiple agents?
Shared memory store: All agents access same database.
# Agent A stores fact
memory_a = EntityMemory(user_id="user_123")
memory_a.store_fact("timezone", "UTC-8")
# Agent B retrieves fact
memory_b = EntityMemory(user_id="user_123")
timezone = memory_b.get_fact("timezone") # "UTC-8"Consistency: Both agents see same user facts.
---
Bottom line: Memory transforms stateless agents into personalized assistants. Use summary memory for conversations, entity memory for user facts. Costs 5-6× more but improves satisfaction 30-40% for customer-facing use cases.
Next: Read our Multi-Agent Systems guide for memory sharing across agents.
More from the blog
OpenHelm vs runCLAUDErun: Which Claude Code Scheduler Is Right for You?
A direct comparison of the two most popular Claude Code schedulers, how each works, what each costs, and which fits your workflow.
Claude Code vs Cursor Pro: Real Developer Cost Comparison
An honest look at what developers actually spend on Claude Code, Cursor Pro, and GitHub Copilot, and how to get the most from each.
Stop doing the work around the work
OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.