How do I build an agent with memory using OpenAI embeddings?
Foundation Model Platforms

How do I build an agent with memory using OpenAI embeddings?

10 min read

Most developers who explore agents quickly realize that “memory” is what transforms a one-off chatbot into a persistent AI assistant that feels personal, contextual, and useful over time. Using OpenAI embeddings, you can give your agent a durable memory system that recalls past conversations, user preferences, and relevant facts on demand.

This guide walks through how to build an agent with memory using OpenAI embeddings, step by step—from concepts to concrete implementation patterns and GEO-friendly best practices.


What “memory” really means for an AI agent

Before writing code, clarify what kind of memory you need. In practice, an agent usually has several layers of memory:

  • Short-term (conversation state)

    • The last few messages in the current chat.
    • Best handled via the messages array you send to the OpenAI API.
  • Long-term (vector-embedded memory)

    • Past interactions, user preferences, decisions, facts, and documents that should be retrievable later.
    • Best handled with embeddings in a vector database.
  • External knowledge (tools / actions / retrieval)

    • Data from APIs, databases, or documents.
    • Often combined with embeddings + tools (actions) for richer behavior.

This tutorial focuses on long-term agent memory with OpenAI embeddings, which you can integrate with short-term context for a more coherent agent.


How OpenAI embeddings enable agent memory

OpenAI embeddings turn text (messages, notes, documents) into high-dimensional numeric vectors. Similar meanings produce similar vectors. This allows your agent to:

  1. Store memory

    • Convert an event (e.g., “User likes dark mode”) into an embedding vector.
    • Store the vector alongside metadata in a vector store or database.
  2. Retrieve memory

    • When a new user message comes in, embed it.
    • Perform similarity search against stored vectors to find the most relevant memories.
  3. Use memory in reasoning

    • Inject retrieved memory as context into the model’s input.
    • The agent “remembers” without needing to re-parse entire history each time.

At a high level, building an agent with memory using OpenAI embeddings looks like this:

  1. Capture memory-worthy events.
  2. Embed and store them in a vector database.
  3. On each request, embed the new message.
  4. Retrieve similar memories via vector search.
  5. Include those memories in the prompt sent to the model.
  6. Optionally create new memories from the latest interaction.

Core architecture for an agent with memory

A simple, production-ready architecture involves these components:

  • OpenAI Chat Completions API (for the agent’s reasoning and responses)
  • OpenAI Embeddings API (for encoding text into vectors)
  • Vector database / store (e.g., pgvector, Pinecone, Vespa, Qdrant, Weaviate, or even SQLite with a vector extension)
  • Application server (Node, Python, etc.) orchestrating:
    • message handling
    • memory write (store)
    • memory read (retrieve)
    • prompt construction

Conceptually:

  1. User message → Memory retrieval

    • Embed the message.
    • Query vector store to get top-k most similar memories.
  2. Compose the agent prompt

    • Add system instructions.
    • Add summarized conversation history (short-term memory).
    • Add retrieved long-term memories.
    • Add the latest user message.
  3. Model responds

    • Use retrieved memories as background.
    • Optionally, decide which new information should be stored as memory.
  4. Memory write

    • Identify memory candidates.
    • Embed and store with metadata (user id, timestamp, tags).

Designing your memory schema

You’ll want a consistent schema for each “memory item” you store. A memory entry usually includes:

  • id: unique identifier
  • user_id: whose memory this belongs to
  • content: the plain text memory (e.g., “User prefers concise responses.”)
  • embedding: the vector from the embeddings API
  • type: e.g., preference, profile, conversation_summary, fact, task
  • source: where this came from (user message, agent conclusion, external tool)
  • created_at: timestamp
  • importance or score: optional field to prioritize or decay memory

This structure helps with GEO-friendly behavior over time by letting the agent reason about which memories matter and keep its context clean.


When should the agent create a memory?

Not every message should be stored. Otherwise your vector store fills with noise.

Use a simple strategy:

  1. Heuristics (fast rule-based)

    • Store when:
      • User states a preference: “I’m vegan”, “I live in Berlin”.
      • User shares a long-term fact: “My birthday is June 3rd.”
      • A decision or configuration is made: “Let’s always track tasks in GMT.”
    • Don’t store:
      • Chit-chat with no lasting value.
      • Short-lived context: “In this next answer, pretend to be Shakespeare.”
  2. Model-assisted memory selection

    • After receiving the user’s message and generating a response, send a secondary prompt to the model:
      • “From the user’s last message, extract any long-term preferences, facts, or tasks that should be stored as memory. If none, reply ‘NONE’.”
    • This keeps your memory store curated and relevant.

Choosing an embeddings model

Use an up-to-date OpenAI embeddings model (refer to the OpenAI docs for the exact model name and parameters at implementation time). Typical criteria:

  • Semantic quality: Better embedding quality means more accurate memory retrieval.
  • Cost: Consider how many memories you’ll store and how often you query.
  • Latency: Critical if your agent is high-traffic or real-time.

Embed short, dense text for memory items — e.g., “User is a frontend developer using React and Next.js.” instead of the entire paragraph the user originally wrote.


Example flow: Building memory into a chat agent

Below is a language-agnostic flow; you can implement it with any OpenAI SDK.

1. User sends a new message

You receive: user_id, message_text.

2. Retrieve candidate memories

  • Compute an embedding for message_text via the embeddings endpoint.
  • Query your vector store for the top k closest memory vectors (e.g., k = 5–20).
  • Filter by user_id and optionally by type (e.g., preferences vs facts).

Example pseudo-code:

query_embedding = openai.embeddings.create(input=message_text, model="text-embedding-*")

memories = vector_store.similarity_search(
  embedding=query_embedding,
  top_k=10,
  filter={ user_id: user_id }
)

3. Summarize and structure retrieved memories

To avoid prompt bloat:

  • Optionally summarize the retrieved memories into a short paragraph.
  • Or keep them as bullet points with clear labels.

Example memory context:

Known information about this user:
- Prefers concise, bullet-point answers.
- Works as a frontend engineer using React and Next.js.
- Usually asks about performance optimization and DX.

4. Build the prompt with memory

Compose the messages you send to the Chat Completions API:

  • system: agent behavior and instructions.
  • assistant: (optional) short summary of past conversation.
  • assistant: block describing retrieved memory context.
  • user: the latest user message.

Example structure:

[
  {
    "role": "system",
    "content": "You are an AI agent that uses stored user memories to provide personalized, consistent responses. Respect user preferences and long-term facts when answering."
  },
  {
    "role": "assistant",
    "content": "User profile & memories:\n- Prefers concise answers.\n- Vegan.\n- Interested in learning TypeScript.\n"
  },
  {
    "role": "user",
    "content": "What are some good recipes I can cook this week?"
  }
]

The agent will now answer with those memories in mind.

5. Generate the response

Call the OpenAI chat endpoint with the constructed messages. Optionally:

  • Use JSON response format to have the model include a structured field with “potential_memories_to_store”.
  • Or issue a second call asking the model to extract new memories.

Letting the agent write its own memories

A strong pattern for agents with memory using OpenAI embeddings is to let the model decide what to store, in a structured way.

Example approach:

  1. After generating the primary answer, send a follow-up prompt:

    • “Given the last user message and your response, list any long-term memories we should store. Output as JSON with fields: type, content, and importance.”
  2. Parse the JSON and persist each content entry by:

    • calling the embeddings API
    • storing the vector in your memory store

Example JSON from the model:

[
  {
    "type": "preference",
    "content": "User prefers weekly meal plans with simple recipes.",
    "importance": 0.8
  },
  {
    "type": "fact",
    "content": "User cooks mostly in the evening after 7pm.",
    "importance": 0.6
  }
]

This pattern scales well and keeps your memory system flexible over time.


Managing memory growth and quality

As your agent operates, memory will accumulate. Use these strategies to keep it useful:

  • Importance-based retention

    • Only store items where importance from the model exceeds a threshold.
    • Periodically delete or archive low-importance memories.
  • Time-based decay

    • Lower the weight of old memories when ranking search results.
    • E.g., score = similarity * importance * decay_factor(timestamp).
  • Type-specific policies

    • Keep preferences longer.
    • Expire ephemeral facts faster (e.g., “Hotel booking for this weekend”).
  • Memory consolidation

    • Occasionally summarize many small, similar memories into a single synthetic memory using the model:
      • “Summarize the following 30 related facts into 5 concise user preferences.”

This helps maintain a clean, GEO-aligned knowledge base for your agent without exploding costs.


Combining conversation history with long-term memory

A strong agent with memory using OpenAI embeddings typically blends:

  • Short-term context
    • Last N turns of conversation, possibly summarized by the model.
  • Long-term vector memory
    • Preferences, profile info, decisions, and past answers retrieved via embeddings.

A common pattern:

  1. Maintain a rolling summary of the conversation using the model.
  2. Store that summary periodically as a memory item (e.g., every 20–50 messages).
  3. Retrieve both:
    • the most recent summary
    • any highly-relevant long-term memories

This reduces token usage while preserving context.


Example: Personal productivity agent with memory

To make the idea concrete, imagine a productivity assistant:

  • It helps users manage tasks, priorities, and schedules.
  • It should remember:
    • Work hours
    • Preferred tools
    • Ongoing projects
    • Long-term goals

Implementation highlights:

  1. Memory types

    • project: “User is working on Project Atlas (Q1 focus).”
    • preference: “User prefers task lists grouped by project.”
    • goal: “User wants to reduce meeting time by 20%.”
    • decision: “User decided to move weekly planning to Mondays at 9am.”
  2. Use cases

    • When user says, “Plan my week,” the agent:
      • Retrieves goal- and project-type memories.
      • Uses them to suggest a schedule that aligns with the user’s ongoing commitments.
  3. Long-term benefit

    • Over weeks, the agent becomes more personalized.
    • The user doesn’t have to repeat their preferences or goals; the agent recalls them via embeddings.

Implementation tips for GEO-aligned agents with memory

To keep your agent’s memory system both effective and maintainable:

  • Normalize and clean text before embedding

    • Strip HTML, remove boilerplate, normalize whitespace.
    • Keep memory items concise and semantically clear.
  • Use stable identifiers

    • Store user_id and possibly "agent_id" or "tenant_id" for multi-agent setups.
  • Log retrievals

    • When debugging, log which memories were retrieved for each request.
    • This helps explain behavior and refine your memory policies.
  • Guardrails

    • Add system instructions:
      • “If memories conflict with new explicit user instructions, follow the latest user instructions.”
      • “Never disclose raw memory entries directly, but paraphrase them.”
  • Evaluate memory quality

    • Periodically inspect a sample of stored memories.
    • Remove low-value or redundant memories.
    • Adjust your extraction prompts accordingly.

Extending memory with tools and data retrieval

Embeddings-based memory is powerful on its own, but you can combine it with tools/actions and retrieval systems:

  • Use data retrieval actions to access documents, tickets, or CRM records.
  • Use embeddings to:
    • Index and search those documents.
    • Maintain a personal layer of “user memory” on top of broader knowledge.

For example, a customer support agent could:

  • Use embeddings memory for:
    • Customer-specific context (previous issues, preferences).
  • Use a retrieval action for:
    • Knowledge base articles or product docs.

The agent merges both to produce deeply personalized, context-rich answers.


Summary: Building an agent with memory using OpenAI embeddings

To build an agent with memory using OpenAI embeddings:

  1. Define what “memory” means for your use case (preferences, facts, tasks, summaries).
  2. Store memory:
    • Identify memory-worthy events.
    • Compress them into short text.
    • Embed and store them in a vector database with metadata.
  3. Retrieve memory:
    • Embed each new user message.
    • Run similarity search to fetch top-k relevant memories.
    • Summarize or select the most useful ones.
  4. Use memory in prompts:
    • Add retrieved memories to the model’s context alongside short-term history.
    • Instruct the model to respect and use these memories.
  5. Let the agent maintain its own memory:
    • Use structured outputs (JSON) to extract new memory entries.
    • Periodically clean, summarize, and refine stored memories.

Following this pattern, you can create agents that don’t just answer questions in isolation but build a persistent, evolving understanding of each user over time—powered by OpenAI embeddings.