Memory Systems for AI Agents: Conversation State, Vector Stores, and Beyond
A three-tier taxonomy: short-term, long-term, and semantic. RAG sits in the semantic tier; the agent decides whether and how to use it.
Three-tier taxonomy
The recurring memory model in framework documentation and the literature splits agent memory into three tiers.
- Short-term / conversation. The current context window. Messages, tool results, and observations the model sees in the active call. Ephemeral; gone at session end.
- Long-term / episodic. Persistent records of past interactions, often a structured log keyed by user, session, or entity. The agent retrieves from this when prior context is relevant; the framework writes to it after each interaction.
- Semantic / vector. Retrievable knowledge stored in a vector database. The agent queries by semantic similarity; the result becomes context for the next generation step. RAG lives in this tier.
RAG in depth
Retrieval-Augmented Generation, introduced in Lewis et al. 2020, is the canonical pattern for grounding a language model’s output in external knowledge. The mechanism: a retriever fetches relevant documents from a corpus (typically by embedding the query and finding nearest-neighbour vectors in a store); the retrieved documents are inserted into the prompt; the generator produces an answer conditioned on both the question and the retrieved context.
RAG fits when the corpus is too large for the context window, when the knowledge changes faster than retraining cycles, or when answers must cite specific source documents. It does not fit when the question is general enough that the model already knows the answer (retrieval adds latency without benefit), or when the corpus is small enough to include in full (no need for retrieval).
Agentic RAG is the variant where the agent decides whether to retrieve, what to retrieve, and how many retrieval hops to perform. A non-agentic RAG pipeline always retrieves once; an agentic system might retrieve zero times if it knows the answer, or three times for a question that requires chained sub-queries.
Vector databases as a category
Vector databases provide three core capabilities: approximate nearest-neighbour search over embedding vectors, metadata filtering on attached fields (date, author, document type), and hybrid search combining vector similarity with keyword (BM25) ranking. Most production systems benefit from hybrid search, since pure vector similarity can miss exact-match queries and pure keyword search misses paraphrases.
A reference list of major options. This page does not rank them; the right choice depends on existing infrastructure, scale, and operational preference.
| Vector store | Notes |
|---|---|
| Pinecone | Managed; production-mature; serverless tier. |
| Weaviate | Open source with managed cloud; strong hybrid search. |
| Qdrant | Open source with managed cloud; written in Rust. |
| Chroma | Lightweight open source; embedded use widely supported. |
| pgvector | PostgreSQL extension; suits teams already on Postgres. |
| Milvus | Open source; mature for very-large-scale deployments. |
| LanceDB | Open source; embedded and serverless options. |
Memory architectures for agents
Within the long-term and semantic tiers, several architectures recur. Production systems usually combine more than one.
- Summarisation memory. Periodically summarise the conversation history into a shorter form; replace the raw history with the summary. Reduces context cost; loses fine detail.
- Sliding-window memory. Keep only the last N turns. Simple; works when older context rarely matters.
- Retrieval memory. Embed each past turn; retrieve relevant turns by semantic similarity when needed. Works when older context occasionally matters but rarely all at once.
- Entity memory. Structured records per entity (per user, per project, per topic). Updated by the agent as it learns facts. Works when there are stable, identifiable entities the agent should remember.
- Hybrid architectures. The combination most production systems use: a sliding window for recent turns, retrieval over older history, structured entity records for stable facts.
Embedding models
An embedding model maps text to a dense vector that captures semantic similarity: closer vectors mean closer meaning. The major providers’ embedding APIs (OpenAI text-embedding-3, Cohere embed-v3, Voyage voyage-3, Google text-embedding-004) cover most production use cases. Open-source alternatives (sentence-transformers, BGE, E5) avoid API dependence.
The choice between providers and open-source models is usually decided by three factors: data sensitivity (some content cannot be embedded by an external API), cost at scale (embedding millions of documents adds up), and quality on the specific corpus (embedding models have varied performance on technical, legal, or multilingual text).
Failure modes
- Retrieval of irrelevant context. The retriever returns documents that look semantically similar but do not answer the question. The model gets confused or hallucinates around the irrelevant content. Mitigation: rerank retrieved results with a cross-encoder; add metadata filters; raise the similarity threshold.
- Lost-in-the-middle. Long contexts have reduced attention in the middle. Important retrieved content placed there is under-used. Mitigation: rank retrieved chunks; place the most relevant first or last; chunk into smaller pieces.
- Stale embeddings. The source data changed since embeddings were computed; retrieval returns outdated material. Mitigation: re-embed on source change; track embedding versions.
- Embedding drift across model updates. Switching to a newer embedding model invalidates the existing index, since the vector spaces are incompatible. Mitigation: plan migration windows; embed in both old and new for a transition period.
- PII in vector stores. Personal data embedded into a vector store is harder to delete than personal data in a row of a relational database (row-level deletion is cheap; vector-index re-build can be expensive). Mitigation: redact PII before embedding; design with the right-to-erasure case in mind.
Frequently asked questions
What is RAG?
Retrieval-Augmented Generation is a pattern where a system retrieves relevant context from an external knowledge source (typically a vector database) and includes that context in the prompt to the language model. RAG is useful when the relevant knowledge is too large to fit in the model’s context window, when the knowledge changes frequently (so it cannot be encoded in training data), or when an answer needs to be grounded in specific source documents.
Is RAG an agent?
A pure RAG pipeline (retrieve, then generate) is not an agent. It does not decide what to do; it always retrieves and always generates. RAG becomes agentic when the system dynamically chooses whether to retrieve, what to retrieve, and how many retrieval hops to perform. Examples include agents that decide a question can be answered without retrieval, or that issue a follow-up retrieval after the first results turn out to be insufficient.
What is the difference between conversation memory and long-term memory?
Conversation memory is the current context window: the messages, tool results, and observations the model can see right now. It is ephemeral; it disappears when the session ends. Long-term memory is persistent storage of past interactions, often as a structured log or as embeddings in a vector store. The agent retrieves from long-term memory when context warrants, and writes to it after each interaction.
What does lost-in-the-middle mean?
A reliability issue with long-context language models: the model attends most strongly to the beginning and end of its input, with reduced attention to the middle. Documented in Liu et al. 2023 (Lost in the Middle: How Language Models Use Long Contexts). Practical implication: stuffing the full retrieved context into the middle of a long prompt can be less effective than chunking, ranking, and including only the most relevant pieces.
- How AI Agents Work - memory as a core component
- ReAct - retrieval as a tool call
- Orchestration tooling - state management across steps
Sources and Further Reading
- P. Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, arXiv:2005.11401 (2020).
- N. F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172 (2023).
- OpenAI, Embeddings guide.
- Cohere, Embed API documentation.
- Voyage AI, documentation.
- pgvector, repository.
- Pinecone, Weaviate, Qdrant, Chroma, Milvus, LanceDB - vendor documentation.