there's a quiet consensus forming in the AI infrastructure space: agents need memory. LLMs forget everything between sessions, users hate re‑explaining context, and personalization requires persistence. a handful of startups have raised venture capital on this thesis. mem0 raised $24M. letta (formerly memgpt) raised $10M. zep, hindsight, supermemory, and others are building competing offerings.
after spending time with the research on agent memory, from large‑scale surveys to the latest RL‑based memory management papers, i came away with a different perspective on this space (references at the bottom). the memory problem is real. but the way we're solving it today may be transitional.
this post walks through the technical landscape and where i think it's heading.
the three forms of memory
the survey proposes a useful taxonomy: agent memory takes three forms, distinguished by where the memory physically lives.
every memory middleware startup operates exclusively at the token level. mem0, letta, zep: they all store text entries in vector databases, retrieve them via semantic similarity, and inject them into prompts. this works, and it has real strengths in transparency and auditability. but it has a fundamental limitation: text is a lossy representation of what the model internally understands.
how token‑level memory works today
the architecture is straightforward. after a conversation, an LLM extracts important facts from the transcript. these facts get embedded into vectors and stored in a database. on the next interaction, a similarity search retrieves relevant facts and prepends them to the prompt.
this works for many use cases. and as i discovered running an AI startup for 8 months, a postgres database with basic queries often handles it fine. you don't always need a dedicated memory service to store user preferences. you need a database.
parametric memory: knowledge baked into weights
parametric memory stores information directly in the model's parameters. there are two subtypes. internal parametric memory modifies the base model's own weights through fine‑tuning or knowledge editing (techniques like ROME and MEMIT). external parametric memory introduces additional parameter modules, like LoRA adapters, alongside a frozen base model.
this matters because personalized memory could be delivered as lightweight LoRA adapters trained on user interaction data, hot‑swapped at inference time. no vector database, no retrieval pipeline, no middleware layer. just a different adapter loaded per user.
latent memory: the model's native format
latent memory lives in the model's hidden states, KV caches, and activations. it's not human‑readable and it's not stored in parameters. it's the model's internal representation, preserved and reused across interactions.
there are three mechanisms. generate: auxiliary modules create compact latent representations (gist tokens compress prompts into KV activations, autocompressor compresses documents into summary vectors). reuse: the model carries over KV cache from prior computation, so it doesn't recompute attention for previous turns. transform: existing KV cache states are compressed to reduce their footprint (SnapKV, PyramidKV, TurboQuant achieving 6x compression with zero quality loss).
why does latent memory outperform? because converting the model's understanding into words and then converting those words back destroys the nuanced statistical patterns the model captured. latent memory preserves the model's native representations without this lossy round‑trip.
| form | best for | key tradeoff |
|---|---|---|
| token‑level | multi‑turn chatbots, personalization, high‑stakes domains (legal, medical, finance) | transparent and auditable, but lossy and retrieval‑dependent |
| parametric | role‑playing, math/coding, alignment, stylized responses | zero‑latency access, but expensive to update |
| latent | multimodal agents, edge deployment, privacy‑sensitive domains | high compression and performance, but model‑specific and opaque |
here's the key observation: every memory middleware startup operates at the token level, because that's the only form they can access. parametric and latent memory require access to model weights, KV caches, and training infrastructure that only foundation model companies control.
memory management is becoming a learned capability
the most important development in agent memory isn't a new storage format. it's the application of reinforcement learning to memory management itself. a wave of recent papers shows that agents can learn to manage their own memory end‑to‑end, without hand‑designed heuristics.
the progression tells a clear story: each step trades interpretability for performance, and each step moves value away from the middleware layer and into the model's own training pipeline. once memory management is a learned capability entangled with model weights, it becomes very hard to offer as a third‑party API.
where memory middleware gets squeezed
memory middleware startups aren't just threatened by latent memory and RL‑trained management. they face pressure from multiple directions at once.
from below, the storage problem is largely solved. as i found firsthand, the core pipeline of extract‑embed‑store‑retrieve can be built straightforwardly on top of existing databases. you don't need a dedicated memory service for this any more than you need a dedicated service for user authentication data.
from the sides, context windows keep expanding (10M tokens in Gemini, 200K in Claude), and KV cache compression is making long contexts cheaper. if you can hold and compress the full conversation history in latent form, the motivation for lossy text extraction weakens.
i experienced this firsthand. when i built zarie, a single‑threaded AI assistant that needs to remember user preferences, contacts, lists, and interaction patterns across conversations, i never reached for a memory middleware. the entire approach is: keep the full conversation history in context, periodically summarize it into a structured JSON, and inject that summary into the system prompt. no vector database, no embedding pipeline, no retrieval layer. just the model's own context window plus a summarization step. and it works. the assistant remembers everything it needs to, personalizes its responses, and handles preference updates naturally. for a single‑threaded assistant, the "memory problem" was solved with about 200 lines of summarization code.
why model companies are incentivized to own memory
this is the pressure point i find most compelling. foundation model companies don't just have the technical ability to build memory natively. they have a strong economic reason to do it themselves.
OpenAI already has ChatGPT memory. Anthropic has Claude memory. Google is building memory into Gemini. these are all token‑level systems today, but they control the full stack: the weights, the training infrastructure, the KV cache, and the serving layer. they can move to latent representations whenever they choose.
the strategic logic is straightforward. if your conversation history and accumulated preferences are encoded as model‑specific latent representations that don't port to competitors, switching costs become enormous. your AI remembers you perfectly, but only on their platform. this is a powerful retention mechanism. a cross‑platform memory layer that lets users move freely between models is solving a problem that platform companies actively don't want solved.
the combination of technical capability and strategic incentive makes this different from a typical "will the big company copy us" risk. the big companies aren't just capable of building memory. they benefit from memory being proprietary and deeply tied to their platform.
where token‑level memory persists
this isn't a story of complete obsolescence. token‑level memory has structural advantages that latent memory can't match in certain contexts.
compliance and auditability. in regulated industries (healthcare, finance, legal), you need to inspect, edit, and delete specific memories on demand. GDPR's right‑to‑erasure requires auditable, deletable memory records. opaque latent representations are a liability here. explicit text memory becomes a legal requirement.
multi‑model environments. if enterprises stay multi‑model (using Claude for some tasks, GPT for others, open‑source for internal use), a cross‑platform memory layer has genuine value. latent memory is inherently model‑specific. text memories work with any model.
user trust and control. users want to see what the AI "knows" about them. they want to correct wrong memories and delete sensitive ones. this requires memory in a format humans can read and edit.
what i think happens
the honest assessment is that memory middleware startups like mem0 and letta are building valuable products for a real problem, but the technical ground beneath them is shifting. the progression from hand‑coded rules to RL‑trained policies to fully latent memory injection points toward a future where memory management is deeply integrated into the model layer itself.
in the short term (through 2026), token‑level memory frameworks will continue capturing developer adoption. model‑agnosticism is a real advantage while the market is fragmented. in the medium term, RL‑trained memory management will become standard, and agents that manage their own memory will outperform those calling external services. longer term, latent memory will likely subsume most of token‑level for performance‑critical applications, while token‑level persists for compliance‑heavy and multi‑model contexts.
the structural winner is whoever controls the model and inference stack. that's where memory is heading. the historical parallel is instructive: nobody built a successful standalone company selling L2 cache. cache became a feature of the processor. memory may follow the same path, becoming a feature of the model platform rather than a standalone middleware layer.
this doesn't mean these companies fail. an acquisition by a foundation model company or a cloud provider is a realistic and potentially good outcome. and the compliance‑driven niche for transparent, auditable, cross‑platform memory is real, even if it doesn't support the kind of valuation a $24M raise implies.
the takeaway: the memory problem is real and lasting. the memory middleware solution is likely transitional. the value will accrue to whoever controls the model layer, because that's where the most powerful forms of memory (parametric, latent) and the most effective management approaches (RL‑trained) naturally live.
references
Hu et al. (2026). "Memory in the Age of AI Agents: A Survey." arXiv:2512.13564v2. 107 pages, 400+ references. The most comprehensive survey of agent memory to date.
Zhou et al. (2025). MEM1: fixed‑size scratchpad trained via PPO. 7B model beats 14B on multi‑hop QA with 3.7x less memory usage.
Zhang et al. (2025). MemGen: latent memory injection via RL‑trained LoRA adapters. No text extraction; continuous vectors injected into the reasoning stream.
Yan et al. (2025). Memory‑R1: structured memory operations (ADD/UPDATE/DELETE/NOOP) learned via RL. Only 152 training examples needed.
Wang et al. (2025). Mem‑α: end‑to‑end RL training of both extraction and management. The schema itself is learned, not hand‑designed.
Zeng et al. (2026). MemArt: KV cache‑centric memory achieving 11%+ accuracy improvement over text‑based methods with 90x reduction in prefill tokens.
Mu et al. (2023). Gist tokens: learning prompt compression via attention masks.
Chevalier et al. (2023). AutoCompressor: recursive document compression into summary vectors.