sankalp phadnis


where ai memory is heading

there's a quiet consensus forming in the AI infrastructure space: agents need memory. LLMs forget everything between sessions, users hate re‑explaining context, and personalization requires persistence. a handful of startups have raised venture capital on this thesis. mem0 raised $24M. letta (formerly memgpt) raised $10M. zep, hindsight, supermemory, and others are building competing offerings.

after spending time with the research on agent memory, from large‑scale surveys to the latest RL‑based memory management papers, i came away with a different perspective on this space (references at the bottom). the memory problem is real. but the way we're solving it today may be transitional.

this post walks through the technical landscape and where i think it's heading.

the survey proposes a useful taxonomy: agent memory takes three forms, distinguished by where the memory physically lives.

three forms of agent memory
token‑level
memory stored as explicit, human‑readable text. facts, preferences, and conversation logs retrieved via similarity search and injected into the prompt.
what startups sell transparent editable model‑agnostic
parametric
memory stored in model weights. the model doesn't look up the memory. it knows it, like knowing Paris is the capital of France. updated via fine‑tuning or LoRA adapters.
zero‑latency access generalizable hard to update
latent
memory carried in the model's internal representations: KV caches, hidden states, activations. not human‑readable. the model's native format.
high compression machine‑native outperforms text

every memory middleware startup operates exclusively at the token level. mem0, letta, zep: they all store text entries in vector databases, retrieve them via semantic similarity, and inject them into prompts. this works, and it has real strengths in transparency and auditability. but it has a fundamental limitation: text is a lossy representation of what the model internally understands.

how token‑level memory works today

the architecture is straightforward. after a conversation, an LLM extracts important facts from the transcript. these facts get embedded into vectors and stored in a database. on the next interaction, a similarity search retrieves relevant facts and prepends them to the prompt.

token‑level memory pipeline
conversation
LLM extracts facts
embed as vectors
store in vector DB
retrieve & inject
this is essentially a RAG pipeline with a different marketing wrapper

this works for many use cases. and as i discovered running an AI startup for 8 months, a postgres database with basic queries often handles it fine. you don't always need a dedicated memory service to store user preferences. you need a database.

parametric memory: knowledge baked into weights

parametric memory stores information directly in the model's parameters. there are two subtypes. internal parametric memory modifies the base model's own weights through fine‑tuning or knowledge editing (techniques like ROME and MEMIT). external parametric memory introduces additional parameter modules, like LoRA adapters, alongside a frozen base model.

how LoRA works: lightweight adapter modules
W
base model weights
4096 × 4096
frozen
+
A
4096 × 16
×
B
16 × 4096
~130K params (vs 16M)
=
W′
personalized model
hot‑swappable per user
instead of updating all 16M parameters, LoRA learns two small matrices. the base model stays frozen; memory becomes modular.

this matters because personalized memory could be delivered as lightweight LoRA adapters trained on user interaction data, hot‑swapped at inference time. no vector database, no retrieval pipeline, no middleware layer. just a different adapter loaded per user.

latent memory: the model's native format

latent memory lives in the model's hidden states, KV caches, and activations. it's not human‑readable and it's not stored in parameters. it's the model's internal representation, preserved and reused across interactions.

there are three mechanisms. generate: auxiliary modules create compact latent representations (gist tokens compress prompts into KV activations, autocompressor compresses documents into summary vectors). reuse: the model carries over KV cache from prior computation, so it doesn't recompute attention for previous turns. transform: existing KV cache states are compressed to reduce their footprint (SnapKV, PyramidKV, TurboQuant achieving 6x compression with zero quality loss).

latent memory: preserving the model's internal representations
token‑level (lossy round‑trip)
model understanding convert to text convert back
information lost at both conversion steps
latent (native preservation)
model understanding persist KV cache reuse directly
no lossy conversion. model's native representations preserved.
MemArt, a KV cache‑centric memory system, improved accuracy by over 11% compared to state‑of‑the‑art text‑based methods while reducing prefill tokens by over 90x.

why does latent memory outperform? because converting the model's understanding into words and then converting those words back destroys the nuanced statistical patterns the model captured. latent memory preserves the model's native representations without this lossy round‑trip.

when to use which form
form best for key tradeoff
token‑level multi‑turn chatbots, personalization, high‑stakes domains (legal, medical, finance) transparent and auditable, but lossy and retrieval‑dependent
parametric role‑playing, math/coding, alignment, stylized responses zero‑latency access, but expensive to update
latent multimodal agents, edge deployment, privacy‑sensitive domains high compression and performance, but model‑specific and opaque

here's the key observation: every memory middleware startup operates at the token level, because that's the only form they can access. parametric and latent memory require access to model weights, KV caches, and training infrastructure that only foundation model companies control.

the most important development in agent memory isn't a new storage format. it's the application of reinforcement learning to memory management itself. a wave of recent papers shows that agents can learn to manage their own memory end‑to‑end, without hand‑designed heuristics.

the RL progression in memory management
hand‑coded rules
heuristic rules like "if similarity > 0.8, update; else add." same rules for every user, every domain, every context. this is what most memory middleware uses today.
learned management, fixed extraction (Memory‑R1)
RL trains the agent to decide ADD, UPDATE, DELETE, or NOOP for each memory entry. the extraction is still rule‑based, but management adapts to the domain. only 152 training examples needed.
learned management + learned extraction (Mem‑α)
RL trains both what insights to extract from conversations (factual? behavioral? emotional?) and how to organize them. the entire pipeline from raw conversation to stored memory is learned end‑to‑end.
fixed‑size scratchpad via RL (MEM1)
the agent gets a fixed‑size text scratchpad and must rewrite it after every turn. trained with PPO on task success alone. a 7B model with learned memory management beats a 14B model using full context. emergent behaviors include tracking multiple objectives and self‑verification.
fully latent memory injection (MemGen)
no text at all. two LoRA adapters added to the base model: a "memory trigger" that detects when the agent needs context, and a "memory weaver" that generates latent tokens (continuous vectors, not text) injected into the reasoning stream. trained jointly via RL.

the progression tells a clear story: each step trades interpretability for performance, and each step moves value away from the middleware layer and into the model's own training pipeline. once memory management is a learned capability entangled with model weights, it becomes very hard to offer as a third‑party API.

memory middleware startups aren't just threatened by latent memory and RL‑trained management. they face pressure from multiple directions at once.

the middleware squeeze
1.
foundation model companies build it themselves
OpenAI, Anthropic, Google are building memory natively. they control the full stack and can move to latent representations at any time.
2.
context windows keep expanding
Gemini handles 10M tokens. KV cache compression (6x+) makes long contexts cheaper. the motivation for lossy text extraction weakens.
3.
RL‑trained agents manage their own memory
agents learning when to store, retrieve, and forget. hand‑crafted memory frameworks become unnecessary.
4.
storage is a commodity
the core pipeline (extract, embed, store, retrieve) is a modest amount of code on top of Postgres with pgvector.
memory middleware gets squeezed

from below, the storage problem is largely solved. as i found firsthand, the core pipeline of extract‑embed‑store‑retrieve can be built straightforwardly on top of existing databases. you don't need a dedicated memory service for this any more than you need a dedicated service for user authentication data.

from the sides, context windows keep expanding (10M tokens in Gemini, 200K in Claude), and KV cache compression is making long contexts cheaper. if you can hold and compress the full conversation history in latent form, the motivation for lossy text extraction weakens.

i experienced this firsthand. when i built zarie, a single‑threaded AI assistant that needs to remember user preferences, contacts, lists, and interaction patterns across conversations, i never reached for a memory middleware. the entire approach is: keep the full conversation history in context, periodically summarize it into a structured JSON, and inject that summary into the system prompt. no vector database, no embedding pipeline, no retrieval layer. just the model's own context window plus a summarization step. and it works. the assistant remembers everything it needs to, personalizes its responses, and handles preference updates naturally. for a single‑threaded assistant, the "memory problem" was solved with about 200 lines of summarization code.

this is the pressure point i find most compelling. foundation model companies don't just have the technical ability to build memory natively. they have a strong economic reason to do it themselves.

OpenAI already has ChatGPT memory. Anthropic has Claude memory. Google is building memory into Gemini. these are all token‑level systems today, but they control the full stack: the weights, the training infrastructure, the KV cache, and the serving layer. they can move to latent representations whenever they choose.

the strategic logic is straightforward. if your conversation history and accumulated preferences are encoded as model‑specific latent representations that don't port to competitors, switching costs become enormous. your AI remembers you perfectly, but only on their platform. this is a powerful retention mechanism. a cross‑platform memory layer that lets users move freely between models is solving a problem that platform companies actively don't want solved.

the combination of technical capability and strategic incentive makes this different from a typical "will the big company copy us" risk. the big companies aren't just capable of building memory. they benefit from memory being proprietary and deeply tied to their platform.

this isn't a story of complete obsolescence. token‑level memory has structural advantages that latent memory can't match in certain contexts.

compliance and auditability. in regulated industries (healthcare, finance, legal), you need to inspect, edit, and delete specific memories on demand. GDPR's right‑to‑erasure requires auditable, deletable memory records. opaque latent representations are a liability here. explicit text memory becomes a legal requirement.

multi‑model environments. if enterprises stay multi‑model (using Claude for some tasks, GPT for others, open‑source for internal use), a cross‑platform memory layer has genuine value. latent memory is inherently model‑specific. text memories work with any model.

user trust and control. users want to see what the AI "knows" about them. they want to correct wrong memories and delete sensitive ones. this requires memory in a format humans can read and edit.

the honest assessment is that memory middleware startups like mem0 and letta are building valuable products for a real problem, but the technical ground beneath them is shifting. the progression from hand‑coded rules to RL‑trained policies to fully latent memory injection points toward a future where memory management is deeply integrated into the model layer itself.

in the short term (through 2026), token‑level memory frameworks will continue capturing developer adoption. model‑agnosticism is a real advantage while the market is fragmented. in the medium term, RL‑trained memory management will become standard, and agents that manage their own memory will outperform those calling external services. longer term, latent memory will likely subsume most of token‑level for performance‑critical applications, while token‑level persists for compliance‑heavy and multi‑model contexts.

the structural winner is whoever controls the model and inference stack. that's where memory is heading. the historical parallel is instructive: nobody built a successful standalone company selling L2 cache. cache became a feature of the processor. memory may follow the same path, becoming a feature of the model platform rather than a standalone middleware layer.

this doesn't mean these companies fail. an acquisition by a foundation model company or a cloud provider is a realistic and potentially good outcome. and the compliance‑driven niche for transparent, auditable, cross‑platform memory is real, even if it doesn't support the kind of valuation a $24M raise implies.

the takeaway: the memory problem is real and lasting. the memory middleware solution is likely transitional. the value will accrue to whoever controls the model layer, because that's where the most powerful forms of memory (parametric, latent) and the most effective management approaches (RL‑trained) naturally live.

references

Hu et al. (2026). "Memory in the Age of AI Agents: A Survey." arXiv:2512.13564v2. 107 pages, 400+ references. The most comprehensive survey of agent memory to date.

Zhou et al. (2025). MEM1: fixed‑size scratchpad trained via PPO. 7B model beats 14B on multi‑hop QA with 3.7x less memory usage.

Zhang et al. (2025). MemGen: latent memory injection via RL‑trained LoRA adapters. No text extraction; continuous vectors injected into the reasoning stream.

Yan et al. (2025). Memory‑R1: structured memory operations (ADD/UPDATE/DELETE/NOOP) learned via RL. Only 152 training examples needed.

Wang et al. (2025). Mem‑α: end‑to‑end RL training of both extraction and management. The schema itself is learned, not hand‑designed.

Zeng et al. (2026). MemArt: KV cache‑centric memory achieving 11%+ accuracy improvement over text‑based methods with 90x reduction in prefill tokens.

Mu et al. (2023). Gist tokens: learning prompt compression via attention masks.

Chevalier et al. (2023). AutoCompressor: recursive document compression into summary vectors.

back to writing