Your Agent Doesn't Need a Better Vector DB. It Needs a Memory Architecture.

An analyst uses your agent to explore a dataset. The agent runs a full battery, surfaces findings, works around a data quality issue it caught mid-session. A week later the analyst comes back with a follow-up question. The agent runs the entire battery again from scratch, as if last week never happened.

This isn't a retrieval failure. The prior work existed. The agent could have retrieved it. The problem is that nobody built the layer that decides what to do with it, so the agent defaulted to the only behavior it knew: start over.

Vector-DB-as-memory has dominated this space for two years and it answers the retrieval question while leaving three harder ones alone: what memory types does your system actually need, how does memory expire, and once you've retrieved prior context, what decision do you make? Mem0, Cognee, LangMem, Graphiti — every comparison you'll find benchmarks APIs and integrations. None of them ship an answer to those three questions, and you can check this yourself by reading any of their docs for a TTL policy or a retrieval-to-decision handoff.

What follows is a specific architecture that solves each of these: four typed memory units with different TTLs and scopes, a dual-backend storage pattern, semantic dedup at write time, and a mode switch that sits between retrieval and execution. The mode switch is the piece no framework gives you, and it's the one that makes the system actually useful across repeated sessions.

The failure mode isn't retrieval. It's what happens after.

Consider a data quality issue found in session 3: a revenue column has a three-day reporting lag. The agent surfaces it, the analyst accounts for it, the analysis proceeds correctly.

Nine sessions later, that finding gets retrieved. The agent has it in context. Now what?

If the system has no framework for what that finding means, the agent treats it as soft context, something to weight alongside more recent observations. Maybe it factors in the lag. Maybe it doesn't. If it's competing against exploratory findings from last month, it might effectively get ignored.

This is the distinction that breaks things quietly. A data quality constraint is not a prior analytical finding. It is a hard fact about the dataset that applies every time anyone queries it. An analytical conclusion about Q4 revenue trends is time-bounded and scoped to one analyst's session. These two things should not live in the same bucket, expire at the same rate, or be injected into the planning prompt in the same way.

The is_durative flag makes this explicit:

is_durative: bool = False  # True if this pattern has recurred >= 2 prior times

When the same data quality issue appears across two or more sessions on the same dataset, it stops being a one-off observation and becomes a structural fact. The planner treats it as a required workaround, not optional context.

Memory needs types, not just content

Every memory system hits this wall eventually: you've stored enough that retrieval returns a mix of critical constraints, stale findings, and noise. You can tune retrieval forever. The problem is upstream.

Four types cover the space well:

MemoryType = Literal["finding", "data_quality", "user_preference", "domain_pattern"]

Type	TTL	Scope	What it stores
`finding`	90 days	User	Prior analytical conclusions, confirmed at some confidence level
`data_quality`	365 days	Org-wide	Null lags, stale columns, known dataset gaps
`user_preference`	365 days	User	Corrections, preferred methods, output format choices
`domain_pattern`	180 days	User	Procedural workarounds ("always winsorize this column before correlations")

The TTLs encode a knowledge claim about each type. A specific analytical finding is worth keeping for 90 days. If it hasn't been reconfirmed by then, it's stale. A known null lag in a revenue column is worth keeping for a year because it's a structural property of the data.

The scope distinction matters as much as the TTL. Data quality issues live in an org-scoped index. Any analyst hitting the same dataset inherits the accumulated knowledge of its quirks. Findings are user-scoped. One analyst's conclusions don't contaminate another analyst's fresh read.

Mem0 and LangMem both operate at user scope by default, full stop — check either project's schema and there's no org-level table at all. Cognee's graph approach can model relationships, but the org vs. user boundary for different memory types isn't something its documentation addresses either.

The consequence of collapsing that distinction is concrete. Without an org-scoped index, every analyst who hits a dataset with a known three-day revenue lag has to discover it themselves. The sixth person to encounter that dataset learns the same thing the first person did and stores a near-duplicate finding under their own user namespace. With an org-scoped index, the first analyst's data quality observation is immediately available to everyone else. The knowledge belongs to the data, not to the person who found it. Treating it as user-scoped means re-discovering org-level facts indefinitely.

The full unit structure:

class MemoryUnit(BaseModel):
    memory_id: str
    session_id: str
    owner_sub: str

    memory_type: MemoryType
    summary: str           # 1-2 sentence condensed fact
    detail: str            # full context

    dialogue_time: datetime
    event_time_start: date | None
    event_time_end: date | None
    is_durative: bool = False

    data_source: str       # dataset key -- scopes dedup
    bank_id: str
    confidence: float
    artifact_id: str | None = None

Confidence is not set manually. It's derived from how the finding was labeled during analysis:

label_to_confidence = {
    "confirmatory": 0.90,
    "exploratory": 0.70,
    "exploratory_divergent": 0.50,
}

Units below 0.6 are dropped at consolidation. Exploratory-divergent findings don't persist. The LLM's own labeling during analysis determines what survives into long-term memory. This closes the loop between execution quality and memory quality: confident findings propagate, speculative ones don't.

Two backends, two different jobs

The simplest possible memory system is a vector database with a query interface: store embeddings, retrieve by similarity. This works until you need TTL, structured lookup by data source, or an audit trail.

The pattern that works separates retrieval from lifecycle management into two backends:

              Memory Write
              (after REPORT phase)
                    |
       _____________|_____________
      |                           |
 S3 Vectors                   DynamoDB
 (semantic search)          (TTL, audit, lookup)

Retrieval at query time hits both indexes, filtered by scope and confidence floor:

user_results = s3v.query_vectors(
    indexName="memory-user",
    topK=10,
    queryVector={"float32": query_embedding},
    filter={
        "$and": [
            {"owner_sub": {"$eq": user_id}},
            {"confidence": {"$gte": 0.5}}
        ]
    },
)
org_results = s3v.query_vectors(
    indexName="memory-org",
    topK=10,
    queryVector={"float32": query_embedding},
    filter={"confidence": {"$gte": 0.5}},
)

The split exists because TTL-aware expiry in a vector index is either application-layer code that will drift, or it simply doesn't exist. Key-value stores can't do semantic similarity search. These are different problems that map to different data structures, and trying to solve both with one is how you end up with fragile application-layer workarounds for the one it doesn't do natively.

Cognee's graph-native approach is interesting because it can represent relationships a flat vector index can't. The tradeoff is that you're now doing expiry and lifecycle management in a graph store, which isn't what graph stores are built for. Every architecture involves a split somewhere. The question is where you put it.

Retrieval is not the decision

Here is the gap. Every memory framework gives you retrieval. A well-designed retrieval path returns prior findings, data quality flags, user preferences, and domain patterns, all filtered by owner and confidence floor. Good retrieval is necessary.

It isn't sufficient.

Once the agent has retrieved N prior findings about a dataset, something has to decide what to do with them. The agent could ignore the prior work and run fresh analysis. It could build on the prior findings and skip already-explored ground. It could run one targeted test to confirm whether the prior answer still holds. Or it could repeat the exact prior battery unchanged.

These are four different behaviors, and the right one depends entirely on what the user is trying to do. Without a decision layer, the agent picks one implicitly. Usually it's "run fresh," which wastes computation and frustrates repeat users. Sometimes it's "incorporate everything," which means stale findings from six months ago influence current analysis.

The mode switch makes this explicit:

memory_mode: Literal["fresh", "extend", "revalidate", "re-run"]

Mode	Behavior
`fresh`	Ignore prior findings. Full new battery. Default for new datasets.
`extend`	Build on prior findings. Skip explored steps. Drill into unknowns.
`revalidate`	Run one confirmatory test. Check if data has changed. Stop.
`re-run`	Repeat prior battery unchanged. No adaptation.

The mode fires at a gate when three or more prior findings exist for the same dataset. The gate surfaces the prior findings and asks the user which applies to their question today.

revalidate is the mode that changes agent behavior most, and it's the one nobody builds. Most repeat questions on a dataset aren't "I want fresh analysis." They're "has the answer changed since last month?" A user who explored Q4 revenue trends in December and asks about Q1 in April doesn't need a full new battery. They need one targeted confirmatory test to check whether the prior dynamics still hold.

Running the full battery again is wasteful. Ignoring prior findings entirely loses continuity. Without revalidate, you're choosing between those two bad options. With it, you confirm or invalidate the prior conclusion with minimal computation and stop.

The storage problem precedes the retrieval problem

Twenty sessions on the same dataset, all surfacing variants of "Q4 revenue growth was 12% YoY." Without dedup, retrieval pulls all twenty. The planner can't tell which is current. The noise degrades planning quality in ways that are hard to diagnose because the retrieval system is technically working fine.

Most posts treat dedup as a retrieval-time problem: clean up duplicates when you pull. This is the wrong place to fix it. Dedup at write time is cheaper, cleaner, and prevents accumulation from happening.

Before storing a new MemoryUnit, the consolidation step embeds the candidate and queries the target index filtered by (data_source, memory_type). If the nearest existing unit is within 0.15 Euclidean distance, the new unit is dropped:

if distance < config.memory_dedup_distance_threshold:  # default: 0.15
    # skip -- too similar to existing memory
    continue

0.15 is a tight threshold. Two findings need to be semantically very close to trigger dedup. The tradeoff is real: a lower threshold lets more variants through, which is appropriate when subtle differences between findings matter. A higher threshold is more aggressive and keeps the index leaner but risks collapsing findings that are genuinely distinct. This is a domain-specific tuning decision and not one to set once and forget.

The is_durative flag works alongside dedup. When the same data quality issue survives the dedup check across two or more sessions, it gets promoted to durative. The planner injects durative constraints as required workarounds rather than optional context. This is how a one-off observation becomes a structural fact: through recurrence, not through manual tagging.

What this looks like end to end

A user opens a session on a dataset they've queried before. At the UNDERSTAND phase, before any profiling runs, the system fires a recall step. It embeds the user's question, queries both the user-scoped and org-scoped indexes, merges the results, and partitions them into four buckets: prior findings, data quality flags, user preferences, and domain patterns.

At the PLAN phase, two blocks go into the planning prompt. Data quality flags are always injected as hard constraints, regardless of the memory mode the user chose:

## Known Data Issues -- Required Workarounds
Your plan MUST account for these before proposing battery steps:
- Revenue column has a 3-day reporting lag. Any trailing-7-day window
  will undercount the most recent 3 days.
  -> Account for this in your plan before running any related steps.

Prior findings are mode-dependent. If the user selected extend, the findings go in with an explicit directive to build on them and skip already-confirmed steps. If the user selected fresh, that block is omitted entirely.

Analysis runs. At the REPORT phase, findings get labeled: confirmatory, exploratory, or exploratory-divergent. The confidence score follows from that label. The consolidation step takes the labeled findings, embeds each one, runs the dedup check, and writes survivors to both backends. DynamoDB gets the full MemoryUnit with TTL set. The vector index gets the embedding with metadata for owner and confidence filtering.

The next session on the same dataset starts with a richer index and a more informed planner. Over time, the org-scoped index accumulates structural knowledge about the dataset that any user of the system can benefit from. The user-scoped index accumulates one person's analytical history, building a picture of what they've confirmed, what they've asked, and how they prefer to work.

Findings aren't the only thing worth remembering. After the report phase completes, a separate lightweight agent runs on the last 40 turns of the session transcript. Its job is different from the consolidation step: where consolidation works on labeled findings with confidence scores, the transcript pass looks for things that never surfaced as explicit findings. A preference the user stated in passing ("I always want weekly grain, not daily"), a correction they made mid-session, a non-obvious filter they mentioned once. These go into user_preference and domain_pattern memory rather than into findings. The distinction matters because not everything worth remembering takes the form of an analytical conclusion. A user who says "exclude OpenAI namespaces from all my revenue queries" is expressing a standing constraint, not making a finding. Without a pass on the transcript, that kind of knowledge evaporates at session end.

The question frameworks haven't answered

The pieces described here aren't novel in isolation. Dual-backend storage, TTL-based expiry, semantic similarity for retrieval: these all exist. What's absent from every framework comparison is the mode switch and the type-differentiated storage design.

A 2026 comparison of eight memory frameworks confirms that none of them handle freshness scoring, semantic memory types with different business semantics, or the decision layer between retrieval and execution — the comparison's own feature matrix shows blank cells in exactly those three rows for all eight tools. The mode switch is that decision layer, made concrete.

The piece I'd most want to compare notes on with people who've built something similar: what does your user-facing decision point look like? The gate approach works well when users have enough context to choose. There are clearly cases where the agent should infer the mode automatically, maybe from the phrasing of the question or from how much time has passed since the last session. What heuristics you'd use for that isn't settled, and if you've worked out a pattern that holds, I'd want to know how it compares to putting the choice directly in front of the user.