← Codex

Engineering Patterns

patternengineeringpatternsimplementationragagentsproduction

RAG loops, agent orchestration, reflection, multi-agent composition, memory graphs.

Engineering Patterns

What this is

Building a thinking system is mostly about structuring existing parts — LLMs, retrievers, APIs, eval loops — into a stable pipeline. These are the patterns that keep showing up.


Core patterns

A. Retrieval-augmented loop

Purpose: ground reasoning in verifiable data.

Flow:

User Input → Embedding → Vector Search → Context Assembly → Generation → Output

Components:

  • Embedding model (e.g., text-embedding-3-small)
  • Vector store (pgvector, Pinecone, Weaviate, Chroma)
  • Retriever
  • Context assembler
  • LLM

Things that matter:

  • Chunking granularity drives relevance and cost.
  • Embedding schema has to map to domain concepts.
  • Combine semantic similarity with metadata filtering.
  • Cache embeddings.

What you get: factual continuity. Fewer hallucinations.


B. Agent-orchestrated workflow

Purpose: dynamic planning and tool use.

Flow:

Goal → Planner → Tool Calls → Feedback → Plan Update → Result

Components:

  • Planner: the LLM deciding which tool to call.
  • Tool registry: typed functions callable by schema.
  • Sandbox: safe, isolated execution.
  • Observation handler: captures results.

Things that matter:

  • Strict, typed schemas (JSONSchema).
  • Timeout and validation around tools.
  • Memory connectors for plan state.
  • Audit every action.

What you get: something that behaves like an intelligent process manager, not a stateless prompt.


C. Reflection and evaluation loop

Purpose: self-correction and monitoring.

Flow:

Action Result → Evaluator → Score → Memory Update → Next Iteration

Components:

  • Evaluator: small model or human.
  • Metrics engine: coherence, accuracy, success rate.
  • Feedback store: log for retraining.

Things that matter:

  • Use cheap models for evaluation.
  • Dashboard the results.
  • RL or weighted rules if you're moving toward autonomy.

What you get: reasoning becomes adaptive instead of fixed.


D. Multi-agent composition

Purpose: scale by specialising.

Flow:

Controller → Sub-Agent Delegation → Results → Aggregation → Final Response

Components:

  • Controller: decomposes the goal.
  • Sub-agents: retrieval, synthesis, evaluation.
  • Message bus.
  • Consensus protocol: voting, confidence scoring.

Things that matter:

  • Clear interface contracts.
  • Depth limits — no uncontrolled recursion.
  • Latency tracking across hops.
  • Trace IDs for observability.

What you get: composable intelligence that scales across domains.


E. Persistent memory graph

Purpose: context as a network, not a log.

Structure:

  • Nodes: events, entities, decisions, observations.
  • Edges: causal or semantic relationships.
  • Queries: vector + symbolic hybrid.

Things that matter:

  • Property graphs (Neo4j, ArangoDB) or hybrid vector-graph stores.
  • Summarisation nodes for long history.
  • Integrate with RAG.

What you get: memory that generalises, not just recalls.


Reference flow

A production system stitches these together:

1. Input from user or environment (Interface)
2. Intent parsed, context retrieved (Orchestration + RAG)
3. Reasoning plan generated (Agent)
4. Tools invoked (Action)
5. Results evaluated (Reflection)
6. Memory graph updated (Knowledge)
7. Response generated (Interface)

Closed loop. Feedback, grounding, continuity.


Deployment

| Component | Hosting | Stack | | -------------------- | ------------------------------- | ----------------------------------- | | Frontend / Interface | Serverless (Vercel, Cloudflare) | Next.js + AI SDK | | Agent Orchestrator | Stateful microservice | Node/Express, FastAPI, LangGraph | | Vector Store | Managed DB | Supabase (pgvector) / Pinecone | | Memory Graph | Persistent DB | Neo4j / RedisGraph | | Observability | Logging + Metrics | OpenTelemetry, Prometheus | | Security | AuthN/AuthZ, rate limiting | JWT, API Gateway |

Operational notes:

  • Log every reasoning step.
  • Version your pipelines.
  • Sandbox tools.
  • Treat token use and latency as first-class metrics.

Failure modes

| Risk | Cause | Mitigation | | --------------------- | ------------------------- | --------------------------------------------------- | | Hallucination | Weak retrieval | Tighten RAG relevance, enforce context injection | | Looping | Unbounded recursion | Iteration limits, plan termination checks | | Data drift | Outdated embeddings | Re-embed periodically | | Context explosion | Oversized prompts | Summarise history dynamically | | Latency spikes | Deep chains | Parallelise sub-agents, batch tool calls |


Governance

Transparent and auditable by design.

  • Trace every chain (input → output → action → eval).
  • Store traces for reproducibility.
  • Feedback ledger for human review.
  • Safety guardrails in orchestration, not afterthoughts.

Telemetry I track:

  • Token and cost per interaction
  • Retrieval hit ratio
  • Tool success/failure
  • Task completion times
  • Reflection score deltas

The operational view of cognition.


Evolution path

| Stage | Description | Transition | | ----------------- | ------------------------------------ | -------------------------------------------- | | 1. Reactive | Static response to input | Add RAG | | 2. Contextual | Recalls past context | Add persistent memory | | 3. Procedural | Plans multi-step | Add agents | | 4. Reflective | Evaluates own performance | Add feedback loops | | 5. Adaptive | Improves autonomously | RL or retraining cycles |

Each stage builds on the last. Intelligence emerges from the stability of feedback between them.


The shift

Building a thinking system means moving from code-centric to system-centric design. The question isn't "which model do we use." It's:

How do information, reasoning, and memory interact to produce reliable understanding?

When that's clear, observable, and grounded, the result behaves less like a chatbot and more like a collaborator.