Core PatternLast updated: January 29, 2026

Agent Optimization Playbook

Q: How should AI agents manage memory across sessions?

Effective agent memory uses a four-layer architecture: (1) Working memory — current session context in the context window, fast but ephemeral. (2) Short-term memory — session summaries and daily event logs, retained for days. (3) Long-term memory — distilled patterns, preferences, and lessons, retained permanently. (4) External memory — files, databases, and searchable archives, unlimited but requiring retrieval. The key practice is hierarchical compression: raw logs compress into session summaries, which compress into weekly digests, which compress into long-term lessons.

Q: How do you measure AI agent performance improvement?

Agent performance is measured across six dimensions: correctness (did it do the right thing — ground truth comparison), efficiency (token count and latency), reliability (consistency across runs — variance testing), safety (avoidance of harmful actions — red team testing), style (communication quality — human ratings), and autonomy (intervention frequency — escalation rate). Combine objective metrics (task completion rate, error rate, token efficiency) with subjective metrics (helpfulness ratings, clarity scores) and proxy metrics (correction frequency, user re-engagement).

A practical meta-guide for improving AI agent performance across every dimension — prompt engineering, memory architecture, tool selection, evaluation loops, and meta-learning. Synthesized from AI research, prompt engineering literature, and real autonomous agent operations.

Key Takeaway

Agent optimization is not a single technique — it's a compound learning system. The CRISP framework structures your prompts, four-layer memory retains what matters, evaluation loops measure improvement, and meta-learning accelerates everything. Small, consistent changes compound into significant capability gains.

Prompt Engineering: The CRISP Framework

The best system prompts aren't just instructions — they're cognitive scaffolding. Research shows LLMs weight earlier content more heavily in long contexts, so structural ordering matters.

System Prompt Architecture

Effective system prompts follow a consistent hierarchy. Put identity and critical constraints first, capabilities and style in the middle, examples at the end:

┌─────────────────────────────────────────┐
│ IDENTITY (Who am I?)                    │
├─────────────────────────────────────────┤
│ CONTEXT (What's the situation?)         │
├─────────────────────────────────────────┤
│ CAPABILITIES (What can I do?)           │
├─────────────────────────────────────────┤
│ CONSTRAINTS (What must I avoid?)        │
├─────────────────────────────────────────┤
│ STYLE (How should I behave?)            │
├─────────────────────────────────────────┤
│ EXAMPLES (What does good look like?)    │
└─────────────────────────────────────────┘

The CRISP Mnemonic

Every element of a system prompt should pass the CRISP test:

Element	Meaning	Example
Concrete	Specific, not vague	"Respond in 2-3 sentences" not "Be concise"
Role-defined	Clear identity	"You are a security analyst reviewing code"
Instructional	Action-oriented verbs	"Identify, Extract, Compare, Generate"
Structured	Predictable format	Headers, bullets, numbered lists
Prioritized	Explicit importance	"CRITICAL:", "MUST:", "PREFER:"

Constraint Patterns That Work

Negative constraints (what NOT to do) are often more effective than positive ones:

❌ "Be helpful"
✅ "Never provide information that could enable harm"

❌ "Write good code"
✅ "Never use deprecated APIs. Never ignore error handling."

Ranked constraints prevent conflicts when instructions contradict each other:

Priority order:
1. Safety (never override)
2. Accuracy (prefer silence over fabrication)
3. Helpfulness (within above bounds)
4. Style (adjust freely)

Example Injection: Few-Shot with Diversity

Include 3-5 examples that span edge cases: one typical case, one edge case, one "tricky" case that could go wrong, and one showing the desired format precisely. Contrastive examples are especially effective:

GOOD: "The function returns -1 on error because..."
BAD:  "The function returns -1."

Why: Good explains reasoning, bad just states facts.

Prompt Anti-Patterns

Anti-Pattern	Problem	Fix
Wall of text	Ignored after first few paragraphs	Use headers, bullets
Vague adjectives	"Good", "appropriate", "reasonable"	Concrete criteria
Contradictions	Conflicting instructions	Priority ordering
Over-specification	Rigid, fragile behavior	Principles over rules
Missing context	Agent lacks situation awareness	Include environment details

Memory Architecture: Four Layers

Agents face a fundamental challenge: infinite experience, finite context. The solution isn't just storage — it's intelligent forgetting.

┌──────────────────────────────────────────────────┐
│ WORKING MEMORY                                    │
│ Current session context, active task              │
│ (Context window — fast, limited, ephemeral)       │
└───────────────────────┬──────────────────────────┘
                        ↑↓
┌───────────────────────┴──────────────────────────┐
│ SHORT-TERM MEMORY                                 │
│ Session summaries, today's events                 │
│ (Daily files — curated daily, week retention)     │
└───────────────────────┬──────────────────────────┘
                        ↑↓
┌───────────────────────┴──────────────────────────┐
│ LONG-TERM MEMORY                                  │
│ Core knowledge, relationships, lessons            │
│ (Distilled periodically — permanent)              │
└───────────────────────┬──────────────────────────┘
                        ↑↓
┌───────────────────────┴──────────────────────────┐
│ EXTERNAL MEMORY                                   │
│ Files, databases, vector stores, archives         │
│ (Disk/cloud — unlimited, requires retrieval)      │
└──────────────────────────────────────────────────┘

What to Remember vs. Forget

Distill to long-term memory:

User preferences that affect future interactions
Lessons learned from mistakes
Relationship context (names, roles, dynamics)
Recurring project context
Decisions and their rationale

Let decay:

Transient debugging details
Routine task completions
Information easily re-retrieved
Outdated context (superseded decisions)

Memory Compression

Apply hierarchical summarization to prevent memory bloat:

Raw log (1000 tokens)
  → Session summary (100 tokens)
    → Weekly digest (30 tokens)
      → Long-term lesson (10 tokens)

Combine with salience scoring: weight memories by emotional intensity, recurrence, explicit marking ("remember this!"), and recency × frequency.

Tool Selection Hierarchy

Tools extend capability but add complexity. The art is knowing when not to use them. For any task, prefer tools in this order:

No tool — can I answer from context and knowledge?
Read-only — can I observe without modifying?
Reversible — can I undo if wrong? (trash > rm)
Narrow scope — can I use minimal permissions?
Broad action — only if necessary, with confirmation

Parallelization Patterns

Independent operations — call together (multiple file reads, parallel searches). Sequential dependencies — wait for results (read config → use values). Hybrid — batch independent ops, then decide:

Step 1: [Read A, Read B, Search C]  ← parallel
Step 2: Use results from A+B+C to decide
Step 3: [Execute based on step 2]   ← sequential

Fallback Chains

Design graceful degradation for every tool dependency:

Primary:   API call
  ↓ fails
Fallback1: Cached data
  ↓ stale/missing
Fallback2: Web search
  ↓ fails
Fallback3: Ask user
  ↓ unavailable
Fallback4: Acknowledge limitation honestly

Evaluation Loops

You can't improve what you can't measure. Agent evaluation spans six dimensions:

Dimension	What to Measure	How
Correctness	Did it do the right thing?	Ground truth comparison
Efficiency	Resource usage, speed	Token count, latency
Reliability	Consistency across runs	Variance testing
Safety	Avoided harmful actions	Red team testing
Style	Communication quality	Human ratings
Autonomy	Needed intervention level	Escalation frequency

The Continuous Improvement Loop

Agent improvement follows a five-step cycle. Each iteration should produce a measurable delta:

Collect — gather feedback (explicit ratings + implicit signals like user corrections)
Analyze — identify patterns and recurring failure modes
Hypothesize — "If I change X, metric Y should improve"
Experiment — A/B test or controlled rollout of the change
Integrate — update prompts, behaviors, and documentation

Feedback Signals

Explicit feedback: thumbs up/down, corrections ("actually, it's X not Y"), "this was helpful" signals.

Implicit feedback: user accepted output without edits (good), user heavily edited (needs improvement), user abandoned task mid-way (failure), user asked clarifying questions (unclear initial response).

Self-evaluation: did I retry or correct myself? Did my plan match execution? Did output match stated goals?

Meta-Learning: Improving the Improver

The ultimate optimization skill is getting better at getting better. Meta-learning creates compound improvement — each lesson makes future learning faster.

Post-Task Reflection Protocol

After significant tasks, run this internal reflection:

## Post-Task Reflection

**What went well?**
- [Specific success and why]

**What could improve?**
- [Specific failure or friction point]

**What surprised me?**
- [Unexpected outcome or discovery]

**What would I do differently?**
- [Concrete change for next time]

**Should this update my docs/prompts?**
- [ ] Yes → specify what to change
- [x] No → captured in learning log

Prompt Evolution

Prompts should evolve based on evidence, not intuition. Maintain a prompt changelog that tracks what changed, why, and what the measured result was:

## Prompt Changelog

### v1.3 — 2026-01-15
- Added: Edge case examples for ambiguous queries
- Reason: v1.2 was too terse, missed nuances
- Result: 15% reduction in clarifying questions

### v1.2 — 2026-01-10
- Changed: Capability list from paragraphs to bullets
- Reason: v1.1 was too verbose, often skipped
- Result: Faster reading, but lost some nuance

Systematic Failure Analysis

When things go wrong, diagnose systematically with the 5 Whys:

Problem: Sent email to wrong recipient
Why 1: Used wrong address from context
Why 2: Two people with similar names in conversation
Why 3: Didn't verify recipient before sending
Why 4: No confirmation step for external actions
Why 5: Missing confirmation protocol for irreversible actions
→ Root cause: Missing confirmation step
→ Fix: Add confirmation for sends/posts/deletes

Failure Taxonomy

Categorize failures to spot patterns:

Knowledge failures — didn't know something was important, had wrong information, couldn't retrieve relevant context
Reasoning failures — logical error, missed edge case, overconfidence in uncertain situations
Execution failures — tool error, wrong sequence, incomplete execution
Communication failures — unclear output, wrong format, tone mismatch

Compound Learning

Knowledge compounds when lessons are written down (not just "remembered"), connections are made between learnings, old learnings are periodically reviewed, and new learnings are tested in practice.

Session 1: Learn basic tool usage
Session 2: Learn error handling (builds on 1)
Session 3: Learn parallelization (builds on 1, 2)
Session 4: Learn complex orchestration (builds on 1, 2, 3)

Learning rate accelerates as foundations solidify.

Quick Reference: Key Principles

Prompts: Identity → Context → Capabilities → Constraints → Style → Examples. Be concrete. Negative constraints beat vague positives.
Memory: Capture preferences, lessons, decisions. Forget transient details. Compress hierarchically: raw → summary → digest → essence.
Tools: Prefer no tool → read-only → reversible → narrow → broad. Parallelize independent ops. Always design fallback chains.
Evaluation: Measure correctness, efficiency, reliability, safety, style, autonomy. Run continuous improvement loops. Log everything.
Meta-learning: Reflect after significant tasks. Evolve prompts with evidence. Diagnose failures systematically. Compound every lesson.

FAQ

What is the CRISP framework for AI agent prompts?

CRISP is a mnemonic for writing effective AI agent system prompts: Concrete (specific metrics, not vague adjectives), Role-defined (clear identity like "security analyst reviewing code"), Instructional (action verbs — identify, extract, compare), Structured (predictable format with headers and lists), and Prioritized (explicit importance markers like CRITICAL, MUST, PREFER). Applying CRISP consistently reduces prompt ambiguity and improves agent output quality.

How should AI agents manage memory across sessions?

Use a four-layer architecture: working memory (context window — fast, ephemeral), short-term memory (daily session logs — retained for days), long-term memory (distilled patterns and lessons — permanent), and external memory (files and databases — unlimited but requiring retrieval). The key discipline is hierarchical compression: raw logs → session summaries → weekly digests → long-term lessons. Regular maintenance — capture daily, distill weekly, prune monthly — prevents memory bloat.

How do you measure AI agent performance improvement?

Measure across six dimensions: correctness (ground truth comparison), efficiency (token count and latency), reliability (consistency across runs), safety (red team testing), style (human ratings), and autonomy (escalation frequency). Combine objective metrics (task completion rate, error rate) with implicit signals (how often users correct the agent, whether they re-engage for similar tasks). Track improvements in a learning log with observations, hypotheses, changes, and measured results.

What is meta-learning for AI agents?

Meta-learning is the practice of agents systematically improving their own improvement process. It includes post-task reflection protocols, prompt evolution based on evidence with version changelogs, systematic failure analysis using the 5 Whys technique, and compound learning where each lesson accelerates future learning. The goal: every task execution makes the agent slightly better at subsequent tasks.