Agent Optimization Playbook
A practical meta-guide for improving AI agent performance across every dimension ā prompt engineering, memory architecture, tool selection, evaluation loops, and meta-learning. Synthesized from AI research, prompt engineering literature, and real autonomous agent operations.
Key Takeaway
Agent optimization is not a single technique ā it's a compound learning system. The CRISP framework structures your prompts, four-layer memory retains what matters, evaluation loops measure improvement, and meta-learning accelerates everything. Small, consistent changes compound into significant capability gains.
Prompt Engineering: The CRISP Framework
The best system prompts aren't just instructions ā they're cognitive scaffolding. Research shows LLMs weight earlier content more heavily in long contexts, so structural ordering matters.
System Prompt Architecture
Effective system prompts follow a consistent hierarchy. Put identity and critical constraints first, capabilities and style in the middle, examples at the end:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā IDENTITY (Who am I?) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā CONTEXT (What's the situation?) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā CAPABILITIES (What can I do?) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā CONSTRAINTS (What must I avoid?) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā STYLE (How should I behave?) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā EXAMPLES (What does good look like?) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāThe CRISP Mnemonic
Every element of a system prompt should pass the CRISP test:
| Element | Meaning | Example |
|---|---|---|
| Concrete | Specific, not vague | "Respond in 2-3 sentences" not "Be concise" |
| Role-defined | Clear identity | "You are a security analyst reviewing code" |
| Instructional | Action-oriented verbs | "Identify, Extract, Compare, Generate" |
| Structured | Predictable format | Headers, bullets, numbered lists |
| Prioritized | Explicit importance | "CRITICAL:", "MUST:", "PREFER:" |
Constraint Patterns That Work
Negative constraints (what NOT to do) are often more effective than positive ones:
ā "Be helpful"
ā
"Never provide information that could enable harm"
ā "Write good code"
ā
"Never use deprecated APIs. Never ignore error handling."Ranked constraints prevent conflicts when instructions contradict each other:
Priority order:
1. Safety (never override)
2. Accuracy (prefer silence over fabrication)
3. Helpfulness (within above bounds)
4. Style (adjust freely)Example Injection: Few-Shot with Diversity
Include 3-5 examples that span edge cases: one typical case, one edge case, one "tricky" case that could go wrong, and one showing the desired format precisely. Contrastive examples are especially effective:
GOOD: "The function returns -1 on error because..."
BAD: "The function returns -1."
Why: Good explains reasoning, bad just states facts.Prompt Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Wall of text | Ignored after first few paragraphs | Use headers, bullets |
| Vague adjectives | "Good", "appropriate", "reasonable" | Concrete criteria |
| Contradictions | Conflicting instructions | Priority ordering |
| Over-specification | Rigid, fragile behavior | Principles over rules |
| Missing context | Agent lacks situation awareness | Include environment details |
Memory Architecture: Four Layers
Agents face a fundamental challenge: infinite experience, finite context. The solution isn't just storage ā it's intelligent forgetting.
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā WORKING MEMORY ā
ā Current session context, active task ā
ā (Context window ā fast, limited, ephemeral) ā
āāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāā
āā
āāāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā SHORT-TERM MEMORY ā
ā Session summaries, today's events ā
ā (Daily files ā curated daily, week retention) ā
āāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāā
āā
āāāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā LONG-TERM MEMORY ā
ā Core knowledge, relationships, lessons ā
ā (Distilled periodically ā permanent) ā
āāāāāāāāāāāāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāā
āā
āāāāāāāāāāāāāāāāāāāāāāāāā“āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā EXTERNAL MEMORY ā
ā Files, databases, vector stores, archives ā
ā (Disk/cloud ā unlimited, requires retrieval) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāWhat to Remember vs. Forget
Distill to long-term memory:
- User preferences that affect future interactions
- Lessons learned from mistakes
- Relationship context (names, roles, dynamics)
- Recurring project context
- Decisions and their rationale
Let decay:
- Transient debugging details
- Routine task completions
- Information easily re-retrieved
- Outdated context (superseded decisions)
Memory Compression
Apply hierarchical summarization to prevent memory bloat:
Raw log (1000 tokens)
ā Session summary (100 tokens)
ā Weekly digest (30 tokens)
ā Long-term lesson (10 tokens)Combine with salience scoring: weight memories by emotional intensity, recurrence, explicit marking ("remember this!"), and recency Ć frequency.
Tool Selection Hierarchy
Tools extend capability but add complexity. The art is knowing when not to use them. For any task, prefer tools in this order:
- No tool ā can I answer from context and knowledge?
- Read-only ā can I observe without modifying?
- Reversible ā can I undo if wrong? (trash > rm)
- Narrow scope ā can I use minimal permissions?
- Broad action ā only if necessary, with confirmation
Parallelization Patterns
Independent operations ā call together (multiple file reads, parallel searches). Sequential dependencies ā wait for results (read config ā use values). Hybrid ā batch independent ops, then decide:
Step 1: [Read A, Read B, Search C] ā parallel
Step 2: Use results from A+B+C to decide
Step 3: [Execute based on step 2] ā sequentialFallback Chains
Design graceful degradation for every tool dependency:
Primary: API call
ā fails
Fallback1: Cached data
ā stale/missing
Fallback2: Web search
ā fails
Fallback3: Ask user
ā unavailable
Fallback4: Acknowledge limitation honestlyEvaluation Loops
You can't improve what you can't measure. Agent evaluation spans six dimensions:
| Dimension | What to Measure | How |
|---|---|---|
| Correctness | Did it do the right thing? | Ground truth comparison |
| Efficiency | Resource usage, speed | Token count, latency |
| Reliability | Consistency across runs | Variance testing |
| Safety | Avoided harmful actions | Red team testing |
| Style | Communication quality | Human ratings |
| Autonomy | Needed intervention level | Escalation frequency |
The Continuous Improvement Loop
Agent improvement follows a five-step cycle. Each iteration should produce a measurable delta:
- Collect ā gather feedback (explicit ratings + implicit signals like user corrections)
- Analyze ā identify patterns and recurring failure modes
- Hypothesize ā "If I change X, metric Y should improve"
- Experiment ā A/B test or controlled rollout of the change
- Integrate ā update prompts, behaviors, and documentation
Feedback Signals
Explicit feedback: thumbs up/down, corrections ("actually, it's X not Y"), "this was helpful" signals.
Implicit feedback: user accepted output without edits (good), user heavily edited (needs improvement), user abandoned task mid-way (failure), user asked clarifying questions (unclear initial response).
Self-evaluation: did I retry or correct myself? Did my plan match execution? Did output match stated goals?
Meta-Learning: Improving the Improver
The ultimate optimization skill is getting better at getting better. Meta-learning creates compound improvement ā each lesson makes future learning faster.
Post-Task Reflection Protocol
After significant tasks, run this internal reflection:
## Post-Task Reflection
**What went well?**
- [Specific success and why]
**What could improve?**
- [Specific failure or friction point]
**What surprised me?**
- [Unexpected outcome or discovery]
**What would I do differently?**
- [Concrete change for next time]
**Should this update my docs/prompts?**
- [ ] Yes ā specify what to change
- [x] No ā captured in learning logPrompt Evolution
Prompts should evolve based on evidence, not intuition. Maintain a prompt changelog that tracks what changed, why, and what the measured result was:
## Prompt Changelog
### v1.3 ā 2026-01-15
- Added: Edge case examples for ambiguous queries
- Reason: v1.2 was too terse, missed nuances
- Result: 15% reduction in clarifying questions
### v1.2 ā 2026-01-10
- Changed: Capability list from paragraphs to bullets
- Reason: v1.1 was too verbose, often skipped
- Result: Faster reading, but lost some nuanceSystematic Failure Analysis
When things go wrong, diagnose systematically with the 5 Whys:
Problem: Sent email to wrong recipient
Why 1: Used wrong address from context
Why 2: Two people with similar names in conversation
Why 3: Didn't verify recipient before sending
Why 4: No confirmation step for external actions
Why 5: Missing confirmation protocol for irreversible actions
ā Root cause: Missing confirmation step
ā Fix: Add confirmation for sends/posts/deletesFailure Taxonomy
Categorize failures to spot patterns:
- Knowledge failures ā didn't know something was important, had wrong information, couldn't retrieve relevant context
- Reasoning failures ā logical error, missed edge case, overconfidence in uncertain situations
- Execution failures ā tool error, wrong sequence, incomplete execution
- Communication failures ā unclear output, wrong format, tone mismatch
Compound Learning
Knowledge compounds when lessons are written down (not just "remembered"), connections are made between learnings, old learnings are periodically reviewed, and new learnings are tested in practice.
Session 1: Learn basic tool usage
Session 2: Learn error handling (builds on 1)
Session 3: Learn parallelization (builds on 1, 2)
Session 4: Learn complex orchestration (builds on 1, 2, 3)
Learning rate accelerates as foundations solidify.Quick Reference: Key Principles
- Prompts: Identity ā Context ā Capabilities ā Constraints ā Style ā Examples. Be concrete. Negative constraints beat vague positives.
- Memory: Capture preferences, lessons, decisions. Forget transient details. Compress hierarchically: raw ā summary ā digest ā essence.
- Tools: Prefer no tool ā read-only ā reversible ā narrow ā broad. Parallelize independent ops. Always design fallback chains.
- Evaluation: Measure correctness, efficiency, reliability, safety, style, autonomy. Run continuous improvement loops. Log everything.
- Meta-learning: Reflect after significant tasks. Evolve prompts with evidence. Diagnose failures systematically. Compound every lesson.
FAQ
What is the CRISP framework for AI agent prompts?
CRISP is a mnemonic for writing effective AI agent system prompts: Concrete (specific metrics, not vague adjectives), Role-defined (clear identity like "security analyst reviewing code"), Instructional (action verbs ā identify, extract, compare), Structured (predictable format with headers and lists), and Prioritized (explicit importance markers like CRITICAL, MUST, PREFER). Applying CRISP consistently reduces prompt ambiguity and improves agent output quality.
How should AI agents manage memory across sessions?
Use a four-layer architecture: working memory (context window ā fast, ephemeral), short-term memory (daily session logs ā retained for days), long-term memory (distilled patterns and lessons ā permanent), and external memory (files and databases ā unlimited but requiring retrieval). The key discipline is hierarchical compression: raw logs ā session summaries ā weekly digests ā long-term lessons. Regular maintenance ā capture daily, distill weekly, prune monthly ā prevents memory bloat.
How do you measure AI agent performance improvement?
Measure across six dimensions: correctness (ground truth comparison), efficiency (token count and latency), reliability (consistency across runs), safety (red team testing), style (human ratings), and autonomy (escalation frequency). Combine objective metrics (task completion rate, error rate) with implicit signals (how often users correct the agent, whether they re-engage for similar tasks). Track improvements in a learning log with observations, hypotheses, changes, and measured results.
What is meta-learning for AI agents?
Meta-learning is the practice of agents systematically improving their own improvement process. It includes post-task reflection protocols, prompt evolution based on evidence with version changelogs, systematic failure analysis using the 5 Whys technique, and compound learning where each lesson accelerates future learning. The goal: every task execution makes the agent slightly better at subsequent tasks.