Skip to content

Retrospectives

Goal: Get a client-deliverable report and learn what went well / what went wrong, automatically.

Do this

When the engagement is done (or just paused for the day):

"Run the retrospective and write it to disk."

The AI calls run_retrospective with write_to_disk: true. Output lands in ./retrospective/<engagement-id>/.

Or via shell:

npm run retrospective

That's it. Open retrospective/<engagement-id>/report.md and you have a client-ready Markdown report.

What you get

File What it is
report.md Client-deliverable attack-path report — narrative timeline, findings, evidence, recommendations
inference-suggestions.json New inference rules the engine spotted from graph patterns
skill-gaps.json Skills that were never used + techniques that were missing
context-improvements.json Frontier scoring observations, OPSEC noise patterns, logging gaps
training-traces.json RLVR training triplets (state → action → outcome → reward)
trace-quality.json Quality assessment of the training data
summary.txt High-level "what happened"

What to do with the outputs

  • report.md → review, customize, deliver to the client.
  • inference-suggestions.json → apply the high-confidence ones with suggest_inference_rule. Patterns with 5+ occurrences are auto-applied to the active rule set; the rest are flagged for review.
  • skill-gaps.json → tells you which skills to write or update before the next engagement.
  • training-traces.json → feed into your RLVR pipeline if you have one.

When to run it

  • End of every engagement, even partial ones.
  • End of each working day during long engagements — you'll catch logging gaps while you can still fix them.
  • Before starting the next engagement of the same type — apply the inference suggestions first.

How auto-improvement works

Two things happen automatically when you run a retrospective:

Inference rule auto-apply

Occurrences ≥ 5  →  Auto-applied (added to active rule set)
Occurrences < 5  →  Suggestion only (logged for review)

The threshold prevents one-off coincidences from poisoning the rule set. applyInferenceSuggestions() returns counts of what was applied vs. skipped.

Technique priors + skill annotations

Per-technique success rates are computed from training traces and feed into frontier scoring on the next engagement — techniques with historically higher success get a scoring boost. Skills get usage counts and success rates so you know which ones are pulling weight.

Annotation What it measures
use_count Total times the skill was referenced
success_count Times a skill-associated action succeeded
failure_count Times it failed
success_rate success_count / use_count

See also