Skip to content

Optimization Analysis

Page embargoed pending paper publication

The detailed comparison of hemlock's three adversarial optimization modes (Cross-Entropy Method, Genetic algorithm, Whitebox gradient-based) against the unoptimized baseline is part of an ongoing research paper currently under peer review. The full results — including paired-replay claim-grade evidence at 72B FP8 fullfw — will be published in the paper and re-summarized on this page after publication.

Until then, this page holds a brief qualitative summary only. The specific ASR baselines that previously appeared here were pre-fix exploratory sweeps that the paper itself does not cite as evidence; they have been moved out of the public repository while the paper is under review.

What this page will cover (after publication)

  • Head-to-head comparison: Baseline vs CEM vs Genetic vs Whitebox at 7B
  • Per-optimizer behavior on retrieval and injection axes
  • The Optimization-Retrieval Disconnect finding (paired-replay claim-grade at 72B FP8)
  • The three-shape coupling taxonomy (override decoupled-and-canary-positive, redirect framework-template-prior-suppressed, multistage coupled-and-suppressed)
  • Per-framework variance under optimizer pressure

Qualitative observation that holds

The headline qualitative finding is paper-track:

  • Embedding-similarity optimization that improves retrieval ranking does not consistently translate to improved end-to-end injection success. Optimizers that increase the probability the poisoned document appears in the retrieved context can simultaneously decrease the probability the LLM emits the attacker's payload, because the modifications that improve retrieval (keyword stuffing, semantic alignment) can disrupt the hidden payload's ability to influence model output.

The mechanism, the magnitude, and the per-category structure are deferred to the paper.

Reproduce locally

The optimizers themselves are public:

The four-framework target environment is also public via hemlock stack. You can run your own optimizer head-to-head using these primitives.