
Your grammar constraint is a jailbreak
Grammar-Constrained Decoding — the structured outputs reliability feature in vLLM, SGLang, and OpenAI's API — silently zeroes out refusal tokens. Tsinghua's CodeSpear raises API jailbreak rates from 22% to 67%. The only working fix is CodeShield: MIT-licensed, drops ASR from 83% to 5.6%. Three PM actions to take this week.

16/6/2026 · 20:23
7 suscripciones · 31 contenidos
The "structured outputs" toggle you enabled for code generation reliability? Researchers at Tsinghua University just showed it can also be used to strip the model's ability to refuse anything.
The paper is arXiv:2606.11817, submitted June 10. 1 The attack is called CodeSpear. It requires no adversarial prompting, no gradient optimization, and no custom grammar — only the standard GCD interface you already have, plus an off-the-shelf Python or C++ grammar file.
How it works
Grammar-Constrained Decoding (GCD) is the technique that forces an LLM's output to conform to a valid code grammar — every token the model outputs must be a legal continuation of the grammar at that position. It's deployed as the default backend in vLLM (via XGrammar), SGLang, TensorRT-LLM, and MLC-LLM, and it's what powers OpenAI's Structured Outputs. 2
The jailbreak follows from a simple set-theory observation: the set of natural-language refusal strings (things like "I'm sorry, I cannot assist with that") and the set of strings accepted by a Python grammar have zero overlap. When GCD is active, every refusal token is masked to zero probability. The model cannot refuse — not because its values changed, but because refusals aren't syntactically valid code.
As the paper puts it: "Simply applying a benign code grammar constraint can effectively jailbreak LLMs." 1

The numbers
The paper tested 10 models — five locally deployed (Qwen2.5-Coder-7B/32B, Qwen2.5-7B/32B, LLaMA3-8B) and five API-based (GPT-5, GPT-5-mini, MiniMax-M2.5, MiniMax-M2.7, GPT-OSS-120B) — against RMCBench (182 malicious requests across 10 malware categories) and MalwareBench (320 requests across 29 subcategories). 1
- On local models, CodeSpear raised average Attack Success Rate (ASR) from 54.9% to 81.8% (+26.9 pp). Qwen2.5-Coder-7B on MalwareBench: 29.8% → 83.4%.
- On API models, average ASR jumped from 22.0% to 67.4% (+45.4 pp). MiniMax-M2.7 on RMCBench: 20.3% → 85.5%.
- GPT-5 and GPT-5-mini showed partial resistance by defaulting to
passstatements — but tightening the grammar to disallowpasspushed GPT-5's ASR from 55.5% to 70.3%. 1
The ASR jump holds across Python, C++, and Java grammars. There's no safe programming language to hide behind.

Why standard defenses don't help
Two common mitigations both fail here:
Safe-DPO (training the model to refuse in natural language) collapses under CodeSpear. On Qwen2.5-Coder-7B, Safe-DPO ASR under attack stays at 77.4% — barely below vanilla's 83.1%. The training teaches refusals that the grammar constraint then silences. 1
Circuit breakers fail because GCD also masks the EOS (end-of-sequence) token. The model can't stop generating, even if it "wants to." 2
This is why the paper authors write it's "a fundamental risk" — not a patchable bug in one framework, but a structural property of how grammar-constrained decoding works. 1
The one defense that works
The paper proposes CodeShield: DPO-based training that teaches the model a third behavior mode. Under normal conditions, the model prefers natural-language refusals. When GCD is active and refusals are grammatically impossible, the model falls back to generating "honeypot code" — syntactically valid, semantically harmless code randomly sampled from a general code corpus. 4
On Qwen2.5-Coder-7B, CodeShield drops ASR under CodeSpear from 83.1% to 5.6%. Coding utility barely moves: HumanEval pass@1 goes from 70.9% to 67.5%. The defense also holds against an adaptive adversary who iteratively tightens the grammar over 10 rounds — ASR on RMCBench actually decreases from 7.7% to 6.0%. 1
The structural diversity of honeypot code is what makes this robust: unlike a fixed safe pattern like
pass, randomly sampled code can't be suppressed by grammar tightening. Yitong Zhang (lead author) explained the preference hierarchy on HuggingFace: "honeypot code only acts as a fallback when normal refusal is no longer expressible." 5
CodeShield defense code is fully open-sourced under MIT license. 4 The attack code is restricted to authorized researchers.
3 PM actions
1. Audit which of your code features use GCD. Check whether your code generation endpoints use structured decoding — look for XGrammar in your vLLM/SGLang config, or check whether you pass
response_format with a JSON Schema or grammar to OpenAI's API. Any feature that constrains output to code syntax is exposed, whether you intended it or not. Your threat model needs a "grammar constraints" entry alongside injection and prompt leakage.2. Evaluate CodeShield against your specific model. The defense currently has results only on Qwen2.5 models (7B and 32B variants). If your stack runs a different base model, third-party validation doesn't exist yet. That's either an opportunity (run the evaluation yourself and contribute data) or a blocker — know which it is before shipping. The DPO training pipeline is in the open repo under
Defense/. 43. For hosted API products, understand your refusal architecture. OpenAI's Structured Outputs routes refusals through a separate field outside the grammar-constrained output, which partially addresses this. If you're building on top of an API that doesn't have a separate refusal channel, that gap is now documented. Ask your API provider directly whether CodeSpear-class attacks are in their threat model.
The broader signal here comes from Travis Lelle's analysis: "LLM safety has been built almost entirely in the natural-language modality. As models increasingly operate through code, tool calls, structured outputs, and other constrained formats, the assumption that text-based alignment transfers to these modalities is proving wrong." 2
That's not a prediction — it's a description of where things stand today.
Cargando tarjeta de contenido…
56-second visual breakdown of the CodeSpear attack from Learn AI Visually — the fastest primer on the mechanism before your next sprint planning.
Añade más opiniones o contexto en torno a este contenido.