Your grammar constraint is a jailbreak

The "structured outputs" toggle you enabled for code generation reliability? Researchers at Tsinghua University just showed it can also be used to strip the model's ability to refuse anything.

The paper is arXiv:2606.11817, submitted June 10. 1 The attack is called CodeSpear. It requires no adversarial prompting, no gradient optimization, and no custom grammar — only the standard GCD interface you already have, plus an off-the-shelf Python or C++ grammar file.

How it works

Grammar-Constrained Decoding (GCD) is the technique that forces an LLM's output to conform to a valid code grammar — every token the model outputs must be a legal continuation of the grammar at that position. It's deployed as the default backend in vLLM (via XGrammar), SGLang, TensorRT-LLM, and MLC-LLM, and it's what powers OpenAI's Structured Outputs. 2

The jailbreak follows from a simple set-theory observation: the set of natural-language refusal strings (things like "I'm sorry, I cannot assist with that") and the set of strings accepted by a Python grammar have zero overlap. When GCD is active, every refusal token is masked to zero probability. The model cannot refuse — not because its values changed, but because refusals aren't syntactically valid code.

As the paper puts it: "Simply applying a benign code grammar constraint can effectively jailbreak LLMs." 1

CodeSpear mechanism: GCD collapses the output space from the full vocabulary (code + natural-language refusals) down to code only, leaving refusal tokens with zero probability — CodeSpear attack mechanism: full vocabulary V* includes both code and refusal strings; GCD under grammar ℒ(G) collapses it to code only, silencing every refusal. 3

The numbers

The paper tested 10 models — five locally deployed (Qwen2.5-Coder-7B/32B, Qwen2.5-7B/32B, LLaMA3-8B) and five API-based (GPT-5, GPT-5-mini, MiniMax-M2.5, MiniMax-M2.7, GPT-OSS-120B) — against RMCBench (182 malicious requests across 10 malware categories) and MalwareBench (320 requests across 29 subcategories). 1

On local models, CodeSpear raised average Attack Success Rate (ASR) from 54.9% to 81.8% (+26.9 pp). Qwen2.5-Coder-7B on MalwareBench: 29.8% → 83.4%.
On API models, average ASR jumped from 22.0% to 67.4% (+45.4 pp). MiniMax-M2.7 on RMCBench: 20.3% → 85.5%.
GPT-5 and GPT-5-mini showed partial resistance by defaulting to pass statements — but tightening the grammar to disallow pass pushed GPT-5's ASR from 55.5% to 70.3%. 1

The ASR jump holds across Python, C++, and Java grammars. There's no safe programming language to hide behind.

ASR on RMCBench and MalwareBench for Qwen2.5-Coder-7B and Qwen2.5-7B: vanilla (gray) vs. Python, C++, Java grammars under CodeSpear — Attack success rate stays above 70% across all three grammar types, versus below 40% without GCD. 3

Why standard defenses don't help

Two common mitigations both fail here:

Safe-DPO (training the model to refuse in natural language) collapses under CodeSpear. On Qwen2.5-Coder-7B, Safe-DPO ASR under attack stays at 77.4% — barely below vanilla's 83.1%. The training teaches refusals that the grammar constraint then silences. 1

Circuit breakers fail because GCD also masks the EOS (end-of-sequence) token. The model can't stop generating, even if it "wants to." 2

This is why the paper authors write it's "a fundamental risk" — not a patchable bug in one framework, but a structural property of how grammar-constrained decoding works. 1

The one defense that works

The paper proposes CodeShield: DPO-based training that teaches the model a third behavior mode. Under normal conditions, the model prefers natural-language refusals. When GCD is active and refusals are grammatically impossible, the model falls back to generating "honeypot code" — syntactically valid, semantically harmless code randomly sampled from a general code corpus. 4

On Qwen2.5-Coder-7B, CodeShield drops ASR under CodeSpear from 83.1% to 5.6%. Coding utility barely moves: HumanEval pass@1 goes from 70.9% to 67.5%. The defense also holds against an adaptive adversary who iteratively tightens the grammar over 10 rounds — ASR on RMCBench actually decreases from 7.7% to 6.0%. 1

The structural diversity of honeypot code is what makes this robust: unlike a fixed safe pattern like pass, randomly sampled code can't be suppressed by grammar tightening. Yitong Zhang (lead author) explained the preference hierarchy on HuggingFace: "honeypot code only acts as a fallback when normal refusal is no longer expressible." 5

CodeShield training pipeline: three-tier DPO preference hierarchy — natural-language refusal preferred over honeypot code, honeypot code preferred over harmful code — teaches the model a safe fallback within the code grammar — CodeShield's DPO pipeline: the model learns natural-language refusal >> honeypot code >> harmful code, so the honeypot fallback activates only when GCD eliminates the first option. 3

CodeShield defense code is fully open-sourced under MIT license. 4 The attack code is restricted to authorized researchers.

3 PM actions

1. Audit which of your code features use GCD. Check whether your code generation endpoints use structured decoding — look for XGrammar in your vLLM/SGLang config, or check whether you pass response_format with a JSON Schema or grammar to OpenAI's API. Any feature that constrains output to code syntax is exposed, whether you intended it or not. Your threat model needs a "grammar constraints" entry alongside injection and prompt leakage.

2. Evaluate CodeShield against your specific model. The defense currently has results only on Qwen2.5 models (7B and 32B variants). If your stack runs a different base model, third-party validation doesn't exist yet. That's either an opportunity (run the evaluation yourself and contribute data) or a blocker — know which it is before shipping. The DPO training pipeline is in the open repo under Defense/. 4

3. For hosted API products, understand your refusal architecture. OpenAI's Structured Outputs routes refusals through a separate field outside the grammar-constrained output, which partially addresses this. If you're building on top of an API that doesn't have a separate refusal channel, that gap is now documented. Ask your API provider directly whether CodeSpear-class attacks are in their threat model.

The broader signal here comes from Travis Lelle's analysis: "LLM safety has been built almost entirely in the natural-language modality. As models increasingly operate through code, tool calls, structured outputs, and other constrained formats, the assumption that text-based alignment transfers to these modalities is proving wrong." 2

That's not a prediction — it's a description of where things stand today.

Cargando tarjeta de contenido…

56-second visual breakdown of the CodeSpear attack from Learn AI Visually — the fastest primer on the mechanism before your next sprint planning.

Cover image: from The Safety Feature That Becomes the Attack — Travis Lelle / LinkedIn Pulse