The benchmark

How it works

A reproducible, layered-detector ablation: each detector runs once per dataset, combinations are span-unions merged to the deployed redaction mask, then scored recall-first.

Open the interactive report Русская версия

A stack of translucent detector layers over a page of text; each layer catches different kinds of marks as they pass through. — Layers compose by span-union; each catches what the others miss.

Datasets

Four slices. Every therapy transcript is synthetic/fictional; the one real-text anchor (RU-real) is external, anonymized, non-clinical public data.

dataset	docs	gold spans	what it is
RU-synth synthetic	30	1076	Synthetic Russian therapy series (6 fictional clients, 30 sessions). Entity-level direct/quasi labels.
RU-adversarial synthetic	16	20	Robustness probe: patronymics, transliteration, diminutives, VK/Telegram handles, Russian national-ID numbers (SNILS / INN / passport), code-switching.
EN-synth synthetic	32	46	Curated English therapy-style snippets. Mention-level coverage (no per-entity ids).
RU-real (JayGuard) real	60	77	External, anonymized, real-but-non-clinical Russian text (Apache-2.0). PERSON/LOCATION only, machine-derived gold.

The detector layers

The benchmark composes these by span-union. The proposed default (★) trades a little recall for large speed/precision gains — and runs entirely locally.

regex

Bilingual pattern rules — phones, emails, URLs, structured IDs, numeric & relative/colloquial dates.

Natasha

Russian NER (Cyrillic-only) — names, locations, orgs.

OpenAI Privacy Filter

English token-classification model — the EN name/address backbone.

Local LLM — Qwen2.5-3B ★ / Gemma3 / Gemma4

Published local LLM baseline (Qwen2.5-3B via Ollama, temp 0) — the layer that lifts world-knowledge quasi-identifiers (medication, profession, spelled-out age). Gemma3 and Gemma4 swaps of this layer beat the Qwen baseline on every full short slice and are reported as exploratory candidates, not promoted to ★ defaults yet.

GLiNER-multi PII

Zero-shot multilingual NER (gliner_multi_pii-v1, Apache-2.0, ~1.2 GB) — deterministic and fully local, works on Russian and English. Exploratory candidate layer alongside the LLM.

Proposed defaults: natasha + regex + qwen2.5:3b ★ for Russian, opf + regex + qwen2.5:3b ★ for English. Gemma3/Gemma4 are candidates, not defaults (◇): they beat Qwen in single runs, and the cloud Gemma4 has now passed the run-variance gate (5 replicates, stable at temp 0) — but the long-Russian slice, the confidence-interval comparison against Qwen, and a preregistration amendment are still open before any ★ changes. Presidio & Philter are not candidates: they are established third-party systems shown as reference baselines, English-only by construction (Presidio runs an English spaCy model; Philter implements US HIPAA rules).

Exploratory model swaps — Gemma vs the Qwen baseline

Same stack, different LLM layer: each row swaps only the model behind the ★ default. Gemma beats Qwen on every full short slice — local Gemma4 12B-MLX is strongest on the Russian sets, and the cloud Gemma4 26B-A4B probe (synthetic text only) leads the English stack. The ★ defaults stay Qwen until variance and promotion gates pass.

dataset	model	cov R	cov F2	ent R
RU-synth long	Qwen2.5-3B ★	0.875	0.802	0.726
	Gemma3	0.954	0.362	0.850
	GLiNER-multi PII (zero-shot NER)	0.980	0.407	0.885
RU-adv	Qwen2.5-3B ★	0.950	0.887	0.950
	Gemma3	1.000	0.917	1.000
	Gemma4 12B-MLX	1.000	0.948	1.000
	Gemma4 26B-A4B (HF cloud)	1.000	0.943	1.000
	GLiNER-multi PII (zero-shot NER)	1.000	0.826	1.000
RU-real	Qwen2.5-3B ★	0.792	0.614	0.792
	Gemma3	0.961	0.730	0.961
	Gemma4 12B-MLX	1.000	0.820	1.000
EN-synth	Qwen2.5-3B ★	0.978	0.870	—
	Gemma3	1.000	0.904	—
	Gemma4 12B-MLX	1.000	0.955	—
	Gemma4 26B-A4B (HF cloud)	1.000	0.962	—
	GLiNER-multi PII (zero-shot NER)	0.957	0.782	—

Gemma3 on the long RU slice trades precision for recall (cov F2 0.362 — heavy over-redaction from chunked prompting); it is included as a recall/noise tradeoff, not a candidate default. Missing rows are omitted, not scored as zero. Full tables and caveats in the interactive report.

Reading the metrics

column	what it means	direction
`cov R`	Mask-coverage recall (relaxed, ≥1-char overlap): did the mask touch the PII span at all? A miss = leaked PII.	↑ higher is better
`cov F2`	Recall-weighted F2 (β=2): recall counts 2× precision, because a miss leaks while a false positive only over-redacts.	↑ higher is better
`ent R`	Entity-level recall (TAB): protected only if ALL mentions are masked — one recurrence is a leak.	↑ higher is better
`direct`	Entity recall for direct identifiers (name, phone, email, policy/ID).	↑ higher is better
`quasi`	Entity recall for quasi-identifiers (age, profession, city, employer, medication, date).	↑ higher is better

Type-aware micro/macro-F1 (i2b2) are also reported. Numbers are mention-level unless marked entity-level.

Beyond detection

Residual risk — what survives

Detection metrics measure what we catch; regulators care what survives. The report maps the default stack onto named risks:

Reconstruction. ~31% of quasi-identifiers survive the default RU stack (both clients); a local 3B attacker recovers attributes from the redacted text — a lower bound, since a frontier model would recover more.
Privacy vs. utility. The same masking keeps the clinical signal usable — the report shows distortion-extraction preserved alongside a char-level non-PII floor.
Regulatory tier. Under the WP29 triad (singling-out / linkability / inference) + a HIPAA-inspired checklist, the RU default lands RED (a direct identifier still leaks at entity level), English AMBER.

Full numbers, charts, and citations in the interactive report →