The benchmark

How it works

A reproducible, layered-detector ablation: each detector runs once per dataset, combinations are span-unions merged to the deployed redaction mask, then scored recall-first.

Open the interactive report Русская вСрсия

A stack of translucent detector layers over a page of text; each layer catches different kinds of marks as they pass through.
Layers compose by span-union; each catches what the others miss.

Datasets

Four slices. Every therapy transcript is synthetic/fictional; the one real-text anchor (RU-real) is external, anonymized, non-clinical public data.

datasetdocsgold spanswhat it is
RU-synth synthetic 30 1076 Synthetic Russian therapy series (6 fictional clients, 30 sessions). Entity-level direct/quasi labels.
RU-adversarial synthetic 16 20 Robustness probe: patronymics, transliteration, diminutives, VK/Telegram handles, Russian national-ID numbers (SNILS / INN / passport), code-switching.
EN-synth synthetic 32 46 Curated English therapy-style snippets. Mention-level coverage (no per-entity ids).
RU-real (JayGuard) real 60 77 External, anonymized, real-but-non-clinical Russian text (Apache-2.0). PERSON/LOCATION only, machine-derived gold.

The detector layers

The benchmark composes these by span-union. The proposed default (β˜…) trades a little recall for large speed/precision gains β€” and runs entirely locally.

regex

Bilingual pattern rules β€” phones, emails, URLs, structured IDs, numeric & relative/colloquial dates.

Natasha

Russian NER (Cyrillic-only) β€” names, locations, orgs.

OpenAI Privacy Filter

English token-classification model β€” the EN name/address backbone.

Local LLM β€” Qwen2.5-3B β˜… / Gemma3 / Gemma4

Published local LLM baseline (Qwen2.5-3B via Ollama, temp 0) β€” the layer that lifts world-knowledge quasi-identifiers (medication, profession, spelled-out age). Gemma3 and Gemma4 swaps of this layer beat the Qwen baseline on every full short slice and are reported as exploratory candidates, not promoted to β˜… defaults yet.

GLiNER-multi PII

Zero-shot multilingual NER (gliner_multi_pii-v1, Apache-2.0, ~1.2 GB) β€” deterministic and fully local, works on Russian and English. Exploratory candidate layer alongside the LLM.

Proposed defaults: natasha + regex + qwen2.5:3b β˜… for Russian, opf + regex + qwen2.5:3b β˜… for English. Gemma3/Gemma4 are candidates, not defaults (β—‡): they beat Qwen in single runs, and the cloud Gemma4 has now passed the run-variance gate (5 replicates, stable at temp 0) β€” but the long-Russian slice, the confidence-interval comparison against Qwen, and a preregistration amendment are still open before any β˜… changes. Presidio & Philter are not candidates: they are established third-party systems shown as reference baselines, English-only by construction (Presidio runs an English spaCy model; Philter implements US HIPAA rules).

Exploratory model swaps β€” Gemma vs the Qwen baseline

Same stack, different LLM layer: each row swaps only the model behind the β˜… default. Gemma beats Qwen on every full short slice β€” local Gemma4 12B-MLX is strongest on the Russian sets, and the cloud Gemma4 26B-A4B probe (synthetic text only) leads the English stack. The β˜… defaults stay Qwen until variance and promotion gates pass.

datasetmodelcov Rcov F2ent R
RU-synth long Qwen2.5-3B β˜… 0.875 0.802 0.726
Gemma3 0.954 0.362 0.850
GLiNER-multi PII (zero-shot NER) 0.980 0.407 0.885
RU-adv Qwen2.5-3B β˜… 0.950 0.887 0.950
Gemma3 1.000 0.917 1.000
Gemma4 12B-MLX 1.000 0.948 1.000
Gemma4 26B-A4B (HF cloud) 1.000 0.943 1.000
GLiNER-multi PII (zero-shot NER) 1.000 0.826 1.000
RU-real Qwen2.5-3B β˜… 0.792 0.614 0.792
Gemma3 0.961 0.730 0.961
Gemma4 12B-MLX 1.000 0.820 1.000
EN-synth Qwen2.5-3B β˜… 0.978 0.870 β€”
Gemma3 1.000 0.904 β€”
Gemma4 12B-MLX 1.000 0.955 β€”
Gemma4 26B-A4B (HF cloud) 1.000 0.962 β€”
GLiNER-multi PII (zero-shot NER) 0.957 0.782 β€”

Gemma3 on the long RU slice trades precision for recall (cov F2 0.362 β€” heavy over-redaction from chunked prompting); it is included as a recall/noise tradeoff, not a candidate default. Missing rows are omitted, not scored as zero. Full tables and caveats in the interactive report.

Reading the metrics

columnwhat it meansdirection
cov R Mask-coverage recall (relaxed, β‰₯1-char overlap): did the mask touch the PII span at all? A miss = leaked PII. ↑ higher is better
cov F2 Recall-weighted F2 (Ξ²=2): recall counts 2Γ— precision, because a miss leaks while a false positive only over-redacts. ↑ higher is better
ent R Entity-level recall (TAB): protected only if ALL mentions are masked β€” one recurrence is a leak. ↑ higher is better
direct Entity recall for direct identifiers (name, phone, email, policy/ID). ↑ higher is better
quasi Entity recall for quasi-identifiers (age, profession, city, employer, medication, date). ↑ higher is better

Type-aware micro/macro-F1 (i2b2) are also reported. Numbers are mention-level unless marked entity-level.

Beyond detection

Residual risk β€” what survives

Detection metrics measure what we catch; regulators care what survives. The report maps the default stack onto named risks:

  • Reconstruction. ~31% of quasi-identifiers survive the default RU stack (both clients); a local 3B attacker recovers attributes from the redacted text β€” a lower bound, since a frontier model would recover more.
  • Privacy vs. utility. The same masking keeps the clinical signal usable β€” the report shows distortion-extraction preserved alongside a char-level non-PII floor.
  • Regulatory tier. Under the WP29 triad (singling-out / linkability / inference) + a HIPAA-inspired checklist, the RU default lands RED (a direct identifier still leaks at entity level), English AMBER.
the takeaway

Redacting direct identifiers is necessary but not sufficient. Quasi-identifier survival is the gate worth checking before any session goes to cloud analysis β€” and the reason the gold needs human, multi-annotator labelling.

We surface the RED tier deliberately: de-identification is harm-reduction, not anonymization. It never replaces explicit client consent and local-only handling for real recordings. See Ethics & consent for our full position.

Full numbers, charts, and citations in the interactive report β†’