The benchmark
How it works
A reproducible, layered-detector ablation: each detector runs once per dataset, combinations are span-unions merged to the deployed redaction mask, then scored recall-first.
Datasets
Four slices. Every therapy transcript is synthetic/fictional; the one real-text anchor (RU-real) is external, anonymized, non-clinical public data.
| dataset | docs | gold spans | what it is |
|---|---|---|---|
| RU-synth synthetic | 30 | 1076 | Synthetic Russian therapy series (6 fictional clients, 30 sessions). Entity-level direct/quasi labels. |
| RU-adversarial synthetic | 16 | 20 | Robustness probe: patronymics, transliteration, diminutives, VK/Telegram handles, Russian national-ID numbers (SNILS / INN / passport), code-switching. |
| EN-synth synthetic | 32 | 46 | Curated English therapy-style snippets. Mention-level coverage (no per-entity ids). |
| RU-real (JayGuard) real | 60 | 77 | External, anonymized, real-but-non-clinical Russian text (Apache-2.0). PERSON/LOCATION only, machine-derived gold. |
The detector layers
The benchmark composes these by span-union. The proposed default (β ) trades a little recall for large speed/precision gains β and runs entirely locally.
regex
Bilingual pattern rules β phones, emails, URLs, structured IDs, numeric & relative/colloquial dates.
Natasha
Russian NER (Cyrillic-only) β names, locations, orgs.
OpenAI Privacy Filter
English token-classification model β the EN name/address backbone.
Local LLM β Qwen2.5-3B β / Gemma3 / Gemma4
Published local LLM baseline (Qwen2.5-3B via Ollama, temp 0) β the layer that lifts world-knowledge quasi-identifiers (medication, profession, spelled-out age). Gemma3 and Gemma4 swaps of this layer beat the Qwen baseline on every full short slice and are reported as exploratory candidates, not promoted to β defaults yet.
GLiNER-multi PII
Zero-shot multilingual NER (gliner_multi_pii-v1, Apache-2.0, ~1.2 GB) β deterministic and fully local, works on Russian and English. Exploratory candidate layer alongside the LLM.
Proposed defaults: natasha + regex + qwen2.5:3b β
for Russian, opf + regex + qwen2.5:3b β
for English. Gemma3/Gemma4 are candidates, not defaults (β): they beat Qwen in single runs, and the cloud Gemma4 has now passed the run-variance gate (5 replicates, stable at temp 0) β but the long-Russian slice, the confidence-interval comparison against Qwen, and a preregistration amendment are still open before any β
changes. Presidio & Philter are not candidates: they are established third-party systems shown as reference baselines, English-only by construction (Presidio runs an English spaCy model; Philter implements US HIPAA rules).
Exploratory model swaps β Gemma vs the Qwen baseline
Same stack, different LLM layer: each row swaps only the model behind the β default. Gemma beats Qwen on every full short slice β local Gemma4 12B-MLX is strongest on the Russian sets, and the cloud Gemma4 26B-A4B probe (synthetic text only) leads the English stack. The β defaults stay Qwen until variance and promotion gates pass.
| dataset | model | cov R | cov F2 | ent R |
|---|---|---|---|---|
| RU-synth long | Qwen2.5-3B β | 0.875 | 0.802 | 0.726 |
| Gemma3 | 0.954 | 0.362 | 0.850 | |
| GLiNER-multi PII (zero-shot NER) | 0.980 | 0.407 | 0.885 | |
| RU-adv | Qwen2.5-3B β | 0.950 | 0.887 | 0.950 |
| Gemma3 | 1.000 | 0.917 | 1.000 | |
| Gemma4 12B-MLX | 1.000 | 0.948 | 1.000 | |
| Gemma4 26B-A4B (HF cloud) | 1.000 | 0.943 | 1.000 | |
| GLiNER-multi PII (zero-shot NER) | 1.000 | 0.826 | 1.000 | |
| RU-real | Qwen2.5-3B β | 0.792 | 0.614 | 0.792 |
| Gemma3 | 0.961 | 0.730 | 0.961 | |
| Gemma4 12B-MLX | 1.000 | 0.820 | 1.000 | |
| EN-synth | Qwen2.5-3B β | 0.978 | 0.870 | β |
| Gemma3 | 1.000 | 0.904 | β | |
| Gemma4 12B-MLX | 1.000 | 0.955 | β | |
| Gemma4 26B-A4B (HF cloud) | 1.000 | 0.962 | β | |
| GLiNER-multi PII (zero-shot NER) | 0.957 | 0.782 | β |
Gemma3 on the long RU slice trades precision for recall (cov F2 0.362 β heavy over-redaction from chunked prompting); it is included as a recall/noise tradeoff, not a candidate default. Missing rows are omitted, not scored as zero. Full tables and caveats in the interactive report.
Reading the metrics
| column | what it means | direction |
|---|---|---|
cov R | Mask-coverage recall (relaxed, β₯1-char overlap): did the mask touch the PII span at all? A miss = leaked PII. | β higher is better |
cov F2 | Recall-weighted F2 (Ξ²=2): recall counts 2Γ precision, because a miss leaks while a false positive only over-redacts. | β higher is better |
ent R | Entity-level recall (TAB): protected only if ALL mentions are masked β one recurrence is a leak. | β higher is better |
direct | Entity recall for direct identifiers (name, phone, email, policy/ID). | β higher is better |
quasi | Entity recall for quasi-identifiers (age, profession, city, employer, medication, date). | β higher is better |
Type-aware micro/macro-F1 (i2b2) are also reported. Numbers are mention-level unless marked entity-level.
Beyond detection
Residual risk β what survives
Detection metrics measure what we catch; regulators care what survives. The report maps the default stack onto named risks:
- Reconstruction. ~31% of quasi-identifiers survive the default RU stack (both clients); a local 3B attacker recovers attributes from the redacted text β a lower bound, since a frontier model would recover more.
- Privacy vs. utility. The same masking keeps the clinical signal usable β the report shows distortion-extraction preserved alongside a char-level non-PII floor.
- Regulatory tier. Under the WP29 triad (singling-out / linkability / inference) + a HIPAA-inspired checklist, the RU default lands RED (a direct identifier still leaks at entity level), English AMBER.
Redacting direct identifiers is necessary but not sufficient. Quasi-identifier survival is the gate worth checking before any session goes to cloud analysis β and the reason the gold needs human, multi-annotator labelling.
We surface the RED tier deliberately: de-identification is harm-reduction, not anonymization. It never replaces explicit client consent and local-only handling for real recordings. See Ethics & consent for our full position.
Full numbers, charts, and citations in the interactive report β