CONFIDE-Bench — De-identification Layer Benchmark

CONFIDE · github.com/glebis/confide · by Gleb Kalinin & CONFIDE contributors · released for research & teaching under the repository license. Therapy transcripts are synthetic/fictional — no real patient data; the one real-text slice (RU-real / JayGuard) is external, anonymized, non-clinical public data.

LLM detector layer (ollama) & local attacker: Qwen2.5-3B-Instruct (qwen2.5:3b) via Ollama, temperature 0. Deterministic layers: Natasha (Russian NER), a bilingual regex layer, and the OpenAI Privacy Filter (English).

De-identification is not one tool but a stack of detectors, and the honest question is which layer pays for the CPU it burns. This benchmark composes detector layers by span-union over psychotherapy transcripts in Russian and English, scores each combination the way published de-id work does — recall-first, entity-level, direct vs quasi — and asks a sharper question than “how good is the tool”: what can only an LLM catch, and what still leaks after we redact?

1. Which layer catches what

The LLM layer is what moves age, medication, and profession above the deterministic baseline.

2. Best sequence per language

Bars show coverage recall; ★ is the proposed default, which trades a little recall for large speed/precision gains — not always the single highest bar.

◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).

combo	cov R↑	cov F2↑	ent R↑	direct↑	quasi↑	preds
regex	0.071	0.086	0.398	0.407	0.390	85
natasha	0.775	0.752	0.301	0.370	0.237	1151
ollama·qwen2.5:3b	0.293	0.329	0.230	0.185	0.271	421
natasha+regex	0.846	0.807	0.699	0.778	0.627	1236
natasha+ollama·qwen2.5:3b	0.825	0.774	0.487	0.537	0.441	1333
regex+ollama·qwen2.5:3b	0.342	0.379	0.469	0.463	0.475	480
natasha+regex+ollama·qwen2.5:3b ★	0.875	0.811	0.726	0.815	0.644	1392
natasha+regex+gemma3 ◇	0.954	0.366	0.850	0.907	0.797	9093
natasha+regex+gliner ◇	0.980	0.412	0.885	0.889	0.881	7927

RU-real (JayGuard slice) — 60 docs, 77 gold mentions (external, anonymized, real-but-non-clinical Russian text — Apache-2.0; PERSON/LOCATION only, machine-derived gold, not human-adjudicated)

◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).

combo	cov R↑	cov F2↑	ent R↑	direct↑	quasi↑	preds
regex	0.000	0.000	0.000	0.000	0.000	13
natasha	0.403	0.418	0.403	0.358	0.700	63
ollama·qwen2.5:3b	0.610	0.547	0.610	0.657	0.300	160
natasha+regex	0.403	0.404	0.403	0.358	0.700	76
natasha+ollama·qwen2.5:3b	0.792	0.674	0.792	0.791	0.800	180
regex+ollama·qwen2.5:3b	0.610	0.547	0.610	0.657	0.300	160
natasha+regex+ollama·qwen2.5:3b ★	0.792	0.674	0.792	0.791	0.800	180
natasha+regex+gemma3 ◇	0.961	0.772	0.961	0.970	0.900	201
natasha+regex+gemma4-12b-mlx ◇	1.000	0.830	1.000	1.000	1.000	172

EN-synth — 32 docs, 46 gold mentions (no entity-level / direct-quasi: the EN sets carry no per-entity entity_id annotation, so only mention-level coverage is scored)

◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).

2b. Adversarial robustness (RU)

combo	cov R↑	cov F2↑	preds
regex	0.370	0.419	19
opf	0.783	0.818	38
ollama·qwen2.5:3b	0.500	0.525	49
opf+regex	0.913	0.921	46
opf+ollama·qwen2.5:3b	0.848	0.815	58
regex+ollama·qwen2.5:3b	0.761	0.743	59
opf+regex+ollama·qwen2.5:3b ★	0.978	0.910	66
natasha+regex+ollama·qwen2.5:3b	0.783	0.758	61
presidio	0.913	0.907	51
philter	0.783	0.799	47
presidio+regex+ollama·qwen2.5:3b	0.935	0.880	66
opf+regex+gemma3 ◇	1.000	0.945	62
opf+regex+gemma4-12b-mlx ◇	1.000	0.976	54
opf+regex+gemma4-26b·cloud ◇	1.000	0.984	52
opf+regex+gliner ◇	0.957	0.800	91

On the hard-forms probe the full stack catches 19/20 adversarial identifiers — the lone leak is a Latin-transliterated Russian name.

2c. Model comparison

Exploratory model swaps show Gemma beating the Qwen baseline on every full short slice; local Gemma4 is strongest on the Russian sets, while HF cloud Gemma4 leads the English stack and is stable across 5 replicates. GLiNER-multi, a local zero-shot NER layer, is reported on the same footing. The ★ defaults are unchanged until variance and promotion gates pass.

Rows are stack scores from separate detector caches, generated by score_llm_experiment.py. Cloud rows used synthetic text only. Long-RU Gemma3 chunking is included as a recall/noise tradeoff, not a promoted default; missing rows are omitted, not scored as zero.

3. Direct vs quasi-identifiers (TAB)

combo	cov R↑	cov F2↑	ent R↑	direct↑	quasi↑	preds
regex	0.400	0.455	0.400	0.471	0.000	8
natasha	0.350	0.389	0.350	0.353	0.333	10
ollama·qwen2.5:3b	0.650	0.663	0.650	0.588	1.000	25
natasha+regex	0.750	0.765	0.750	0.824	0.333	18
natasha+regex+ollama·qwen2.5:3b ★	0.950	0.887	0.950	0.941	1.000	30
natasha+regex+gemma3 ◇	1.000	0.917	1.000	1.000	1.000	32
natasha+regex+gemma4-12b-mlx ◇	1.000	0.948	1.000	1.000	1.000	28
natasha+regex+gemma4-26b·cloud ◇	1.000	0.943	1.000	1.000	1.000	26
natasha+regex+gliner ◇	1.000	0.826	1.000	1.000	1.000	41

dataset	model	cov R↑	cov F2↑	ent R	preds
RU-synth long	Qwen2.5-3B	0.875	0.802	0.726	1392
RU-synth long	Gemma3	0.954	0.362	0.850	9093
RU-synth long	GLiNER-multi PII (zero-shot NER)	0.980	0.407	0.885	7927
RU-adv	Qwen2.5-3B	0.950	0.887	0.950	30
RU-adv	Gemma3	1.000	0.917	1.000	32
RU-adv	Gemma4 12B-MLX	1.000	0.948	1.000	28
RU-adv	Gemma4 26B-A4B (HF cloud)	1.000	0.943	1.000	26
RU-adv	GLiNER-multi PII (zero-shot NER)	1.000	0.826	1.000	41
RU-real	Qwen2.5-3B	0.792	0.614	0.792	180
RU-real	Gemma3	0.961	0.730	0.961	201
RU-real	Gemma4 12B-MLX	1.000	0.820	1.000	172
EN-synth	Qwen2.5-3B	0.978	0.870	—	66
EN-synth	Gemma3	1.000	0.904	—	62
EN-synth	Gemma4 12B-MLX	1.000	0.955	—	54
EN-synth	Gemma4 26B-A4B (HF cloud)	1.000	0.962	—	52
EN-synth	GLiNER-multi PII (zero-shot NER)	0.957	0.782	—	91

Direct identifiers reach 0.81 entity recall; quasi-identifiers remain lower at 0.64.

4. What survives — reconstruction & re-identification

31% of quasi-identifiers survive the default stack (both clients); over-redaction costs 20% of redactions.

5. Privacy vs utility — can you de-identify and still analyze?

A weak local attacker recovers 1/10 attributes (top-3); yet 91% of the clinical signal survives redaction.

6. CONFIDE stack vs established baselines (Presidio, Philter)

Off-the-shelf de-identifiers can match the stack on coverage but fall far behind on type-aware micro-F1.

7. Regulatory residual-risk (RU · EN)

Detection metrics measure what we catch; regulators care what survives. Mapped onto named risks, the RU default stack lands at RED — driven by 9 in-scope residual direct-identifier entities (a re-identification key left in the text). A further 1 are spelled-out digit IDs, out of scope for the regex layer by design and reported separately.

dataset	tier	direct res	special res	HIPAA	worst doc	inf rate	link AUC
RU	RED	9	6	4/6	70%	20%	0.46
EN-synth	AMBER	0	0	6/7	0%	20%	0.46

Per-language residual-risk tier under each language's ★ default stack. RU lands RED (direct identifiers leak at the strict TAB entity bar); EN lands AMBER (no direct-ID leak, but nonzero inference / incomplete HIPAA coverage). EN's worst-doc recall reads 0% because its tiny gold means one PII-bearing doc can be missed entirely — small-N noise, not a systematic EN failure. The RU detail follows.

References & credits

CONFIDE-Bench builds on the de-identification, re-identification, and documentation literature listed below. Every work named or relied on in this report is credited here with a link to its canonical page (DOI / arXiv / HuggingFace / GitHub). We credit only what the report actually uses; inclusion does not imply endorsement by those authors.

Benchmarks & metrics

TAB — Text Anonymization Benchmark. Pilán, Lison, Øvrelid, Papadopoulou, Sánchez & Batet (2022), Computational Linguistics 48(4):1053–1101. Source of the direct vs. quasi-identifier distinction and entity-level (all-mentions-masked) recall. doi:10.1162/coli_a_00458 · ACL Anthology
2014 i2b2/UTHealth de-identification (Track 1). Stubbs, Kotfila & Uzuner (2015), J. Biomedical Informatics. Strict entity-based de-id evaluation; comparison point for clinical-note de-id. PubMed 26225918
2016 CEGS N-GRID / n2c2 psychiatric-intake de-identification. Stubbs, Filannino & Uzuner (2017), J. Biomedical Informatics. Psychiatric-intake-note de-id comparison point. PMC5705537
MEDDOCAN. Spanish synthetic clinical-case de-identification shared task (IberLEF 2019), ~22 PHI types. Related clinical de-id benchmark. PlanTL SPACCC_MEDDOCAN
Presidio-research (F2 evaluation). Microsoft, MIT-licensed. Basis for the recall-weighted F₂ (β=2) de-id scoring framing. github.com/microsoft/presidio-research
Tau-Eval. Loiseau et al. (2025), EMNLP System Demonstrations. Task-sensitive privacy-and-utility evaluation framing. arXiv:2506.05979

Re-identification & privacy attacks

Staab et al. — Beyond Memorization: Violating Privacy via Inference with LLMs. ICLR 2024. LLM inference-attack framing; frontier attackers infer far more than the local lower-bound attacker used here. arXiv:2310.07298
Anonymeter. Giomi, Boenisch, Wehmeyer & Tasnádi (2022/PETS 2023), Statice. Attack-based singling-out / linkability / inference framing (the three GDPR risks). arXiv:2211.10459 · GitHub
RAT-Bench. Imperial College (2026 preprint). Attacker-based residual re-identification benchmark framing (cited as preprint evidence). OpenReview FjbU4kLriN

Detectors & tools

Microsoft Presidio. MIT license; spaCy-backed PII detection (EN-first baseline). github.com/microsoft/presidio
Philter / philter-lite. UCSF clinical de-identification rule set; philter-lite is the Sirona Medical fork. github.com/SironaMedical/philter-lite · PyPI
Natasha. Russian NLP/NER toolkit (Cyrillic-only — the basis for the documented transliteration leak). github.com/natasha/natasha
OpenAI Privacy Filter (OPF), openai/privacy-filter. Apache-2.0 token-classification PII model (used as the EN name/address backbone). The model card states it is a redaction / data-minimization aid, not an anonymization or compliance guarantee. huggingface.co/openai/privacy-filter
Ollama + Qwen. Local LLM runner and the Qwen model family used for the local-LLM detector layer and the local 3B re-identification attacker. ollama.com · QwenLM/Qwen2.5

Datasets

JayGuard NER Benchmark. Just AI (2025), Hugging Face Datasets. External, anonymized, real-but-non-clinical conversational Russian PII dataset (Apache-2.0); the RU-real slice is built from it (PERSON/LOCATION). Used with attribution as required by the licence. huggingface.co/datasets/just-ai/jayguard-ner-benchmark

Documentation & regulatory framing

Datasheets for Datasets. Gebru et al. (2021), CACM. Microsoft Research
Data Statements for NLP. Bender & Friedman (2018), TACL. ACL Anthology Q18-1041
GDPR Recital 26 & WP29/EDPB anonymisation framework. “Reasonably likely means” and the singling-out / linkability / inference triad. GDPR (EUR-Lex) · EDPB SME guide
HIPAA de-identification (Safe Harbor & Expert Determination). Mapping is illustrative only — benchmark success is not a compliance certification. HHS HIPAA de-id guidance

CONFIDE-Bench — Which Layer Earns Its Compute?

1. Which layer catches what

2. Best sequence per language

2b. Adversarial robustness (RU)

2c. Model comparison

3. Direct vs quasi-identifiers (TAB)

4. What survives — reconstruction & re-identification

5. Privacy vs utility — can you de-identify and still analyze?

6. CONFIDE stack vs established baselines (Presidio, Philter)

7. Regulatory residual-risk (RU · EN)

References & credits

Benchmarks & metrics

Re-identification & privacy attacks

Detectors & tools

Datasets

Documentation & regulatory framing