A bilingual de-identification benchmark for psychotherapy transcripts.
CONFIDE · github.com/glebis/confide · by Gleb Kalinin & CONFIDE contributors · released for research & teaching under the repository license. Therapy transcripts are synthetic/fictional — no real patient data; the one real-text slice (RU-real / JayGuard) is external, anonymized, non-clinical public data.
Project site, plain-language explainer & how to contribute: confide.salient.community.
LLM detector layer (ollama) & local attacker: Qwen2.5-3B-Instruct (qwen2.5:3b) via Ollama, temperature 0. Deterministic layers: Natasha (Russian NER), a bilingual regex layer, and the OpenAI Privacy Filter (English).
TL;DR — we test how well automatic privacy tools hide personal details in psychotherapy session transcripts (Russian & English), and which combination of tools earns its compute.
How to read this: ★ marks the recommended default stack · bars show coverage (higher is better) · a blank bar means that combination was not run for that language · every table has a column key explaining its abbreviations.
De-identification is not one tool but a stack of detectors, and the honest question is which layer pays for the CPU it burns. This benchmark composes detector layers by span-union over psychotherapy transcripts in Russian and English, scores each combination the way published de-id work does — recall-first, entity-level, direct vs quasi — and asks a sharper question than “how good is the tool”: what can only an LLM catch, and what still leaks after we redact?
Three PII types — age 0%, medication 4%, profession 2% — are near-zero under the deterministic layers (Natasha NER + regex). Adding the local qwen layer raises their mention-recall, but medication and profession still have very low entity-recall because every mention must be masked. Meanwhile 31% of quasi-identifiers still survive the default stack. Redaction of direct identifiers is necessary but not sufficient.
The LLM layer is what moves age, medication, and profession above the deterministic baseline.
RU per-category mention recall (relaxed overlap), 1076 gold mentions: deterministic (Natasha+regex) vs. +qwen.
Structured direct IDs (email, phone, policy ID) and names/orgs/locations reach 0.9–1.0 from regex + Natasha. Dates are now caught too — a numeric-date regex rule was added after the benchmark exposed the gap.
Quasi-identifiers needing world-knowledge — a drug name, an occupation, a spelled-out age — are invisible to pattern and NER layers; the LLM is the only layer that lifts their mention-recall.
Caveat: this is mention recall. At entity level (all mentions masked), medication and profession stay at 0 even with qwen — a higher bar.
Bars show coverage recall; ★ is the proposed default, which trades a little recall for large speed/precision gains — not always the single highest bar.
Coverage recall by combination across the plotted datasets. A missing bar means that combination was not run on that language (see note), not a zero score. ★ = proposed default; see table for F2/precision.
Russian runs only the local-first stack — Natasha + regex + qwen2.5:3b. Three detectors are English-only by design, so any combo containing them has no RU bar: Presidio (its RU is spaCy-NER-dependent and weak — left unscored rather than misrepresented), Philter (an English clinical-notes rule set), and the OpenAI Privacy Filter (an English token-classifier). This is a scope decision, not a measured RU failure.
RU: the proposed default natasha+regex+ollama ★ reaches 88% coverage recall. The OPF-on-RU row is omitted until its detector cache is regenerated for the current 30-document corpus.
EN-synth: OPF is the name/address backbone (English’s Natasha). Default opf+regex+ollama ★; opf+regex edges it on F2.
RU-real (JayGuard): on external real Russian text the local stack reaches strong coverage — a real-text anchor for the otherwise-synthetic RU corpus (PERSON/LOCATION only).
| combo | cov R↑ | cov F2↑ | ent R↑ | direct↑ | quasi↑ | preds |
|---|---|---|---|---|---|---|
| regex | 0.071 | 0.086 | 0.398 | 0.407 | 0.390 | 85 |
| natasha | 0.775 | 0.752 | 0.301 | 0.370 | 0.237 | 1151 |
| ollama·qwen2.5:3b | 0.293 | 0.329 | 0.230 | 0.185 | 0.271 | 421 |
| natasha+regex | 0.846 | 0.807 | 0.699 | 0.778 | 0.627 | 1236 |
| natasha+ollama·qwen2.5:3b | 0.825 | 0.774 | 0.487 | 0.537 | 0.441 | 1333 |
| regex+ollama·qwen2.5:3b | 0.342 | 0.379 | 0.469 | 0.463 | 0.475 | 480 |
| natasha+regex+ollama·qwen2.5:3b ★ | 0.875 | 0.811 | 0.726 | 0.815 | 0.644 | 1392 |
| natasha+regex+gemma3 ◇ | 0.954 | 0.366 | 0.850 | 0.907 | 0.797 | 9093 |
| natasha+regex+gliner ◇ | 0.980 | 0.412 | 0.885 | 0.889 | 0.881 | 7927 |
◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).
RU-real (JayGuard slice) — 60 docs, 77 gold mentions (external, anonymized, real-but-non-clinical Russian text — Apache-2.0; PERSON/LOCATION only, machine-derived gold, not human-adjudicated)
| combo | cov R↑ | cov F2↑ | ent R↑ | direct↑ | quasi↑ | preds |
|---|---|---|---|---|---|---|
| regex | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 13 |
| natasha | 0.403 | 0.418 | 0.403 | 0.358 | 0.700 | 63 |
| ollama·qwen2.5:3b | 0.610 | 0.547 | 0.610 | 0.657 | 0.300 | 160 |
| natasha+regex | 0.403 | 0.404 | 0.403 | 0.358 | 0.700 | 76 |
| natasha+ollama·qwen2.5:3b | 0.792 | 0.674 | 0.792 | 0.791 | 0.800 | 180 |
| regex+ollama·qwen2.5:3b | 0.610 | 0.547 | 0.610 | 0.657 | 0.300 | 160 |
| natasha+regex+ollama·qwen2.5:3b ★ | 0.792 | 0.674 | 0.792 | 0.791 | 0.800 | 180 |
| natasha+regex+gemma3 ◇ | 0.961 | 0.772 | 0.961 | 0.970 | 0.900 | 201 |
| natasha+regex+gemma4-12b-mlx ◇ | 1.000 | 0.830 | 1.000 | 1.000 | 1.000 | 172 |
◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).
EN-synth — 32 docs, 46 gold mentions (no entity-level / direct-quasi: the EN sets carry no per-entity entity_id annotation, so only mention-level coverage is scored)
| combo | cov R↑ | cov F2↑ | preds |
|---|---|---|---|
| regex | 0.370 | 0.419 | 19 |
| opf | 0.783 | 0.818 | 38 |
| ollama·qwen2.5:3b | 0.500 | 0.525 | 49 |
| opf+regex | 0.913 | 0.921 | 46 |
| opf+ollama·qwen2.5:3b | 0.848 | 0.815 | 58 |
| regex+ollama·qwen2.5:3b | 0.761 | 0.743 | 59 |
| opf+regex+ollama·qwen2.5:3b ★ | 0.978 | 0.910 | 66 |
| natasha+regex+ollama·qwen2.5:3b | 0.783 | 0.758 | 61 |
| presidio | 0.913 | 0.907 | 51 |
| philter | 0.783 | 0.799 | 47 |
| presidio+regex+ollama·qwen2.5:3b | 0.935 | 0.880 | 66 |
| opf+regex+gemma3 ◇ | 1.000 | 0.945 | 62 |
| opf+regex+gemma4-12b-mlx ◇ | 1.000 | 0.976 | 54 |
| opf+regex+gemma4-26b·cloud ◇ | 1.000 | 0.984 | 52 |
| opf+regex+gliner ◇ | 0.957 | 0.800 | 91 |
◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).
On the hard-forms probe the full stack catches 19/20 adversarial identifiers — the lone leak is a Latin-transliterated Russian name.
| combo | cov R↑ | cov F2↑ | ent R↑ | direct↑ | quasi↑ | preds |
|---|---|---|---|---|---|---|
| regex | 0.400 | 0.455 | 0.400 | 0.471 | 0.000 | 8 |
| natasha | 0.350 | 0.389 | 0.350 | 0.353 | 0.333 | 10 |
| ollama·qwen2.5:3b | 0.650 | 0.663 | 0.650 | 0.588 | 1.000 | 25 |
| natasha+regex | 0.750 | 0.765 | 0.750 | 0.824 | 0.333 | 18 |
| natasha+regex+ollama·qwen2.5:3b ★ | 0.950 | 0.887 | 0.950 | 0.941 | 1.000 | 30 |
| natasha+regex+gemma3 ◇ | 1.000 | 0.917 | 1.000 | 1.000 | 1.000 | 32 |
| natasha+regex+gemma4-12b-mlx ◇ | 1.000 | 0.948 | 1.000 | 1.000 | 1.000 | 28 |
| natasha+regex+gemma4-26b·cloud ◇ | 1.000 | 0.943 | 1.000 | 1.000 | 1.000 | 26 |
| natasha+regex+gliner ◇ | 1.000 | 0.826 | 1.000 | 1.000 | 1.000 | 41 |
◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).
16 snippets, 20 gold forms: patronymics, transliteration, diminutives, VK/Telegram handles, SNILS/INN/passport IDs, abbreviated addresses, and code-switching.
Regex catches the structured IDs and handles; Natasha + qwen2.5:3b recover patronymics, diminutives and code-switching. The one residual leak is “Sergey Volkov” — a Latin-transliterated Russian name: Natasha is Cyrillic-only, regex has no name rule, and qwen missed it. This is the argument for adding an English/Latin NER (OPF) when transliteration is expected.
Exploratory model swaps show Gemma beating the Qwen baseline on every full short slice; local Gemma4 is strongest on the Russian sets, while HF cloud Gemma4 leads the English stack and is stable across 5 replicates. GLiNER-multi, a local zero-shot NER layer, is reported on the same footing. The ★ defaults are unchanged until variance and promotion gates pass.
| dataset | model | cov R↑ | cov F2↑ | ent R | preds |
|---|---|---|---|---|---|
| RU-synth long | Qwen2.5-3B | 0.875 | 0.802 | 0.726 | 1392 |
| RU-synth long | Gemma3 | 0.954 | 0.362 | 0.850 | 9093 |
| RU-synth long | GLiNER-multi PII (zero-shot NER) | 0.980 | 0.407 | 0.885 | 7927 |
| RU-adv | Qwen2.5-3B | 0.950 | 0.887 | 0.950 | 30 |
| RU-adv | Gemma3 | 1.000 | 0.917 | 1.000 | 32 |
| RU-adv | Gemma4 12B-MLX | 1.000 | 0.948 | 1.000 | 28 |
| RU-adv | Gemma4 26B-A4B (HF cloud) | 1.000 | 0.943 | 1.000 | 26 |
| RU-adv | GLiNER-multi PII (zero-shot NER) | 1.000 | 0.826 | 1.000 | 41 |
| RU-real | Qwen2.5-3B | 0.792 | 0.614 | 0.792 | 180 |
| RU-real | Gemma3 | 0.961 | 0.730 | 0.961 | 201 |
| RU-real | Gemma4 12B-MLX | 1.000 | 0.820 | 1.000 | 172 |
| EN-synth | Qwen2.5-3B | 0.978 | 0.870 | — | 66 |
| EN-synth | Gemma3 | 1.000 | 0.904 | — | 62 |
| EN-synth | Gemma4 12B-MLX | 1.000 | 0.955 | — | 54 |
| EN-synth | Gemma4 26B-A4B (HF cloud) | 1.000 | 0.962 | — | 52 |
| EN-synth | GLiNER-multi PII (zero-shot NER) | 0.957 | 0.782 | — | 91 |
Rows are stack scores from separate detector caches, generated by score_llm_experiment.py. Cloud rows used synthetic text only. Long-RU Gemma3 chunking is included as a recall/noise tradeoff, not a promoted default; missing rows are omitted, not scored as zero.
Direct identifiers reach 0.81 entity recall; quasi-identifiers remain lower at 0.64.
RU entity-level recall (an entity is protected only if all mentions are masked) by identifier class.
Direct (name, phone, email, policy): masked at 0.81 entity recall in the default stack.
Quasi (age, profession, city, employer, medication, date): the LLM helps but the ceiling stays low — these are the attributes that, combined, re-identify a person.
31% of quasi-identifiers survive the default stack (both clients); over-redaction costs 20% of redactions.
Method — following the re-identification / inference-attack literature (Staab et al.; RAT-Bench; Tau-Eval). An entity survives if any one of its mentions is left unmasked.
A local 3B qwen attacker recovered 2 of 10 tested attributes from the redacted text (e.g. the medication, because its entity-recall is 0). A weak attacker is a lower bound — the inference-attack literature (Staab et al.) reports much higher re-identification rates for frontier models.
Implication: redacting direct identifiers is table stakes; quasi-identifier survival is a useful gate to check before sending a session to cloud analysis.
A weak local attacker recovers 1/10 attributes (top-3); yet 91% of the clinical signal survives redaction.
Method — top-k inference attack with a fixed, declared budget (qwen2.5:3b, temp 0.4, top-3 guesses/attribute, redacted text only) + downstream task preservation (re-run cognitive-distortion extraction on redacted vs. original). Aligned with Staab et al. / RAT-Bench (privacy) and Tau-Eval (task-sensitive utility).
The same masking that lowers attacker success can erase clinical content. Here it does not: the default stack keeps ~91% of distortion signal and 99.5% of non-PII text while a weak attacker recovers nothing top-3.
Caveat: this attacker is a lower bound — quasi-identifiers still survive in text (medication entity-recall is 0), so a frontier attacker would score higher. Residual risk stays MEDIUM for client B.
Off-the-shelf de-identifiers can match the stack on coverage but fall far behind on type-aware micro-F1.
EN-synth: coverage F2 (recall-weighted, headline) vs type-aware micro-F1 for the CONFIDE ★ stack and the established baselines.
Coverage F2 (orange) asks only “did we mask the span at all”. Type micro-F1 (green) demands the right label too — what a redaction policy actually needs.
On EN-synth, Presidio edges the stack on coverage F2 (a broad DATE_TIME recognizer) but its type-F1 is far lower; Philter is high-coverage yet emits almost everything as untyped OTHER, so its type-F1 is unusable.
Takeaway: a generic system is not a therapy-tuned one; the only coverage a baseline adds is relative/colloquial dates.
Detection metrics measure what we catch; regulators care what survives. Mapped onto named risks, the RU default stack lands at RED — driven by 9 in-scope residual direct-identifier entities (a re-identification key left in the text). A further 1 are spelled-out digit IDs, out of scope for the regex layer by design and reported separately.
All datasets, side by side
| dataset | tier | direct res | special res | HIPAA | worst doc | inf rate | link AUC |
|---|---|---|---|---|---|---|---|
| RU | RED | 9 | 6 | 4/6 | 70% | 20% | 0.46 |
| EN-synth | AMBER | 0 | 0 | 6/7 | 0% | 20% | 0.46 |
Per-language residual-risk tier under each language's ★ default stack. RU lands RED (direct identifiers leak at the strict TAB entity bar); EN lands AMBER (no direct-ID leak, but nonzero inference / incomplete HIPAA coverage). EN's worst-doc recall reads 0% because its tiny gold means one PII-bearing doc can be missed entirely — small-N noise, not a systematic EN failure. The RU detail follows.
WP29 (Art-29 WP 05/2014) re-identification triad — identifiability decomposes into singling out (residual quasi surface via a caveated population-fraction estimator — NOT corpus k-anonymity, N is tiny), linkability (pairwise session-linking ROC AUC 0.46 — the area under the ROC curve: 1.0 = perfectly linkable, 0.50 = no better than chance, which is the safe direction), and inference (attribute-recovery attack recovers 20%). HIPAA coverage is a Safe-Harbor-inspired checklist, not a legal determination (AGE is N/A; structured IDs collapsed).
RED = any in-scope direct identifier leaks at entity level (one unmasked mention is a key). AMBER = special-category residual, nonzero inference, or linkability above chance. GREEN = all clear.
What leaks: not whole names but specific variants — inflected/possessive/patronymic forms (Артёмом, Натальин, Денису), lowercase surnames, vocatives, Latin transliteration (Timur), and name/common-word collisions (Вера, Роман). Mention-level recall hides this; the strict TAB entity bar (one miss ⇒ unprotected) surfaces it.
The singling-out estimate is illustrative (a caveated population-fraction method, not corpus k-anonymity) — it is not a guarantee of non-identifiability.
Each detector runs once per dataset; combinations are span-unions of cached spans, interval-merged to the deployed redaction mask before scoring. This report headlines coverage recall (relaxed overlap) — the privacy-critical number — and recall-weighted F2 + precision sit in the leaderboard table. Type-aware micro/macro-F1 (i2b2) and entity-level recall (TAB; all mentions masked) are also reported. Numbers are mention-level unless marked entity-level. Gold for RU is located from the two answer-key PII inventories and hand-verified (a planted-signal recovery eval, not independently annotated gold); the EN set is a curated synthetic slice, and the one real-text anchor is the external RU-real (JayGuard) slice. Mostly synthetic data — no real patients. Small N: treat per-type numbers as directional.
CONFIDE-Bench builds on the de-identification, re-identification, and documentation literature listed below. Every work named or relied on in this report is credited here with a link to its canonical page (DOI / arXiv / HuggingFace / GitHub). We credit only what the report actually uses; inclusion does not imply endorsement by those authors.
philter-lite is the Sirona Medical fork. github.com/SironaMedical/philter-lite · PyPIopenai/privacy-filter. Apache-2.0 token-classification PII model (used as the EN name/address backbone). The model card states it is a redaction / data-minimization aid, not an anonymization or compliance guarantee. huggingface.co/openai/privacy-filter