CONFIDE-Bench — Which Layer Earns Its Compute?

A bilingual de-identification benchmark for psychotherapy transcripts.

CONFIDE · github.com/glebis/confide · by Gleb Kalinin & CONFIDE contributors · released for research & teaching under the repository license. Therapy transcripts are synthetic/fictional — no real patient data; the one real-text slice (RU-real / JayGuard) is external, anonymized, non-clinical public data.

Project site, plain-language explainer & how to contribute: confide.salient.community.

TAB · i2b2/n2c2 · Presidio-F2 · Datasheets for Datasets

LLM detector layer (ollama) & local attacker: Qwen2.5-3B-Instruct (qwen2.5:3b) via Ollama, temperature 0. Deterministic layers: Natasha (Russian NER), a bilingual regex layer, and the OpenAI Privacy Filter (English).

TL;DR — we test how well automatic privacy tools hide personal details in psychotherapy session transcripts (Russian & English), and which combination of tools earns its compute.

What we measure
Did the redaction mask actually cover each piece of personal information? — recall-first, because a miss is leaked data.
Why it matters
Therapy text is deeply sensitive; one un-hidden name, phone, or medication can re-identify a client.
What it is NOT
Not a HIPAA/GDPR compliance certificate. The therapy transcripts are synthetic/fictional and samples are small — treat results as directional. The one real-text exception is the external RU-real (JayGuard) slice: anonymized, non-clinical public data.
Who it is for
Anyone choosing or building a de-identification pipeline for clinical or therapy text.

How to read this: ★ marks the recommended default stack · bars show coverage (higher is better) · a blank bar means that combination was not run for that language · every table has a column key explaining its abbreviations.

datasets
4
RU · RU-adv · EN · RU-real
combos × dataset
8
union-composed ablation
RU default recall
88%
natasha+regex+ollama ★
quasi-ID survival
31%
both clients · re-id surface

De-identification is not one tool but a stack of detectors, and the honest question is which layer pays for the CPU it burns. This benchmark composes detector layers by span-union over psychotherapy transcripts in Russian and English, scores each combination the way published de-id work does — recall-first, entity-level, direct vs quasi — and asks a sharper question than “how good is the tool”: what can only an LLM catch, and what still leaks after we redact?

headline

Three PII types — age 0%, medication 4%, profession 2% — are near-zero under the deterministic layers (Natasha NER + regex). Adding the local qwen layer raises their mention-recall, but medication and profession still have very low entity-recall because every mention must be masked. Meanwhile 31% of quasi-identifiers still survive the default stack. Redaction of direct identifiers is necessary but not sufficient.

1. Which layer catches what

The LLM layer is what moves age, medication, and profession above the deterministic baseline.

RU per-category mention recall (relaxed overlap), 1076 gold mentions: deterministic (Natasha+regex) vs. +qwen.

reading it

Structured direct IDs (email, phone, policy ID) and names/orgs/locations reach 0.9–1.0 from regex + Natasha. Dates are now caught too — a numeric-date regex rule was added after the benchmark exposed the gap.

Quasi-identifiers needing world-knowledge — a drug name, an occupation, a spelled-out age — are invisible to pattern and NER layers; the LLM is the only layer that lifts their mention-recall.

Caveat: this is mention recall. At entity level (all mentions masked), medication and profession stay at 0 even with qwen — a higher bar.

:::

2. Best sequence per language

Bars show coverage recall; ★ is the proposed default, which trades a little recall for large speed/precision gains — not always the single highest bar.

Coverage recall by combination across the plotted datasets. A missing bar means that combination was not run on that language (see note), not a zero score. ★ = proposed default; see table for F2/precision.

why some bars are blank for RU

Russian runs only the local-first stack — Natasha + regex + qwen2.5:3b. Three detectors are English-only by design, so any combo containing them has no RU bar: Presidio (its RU is spaCy-NER-dependent and weak — left unscored rather than misrepresented), Philter (an English clinical-notes rule set), and the OpenAI Privacy Filter (an English token-classifier). This is a scope decision, not a measured RU failure.

why they differ

RU: the proposed default natasha+regex+ollama ★ reaches 88% coverage recall. The OPF-on-RU row is omitted until its detector cache is regenerated for the current 30-document corpus.

EN-synth: OPF is the name/address backbone (English’s Natasha). Default opf+regex+ollama ★; opf+regex edges it on F2.

RU-real (JayGuard): on external real Russian text the local stack reaches strong coverage — a real-text anchor for the otherwise-synthetic RU corpus (PERSON/LOCATION only).

combocov Rcov F2ent Rdirectquasipreds
regex0.0710.0860.3980.4070.39085
natasha0.7750.7520.3010.3700.2371151
ollama·qwen2.5:3b0.2930.3290.2300.1850.271421
natasha+regex0.8460.8070.6990.7780.6271236
natasha+ollama·qwen2.5:3b0.8250.7740.4870.5370.4411333
regex+ollama·qwen2.5:3b0.3420.3790.4690.4630.475480
natasha+regex+ollama·qwen2.5:3b ★0.8750.8110.7260.8150.6441392
natasha+regex+gemma3 ◇0.9540.3660.8500.9070.7979093
natasha+regex+gliner ◇0.9800.4120.8850.8890.8817927

◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).

column key — what each abbreviation means, and which direction is better

RU-real (JayGuard slice) — 60 docs, 77 gold mentions (external, anonymized, real-but-non-clinical Russian text — Apache-2.0; PERSON/LOCATION only, machine-derived gold, not human-adjudicated)

combocov Rcov F2ent Rdirectquasipreds
regex0.0000.0000.0000.0000.00013
natasha0.4030.4180.4030.3580.70063
ollama·qwen2.5:3b0.6100.5470.6100.6570.300160
natasha+regex0.4030.4040.4030.3580.70076
natasha+ollama·qwen2.5:3b0.7920.6740.7920.7910.800180
regex+ollama·qwen2.5:3b0.6100.5470.6100.6570.300160
natasha+regex+ollama·qwen2.5:3b ★0.7920.6740.7920.7910.800180
natasha+regex+gemma3 ◇0.9610.7720.9610.9700.900201
natasha+regex+gemma4-12b-mlx ◇1.0000.8301.0001.0001.000172

◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).

column key — what each abbreviation means, and which direction is better

EN-synth — 32 docs, 46 gold mentions (no entity-level / direct-quasi: the EN sets carry no per-entity entity_id annotation, so only mention-level coverage is scored)

combocov Rcov F2preds
regex0.3700.41919
opf0.7830.81838
ollama·qwen2.5:3b0.5000.52549
opf+regex0.9130.92146
opf+ollama·qwen2.5:3b0.8480.81558
regex+ollama·qwen2.5:3b0.7610.74359
opf+regex+ollama·qwen2.5:3b ★0.9780.91066
natasha+regex+ollama·qwen2.5:3b0.7830.75861
presidio0.9130.90751
philter0.7830.79947
presidio+regex+ollama·qwen2.5:3b0.9350.88066
opf+regex+gemma3 ◇1.0000.94562
opf+regex+gemma4-12b-mlx ◇1.0000.97654
opf+regex+gemma4-26b·cloud ◇1.0000.98452
opf+regex+gliner ◇0.9570.80091

◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).

column key — what each abbreviation means, and which direction is better
:::

2b. Adversarial robustness (RU)

On the hard-forms probe the full stack catches 19/20 adversarial identifiers — the lone leak is a Latin-transliterated Russian name.

combocov Rcov F2ent Rdirectquasipreds
regex0.4000.4550.4000.4710.0008
natasha0.3500.3890.3500.3530.33310
ollama·qwen2.5:3b0.6500.6630.6500.5881.00025
natasha+regex0.7500.7650.7500.8240.33318
natasha+regex+ollama·qwen2.5:3b ★0.9500.8870.9500.9411.00030
natasha+regex+gemma3 ◇1.0000.9171.0001.0001.00032
natasha+regex+gemma4-12b-mlx ◇1.0000.9481.0001.0001.00028
natasha+regex+gemma4-26b·cloud ◇1.0000.9431.0001.0001.00026
natasha+regex+gliner ◇1.0000.8261.0001.0001.00041

◇ exploratory swap in the ★ stack (Gemma LLM or GLiNER NER layer) — separate detector cache, not a promoted default (variance and promotion gates pending; see the model comparison section).

column key — what each abbreviation means, and which direction is better
  • cov R (higher is better) — Mask-coverage recall (relaxed, ≥1-char overlap): fraction of gold PII spans the redaction mask touched at all. A miss = leaked PII.
  • cov F2 (higher is better) — Mask-coverage F2 (recall-weighted, β=2): recall counts 2× precision, because a missed entity leaks PII while a false positive only over-redacts.
  • ent R (higher is better) — Entity-level recall (TAB): an entity counts as protected only if ALL its mentions are masked — one un-redacted recurrence is a leak.
  • direct (higher is better) — Entity recall for direct identifiers (name, phone, email, policy/ID).
  • quasi (higher is better) — Entity recall for quasi-identifiers (age, profession, city, employer, medication, date) — the combinable re-identification surface.
  • preds · (context) — Number of predicted spans the stack emitted (redaction volume). Context, not a score: more masking trades precision for recall.
what the probe contains

16 snippets, 20 gold forms: patronymics, transliteration, diminutives, VK/Telegram handles, SNILS/INN/passport IDs, abbreviated addresses, and code-switching.

Regex catches the structured IDs and handles; Natasha + qwen2.5:3b recover patronymics, diminutives and code-switching. The one residual leak is “Sergey Volkov” — a Latin-transliterated Russian name: Natasha is Cyrillic-only, regex has no name rule, and qwen missed it. This is the argument for adding an English/Latin NER (OPF) when transliteration is expected.

:::

2c. Model comparison

Exploratory model swaps show Gemma beating the Qwen baseline on every full short slice; local Gemma4 is strongest on the Russian sets, while HF cloud Gemma4 leads the English stack and is stable across 5 replicates. GLiNER-multi, a local zero-shot NER layer, is reported on the same footing. The ★ defaults are unchanged until variance and promotion gates pass.

datasetmodelcov Rcov F2ent Rpreds
RU-synth longQwen2.5-3B0.8750.8020.7261392
RU-synth longGemma30.9540.3620.8509093
RU-synth longGLiNER-multi PII (zero-shot NER)0.9800.4070.8857927
RU-advQwen2.5-3B0.9500.8870.95030
RU-advGemma31.0000.9171.00032
RU-advGemma4 12B-MLX1.0000.9481.00028
RU-advGemma4 26B-A4B (HF cloud)1.0000.9431.00026
RU-advGLiNER-multi PII (zero-shot NER)1.0000.8261.00041
RU-realQwen2.5-3B0.7920.6140.792180
RU-realGemma30.9610.7300.961201
RU-realGemma4 12B-MLX1.0000.8201.000172
EN-synthQwen2.5-3B0.9780.87066
EN-synthGemma31.0000.90462
EN-synthGemma4 12B-MLX1.0000.95554
EN-synthGemma4 26B-A4B (HF cloud)1.0000.96252
EN-synthGLiNER-multi PII (zero-shot NER)0.9570.78291

Rows are stack scores from separate detector caches, generated by score_llm_experiment.py. Cloud rows used synthetic text only. Long-RU Gemma3 chunking is included as a recall/noise tradeoff, not a promoted default; missing rows are omitted, not scored as zero.

:::

3. Direct vs quasi-identifiers (TAB)

Direct identifiers reach 0.81 entity recall; quasi-identifiers remain lower at 0.64.

RU entity-level recall (an entity is protected only if all mentions are masked) by identifier class.

the asymmetry

Direct (name, phone, email, policy): masked at 0.81 entity recall in the default stack.

Quasi (age, profession, city, employer, medication, date): the LLM helps but the ceiling stays low — these are the attributes that, combined, re-identify a person.

:::

4. What survives — reconstruction & re-identification

31% of quasi-identifiers survive the default stack (both clients); over-redaction costs 20% of redactions.

quasi survival (A)
27%
client A · re-id surface
quasi survival (B)
33%
client B
over-redaction (C)
20%
readability cost
attacker
qwen2.5:3b
Qwen2.5-3B · recovers attrs from redacted text

Method — following the re-identification / inference-attack literature (Staab et al.; RAT-Bench; Tau-Eval). An entity survives if any one of its mentions is left unmasked.

interpretation

A local 3B qwen attacker recovered 2 of 10 tested attributes from the redacted text (e.g. the medication, because its entity-recall is 0). A weak attacker is a lower bound — the inference-attack literature (Staab et al.) reports much higher re-identification rates for frontier models.

Implication: redacting direct identifiers is table stakes; quasi-identifier survival is a useful gate to check before sending a session to cloud analysis.

:::

5. Privacy vs utility — can you de-identify and still analyze?

A weak local attacker recovers 1/10 attributes (top-3); yet 91% of the clinical signal survives redaction.

CBT-signal preserved
91%
distortion types, redacted vs orig
non-PII text kept
99.5%
char-level utility floor
attack top-3
1/10
qwen2.5:3b, lower bound
residual risk
MEDIUM / MEDIUM
client A / B

Method — top-k inference attack with a fixed, declared budget (qwen2.5:3b, temp 0.4, top-3 guesses/attribute, redacted text only) + downstream task preservation (re-run cognitive-distortion extraction on redacted vs. original). Aligned with Staab et al. / RAT-Bench (privacy) and Tau-Eval (task-sensitive utility).

the tension, resolved

The same masking that lowers attacker success can erase clinical content. Here it does not: the default stack keeps ~91% of distortion signal and 99.5% of non-PII text while a weak attacker recovers nothing top-3.

Caveat: this attacker is a lower bound — quasi-identifiers still survive in text (medication entity-recall is 0), so a frontier attacker would score higher. Residual risk stays MEDIUM for client B.

:::

6. CONFIDE stack vs established baselines (Presidio, Philter)

Off-the-shelf de-identifiers can match the stack on coverage but fall far behind on type-aware micro-F1.

EN-synth: coverage F2 (recall-weighted, headline) vs type-aware micro-F1 for the CONFIDE ★ stack and the established baselines.

reading it

Coverage F2 (orange) asks only “did we mask the span at all”. Type micro-F1 (green) demands the right label too — what a redaction policy actually needs.

On EN-synth, Presidio edges the stack on coverage F2 (a broad DATE_TIME recognizer) but its type-F1 is far lower; Philter is high-coverage yet emits almost everything as untyped OTHER, so its type-F1 is unusable.

Takeaway: a generic system is not a therapy-tuned one; the only coverage a baseline adds is relative/colloquial dates.

:::

7. Regulatory residual-risk (RU · EN)

Detection metrics measure what we catch; regulators care what survives. Mapped onto named risks, the RU default stack lands at RED — driven by 9 in-scope residual direct-identifier entities (a re-identification key left in the text). A further 1 are spelled-out digit IDs, out of scope for the regex layer by design and reported separately.

All datasets, side by side

datasettierdirect resspecial resHIPAAworst docinf ratelink AUC
RURED964/670%20%0.46
EN-synthAMBER006/70%20%0.46

Per-language residual-risk tier under each language's ★ default stack. RU lands RED (direct identifiers leak at the strict TAB entity bar); EN lands AMBER (no direct-ID leak, but nonzero inference / incomplete HIPAA coverage). EN's worst-doc recall reads 0% because its tiny gold means one PII-bearing doc can be missed entirely — small-N noise, not a systematic EN failure. The RU detail follows.

residual-risk tier
RED
ordinal R/A/G · RU ★ stack
HIPAA-inspired coverage
4/6
categories fully removed
worst-doc recall
70%
containment · 1.424 leaks / 10k chars
singled out
0/6
clients · residual quasi surface

WP29 (Art-29 WP 05/2014) re-identification triad — identifiability decomposes into singling out (residual quasi surface via a caveated population-fraction estimator — NOT corpus k-anonymity, N is tiny), linkability (pairwise session-linking ROC AUC 0.46 — the area under the ROC curve: 1.0 = perfectly linkable, 0.50 = no better than chance, which is the safe direction), and inference (attribute-recovery attack recovers 20%). HIPAA coverage is a Safe-Harbor-inspired checklist, not a legal determination (AGE is N/A; structured IDs collapsed).

reading the tier

RED = any in-scope direct identifier leaks at entity level (one unmasked mention is a key). AMBER = special-category residual, nonzero inference, or linkability above chance. GREEN = all clear.

What leaks: not whole names but specific variants — inflected/possessive/patronymic forms (Артёмом, Натальин, Денису), lowercase surnames, vocatives, Latin transliteration (Timur), and name/common-word collisions (Вера, Роман). Mention-level recall hides this; the strict TAB entity bar (one miss ⇒ unprotected) surfaces it.

The singling-out estimate is illustrative (a caveated population-fraction method, not corpus k-anonymity) — it is not a guarantee of non-identifiability.

methodology

Each detector runs once per dataset; combinations are span-unions of cached spans, interval-merged to the deployed redaction mask before scoring. This report headlines coverage recall (relaxed overlap) — the privacy-critical number — and recall-weighted F2 + precision sit in the leaderboard table. Type-aware micro/macro-F1 (i2b2) and entity-level recall (TAB; all mentions masked) are also reported. Numbers are mention-level unless marked entity-level. Gold for RU is located from the two answer-key PII inventories and hand-verified (a planted-signal recovery eval, not independently annotated gold); the EN set is a curated synthetic slice, and the one real-text anchor is the external RU-real (JayGuard) slice. Mostly synthetic data — no real patients. Small N: treat per-type numbers as directional.

References & credits

CONFIDE-Bench builds on the de-identification, re-identification, and documentation literature listed below. Every work named or relied on in this report is credited here with a link to its canonical page (DOI / arXiv / HuggingFace / GitHub). We credit only what the report actually uses; inclusion does not imply endorsement by those authors.

Benchmarks & metrics

Re-identification & privacy attacks

Detectors & tools

Datasets

Documentation & regulatory framing