Bilingual · recall-first · open

Which layer earns its compute?

CONFIDE-Bench measures how well automatic privacy tools hide personal details in psychotherapy session transcripts — in Russian and English — and which combination of detectors is actually worth running.

A missed identifier is leaked client data. So the benchmark is recall-first, scores detectors the way published de-identification work does (entity-level, direct vs. quasi), and asks the sharper question: what can only an LLM catch, and what still leaks after we redact?

A psychotherapy transcript with personal details being masked — names, places, and a medication covered by redaction blocks.

The problem

Therapists want AI help. The transcript can't leak the client.

Therapists increasingly want to use AI to review their sessions — but a raw transcript can reveal a client's name, town, employer, diagnosis, or the medication they take. The fix is to strip identifying information out before anything leaves the therapist's machine. That is harder than it sounds:

  • Easy clues — a phone or email looks like one, so a simple rule catches it.
  • Hard clues — "I'm the only puppet-maker in my small town" contains no name, yet still points to one person.

CONFIDE-Bench is the scorecard for that job — and, unusually, it covers Russian as well as English, and therapy conversations rather than hospital notes.

What it measures

Did we catch it?

Coverage recall, entity-level (TAB): an entity counts as protected only if every mention is masked. One un-redacted recurrence is a leak. Recall is weighted over precision (F2).

What only an LLM catches

Direct IDs fall to rules + NER. Quasi-identifiers needing world-knowledge — a drug name, an occupation, a spelled-out age — are invisible to pattern layers; the local LLM is the only layer that lifts them.

What still leaks

Even after redaction, surviving quasi-identifiers and an inference attack on the masked text show how much re-identification surface remains — mapped onto named regulatory risks.

Headline findings

Direct identifiers are table stakes; quasi-identifiers are the real surface.

88%
RU default coverage
natasha + regex + qwen2.5:3b ★
4
datasets
RU · RU-adv · EN · RU-real
31%
quasi-ID survival
both clients · re-id surface
19/20
adversarial caught
lone leak: a transliterated name

Three PII types — age, medication, profession — sit near zero under the deterministic layers. The local LLM raises their mention-recall, but medication and profession stay low at the strict entity level, because every mention must be masked. Redaction of direct identifiers is necessary but not sufficient.

Exploratory model-swap finding: Gemma3 and Gemma4 both beat the current Qwen baseline on completed full short slices; Gemma4 is strongest on quality, while Qwen remains the published default until variance and promotion gates pass.

our position

We report the RED residual-risk tier on purpose. De-identification is harm-reduction, not anonymization — it never replaces explicit, informed client consent and local-only handling for any real recording. A tool that promised "safe to upload" would be the dangerous one. Read our ethics & consent stance →

See the full interactive report →  ·  русская версия →

The toolkit

Not just a scorecard — the tools too.

CONFIDE ships the redactor and the adversary. The session-anonymizer skill does three-layer, fully-local de-identification inside your coding agent — install it with npx skills add glebis/confide. CONFIDE-Red then tries to re-identify what survived (inference / singling-out / linkability) — the honest check behind the RED tier.

Help

The gold standard needs more than one annotator.

To make the benchmark trustworthy, its "gold" labels must be marked by several independent people — and Russian speakers are especially needed. A pilot is ~6–10 transcripts, roughly 3–5 hours, on your own schedule. You only ever see synthetic or consented data.