Why does RASC outperform direct LLM generation for clinical codes?

LLMs do not reliably memorize large, versioned clinical vocabularies. RASC retrieves similar existing value sets to form a candidate pool, shrinking the effective output space. A classifier then ranks candidates against ground truth, avoiding hallucinated codes. This two-stage approach reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.

What is a clinical value set and why does it matter?

A clinical value set is a curated list of standardized medical codes (e.g., ICD-10, SNOMED CT) that define a single clinical concept, such as 'diabetes diagnosis' or 'heart failure medication.' Accurate value sets are critical for quality measurement, patient phenotyping, and clinical research. Incomplete or incorrect sets lead to missed cases and biased outcomes.

How much better is the cross-encoder classifier than GPT-4o?

The cross-encoder fine-tuned on SAPBert achieves value-set-level F1 of 0.298 versus GPT-4o's 0.105. Additionally, 48.6% of GPT-4o's returned codes do not exist in the official VSAC vocabulary at all. The performance gap widens for larger value sets, confirming that retrieval-grounded classification is more reliable than zero-shot generation.

ai · 4 min read · Apr 17, 2026

Retrieval beats memorization for clinical code selection

A two-stage retrieval-then-classify method outperforms direct LLM generation for assembling clinical value sets from large standardized vocabularies.

Source: arxiv/cs.LG · Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons · open original ↗

Retrieve similar existing value sets, then classify candidates to build clinical code lists more accurately than direct LLM generation.

— Clinical value set authoring identifies all codes defining a medical concept in standardized vocabularies.
— LLMs fail to reliably recall large, versioned clinical vocabularies learned during pretraining.
— RASC retrieves K similar existing value sets, then applies a classifier to rank candidate codes.
— Cross-encoder on SAPBert achieves AUROC 0.852 and F1 0.298, beating zero-shot GPT-4o at F1 0.105.
— Retrieval-then-classify reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.
— Performance gap widens as value set size increases, confirming theoretical advantage of shrinking output space.
— Benchmark created from 11,803 publicly available VSAC value sets, first large-scale dataset for task.
— Gains replicate across SAPBert cross-encoder, LightGBM, and other classifier architectures.

Astrobobo tool mapping

Knowledge Capture Record the structure of your current value set authoring process—how many manual reviews, how many codes per set, error rate. Use this as baseline to measure RASC's impact.
Reading Queue Add the RASC GitHub repository and VSAC benchmark documentation to your queue. Understand the SAPBert cross-encoder architecture and why it outperforms GPT-4o on this task.
Focus Brief Summarize the three key findings: (1) retrieval shrinks output space, (2) cross-encoder beats zero-shot LLM, (3) gains hold across model types. Use this to pitch a pilot to your clinical leadership.

Frequently asked

LLMs do not reliably memorize large, versioned clinical vocabularies. RASC retrieves similar existing value sets to form a candidate pool, shrinking the effective output space. A classifier then ranks candidates against ground truth, avoiding hallucinated codes. This two-stage approach reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.

Share X LinkedIn

cite ▸

APA

Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons. (2026, April 17). Retrieval beats memorization for clinical code selection. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e

MLA

Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons. "Retrieval beats memorization for clinical code selection." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14616.

BibTeX

@misc{astrobobo_retrieval-beats-memorization-for-clinical-code-selection-55495e_2026,
  author       = {Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons},
  title        = {Retrieval beats memorization for clinical code selection},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14616},
}

#llm #clinical #retrieval #classification #healthcare #vocabulary

Retrieval beats memorization for clinical code selection

Astrobobo tool mapping

Frequently asked

Related insights

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs