Retrieval beats memorization for clinical code selection
A two-stage retrieval-then-classify method outperforms direct LLM generation for assembling clinical value sets from large standardized vocabularies.
Retrieve similar existing value sets, then classify candidates to build clinical code lists more accurately than direct LLM generation.
- — Clinical value set authoring identifies all codes defining a medical concept in standardized vocabularies.
- — LLMs fail to reliably recall large, versioned clinical vocabularies learned during pretraining.
- — RASC retrieves K similar existing value sets, then applies a classifier to rank candidate codes.
- — Cross-encoder on SAPBert achieves AUROC 0.852 and F1 0.298, beating zero-shot GPT-4o at F1 0.105.
- — Retrieval-then-classify reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.
- — Performance gap widens as value set size increases, confirming theoretical advantage of shrinking output space.
- — Benchmark created from 11,803 publicly available VSAC value sets, first large-scale dataset for task.
- — Gains replicate across SAPBert cross-encoder, LightGBM, and other classifier architectures.
Astrobobo tool mapping
- Knowledge Capture Record the structure of your current value set authoring process—how many manual reviews, how many codes per set, error rate. Use this as baseline to measure RASC's impact.
- Reading Queue Add the RASC GitHub repository and VSAC benchmark documentation to your queue. Understand the SAPBert cross-encoder architecture and why it outperforms GPT-4o on this task.
- Focus Brief Summarize the three key findings: (1) retrieval shrinks output space, (2) cross-encoder beats zero-shot LLM, (3) gains hold across model types. Use this to pitch a pilot to your clinical leadership.
Frequently asked
- LLMs do not reliably memorize large, versioned clinical vocabularies. RASC retrieves similar existing value sets to form a candidate pool, shrinking the effective output space. A classifier then ranks candidates against ground truth, avoiding hallucinated codes. This two-stage approach reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.
cite ▸
Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons. (2026, April 17). Retrieval beats memorization for clinical code selection. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e
Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons. "Retrieval beats memorization for clinical code selection." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14616.
@misc{astrobobo_retrieval-beats-memorization-for-clinical-code-selection-55495e_2026,
author = {Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons},
title = {Retrieval beats memorization for clinical code selection},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14616},
}