ai · 4 min read · Apr 17, 2026

Retrieval beats memorization for clinical code selection

A two-stage retrieval-then-classify method outperforms direct LLM generation for assembling clinical value sets from large standardized vocabularies.

Source: arxiv/cs.LG · Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons · open original ↗

Retrieve similar existing value sets, then classify candidates to build clinical code lists more accurately than direct LLM generation.

  • Clinical value set authoring identifies all codes defining a medical concept in standardized vocabularies.
  • LLMs fail to reliably recall large, versioned clinical vocabularies learned during pretraining.
  • RASC retrieves K similar existing value sets, then applies a classifier to rank candidate codes.
  • Cross-encoder on SAPBert achieves AUROC 0.852 and F1 0.298, beating zero-shot GPT-4o at F1 0.105.
  • Retrieval-then-classify reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.
  • Performance gap widens as value set size increases, confirming theoretical advantage of shrinking output space.
  • Benchmark created from 11,803 publicly available VSAC value sets, first large-scale dataset for task.
  • Gains replicate across SAPBert cross-encoder, LightGBM, and other classifier architectures.

Astrobobo tool mapping

  • Knowledge Capture Record the structure of your current value set authoring process—how many manual reviews, how many codes per set, error rate. Use this as baseline to measure RASC's impact.
  • Reading Queue Add the RASC GitHub repository and VSAC benchmark documentation to your queue. Understand the SAPBert cross-encoder architecture and why it outperforms GPT-4o on this task.
  • Focus Brief Summarize the three key findings: (1) retrieval shrinks output space, (2) cross-encoder beats zero-shot LLM, (3) gains hold across model types. Use this to pitch a pilot to your clinical leadership.

Frequently asked

  • LLMs do not reliably memorize large, versioned clinical vocabularies. RASC retrieves similar existing value sets to form a candidate pool, shrinking the effective output space. A classifier then ranks candidates against ground truth, avoiding hallucinated codes. This two-stage approach reduces irrelevant candidates per true positive from 12.3 to 3.2–4.4.
Share X LinkedIn
cite
APA
Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons. (2026, April 17). Retrieval beats memorization for clinical code selection. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e
MLA
Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons. "Retrieval beats memorization for clinical code selection." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14616.
BibTeX
@misc{astrobobo_retrieval-beats-memorization-for-clinical-code-selection-55495e_2026,
  author       = {Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons},
  title        = {Retrieval beats memorization for clinical code selection},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/retrieval-beats-memorization-for-clinical-code-selection-55495e},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14616},
}

Related insights