Why does direct LLM prompting fail for clinical code selection?

Large language models are not reliably trained on the full, versioned clinical vocabularies (e.g., SNOMED CT, ICD-10). They hallucinate codes that do not exist in official systems, creating compliance and data integrity risks. Retrieval-augmented approaches ground the model in a curated corpus of real codes, eliminating out-of-vocabulary hallucinations.

How does RASC reduce the complexity of code selection?

RASC shrinks the effective output space from thousands of codes to a smaller candidate pool by first retrieving similar existing value sets from a corpus. The classifier then operates on this reduced set, improving both accuracy and computational efficiency. This two-stage approach achieves higher precision than direct generation or retrieval alone.

What is the performance difference between RASC and GPT-4o?

RASC's fine-tuned cross-encoder achieves F1 0.298 and AUROC 0.852, while GPT-4o zero-shot achieves F1 0.105. Additionally, 48.6% of GPT-4o's returned codes do not exist in the official VSAC vocabulary. RASC's advantage grows as value set size increases, making it more reliable for complex clinical concepts.

ai · 4 min read · Apr 17, 2026

Retrieval-Augmented Set Completion for Clinical Code Authoring

A two-stage approach retrieves similar clinical value sets then classifies candidates, outperforming direct LLM generation on standardized medical vocabularies.

Source: arxiv/cs.LG · Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons · open original ↗

Retrieve similar clinical value sets, then classify candidates with a fine-tuned model, reducing hallucination and improving code selection accuracy.

— Clinical value set authoring identifies all codes representing a medical concept in standardized vocabularies.
— Direct LLM prompting fails because vocabularies are large, versioned, and not reliably memorized.
— RASC retrieves K similar existing sets from a corpus, then applies a classifier to each candidate.
— Cross-encoder fine-tuned on SAPBert achieves AUROC 0.852, outperforming MLP (0.799) and GPT-4o zero-shot (F1 0.105).
— GPT-4o returns 48.6% codes absent from the official vocabulary, indicating hallucination.
— Retrieval-only baseline produces 12.3 irrelevant codes per true positive; classifiers reduce this to 3.2–4.4.
— Performance gap widens as value set size increases, confirming theoretical advantage of shrinking output space.
— Benchmark dataset of 11,803 VSAC value sets enables reproducible evaluation.

Astrobobo tool mapping

Knowledge Capture Document your organization's existing value set corpus and metadata (e.g., clinical domain, code counts, update frequency) to prepare for retrieval-based automation.
Focus Brief Summarize the AUROC and F1 scores from RASC and competing baselines in a one-page decision brief for clinical leadership, highlighting hallucination rates in GPT-4o.
Reading Queue Queue related papers on clinical NLP, vocabulary standardization (SNOMED CT, ICD-10), and retrieval-augmented generation to deepen domain context.
Daily Log Track pilot results if you implement RASC: measure time saved per value set, false positive rate, and clinician feedback on code relevance.

Frequently asked

Large language models are not reliably trained on the full, versioned clinical vocabularies (e.g., SNOMED CT, ICD-10). They hallucinate codes that do not exist in official systems, creating compliance and data integrity risks. Retrieval-augmented approaches ground the model in a curated corpus of real codes, eliminating out-of-vocabulary hallucinations.

Share X LinkedIn

cite ▸

APA

Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons. (2026, April 17). Retrieval-Augmented Set Completion for Clinical Code Authoring. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/retrieval-augmented-set-completion-for-clinical-code-authoring-55495e

MLA

Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons. "Retrieval-Augmented Set Completion for Clinical Code Authoring." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/retrieval-augmented-set-completion-for-clinical-code-authoring-55495e. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14616.

BibTeX

@misc{astrobobo_retrieval-augmented-set-completion-for-clinical-code-authoring-55495e_2026,
  author       = {Sumit Mukherjee, Juan Shu, Nairwita Mazumder, Tate Kernell, Celena Wheeler, Shannon Hastings, Chris Sidey-Gibbons},
  title        = {Retrieval-Augmented Set Completion for Clinical Code Authoring},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/retrieval-augmented-set-completion-for-clinical-code-authoring-55495e},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14616},
}

#clinical #retrieval #classification #healthcare #llm #benchmark

Retrieval-Augmented Set Completion for Clinical Code Authoring

Astrobobo tool mapping

Frequently asked

Related insights

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs