ai · 6 min read · Apr 21, 2026

Automating Dataset Creation with LLMs and Search Engines

Researchers propose ADC, a method to build large labeled datasets automatically using language models and web search, reducing manual annotation work and cost.

Source: arxiv/cs.LG · Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu · open original ↗

ADC automates dataset construction by using LLMs to design classes and generate search queries, collecting and curating samples with minimal human effort.

  • LLMs design detailed class hierarchies and generate search code to collect images from search engines automatically.
  • Clothing-ADC dataset contains 1M+ images across 12 main classes and 12,000 fine-grained subclasses built via automation.
  • Automated curation achieves 79% agreement with human annotators, reducing label noise from 22.2% to 10.7%.
  • Real-world challenges remain: label errors persist and data distributions become imbalanced across classes.
  • Open-source toolkit includes methods for label error detection and robust learning under noisy conditions.
  • Three benchmark datasets created for label noise detection, noise-robust learning, and class-imbalanced learning evaluation.
  • Existing methods evaluated on benchmarks to establish baselines and guide future research directions.

Astrobobo tool mapping

  • Knowledge Capture Document your domain's class taxonomy and annotation rules in a structured format so an LLM can parse and extend them for automated search query generation.
  • Reading Queue Queue the ADC paper and related label-noise detection papers to understand which noise-handling methods suit your downstream model architecture.
  • Focus Brief Create a brief comparing ADC's approach to your current manual labeling workflow: time per sample, cost, and error rate. Use it to justify pilot automation.

Frequently asked

  • ADC uses LLMs to design classes and generate search queries automatically, then collects samples from search engines without manual labeling. This eliminates the per-sample human annotation cost. The Clothing-ADC dataset (1M+ images) was built with negligible cost, whereas manual labeling would require thousands of hours. Remaining costs are limited to quality assurance and noise detection on the automated output.
Share X LinkedIn
cite
APA
Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu. (2026, April 21). Automating Dataset Creation with LLMs and Search Engines. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/automating-dataset-creation-with-llms-and-search-engines-5169c6
MLA
Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu. "Automating Dataset Creation with LLMs and Search Engines." Astrobobo Content Engine, 21 Apr 2026, https://astrobobo-content-engine.vercel.app/article/automating-dataset-creation-with-llms-and-search-engines-5169c6. Based on "arxiv/cs.LG", https://arxiv.org/abs/2408.11338.
BibTeX
@misc{astrobobo_automating-dataset-creation-with-llms-and-search-engines-5169c6_2026,
  author       = {Minghao Liu, Zonglin Di, Jiaheng Wei, Zhongruo Wang, Hengxiang Zhang, Ruixuan Xiao, Haoyu Wang, Jinlong Pang, Hao Chen, Ankit Shah, Hongxin Wei, Xinlei He, Zhaowei Zhao, Haobo Wang, Lei Feng, Jindong Wang, James Davis, Yang Liu},
  title        = {Automating Dataset Creation with LLMs and Search Engines},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/automating-dataset-creation-with-llms-and-search-engines-5169c6},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2408.11338},
}

Related insights