Tag
#llm
42 insights
- ai · arxiv/cs.AI · 8 min
Safe Bilevel Delegation: Runtime Safety Control for Multi-Agent LLM Systems
A formal framework that dynamically adjusts safety-efficiency trade-offs when delegating tasks to specialized AI sub-agents during execution.
May 2, 2026 Read → - ai · arxiv/cs.AI · 8 min
Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks
How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.
May 2, 2026 Read → - ai · arxiv/cs.AI · 3 min
Multi-agent framework automates recommendation system tuning
AgenticRecTune uses specialized LLM agents to optimize configuration across pre-ranking, ranking, and re-ranking pipelines without manual tuning.
May 1, 2026 Read → - ai · arxiv/cs.AI · 8 min
LLMs Withhold Help When They Misread Intent, Not Lack Knowledge
A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.
May 1, 2026 Read → - ai · hackernoon · 2 min
HackerNoon's April 2026 Digest: AI Costs, Data Pipelines, and Local Models
A structured pass through HackerNoon's April 29 roundup, surfacing the signal on AI tooling costs, data sourcing, and LLM deployment tradeoffs.
Apr 30, 2026 Read → - ai · hackernoon · 6 min
Continuity in AI agents requires architecture, not bigger memory stores
A solo builder argues that persistent AI identity depends on scheduled cognition cycles and narrative compression, not retrieval systems.
Apr 30, 2026 Read → - ai · arxiv/cs.AI · 4 min
Evergreen: Cost-Efficient Verification of LLM-Generated Claims
A system that recasts claim verification as semantic queries, reducing LLM costs by 3.2x while maintaining accuracy on aggregated data.
Apr 30, 2026 Read → - ai · arxiv/cs.AI · 8 min
LATTICE: Measuring Crypto Agent Quality Beyond Accuracy
New benchmark evaluates how well AI agents support user decisions in crypto, not just whether they get answers right.
Apr 30, 2026 Read → - ai · arxiv/cs.LG · 4 min
Efficient Rationale Retrieval via Student-Teacher Distillation
Rabtriever reduces computational cost of LLM-based document ranking by distilling cross-encoder knowledge into independent query-document encoders.
Apr 28, 2026 Read → - ai · arxiv/cs.AI · 8 min
Poisoned Pretraining: Hidden Attacks Embedded in LLM Training Data
Researchers demonstrate how adversaries can plant dormant malicious logic in large language models by seeding poisoned content across obscure websites, evading detection until triggered.
Apr 27, 2026 Read → - ai · arxiv/cs.LG · 4 min
LLMs use hidden confidence signals to detect and fix their own errors
Research shows large language models maintain a second-order evaluative signal that predicts error detection and self-correction beyond what their output probabilities reveal.
Apr 27, 2026 Read → - ai · arxiv/cs.AI · 8 min
Rule-Based AI Needs Policy Grounding, Not Label Agreement
Content moderation systems fail when evaluated by human agreement alone. A new framework measures whether decisions logically follow stated rules instead.
Apr 26, 2026 Read → - ai · hackernoon · 6 min
HackerNoon's 100 AI Reading List: What It Covers and Where It Falls Short
A ranked collection of free AI articles from HackerNoon, ordered by reader engagement, spanning deployment, ethics, and applied ML.
Apr 26, 2026 Read → - ai · arxiv/cs.AI · 8 min
FHIR Format Choice Shifts LLM Medication Safety by 19 Points
How you serialize patient data to language models dramatically changes reconciliation accuracy, with smaller models favoring narrative text and large models preferring raw JSON.
Apr 25, 2026 Read → - ai · arxiv/cs.AI · 6 min
LLM Safety Filters Fail Differently Across Dialects and Explicit Identity
Research shows language models refuse requests more often when users state their identity explicitly, but bypass safety guardrails when using dialect signals like AAVE.
Apr 24, 2026 Read → - ai · arxiv/cs.AI · 8 min
Junk Data Degrades LLM Reasoning; Twitter Study Shows Lasting Harm
Continual training on low-quality social media text causes measurable cognitive decline in language models, with reasoning and safety capabilities dropping significantly.
Apr 23, 2026 Read → - ai · arxiv/cs.AI · 8 min
AI Bias in Code Decisions: Prompt Wording Shifts Model Choices
Researchers find that small phrasing changes in prompts push AI systems toward poor software engineering decisions, and standard prompt techniques don't fix it.
Apr 23, 2026 Read → - ai · arxiv/cs.AI · 4 min
Automated quantization shrinks spike-driven language models for edge devices
QSLM framework compresses neural network models by up to 86.5% while preserving accuracy, enabling deployment on resource-constrained embedded hardware.
Apr 22, 2026 Read → - ai · arxiv/cs.LG · 8 min
Simpler Optimizers Make LLM Unlearning More Robust
Research shows that using lower-order optimization methods during LLM unlearning produces forgetting that resists post-training attacks better than sophisticated gradient-based approaches.
Apr 21, 2026 Read → - ai · arxiv/cs.LG · 4 min
LLMs complement but don't replace classical hyperparameter optimization
A study comparing LLM agents to classical algorithms like CMA-ES and TPE finds hybrid approaches work best for tuning model hyperparameters under compute constraints.
Apr 21, 2026 Read → - ai · arxiv/cs.LG · 6 min
Automating Dataset Creation with LLMs and Search Engines
Researchers propose ADC, a method to build large labeled datasets automatically using language models and web search, reducing manual annotation work and cost.
Apr 21, 2026 Read → - engineering · arxiv/cs.LG · 4 min
Kernel-Level LLM Safety via Logit Inspection
ProbeLogits reads token probabilities before generation to enforce safety policies at the OS level, achieving parity with learned classifiers at 2.5x speed.
Apr 21, 2026 Read → - ai · arxiv/cs.AI · 4 min
Interpretable Traces Don't Guarantee Better LLM Reasoning
Research shows Chain-of-Thought traces improve model performance but confuse users, and correctness of intermediate steps barely predicts final accuracy.
Apr 20, 2026 Read → - ai · arxiv/cs.AI · 5 min
LLMs Can Infer Unspoken Intent in Collaborative Tasks
Researchers tested whether large language models can interpret incomplete instructions by reasoning about a human partner's mental state, matching human performance.
Apr 20, 2026 Read → - ai · arxiv/cs.AI · 6 min
OjaKV: Online Low-Rank Compression for LLM Key-Value Caches
A hybrid storage and adaptive subspace method reduces KV cache memory by compressing intermediate tokens while preserving critical anchors, compatible with FlashAttention.
Apr 20, 2026 Read → - engineering · hackernoon · 6 min
Indirect Prompt Injection Turns RAG Documents Into Attack Vectors
Malicious instructions hidden inside ingested PDFs can override LLM system prompts before any chat-layer firewall ever sees them.
Apr 19, 2026 Read → - engineering · hackernoon · 7 min
Claude Code model tiers and effort levels, explained plainly
Choosing the wrong model or effort level in Claude Code wastes tokens silently. Here is what each setting actually controls.
Apr 19, 2026 Read → - engineering · hackernoon · 7 min
LLMesh routes local LLM requests across machines via one endpoint
A distributed inference broker lets teams share GPU hardware without changing application code between dev, staging, and production.
Apr 18, 2026 Read → - engineering · hackernoon · 6 min
Bots Follow Scripts; Agents Pursue Goals — Know the Difference
A structural comparison of rule-based bots and LLM-driven agents, with a framework for choosing the right autonomy level.
Apr 18, 2026 Read → - ai · hackernoon · 2 min
HackerNoon indexes 218 articles on AI agents for self-directed study
A curated reading list from HackerNoon's Learn Repo maps the AI agent landscape across frameworks, protocols, security, and production failures.
Apr 18, 2026 Read → - ai · arxiv/cs.AI · 4 min
TableNet: LLM-Driven Dataset for Table Structure Recognition
Researchers introduce an autonomous multi-agent system that generates synthetic tables at scale and uses active learning to train structure recognition models more efficiently.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
Token Importance in On-Policy Distillation: Entropy and Disagreement
Research identifies two regions of high-value tokens in knowledge distillation: high-entropy positions and low-entropy positions where student and teacher disagree, enabling 50–80% token reduction.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
Formal framework for multi-agent AI system safety and coordination
Researchers propose unified semantic models and 30 temporal-logic properties to verify behavior, detect coordination failures, and prevent vulnerabilities in agentic AI systems.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 4 min
LLM scripting brings petascale climate visualization to laptops
Researchers demonstrate a framework that lets domain scientists animate massive NASA climate datasets on commodity hardware using natural-language prompts instead of specialized graphics expertise.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
Small Models Match Large Ones via Inference Scaffolding
McClendon et al. show that role-based prompt structuring at inference time doubles small-model performance on complex tasks without retraining.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
LLMs show human-like trust bias toward people, with demographic blind spots
Study of 43,200 experiments reveals language models develop trust patterns similar to humans, including susceptibility to age, religion, and gender bias in financial decisions.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 6 min
Measuring Where Chatbots Beat Humans on Tests
Researchers apply psychometric methods to identify test items where LLMs systematically outperform human learners, revealing assessment vulnerabilities.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
LLMs hit formal reasoning ceiling; Chomsky Hierarchy reveals efficiency gap
New benchmark shows large language models struggle with structured complexity tasks and require prohibitive compute to achieve reliability in formal reasoning.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 3 min
Framework uses AI outputs as features, not proxies, for labeled data
Generative Augmented Inference treats LLM predictions as informative signals rather than direct substitutes, reducing human labeling needs by 75–90% across operations tasks.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 8 min
LLM Panels Match Expert Clinicians in Medical Diagnosis Scoring
A study of three frontier AI models scoring real hospital cases shows calibrated LLM juries can reliably replace human expert panels for medical AI evaluation.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 4 min
Retrieval-Augmented Set Completion for Clinical Code Authoring
A two-stage approach retrieves similar clinical value sets then classifies candidates, outperforming direct LLM generation on standardized medical vocabularies.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 4 min
Retrieval beats memorization for clinical code selection
A two-stage retrieval-then-classify method outperforms direct LLM generation for assembling clinical value sets from large standardized vocabularies.
Apr 17, 2026 Read →