Speech Models Fail Safety Tests That Text Models Pass
A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.
Speech language models recognize social norms in text but fail to enforce them when speaker identity, tone, or environment arrive as audio.
- — VoxSafeBench tests safety, fairness, and privacy across 22 bilingual tasks in speech contexts.
- — Tier 1 compares matched text and audio inputs to isolate audio-specific risks.
- — Tier 2 uses benign transcripts where response depends on speaker, tone, or location.
- — Models detect acoustic cues but fail to apply appropriate safeguards based on them.
- — Safety drops for speaker- and scene-conditioned risks; fairness erodes with vocal demographic cues.
- — Privacy protections weaken when contextual information arrives through speech rather than text.
- — A speech grounding gap exists: models recognize norms textually but not acoustically.
Astrobobo tool mapping
- Knowledge Capture Record the specific failure mode (e.g., 'model detects child speaker but grants access to adult-only content') and the acoustic feature that triggered it (age, tone, background noise).
- Focus Brief Summarize the gap between your text-based safety tests and your speech-based safety tests in a one-page comparison. Identify which risk category (safety, fairness, privacy) shows the largest drop.
- Reading Queue Queue the VoxSafeBench paper and its code repository for deeper review of Tier 2 task design and intermediate perception probes.
Frequently asked
- The speech grounding gap is the failure of speech language models to apply safety, fairness, and privacy rules when the relevant cue arrives as audio rather than text. Models often recognize the social norm when presented as text but ignore it when the same information is embedded in speaker identity, tone, or environment. This gap exposes a mismatch between text-based and audio-based reasoning.
cite ▸
Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu. (2026, April 17). Speech Models Fail Safety Tests That Text Models Pass. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/speech-models-fail-safety-tests-that-text-models-pass-210565
Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu. "Speech Models Fail Safety Tests That Text Models Pass." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/speech-models-fail-safety-tests-that-text-models-pass-210565. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.14548.
@misc{astrobobo_speech-models-fail-safety-tests-that-text-models-pass-210565_2026,
author = {Yuxiang Wang, Hongyu Liu, Yijiang Xu, Qinke Ni, Li Wang, Wan Lin, Kunyu Feng, Dekun Chen, Xu Tan, Lei Wang, Jie Shi, Zhizheng Wu},
title = {Speech Models Fail Safety Tests That Text Models Pass},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/speech-models-fail-safety-tests-that-text-models-pass-210565},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.14548},
}