ai · 6 min read · Apr 24, 2026

LLM Safety Filters Fail Differently Across Dialects and Explicit Identity

Research shows language models refuse requests more often when users state their identity explicitly, but bypass safety guardrails when using dialect signals like AAVE.

Source: arxiv/cs.AI · Irti Haq, Bel\'en Sald\'ias · open original ↗

LLMs apply stricter safety filters to explicit identity claims than to implicit dialect signals, creating unequal user experiences.

  • Explicit identity prompts (e.g., 'I am Black') trigger higher refusal rates and aggressive content filtering.
  • Implicit dialect cues (AAVE, Singlish) reduce refusal probability to near zero while increasing semantic similarity to reference text.
  • Safety alignment mechanisms rely heavily on explicit keywords, missing socio-linguistic signals that bypass guardrails.
  • Dialect-based requests receive less sanitized, potentially more hostile information than standard English equivalents.
  • Current safety techniques create bifurcated user experience: cautious output for standard speakers, raw output for dialect speakers.
  • Study analyzed 24,000+ responses from Gemma-3-12B and Qwen-3-VL-8B across sensitive domains using factorial design.
  • Fundamental tension exists between equitable safety and linguistic diversity in alignment training.

Astrobobo tool mapping

  • Knowledge Capture Log the three-prompt test results and refusal-rate deltas in a structured format. Tag with 'safety-audit' and 'dialect-bias' for later analysis.
  • Focus Brief Summarize findings in a one-page brief for your safety or product team. Highlight which dialects trigger higher refusal and which bypass filters.
  • Reading Queue Queue the full arxiv paper and related work on socio-linguistic bias in NLP. Assign to team members responsible for safety alignment.

Frequently asked

  • Safety filters are trained to detect and block explicit demographic keywords as a crude risk-mitigation tactic. When a user says 'I am Black,' the model's safety layer flags the demographic label itself as a trigger, not the actual request content. This over-indexes on explicit cues and misses the nuance of what the user is actually asking for.
Share X LinkedIn
cite
APA
Irti Haq, Bel\'en Sald\'ias. (2026, April 24). LLM Safety Filters Fail Differently Across Dialects and Explicit Identity. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/llm-safety-filters-fail-differently-across-dialects-and-explicit-identity-601b69
MLA
Irti Haq, Bel\'en Sald\'ias. "LLM Safety Filters Fail Differently Across Dialects and Explicit Identity." Astrobobo Content Engine, 24 Apr 2026, https://astrobobo-content-engine.vercel.app/article/llm-safety-filters-fail-differently-across-dialects-and-explicit-identity-601b69. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.21152.
BibTeX
@misc{astrobobo_llm-safety-filters-fail-differently-across-dialects-and-explicit-identity-601b69_2026,
  author       = {Irti Haq, Bel\'en Sald\'ias},
  title        = {LLM Safety Filters Fail Differently Across Dialects and Explicit Identity},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/llm-safety-filters-fail-differently-across-dialects-and-explicit-identity-601b69},
  note         = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.21152},
}

Related insights