arXiv:2605.19069v3 Announce Type: replace-cross
Abstract: Code-switching — the natural alternation between two languages within a single utterance — remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic–English, Saudi Arabic (Najdi/Hijazi)–English, Persian (Farsi)–English, and German–English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $approx$91%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($tau = 1.0$), WER inflates the magnitude of quality gaps by approximately 3$times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.