Conjuring Semantic Similarity

arXiv:2410.16431v4 Announce Type: replace Abstract: The semantic similarity between sample expressions measures the distance between their latent ‘meaning’. These meanings are themselves typically represented by

  • Home
  • Uncategorized
  • Diversifying Toxicity Search in Large Language Models Through Speciation

arXiv:2601.20981v2 Announce Type: replace-cross
Abstract: Evolutionary prompt search is a practical black-box approach for red teaming large language models, however existing methods often collapse onto a small family of high-performing prompts, limiting coverage of distinct failure modes. We present a speciated quality-diversity extension of textitToxSearch that maintains multiple high-toxicity prompt niches in parallel rather than optimizing a single best prompt. textitToxSearch-S introduces unsupervised prompt speciation via a search methodology that maintains capacity-limited species with exemplar leaders, a reserve pool for emerging niches, and species-aware parent selection that trades off within-niche exploitation and cross-niche exploration. Preliminary results show textitToxSearch-S reaching higher peak toxicity ($approx 0.73$ vs. $approx 0.47$) with a heavier tail (top-10 median $0.66$ vs. $0.45$) than the baseline. Speciation also yields broader semantic coverage under a topics-as-species analysis (higher effective topic diversity and larger unique topic coverage). Finally, species formed are well-separated in embedding space (mean separation ratio $approx 1.93$) and exhibit distinct toxicity distributions, indicating that speciation partitions the adversarial space into behaviorally differentiated niches rather than superficial lexical variants.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844