arXiv:2605.02908v1 Announce Type: cross
Abstract: Understanding how textual embeddings contribute to memorization in text-to-image diffusion models is crucial for both interpretability and safety. This paper investigates an unexpected behavior of CLIP embeddings in Stable Diffusion, revealing that the model disproportionately relies on specific embeddings. We categorize input tokens as , , and with corresponding embeddings $mathbfv^mathbfsot, mathbfv^mathbfpr, mathbfv^mathbfeot, mathbfv^mathbfpad$. We discover that $mathbfv^mathbfpr$ contribute minimally to generation in memorized cases. In contrast, $mathbfv^mathbfpad$ strongly affect memorization due to their structural duplication of $mathbfv^mathbfeot$, the only embedding explicitly optimized during CLIP training. This duplication unintentionally amplifies the influence of $mathbfv^mathbfeot$, causing the model to over-rely on it, thereby driving memorization. Based on these observations, we propose two simple yet effective inference-time mitigation strategies: (1) Replacing the tokenizer’s default from to the ! token before embedding, and masking the $mathbfv^mathbfeot$; (2) Partial masking of $mathbfv^mathbfpad$. Both suppress memorization without degrading quality, and are readily deployable without prior detection.
Crisis support teams’ technological openness and learning attitudes toward the AI based virtual patient system crisis support VR
BackgroundAgainst the backdrop of escalating global humanitarian crises, innovative didactic simulations are becoming increasingly important. A promising alternative to traditional classroom-based didactics for learning psychological