arXiv:2603.24511v1 Announce Type: cross
Abstract: LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering citeprank2026posttrainbench, novikov2025alphaevolve. We show that an emphautoresearch-style pipeline citepkarpathy2026autoresearch powered by Claude Code discovers novel white-box adversarial attack textitalgorithms that textbfsignificantly outperform all existing (30+) methods in jailbreaking and prompt injection evaluations.
Starting from existing attack implementations, such as GCG~citepzou2023universal, the agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $leq$10% for existing algorithms (Creffig:teaser, left).
The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving textbf100% ASR against Meta-SecAlign-70B citepchen2025secalign versus 56% for the best baseline (Creffig:teaser, middle). Extending the findings of~citecarlini2025autoadvexbench, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.
Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models
BackgroundSocial media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due to short, noisy text, overlapping symptomatology,




