Similar Items: You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
- Attention Is Where You Attack
- Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing
- When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
- STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
- AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification
- An Evaluation of Chat Safety Moderations in Roblox