Similar Items: Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation
- Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks
- Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs
- Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
- AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety Verification
- An Evaluation of Chat Safety Moderations in Roblox
- Combating Organized Platform Abuse: Amplifying Weak Risk Signals with Structural Information