Similar Items: An Evaluation of Chat Safety Moderations in Roblox
- AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
- Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks
- Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation
- Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
- You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
- ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models