Similar Items: Mitigating Misalignment Contagion by Steering with Implicit Traits
- Misaligned by Reward: Socially Undesirable Preferences in LLMs
- Conceptors for Semantic Steering
- Implicit Representations of Grammaticality in Language Models
- Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
- Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors
- Reasoning over Object Descriptions Improves Coreference Resolution in Task-Based Dialogue Systems