Text this: Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training