Text this: G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models