Text this: Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context