Similar Items: G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
- SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
- Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation
- Unified Map Prior Encoder for Mapping and Planning
- Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
- UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
- One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy