Similar Items: Let ViT Speak: Generative Language-Image Pre-training
- RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction
- Large Language Models are Universal Reasoners for Visual Generation
- LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)
- FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
- SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
- FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching