Similar Items: RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction
- Let ViT Speak: Generative Language-Image Pre-training
- Linearizing Vision Transformer with Test-Time Training
- H-ViT: hardware-friendly post-training quantization for efficient vision transformer inference
- Faithful Extreme Image Rescaling with Learnable Reversible Transformation and Semantic Priors
- AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs
- Foundation AI Models for Aerosol Optical Depth Estimation from PACE Satellite Data