Similar Items: Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment