Similar Items: Transformer-based fusion of acoustic and textual cues with proportional augmentation for emotion recognition