Text this: AVCLNet: Multimodal Multispeaker Tracking Network Using Audio‐Visual Contrastive Learning