Text this: A comparative analysis of video vision transformers on word-level sign language datasets