Text this: Automatic video captioning using spatiotemporal convolutions on temporally sampled frames