Text this: Region-Aware and Temporally Consistent Human-Object Contact Detection in Video