Text this: Extracting A Large Corpus from the Internet Archive, A Case Study