Similar Items: Extracting A Large Corpus from the Internet Archive, A Case Study