Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling

Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcri...

Full description

Saved in:
Bibliographic Details
Main Author: Marquard, Stephen
Format: Thesis
Language:English
Published: Computer Science 2016
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613186715287552
access_status_str Open Access
author Marquard, Stephen
author_browse Marquard, Stephen
author_facet Marquard, Stephen
author_sort Marquard, Stephen
collection Thesis
description Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia.
format Thesis
id oai:open.uct.ac.za:11427/21226
institution University of Cape Town (South Africa)
language eng
last_indexed 2026-06-10T12:32:08.355Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2016
publishDateRange 2016
publishDateSort 2016
publisher Computer Science
publisherStr Computer Science
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/21226 Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling Marquard, Stephen Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia. 2016-08-13T18:55:00Z 2016-08-13T18:55:00Z 2012 2016-08-13T18:25:02Z Master Thesis Masters MPhil http://hdl.handle.net/11427/21226 eng application/pdf Computer Science Unknown University of Cape Town University of Cape Town
spellingShingle Marquard, Stephen
Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
thesis_degree_str Master's
title Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
title_full Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
title_fullStr Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
title_full_unstemmed Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
title_short Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
title_sort improving searchability of automatically transcribed lectures through dynamic language modelling
url http://hdl.handle.net/11427/21226
work_keys_str_mv AT marquardstephen improvingsearchabilityofautomaticallytranscribedlecturesthroughdynamiclanguagemodelling