Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering

Hospitals store patient information in relational databases known as Electronic Health Records (EHRs). Exist ing EHRs have filter and search options on the front end that are converted to SQL queries at the back end. However, these search and filter options become cumbersome when querying the EHR. W...

Full description

Saved in:
Bibliographic Details
Main Author: Alexander, Natalie
Other Authors: Buys, Jan
Format: Thesis
Language:English
Published: Department of Statistical Sciences 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613188103602176
access_status_str Open Access
author Alexander, Natalie
author2 Buys, Jan
author_browse Alexander, Natalie
Buys, Jan
author_facet Buys, Jan
Alexander, Natalie
author_sort Alexander, Natalie
collection Thesis
description Hospitals store patient information in relational databases known as Electronic Health Records (EHRs). Exist ing EHRs have filter and search options on the front end that are converted to SQL queries at the back end. However, these search and filter options become cumbersome when querying the EHR. While users could write custom SQL queries to query the EHR directly, this approach requires database expertise. Recent ad vancements in medical question-answering leverage text-to-SQL parsing, which translates a user's natural language question into an executable SQL query, enabling information retrieval from a database. However, current medical text-to-SQL research only addresses a limited scope of questions, known as answerable ques tions. Questions that the system cannot reliably answer (unanswerable questions) result in inexecutable or incorrect SQL predictions which may return incorrect information that affects clinical decision-making. This limitation underscores the need for a medical text-to-SQL system that can reliably address both answerable and unanswerable questions. This project aims to expand the coverage of questions answered by medical text to-SQL systems, by addressing unanswerable questions that are out-of-schema or require medical knowledge to simplify complex, medical jargon. More specifically, we focus on addressing real-world unanswerable ques tions related to diagnoses and medication. This research first explores methods for addressing out-of-schema questions by assessing how incorporating an unseen schema, during inference, enhances the performance of a sequence-to-sequence (T5) text-to-SQL model. We then compare this approach to the effectiveness of fine tuning the model on a training dataset that includes these out-of-schema questions and their corresponding schema. Secondly, this research examines how external medical knowledge sources, related to diagnoses and medication, can be used in data augmentation (either through retrieval-augmented generation or SQL post processing) to improve the answerability of unanswerable questions with complex medical jargon. In addition, we ensure model reliability by applying answer abstention when the text-to-SQL model cannot reliably answer a question, while also ensuring that the model does not deteriorate the answerability of the original answerable questions. As a result of these experiments, we find that out-of-schema questions are addressed by fine-tuning a T5-Base model on a training dataset that includes out-of-schema question representations, excluding addi tional schema information. In addition, we find that fine-tuning a T5-Large model with retrieval-augmented generation, which incorporates medical knowledge from the SNOMED CT and RxNorm medical vocabularies, improves the model's ability to answer unanswerable questions with complex medical jargon. We also find that an entropy-based uncertainty estimation method, which uses K-means clustering to establish the absten tion threshold, is suitable for answer abstention. Finally, we find that our proposed models do not compromise the answerability of the original answerable questions.
format Thesis
id oai:open.uct.ac.za:11427/42164
institution University of Cape Town (South Africa)
language eng
last_indexed 2026-06-10T12:32:09.918Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2025
publishDateRange 2025
publishDateSort 2025
publisher Department of Statistical Sciences
publisherStr Department of Statistical Sciences
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/42164 Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering Alexander, Natalie Buys, Jan Statistical Science Hospitals store patient information in relational databases known as Electronic Health Records (EHRs). Exist ing EHRs have filter and search options on the front end that are converted to SQL queries at the back end. However, these search and filter options become cumbersome when querying the EHR. While users could write custom SQL queries to query the EHR directly, this approach requires database expertise. Recent ad vancements in medical question-answering leverage text-to-SQL parsing, which translates a user's natural language question into an executable SQL query, enabling information retrieval from a database. However, current medical text-to-SQL research only addresses a limited scope of questions, known as answerable ques tions. Questions that the system cannot reliably answer (unanswerable questions) result in inexecutable or incorrect SQL predictions which may return incorrect information that affects clinical decision-making. This limitation underscores the need for a medical text-to-SQL system that can reliably address both answerable and unanswerable questions. This project aims to expand the coverage of questions answered by medical text to-SQL systems, by addressing unanswerable questions that are out-of-schema or require medical knowledge to simplify complex, medical jargon. More specifically, we focus on addressing real-world unanswerable ques tions related to diagnoses and medication. This research first explores methods for addressing out-of-schema questions by assessing how incorporating an unseen schema, during inference, enhances the performance of a sequence-to-sequence (T5) text-to-SQL model. We then compare this approach to the effectiveness of fine tuning the model on a training dataset that includes these out-of-schema questions and their corresponding schema. Secondly, this research examines how external medical knowledge sources, related to diagnoses and medication, can be used in data augmentation (either through retrieval-augmented generation or SQL post processing) to improve the answerability of unanswerable questions with complex medical jargon. In addition, we ensure model reliability by applying answer abstention when the text-to-SQL model cannot reliably answer a question, while also ensuring that the model does not deteriorate the answerability of the original answerable questions. As a result of these experiments, we find that out-of-schema questions are addressed by fine-tuning a T5-Base model on a training dataset that includes out-of-schema question representations, excluding addi tional schema information. In addition, we find that fine-tuning a T5-Large model with retrieval-augmented generation, which incorporates medical knowledge from the SNOMED CT and RxNorm medical vocabularies, improves the model's ability to answer unanswerable questions with complex medical jargon. We also find that an entropy-based uncertainty estimation method, which uses K-means clustering to establish the absten tion threshold, is suitable for answer abstention. Finally, we find that our proposed models do not compromise the answerability of the original answerable questions. 2025-11-10T09:49:58Z 2025-11-10T09:49:58Z 2025 2025-11-10T09:45:15Z Thesis / Dissertation Masters MSc http://hdl.handle.net/11427/42164 eng application/pdf Department of Statistical Sciences Faculty of Science University of Cape Town
spellingShingle Statistical Science
Alexander, Natalie
Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering
thesis_degree_str Master's
title Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering
title_full Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering
title_fullStr Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering
title_full_unstemmed Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering
title_short Towards answering unanswerable questions: data augmentation for enhanced medical domain question answering
title_sort towards answering unanswerable questions data augmentation for enhanced medical domain question answering
topic Statistical Science
url http://hdl.handle.net/11427/42164
work_keys_str_mv AT alexandernatalie towardsansweringunanswerablequestionsdataaugmentationforenhancedmedicaldomainquestionanswering