Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Investigating language preferences in improving multilingual Swahili information retrieval

Multilingual Information Retrieval (MLIR) systems are designed to retrieve information from multiple languages in response to a query posed in another language or in one of the languages in which a user is looking for information. Researchers have proposed several approaches for combining the result...

Full description

Saved in:
Bibliographic Details
Main Author: Telemala, Joseph Philipo
Other Authors: Suleman, Hussein
Format: Thesis
Language:English
Published: Department of Computer Science 2022
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613197155958784
access_status_str Open Access
author Telemala, Joseph Philipo
author2 Suleman, Hussein
author_browse Suleman, Hussein
Telemala, Joseph Philipo
author_facet Suleman, Hussein
Telemala, Joseph Philipo
author_sort Telemala, Joseph Philipo
collection Thesis
description Multilingual Information Retrieval (MLIR) systems are designed to retrieve information from multiple languages in response to a query posed in another language or in one of the languages in which a user is looking for information. Researchers have proposed several approaches for combining the results from individual result lists to produce a single result list. Some are heuristics, such as round-robin, in which a result is drawn from each result list one at a time until all lists are exhausted, while others are Machine Learning (ML)-based, in which a model is trained using a variety of features from the query and the required documents. These approaches strive for topical relevance, which is the most important goal in satisfying users' information needs. However, multilingual speakers exhibit a variety of behaviours, some of which are unique to certain individuals based on their historical, cultural, and linguistic backgrounds. Unfortunately, these behaviours are ignored in the current MLIR system design and implementation. Current MLIR systems present results that do not take people's language preferences into account when ranking results. Studies have shown that users have different language preferences based on their search topics – Topic-Language (T-L) preferences. This study proposes using T-L preferences to improve the relevance of the ranked MLIR results. To achieve this aim, we used a survey-based study to try to understand the information needs and Web search behaviour of Swahili-speaking Web users in Tanzania. One bold behaviour of such multilingual Web users that emerged is code-switching. Several factors, such as information context and search topic, were identified as reasons for such frequent language switching. We then created a prototype multilingual search engine with which users interacted in order to quantify how much the language of the query or the selected results is influenced by the search topic. We estimated the relationship between the topic of search and the language of the query and clicked results using the resulting query and click-through logs. The findings revealed that Swahili-speaking Web users have language preferences for certain topics. For example, Kiswahili was significantly preferred as a results language in only 9% of the examined topics, English was preferred in 26% of the topics, and there was no preference for language of results in the remaining 65% of the topics. Based on these findings, we created the T-L-based algorithm, which re-ranks the results based on T-L associations/preferences. We evaluated our proposed T-L-based algorithm using clickthrough logs from our prototype guided multilingual search engine. The results show that incorporating language preferences into the ranking model significantly improves the relevance MLIR results in some specific cases. The strength of the T-L association and the number of relevant results in the preferred language's list were discovered to be driving factors in the performance improvement of the T-L-based algorithm. This thesis provides evidence that using language preferences can potentially improve the relevance of MLIR results for some topics that are preferentially expressed in specific languages. This is important in communities where information search and access are hampered by a variety of factors and there is a clear lineage in language use, where MLIR's topical relevance alone may not be sufficient.
format Thesis
id oai:open.uct.ac.za:11427/36568
institution University of Cape Town (South Africa)
language eng
last_indexed 2026-06-10T12:32:18.917Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2022
publishDateRange 2022
publishDateSort 2022
publisher Department of Computer Science
publisherStr Department of Computer Science
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/36568 Investigating language preferences in improving multilingual Swahili information retrieval Telemala, Joseph Philipo Suleman, Hussein computer science Multilingual Information Retrieval (MLIR) systems are designed to retrieve information from multiple languages in response to a query posed in another language or in one of the languages in which a user is looking for information. Researchers have proposed several approaches for combining the results from individual result lists to produce a single result list. Some are heuristics, such as round-robin, in which a result is drawn from each result list one at a time until all lists are exhausted, while others are Machine Learning (ML)-based, in which a model is trained using a variety of features from the query and the required documents. These approaches strive for topical relevance, which is the most important goal in satisfying users' information needs. However, multilingual speakers exhibit a variety of behaviours, some of which are unique to certain individuals based on their historical, cultural, and linguistic backgrounds. Unfortunately, these behaviours are ignored in the current MLIR system design and implementation. Current MLIR systems present results that do not take people's language preferences into account when ranking results. Studies have shown that users have different language preferences based on their search topics – Topic-Language (T-L) preferences. This study proposes using T-L preferences to improve the relevance of the ranked MLIR results. To achieve this aim, we used a survey-based study to try to understand the information needs and Web search behaviour of Swahili-speaking Web users in Tanzania. One bold behaviour of such multilingual Web users that emerged is code-switching. Several factors, such as information context and search topic, were identified as reasons for such frequent language switching. We then created a prototype multilingual search engine with which users interacted in order to quantify how much the language of the query or the selected results is influenced by the search topic. We estimated the relationship between the topic of search and the language of the query and clicked results using the resulting query and click-through logs. The findings revealed that Swahili-speaking Web users have language preferences for certain topics. For example, Kiswahili was significantly preferred as a results language in only 9% of the examined topics, English was preferred in 26% of the topics, and there was no preference for language of results in the remaining 65% of the topics. Based on these findings, we created the T-L-based algorithm, which re-ranks the results based on T-L associations/preferences. We evaluated our proposed T-L-based algorithm using clickthrough logs from our prototype guided multilingual search engine. The results show that incorporating language preferences into the ranking model significantly improves the relevance MLIR results in some specific cases. The strength of the T-L association and the number of relevant results in the preferred language's list were discovered to be driving factors in the performance improvement of the T-L-based algorithm. This thesis provides evidence that using language preferences can potentially improve the relevance of MLIR results for some topics that are preferentially expressed in specific languages. This is important in communities where information search and access are hampered by a variety of factors and there is a clear lineage in language use, where MLIR's topical relevance alone may not be sufficient. 2022-06-29T10:45:14Z 2022-06-29T10:45:14Z 2022 2022-06-29T10:44:53Z Doctoral Thesis Doctoral PhD http://hdl.handle.net/11427/36568 eng application/pdf Department of Computer Science Faculty of Science
spellingShingle computer science
Telemala, Joseph Philipo
Investigating language preferences in improving multilingual Swahili information retrieval
thesis_degree_str Doctoral
title Investigating language preferences in improving multilingual Swahili information retrieval
title_full Investigating language preferences in improving multilingual Swahili information retrieval
title_fullStr Investigating language preferences in improving multilingual Swahili information retrieval
title_full_unstemmed Investigating language preferences in improving multilingual Swahili information retrieval
title_short Investigating language preferences in improving multilingual Swahili information retrieval
title_sort investigating language preferences in improving multilingual swahili information retrieval
topic computer science
url http://hdl.handle.net/11427/36568
work_keys_str_mv AT telemalajosephphilipo investigatinglanguagepreferencesinimprovingmultilingualswahiliinformationretrieval