Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Addressing deep reinforcement learning: empirical algorithm performance evaluations∗

Due to the rapidly paced production of deep reinforcement learning (RL) research papers, some recent publications have begun to critique the manner in which RL algorithm performances are evaluated. Building on this recent scrutiny, our work attempts to identify the precise aspects of empirical deep...

Full description

Saved in:
Bibliographic Details
Main Author: Dubb, Roland
Other Authors: Shock, Jonathan
Format: Thesis
Language:English
English
Published: Department of Mathematics and Applied Mathematics 2025
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Due to the rapidly paced production of deep reinforcement learning (RL) research papers, some recent publications have begun to critique the manner in which RL algorithm performances are evaluated. Building on this recent scrutiny, our work attempts to identify the precise aspects of empirical deep RL algorithm performance evaluations that need attention for improvement. This dissertation begins by briefly introducing the RL problem. Thereafter, we review the literature and discuss recent scrutiny of various aspects of deep RL algorithm performance evaluations. Specifically, we discuss the following aspects: (i) the choice of RL environment, (ii) the measurement of uncertainty, (iii) the collection of data, and (iv) the aggregation of that data. From this discussion, we identify two particular problems with RL evaluations, namely the non-linear scaling of algorithm performance scores with the level of skill achieved by that particular algorithm, and the (potentially) biased weighting of scores in the data aggregation process, across RL environments. As multi-agent RL (MARL) presents a recently popular research paradigm whose evaluation procedures have not yet been carefully scrutinised in the literature, we analyse a dataset by Gorsane et al. [1] which documents the evaluation methodologies of many recent deep cooperative MARL publications. This analysis, which reveals several flawed aspects about MARL evaluation, along with the reviewed RL evaluation issues from the literature, motivates for an attempt at constructing an improved RL algorithm empirical performance evaluation guideline. Multi-criteria decision analysis (MCDA) is discussed as a potential framework that offers a data aggregation procedure that resolves the two aforementioned problems with RL evaluations. Combining the use of MCDA with our insights from the literature, we propose an improved guideline for deep RL empirical algorithm performance evaluations. This is contrasted with another proposed guideline by Gorsane et al. [1] before a proof-of-concept test is conducted. Overall, we aim to move toward the better evaluation of RL algorithms and contribute toward an increased sensitivity to a lack of scientific rigour [2, 3] in the field of machine learning.