Full Text Available
Note: Clicking the button above will open the full text document at the original institutional repository in a new window.
ENGLISH ABSTRACT: Humans have the remarkable ability to perform actions at various levels of abstraction. In addition to this, humans are also able to learn new skills by applying relevant knowledge, observing experts and refining t hrough e x p erience. M any c urrent r einforcement learning (RL...
| Main Author: | |
|---|---|
| Other Authors: | |
| Format: | Thesis |
| Language: | en_ZA |
| Published: |
Stellenbosch : Stellenbosch University
2021
|
| Subjects: | |
| Tags: |
No Tags, Be the first to tag this record!
|
| Summary: | ENGLISH ABSTRACT: Humans have the remarkable ability to perform actions at various levels of abstraction. In
addition to this, humans are also able to learn new skills by applying relevant knowledge,
observing experts and refining t hrough e x p erience. M any c urrent r einforcement learning
(RL) algorithms rely on a lengthy trial-and-error training process, making it infeasible
to train them in the real world. In this thesis, to address sparse, hierarchical problems
we propose the following: (1) an RL algorithm, Branched Rainbow from Demonstrations
(BRfD), which combines several improvements to the Deep Q-Networks (DQN) algorithm,
and is capable of learning from human demonstrations; (2) a hierarchically structured RL
algorithm using BRfD to solve a set of sub-tasks in order to reach a goal. We evaluate both
of these algorithms in the 2019 MineRL challenge environments. The MineRL competition
challenged participants to find a Diamond i n M inecraft—a 3 D, o p en-world, procedurally
generated game. We analyse the efficiency of several improvements implemented in the
BRfD algorithm through an extensive ablation study. For this study, the agents are tasked
with collecting 64 logs in a Minecraft forest environment. We show that our algorithm
outperforms the overall winner of the MineRL challenge in the TreeChop environment.
Additionally, we show that nearly all of the improvements impact the performance either in
terms of learning speed or rewards received. For the hierarchical algorithm, we segment the
demonstrations into the respective sub-tasks. The algorithm then trains a version of BRfD
on these demonstrations before learning from its own experiences in the environment. We
then evaluate the algorithm by inspecting the proportion of episodes in which certain items
were obtained. While our algorithm is able to obtain iron ore, the current state-of-the-art
algorithms are capable of obtaining a diamond. |
|---|