User description

Learning rational behaviors in open-world games like Minecraft remains to be challenging for Reinforcement Learning (RL) research due to the compound challenge of partial observability, high-dimensional visual perception and delayed reward. To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration. Specifically, our approach includes two levels of hierarchy, where the high-level controller learns a policy to control over options and the low-level workers learn to solve each sub-task. To boost the learning of sub-tasks, we propose a combination of techniques including 1) action-aware representation learning which captures underlying relations between action and representation, 2) discriminator-based self-imitation learning for efficient exploration, and 3) ensemble behavior cloning with consistency filtering for policy robustness. email Extensive experiments show that JueWu-MC significantly improves sample efficiency and outperforms a set of baselines by a large margin. Notably, we won the championship of the NeurIPS MineRL 2021 research competition and achieved the highest performance score ever.Deep reinforcement learning (DRL) has shown great success in many genres of games, including board game (Silver et al., 2016), Atari (Mnih et al., 2013), simple first-person-shooter (FPS) (Huang et al., 2019), real-time strategy (RTS) (Vinyals et al., 2019), multiplayer online battle arena (MOBA) (Berner et al., 2019), etc. Recently, open-world games have been attracting attention due to its playing mechanism and similarity to real-world control tasks (Guss et al., 2021). Minecraft, as a typical open-world game, has been increasingly explored for the past few years (Oh et al., 2016; Tessler et al., 2017; Guss et al., 2019; Kanervisto et al., 2020; Skrynnik et al., 2021; Mao et al., 2021).Compared to other games, the characteristics of Minecraft make it a suitable testbed for RL research, as it emphasizes exploration, perception and construction in a 3D open world (Oh et al., 2016). The agent is only provided with partial observability and occlusions. The tasks in the game are chained and long-term. Generally, human can make rational decisions to explore basic items and construct desired higher-level items using a reasonable amount of samples, while it can be hard for an AI agent to do so autonomously. Therefore, to facilitate the efficient decision-making of agents in playing Minecraft, MineRL (Guss et al., 2019) has been developed as a research competition platform, which provides human demonstrations and encourages the development of sample-efficient RL agents for playing Minecraft. Since the release of MineRL, a number of efforts have been made on developing Minecraft AI agents, e.g., ForgER (Skrynnik et al., 2021), SEIHAI (Mao et al., 2021).However, it is still difficult for existing RL algorithms to mine items in Minecraft due to the compound challenge it poses, expanded below.Long-time HorizonsIn order to achieve goals (e.g., mining a diamond) in Minecraft, the agent is required to finish a variety of sub-tasks (e.g., log, craft) that highly depend on each other. Due to the sparse reward, it is hard for agents to learn long-horizon decisions efficiently. Hierarchical RL from demonstrations (Le et al., 2018; Pertsch et al., 2020) has been explored to leverage the task structure to accelerate the learning process. However, learning from unstructured demonstrations without domain knowledge remains challenging.High-dimensional Visual PerceptionMinecraft is a flexible 3D first-person game revolving around gathering resources (i.e., explore) and creating structures and items (i.e., construct). In this environment, agents are required to deal with high-dimensional visual input to enable efficient control. However, agent’s surroundings are varied and dynamic, which poses difficulties to learning a good representation.Inefficient ExplorationWith partial observability, the agent needs to explore in the right way and collect information from the environment so as to achieve goals. A naive exploration strategy can waste a lot of samples on useless exploration. Self-imitation Learning (SIL) (Oh et al., 2018) is a simple method that learns to reproduce past good behaviors to incentivize deep exploration. However, SIL is not sample-efficient because its advantage-clipping operation causes a waste of samples. Moreover, SIL does not make use of the transitions between samples.Imperfect DemonstrationsHuman demonstrations in playing Minecraft are highly distributional diverse (Kanervisto et al., 2020). Also, there exists noisy data due to the imperfection of human operation (Guss et al., 2019).To address the aforementioned compound challenges, we develop an efficient hierarchical RL approach equipped with novel representation and imitation learning techniques. Our method makes effective use of human demonstrations to boost the learning of agents and enables the RL algorithm to learn rational behaviors with high sample efficiency.Hierarchical Planing with PriorWe first propose a hierarchical RL (HRL) framework with two levels of hierarchy, where the high-level controller automatically extracts sub-goals in long-horizon trajectories from the unstructured human demonstrations and learns a policy to control over options, while the low-level workers learn sub-tasks to achieve sub-goals by leveraging both demonstrations dispatched by the high-level controller and interactions with environments. Our approach automatically structures the demonstrations and learns a hierarchical agent, which enables better decision over long-horizon tasks. email Under our HRL framework, we devise the following key techniques to boost agent learning.Action-aware Representation LearningAlthough some prior works (Huang et al., 2019) proposed using auxiliary tasks (e.g., enemy detection) to better understand the 3D world, such methods require a large amount of labeled data. We propose a self-supervised action-aware representation learning (A2RL) technique, which learns to capture the underlying relations between action and representation in 3D visual environments like Minecraft. As we will show, A2RL not only enables effective control by learning a compact representation but also improves the interpretability of the learned policy.Discriminator-based Self-imitation LearningAs mentioned, existing self-imitation learning is advantage-based and becomes sample-inefficient for handling tasks in Minecraft, as it wastes a lot of samples due to the clipped objective and does not utilize transitions between samples. Therefore, we propose discriminator-based self-imitation learning (DSIL) which leverages self-generated experiences to learn self-correctable policies for better exploration.Ensemble Behavior Cloning with Consistency FilteringLearning a robust policy from imperfect demonstrations is difficult (Wu et al., 2019). To address this issue, we first propose consistency filtering to identify the most common human behavior, and then perform ensemble behavior cloning to learn a robust agent with reduced uncertainty.In summary, our contributions are: 1) We propose JueWu-MC, a sample-efficient hierarchical RL approach, equipped with action-aware representation learning, discriminator-based self-imitation, and ensemble behavior cloning with consistency filtering, for training Minecraft AI agents. 2) Our approach outperforms competitive baselines by a significantly large margin and achieves the best performance ever throughout the MineRL competition history. Thorough ablations and visualizations are further conducted to help understand why our approach works.Game AIGame has long been a preferable field for artificial intelligence research. AlphaGo (Silver et al., 2016) mastered the game of Go with DRL and tree search. Since then, DRL has been used in other more sophisticated games, including StarCraft (RTS) (Vinyals et al., 2019), Google Football (Sports) (Kurach et al., 2020), VizDoom (FPS) (Huang et al., 2019), Dota (MOBA) (Berner et al., 2019). Recently, the 3D open-world game Minecraft is drawing rising attention. Oh et al. (2016) showed that existing RL algorithms suffer from generalization in Minecraft and proposed a new memory-based DRL architecture. Tessler et al. (2017) proposed H-DRLN, a combination of a deep skill array and a skill distillation system, to promote lifelong learning and transfer knowledge among different tasks in Minecraft. Since MineRL was held in 2019, many solutions have been proposed to learn to play in Minecraft. There works can be grouped into two categories: 1) end-to-end learning (Amiranashvili et al., 2020; Kanervisto et al., 2020; Scheller et al., 2020); 2) HRL with human demonstrations (Skrynnik et al., 2021; Mao et al., 2021). Our approach belongs to the second category. In this category, prior works leverage the structure of the tasks and learn a hierarchical agent to play in Minecraft - ForgER (Skrynnik et al., 2021) proposed a hierarchical method with forgetful experience replay to allow the agent to learn from low-quality demonstrations; Mao et al. (2021) proposed SEIHAI that fully takes advantage of the human demonstrations and the task structure.Sample-efficient Reinforcement LearningOur work is to build a sample-efficient RL agent for playing Minecraft, and we thereby develop a combination of efficient learning techniques. We discuss the most relevant works below.Our work is related to recent HRL research that builds upon human priors. To expand, Le et al. (2018) proposed to warm-up the hierarchical agent from demonstrations and fine-tune with RL algorithms. Pertsch et al. (2020) proposed to learn a skill prior from demonstrations to accelerate HRL algorithms. Compared to existing works, we are faced with the highly unstructured demo in 3D first-person video games played by the crowds. We address this challenge by structuring the demonstrations and defining sub-tasks and sub-goals automatically.Representation learning in RL has two broad directions: self-supervised learning and contrastive learning. The former (Wu et al., 2021) aims at learning rich representations for high-dimensional unlabeled data to be useful across tasks, while the latter (Srinivas et al., 2020) learns representations that obey similarity constraints in a dataset organized by similar and dissimilar pairs. Our work proposes a novel self-supervised representation learning method that can measure action effects in 3D video games.Existing methods use curiosity or uncertainty as a signal for exploration (Pathak et al., 2017; Burda et al., 2018) so that the learned agent is able to cover a large state space. However, the exploration-exploitation dilemma, given the sample efficiency consideration, drives us to develop self-imitation learning (SIL) (Oh et al., 2018) methods that focus on exploiting past good experiences for better exploration. Hence, we propose discriminator-based self-imitation learning (DSIL) for efficient exploration.Our work is also related to learning from imperfect demonstrations, such as DQfD (Hester et al., 2018) and Q-filter (Nair et al., 2018). Most methods in this field leverage online interactions with the environment to handle the noise in demonstrations. We propose ensemble behavior cloning with consistency filtering (EBC) which leverages imperfect demonstrations to learn robust policies in playing Minecraft.3 MethodIn this section, we first introduce our overall HRL framework, and then illustrates the details of each component.3.1 OverviewFigure 1 shows our overall framework. We define the human demonstrations as