Dynamic Vast Sparse Action space - RL

Project Overview

This project focuses on reinforcement learning (RL) in environments with dynamically changing and sparse action spaces. In general, RL suffers from poor sample efficiency, requiring extensive training time and computational resources. Even when learning is successful, policies often get trapped in local optima, making further improvement computationally expensive.

The problem of conventional reinforcement learning methods

While the curse of dimensionality in the state space can often be addressed through dimensionality reduction techniques or by designing compact state representations, the challenge of large action spaces remains critical. In standard actor-critic architectures, the actor network typically outputs evaluations over the entire action space. When the space is extremely large, the output dimensionality becomes a major bottleneck. Some tasks have huge action spaces but it is often sparse. In other words, the available actions depend on state and are often much smaller than the entire action space. Alphazero, developed by DeepMind, performs better than professional player and action-masking is employed to ignore some impossible actions.

The Overview of Reinforcement Learning

After Action Masking

Masking infeasible actions can remove invalid actions and reconstruct an action-taking probabilistic distribution, but it still does not fundamentally solve the problem of resource consumption. Before action-masking, much computational resources are wasted for calculating before-mask logits. Therefore, the cost is surprisingly numerous in case where the entire action space is extremely large. Let's see an exapmle of action-masking. Let an action space be composed of from action0 to action7 and action 1, 2, 6, 7 are invalid. By changing a logit of an invalid action, new action-taking probability distribution is established.

Beyond Action Masking

I am particularly interested in RL methods that can efficiently operate in environments where the total action space is vast, yet only a sparse subset of actions are valid in each state. For example, shogi has a huge entire action space but valid actions are sometimes sparse because most actions are prohibitted depending on state. For control of robots, some actions such as rotating arms dangerously and running too fast may cause them to be broken down. The dangerous actions should be banned for safety and sample efficiency.

Motivation of Simulator

One example of such an environment is grid-based, turn-based simulation RPGs. I am personally interested in the genre for strategical complexity. These games typically exhibit sparse, state-dependent action spaces, making them an ideal environment for my research. Based on this idea, I began designing and implementing a custom environment specifically tailored for reinforcement learning in such settings.

Simulator Page