Exploration is an important part of reinforcement learning; here, agents learn to predict and control stochastic and unknown environments. Exploration is essential because it provides important information, the absence of which could hinder effective learning. That said, this is one of the difficult tasks. Over years of research, it has been found that an effective way to increase an agent’s tendency to explore is to augment trajectories with intrinsic rewards for reaching new environmental states. However, the challenge here again is which states are considered new – this, in turn, depends on how the states of the environment are represented.
To address this challenge, DeepMind researchers introduced a new method in which agents are equipped with prior knowledge in the form of abstractions derived from large visual language models pre-trained on image captioning data.
Previously, novelty-driven exploration research has offered several approaches to derive state representation in reinforcement learning agents. One of the most popular methods is to use random features where the state is represented by integrating visual observations with fixed and randomly initialized target networks. Another popular method is to learn visual features extracted from an inverse dynamics model.
Both of these methods, and others like them, work well for classic 2D environments. their effectiveness for high-dimensional, partially observable 3D environments is still unproven. In 3D environments, there are challenges such as different viewpoints of the same scene map with distinct functionality; therefore, it is difficult to identify a good mapping between the visual state and the feature space, which is further exacerbated by the fact that useful state abstractions are highly task dependent. The DeepMind researchers in their paper call the acquisition of representations of the environment that support effective exploration a “chicken and egg” problem. This means that an agent can only understand whether two states should be considered similar or different when it has actually explored its environment.
Now, what if an agent learns representations in the form of abstractions derived from language models? This is what DeepMind researchers did. They hypothesized that the representations acquired by these visual language pre-learnings result in effective exploration of 3D environments because they are shaped by the unique abstract nature of the language models.
Researchers were able to demonstrate that language abstraction and pre-trained visual language improve the sampling efficiency of existing exploration methods. The advantage of this model can be seen in the on- and off-policy algorithms, different 3D domains, and other task specifications. The team also designed control experiments to understand how language contributes to better exploration and found that their results are consistent with cognitive science perspectives on the usefulness of human language.
Read the full article here.
Using language to learn exploration is not a new method. Many researchers have also explored this technique. In 2021, a group of Stanford researchers introduced Exploration through Learned Language Abstraction (ELLA), a reward shaping approach to increase sample efficiency in sparse reward environments by correlating high-level instructions with simpler constituents. Specifically, this approach had two important components: a completion classifier to identify when agents complete low-level instructions and a relevance classifier that correlates low-level instructions with the success of high-level tasks.
Earlier this year, a group of researchers published a paper titled – Improving Intrinsic Exploration with Language Abstractions, which showed how natural language could be used as a general means to highlight relevant abstractions in an environment. This work was different from previous ones because researchers assessed whether the language could improve on existing mining methods by extending directly to competitive mining baselines like AMIGo and NovelD. They saw a 45-85% improvement over their non-linguistic counterparts.