Davide Paglieri1, Bartłomiej Cupiał2∗, Samuel Coward3, Ulyana Piterbarg4,
Maciej Wolczyk2, Akbir Khan1,5, Eduardo Pignatelli1, Łukasz Kuciński2,Lerrel Pinto4
Rob Fergus4, Jakob Nicolaus Foerster3, Jack Parker-Holder1, Tim Rocktäschel1
1AI Centre, University College London,2IDEAS NCBR,3University of Oxford,
4New York University,5AnthropicEqual technical contribution, first author was the project lead. Correspondence to d.paglieri@cs.ucl.ac.uk.
Code and Leaderboard at balrogai.com
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities, however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies—areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment).We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community.
1 Introduction
Recent successes of Large Language Models (LLMs) have renewed interest in building general-purpose agents capable of autonomously achieving complex goalsYang etal. (2023).LLMs possess vast knowledge across domains(Brown, 2020; Hendrycks etal., 2020), can reason in specific scenarios(Wei etal., 2022a; Shinn etal., 2023; Rein etal., 2023), and can reliably follow human instructions in simple settings(Ouyang etal., 2022). These abilities suggest that LLMs have the potential to become efficient agents, capable of autonomously performing a wide range of human tasks that require sequential decision making. In the present day, however, state-of-the-art models continue to exhibit persistent failure modes on many of the skills that are crucial for autonomous real-world interaction. For example, LLMs fail to act robustly in dynamic environments, and they cannot reliably learn from mistakes, reason about space and time, or plan over long time horizons(Xing etal., 2024; Yamada etal., 2023; Kambhampati etal., 2024). Improving our understanding of LLM capabilities through rigorous, safe evaluations is key for assessing the risks and limitations of deploying agentic LLMs in the real world.
Current agentic benchmarks evaluate LLM performance in settings that involve no more than a few dozen rounds of interaction between a model and an environment, e.g., solving simple office tasks(Wang etal., 2024), navigating the Internet (Zhou etal., 2023), and resolving GitHub issues(Jimenez etal., 2023). New agentic prompting frameworks and improvements to short-horizon reasoning via LLMs like OpenAI o1 have led to dramatic and fast-paced gains in state-of-the-art performance on these benchmarks (OpenAI, 2024b; Wang etal., 2023; Fernando etal., 2023; Hu etal., 2024). However, many realistic tasks require orders of magnitude more interactions(Pignatiello etal., 2020; Wansink and Sobal, 2007).
In this paper, we argue that the next frontier for language and vision-language model capabilities lies in long-horizon reasoning and decision-making. To that end, we proposeBALROG: Benchmarking Agentic LLM/VLM Reasoning On Games. BALROG is a benchmark and framework that aggregates a diverse set of complex reinforcement learning game environments into a unified testbed for research on long-context LLMs. Games have historically served as highly effective metrics for evaluating progress in deep reinforcement learning research(Bellemare etal., 2013; Silver etal., 2018; Schrittwieser etal., 2020; Vinyals etal., 2019). By aggregating many different game environments into a single evaluation, we look to spur progress on developing truly generalist agents that can meaningfully address embodied, real world tasks. Specifically BALROG enables seamless running of LLM and VLM agents on BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, and NetHack(Chevalier-Boisvert etal., 2019; Hafner, 2021; Côté etal., 2019; Cloos etal., 2024; Samvelyan etal., 2021; Küttler etal., 2020). These environments have lightweight simulators, ensuring that the benchmark is affordable for the research community. Furthermore, while all of these games are long-horizon, they span a broad range of difficulty levels, from tasks where we see fair zero-shot performance by state-of-the-art long-context models (BabyAI) to those where even specialized neural models trained on billions of in-domain datapoints make very limited progress (NetHack)(Piterbarg etal., 2024; Klissarov etal., 2023; Wołczyk etal., 2024). BALROG is difficult to solve through simple memorization – all of the environments used in the benchmark are procedurally generated, and encountering the same instance of an environment twice is unlikely.
Using the six proposed environments, we evaluate the capabilities of various popular LLMs and VLMs. We employ a fine-grained metric that captures how close each model is to completing a task, which gives us a thorough understanding of the resulting trajectories. In our qualitative analysis, we study the agents’ capabilities for spatial reasoning, systematic exploration, long-term planning, and discovering environment dynamics. We find that the current top LLMs show promise on the simplest tasks but completely fail to make meaningful progress on the more difficult tasks, such as MiniHack and NetHack. Some of the models exhibit knowledge about the game from pre-training but fail to use it in practice. For example, in NetHack, GPT-4o often dies from the consumption of rotten food, even though, when prompted, it correctly identifies it as very dangerous. Furthermore, we study the impact of the input representation. Although the majority of the environments were created with vision in mind, we find that multimodal LLMs perform much worse when also presented with an image of the environment rather than a textual-only description of the observation. This suggests that reliable vision-based decision-making is currently far outside our reach.
Our results show that BALROG is a very difficult benchmark that still allows us to observe fine-grained progress in crucial areas such as long-term planning, spatial reasoning and navigation. We share the codebase and open the benchmark for external submissions.We summarize our contributions as follows:
- •
BALROG, a suite of six reinforcement learning environments for testing the agentic capabilities of long-context LLMs. We provide a fine-grained metric for model evaluation, and we develop a novel data-informed progression system for NetHack.
- •
Baseline evaluations of state-of-the-art LLMs on BALROG using zero-shot prompting, in both Language-Vision and Language-only modalities. We show that while models exhibit decent performance on easier games, all are very far from solving the hardest game in the benchmark, NetHack. We observe that the performance drops further when images of the environment are presented, suggesting severe problems with VLM decision-making.
- •
We perform a qualitative analysis of the results across capabilities such as spatial reasoning, systematic exploration, and long-term planning. We identify an intriguing knowing-doing gap where the models cannot employ the knowledge they possess.
- •
An open-source toolkit for benchmarking long-context models on BALROG. This toolkit enables researchers and practitioners to quickly evaluate model performance. While the baseline evaluations performed in this paper are zero-shot, the BALROG toolkit supports inference-time prompting strategies like chain-of-thought (Wei etal., 2022b), few-shot learning, and more.
2 BALROG
BALROG is a benchmark and framework that aims to improve our understanding of whether existing long-context LLMs are agentic, i.e., whether they can be used to automate complex activities that require sequential decision-making. It supports model evaluation on challenging reinforcement learning environments that test skills such as long-term planning, spatial reasoning, and the ability to deduce the mechanics of the environment.
By design, the BALROG framework explicitly decouples inference-time prompting strategies from underlying models. The goal of this design choice is two-fold: (1) to facilitate rapid prototyping of inference-time methods for improving model performance on long-context decision-making beyond zero-shot prompting and (2) to ensure that model evaluations are consistent and rigorous.
In the remainder of this section, we introduce the game environments evaluated in the benchmark and we discuss our protocols for model submission to the BALROG Benchmark Leaderboard111This Leaderboard will open to the public at the time of publication..
2.1 Environments
BALROG evaluates long-context models as agents on the games described below.
BabyAI.(Chevalier-Boisvert etal., 2019; Carta etal., 2023) A simple, two-dimensional grid-world in which the agent has to solve tasks of varying complexity described in natural language (e.g., “go to the blue ball, then pick up the grey key”). Agents are tested across five different types of navigation tasks, see Appendix A.
Crafter.(Hafner, 2021) A Minecraft-inspired grid environment where the player has to explore, gather resources and craft items to ensure their survival. Agents are evaluated based on the number of achieved milestones, such as discovering new resources and crafting tools, see AppendixB.
TextWorld.(Côté etal., 2019) An entirely text-based game with no visual component, where the agent has to explore mazes and interact with everyday objects through natural language (e.g., “cook potato with oven”). Unlike the other environments in BALROG, TextWorld is not a grid-world. Models are evaluated on three different tasks, see Appendix C.
Baba Is AI.(Cloos etal., 2024) An environment based on the popular puzzle video game Baba Is You. The playermanipulates the rules of the game world by pushing word blocks, altering how objects interact. Agents are tested on 40 puzzles, see AppendixD.
MiniHack.(Samvelyan etal., 2021) MiniHack is a multi-task framework built on top of the NetHack Learning Environment(Küttler etal., 2020). We select five different tasks, Maze, Corridor, CorridorBattle, Boxoban, and Quest. Collectively, they assess a wide range of skills, including exploration, navigation, long-term planning, and resource management, see Appendix9.
NetHack Learning Environment (NLE)(Küttler etal., 2020) is based on the classic roguelike game NetHack, known for its extreme difficulty and complexity. Success in NetHack demands both long-term strategic planning, since a winning game can involve hundreds of thousands of steps, as well as short-term tactics to fight hordes of monsters. Accurate credit assignment is also crucial to understanding which actions contributed to success or failure. It takes human players years to master NetHack without accessing external guides. Notably, we find that research shows that LLMs can answer questions about the game mechanics and optimal strategies (see AppendixF.5), but they fail to apply this knowledge in practice. See AppendixF for more details.
Skills | BabyAI | TextWorld | Crafter | Baba Is AI | MiniHack | NLE |
Navigation | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Exploration | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Resource Management | ✗ | ✔ | ✔ | ✗ | ✔ | ✔ |
Complex Credit Assignment | ✗ | ✗ | ✔ | ✔ | ✔ | ✔ |
Deducing Env. Dynamics | ✗ | ✗ | ✗ | ✔ | ✔ | ✔ |
Long-term Planning | ✗ | ✗ | ✗ | ✔ | ✔ | ✔ |
Turns to Complete | – | |||||
Time to Master for Humans | Seconds | Minutes | Hours | Hours | Hours | Years |
Table 1 provides an overview of the environments used in the benchmark, detailing the reasoning and agentic capabilities required to succeed in each. This diverse set of environments positions BALROG as a comprehensive benchmark for assessing the capabilities of LLM agents, making it a valuable tool for evaluating their performance for years to come.
2.2 Submitting to the Benchmark Leaderboard
The BALROG benchmark accepts two types of submissions.
New Models. Submissions may include any type of new model, such as large language models (LLMs), vision-language models (VLMs), large-action models (LAMs), or fine-tuned versions of existing models. The key requirement is that these models must be capable of generating actions in natural language. By default, these models will be evaluated zero-shot.
Agentic Strategies. Submissions may propose novel inference-time prompting strategies for improving the reasoning, planning, or in-context learning capability of an existing model. These strategies should extend beyond simple zero-shot prompting for direct action prediction, demonstrating more sophisticated techniques for inference-time decision-making.
3 Zero-Shot Evaluation Protocol
In this section, we provide a description of our protocols for evaluating state-of-the-art, long-context LLMs and VLMs on BALROG. These evaluations are intended to serve as baselines for the benchmark. As a result, they probe zero-shot performance only.
3.1 Evaluation Setting
We aim to keep the evaluation setting simple. During each timestep of interaction, agents are prompted to output the next action as a natural language string, conditioned on their past interaction history in the environment. To perform successfully in BALROG, models must demonstrate robust instruction-following capabilities, including reading and interpreting game rules, understanding the action space, and producing valid actions to complete tasks effectively.
To address cases where the LLMs/VLMs output hallucinated or invalid actions, BALROG provides feedback to the agent indicating the action’s invalidity, it then executes a default fallback action (such as a “do-nothing” action or a standard move like “north”), and logs the occurrence for trajectory statistics. This ensures that the interaction remains continuous and robust while enabling users to analyze the context and frequency of such errors in post-evaluation analysis.
A diagrammatic visualization of BALROG is shown in Figure 1. We conceptualize the agent as a combination of the underlying LLM/VLM model and a particular prompting strategy. We provide a unified client wrapper that seamlessly integrates APIs for closed-source LLMs and VLMs such as OpenAI, Gemini, and Claude and allows users to effortlessly switch and evaluate models. For the evaluation of locally-served models, we include native support for the vLLM library(Kwon etal., 2023), which optimizes throughput by efficiently batching generation requests. We use multiple seeds for each environment to ensure the statistical significance of the results.
Metrics To ensure a fair and interpretable evaluation, we introduce a standardized metric, scoring performance on each task within a range of 0 to 100. For environments like MiniHack, BabyAI, and Baba Is AI, each episode is scored as either 0 or 100 based on task completion. For TextWorld, Crafter, and NetHack we use as the score a real number between 0 and 100, representing the proportion of achievements toward the maximum score. For NetHack, as the game scoring system does not adequately reflect actual progression (Wołczyk etal., 2024), we propose a novel, data-informed progression metric, described in Appendix F.2, to better capture agent performance.
Performance BALROG supports highly parallelized evaluations, leveraging the lightweight simulators of each of the environments in the suite. These evaluations allow multiple agents and environment instances to run concurrently with minimal computational overhead. Environment instances run asynchronously from one another, accommodating varying observation lengths and ensuring that agents with faster generation speeds (per action) are not affected by slower agent bottlenecks.
3.2 Observations
In the initial prompt, the agent is introduced to the game rules and provided with a list of available actions, each accompanied by a brief description. To prevent model overspecialization, we design a general prompt that is not fine-tuned to any specific LLM. Subsequent prompts present the observation-action history in a chat-based format. The game rules and observations are conveyed from the perspective of the “user”, while prior actions are attributed to the “assistant” or “model” role, depending on the type of model used. This structure mirrors the standard format used for fine-tuning instruction-following LLMs. Detailed examples of game observations are included in the appendices.
Except for TextWorld, which lacks a visual component, we evaluate all environments using two observation modalities:
Language Only Format Observations are expressed as natural language descriptions of the environment’s state (e.g., “a wall 5 steps ahead, a wall 2 steps to the left…”). For environments without native textual representations, we either generate descriptions using open-source language wrappers (BabyAI (Carta etal., 2023), Crafter (Wu etal., 2023), NetHack, and MiniHack (Goodger etal., 2023)) or develop a custom wrapper ourselves (Baba is AI, see Appendix5)
Vision-Language Format For VLMs, the observation consists of an image representing the environment’s current state, alongside its natural language description (mentioned above). In this format, the image corresponds only to the current observation, although we support including multiple images in the observation history.
For the most complex environments, i.e., MiniHack and NetHack, we augment the language-based observations with a two-dimensional map rendered using ASCII characters. For all experiments, we use a history length of 16 observations to maintain consistency across tasks. However, participants submitting to this benchmark are allowed to modify the observation history length as needed for their respective models and experiments.
3.3 Models
We evaluate a range of popular closed-source and open-source models, including Gemini-1.5-Flash and Gemini-1.5-Pro (Reid etal., 2024), GPT-4o-mini (2024-07-18 release) and GPT-4o (2024-05-13 release) (Achiam etal., 2023; OpenAI, 2024a), Claude 3.5 Sonnet (Anthropic, 2024), as well as Llama 3.1 instruct (8B and 70B) (Dubey etal., 2024) and Llama 3.2 instruct (1B, 3B, 11B and 90B) (MetaAI, 2024). Additionally, we test o1-mini (2024-09-12 release) and o1-preview (2024-09-12 release) (OpenAI, 2024b) exclusively on the NetHack environment due to budget constraints.
4 Results
In Figure2, we present the results of our experiments using the BALROG evaluation script for both language-only and vision-language formats. Most leading models demonstrate fair average progression on BabyAI, Crafter, and Baba Is AI, with GPT-4o performing best. Interestingly, the open-source Llama 3.1 70B and Llama 3.2 90B models achieve the highest results on the Baba Is AI language-only format, narrowly surpassing GPT-4o and Claude 3.5 Sonnet. In TextWorld, GPT-4o and Claude 3.5 Sonnet lead, while Gemini models fail to complete any tasks, being flagged as ‘unsafe’ by the Google Gemini API, despite the prompts containing no actual safety concerns. The MiniHack suite proves very challenging for all models, especially the quest and boxoban tasks, which were never solved by any model. Finally, all models flat line with NetHack, with the best-performing model, o1-preview, achieving a meager 1.5% average game progression.
Table LABEL:llm_table summarizes the aggregated results across all environments in the language-only format. Overall, GPT-4o is the best-performing model, with an average progression of 31.62%, followed closely by Claude 3.5 Sonnet and Llama 3.1 70B. Gemini-1.5-Pro lags behind the other large models, partly due to its 0% performance on TextWorld. However, results differ for the vision-language format, as shown in Table LABEL:vlm_table. Here, we observe that both GPT-4o and Llama 3.2 exhibit a decline in performance when image observations are included, likely due to confusion arising from the added visual input. In contrast, Gemini-1.5-Pro and Claude 3.5 Sonnet especially, maintain consistent performance across both formats. This suggests that current multimodal Transformer architectures are still better equipped at handling textual information than visual input, a topic we explore further in Section6. Additionally, Llama 3.1 70B outperforms the larger and more recent Llama 3.2 90B in the language-only format, suggesting that the introduction of visual processing in the latter may have negatively impacted its linguistic and reasoning capabilities. We show more detailed results for each environment in their appendices.
Model | Average Progress (%) |
gpt-4o | 32.34 1.49 |
claude-3.5-sonnet | 29.98 1.98 |
llama-3.1-70b-it | 27.88 1.43 |
llama-3.2-90B-it | 23.66 1.09 |
gemini-1.5-pro | 21.00 1.18 |
gpt-4o-mini | 17.36 1.35 |
llama-3.1-8b-it | 14.14 1.51 |
llama-3.2-11B-it | 13.54 1.05 |
gemini-1.5-flash | 9.73 0.77 |
llama-3.2-3B-it | 8.47 1.12 |
llama-3.2-1B-it | 6.32 1.00 |
Model | Average Progress (%) |
claude-3.5-sonnet | 29.08 2.21 |
gemini-1.5-pro | 25.76 1.36 |
gpt-4o | 22.56 1.44 |
gpt-4o-mini | 15.36 1.29 |
gemini-1.5-flash | 14.94 1.40 |
llama-3.2-90B-it | 13.43 1.16 |
llama-3.2-11B-it | 6.91 0.84 |
4.1 Qualitative analysis
We conducted an analysis of the model trajectories across the environments to identify common behaviors and challenges specific to each setting.
Spatial Reasoning While language models demonstrate some proficiency in basic navigation, they exhibit significant limitations in more complex spatial reasoning tasks. In the BabyAI suite, we observed significant shortcomings in the agents’ ability to place objects adjacent to other objects, which is required in some scenarios. In NetHack and MiniHack CorridorBattle, good spatial reasoning is crucial during combat, as players need to maneuver within confined corridors to avoid being surrounded by monsters. However, the agents frequently ended up cornered.
Systematic Exploration Our experiments revealed a significant weakness in the models’ ability to explore. In TextWorld’s Coin Collector, where agents must explore a house to locate a coin, agents often wander aimlessly, revisiting rooms they’ve already explored while missing important areas entirely. An efficient agent would behave in DFS-like manner, methodically searching each room, keeping track of visited areas and prioritizing unexplored spaces. The more complex quests in MiniHack expose similar issues, with models failing to efficiently navigate maze-like structures.
Long-term planning The agents exhibit substantial deficiencies in devising and executing long-term plans. We observe near-zero performance on MiniHack, and NLE, which both require careful planning. In particular, we do not observe a single successful trajectory in the Boxoban logical puzzles in MiniHack, which requires careful planning at every step in order to avoid irreversible failures.LLMs, with the finite amount of compute available to them in a single forward pass, are necessarily confined to solving some subset of reasoning problems. We observe that with the current models’ depth, number of flops, and reasoning solution templates embedded in the weights, these models cannot solve the reasoning tasks in BALROG. We see a notable improvement with OpenAI o1’s chain of thought capabilities on NetHack, performing close to three times better than its closest competitor in language-only mode Claude-3.5-Sonnet. However, its average progression of 1.57% is still far from satisfactory.
Discovering and Leveraging Environment Dynamics Some games require inferring non-trivial causal structure through experimentation to come up with new strategies. For example, a player might identify a potion of paralysis by drinking it, and then realize they can use this strategically by throwing such potions at enemies to incapacitate them. This kind of experimentation and strategic thinking is crucial for success in NetHack. However, current models struggle to formulate and execute such context-dependent strategies. In MiniHack Quests environments, models fail to devise and implement multi-step strategies, such as utilizing wand of cold or ring of levitation to cross lava rivers.In Crafter, where agents can handle basic tasks such as collecting wood, crafting items, drinking water, and even engaging in combat, they fail to learn long-term survival skills such as building shelters for protection against nocturnal threats.
Knowing-Doing Gap We observe a pronounced “knowing-doing” gap, where models execute undesirable actions during gameplay despite knowledge of their negative consequences. For instance, in NetHack, models often exit the dungeon shortly after starting the game, resulting in an instant game termination. When queried in a separate thread about the consequences of exiting the first level in NetHack, they correctly identify that it results in an instant death, making it is a highly undesirable action. Similarly, although the models correctly identify that eating rotten food in NetHack can result in death, this remains a common cause of failure, underscoring a disconnect between knowledge and decision-making. Additionally, models tend to ignore even the hints directly present in the input prompt and die from overeating even when advised against it. To study this problem in more detail, we prepared a questionnaire probing basic NetHack knowledge (see AppendixF.5).
5 Related Work
The evaluation of large language models has historically relied on benchmarks that emphasize static, non-interactive tasks. Benchmarks such as SuperGLUE (Wang etal., 2019), which tests general-purpose language understanding and MMLU (Hendrycks etal., 2020), which measures massive multitask language understanding, have been instrumental in advancing LLM research. BigBench (Srivastava etal., 2022) further expands the scope by including a diverse set of linguistic and cognitive challenges. Mathematical reasoning datasets like GSM8K and MATH (Cobbe etal., 2021; Hendrycks etal., 2021) assess models’ abilities to solve grade-school and competition-level math problems, while Shi etal. (2022) explore multilingual chain-of-thought reasoning. In the domain of code understanding and generation, benchmarks such as HumanEval (Chen etal., 2021) and CodeXGLUE (Lu etal., 2021) evaluate models capabilities in programming tasks.
These benchmarks, however, are limited to single-turn or short-context scenarios, do not require sequential decision-making or adaptation to changing environments and have been saturating rapidly (Kiela etal., 2021). Static benchmarks may not fully capture the progress we are seeking, since the research community aims to push the frontier of agentic foundation models capable of acting in dynamic environments, using tools, planning ahead, and reasoning about their surroundings. Researchers have recently investigated how LLMs use these skills to solve practical tasks, including using computer interfaces to perform office-related chores (Wang etal., 2024; Qin etal., 2024), navigating web pages (Yao etal., 2022; Zhou etal., 2023), and solve GitHub issues (Jimenez etal., 2023). Several works studied the multi-agent capabilities of LLMs to see if they can co-operate(Gong etal., 2023; Piatti etal., 2024) or effectively play against other agents(Jin etal., 2024; Wu etal., 2024).
In this work, we study agentic skills in the context of video games, as they offer challenges well-tailored for human players and test skills that are useful for embodied agents. Previously, some related works employed games to benchmark LLMs(Liu etal., 2023b; Todd etal., 2024; Wu etal., 2023), highlighting their emphasis on problem-solving, spatial reasoning, and well-defined rules and objectives. Some of these benchmarks, however, are already reaching saturation, with environments like Crafter being the most challenging in their suite. In contrast, BALROG fills an important gap by providing a wide range of games at varying difficulties—including the NetHack Learning Environment (Küttler etal., 2020), which takes humans years to master, and where zero-shot LLMs struggle greatly, as also seen in prior work (Jeurissen etal., 2024). These tasks represent a rich and granular testbed for evaluating agentic foundation models, pushing decision-making evaluations of LLMs/VLMs to the very limit of their context lengths. Other environments such as MineDojo(Fan etal., 2022) and MineRL(Guss etal., 2019) also present open-ended challenges for agentic capabilities, their steep computational requirements and reliance on multimodal inputs make them less practical for accessible, large-scale benchmarks.
While BALROG currently focuses on evaluating single-agent foundational capabilities, future extensions could explore multi-agent collaboration environments that provide unique opportunities to test teamwork and coordination skills in LLMs. For example, Overcooked (Carroll etal., 2019; Liu etal., 2023a) simulates a cooperative cooking environment where agents must collaborate efficiently under time constraints and task dependencies, testing planning and communication abilities. Another compelling environment is Hanabi (Bard etal., 2020), a cooperative card game where players must rely on indirect communication and inferential reasoning to achieve a shared objective under partial observability. These environments present rich opportunities to benchmark advanced collaboration and multi-agent decision-making skills, which are essential for broader deployment of agentic LLMs.
6 Open Research Problems
Aside from its utility for model evaluations, BALROG also offers a test-bed for rapidly prototyping new inference-time methods for improving the agentic capabilities of LLMs and VLMs. There are many open research problems in this space. As of the writing of this paper, some of the most performant methods for improving model reasoning capabilities on short-form and/or shorter-context problems are infeasible to apply naively to BALROG due to the extremely long-context nature of tasks. Addressing these challenges could further enhance the development of stronger autonomous agents. We highlight several key areas for future work below.
In-Context Learning and Few-Shot Prompting
BALROG enables evaluation of In-Context Learning (ICL) agents, which can use few-shot examples to adapt to out-of-distribution tasks. We provide a small dataset of human demonstrations for each environment and an implementation of few-shot conditioning in the BALROG codebase. The benchmark codebase also supports the study of In-Context Reinforcement Learning (Lee etal., 2024; Laskin etal., 2022; Lin etal., 2023), where agents learn to improve from mistakes during inference. On the large models benchmarked in Section 4, naive few-shot learning (i.e., prompting LLM and VLM agents with examples of full human games in-context) is extremely computationally expensive to run on BALROG. For example, a single demonstration of NetHack game-play can require upwards of input tokens to represent in a prompt. Despite advancements in fast inference technologies like caching and falling API costs for long-context prompting, we found these experiments to be infeasible to conduct at this time. Sub-selecting only the relevant parts of demonstrations via retrieval-augmented few-shot prompting strategies(Lewis etal., 2020) might offer a way to circumvent these challenges. We leave exploration of such methods for future work.
Advanced Reasoning Strategies
Beyond simply prompting LLMs and VLMs to directly predict the next action of game-play, BALROG also supports the study of more advanced reasoning techniques like chain-of-thought (Wei etal., 2022b), self-refinement (Madaan etal., 2024), and basic planning. These methods have been demonstrated to improve model performance on shorter-context problems. We believe them to be an exciting direction for future work on long-context reasoning and decision-making. For example, model performance on the tasks in BALROG might be improved by integrating multi-agent collaboration (Chang, 2023; Khan etal., 2024; Yao etal., 2024) and tool usage (Shen etal., 2024; Ruan etal., 2023; Schick etal., 2024; Qin etal., 2023) in decision-making. Additionally, incorporating memory mechanisms or reinforcement learning techniques could help bridge the “knowing-doing” gap, enabling models to apply their knowledge effectively in practical, long-horizon tasks. Finally, experimenting with open-ended self-improvement loops (Wang etal., 2023; Hu etal., 2024) could lead to more adaptive and general agents (Team etal., 2023; Hughes etal., 2024), offering a pathway toward truly autonomous systems.
Limitations of Current Vision-Language Models
Despite their potential, our benchmark shows significant variability in VLM performance. While some models, like Llama 3.2, struggle to integrate visual information into coherent decision-making, others—most notably Sonnet 3.5—demonstrate stronger performance in VLM mode. This disparity highlights significant variability in VLM capabilities, which may stem from differences in training objectives and datasets. For example, Sonnet 3.5’s superior performance can be attributed in part to its training on tasks involving computer usage (Anthropic, 2024), which inherently require integrating visual and textual inputs for action-based reasoning.
Recent studies have identified key limitations of VLMs that align with our findings, including biases toward natural image-text pairs, optimization for image description rather than action-oriented reasoning, and challenges with out-of-distribution inputs(Tan etal., 2024; Tong etal., 2024; Rahmanzadehgervi etal., 2024; Zang etal., 2024; Guan etal., 2023). These limitations are further exemplified in our benchmark, where grid-based image observations differ significantly from the natural image-text pairs on which many VLMs are trained(Yu etal., 2023; Rahmanzadehgervi etal., 2024). Moreover, the computational cost of image processing constrained our evaluation to a single image per observation, with the remainder of the history provided in text. While this constraint may hinder performance for some models, our results show that certain VLMs like Claude 3.5 Sonnet can still perform robustly under these conditions.
To address these challenges, our codebase already supports multi-image observation histories, and future iterations will incorporate video observations, which are likely better suited for the long-horizon sequential decision-making tasks central to our benchmark. These enhancements aim to better evaluate and leverage the potential of VLMs in complex reasoning scenarios. We plan to introduce support for video observations once prominent models with efficient video-processing capabilities become available, ensuring that our benchmark remains aligned with the latest advancements in VLM technology.
Computational Limitations of Large Language Models
Mechanistic interpretability could provide valuable insights for understanding the computational limitations of agentic LLMs. The computational expressiveness of LLMs is fundamentally linked with the ability to solve complex reasoning problems (Wei etal., 2022a).While current models perform well on simple tasks such as navigation and object manipulation, they struggle with more complex tasks that could require non-trivial and general-purpose computation, for example, building a shelter or developing combat strategies.This could be due to the models’ inability to retrieve relevant computational circuits(Olah etal., 2020), limitations to inference-time budget(Snell etal., 2024), or representational expressivity.This raises important questions about the scope of effectively solvable tasks for LLMs and VLMs, which is dependent on factors such as model depth, context size, and the distribution shift between pre-training and downstream tasks. Further research is needed to understand the underlying causes of these limitations and to develop strategies for overcoming them, such as adaptive simulation of computational circuits during runtime.
7 Conclusion
We introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs across a diverse set of challenging, long-horizon tasks. Through easily reproducible evaluation protocols, BALROG reveals critical shortcomings in current models, particularly in areas such as vision-based decision-making and long-term planning, identifying clear gaps between model performance and human-level capabilities. These deficiencies, uncovered through our qualitative analysis, reflect the challenges faced in real-world scenarios, underscoring the practical relevance of our benchmark for agentic applications. Our evaluation framework leverages fast, procedurally generated environments, ensuring rigorous and fair comparisons by preventing test-set leakage, a common issue in other benchmarks. We believe that BALROG will serve as a critical tool for supporting and advancing research towards autonomous LLM agents.
Ethics Statement
This work provides a benchmark for the agentic capabilities of LLMs. We believe that experimentation in simulated environments, where the behavior of the agents is easy to interpret, is crucial for building safe agentic systems. It is important to address questions on how to ensure that the agent’s behavior is well aligned with human intentions.
Reproducibility Statement
We strive to make all experiments in this paper fully reproducible. We share the codebase for evaluation, which is available in the supplementary materials. We describe the full descriptions of the evaluation schemes of the specific environments in Appendices A to F.
References
- Achiam etal. (2023)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya,FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman,Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
- Anthropic (2024)Anthropic.Developing a computer use model, 2024.URL https://www.anthropic.com/news/developing-computer-use.Accessed: 2024-11-17.
- Anthropic (2024)Anthropic.Claude 3.5 sonnet: Enhanced intelligence and versatility, 2024.URL https://www.anthropic.com/news/claude-3-5-sonnet.Accessed: 2024-11-18.
- Bard etal. (2020)Nolan Bard, JakobN Foerster, Sarath Chandar, Neil Burch, Marc Lanctot,HFrancis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, EdwardHughes, etal.The hanabi challenge: A new frontier for ai research.Artificial Intelligence, 280:103216, 2020.
- Bellemare etal. (2013)MarcG Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.The arcade learning environment: An evaluation platform for generalagents.Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Brown (2020)TomB Brown.Language models are few-shot learners.arXiv preprint arXiv:2005.14165, 2020.
- Carroll etal. (2019)Micah Carroll, Rohin Shah, MarkK Ho, Tom Griffiths, Sanjit Seshia, PieterAbbeel, and Anca Dragan.On the utility of learning about humans for human-ai coordination.Advances in neural information processing systems, 32, 2019.
- Carta etal. (2023)Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud,and Pierre-Yves Oudeyer.Grounding large language models in interactive environments withonline reinforcement learning.In International Conference on Machine Learning, pages3676–3713. PMLR, 2023.
- Chang (2023)EdwardY Chang.Prompting large language models with the socratic method.In 2023 IEEE 13th Annual Computing and Communication Workshopand Conference (CCWC), pages 0351–0360. IEEE, 2023.
- Chen etal. (2021)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde DeOliveiraPinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, GregBrockman, etal.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021.
- Chevalier-Boisvert etal. (2019)Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems,Chitwan Saharia, ThienHuu Nguyen, and Yoshua Bengio.BabyAI: First steps towards grounded language learning with a humanin the loop.In International Conference on Learning Representations, 2019.URL https://openreview.net/forum?id=rJeXCo0cYX.
- Cloos etal. (2024)Nathan Cloos, Meagan Jens, Michelangelo Naim, Yen-Ling Kuo, Ignacio Cases,Andrei Barbu, and ChristopherJ Cueva.Baba is ai: Break the rules to beat the benchmark.In ICML 2024 Workshop on LLMs and Cognition, 2024.
- Cobbe etal. (2021)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, LukaszKaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano,etal.Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021.
- Côté etal. (2019)Marc-Alexandre Côté, Akos Kádár, Xingdi Yuan, Ben Kybartas,Tavian Barnes, Emery Fine, James Moore, Matthew Hausknecht, Layla ElAsri,Mahmoud Adada, etal.Textworld: A learning environment for text-based games.In Computer Games: 7th Workshop, CGW 2018, Held in Conjunctionwith the 27th International Conference on Artificial Intelligence, IJCAI2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages41–75. Springer, 2019.
- Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, AhmadAl-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan,etal.The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.
- Fan etal. (2022)Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu,Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar.Minedojo: Building open-ended embodied agents with internet-scaleknowledge.Advances in Neural Information Processing Systems,35:18343–18362, 2022.
- Fernando etal. (2023)Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and TimRocktäschel.Promptbreeder: Self-referential self-improvement via promptevolution.arXiv preprint arXiv:2309.16797, 2023.
- Gong etal. (2023)Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, ZilongZheng, Song-Chun Zhu, Demetri Terzopoulos, LiFei-Fei, etal.Mindagent: Emergent gaming interaction.arXiv preprint arXiv:2309.09971, 2023.
- Goodger etal. (2023)Nikolaj Goodger, Peter Vamplew, Cameron Foale, and Richard Dazeley.A nethack learning environment language wrapper for autonomousagents.Journal of Open Research Software, 11, 06 2023.doi: 10.5334/jors.444.
- Guan etal. (2023)Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, XijunWang, Lichang Chen, Furong Huang, Yaser Yacoob, etal.Hallusionbench: An advanced diagnostic suite for entangled languagehallucination and visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023.
- Guss etal. (2019)WilliamH Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel,Manuela Veloso, and Ruslan Salakhutdinov.Minerl: A large-scale dataset of minecraft demonstrations.arXiv preprint arXiv:1907.13440, 2019.
- Hafner (2021)Danijar Hafner.Benchmarking the spectrum of agent capabilities.arXiv preprint arXiv:2109.06780, 2021.
- Hambro etal. (2022a)Eric Hambro, Sharada Mohanty, Dmitrii Babaev, Minwoo Byeon, Dipam Chakraborty,Edward Grefenstette, Minqi Jiang, JoDaejin, Anssi Kanervisto, Jongmin Kim,etal.Insights from the neurips 2021 nethack challenge.In NeurIPS 2021 Competitions and Demonstrations Track, pages41–52. PMLR, 2022a.
- Hambro etal. (2022b)Eric Hambro, Roberta Raileanu, Danielle Rothermel, Vegard Mella, TimRocktäschel, Heinrich Küttler, and Naila Murray.Dungeons and data: A large-scale nethack dataset.Advances in Neural Information Processing Systems,35:24864–24878, 2022b.
- Hendrycks etal. (2020)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, DawnSong, and Jacob Steinhardt.Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020.
- Hendrycks etal. (2021)Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, EricTang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021.
- Hu etal. (2024)Shengran Hu, Cong Lu, and Jeff Clune.Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024.
- Hughes etal. (2024)Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, AditiMavalankar, Yuge Shi, Tom Schaul, and Tim Rocktaschel.Open-endedness is essential for artificial superhuman intelligence.arXiv preprint arXiv:2406.04268, 2024.
- Jeurissen etal. (2024)Dominik Jeurissen, Diego Perez-Liebana, Jeremy Gow, Duygu Cakmak, and JamesKwan.Playing nethack with llms: Potential & limitations as zero-shotagents.arXiv preprint arXiv:2403.00690, 2024.
- Jimenez etal. (2023)CarlosE Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, OfirPress, and Karthik Narasimhan.Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023.
- Jin etal. (2024)Xuanfa Jin, Ziyan Wang, Yali Du, Meng Fang, Haifeng Zhang, and Jun Wang.Learning to discuss strategically: A case study on one night ultimatewerewolf.arXiv preprint arXiv:2405.19946, 2024.
- Kambhampati etal. (2024)Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma,Siddhant Bhambri, Lucas Saldyt, and Anil Murthy.Llms can’t plan, but can help planning in llm-modulo frameworks.arXiv preprint arXiv:2402.01817, 2024.
- Khan etal. (2024)Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, AnshRadhakrishnan, Edward Grefenstette, SamuelR Bowman, Tim Rocktäschel, andEthan Perez.Debating with more persuasive llms leads to more truthful answers.arXiv preprint arXiv:2402.06782, 2024.
- Kiela etal. (2021)Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger,Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia,etal.Dynabench: Rethinking benchmarking in nlp.arXiv preprint arXiv:2104.14337, 2021.
- Klissarov etal. (2023)Martin Klissarov, Pierluca D’Oro, Shagun Sodhani, Roberta Raileanu, Pierre-LucBacon, Pascal Vincent, Amy Zhang, and Mikael Henaff.Motif: Intrinsic motivation from artificial intelligence feedback.arXiv preprint arXiv:2310.00166, 2023.
- Küttler etal. (2020)Heinrich Küttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu,Marco Selvatici, Edward Grefenstette, and Tim Rocktäschel.The nethack learning environment.Advances in Neural Information Processing Systems,33:7671–7684, 2020.
- Kwon etal. (2023)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, CodyHao Yu,JosephE. Gonzalez, Hao Zhang, and Ion Stoica.Efficient memory management for large language model serving withpagedattention.In Proceedings of the ACM SIGOPS 29th Symposium on OperatingSystems Principles, 2023.
- Laskin etal. (2022)Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer,Richie Steigerwald, DJStrouse, Steven Hansen, Angelos Filos, Ethan Brooks,etal.In-context reinforcement learning with algorithm distillation.arXiv preprint arXiv:2210.14215, 2022.
- Lee etal. (2024)Jonathan Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, OfirNachum, and Emma Brunskill.Supervised pretraining can learn in-context reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024.
- Lewis etal. (2020)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, VladimirKarpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, TimRocktäschel, etal.Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems,33:9459–9474, 2020.
- Li etal. (2022)Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, TaoChen, De-An Huang, Ekin Akyürek, Anima Anandkumar, etal.Pre-trained language models for interactive decision-making.Advances in Neural Information Processing Systems,35:31199–31212, 2022.
- Lin etal. (2023)Licong Lin, YuBai, and Song Mei.Transformers as decision makers: Provable in-context reinforcementlearning via supervised pretraining.arXiv preprint arXiv:2310.08566, 2023.
- Liu etal. (2023a)Jijia Liu, Chao Yu, Jiaxuan Gao, Yuqing Xie, Qingmin Liao, YiWu, and YuWang.Llm-powered hierarchical language agent for real-time human-aicoordination.arXiv preprint arXiv:2312.15224, 2023a.
- Liu etal. (2023b)Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, YuGu,Hangliang Ding, Kaiwen Men, Kejuan Yang, etal.Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023b.
- Lu etal. (2024)Cong Lu, Shengran Hu, and Jeff Clune.Intelligent go-explore: Standing on the shoulders of giant foundationmodels.arXiv preprint arXiv:2405.15143, 2024.
- Lu etal. (2021)Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, AmbrosioBlanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, etal.Codexglue: A machine learning benchmark dataset for codeunderstanding and generation.arXiv preprint arXiv:2102.04664, 2021.
- Madaan etal. (2024)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, SarahWiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, etal.Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36, 2024.
- MetaAI (2024)MetaAI.Llama 3.2: Revolutionizing edge ai and vision with open, customizablemodels, 2024.URLhttps://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/.Accessed: 2024-09-28.
- Olah etal. (2020)Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, andShan Carter.Zoom in: An introduction to circuits.Distill, 5(3):e00024–001, 2020.
- OpenAI (2024a)OpenAI.Hello gpt-4o, 2024a.URL https://openai.com/index/hello-gpt-4o/.Accessed: 2024-09-28.
- OpenAI (2024b)OpenAI.Introducing openai o1 preview, September 2024b.URL https://openai.com/index/introducing-openai-o1-preview/.Accessed: 2024-09-27.
- Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, PamelaMishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.Advances in neural information processing systems,35:27730–27744, 2022.
- Piatti etal. (2024)Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Schölkopf,Mrinmaya Sachan, and Rada Mihalcea.Cooperate or collapse: Emergence of sustainability behaviors in asociety of llm agents.arXiv preprint arXiv:2404.16698, 2024.
- Pignatiello etal. (2020)GrantA Pignatiello, RichardJ Martin, and RonaldL HickmanJr.Decision fatigue: A conceptual analysis.Journal of health psychology, 25(1):123–135, 2020.
- Piterbarg etal. (2024)Ulyana Piterbarg, Lerrel Pinto, and Rob Fergus.diff history for neural language agents.In Forty-first International Conference on Machine Learning,2024.
- Qin etal. (2024)Yanzhao Qin, Tao Zhang, Yanjun Shen, Wenjing Luo, Haoze Sun, Yan Zhang, YujingQiao, Weipeng Chen, Zenan Zhou, Wentao Zhang, etal.Sysbench: Can large language models follow system messages?arXiv preprint arXiv:2408.10943, 2024.
- Qin etal. (2023)Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin,Xin Cong, Xiangru Tang, Bill Qian, etal.Toolllm: Facilitating large language models to master 16000+real-world apis.arXiv preprint arXiv:2307.16789, 2023.
- Rahmanzadehgervi etal. (2024)Pooyan Rahmanzadehgervi, Logan Bolton, MohammadReza Taesiri, and AnhTottiNguyen.Vision language models are blind.arXiv preprint arXiv:2407.06581, 2024.
- Reed etal. (2022)Scott Reed, Konrad Zolna, Emilio Parisotto, SergioGomez Colmenarejo, AlexanderNovikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay,JostTobias Springenberg, etal.A generalist agent.arXiv preprint arXiv:2205.06175, 2022.
- Reid etal. (2024)Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, TimothyLillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, OrhanFirat, Julian Schrittwieser, etal.Gemini 1.5: Unlocking multimodal understanding across millions oftokens of context.arXiv preprint arXiv:2403.05530, 2024.
- Rein etal. (2023)David Rein, BettyLi Hou, AsaCooper Stickland, Jackson Petty, RichardYuanzhePang, Julien Dirani, Julian Michael, and SamuelR Bowman.Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023.
- Ruan etal. (2023)Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu, Tianpeng Bao, Hangyu Mao,Ziyue Li, Xingyu Zeng, Rui Zhao, etal.Tptu: Task planning and tool usage of large language model-based aiagents.In NeurIPS 2023 Foundation Models for Decision MakingWorkshop, 2023.
- Samvelyan etal. (2021)Mikayel Samvelyan, Robert Kirk, Vitaly Kurin, Jack Parker-Holder, Minqi Jiang,Eric Hambro, Fabio Petroni, Heinrich Küttler, Edward Grefenstette, andTim Rocktäschel.Minihack the planet: A sandbox for open-ended reinforcement learningresearch.arXiv preprint arXiv:2109.13202, 2021.
- Schick etal. (2024)Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, MariaLomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom.Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36, 2024.
- Schrittwieser etal. (2020)Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan,Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis,Thore Graepel, etal.Mastering atari, go, chess and shogi by planning with a learnedmodel.Nature, 588(7839):604–609, 2020.
- Shen etal. (2024)Yongliang Shen, Kaitao Song, XuTan, Dongsheng Li, Weiming Lu, and YuetingZhuang.Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface.Advances in Neural Information Processing Systems, 36, 2024.
- Shi etal. (2022)Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, SoroushVosoughi, HyungWon Chung, YiTay, Sebastian Ruder, Denny Zhou, etal.Language models are multilingual chain-of-thought reasoners.arXiv preprint arXiv:2210.03057, 2022.
- Shinn etal. (2023)Noah Shinn, Beck Labash, and Ashwin Gopinath.Reflexion: an autonomous agent with dynamic memory andself-reflection.arXiv preprint arXiv:2303.11366, 2(5):9,2023.
- Silver etal. (2018)David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, MatthewLai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, ThoreGraepel, etal.A general reinforcement learning algorithm that masters chess, shogi,and go through self-play.Science, 362(6419):1140–1144, 2018.
- Snell etal. (2024)Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar.Scaling llm test-time compute optimally can be more effective thanscaling model parameters.arXiv preprint arXiv:2408.03314, 2024.
- Srivastava etal. (2022)Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu AwalMd Shoeb, AbubakarAbid, Adam Fisch, AdamR Brown, Adam Santoro, Aditya Gupta, AdriàGarriga-Alonso, etal.Beyond the imitation game: Quantifying and extrapolating thecapabilities of language models.arXiv preprint arXiv:2206.04615, 2022.
- Tan etal. (2024)Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue,Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, etal.Towards general computer control: A multimodal agent for red deadredemption ii as a case study.In ICLR 2024 Workshop on Large Language Model (LLM) Agents,2024.
- Team etal. (2023)AdaptiveAgent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, FeryalBehbahani, Avishkar Bhoopchand, Nathalie Bradley-Schmieg, Michael Chang,Natalie Clay, Adrian Collister, etal.Human-timescale adaptation in an open-ended task space.arXiv preprint arXiv:2301.07608, 2023.
- Todd etal. (2024)Graham Todd, Tim Merino, Sam Earle, and Julian Togelius.Missed connections: Lateral thinking puzzles for large languagemodels.arXiv preprint arXiv:2404.11730, 2024.
- Tong etal. (2024)Shengbang Tong, Zhuang Liu, Yuexiang Zhai, YiMa, Yann LeCun, and Saining Xie.Eyes wide shut? exploring the visual shortcomings of multimodal llms.In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 9568–9578, 2024.
- Vinyals etal. (2019)Oriol Vinyals, Igor Babuschkin, WojciechM Czarnecki, Michaël Mathieu,Andrew Dudzik, Junyoung Chung, DavidH Choi, Richard Powell, Timo Ewalds,Petko Georgiev, etal.Grandmaster level in starcraft ii using multi-agent reinforcementlearning.nature, 575(7782):350–354, 2019.
- Wang etal. (2019)Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael,Felix Hill, Omer Levy, and Samuel Bowman.Superglue: A stickier benchmark for general-purpose languageunderstanding systems.Advances in neural information processing systems, 32, 2019.
- Wang etal. (2023)Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu,Linxi Fan, and Anima Anandkumar.Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023.
- Wang etal. (2024)Zilong Wang, Yuedong Cui, LiZhong, Zimin Zhang, DaYin, BillYuchen Lin, andJingbo Shang.Officebench: Benchmarking language agents across multipleapplications for office automation.arXiv preprint arXiv:2407.19056, 2024.
- Wansink and Sobal (2007)Brian Wansink and Jeffery Sobal.Mindless eating: The 200 daily food decisions we overlook.Environment and Behavior, 39(1):106–123,2007.
- Wei etal. (2022a)Jason Wei, YiTay, Rishi Bommasani, Colin Raffel, Barret Zoph, SebastianBorgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, etal.Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022a.
- Wei etal. (2022b)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocVLe, Denny Zhou, etal.Chain-of-thought prompting elicits reasoning in large languagemodels.Advances in neural information processing systems,35:24824–24837, 2022b.
- Wołczyk etal. (2024)Maciej Wołczyk, Bartłomiej Cupiał, Mateusz Ostaszewski, MichałBortkiewicz, Michał Zajac, Razvan Pascanu, Łukasz Kuciński, andPiotr Miłoś.Fine-tuning reinforcement learning models is secretly a forgettingmitigation problem.arXiv preprint arXiv:2402.02868, 2024.
- Wu etal. (2024)Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, and Haobo Fu.Enhance reasoning for large language models in the game werewolf.arXiv preprint arXiv:2402.02330, 2024.
- Wu etal. (2023)Yue Wu, Xuan Tang, TomM Mitchell, and Yuanzhi Li.Smartplay: A benchmark for llms as intelligent agents.arXiv preprint arXiv:2310.01557, 2023.
- Xing etal. (2024)Mingzhe Xing, Rongkai Zhang, Hui Xue, QiChen, Fan Yang, and Zhen Xiao.Understanding the weakness of large language model agents within acomplex android environment.In Proceedings of the 30th ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining, pages 6061–6072, 2024.
- Yamada etal. (2023)Yutaro Yamada, Yihan Bao, AndrewK Lampinen, Jungo Kasai, and Ilker Yildirim.Evaluating spatial understanding of large language models.arXiv preprint arXiv:2310.14540, 2023.
- Yang etal. (2023)Hui Yang, Sifu Yue, and Yunzhong He.Auto-gpt for online decision making: Benchmarks and additionalopinions.arXiv preprint arXiv:2306.02224, 2023.
- Yao etal. (2022)Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan.Webshop: Towards scalable real-world web interaction with groundedlanguage agents.Advances in Neural Information Processing Systems,35:20744–20757, 2022.
- Yao etal. (2024)Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, andKarthik Narasimhan.Tree of thoughts: Deliberate problem solving with large languagemodels.Advances in Neural Information Processing Systems, 36, 2024.
- Yu etal. (2023)Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, TianluWang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, etal.Scaling autoregressive multi-modal models: Pretraining andinstruction tuning.arXiv preprint arXiv:2309.02591, 2(3), 2023.
- Zang etal. (2024)Yuhang Zang, Hanlin Goh, Josh Susskind, and Chen Huang.Overcoming the pitfalls of vision-language model finetuning for oodgeneralization.arXiv preprint arXiv:2401.15914, 2024.
- Zhou etal. (2023)Shuyan Zhou, FrankF Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar,Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, etal.Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023.
Appendix A Baby AI
BabyAI[Chevalier-Boisvert etal., 2019] is a research platform designed to study grounded language learning and instruction following in artificial agents. It consists of a suite of 2D grid world environments with increasing levels of complexity. In these environments, an agent navigates through rooms and interacts with various objects like doors, keys, balls, and boxes of different colors. The agent receives natural language instructions, called “missions”, which describe tasks it needs to complete, such as picking up specific objects or navigating to certain locations. Many existing works on decision-making have studied model performance on this environment [Reed etal., 2022, Li etal., 2022]. We use it as a historically relevant environment that we expect to be relatively easy to solve.
A.1 BabyAI-Text
We evaluate the agents on 5 tasks introduced in BabyAI-Text[Carta etal., 2023], which provides a description of each observation instead of a symbolic representation. A textual description consists of a list of template descriptions with the following structure:
- •
”You see a
<object>
<location>
” if the object is a key, a ball, a box or a wall. - •
”You see a(n) open/closed door
<location>
” , if the agent sees a door. - •
”You carry a
<object>
”, if the agent carries an object.
A.2 BabyAI Results
We provide BabyAI results for LLM and VLM mode in Tables LABEL:babyAI_LLM and LABEL:babyAI_VLM. Errors are computed with 25 seeds for each of the 5 tasks of BabyAI. GPT-4o leads, closely followed by Llama 3.1 70B. When vision is added to the observation, GPT4o all models performance decrease, except for Gemini-1.5-Pro, whose performance remains stable.
Model | Average Progress (%) |
gpt-4o | 77.60 3.73 |
llama-3.1-70B-it | 73.20 3.96 |
gemini-1.5-pro | 58.40 4.41 |
llama-3.2-90B-it | 55.20 4.45 |
claude-3.5-sonnet | 52.00 7.07 |
gpt-4o-mini | 50.40 4.47 |
llama-3.2-11B-it | 32.80 4.20 |
llama-3.1-8B-it | 30.00 6.48 |
gemini-1.5-flash | 25.60 3.90 |
llama-3.2-3B-it | 10.00 4.24 |
llama-3.2-1B-it | 6.00 3.36 |
Model | Average Progress (%) |
gpt-4o | 62.00 4.34 |
gemini-1.5-pro | 58.40 4.41 |
claude-3.5-sonnet | 50.00 7.07 |
gemini-1.5-flash | 43.20 4.43 |
gpt-4o-mini | 38.00 4.34 |
llama-3.2-90B-it | 28.20 4.02 |
llama-3.2-11B-it | 10.40 2.73 |
A.3 Observations
Example of instruction prompt and observation for BabyAI
Appendix B Crafter
Crafter[Hafner, 2021] is an open-source 2D survival game designed specifically for research on strong generalization, deep exploration, and long-term reasoning in reinforcement learning. It is a Minecraft-inspired, procedurally generated environment that combines resource gathering, crafting, and combat elements. Additionally, the game includes a comprehensive set of tasks and achievements, enabling researchers to evaluate agent performance across multiple objectives and time scales. To enable interaction with language models we use the same language wrapper as proposed in Wu etal. [2023].
B.1 Crafter Results
We provide Crafter results for LLM and VLM format in Tables 7 and 7, standard errors are computed using 10 seeds. GPT4o leads in language-only mode, and Gemini-1.5-Pro leads in vision-language mode. Surprisingly, Llama 3.2 90B performance decreases very sharply when images are added, getting worse average progress than its smaller 11B model.
Model | Average Progress (%) |
gpt-4o | 33.10 2.32 |
claude-3.5-sonnet | 32.73 3.20 |
llama-3.2-90B-it | 31.69 1.36 |
llama-3.1-70B-it | 31.31 2.68 |
gemini-1.5-pro | 30.21 2.86 |
llama-3.2-11B-it | 26.20 3.30 |
llama-3.1-8B-it | 25.45 3.23 |
gemini-1.5-flash | 20.00 0.74 |
gpt-4o-mini | 12.72 1.13 |
llama-3.2-3B-it | 17.27 2.79 |
llama-3.2-1B-it | 12.73 1.91 |
Model | Average Progress (%) |
claude-3.5-sonnet | 37.27 3.14 |
gemini-1.5-pro | 33.50 2.07 |
gpt-4o | 26.81 3.74 |
llama-3.2-11B-it | 23.63 1.48 |
gemini-1.5-flash | 20.70 4.43 |
gpt-4o-mini | 19.91 3.13 |
llama-3.2-90B-it | 10.00 1.13 |
B.2 Observations
Appendix C TextWorld
TextWorld[Côté etal., 2019] is a text-based game environment developed by Microsoft Research that allows for the creation and customization of interactive fiction games. In our experiments, we utilize three specific games from the TextWorld domain: “Treasure Hunter”, “The Cooking Game”, and “Coin Collector”. Each task can be generated with different levels of difficulty by changing number of rooms, enabling obstacles and including distractor rooms. We use the generation rules introduced in Lu etal. [2024].
C.1 Treasure Hunter
In Treasure Hunter, we create a challenging maze-like environment with 20 rooms. The game is set to the maximum difficulty level of 30, introducing locked doors and containers that must be manipulated to locate the target object. To increase complexity, we remove the solution description and filter out tasks that can be solved optimally in 20 steps or fewer. This setup requires the agent to navigate a complex space, interact with various objects, and devise strategies to overcome obstacles in its quest to find the treasure.
C.2 The Cooking Game
The Cooking Game presents a culinary challenge set across 13 rooms. We maximize the complexity by including up to 5 ingredients and enabling all additional challenging options. The agent must navigate through doors, process food items using tools like knives, and cook ingredients using various methods such as grilling, frying, and roasting. This game tests the agent’s ability to plan and execute multi-step processes in a dynamic environment, simulating the complexities of real-world cooking tasks.
C.3 Coin Collector
Coin Collector features an expansive environment with 40 rooms, including potential distractor rooms to increase navigation difficulty. Similar to Treasure Hunter, we remove the solution description to enhance the challenge. The optimal path from the agent’s starting point to the target is set to 20 steps, requiring efficient exploration and decision-making. This game tests the agent’s ability to navigate large spaces, avoid distractions, and efficiently reach its goal in a complex, maze-like structure.
C.4 TextWorld Results
In Table 8, we provide results for TextWorld. Standard errors are computed using 20 seeds for each of the 3 tasks. GPT-4o once again leads, obtaining more than twice the average progression of its closest competitor Llama 3.1 70B. The coin collector task was by far the most challenging, with GPT-4o managing to solve it only once out of 20 attempts. Gemini models’ APIs often failed to return completions on textworld, flagging the inputs as ”unsafe”, despite there being absolutely no real safety concerns in the textworld gameplays. This made it impossible to complete a full round of evaluation on the Gemini models, thus we marked them as 0% progression.
Model | Average Progress (%) |
claude-3.5-sonnet | 42.06 5.41 |
gpt-4o | 39.31 5.24 |
llama-3.1-70B-it | 15.00 4.61 |
gpt-4o-mini | 12.25 3.55 |
llama-3.2-90B-it | 11.18 2.98 |
llama-3.2-11B-it | 6.67 2.17 |
gemini-1.5-flash | 0.00 0.00 |
gemini-1.5-pro | 0.00 0.00 |
C.5 Observations
Appendix D Baba Is AI
Baba Is AI is a benchmark environment based on the puzzle game ”Baba Is You”. In this gridworld game, players interact with various objects and textual rule blocks to achieve specific goals. The unique aspect of Baba Is AI is that the rules of the game can be manipulated and rearranged by the player, creating a dynamic environment where agents must identify relevant objects and rules and then manipulate them to change or create new rules to succeed. This benchmark allows researchers to explore a broader notion of generalization compared to current benchmarks, as it requires agents to not only learn and follow the rules but also to combine previously seen rules in novel ways. Agents are tested on 40 different puzzle levels.
D.1 Baba Is AI Language Wrapper
To enable interaction with language models, we made a custom language wrapper for Baba Is AI. It constructs language observation from active rules and creates a description by formatting object positions relative to the player. We don’t provide the solution for the agent and don’t specify grid boundaries in the text-only experiments.
D.2 Baba Is AI Results
We provide the Baba Is AI results for LLM and VLM mode in Tables 10 and 10. Standard errors are computed using 5 seeds for each of the 40 Baba Is AI tasks. Surprisingly, the Llama models lead, with Llama 3.2 90B surpassing GPT-4o by a 10% margin in language-only mode. Once again, when vision is added, model performance suffers, with only Gemini-1.5-Pro remaining stable.
Model | Average Progress (%) |
llama-3.2-90B-it | 43.90 3.47 |
llama-3.1-70B-it | 40.00 3.42 |
claude-3.5-sonnet | 37.50 4.42 |
gpt-4o | 33.66 3.30 |
gemini-1.5-pro | 32.02 3.26 |
llama-3.1-8B-it | 18.33 3.53 |
llama-3.2-3B-it | 17.50 3.47 |
gpt-4o-mini | 15.60 2.53 |
llama-3.2-11B-it | 15.60 2.50 |
gemini-1.5-flash | 12.80 2.33 |
llama-3.2-1B-it | 10.83 2.84 |
Model | Average Progress (%) |
claude-3.5-sonnet | 34.45 4.36 |
gemini-1.5-pro | 31.40 3.24 |
llama-3.2-90B-it | 21.90 2.89 |
gpt-4o | 18.62 2.72 |
gpt-4o-mini | 16.41 2.59 |
gemini-1.5-flash | 8.30 1.93 |
llama-3.2-11B-it | 5.76 1.63 |
D.3 Observations
Appendix E MiniHack
MiniHack[Samvelyan etal., 2021] is a powerful sandbox framework built on top of the NLE[Küttler etal., 2020] that enables researchers to easily design rich and diverse environments for RL. It provides a flexible platform for creating custom RL tasks ranging from simple grid-world navigation to complex, procedurally generated worlds with intricate game mechanics. The framework allows users to define environments using a human-readable description language or a simple Python interface, giving fine-grained control over environment elements such as terrain, objects, monsters, and traps. MiniHack offers a diverse array of tasks, which can be broadly categorized into three main groups: Navigation Tasks, Skill Acquisition Tasks, and Ported Tasks. To enable interaction with language models, we use NetHack Language Wrapper described in the NetHack AppendixF.
From the MiniHack Navigation Tasks, we picked Maze 9x9, Maze 15x15, Corridor and CorridorBattle, which challenge the agent to reach the goal position by overcoming various difficulties on their way, such as fighting monsters in corridors and navigating through complex or procedurally generated mazes. These tasks feature a relatively small action space, i.e., movement towards 8 compass directions, and based on the environment, search, kick, open, and eat actions.
From the MiniHack Skill Acquisition Tasks, we picked Quest (with three different difficulty levels, Easy, Medium, and Hard), which challenges the agent to use objects found in the environment to cross a lava river (these objects can provide levitation or freezing abilities), fight monsters, navigate through rooms or mazes and towards the end of the quest use a wand of death to defeat a powerful monster guarding the goal location.
We additionally test the agents on MiniHack Boxoban. This family of environments is an adaptation of the Boxoban puzzle game, which itself is inspired by the classic Sokoban. These environments present a challenging puzzle-solving task within the MiniHack framework, leveraging the NetHack game mechanics. The primary goal in MiniHack Boxoban is to push four boulders (MiniHack’s equivalent of boxes) onto four designated goal locations, which are represented by fountains. This task requires strategic thinking and planning, as the agent must carefully maneuver the boulders through the environment without getting them stuck in corners or against walls.
We provide MiniHack results for LLM and VLM mode in Tables12 and12, standard errors were computed using 5 seeds for each task. Here, GPT-4o and a Gemini-1.5-Pro equal each other both in language-only and vision-language mode, with both models only managing to complete some of the corridor and corridor battle tasks. None of the other models solved any task.
Model | Average Progress (%) |
claude-3.5-sonnet | 15.00 5.65 |
gpt-4o | 10.00 4.74 |
gpt-4o-mini | 10.00 4.74 |
llama-3.1-70B-it | 7.50 4.16 |
gemini-1.5-pro | 5.00 3.45 |
llama-3.1-8B-it | 5.00 3.45 |
gemini-1.5-flash | 5.00 3.45 |
llama-3.2-1B-it | 5.00 3.45 |
llama-3.2-11B-it | 2.50 2.47 |
llama-3.2-3B-it | 2.50 2.47 |
Model | Average Progress (%) |
claude-3.5-sonnet | 22.50 6.60 |
gpt-4o | 5.00 3.44 |
gemini-1.5-pro | 5.00 3.44 |
llama-3.2-90B-it | 2.50 2.47 |
gpt-4o-mini | 2.50 2.47 |
gemini-1.5-flash | 2.50 2.47 |
llama-3.2-11B-it | 2.50 2.47 |
E.1 Observations
Appendix F NetHack Learning Environment
The NetHack Learning Environment (NLE)[Küttler etal., 2020] is a scalable, procedurally generated, stochastic, rich, and challenging environment designed to drive long-term research in RL on problems such as exploration, planning, skill acquisition, and language-conditioned RL. Built around the classic and highly complex terminal roguelike game NetHack, NLE provides a complex and dynamic environment where agents must navigate through procedurally generated dungeons, interact with hundreds of entity types, and learn to overcome various challenges.
The goal of the player is to descend through procedurally generated dungeon levels while killing monsters, solving puzzles, and gathering better equipment in order to retrieve the Amulet of Yendor and finally ascend back to the surface to win the game.NetHack is notoriously challenging, even for human players. Mastering the game can take years even with online resources like the NetHack Wiki. Success in NetHack demands long-term strategic planning, as a winning game can involve hundreds of thousands of steps, as well as short-term tactics to fight hordes of monsters. Accurate credit assignment is also crucial to understanding which actions contributed to success or failure. NetHack has already been used extensively as a testbed for RL agents[Wołczyk etal., 2024, Piterbarg etal., 2024, Hambro etal., 2022b]; tabula-rasa RL agents particularly struggle due to sparse reward, complex credit assignment, extremely long-time-horizon, and high stochasticity of the game. The current state-of-the-art agent still remains a hand-coded symbolic policy [Hambro etal., 2022a].
F.1 NetHack Language Wrapper
The NetHack Language Wrapper [Goodger etal., 2023] is a tool designed to interface with the NLE and MiniHack by translating non-language observations into text-based representations. This wrapper, converts various NLE observations such as glyphs
, blstats
, tty_chars
, inv_letters
, inv_strs
, and tty_cursor
into readable text equivalents. For example, it transforms the visual display of the game environment into a textual description, including details about the surroundings, inventory, and player statistics. The wrapper also supports text-based actions, allowing users to interact with the environment using commands like wait
, apply
, and north
, which are then converted into the discrete actions required by the NLE. This functionality enables easier interaction with the NetHack environment, particularly for language models.
F.2 New NetHack Progression System
NetHack features an in-game scoring system that rewards players for actions such as killing monsters, identifying objects, eating food, collecting gold and items, and ultimately ascending in the game. However, we argue that this scoring system does not effectively capture true game progression, as players can win the game with scores ranging from a few hundred thousand to several million points. To address this limitation, we developed a novel, data-informed progression metric using a dataset of human-played NetHack games [Hambro etal., 2022b]. Specifically, we recorded the dungeon levels and experience levels achieved in each game, as well as whether the game resulted in an ascension. Utilizing these statistics, we constructed a data-centric progression system where each data point represents the probability of a human player winning the game after reaching a specific dungeon level or experience level. The resulting progression curves are presented in Figure 10. For practical purposes, we define Dungeon Level 1 (Dlvl:1) and Experience Level 1 as representing 0% progression, corresponding to the game’s starting point, and ascension as 100% progression. The agent’s overall progress is thus determined by the highest progression achieved between the dungeon level and experience level attained.
F.3 NetHack Results
We provide NetHack results for LLM and VLM mode in TablesLABEL:LLM_NLE andLABEL:VLM_NLE. Standard errors are computed using 5 seeds. o1-preview achieves the highest progression out of all the tested models. However, it is still very far from making any significant progression in the game. The best individual run was achieved by Gemini-1.5-Pro vision-language mode, reaching dungeon level 3 and experience level 4.
Model | Average Progress (%) |
o1-preview | 1.57 0.40 |
claude-3.5-sonnet | 0.58 0.52 |
gpt-4o | 0.37 0.37 |
o1-mini | 0.36 0.24 |
llama-3.1-70B-it | 0.35 0.35 |
llama-3.1-8B-it | 0 0 |
gemini-1.5-pro | 0.31 0.31 |
gpt-4o-mini | 0 0 |
gemini-1.5-flash | 0 0 |
llama-3.2-90B-it | 0 0 |
llama-3.2-11B-it | 0 0 |
llama-3.2-3B-it | 0 0 |
llama-3.2-1B-it | 0 0 |
Model | Average Progress (%) |
claude-3.5-sonnet | 1.16 0.42 |
gemini-1.5-pro | 0.48 0.48 |
gpt-4o | 0.37 0.37 |
gpt-4o-mini | 0 0 |
gemini-1.5-flash | 0 0 |
llama-3.2-11B-it | 0 0 |
gemini-1.5-flash | 0 0 |
llama-3.2-90B-it | 0 0 |
llama-3.2-11B-it | 0 0 |
F.4 Observation
Despite having a language wrapper that describes its observations [Goodger etal., 2023], NetHack is not meant to be played with language only, thus we provided the ASCII map in language mode and the RGB tiles map in vision-language mode. In the LLM context, we only keep information that is important to be kept in the long term, i.e., the game message and language observation. Agent stats and inventory are only needed in the current step, so we do not keep them in the context. This is done also to prevent the context length of NetHack to explode out of control.
F.5 LLMs possess extensive NetHack knowledge
Agents need extensive knowledge to play NetHack effectively. Skilled players leverage their understanding of the game, acquired through hours of gameplay, while new players often rely on written resources online, such as the NetHack Wiki. We use a questionnaire to test if LLMs possess some of this knowledge and assess if agents can apply it. The questions are as follows:
- 1.
Are there any consequences of eating rotten food in NetHack? (Answer: It might lead to blindness, hallucination or even instant death.)
- 2.
Are there any consequences of eating too much food in NetHack? (Answer: The player character might choke on the food and die.)
- 3.
What happens when you ascend the stairs on level 1 of NetHack? (Answer: It leads to instant game over.)
- 4.
Is it a good idea to use an invisibility wand on a monster in NetHack? (Answer: No, as it will be harder to fight them.)
- 5.
Is it a good idea to pray to an angered god in NetHack? (Answer: No, the god will punish the player.)
These questions test a fundamental understanding of the game mechanics, particularly focusing on behaviors that new players may mistakenly attempt and should be avoided.
We summarize each LLM’s responses in the tables below. For each question, we assess whether the response is accurate, whether the conclusion is correct (i.e., the LLM recognizes the need to avoid such behavior), and whether the agent successively avoids these mistakes.
LLM | Q1 | Q2 | Q3 | Q4 | Q5 | |
GPT 4o | Correct | ✔ | ✔ | ✔ | ✔ | |
Conclusion | ✔ | ✔ | ✔ | ✔ | ✔ | |
Behaviour | ✗ | ✔ | ✗ | N/A | ✔ | |
GPT 4o-mini | Correct | ✗ | ✔ | ✗ | ✔ | |
Conclusion | ✔ | ✔ | ✔ | ✔ | ✔ | |
Behaviour | ✗ | ✔ | ✔ | N/A | N/A | |
Gemini 1.5-flash | Correct | ✗ | ✗ | ✗ | ✗ | ✔ |
Conclusion | ✔ | ✗ | ✗ | ✗ | ✔ | |
Behaviour | ✔ | ✔ | ✗ | N/A | N/A | |
Gemini 1.5-pro | Correct | ✔ | ✗ | ✔ | ✔ | |
Conclusion | ✔ | ✔ | ✗ | ✔ | ✔ | |
Behaviour | ✔ | ✔ | ✗ | N/A | N/A | |
Llama 3.1 70B Instruct | Correct | ✔ | ✗ | ✔ | ✗ | ✔ |
Conclusion | ✔ | ✗ | ✗ | ✔ | ✔ | |
Behaviour | ✗ | ✗ | ✗ | ✗ | ✗ | |
Llama 3.2 11B Instruct | Correct | ✗ | ✗ | ✗ | ✗ | ✔ |
Conclusion | ✔ | ✗ | ✗ | ✔ | ✔ | |
Behaviour | ✗ | ✗ | ✗ | N/A | N/A | |
Llama 3.2 90B Instruct | Correct | ✔ | ✔ | ✗ | ✔ | |
Conclusion | ✔ | ✔ | ✔ | ✔ | ✔ | |
Behaviour | ✗ | ✔ | ✗ | N/A | N/A |
We observe that while generally the LLMs understand to avoid common mistakes, regardless of whether their reasoning is completely correct, they still struggle to consistently exploit that knowledge. Agents will often consume rotten food and prematurely exit the game by ascending the steps on the first level. This illustrates a gap between LLM agents ability to exploit knowledge in practice.