Benchmarking Large Language Model Reasoning in Indoor Robot Navigation

Emirhan Balcı¹, Mehmet Sarıgül², Barış Ata²

¹Adana Science and Technology University, ²Çukurova University

Abstract

This study evaluates the performance of state-of-the-art text-based generative large language models in indoor robot navigation planning, focusing on object, spatial, and common sense reasoning-centric instructions. Three scenes from the Matterport3D dataset were selected, along with corresponding instruction sequences and routes. Object-labeled semantic maps were generated using the RGB-D images and camera poses of the scenes. The instructions were provided to the models, and the generated robot codes were executed on a mobile robot within the selected scenes. The routes followed by the robot, which detected objects through the semantic map, were recorded. The findings indicate that while the models successfully executed object and spatial-based instructions, some models struggled with those requiring common-sense reasoning. This study aims to contribute to robotics research by providing insights into the navigation planning capabilities of language models.

First scene: 5LpN3gDmAk7

Second scene: jh4fc5c5qoQ

Third scene: mJXqzFtmKg4

Semantic Map Creation

Semantic understanding is essential for object- and spatial-centric navigation tasks in robotics. To this end, we employed Visual Language Maps (VLMaps) to construct semantic maps from the Matterport3D dataset, which offers a large collection of realistic indoor scenes. Using RGB-D frames and their corresponding poses, we built semantic maps for three distinct scenes. Each scene includes diverse objects, obstacles, and room layouts, providing a benchmark to evaluate LLMs under dynamic and varied conditions.

Large Language Model Reasoning


        System prompt:
        You are an experienced robot familiar with where each piece of object belongs in a house. 
        You are asked to execute a batch of tasks containing varying objects within a home, 
        including, a table, picture, cabinet, cushion, window, sofa, bed, curtain, plant, sink, stairs, 
        toilet, towel, mirror, tv_monitor, counter, and a shelving. For each task, you need to specify 
        an object based on the context and reach there to execute the task. For the third task only, 
        we assume that reaching the object will deliver the task.

We aim to evaluate the performance of state-of-the-art text-based generative large language models in indoor robot navigation planning, focusing on the execution of instructions that require object-centric, spatial, and common-sense reasoning. To align the models with these scenarios, we applied the persona pattern prompt engineering technique, configuring them to function as assistants via the system prompt shown above. For each scene, we designed three instruction sets targeting object-goal, spatial-goal, and common-sense reasoning-based navigation tasks, aiming to comprehensively assess the models' planning capabilities.


  First task - Object Goal Navigation:
  I want you to perform multiple tasks in an order, one by one.
  
  Go by the counter, approach the sink, go to the stairs, 
  and then reach the cushion. 
  
  Now, repeat this process in a reverse order and 
  conclude it by reaching the chair. 
  
  Once you complete these, go back and forth between 
  cushion and counter twice.


  Second task - Spatial Goal Navigation:
  I want you to perform multiple tasks in an order, one by one. 
  
  Move first to the left side of the chair in front of you, 
  face the sofa, and then move to the west of the counter.
  
  Later, with the counter on your right, 
  go to the east of the window, 
  face the chair in front of you 
  and finally move to the south of the door. 
  
  Finally, turn absolute 180 degrees.


  Third task - Common-Sense Reasoning:
  I need you to complete a series of tasks, step by step. 
  
  First, move the fruits and vegetables 
  to a suitable spot for washing. 
  
  Then, put the towels in an appropriate place, 
  organize my reading books, 
  retrieve the book I left where I last sat, 
  and adjust the brightness for resting. 
  
  Finally, arrange the sleeping pillows.

Simulation Integration

The defined prompts are provided to the models to generate robot code, which is then executed in parallel within the Habitat-Sim simulation environment using a Fetch robot. The resulting trajectories are subsequently extracted and compared against ground-truth trajectories.

Each task is designed to evaluate a specific aspect of an LLM's navigation planning capability. The first task assesses the model's basic understanding of simple navigation instructions by requiring it to follow a sequence of objects in a defined order. The second task evaluates spatial reasoning by instructing the model to approach objects from specific positions and align the robot accordingly. The third task tests the model's common-sense reasoning skills, requiring it to choose the appropriate object based on contextual cues. For this final task, we assume that reaching the selected object completes the navigation objective.

We leveraged widely used evaluation metrics in robotics (SR, NE, SDTW, CLS) to assess the executed trajectories and analyze the models’ navigation planning performance. A total of 10 models from the ChatGPT, Gemini, DeepSeek, and Claude language model families were evaluated. Our experimental results indicate that current LLMs are capable of effective navigation planning in both object-centric and spatial tasks. Notably, GPT-4o, DeepSeek-R1, and Claude 3.5 Sonnet outperformed the others, particularly due to their stronger common-sense reasoning capabilities.

BibTeX

 @inproceedings{balci25benchmarking,
          title={Benchmarking Large Language Model Reasoning in Indoor Robot Navigation},
          author={Emirhan Balcı and Mehmet Sarıgül and Barış Ata},
          booktitle = {Proceedings of the 33rd IEEE Conference on Signal Processing and Communications Applications (SIU)},
          year={2025},
          address = {Istanbul, TR}
          }

Acknowledgements

This work was funded by the Scientific and Technological Research Council of Turkey (TÜBİTAK) under project number 123E694. We also thank the VLMaps project for releasing their work as open-source.

Benchmarking Large Language Model Reasoning in Indoor Robot Navigation

Our benchmark pipeline applies semantic mapping and a user-in-the-loop interface to simulate realistic indoor navigation tasks.

Abstract

Video

Approach

Semantic Map Creation

Large Language Model Reasoning

Simulation Integration

Experiments

Indoor Navigation Tasks

Evaluation

BibTeX

People

Emirhan Balcı

Mehmet Sarıgül

Barış Ata

Acknowledgements