Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ jobs:
benchmark:
- example_bench
- course_exam_bench
- toposense_bench
# TODO: For now, we comment out other benchmarks as they have no tests
# - arteval_bench
# - cache_bench
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ System Intelligence Benchmark currently includes the following example benchmark
- **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects
- **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation
- **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
- **TopoSense Benchmark** ([benchmarks/toposense_bench/](benchmarks/toposense_bench/)) - Evaluates Semantic-Spatial Sensor Scheduling (S³) capabilities in large-scale IoT digital twins (5,250 queries across 2,510 cameras)
- **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks

## Quick Start
Expand Down
90 changes: 90 additions & 0 deletions benchmarks/toposense_bench/README.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an entry for the benchmark to the root project README?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s also ongoing work in another PR to add a Why.md file to each benchmark directory. See the discussion: #21 (comment)

Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# TopoSense-Bench: Semantic-Spatial Sensor Scheduling

**TopoSense-Bench** is a large-scale, rigorous benchmark designed to evaluate Large Language Models (LLMs) on the **Semantic-Spatial Sensor Scheduling (S³)** problem.

Originating from the **ACM MobiCom '26** paper *"IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling"*, this benchmark tests an agent's ability to translate high-level natural language user intents (e.g., *"Find my backpack lost between the library and the gym"*) into precise physical sensor activation plans within a large-scale digital twin.

## 📊 Overview

- **Source**: Hosted on [Hugging Face](https://huggingface.co/datasets/IoT-Brain/TopoSense-Bench) (Seamlessly integrated via the `datasets` library).
- **Scale**:
- **5,250** Natural Language Queries.
- **2,510** Sensors (Cameras).
- **161** Floor Plans across **33** Buildings.
- **Problem Domain**: Embodied AI, IoT, Spatial Reasoning, and RAG (Retrieval-Augmented Generation).

## 🎯 Task Taxonomy

The benchmark categorizes queries into three tiers of complexity based on the spatial scope and reasoning difficulty:

- **Tier 1: Intra-Zone Perception**
- Simple queries focused on specific rooms or focal areas (e.g., *"Check the entrance of the conference hall"*).
- **Tier 2: Intra-Building Coordination**
- Complex queries requiring navigation across multiple floors within a single building (e.g., *"Track the path from the 4th-floor lab to the ground floor exit"*).
- **Tier 3: Inter-Building Coordination**
- Long-horizon queries involving transitions between outdoor spaces and multiple buildings (e.g., *"I walked from the Library to the Gym, check cameras along the way"*).

## ⚙️ Evaluation Methodology

Unlike standard QA benchmarks, TopoSense-Bench employs a **Retrieval-Augmented Generation (RAG)** workflow to simulate realistic sensor scheduling:

1. **Context Retrieval**: The system dynamically retrieves the relevant topological map data (textual representation of buildings/floors) based on the user's query using a heuristic `TopologyManager`.
2. **Reasoning**: The LLM acts as a scheduler. It must analyze the provided map data and the user's intent to identify the specific sensor node ID that best satisfies the request.
3. **Scoring**: The evaluation uses a parsing-based exact match metric. It compares the core identifier in the LLM's output against the ground truth sensor ID (e.g., `teaching_building_1_camera_03`).

## 🚀 Quick Start

### 1. Installation

Ensure you are in the `benchmarks/toposense_bench` directory, then install the required dependencies:

```bash
pip install -r requirements.txt
```

### 2. Configuration

Create or edit `env.toml` to configure your LLM provider. This benchmark uses `litellm` for model calls.

```toml
[llm]
# Example for OpenAI
OPENAI_API_KEY = "sk-..."

# Example for DeepSeek (OpenAI-Compatible)
# OPENAI_API_KEY = "sk-..."
# OPENAI_API_BASE = "https://api.deepseek.com"
```

### 3. Run Evaluation

Run the evaluation script. You must specify the model name.

> **Note**: If using a non-OpenAI provider (like DeepSeek or Qwen) via the OpenAI-compatible endpoint, please add the `openai/` prefix to the model name.

```bash
# Run with GPT-4o
bash run.sh "gpt-4o"

# Run with DeepSeek-Chat
bash run.sh "openai/deepseek-chat"
```

### 4. Results

After the run completes, results will be saved in the `outputs/` directory:
- `summary.json`: Overall accuracy and breakdown by task tier.
- `results.jsonl`: Detailed logs including retrieval status, model input/output, and correctness for every query.

## 📚 Citation

If you use this benchmark in your research, please cite our MobiCom '26 paper:

```bibtex
@inproceedings{iotbrain2026,
title={IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling},
author={Anonymous Author(s)},
booktitle={Proceedings of the 32nd Annual International Conference on Mobile Computing and Networking (MobiCom '26)},
year={2026}
}
```
15 changes: 15 additions & 0 deletions benchmarks/toposense_bench/Why.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Why TopoSense-Bench?

## The Problem: The Semantic-Physical Mapping Gap
Modern IoT systems are transitioning from passive monitoring to intent-driven operation. However, a critical gap exists between high-level human intent (e.g., *"Find my backpack lost between the library and the gym"*) and the precise physical sensor actions required to fulfill it.

Existing benchmarks often focus on pure QA or code generation, overlooking the **embodied** and **spatial** reasoning capabilities required for real-world cyber-physical systems.

## The Solution: Semantic-Spatial Sensor Scheduling (S³)
TopoSense-Bench introduces the S³ challenge, requiring LLMs to:
1. **Reason Spatially**: Understand complex topological relationships (connectivity, floor transitions) in a large-scale digital twin.
2. **Act Proactively**: Select the optimal subset of sensors from a massive network (2,510 cameras) to satisfy a query, rather than just answering a text question.
3. **Ground in Reality**: Map vague natural language to concrete sensor identifiers (e.g., `teaching_building_1_camera_03`).

## Impact
By mastering this benchmark, LLMs demonstrate the capability to serve as the "brain" for large-scale smart city and smart campus infrastructures, moving beyond chatbots to actionable physical agents.
5 changes: 5 additions & 0 deletions benchmarks/toposense_bench/env.toml.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[llm]

OPENAI_API_KEY = "your_key_here"

OPENAI_API_BASE = "your_url_here"
16 changes: 16 additions & 0 deletions benchmarks/toposense_bench/install.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

# Create virtual environment
python3 -m venv .venv

# Activate virtual environment
source .venv/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install requirements
pip install -r requirements.txt

echo "✅ Installation complete. Virtual environment created in .venv/"
echo "👉 To activate: source .venv/bin/activate"
11 changes: 11 additions & 0 deletions benchmarks/toposense_bench/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Hugging Face Ecosystem
datasets>=2.14.0
huggingface_hub>=0.16.0

# Data Processing & Utilities
pandas>=1.5.0
tqdm
loguru

# Configuration parsing (for compatibility with older Python versions)
tomli>=2.0.1; python_version < "3.11"
23 changes: 23 additions & 0 deletions benchmarks/toposense_bench/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you follow template to add one install.sh that is needed for our integration? Thanks.


# ==============================================================================
# TopoSense-Bench Execution Script
#
# Usage:
# ./run.sh [model_name]
#
# Examples:
# ./run.sh "gpt-4o" # Run with OpenAI GPT-4o (Default)
# ./run.sh "openai/deepseek-chat" # Run with DeepSeek (via OpenAI-compatible endpoint)
#
# Note: Ensure that API keys are correctly configured in 'env.toml'.
# ==============================================================================

# Set default model to "gpt-4o" if no argument is provided
MODEL_NAME=${1:-"gpt-4o"}

echo "🚀 Starting TopoSense-Bench evaluation..."
echo "🤖 Model: $MODEL_NAME"

# Run the main evaluation script
python src/main.py --model_name "$MODEL_NAME"
Empty file.
77 changes: 77 additions & 0 deletions benchmarks/toposense_bench/src/evaluator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
"""Evaluator for TopoSense Benchmark."""

import re
import ast
from loguru import logger


class TopoSenseEvaluator:
"""Evaluator class for Semantic-Spatial Sensor Scheduling tasks."""

def __init__(self):
pass

def parse_node_info(self, text):
"""
Parses the Node string representation to extract the critical 'name' tag.

Input format example:
"Node(223, 307, Tags: {'man_made': 'surveillance', 'name': 'camera_1'})"

Args:
text (str): The raw ground truth string from the dataset.

Returns:
str: The extracted sensor name (e.g., "camera_1") or the original text if parsing fails.
"""
try:
# 1. Attempt to extract the Tags dictionary part using regex
tags_match = re.search(r"Tags:\s*(\{.*?\})", text)
if tags_match:
tags_str = tags_match.group(1)
# Safely evaluate the string as a Python dictionary
tags = ast.literal_eval(tags_str)
# Return the 'name' tag converted to lowercase
return tags.get('name', '').lower()

# 2. Fallback: If it's a pure ID format or regex fails, return normalized text
return text.strip().lower()
except Exception:
return text.strip().lower()

def eval(self, llm_response_json, ground_truth_str):
"""
Evaluate the LLM's response against the ground truth.

Args:
llm_response_json (dict): The JSON output from the LLM.
Expected format: {"answer": "...", "explanation": "..."}
ground_truth_str (str): The raw answer string from the dataset.

Returns:
dict: Evaluation result containing status, score, and parsed ground truth.
"""
# 1. Extract the core answer from the LLM response
llm_answer = str(llm_response_json.get("answer", "")).lower()

# 2. Parse the unique identifier (Target Name) from the Ground Truth
gt_target_name = self.parse_node_info(ground_truth_str)

# 3. Evaluation Logic
# Requirement: The LLM's answer must contain the core identifier of the GT.
# Example:
# GT: "fire_fighting_access_1_camera_1"
# LLM: "I suggest using fire_fighting_access_1_camera_1" -> Correct

# Normalize strings by replacing underscores and hyphens with spaces for robust matching
clean_llm = llm_answer.replace("_", " ").replace("-", " ")
clean_gt = gt_target_name.replace("_", " ").replace("-", " ")

# Perform containment check
is_correct = clean_gt in clean_llm

return {
"status": "correct" if is_correct else "incorrect",
"score": 1 if is_correct else 0,
"parsed_gt": gt_target_name
}
Loading