sys-intelligence · xuafeng · Dec 22, 2025 · Dec 11, 2025 · Dec 14, 2025 · Dec 14, 2025
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -19,6 +19,7 @@ jobs:
         benchmark:
           - example_bench
           - course_exam_bench
+          - toposense_bench
           # TODO: For now, we comment out other benchmarks as they have no tests
           # - arteval_bench
           # - cache_bench

diff --git a/README.md b/README.md
@@ -21,6 +21,7 @@ System Intelligence Benchmark currently includes the following example benchmark
 - **System Lab Benchmark** ([benchmarks/course_lab_bench/](benchmarks/course_lab_bench/)) - Assesses AI capability on practical system course labs and projects 
 - **System Artifact Benchmark** ([benchmarks/arteval_bench/](benchmarks/arteval_bench/)) - Evaluates AI performance on artifact evaluation
 - **System Modeling Benchmark** ([benchmarks/sysmobench/](benchmarks/sysmobench/)) - Evaluates an agent's ability to produce correct TLA+ models for real-world concurrent and distributed systems, covering system capabilities across system comprehension, abstraction, and potentially tool fluency.
+- **TopoSense Benchmark** ([benchmarks/toposense_bench/](benchmarks/toposense_bench/)) - Evaluates Semantic-Spatial Sensor Scheduling (S³) capabilities in large-scale IoT digital twins (5,250 queries across 2,510 cameras)
 - **Example Benchmark** ([benchmarks/example_bench/](benchmarks/example_bench/)) - Template and reference implementation for creating new benchmarks
 
 ## Quick Start

diff --git a/benchmarks/toposense_bench/README.md b/benchmarks/toposense_bench/README.md
@@ -0,0 +1,90 @@
+# TopoSense-Bench: Semantic-Spatial Sensor Scheduling
+
+**TopoSense-Bench** is a large-scale, rigorous benchmark designed to evaluate Large Language Models (LLMs) on the **Semantic-Spatial Sensor Scheduling (S³)** problem.
+
+Originating from the **ACM MobiCom '26** paper *"IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling"*, this benchmark tests an agent's ability to translate high-level natural language user intents (e.g., *"Find my backpack lost between the library and the gym"*) into precise physical sensor activation plans within a large-scale digital twin.
+
+## 📊 Overview
+
+- **Source**: Hosted on [Hugging Face](https://huggingface.co/datasets/IoT-Brain/TopoSense-Bench) (Seamlessly integrated via the `datasets` library).
+- **Scale**:
+  - **5,250** Natural Language Queries.
+  - **2,510** Sensors (Cameras).
+  - **161** Floor Plans across **33** Buildings.
+- **Problem Domain**: Embodied AI, IoT, Spatial Reasoning, and RAG (Retrieval-Augmented Generation).
+
+## 🎯 Task Taxonomy
+
+The benchmark categorizes queries into three tiers of complexity based on the spatial scope and reasoning difficulty:
+
+- **Tier 1: Intra-Zone Perception**
+  - Simple queries focused on specific rooms or focal areas (e.g., *"Check the entrance of the conference hall"*).
+- **Tier 2: Intra-Building Coordination**
+  - Complex queries requiring navigation across multiple floors within a single building (e.g., *"Track the path from the 4th-floor lab to the ground floor exit"*).
+- **Tier 3: Inter-Building Coordination**
+  - Long-horizon queries involving transitions between outdoor spaces and multiple buildings (e.g., *"I walked from the Library to the Gym, check cameras along the way"*).
+
+## ⚙️ Evaluation Methodology
+
+Unlike standard QA benchmarks, TopoSense-Bench employs a **Retrieval-Augmented Generation (RAG)** workflow to simulate realistic sensor scheduling:
+
+1.  **Context Retrieval**: The system dynamically retrieves the relevant topological map data (textual representation of buildings/floors) based on the user's query using a heuristic `TopologyManager`.
+2.  **Reasoning**: The LLM acts as a scheduler. It must analyze the provided map data and the user's intent to identify the specific sensor node ID that best satisfies the request.
+3.  **Scoring**: The evaluation uses a parsing-based exact match metric. It compares the core identifier in the LLM's output against the ground truth sensor ID (e.g., `teaching_building_1_camera_03`).
+
+## 🚀 Quick Start
+
+### 1. Installation
+
+Ensure you are in the `benchmarks/toposense_bench` directory, then install the required dependencies:
+
+```bash
+pip install -r requirements.txt
+```
+
+### 2. Configuration
+
+Create or edit `env.toml` to configure your LLM provider. This benchmark uses `litellm` for model calls.
+
+```toml
+[llm]
+# Example for OpenAI
+OPENAI_API_KEY = "sk-..."
+
+# Example for DeepSeek (OpenAI-Compatible)
+# OPENAI_API_KEY = "sk-..."
+# OPENAI_API_BASE = "https://api.deepseek.com"
+```
+
+### 3. Run Evaluation
+
+Run the evaluation script. You must specify the model name.
+
+> **Note**: If using a non-OpenAI provider (like DeepSeek or Qwen) via the OpenAI-compatible endpoint, please add the `openai/` prefix to the model name.
+
+```bash
+# Run with GPT-4o
+bash run.sh "gpt-4o"
+
+# Run with DeepSeek-Chat
+bash run.sh "openai/deepseek-chat"
+```
+
+### 4. Results
+
+After the run completes, results will be saved in the `outputs/` directory:
+- `summary.json`: Overall accuracy and breakdown by task tier.
+- `results.jsonl`: Detailed logs including retrieval status, model input/output, and correctness for every query.
+
+## 📚 Citation
+
+If you use this benchmark in your research, please cite our MobiCom '26 paper:
+
+```bibtex
+@inproceedings{iotbrain2026,
+  title={IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling},
+  author={Anonymous Author(s)},
+  booktitle={Proceedings of the 32nd Annual International Conference on Mobile Computing and Networking (MobiCom '26)},
+  year={2026}
+}
+```
diff --git a/benchmarks/toposense_bench/Why.md b/benchmarks/toposense_bench/Why.md
@@ -0,0 +1,15 @@
+# Why TopoSense-Bench?
+
+## The Problem: The Semantic-Physical Mapping Gap
+Modern IoT systems are transitioning from passive monitoring to intent-driven operation. However, a critical gap exists between high-level human intent (e.g., *"Find my backpack lost between the library and the gym"*) and the precise physical sensor actions required to fulfill it.
+
+Existing benchmarks often focus on pure QA or code generation, overlooking the **embodied** and **spatial** reasoning capabilities required for real-world cyber-physical systems.
+
+## The Solution: Semantic-Spatial Sensor Scheduling (S³)
+TopoSense-Bench introduces the S³ challenge, requiring LLMs to:
+1.  **Reason Spatially**: Understand complex topological relationships (connectivity, floor transitions) in a large-scale digital twin.
+2.  **Act Proactively**: Select the optimal subset of sensors from a massive network (2,510 cameras) to satisfy a query, rather than just answering a text question.
+3.  **Ground in Reality**: Map vague natural language to concrete sensor identifiers (e.g., `teaching_building_1_camera_03`).
+
+## Impact
+By mastering this benchmark, LLMs demonstrate the capability to serve as the "brain" for large-scale smart city and smart campus infrastructures, moving beyond chatbots to actionable physical agents.
diff --git a/benchmarks/toposense_bench/env.toml.example b/benchmarks/toposense_bench/env.toml.example
@@ -0,0 +1,5 @@
+[llm]
+
+OPENAI_API_KEY = "your_key_here" 
+
+OPENAI_API_BASE = "your_url_here"
diff --git a/benchmarks/toposense_bench/install.sh b/benchmarks/toposense_bench/install.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+# Create virtual environment
+python3 -m venv .venv
+
+# Activate virtual environment
+source .venv/bin/activate
+
+# Upgrade pip
+pip install --upgrade pip
+
+# Install requirements
+pip install -r requirements.txt
+
+echo "✅ Installation complete. Virtual environment created in .venv/"
+echo "👉 To activate: source .venv/bin/activate"
diff --git a/benchmarks/toposense_bench/requirements.txt b/benchmarks/toposense_bench/requirements.txt
@@ -0,0 +1,11 @@
+# Hugging Face Ecosystem
+datasets>=2.14.0
+huggingface_hub>=0.16.0
+
+# Data Processing & Utilities
+pandas>=1.5.0
+tqdm
+loguru
+
+# Configuration parsing (for compatibility with older Python versions)
+tomli>=2.0.1; python_version < "3.11"
diff --git a/benchmarks/toposense_bench/run.sh b/benchmarks/toposense_bench/run.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+
+# ==============================================================================
+# TopoSense-Bench Execution Script
+#
+# Usage:
+#   ./run.sh [model_name]
+#
+# Examples:
+#   ./run.sh "gpt-4o"                  # Run with OpenAI GPT-4o (Default)
+#   ./run.sh "openai/deepseek-chat"    # Run with DeepSeek (via OpenAI-compatible endpoint)
+#
+# Note: Ensure that API keys are correctly configured in 'env.toml'.
+# ==============================================================================
+
+# Set default model to "gpt-4o" if no argument is provided
+MODEL_NAME=${1:-"gpt-4o"}
+
+echo "🚀 Starting TopoSense-Bench evaluation..."
+echo "🤖 Model: $MODEL_NAME"
+
+# Run the main evaluation script
+python src/main.py --model_name "$MODEL_NAME"
diff --git a/benchmarks/toposense_bench/src/__init__.py b/benchmarks/toposense_bench/src/__init__.py
diff --git a/benchmarks/toposense_bench/src/evaluator.py b/benchmarks/toposense_bench/src/evaluator.py
@@ -0,0 +1,77 @@
+"""Evaluator for TopoSense Benchmark."""
+
+import re
+import ast
+from loguru import logger
+
+
+class TopoSenseEvaluator:
+    """Evaluator class for Semantic-Spatial Sensor Scheduling tasks."""
+
+    def __init__(self):
+        pass
+
+    def parse_node_info(self, text):
+        """
+        Parses the Node string representation to extract the critical 'name' tag.
+
+        Input format example:
+        "Node(223, 307, Tags: {'man_made': 'surveillance', 'name': 'camera_1'})"
+
+        Args:
+            text (str): The raw ground truth string from the dataset.
+
+        Returns:
+            str: The extracted sensor name (e.g., "camera_1") or the original text if parsing fails.
+        """
+        try:
+            # 1. Attempt to extract the Tags dictionary part using regex
+            tags_match = re.search(r"Tags:\s*(\{.*?\})", text)
+            if tags_match:
+                tags_str = tags_match.group(1)
+                # Safely evaluate the string as a Python dictionary
+                tags = ast.literal_eval(tags_str)
+                # Return the 'name' tag converted to lowercase
+                return tags.get('name', '').lower()
+
+            # 2. Fallback: If it's a pure ID format or regex fails, return normalized text
+            return text.strip().lower()
+        except Exception:
+            return text.strip().lower()
+
+    def eval(self, llm_response_json, ground_truth_str):
+        """
+        Evaluate the LLM's response against the ground truth.
+
+        Args:
+            llm_response_json (dict): The JSON output from the LLM.
+                                      Expected format: {"answer": "...", "explanation": "..."}
+            ground_truth_str (str): The raw answer string from the dataset.
+
+        Returns:
+            dict: Evaluation result containing status, score, and parsed ground truth.
+        """
+        # 1. Extract the core answer from the LLM response
+        llm_answer = str(llm_response_json.get("answer", "")).lower()
+
+        # 2. Parse the unique identifier (Target Name) from the Ground Truth
+        gt_target_name = self.parse_node_info(ground_truth_str)
+
+        # 3. Evaluation Logic
+        # Requirement: The LLM's answer must contain the core identifier of the GT.
+        # Example:
+        #   GT: "fire_fighting_access_1_camera_1"
+        #   LLM: "I suggest using fire_fighting_access_1_camera_1" -> Correct
+
+        # Normalize strings by replacing underscores and hyphens with spaces for robust matching
+        clean_llm = llm_answer.replace("_", " ").replace("-", " ")
+        clean_gt = gt_target_name.replace("_", " ").replace("-", " ")
+
+        # Perform containment check
+        is_correct = clean_gt in clean_llm
+
+        return {
+            "status": "correct" if is_correct else "incorrect",
+            "score": 1 if is_correct else 0,
+            "parsed_gt": gt_target_name
+        }