-
Notifications
You must be signed in to change notification settings - Fork 5
Add TopoSense-Bench: A Semantic-Spatial Sensor Scheduling Benchmark #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There’s also ongoing work in another PR to add a |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # TopoSense-Bench: Semantic-Spatial Sensor Scheduling | ||
|
|
||
| **TopoSense-Bench** is a large-scale, rigorous benchmark designed to evaluate Large Language Models (LLMs) on the **Semantic-Spatial Sensor Scheduling (S³)** problem. | ||
|
|
||
| Originating from the **ACM MobiCom '26** paper *"IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling"*, this benchmark tests an agent's ability to translate high-level natural language user intents (e.g., *"Find my backpack lost between the library and the gym"*) into precise physical sensor activation plans within a large-scale digital twin. | ||
|
|
||
| ## 📊 Overview | ||
|
|
||
| - **Source**: Hosted on [Hugging Face](https://huggingface.co/datasets/IoT-Brain/TopoSense-Bench) (Seamlessly integrated via the `datasets` library). | ||
| - **Scale**: | ||
| - **5,250** Natural Language Queries. | ||
| - **2,510** Sensors (Cameras). | ||
| - **161** Floor Plans across **33** Buildings. | ||
| - **Problem Domain**: Embodied AI, IoT, Spatial Reasoning, and RAG (Retrieval-Augmented Generation). | ||
|
|
||
| ## 🎯 Task Taxonomy | ||
|
|
||
| The benchmark categorizes queries into three tiers of complexity based on the spatial scope and reasoning difficulty: | ||
|
|
||
| - **Tier 1: Intra-Zone Perception** | ||
| - Simple queries focused on specific rooms or focal areas (e.g., *"Check the entrance of the conference hall"*). | ||
| - **Tier 2: Intra-Building Coordination** | ||
| - Complex queries requiring navigation across multiple floors within a single building (e.g., *"Track the path from the 4th-floor lab to the ground floor exit"*). | ||
| - **Tier 3: Inter-Building Coordination** | ||
| - Long-horizon queries involving transitions between outdoor spaces and multiple buildings (e.g., *"I walked from the Library to the Gym, check cameras along the way"*). | ||
|
|
||
| ## ⚙️ Evaluation Methodology | ||
|
|
||
| Unlike standard QA benchmarks, TopoSense-Bench employs a **Retrieval-Augmented Generation (RAG)** workflow to simulate realistic sensor scheduling: | ||
|
|
||
| 1. **Context Retrieval**: The system dynamically retrieves the relevant topological map data (textual representation of buildings/floors) based on the user's query using a heuristic `TopologyManager`. | ||
| 2. **Reasoning**: The LLM acts as a scheduler. It must analyze the provided map data and the user's intent to identify the specific sensor node ID that best satisfies the request. | ||
| 3. **Scoring**: The evaluation uses a parsing-based exact match metric. It compares the core identifier in the LLM's output against the ground truth sensor ID (e.g., `teaching_building_1_camera_03`). | ||
|
|
||
| ## 🚀 Quick Start | ||
|
|
||
| ### 1. Installation | ||
|
|
||
| Ensure you are in the `benchmarks/toposense_bench` directory, then install the required dependencies: | ||
|
|
||
| ```bash | ||
| pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| ### 2. Configuration | ||
|
|
||
| Create or edit `env.toml` to configure your LLM provider. This benchmark uses `litellm` for model calls. | ||
|
|
||
| ```toml | ||
| [llm] | ||
| # Example for OpenAI | ||
| OPENAI_API_KEY = "sk-..." | ||
|
|
||
| # Example for DeepSeek (OpenAI-Compatible) | ||
| # OPENAI_API_KEY = "sk-..." | ||
| # OPENAI_API_BASE = "https://api.deepseek.com" | ||
| ``` | ||
|
|
||
| ### 3. Run Evaluation | ||
|
|
||
| Run the evaluation script. You must specify the model name. | ||
|
|
||
| > **Note**: If using a non-OpenAI provider (like DeepSeek or Qwen) via the OpenAI-compatible endpoint, please add the `openai/` prefix to the model name. | ||
|
|
||
| ```bash | ||
| # Run with GPT-4o | ||
| bash run.sh "gpt-4o" | ||
|
|
||
| # Run with DeepSeek-Chat | ||
| bash run.sh "openai/deepseek-chat" | ||
| ``` | ||
|
|
||
| ### 4. Results | ||
|
|
||
| After the run completes, results will be saved in the `outputs/` directory: | ||
| - `summary.json`: Overall accuracy and breakdown by task tier. | ||
| - `results.jsonl`: Detailed logs including retrieval status, model input/output, and correctness for every query. | ||
|
|
||
| ## 📚 Citation | ||
|
|
||
| If you use this benchmark in your research, please cite our MobiCom '26 paper: | ||
|
|
||
| ```bibtex | ||
| @inproceedings{iotbrain2026, | ||
| title={IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling}, | ||
| author={Anonymous Author(s)}, | ||
| booktitle={Proceedings of the 32nd Annual International Conference on Mobile Computing and Networking (MobiCom '26)}, | ||
| year={2026} | ||
| } | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # Why TopoSense-Bench? | ||
|
|
||
| ## The Problem: The Semantic-Physical Mapping Gap | ||
| Modern IoT systems are transitioning from passive monitoring to intent-driven operation. However, a critical gap exists between high-level human intent (e.g., *"Find my backpack lost between the library and the gym"*) and the precise physical sensor actions required to fulfill it. | ||
|
|
||
| Existing benchmarks often focus on pure QA or code generation, overlooking the **embodied** and **spatial** reasoning capabilities required for real-world cyber-physical systems. | ||
|
|
||
| ## The Solution: Semantic-Spatial Sensor Scheduling (S³) | ||
| TopoSense-Bench introduces the S³ challenge, requiring LLMs to: | ||
| 1. **Reason Spatially**: Understand complex topological relationships (connectivity, floor transitions) in a large-scale digital twin. | ||
| 2. **Act Proactively**: Select the optimal subset of sensors from a massive network (2,510 cameras) to satisfy a query, rather than just answering a text question. | ||
| 3. **Ground in Reality**: Map vague natural language to concrete sensor identifiers (e.g., `teaching_building_1_camera_03`). | ||
|
|
||
| ## Impact | ||
| By mastering this benchmark, LLMs demonstrate the capability to serve as the "brain" for large-scale smart city and smart campus infrastructures, moving beyond chatbots to actionable physical agents. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| [llm] | ||
|
|
||
| OPENAI_API_KEY = "your_key_here" | ||
|
|
||
| OPENAI_API_BASE = "your_url_here" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Create virtual environment | ||
| python3 -m venv .venv | ||
|
|
||
| # Activate virtual environment | ||
| source .venv/bin/activate | ||
|
|
||
| # Upgrade pip | ||
| pip install --upgrade pip | ||
|
|
||
| # Install requirements | ||
| pip install -r requirements.txt | ||
|
|
||
| echo "✅ Installation complete. Virtual environment created in .venv/" | ||
| echo "👉 To activate: source .venv/bin/activate" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| # Hugging Face Ecosystem | ||
| datasets>=2.14.0 | ||
| huggingface_hub>=0.16.0 | ||
|
|
||
| # Data Processing & Utilities | ||
| pandas>=1.5.0 | ||
| tqdm | ||
| loguru | ||
|
|
||
| # Configuration parsing (for compatibility with older Python versions) | ||
| tomli>=2.0.1; python_version < "3.11" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| #!/bin/bash | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you follow template to add one install.sh that is needed for our integration? Thanks. |
||
|
|
||
| # ============================================================================== | ||
| # TopoSense-Bench Execution Script | ||
| # | ||
| # Usage: | ||
| # ./run.sh [model_name] | ||
| # | ||
| # Examples: | ||
| # ./run.sh "gpt-4o" # Run with OpenAI GPT-4o (Default) | ||
| # ./run.sh "openai/deepseek-chat" # Run with DeepSeek (via OpenAI-compatible endpoint) | ||
| # | ||
| # Note: Ensure that API keys are correctly configured in 'env.toml'. | ||
| # ============================================================================== | ||
|
|
||
| # Set default model to "gpt-4o" if no argument is provided | ||
| MODEL_NAME=${1:-"gpt-4o"} | ||
|
|
||
| echo "🚀 Starting TopoSense-Bench evaluation..." | ||
| echo "🤖 Model: $MODEL_NAME" | ||
|
|
||
| # Run the main evaluation script | ||
| python src/main.py --model_name "$MODEL_NAME" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| """Evaluator for TopoSense Benchmark.""" | ||
|
|
||
| import re | ||
| import ast | ||
| from loguru import logger | ||
|
|
||
|
|
||
| class TopoSenseEvaluator: | ||
| """Evaluator class for Semantic-Spatial Sensor Scheduling tasks.""" | ||
|
|
||
| def __init__(self): | ||
| pass | ||
|
|
||
| def parse_node_info(self, text): | ||
| """ | ||
| Parses the Node string representation to extract the critical 'name' tag. | ||
|
|
||
| Input format example: | ||
| "Node(223, 307, Tags: {'man_made': 'surveillance', 'name': 'camera_1'})" | ||
|
|
||
| Args: | ||
| text (str): The raw ground truth string from the dataset. | ||
|
|
||
| Returns: | ||
| str: The extracted sensor name (e.g., "camera_1") or the original text if parsing fails. | ||
| """ | ||
| try: | ||
| # 1. Attempt to extract the Tags dictionary part using regex | ||
| tags_match = re.search(r"Tags:\s*(\{.*?\})", text) | ||
| if tags_match: | ||
| tags_str = tags_match.group(1) | ||
| # Safely evaluate the string as a Python dictionary | ||
| tags = ast.literal_eval(tags_str) | ||
| # Return the 'name' tag converted to lowercase | ||
| return tags.get('name', '').lower() | ||
|
|
||
| # 2. Fallback: If it's a pure ID format or regex fails, return normalized text | ||
| return text.strip().lower() | ||
| except Exception: | ||
| return text.strip().lower() | ||
|
|
||
| def eval(self, llm_response_json, ground_truth_str): | ||
| """ | ||
| Evaluate the LLM's response against the ground truth. | ||
|
|
||
| Args: | ||
| llm_response_json (dict): The JSON output from the LLM. | ||
| Expected format: {"answer": "...", "explanation": "..."} | ||
| ground_truth_str (str): The raw answer string from the dataset. | ||
|
|
||
| Returns: | ||
| dict: Evaluation result containing status, score, and parsed ground truth. | ||
| """ | ||
| # 1. Extract the core answer from the LLM response | ||
| llm_answer = str(llm_response_json.get("answer", "")).lower() | ||
|
|
||
| # 2. Parse the unique identifier (Target Name) from the Ground Truth | ||
| gt_target_name = self.parse_node_info(ground_truth_str) | ||
|
|
||
| # 3. Evaluation Logic | ||
| # Requirement: The LLM's answer must contain the core identifier of the GT. | ||
| # Example: | ||
| # GT: "fire_fighting_access_1_camera_1" | ||
| # LLM: "I suggest using fire_fighting_access_1_camera_1" -> Correct | ||
|
|
||
| # Normalize strings by replacing underscores and hyphens with spaces for robust matching | ||
| clean_llm = llm_answer.replace("_", " ").replace("-", " ") | ||
| clean_gt = gt_target_name.replace("_", " ").replace("-", " ") | ||
|
|
||
| # Perform containment check | ||
| is_correct = clean_gt in clean_llm | ||
|
|
||
| return { | ||
| "status": "correct" if is_correct else "incorrect", | ||
| "score": 1 if is_correct else 0, | ||
| "parsed_gt": gt_target_name | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an entry for the benchmark to the root project README?