🔍 About | 🚀 Quick Start | 📊 Evaluation | 🔗 Citation
This is the official repository for ACL 2025 (Findings) paper "Exploring Multi-Modal Data with Tool-Augmented LLM Agents for Precise Causal Discovery". This paper proposes MATMCD (Multi-Agent system with Tool-augmented LLMs for Multi-modal enhancement of Causal Discovery), a novel framework designed to improve causal discovery by integrating multi-modal data using tool-augmented large language model (LLM) agents.
Traditional causal discovery methods rely solely on statistical correlations in observational data, overlooking valuable semantic cues from external sources. MATMCD addresses this gap by introducing a multi-agent system. MATMCD supports modular integration with statistical causal discovery (SCD) algorithms (e.g., PC, ES, DirectLiNGAM), and enables enhanced reasoning by combining symbolic causal graphs and unstructured textual data.
MATMCD has a architecture as illustrated in Figure 1, which consists of the following key components.
- Causal Graph Estimator: generate an initual causal graph by calling a SCD algorithm.
- Data Augmentation Agent (DA-Agent): retrieves and summarizes semantic context (e.g., from web or log data) using search tools and LLMs.
- Causal Constraint Agent (CC-Agent): integrates the augmented data with the initial causal graph to verify or refute causal links using a reasoning pipeline.
- Causal Graph Refiner: reconstructs the final causal graph by combining LLM-inferred constraints with a SCD algorithm.
![]() |
|---|
| Figure 1: An illustration of MATMCD framework: (a) an overview of the framework, (b) the inner working of DA-Agent, and (c) the inner working of CC-Agent |
- Multi-modal data: integrates time series data, metadata, web documents, and logs to enrich semantic context for causal discovery.
- LLM reasoning: employs tool-augmented LLMs to reason over causal structures using external knowledge and contextual cues.
- Modular design: features a modular architecture that allows easy swapping of LLMs and SCD algorithms for flexible adaptation.
-
Clone the Repository
git clone https://github.com/your_username/MATMCD.git cd MATMCD -
Set Up the Environment
- We recommend using
condaorvirtualenvto create an isolated environment.
python3 -m venv venv source venv/bin/activate # or .\venv\Scripts\activate on Windows pip install -r requirements.txt
- We recommend using
-
Configure API Keys
- Add API-keys in
config.pyfile.
- Add API-keys in
-
Download the datasets
- The original data can be download from AutoMPG, DWD Climate, Sachs, Asic, Child and LEMMA_RCA datasets from LEMMA-RCA.
- The CSV format of the AutoMPG, DWD Climate, and Sachs datasets can be downloaded from here. The Asia and Child datasets can be converted to CSV format via script
data/SampleFromBIF.py. - Place the data in the
datafolder.
-
Run the Application
- Make sure the environment, API and dataset are accurate.
- Run
python GTdatasets_experiment.pyto start.
-
Run Experiments and Evaluate
- Run benchmark experiments on standard datasets:
python GTdatasets_experiment.py
- For root cause analysis on microservice datasets:
python RCA_experiment.py
- Results will be saved in the
results/folder.
- Run benchmark experiments on standard datasets:
MATMCD is evaluated on:
- Benchmark Datasets: AutoMPG, DWDClimate, SachsProtein, Asia, and Child — covering both time-series and sequence data.
- AIOps Datasets: Product Review and Cloud Computing — large-scale multivariate time series with event logs.
Key results:
- Up to 66.7% reduction of causal inference errors (in terms of NHD) over baseline methods.
- Up to 83.3% improvement in root cause locating accuracy (in terms of MAP@10).
@inproceedings{shen2025MATMCD,
title={Exploring Multi-Modal Data with Tool-Augmented LLM Agents for Precise Causal Discovery},
author={Shen, ChengAo and Chen, Zhengzhang and Luo, Dongsheng and Xu, Dongkuan and Chen, Haifeng and Ni, Jingchao},
booktitle={ACL (Findings)},
year={2025}
}
If you have any questions or concerns, please contact us: cshen9 [at] uh [dot] edu or submit an issue



