This project implements a sophisticated Multi-Agent System (MAS) architecture featuring a Research Coordinator that orchestrates specialized AI agents to analyze academic research. The system transforms academic papers into a searchable, structured, and semantically rich knowledge base through intelligent agent coordination powered by * LangGraph* state management and Command-based routing.
- Intelligent Agent Orchestration: Research Coordinator uses LangGraph StateGraph with Command-based routing to dynamically delegate queries to specialized agents
- ️Multi-Database Architecture: Integrates Neo4j (graph), MongoDB (documents), and ChromaDB (vectors) for comprehensive data storage
- Automated PDF Processing: Advanced ingestion pipeline with entity extraction, topic modeling, and metadata enrichment using OpenAI GPT-4
- Semantic Search & Retrieval: Hybrid search combining vector similarity, graph traversal, and document analysis
- Real-time Research Analysis: Dynamic routing between relationship analysis and thematic analysis based on query classification
- ️Professional CLI Interface: Comprehensive command-line management with health checks, logging, and testing capabilities
Research Coordinator (LangGraph StateGraph)
↓
Query Classification Node
↓
┌─────────────┬─────────────┬─────────────┐
│ Greeting │ Simple │ Research │
│ Handler │ Question │ Query │
└─────────────┴─────────────┴─────────────┘
↓
Planning Node
↓
┌─────────────┬─────────────┐
│Relationship │ Theme │
│ Analyst │ Analyst │
│ Node │ Node │
└─────────────┴─────────────┘
↓
Synthesis Node
↓
Final Response
- 🎯 Research Coordinator: Central supervisor using LangGraph Commands for intelligent query classification and agent delegation
- 🔗 Relationship Analyst: Maps connections between papers, authors, concepts, and research lineages using Neo4j graph queries
- 📊 Theme Analyst: Identifies patterns, topics, and trends across research literature using MongoDB document analysis
- 🏷️ Entity Extraction: Automated identification of key concepts, methodologies, and research entities via LLM processing
- 📈 Topic Modeling: Latent theme discovery and research domain classification with weighted term extraction
PDF Ingestion ➜ Entity Extraction ➜ Topic Modeling ➜ Graph Construction ➜ Vector Embedding ➜ Agent Analysis ➜ User Interaction
The system features a sophisticated multi-stage ingestion pipeline that processes academic PDFs:
- 📄 PDF Text Extraction: Uses PyMuPDF for robust text and metadata extraction
- 🧠 LLM-Powered Analysis: OpenAI GPT-4 extracts entities, relationships, and topics
- 🔗 Knowledge Graph Construction: Builds Neo4j nodes and relationships for papers, authors, and concepts
- 📊 Topic Modeling: Discovers research themes and categorizes content in MongoDB
- 🎯 Vector Embeddings: Creates semantic embeddings for similarity search in ChromaDB
- ✅ Quality Validation: Tests data integrity across all databases
# Run complete ingestion pipeline
python src/utils/ingestion_pipeline.py
# Test with a single PDF
python src/utils/ingestion_pipeline.py --testsupervisor-multi-agent-system/
├── src/
│ ├── main.py # FastAPI application entry point
│ ├── api/v1/endpoints/ # API endpoints
│ │ ├── status.py # Health check endpoints
│ │ └── agent.py # Main agent interaction endpoint
│ ├── domain/
│ │ ├── agents/ # Specialized AI agents
│ │ │ ├── research_coordinator.py # LangGraph orchestration agent
│ │ │ ├── relationship_analyst.py # Neo4j graph analysis
│ │ │ └── theme_analyst.py # MongoDB topic analysis
│ ├── databases/ # Database configurations
│ │ ├── graph/ # Neo4j configuration
│ │ ├── document/ # MongoDB configuration
│ │ └── vector/ # ChromaDB configuration
│ ├── services/ # Database service layers
│ │ ├── graph_service.py # Neo4j operations
│ │ ├── document_service.py # MongoDB operations
│ │ └── vector_service.py # ChromaDB operations
│ └── utils/ # Utilities and tools
│ ├── ingestion_pipeline.py # Comprehensive PDF processing
│ ├── model_init.py # LLM initialization
│ └── agent_wrapper.py # Agent response utilities
├── cli.py # Professional CLI interface
├── docker-compose.yml # Multi-service orchestration
├── requirements.txt # Python dependencies
└── sources/ # PDF documents for ingestion
| Component | Technology | Purpose |
|---|---|---|
| Agent Framework | LangGraph + LangChain | Modern state-based multi-agent orchestration |
| LLM Integration | OpenAI GPT-4 | Entity extraction, topic modeling, and analysis |
| API Framework | FastAPI | High-performance web API with automatic documentation |
| Graph Database | Neo4j | Knowledge graph for entity relationships |
| Document Database | MongoDB | Structured document storage and topic modeling |
| Vector Database | ChromaDB | Semantic search and similarity matching |
| Containerization | Docker + Docker Compose | Consistent deployment and scaling |
| CLI Interface | Click | Professional command-line management |
- Python: 3.11+
- Docker: Latest version with Docker Compose
- OpenAI API Key: Required for LLM operations
- System Requirements: 8GB RAM minimum, 16GB recommended
- Storage: 20GB minimum, 50GB recommended for large document collections
git clone https://github.com/Ricoledan/supervisor-multi-agent-system
cd supervisor-multi-agent-system
cp .env.defaults .env
# Edit .env and add your OPENAI_API_KEY# Using the CLI (recommended)
python cli.py start
# Quick start with minimal health checks
python cli.py quick-start
# Start only databases for development
python cli.py start --databases-only# Create sources directory and add PDFs
mkdir -p sources
# Copy your academic PDF files to sources/# Quick test with clean output
python cli.py test --simple
# Detailed system test
python cli.py test --query "machine learning applications"
# Test specific functionality
curl -X POST "http://localhost:8000/api/v1/agent" \
-H "Content-Type: application/json" \
-d '{"query": "How do neural networks relate to computer vision?"}'The system includes a comprehensive CLI for professional management:
python cli.py start # Start all services
python cli.py stop # Stop all services
python cli.py restart # Restart system
python cli.py status # Check service statuspython cli.py test # Test system functionality
python cli.py test --simple # Clean, formatted output
python cli.py health # Run health checks
python cli.py health --detailed # Comprehensive health analysis
python cli.py logs # View system logs
python cli.py logs --follow # Follow logs in real-timepython cli.py start --databases-only # Start only databases
python cli.py restart --service neo4j # Restart specific service{
"query": "What are the main approaches to transformer architectures in natural language processing?"
}{
"query": "How do computer vision techniques connect to medical diagnosis research?"
}{
"query": "What themes are emerging in climate change adaptation research over the past 5 years?"
}{
"query": "Show me the research lineage and evolution of BERT language models"
}{
"query": "How does reinforcement learning apply to robotics and autonomous systems?"
}POST /api/v1/agent
curl -X POST "http://localhost:8000/api/v1/agent" \
-H "Content-Type: application/json" \
-d '{"query": "How do neural networks relate to computer vision?"}'{
"status": "success",
"message": "# 🎯 Research Analysis Results\n\n**Query:** How do neural networks relate to computer vision?\n\n## 🔗 Relationship Analysis\n\nBased on the knowledge graph analysis...",
"query": "How do neural networks relate to computer vision?",
"query_type": "RESEARCH_QUERY",
"specialists_used": {
"relationship_analyst": true,
"theme_analyst": true
},
"system_health": {
"relationship_analyst": "✅ Active",
"theme_analyst": "✅ Active",
"database_usage": "✅ High",
"response_quality": "Database-driven"
}
}| Endpoint | Method | Description |
|---|---|---|
/api/v1/status |
GET | System health check |
/api/v1/agent |
POST | Main research analysis endpoint |
/api/v1/agent/detailed |
POST | Full conversation state |
/api/v1/agent/raw |
POST | Debug endpoint with raw outputs |
/api/v1/agent/health |
GET | Agent system health check |
// Nodes
(:Paper {id, title, year, source, research_field, methodology})
(:Author {name})
(:Concept {name, category, description})
// Relationships
(:Author)-[:AUTHORED]->(:Paper)
(:Paper)-[:CONTAINS]->(:Concept)
(:Concept)-[:RELATES_TO {type, description}]->(:Concept)// papers collection
{
paper_id: String,
metadata
:
{
title, authors, year, abstract, keywords,
journal, doi, research_field, methodology
}
,
content: [{page, text}],
entities
:
{
concepts, relationships
}
,
processed_at: Date
}
// topics collection
{
paper_id: String,
category
:
String,
terms
:
[{term, weight}],
source
:
String,
created_at
:
Date
}# Collection: academic_papers
{
documents: [text_chunks],
embeddings: [vector_embeddings],
metadatas: [{
paper_id, page, source, title,
authors, year, research_field,
chunk_id, chunk_total
}],
ids: [unique_chunk_ids]
}# Required
OPENAI_API_KEY=your_openai_api_key_here
# Database Configuration (defaults provided)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password
NEO4J_DB=neo4j
MONGODB_HOST=localhost
MONGODB_PORT=27017
MONGODB_USER=user
MONGODB_PASSWORD=password
MONGODB_DB=research_db
CHROMA_HOST=localhost
CHROMA_PORT=8001# LLM Model Selection
OPENAI_MODEL=gpt-4 # or gpt-3.5-turbo for faster responses
# Ingestion Pipeline Settings
CHUNK_SIZE=1000 # Text chunk size for embeddings
CHUNK_OVERLAP=200 # Overlap between chunks
MAX_CONCEPTS=15 # Maximum concepts per paper
# Performance Tuning
NEO4J_POOL_SIZE=10
MONGODB_POOL_SIZE=10analyze_research_relationships(): Queries Neo4j for entity connections- Paper lineages and citation networks
- Author collaboration patterns
- Cross-disciplinary concept relationships
- Research influence patterns
analyze_research_themes(): Queries MongoDB for topic patterns- Latent theme discovery across document collections
- Research trend identification and evolution
- Methodological approach analysis
- Domain-specific terminology extraction
- Query Response Time: 15-45 seconds (depends on database size and complexity)
- PDF Processing Speed: 2-3 minutes per paper (including all extractions)
- Concurrent Users: Supports 5-10 simultaneous research queries
- Database Storage: ~500MB per 100 research papers
- Minimum: 8GB RAM, 4 CPU cores, 20GB storage
- Recommended: 16GB RAM, 8 CPU cores, 50GB storage
- Production: 32GB RAM, 8+ CPU cores, 100GB+ storage
- Horizontal Scaling: Docker Compose replicas for API services
- Database Optimization: Connection pooling and memory tuning
- Caching: Redis integration for frequent queries (future enhancement)
# Check service status
python cli.py status
# View detailed logs
python cli.py logs --service neo4j
python cli.py logs --service mongodb
python cli.py logs --service chromadb
# Restart specific service
python cli.py restart --service neo4j# Verify data ingestion completed
python cli.py test --query "machine learning"
# Check ingestion quality
python src/utils/ingestion_pipeline.py --test
# Re-run full ingestion if needed
python src/utils/ingestion_pipeline.py# Increase timeout for complex queries
python cli.py test --timeout 120
# Check database performance
python cli.py health --detailed
# Monitor system resources
python cli.py logs --follow# Verify sources directory exists
ls -la sources/
# Check PDF file permissions
python cli.py test --query "test"- Create Agent File:
src/domain/agents/new_specialist.py
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
@tool
def analyze_custom_data(query: str) -> str:
"""Custom analysis tool"""
# Your custom database queries here
return analysis_result
specialist_agent = create_react_agent(
model=model,
tools=[analyze_custom_data],
prompt=SYSTEM_PROMPT
)- Update Coordinator: Add routing logic in
research_coordinator.py - Add API Endpoints: Update
agent.pyif needed
# Example: Custom Neo4j analysis
def analyze_author_networks(query: str):
with driver.session() as session:
result = session.run("""
MATCH (a1:Author)-[:AUTHORED]->(p)<-[:AUTHORED]-(a2:Author)
WHERE a1.name CONTAINS $query
RETURN a1.name, a2.name, count(p) as collaborations
ORDER BY collaborations DESC LIMIT 10
""", query=query)
return result.data()# Start databases only for development
python cli.py start --databases-only
# Run API in development mode
python -m uvicorn src.main:app --reload --host 0.0.0.0 --port 8000
# Monitor logs in separate terminal
python cli.py logs --follow
# Test changes
python cli.py test --simpleAfter starting the system, access these interfaces:
- API Documentation: http://localhost:8000/docs
- API Status: http://localhost:8000/api/v1/status
- Neo4j Browser: http://localhost:7474 (neo4j/password)
- MongoDB Express: http://localhost:8081
- ChromaDB: http://localhost:8001
# Full system test with clean output
python cli.py test --simple
# Test with specific queries
python cli.py test --query "transformer models" --timeout 60
# Health check all components
python cli.py health --detailed
# Test ingestion pipeline
python src/utils/ingestion_pipeline.py --test
# API endpoint testing
curl -X GET "http://localhost:8000/api/v1/agent/health"The system includes built-in quality checks:
- Data Integrity: Validates cross-database consistency
- Response Quality: Monitors agent specialist usage
- Performance Metrics: Tracks query response times
- Database Health: Monitors connection status and query performance
- Primary: PDF research papers with text content
- Secondary: Text files (.txt, .md) for preprocessing
- Future: DOI-based ingestion, arXiv API integration
- Academic conferences (NeurIPS, ICML, ACL, ICLR, etc.)
- Journal articles from major publishers (IEEE, ACM, Springer, Elsevier)
- Preprint servers (arXiv, bioRxiv, medRxiv)
- Technical reports and white papers
- LangGraph Multi-Agent Documentation
- Neo4j Graph Database Documentation
- ChromaDB Vector Database Documentation
- FastAPI Framework Documentation
- OpenAI API Documentation
- Docker Compose Documentation
- Multi-Agent Systems
Contributions are welcome! Please read our contributing guidelines and submit pull requests for any improvements.
This project is licensed under the MIT License—see the LICENSE file for details.