Skip to content

Robaina/ProtScout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎯 About

ProtScout is a Python package that enables ranking of protein sequences based on multiple properties predicted by state-of-the-art AI models. It provides a unified interface to assess and compare proteins using various characteristics such as stability, solubility, catalytic efficiency, and thermal properties.

✨ Features

  • 🧬 Comprehensive protein property analysis (structure, embeddings, catalytic activity, kinetic parameters, thermal stability, melting temperature, environmental tolerances, solubility, classical properties)
  • 🐳 Containerized execution of prediction tools with Docker
  • πŸš€ Modular, parallel workflow with configurable steps and automatic resume
  • πŸ”„ Automatic retry and resume support for robust execution
  • βœ… Validation and dry-run modes to preview workflow
  • πŸ”§ Fully configurable via YAML files and environment variable overrides
  • πŸ“ˆ Detailed logging and resource monitoring

πŸš€ Installation

Prerequisites

  • Python 3.8 or higher
  • Docker (for running containerized prediction tools)
  • NVIDIA GPU with CUDA support (recommended)
  • Conda or Poetry for environment management

Install with Poetry (Recommended)

# Clone repository
git clone https://github.com/Robaina/ProtScout.git
cd ProtScout

# Install Poetry if you haven't already
pip install poetry


# Install package and dependencies
poetry install

# Activate the virtual environment
poetry shell

Install with pip

# Clone repository
git clone https://github.com/Robaina/ProtScout.git
cd ProtScout


# Install in development mode
pip install -e .

πŸ“‹ Quick Start

Generate a configuration file:

protscout init -o my_config.yaml

Edit the configuration file with your paths and settings:

condition: ultra
workdir: /path/to/your/workdir
modeldir: /path/to/model/weights
python_executable: /path/to/conda/env/bin/python
memory: 100g
workers: 2
quiet: true
max_retries: 2
preserve_artifacts: false

Validate your setup:

protscout validate -c my_config.yaml

Run the workflow:

protscout run -c my_config.yaml

πŸ“– Usage Examples

Basic Workflow

# Run complete workflow
protscout run -c config.yaml

# Run specific steps only
protscout run -c config.yaml -s clean_sequences -s esmfold

# Override condition from command line
protscout run -c config.yaml --condition ultra

Advanced Features

# Resume from last successful step after failure
protscout run -c config.yaml --resume

# Dry run to see what would be executed
protscout run -c config.yaml --dry-run

# Monitor logs in real-time
protscout logs logs/protscout_run_20240112_143022.log -f

Parallel Execution

The workflow automatically runs compatible steps in parallel:

  • ESMFold and ESM-2 run simultaneously
  • All prediction tools (CatPred, Catapro, Temberture, Temstapro, GeoPoc, GATSol) run in parallel
  • Result processing steps are parallelized

πŸ”§ Configuration

ProtScout uses YAML configuration files for workflow management. Key configuration sections:

# Analysis condition
condition: ultra

# Core directories
workdir: /path/to/your/workdir
modeldir: /path/to/model/weights

# Execution settings
python_executable: /path/to/conda/env/bin/python
memory: 100g
workers: 2
quiet: true
max_retries: 2
preserve_artifacts: false

# Container images (optional overrides)
containers:
  esmfold:
    image: ghcr.io/new-atlantis-labs/esmfold:latest
    max_containers: 1
  # ... other containers: esm2, catpred, catapro, temberture, temstapro, geopoc, gatsol

# GPU and shared memory settings
resources:
  gpus: all
  shm_size: 100g

# Workflow steps
steps:
  - clean_sequences
  - esmfold
  - esm2
  - remove_sequences_without_pdb
  - prepare_catpred
  - catpred
  - catapro
  - temberture
  - temstapro
  - geopoc
  - gatsol
  - classical_properties
  - process_temberture
  - process_temstapro
  - process_geopoc
  - process_gatsol
  - process_catpred
  - process_catapro
  - consolidate_results

See configs/example_workflow.yaml for a complete example.

πŸ› οΈ Workflow Steps

  • clean_sequences - Clean and deduplicate input sequences
  • esmfold - Predict protein structures using ESMFold
  • esm2 - Generate protein embeddings using ESM-2
  • remove_sequences_without_pdb - Filter sequences without structures
  • prepare_catpred - Prepare inputs for catalytic prediction
  • catpred - Predict catalytic properties
  • catapro - Predict kinetic parameters (KM, Kcat, catalytic efficiency)
  • temberture - Predict temperature stability
  • temstapro - Predict melting temperature
  • geopoc - Predict environmental conditions (temp, pH, salt)
  • gatsol - Predict solubility
  • classical_properties - Calculate classical protein properties
  • process_* - Process results from each tool
  • consolidate_results - Create final output tables

πŸ“Š Output Structure

<artifacts_dir>/                  # raw outputs (artifacts)
β”œβ”€β”€ structures/                   # PDB files from ESMFold
β”œβ”€β”€ embeddings/                   # ESM-2 embeddings
β”œβ”€β”€ clean_sequences/              # cleaned FASTA files
β”œβ”€β”€ catpred_data/                 # prepared inputs for CatPred
β”œβ”€β”€ catpred/                      # CatPred raw output
β”œβ”€β”€ catapro/                      # Catapro kinetic predictions (KM, Kcat, efficiency)
β”œβ”€β”€ temberture/                   # temperature stability predictions
β”œβ”€β”€ temstapro/                    # melting temperature predictions
β”œβ”€β”€ geopoc/                       # environmental predictions (temp, pH, salt)
└── gatsol/                       # solubility predictions

<results_dir>/                    # processed results
β”œβ”€β”€ classical_properties_results/ # classical property outputs
β”œβ”€β”€ temberture_results/            # processed temperature results
β”œβ”€β”€ temstapro_results/             # processed melting temperature results
β”œβ”€β”€ geopoc_results/                # processed environmental results
β”œβ”€β”€ gatsol_results/                # processed solubility results
β”œβ”€β”€ catpred_results/               # processed CatPred results
β”œβ”€β”€ catapro_results/               # processed Catapro kinetic results
└── consolidated_results/          # final consolidated tables

πŸ”„ Resume Capability

ProtScout automatically saves workflow state and can resume from failures:

# If workflow fails at step 'gatsol'
protscout run -c config.yaml --resume
# Workflow will skip completed steps and continue from 'gatsol'

πŸ“ Logging

Comprehensive logging with multiple levels:

  • Console output: INFO level (progress and important messages)
  • Log file: DEBUG level (detailed execution information)

Logs are saved to: {workdir}/logs/protscout_run_YYYYMMDD_HHMMSS.log

πŸ› Troubleshooting

Docker Issues

# Check if Docker is running
docker info

# Ensure user has Docker permissions
sudo usermod -aG docker $USER

GPU Issues

# Check GPU availability
nvidia-smi

# Verify CUDA installation
nvcc --version

Memory Issues

  • Reduce max_containers in configuration
  • Decrease toks_per_batch for ESM-2
  • Lower batch sizes for prediction tools

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

πŸ“š Citation

If you use ProtScout in your research, please cite:

@software{protscout2024,
  author = {Robaina-EstΓ©vez, SemidΓ‘n},
  title = {ProtScout: AI-powered protein sequence ranking},
  year = {2025},
  url = {https://github.com/Robaina/ProtScout}
}

πŸ™ Acknowledgments

ProtScout integrates several state-of-the-art protein prediction tools:

  • ESMFold for structure prediction
  • ESM-2 for sequence embeddings
  • CatPred for catalytic activity prediction
  • Catapro for kinetic parameters (KM, Kcat, catalytic efficiency)
  • Temberture for thermal stability prediction
  • Temstapro for melting temperature prediction
  • GeoPoc for environmental condition prediction
  • GATSol for solubility prediction

About

Filter protein sequences by predicted protein properties

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published