ProtScout is a Python package that enables ranking of protein sequences based on multiple properties predicted by state-of-the-art AI models. It provides a unified interface to assess and compare proteins using various characteristics such as stability, solubility, catalytic efficiency, and thermal properties.
- 𧬠Comprehensive protein property analysis (structure, embeddings, catalytic activity, kinetic parameters, thermal stability, melting temperature, environmental tolerances, solubility, classical properties)
- π³ Containerized execution of prediction tools with Docker
- π Modular, parallel workflow with configurable steps and automatic resume
- π Automatic retry and resume support for robust execution
- β Validation and dry-run modes to preview workflow
- π§ Fully configurable via YAML files and environment variable overrides
- π Detailed logging and resource monitoring
- Python 3.8 or higher
- Docker (for running containerized prediction tools)
- NVIDIA GPU with CUDA support (recommended)
- Conda or Poetry for environment management
# Clone repository
git clone https://github.com/Robaina/ProtScout.git
cd ProtScout
# Install Poetry if you haven't already
pip install poetry
# Install package and dependencies
poetry install
# Activate the virtual environment
poetry shell# Clone repository
git clone https://github.com/Robaina/ProtScout.git
cd ProtScout
# Install in development mode
pip install -e .Generate a configuration file:
protscout init -o my_config.yamlEdit the configuration file with your paths and settings:
condition: ultra
workdir: /path/to/your/workdir
modeldir: /path/to/model/weights
python_executable: /path/to/conda/env/bin/python
memory: 100g
workers: 2
quiet: true
max_retries: 2
preserve_artifacts: falseValidate your setup:
protscout validate -c my_config.yamlRun the workflow:
protscout run -c my_config.yaml# Run complete workflow
protscout run -c config.yaml
# Run specific steps only
protscout run -c config.yaml -s clean_sequences -s esmfold
# Override condition from command line
protscout run -c config.yaml --condition ultra# Resume from last successful step after failure
protscout run -c config.yaml --resume
# Dry run to see what would be executed
protscout run -c config.yaml --dry-run
# Monitor logs in real-time
protscout logs logs/protscout_run_20240112_143022.log -fThe workflow automatically runs compatible steps in parallel:
- ESMFold and ESM-2 run simultaneously
- All prediction tools (CatPred, Catapro, Temberture, Temstapro, GeoPoc, GATSol) run in parallel
- Result processing steps are parallelized
ProtScout uses YAML configuration files for workflow management. Key configuration sections:
# Analysis condition
condition: ultra
# Core directories
workdir: /path/to/your/workdir
modeldir: /path/to/model/weights
# Execution settings
python_executable: /path/to/conda/env/bin/python
memory: 100g
workers: 2
quiet: true
max_retries: 2
preserve_artifacts: false
# Container images (optional overrides)
containers:
esmfold:
image: ghcr.io/new-atlantis-labs/esmfold:latest
max_containers: 1
# ... other containers: esm2, catpred, catapro, temberture, temstapro, geopoc, gatsol
# GPU and shared memory settings
resources:
gpus: all
shm_size: 100g
# Workflow steps
steps:
- clean_sequences
- esmfold
- esm2
- remove_sequences_without_pdb
- prepare_catpred
- catpred
- catapro
- temberture
- temstapro
- geopoc
- gatsol
- classical_properties
- process_temberture
- process_temstapro
- process_geopoc
- process_gatsol
- process_catpred
- process_catapro
- consolidate_resultsSee configs/example_workflow.yaml for a complete example.
clean_sequences- Clean and deduplicate input sequencesesmfold- Predict protein structures using ESMFoldesm2- Generate protein embeddings using ESM-2remove_sequences_without_pdb- Filter sequences without structuresprepare_catpred- Prepare inputs for catalytic predictioncatpred- Predict catalytic propertiescatapro- Predict kinetic parameters (KM, Kcat, catalytic efficiency)temberture- Predict temperature stabilitytemstapro- Predict melting temperaturegeopoc- Predict environmental conditions (temp, pH, salt)gatsol- Predict solubilityclassical_properties- Calculate classical protein propertiesprocess_*- Process results from each toolconsolidate_results- Create final output tables
<artifacts_dir>/ # raw outputs (artifacts)
βββ structures/ # PDB files from ESMFold
βββ embeddings/ # ESM-2 embeddings
βββ clean_sequences/ # cleaned FASTA files
βββ catpred_data/ # prepared inputs for CatPred
βββ catpred/ # CatPred raw output
βββ catapro/ # Catapro kinetic predictions (KM, Kcat, efficiency)
βββ temberture/ # temperature stability predictions
βββ temstapro/ # melting temperature predictions
βββ geopoc/ # environmental predictions (temp, pH, salt)
βββ gatsol/ # solubility predictions
<results_dir>/ # processed results
βββ classical_properties_results/ # classical property outputs
βββ temberture_results/ # processed temperature results
βββ temstapro_results/ # processed melting temperature results
βββ geopoc_results/ # processed environmental results
βββ gatsol_results/ # processed solubility results
βββ catpred_results/ # processed CatPred results
βββ catapro_results/ # processed Catapro kinetic results
βββ consolidated_results/ # final consolidated tables
ProtScout automatically saves workflow state and can resume from failures:
# If workflow fails at step 'gatsol'
protscout run -c config.yaml --resume
# Workflow will skip completed steps and continue from 'gatsol'Comprehensive logging with multiple levels:
- Console output: INFO level (progress and important messages)
- Log file: DEBUG level (detailed execution information)
Logs are saved to: {workdir}/logs/protscout_run_YYYYMMDD_HHMMSS.log
# Check if Docker is running
docker info
# Ensure user has Docker permissions
sudo usermod -aG docker $USER# Check GPU availability
nvidia-smi
# Verify CUDA installation
nvcc --version- Reduce
max_containersin configuration - Decrease
toks_per_batchfor ESM-2 - Lower batch sizes for prediction tools
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.
If you use ProtScout in your research, please cite:
@software{protscout2024,
author = {Robaina-EstΓ©vez, SemidΓ‘n},
title = {ProtScout: AI-powered protein sequence ranking},
year = {2025},
url = {https://github.com/Robaina/ProtScout}
}ProtScout integrates several state-of-the-art protein prediction tools:
- ESMFold for structure prediction
- ESM-2 for sequence embeddings
- CatPred for catalytic activity prediction
- Catapro for kinetic parameters (KM, Kcat, catalytic efficiency)
- Temberture for thermal stability prediction
- Temstapro for melting temperature prediction
- GeoPoc for environmental condition prediction
- GATSol for solubility prediction
