A workspace by @danniesim while working with the Wellcome Collection Digital Platform and Machine Learning team.
This workspace is meant to be used with VS Code.
Install Claude Code to use the AI assistant with this project.
Before running Claude Code, load the environment variables:
source ./load_env.shNote: don�t run ./load_env.sh (executing it won�t persist variables in your current shell).
If your env file isn�t named .env, you can pass a path:
source ./load_env.sh path/to/.envThis sets AWS_PROFILE so Claude Code uses AWS Bedrock for LLM access.
/commit- Create git commits using conventional commit style (feat:,fix:,docs:, etc.)/spark-docker- Start/stop local Spark cluster for development/data-explore- Query Hive tables and explore the data warehouse/iiif- Download IIIF manifests and process Wellcome Collection data/ec2- Manage EC2 instances for GPU workloads/embed- Generate text embeddings using the Qwen3 embedding service/vlm-embed- Run VLM embedding jobs or start the Flask service/run-tests- Run pytest for the wc_simd package/notebook- Work with Jupyter notebooks/deploy- Deploy to AWS (SageMaker, Amplify, EC2)
Based on past work in this repo, Claude Code can help with:
- VLM embeddings - Add features to the embedding pipeline, debug sharding issues, optimize multi-GPU processing
- Spark/Hive - Write Spark jobs, debug OOM errors, create new Hive tables, optimize queries
- IIIF processing - Extend manifest processing, add new CLI commands, handle edge cases
- Infrastructure - Update Docker configs, add CDK stacks, configure EC2/SageMaker
- Notebooks - Explore data, run experiments, visualize embeddings
- Bug fixes - Debug failing jobs, fix configuration issues, improve error handling
- Documentation - Update CLAUDE.md with new patterns, add notebook overviews
The source of the wc_simd Python package is found in src/wc_simd
I use UV for Python dependency management, you may install it with this line curl -LsSf https://astral.sh/uv/install.sh | sh.
To be able to import the wc_simd module from code and install Python dependencies, run: uv sync from the repo root.
If requirements.txt is needed, run: uv run pip freeze > requirements.txt
Explorations and experiments found in notebooks.
Bespoke data files are placed in data. Files that can be imported from other sources are found in data/imports. See: Data Import Index
See this
I use pytest for testing. Tests can be found in tests.
Demos like bookchat cand be found in demos
I use LTeX and Markdown Lint to keep my docs and comments sane. Here are my recommended VS Code extensions
This VS Code workspace can be run on a remote EC2 instance. I use a c5.4xlarge instance.
Typically, long-running jobs can be run with screen which will continue running programs started on the command line even if the SSH connection is gone.
Once reconnected via SSH, we can retrieve the terminal running the program with screen -R. List all terminals with screen -list.
sudo apt install clang