Skip to content

wellcomecollection/wc_simd

Repository files navigation

Workspace of SIMD

A workspace by @danniesim while working with the Wellcome Collection Digital Platform and Machine Learning team.

This workspace is meant to be used with VS Code.

Claude Code

Install Claude Code to use the AI assistant with this project.

Before running Claude Code, load the environment variables:

source ./load_env.sh

Note: don�t run ./load_env.sh (executing it won�t persist variables in your current shell). If your env file isn�t named .env, you can pass a path:

source ./load_env.sh path/to/.env

This sets AWS_PROFILE so Claude Code uses AWS Bedrock for LLM access.

Slash commands

  • /commit - Create git commits using conventional commit style (feat:, fix:, docs:, etc.)
  • /spark-docker - Start/stop local Spark cluster for development
  • /data-explore - Query Hive tables and explore the data warehouse
  • /iiif - Download IIIF manifests and process Wellcome Collection data
  • /ec2 - Manage EC2 instances for GPU workloads
  • /embed - Generate text embeddings using the Qwen3 embedding service
  • /vlm-embed - Run VLM embedding jobs or start the Flask service
  • /run-tests - Run pytest for the wc_simd package
  • /notebook - Work with Jupyter notebooks
  • /deploy - Deploy to AWS (SageMaker, Amplify, EC2)

General development tasks

Based on past work in this repo, Claude Code can help with:

  • VLM embeddings - Add features to the embedding pipeline, debug sharding issues, optimize multi-GPU processing
  • Spark/Hive - Write Spark jobs, debug OOM errors, create new Hive tables, optimize queries
  • IIIF processing - Extend manifest processing, add new CLI commands, handle edge cases
  • Infrastructure - Update Docker configs, add CDK stacks, configure EC2/SageMaker
  • Notebooks - Explore data, run experiments, visualize embeddings
  • Bug fixes - Debug failing jobs, fix configuration issues, improve error handling
  • Documentation - Update CLAUDE.md with new patterns, add notebook overviews

UV and Python Package

The source of the wc_simd Python package is found in src/wc_simd

I use UV for Python dependency management, you may install it with this line curl -LsSf https://astral.sh/uv/install.sh | sh.

To be able to import the wc_simd module from code and install Python dependencies, run: uv sync from the repo root.

If requirements.txt is needed, run: uv run pip freeze > requirements.txt

Notebooks

Explorations and experiments found in notebooks.

Data Directory

Bespoke data files are placed in data. Files that can be imported from other sources are found in data/imports. See: Data Import Index

Local Spark with Hadoop and Hive

See this

Tests

I use pytest for testing. Tests can be found in tests.

Demos

Demos like bookchat cand be found in demos

Lint and Spellcheck

I use LTeX and Markdown Lint to keep my docs and comments sane. Here are my recommended VS Code extensions

EC2 Worker Instance

This VS Code workspace can be run on a remote EC2 instance. I use a c5.4xlarge instance.

Typically, long-running jobs can be run with screen which will continue running programs started on the command line even if the SSH connection is gone. Once reconnected via SSH, we can retrieve the terminal running the program with screen -R. List all terminals with screen -list.

Python Packages Building

fasttext

sudo apt install clang

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •