Crawlers that scrape URLs from domains and submit them to ArchiveBox.
- Parallel crawling of multiple domains.
- Respects
robots.txt(optional). - Parses sitemaps (XML) for seed URLs.
- Filters out unwanted domains or URLs.
- Submits discovered URLs in bulk to ArchiveBox (v0.8.0+) via its REST API.
- Python ≥ 3.8
- pip
- (optional) Docker & docker-compose
- ArchiveBox v0.8.0 or higher (features REST API)
# Clone the repo
git clone https://github.com/egg82/archivers.git
cd archivers
# Install the domain-archiver
cd domain_archiver
python3 -m pip install -r requirements.txt
# .. or install the gov-archiver
cd ../gov_archiver
python3 -m pip install -r requirements.txtAll options are passed in via environment variables.
export ARCHIVEBOX_URL="https://archivebox.example.com"
export API_TOKEN="your_api_token"
export DOMAIN_LIST="example.com;test.com"
export TAG="archivebox-tag"
# tuning
export DEPTH_LIMIT=1
export CRAWL_DELAY=0.5
export SIMULTANEOUS_DOMAINS=3
export THREADS_PER_DOMAIN=8
export FOLLOW_ROBOTS=true
export LOG_LEVEL=info
# optional: customize which URLs to include
# export URL_FILTERS_REGEX="^https?://([A-Za-z0-9-]+\.)*example\.com(/.*)?$"
# Optional: to use Redis for parallel crawling, very deep, or resumable crawls
# export REDIS_URL="redis://redis-host:6379/0"
python3 import.pyexport ARCHIVEBOX_URL="https://archivebox.example.com"
export API_TOKEN="your_api_token"
export TAG="gov-run-$(date +%Y%m%d)"
export DEPTH_LIMIT=0
export CRAWL_DELAY=0.2
export SIMULTANEOUS_DOMAINS=5
export THREADS_PER_DOMAIN=4
export FOLLOW_ROBOTS=true
export LOG_LEVEL=info
# optional: customize which URLs to include
# export URL_FILTERS_REGEX="^https?://([A-Za-z0-9-]+\.)*example\.com(/.*)?$"
# Optional: to use Redis for parallel crawling, very deep, or resumable crawls
# export REDIS_URL="redis://redis-host:6379/0"
python3 import.py| Variable | Description |
|---|---|
ARCHIVEBOX_URL |
Base URL of your ArchiveBox instance (no trailing slash). |
API_TOKEN |
Bearer token for the ArchiveBox REST API. |
TAG |
Tag to apply to all submitted URLs. |
DOMAIN_LIST (domain_archiver) |
Semicolon-separated list of domains to crawl. |
FOLLOW_ROBOTS |
1/true/yes/0/false/no - whether to obey robots.txt. |
URL_FILTERS_REGEX |
(Optional) Semicolon-separated regexes to override default URL filters. |
EXCLUDE_URLS_REGEX |
(Optional) Regex to override default exclude URLs. (default: skips archiving various file extensions) |
NO_CRAWL_URLS_REGEX |
(Optional) Regex to override default no-crawl URLs. (default: skips crawling various file extensions) |
DOMAIN_TYPE_NEGATIVE_FILTER_REGEX (gov_archiver) |
Regex to exclude certain “Domain type” values when fetching the .gov list. |
DEPTH_LIMIT |
How many link-hops from each seed URLs (0 = just the seeds). |
CRAWL_DELAY |
Seconds to wait between requests to the same site. |
SIMULTANEOUS_DOMAINS |
Number of domains to crawl in parallel. |
THREADS_PER_DOMAIN |
Threads per domain for recursive crawling. |
REQUEST_TIMEOUT |
HTTP timeout (seconds) for all requests. |
USER_AGENT |
User-agent string to present when fetching pages or robots.txt. |
LOG_LEVEL |
debug, info, warn, error, or critical. |
REDIS_URL |
(Optional) Redis URL, e.g. redis://<host>:6379/0. Only required for parallel crawling or very deep (or resumable) crawls. |
REDIS_USER |
(Optional) Username for Redis authentication. |
REDIS_PASS |
(Optional) Password for Redis authentication. |
REDIS_NAMESPACE |
Namespace/prefix for all Redis keys (default: crawler). |
You can also run either archiver via Docker Hub images:
# Domain Archiver
docker run --rm \
-e ARCHIVEBOX_URL="https://archivebox.example.com" -e API_TOKEN="your_api_token" -e DOMAIN_LIST="example.com;test.com" \
egg82/domain_archiver:1.0.1-alpine
# Gov Archiver
docker run --rm \
-e ARCHIVEBOX_URL="https://archivebox.example.com" -e API_TOKEN="your_api_token" \
egg82/gov_archiver:1.0.1-alpineIf you want to build locally:
cd domain_archiver
docker build -t domain_archiver-local --file Dockerfile-alpine .
cd ../gov_archiver
docker build -t gov_archiver-local --file Dockerfile-alpine .- Fork the repo
- Create a feature branch
- Submit a pull request
Please open an issue first for major changes or feature requests.
This project is licensed under the MIT License - see LICENSE for details.