Archivers

Crawlers that scrape URLs from domains and submit them to ArchiveBox.

domain_archiver
gov_archiver

Features

Parallel crawling of multiple domains.
Respects robots.txt (optional).
Parses sitemaps (XML) for seed URLs.
Filters out unwanted domains or URLs.
Submits discovered URLs in bulk to ArchiveBox (v0.8.0+) via its REST API.

Getting Started

Prerequisites

Python ≥ 3.8
pip
(optional) Docker & docker-compose
ArchiveBox v0.8.0 or higher (features REST API)

Installation

# Clone the repo
git clone https://github.com/egg82/archivers.git
cd archivers

# Install the domain-archiver
cd domain_archiver
python3 -m pip install -r requirements.txt

# .. or install the gov-archiver
cd ../gov_archiver
python3 -m pip install -r requirements.txt

Usage

All options are passed in via environment variables.

Domain Archiver

export ARCHIVEBOX_URL="https://archivebox.example.com"
export API_TOKEN="your_api_token"
export DOMAIN_LIST="example.com;test.com"
export TAG="archivebox-tag"

# tuning
export DEPTH_LIMIT=1
export CRAWL_DELAY=0.5
export SIMULTANEOUS_DOMAINS=3
export THREADS_PER_DOMAIN=8
export FOLLOW_ROBOTS=true
export LOG_LEVEL=info

# optional: customize which URLs to include
# export URL_FILTERS_REGEX="^https?://([A-Za-z0-9-]+\.)*example\.com(/.*)?$"

# Optional: to use Redis for parallel crawling, very deep, or resumable crawls
# export REDIS_URL="redis://redis-host:6379/0"

python3 import.py

Gov Archiver

export ARCHIVEBOX_URL="https://archivebox.example.com"
export API_TOKEN="your_api_token"
export TAG="gov-run-$(date +%Y%m%d)"
export DEPTH_LIMIT=0
export CRAWL_DELAY=0.2
export SIMULTANEOUS_DOMAINS=5
export THREADS_PER_DOMAIN=4
export FOLLOW_ROBOTS=true
export LOG_LEVEL=info

# optional: customize which URLs to include
# export URL_FILTERS_REGEX="^https?://([A-Za-z0-9-]+\.)*example\.com(/.*)?$"

# Optional: to use Redis for parallel crawling, very deep, or resumable crawls
# export REDIS_URL="redis://redis-host:6379/0"

python3 import.py

Configuration

Variable	Description
`ARCHIVEBOX_URL`	Base URL of your ArchiveBox instance (no trailing slash).
`API_TOKEN`	Bearer token for the ArchiveBox REST API.
`TAG`	Tag to apply to all submitted URLs.
`DOMAIN_LIST` (domain_archiver)	Semicolon-separated list of domains to crawl.
`FOLLOW_ROBOTS`	`1`/`true`/`yes`/`0`/`false`/`no` - whether to obey `robots.txt`.
`URL_FILTERS_REGEX`	(Optional) Semicolon-separated regexes to override default URL filters.
`EXCLUDE_URLS_REGEX`	(Optional) Regex to override default exclude URLs. (default: skips archiving various file extensions)
`NO_CRAWL_URLS_REGEX`	(Optional) Regex to override default no-crawl URLs. (default: skips crawling various file extensions)
`DOMAIN_TYPE_NEGATIVE_FILTER_REGEX` (gov_archiver)	Regex to exclude certain “Domain type” values when fetching the `.gov` list.
`DEPTH_LIMIT`	How many link-hops from each seed URLs (0 = just the seeds).
`CRAWL_DELAY`	Seconds to wait between requests to the same site.
`SIMULTANEOUS_DOMAINS`	Number of domains to crawl in parallel.
`THREADS_PER_DOMAIN`	Threads per domain for recursive crawling.
`REQUEST_TIMEOUT`	HTTP timeout (seconds) for all requests.
`USER_AGENT`	User-agent string to present when fetching pages or robots.txt.
`LOG_LEVEL`	`debug`, `info`, `warn`, `error`, or `critical`.
`REDIS_URL`	(Optional) Redis URL, e.g. `redis://<host>:6379/0`. Only required for parallel crawling or very deep (or resumable) crawls.
`REDIS_USER`	(Optional) Username for Redis authentication.
`REDIS_PASS`	(Optional) Password for Redis authentication.
`REDIS_NAMESPACE`	Namespace/prefix for all Redis keys (default: `crawler`).

Docker

You can also run either archiver via Docker Hub images:

# Domain Archiver
docker run --rm \
  -e ARCHIVEBOX_URL="https://archivebox.example.com" -e API_TOKEN="your_api_token" -e DOMAIN_LIST="example.com;test.com" \
  egg82/domain_archiver:1.0.1-alpine

# Gov Archiver
docker run --rm \
  -e ARCHIVEBOX_URL="https://archivebox.example.com" -e API_TOKEN="your_api_token" \
  egg82/gov_archiver:1.0.1-alpine

If you want to build locally:

cd domain_archiver
docker build -t domain_archiver-local --file Dockerfile-alpine .

cd ../gov_archiver
docker build -t gov_archiver-local --file Dockerfile-alpine .

Contributing

Fork the repo
Create a feature branch
Submit a pull request

Please open an issue first for major changes or feature requests.

License

This project is licensed under the MIT License - see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Archivers

Table of Contents

Features

Getting Started

Prerequisites

Installation

Usage

Domain Archiver

Gov Archiver

Configuration

Docker

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
domain_archiver		domain_archiver
gov_archiver		gov_archiver
LICENSE		LICENSE
README.md		README.md

License

egg82/archivers

Folders and files

Latest commit

History

Repository files navigation

Archivers

Table of Contents

Features

Getting Started

Prerequisites

Installation

Usage

Domain Archiver

Gov Archiver

Configuration

Docker

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages