webArchive

Crawls websites and saves found URLs to a file.

Usage

Install Node.js and run npm install in ./crawler.

There are 2 required CLI arguments:

First argument: domain to crawl
Second argument: path to the file where the URLs should be saved

And 2 optional CLI arguments:

Third argument: connection count limit. Default is 15
Fourth argument: redirect count limit. Default is 15.

For example, if you want to crawl example.com and save found URLs to ./test.txt, run the following command:

node ./index.js example.com test.txt

Download websites in WARC format after a crawl

Use Wget: wget --input-file=CHANGE_THIS --warc-file="warc" --force-directories --tries=10

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
crawler		crawler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

webArchive

Usage

Download websites in WARC format after a crawl

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

mrrfv/webArchive

Folders and files

Latest commit

History

Repository files navigation

webArchive

Usage

Download websites in WARC format after a crawl

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages