- Linux operating system.
- c++ version 11.
- g++ version 8+.
- POSIX threads.
-
Download Chilkat for your operating system.
-
Configure Chilkat library:
- Extract library from
.tar.gzfile. - Copy relative path from extracted folder.
- Replace
CHILKATPATHinMakefile.incfile with the relative path.
- Extract library from
-
Download Gumbo v0.10.1.
-
Configure Gumbo:
- Extract library from
.tar.gzor.zipfile. - Move to inside library folder.
- Run the following commands on terminal:
$ sudo apt-get install libtool $ ./autogen.sh $ ./configure $ make $ sudo make install - Extract library from
The simple_crwaler is a simple clawler that receives a parent URL and a number, and outputs the first parameter followed by the requested number of links inside it.
$ cd simples_crawler
$ make
$ ./a.out <parentUrl> <numAdditionalLinks>
<urlToCrawl>: the parent url.<numAdditionalLinks>: the number of additional urls to crawl inside parent url.
The complex_crawler module is a crawler with two options for crawling: a long term and a short term scheduler. If the crawler is configured with the first option, it will use a single thread and a priroty queue based on the URL size to choose witch URL it should visit first. Otherwise, it will use multiple threads to collect a large number of pages in parallel.
$ cd complex_crawler
$ make
$ ./demo/main.out <seedFileName> <storageDir> <mode> <numPages> <numThreads?>
<seedFileName>: the file name for seed irls separed by a line break. Ex:seed/seed.txt.<storageDir>: the directory to put the collected HTML pages and aindex.txtfile, that has the collection information. Ex:storage/orstorage_extra/.<mode>: 0 for long term scheduler and 2 for short term scheduler. The value 1 chooses an option under construction and should not be used.<numPages>: the number of total pages to crawl.<numThreads?>: the number of threads to use when choosing the short term scheduler (default value is 1).
The inverted_index module indexes the crawled collection and searches for specified pages inside it.
$ cd inverted_index
$ make
./demo/indexing.out <collectionDir> <outDir>
<collectionDir>: the directory where collection is stored. Ex:../complex_crawler/storage/.<outDir>: the directory to store the encoded index and the vocabulary for it.
./demo/indexing.out <indexFileName> <term>
<indexFileName>: the index file name. Ex:storage/inverted_index.txt.<term>: the vocabulary term to search for.
- Install clang-format:
$ sudo apt-get install clang-format
- Run:
$ clang-format -i <file>
All files of this software, except HtmlParser.cpp, are distributed under the MIT License. The HtmlParser.cpp file is distributed under the Apache License, Version 2.0.