search_engine

System requirements

Linux operating system.
c++ version 11.
g++ version 8+.
POSIX threads.

Configure external libraries

Download Chilkat for your operating system.
Configure Chilkat library:
- Extract library from .tar.gz file.
- Copy relative path from extracted folder.
- Replace CHILKATPATH in Makefile.inc file with the relative path.
Download Gumbo v0.10.1.
Configure Gumbo:
- Extract library from .tar.gz or .zip file.
- Move to inside library folder.
- Run the following commands on terminal:
```
$ sudo apt-get install libtool
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install
```

Text Chilkat library installation

The simple_crwaler is a simple clawler that receives a parent URL and a number, and outputs the first parameter followed by the requested number of links inside it.

Build

$ cd simples_crawler
$ make

Run

$ ./a.out <parentUrl> <numAdditionalLinks>

<urlToCrawl>: the parent url.
<numAdditionalLinks>: the number of additional urls to crawl inside parent url.

Crawl web pages

The complex_crawler module is a crawler with two options for crawling: a long term and a short term scheduler. If the crawler is configured with the first option, it will use a single thread and a priroty queue based on the URL size to choose witch URL it should visit first. Otherwise, it will use multiple threads to collect a large number of pages in parallel.

Build

$ cd complex_crawler
$ make

Run

$ ./demo/main.out <seedFileName> <storageDir> <mode> <numPages> <numThreads?>

<seedFileName>: the file name for seed irls separed by a line break. Ex: seed/seed.txt.
<storageDir>: the directory to put the collected HTML pages and a index.txt file, that has the collection information. Ex: storage/ or storage_extra/.
<mode>: 0 for long term scheduler and 2 for short term scheduler. The value 1 chooses an option under construction and should not be used.
<numPages>: the number of total pages to crawl.
<numThreads?>: the number of threads to use when choosing the short term scheduler (default value is 1).

Index and search the crawled pages

The inverted_index module indexes the crawled collection and searches for specified pages inside it.

Build

$ cd inverted_index
$ make

Indexing

./demo/indexing.out <collectionDir> <outDir>

<collectionDir>: the directory where collection is stored. Ex: ../complex_crawler/storage/.
<outDir>: the directory to store the encoded index and the vocabulary for it.

Search

./demo/indexing.out <indexFileName> <term>

<indexFileName>: the index file name. Ex: storage/inverted_index.txt.
<term>: the vocabulary term to search for.

Format project files

Install clang-format:

$ sudo apt-get install clang-format

Run:

$ clang-format -i <file>

LICENCE

All files of this software, except HtmlParser.cpp, are distributed under the MIT License. The HtmlParser.cpp file is distributed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
complex_crawler		complex_crawler
html_parser		html_parser
inverted_index		inverted_index
simple_crawler		simple_crawler
text_parser		text_parser
threadpool		threadpool
utils		utils
.clang-format		.clang-format
.gitignore		.gitignore
Makefile.inc		Makefile.inc
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

search_engine

System requirements

Configure external libraries

Text Chilkat library installation

Build

Run

Crawl web pages

Build

Run

Index and search the crawled pages

Build

Indexing

Search

Format project files

LICENCE

About

Uh oh!

Releases

Packages

Languages

AlaoPrado/search_engine

Folders and files

Latest commit

History

Repository files navigation

search_engine

System requirements

Configure external libraries

Text Chilkat library installation

Build

Run

Crawl web pages

Build

Run

Index and search the crawled pages

Build

Indexing

Search

Format project files

LICENCE

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages