CS121 - Web Crawler

Date: 05/03/2024

🌐 Overview

This project is a web crawler developed for CS121-Information Retrieval. It is capable of crawling allowed web pages under certain domains, collecting word frequencies, tracking visited URLs, and avoiding traps such as infinite crawling or duplicate pages.

🔧 Global Variables

not_allowed: URLs disallowed by robots.txt.
visited_page: Total number of pages visited.
visited: Set of visited URLs (to avoid duplication).
longest_number: Word count of the longest page.
longest_url: URL of the longest page.
WordCount: Dictionary of word frequency.
domain: Dictionary counting domain frequency.
depth: Tracks page depth to detect traps.
finger_print: Stores fingerprints for similarity detection.
stop_words: Set of ignored words.

🧠 Main Functions

`scraper(url, resp)`

Calls extract_next_links on a valid URL.
Handles and skips unknown exceptions gracefully.

`printall()`

Displays and writes the longest page URL and word count.
Outputs:
- result.txt: Summary including total pages, longest URL, sorted word and domain frequencies.
- not_allowed.txt: URLs disallowed by robots.txt.
- visited.txt: All visited URLs.

`extract_next_links(url, resp)`

Checks for a 200 OK status; returns [] otherwise.
Detects traps based on depth (>20), logs to ING_trap.txt.
Handles encoding using Content-Type, defaults to utf-8.
Parses HTML using BeautifulSoup.
Discards empty pages or overly large files (>1,000,000 chars); logs to:
- ING_empty_file.txt
- ING_too_large_file.txt
Computes fingerprints using hashed 3-word sequences. If similarity > 95%, skips and logs to ING_similar.txt.
Extracts and normalizes links using urllib and posixpath.
Updates:
- WordCount
- longest_url
- domain frequency
- visited_page count
Valid links are added to the return list and marked as visited.

`is_valid(url)`

Rejects banned URLs from not_allowed.
Only allows http/https schemes.
Only allows domains ending in:
- ics.uci.edu
- stat.uci.edu
- informatics.uci.edu
- cs.uci.edu
Rejects URLs containing calendar or stayconnected.
Rejects URLs longer than 300 characters.
Uses robots.txt via urllib.robotparser to check permissions.
Applies regex filtering for invalid URL patterns.
Handles SSL exceptions politely by returning False.

📄 Output Files

result.txt: Final statistics and sorted data.
not_allowed.txt: Disallowed URLs by robots.txt.
visited.txt: Successfully visited pages.
ING_trap.txt: Trap URLs (too deep).
ING_empty_file.txt: Empty pages.
ING_too_large_file.txt: Pages too large to parse.
ING_similar.txt: Duplicate/similar pages.

📚 Dependencies

BeautifulSoup (bs4)
urllib
urlparse
posixpath
robotparser

✅ Notes

Make sure to run the crawler within allowed domains and under reasonable depth limits.
Adjust thresholds (e.g. trap depth, similarity, page size) based on future requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.idea		.idea
Trap20_Similar0.9_Result		Trap20_Similar0.9_Result
crawler		crawler
packages		packages
utils		utils
.gitignore		.gitignore
Final_Report.txt		Final_Report.txt
README.md		README.md
architecture.png		architecture.png
config.ini		config.ini
launch.py		launch.py
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS121 - Web Crawler

🌐 Overview

🔧 Global Variables

🧠 Main Functions

`scraper(url, resp)`

`printall()`

`extract_next_links(url, resp)`

`is_valid(url)`

📄 Output Files

📚 Dependencies

✅ Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

Lorrainnn/WebCrawler

Folders and files

Latest commit

History

Repository files navigation

CS121 - Web Crawler

🌐 Overview

🔧 Global Variables

🧠 Main Functions

scraper(url, resp)

printall()

extract_next_links(url, resp)

is_valid(url)

📄 Output Files

📚 Dependencies

✅ Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

`scraper(url, resp)`

`printall()`

`extract_next_links(url, resp)`

`is_valid(url)`

Packages