Date: 05/03/2024
This project is a web crawler developed for CS121-Information Retrieval. It is capable of crawling allowed web pages under certain domains, collecting word frequencies, tracking visited URLs, and avoiding traps such as infinite crawling or duplicate pages.
- not_allowed: URLs disallowed by
robots.txt. - visited_page: Total number of pages visited.
- visited: Set of visited URLs (to avoid duplication).
- longest_number: Word count of the longest page.
- longest_url: URL of the longest page.
- WordCount: Dictionary of word frequency.
- domain: Dictionary counting domain frequency.
- depth: Tracks page depth to detect traps.
- finger_print: Stores fingerprints for similarity detection.
- stop_words: Set of ignored words.
- Calls
extract_next_linkson a valid URL. - Handles and skips unknown exceptions gracefully.
- Displays and writes the longest page URL and word count.
- Outputs:
result.txt: Summary including total pages, longest URL, sorted word and domain frequencies.not_allowed.txt: URLs disallowed byrobots.txt.visited.txt: All visited URLs.
- Checks for a 200 OK status; returns
[]otherwise. - Detects traps based on depth (>20), logs to
ING_trap.txt. - Handles encoding using
Content-Type, defaults toutf-8. - Parses HTML using
BeautifulSoup. - Discards empty pages or overly large files (>1,000,000 chars); logs to:
ING_empty_file.txtING_too_large_file.txt
- Computes fingerprints using hashed 3-word sequences. If similarity > 95%, skips and logs to
ING_similar.txt. - Extracts and normalizes links using
urllibandposixpath. - Updates:
WordCountlongest_urldomainfrequencyvisited_pagecount
- Valid links are added to the return list and marked as visited.
- Rejects banned URLs from
not_allowed. - Only allows
http/httpsschemes. - Only allows domains ending in:
ics.uci.edustat.uci.eduinformatics.uci.educs.uci.edu
- Rejects URLs containing
calendarorstayconnected. - Rejects URLs longer than 300 characters.
- Uses
robots.txtviaurllib.robotparserto check permissions. - Applies regex filtering for invalid URL patterns.
- Handles SSL exceptions politely by returning
False.
result.txt: Final statistics and sorted data.not_allowed.txt: Disallowed URLs by robots.txt.visited.txt: Successfully visited pages.ING_trap.txt: Trap URLs (too deep).ING_empty_file.txt: Empty pages.ING_too_large_file.txt: Pages too large to parse.ING_similar.txt: Duplicate/similar pages.
BeautifulSoup(bs4)urlliburlparseposixpathrobotparser
- Make sure to run the crawler within allowed domains and under reasonable depth limits.
- Adjust thresholds (e.g. trap depth, similarity, page size) based on future requirements.
