Skip to content

Conversation

@zees-dev
Copy link
Collaborator

Description

Introducing bless-crawl host module integration.
This essentially allows a wasm app to use the host runtime's functionality to scrape web pages.

This requires the following host function FFI to be implemented by the host runtime:

  • scrape
  • map
  • crawl

This pull request introduces significant enhancements to the SDK, including new functionality for web scraping, HTML transformation, and Markdown conversion.
It also updates dependencies and documentation to support these features. Below are the most important changes grouped by theme:

New Features

  • Added a new example web-scrape.rs demonstrating web scraping capabilities using the Blockless SDK. This includes examples for basic scraping, link mapping, and recursive crawling. (examples/web-scrape.rs)
  • Implemented parse_markdown in html_to_markdown.rs to convert HTML to Markdown, process multi-line links, and remove "Skip to Content" links. Includes comprehensive test cases. (src/bless_crawl/html_to_markdown.rs)
  • Developed transform_html in html_transform.rs for filtering and processing HTML content, including removing unwanted elements, handling relative URLs, and processing srcset attributes. Includes extensive test coverage. (src/bless_crawl/html_transform.rs)

Dependency Updates

  • Updated Cargo.toml to include new dependencies: htmd for HTML-to-Markdown conversion, kuchikiki for HTML parsing, regex for pattern matching, and url for URL handling. Adjusted serde_json features for compatibility. (Cargo.toml)

Documentation Enhancements

  • Updated README.md to include the new web-scrape example in the list of supported examples, highlighting its functionality for scraping content from a URL. (README.md)

Codebase Integration

  • Exposed the bless_crawl module at the library level to make the new scraping and transformation utilities accessible to external users. (src/lib.rs)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates the bless-crawl module to enable a WASM app to leverage host functionality for web scraping, HTML transformation, and Markdown conversion.

  • Added the bless-crawl module with FFI bindings for scrape, map, and crawl functions
  • Introduced HTML transformation and Markdown parsing utilities with comprehensive test cases
  • Updated examples, documentation, and dependencies to support the new web scraping features

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/lib.rs Exposes the new bless_crawl module to external users
src/bless_crawl/mod.rs Implements core web scraping functionality and FFI bindings, configs, and errors
src/bless_crawl/html_transform.rs Provides HTML transformation utilities including element filtering and URL adjustments
src/bless_crawl/html_to_markdown.rs Converts HTML content to Markdown with additional link processing
examples/web-scrape.rs Demonstrates web scraping usage with the BlessCrawl interface
README.md Updates documentation to include the new web scraping example
Cargo.toml Adds new dependencies and updates configurations for HTML and Markdown conversion

@zees-dev zees-dev merged commit f7d0d52 into main Jun 26, 2025
1 check passed
@zees-dev zees-dev deleted the feat/bless-crawl branch June 26, 2025 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants