-
Notifications
You must be signed in to change notification settings - Fork 6
feat: bless-crawler module #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates the bless-crawl module to enable a WASM app to leverage host functionality for web scraping, HTML transformation, and Markdown conversion.
- Added the bless-crawl module with FFI bindings for scrape, map, and crawl functions
- Introduced HTML transformation and Markdown parsing utilities with comprehensive test cases
- Updated examples, documentation, and dependencies to support the new web scraping features
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/lib.rs | Exposes the new bless_crawl module to external users |
| src/bless_crawl/mod.rs | Implements core web scraping functionality and FFI bindings, configs, and errors |
| src/bless_crawl/html_transform.rs | Provides HTML transformation utilities including element filtering and URL adjustments |
| src/bless_crawl/html_to_markdown.rs | Converts HTML content to Markdown with additional link processing |
| examples/web-scrape.rs | Demonstrates web scraping usage with the BlessCrawl interface |
| README.md | Updates documentation to include the new web scraping example |
| Cargo.toml | Adds new dependencies and updates configurations for HTML and Markdown conversion |
Joinhack
approved these changes
Jun 26, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Introducing bless-crawl host module integration.
This essentially allows a wasm app to use the host runtime's functionality to scrape web pages.
This requires the following host function FFI to be implemented by the host runtime:
scrapemapcrawlThis pull request introduces significant enhancements to the SDK, including new functionality for web scraping, HTML transformation, and Markdown conversion.
It also updates dependencies and documentation to support these features. Below are the most important changes grouped by theme:
New Features
web-scrape.rsdemonstrating web scraping capabilities using the Blockless SDK. This includes examples for basic scraping, link mapping, and recursive crawling. (examples/web-scrape.rs)parse_markdowninhtml_to_markdown.rsto convert HTML to Markdown, process multi-line links, and remove "Skip to Content" links. Includes comprehensive test cases. (src/bless_crawl/html_to_markdown.rs)transform_htmlinhtml_transform.rsfor filtering and processing HTML content, including removing unwanted elements, handling relative URLs, and processingsrcsetattributes. Includes extensive test coverage. (src/bless_crawl/html_transform.rs)Dependency Updates
Cargo.tomlto include new dependencies:htmdfor HTML-to-Markdown conversion,kuchikikifor HTML parsing,regexfor pattern matching, andurlfor URL handling. Adjustedserde_jsonfeatures for compatibility. (Cargo.toml)Documentation Enhancements
README.mdto include the newweb-scrapeexample in the list of supported examples, highlighting its functionality for scraping content from a URL. (README.md)Codebase Integration
bless_crawlmodule at the library level to make the new scraping and transformation utilities accessible to external users. (src/lib.rs)