feat: bless-crawler module #23

zees-dev · 2025-06-25T04:52:45Z

Description

Introducing bless-crawl host module integration.
This essentially allows a wasm app to use the host runtime's functionality to scrape web pages.

This requires the following host function FFI to be implemented by the host runtime:

scrape
map
crawl

This pull request introduces significant enhancements to the SDK, including new functionality for web scraping, HTML transformation, and Markdown conversion.
It also updates dependencies and documentation to support these features. Below are the most important changes grouped by theme:

New Features

Added a new example web-scrape.rs demonstrating web scraping capabilities using the Blockless SDK. This includes examples for basic scraping, link mapping, and recursive crawling. (examples/web-scrape.rs)
Implemented parse_markdown in html_to_markdown.rs to convert HTML to Markdown, process multi-line links, and remove "Skip to Content" links. Includes comprehensive test cases. (src/bless_crawl/html_to_markdown.rs)
Developed transform_html in html_transform.rs for filtering and processing HTML content, including removing unwanted elements, handling relative URLs, and processing srcset attributes. Includes extensive test coverage. (src/bless_crawl/html_transform.rs)

Dependency Updates

Updated Cargo.toml to include new dependencies: htmd for HTML-to-Markdown conversion, kuchikiki for HTML parsing, regex for pattern matching, and url for URL handling. Adjusted serde_json features for compatibility. (Cargo.toml)

Documentation Enhancements

Updated README.md to include the new web-scrape example in the list of supported examples, highlighting its functionality for scraping content from a URL. (README.md)

Codebase Integration

Exposed the bless_crawl module at the library level to make the new scraping and transformation utilities accessible to external users. (src/lib.rs)

Copilot

Pull Request Overview

This PR integrates the bless-crawl module to enable a WASM app to leverage host functionality for web scraping, HTML transformation, and Markdown conversion.

Added the bless-crawl module with FFI bindings for scrape, map, and crawl functions
Introduced HTML transformation and Markdown parsing utilities with comprehensive test cases
Updated examples, documentation, and dependencies to support the new web scraping features

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/lib.rs	Exposes the new bless_crawl module to external users
src/bless_crawl/mod.rs	Implements core web scraping functionality and FFI bindings, configs, and errors
src/bless_crawl/html_transform.rs	Provides HTML transformation utilities including element filtering and URL adjustments
src/bless_crawl/html_to_markdown.rs	Converts HTML content to Markdown with additional link processing
examples/web-scrape.rs	Demonstrates web scraping usage with the BlessCrawl interface
README.md	Updates documentation to include the new web scraping example
Cargo.toml	Adds new dependencies and updates configurations for HTML and Markdown conversion

src/bless_crawl/mod.rs

zees-dev added 7 commits June 25, 2025 16:47

upd cargo.toml deps

85f6b01

bless-crawl plugin impl

ea2d17d

html to markdown impl

deefc87

html transformation impl for include and exclude tags

dd01f92

bless-crawl plugin impl - lib

3f82c40

webscrape example

057835f

readme

02316d3

zees-dev requested review from Joinhack, Copilot, michalzajda and uditdc June 25, 2025 04:52

Copilot AI reviewed Jun 25, 2025

View reviewed changes

src/bless_crawl/mod.rs Outdated Show resolved Hide resolved

zees-dev added 5 commits June 25, 2025 16:53

cargo fmt --all

f161018

fixed clippy errors

260340a

fixed clippy warnings

7460e59

return 1 as exitcode for mock-ffi impl

eb02b5f

fixed doc tests

1e9295f

zees-dev mentioned this pull request Jun 25, 2025

bless-crawl plugin support blessnetwork/javy-bless-plugins#14

Merged

Joinhack approved these changes Jun 26, 2025

View reviewed changes

zees-dev merged commit f7d0d52 into main Jun 26, 2025
1 check passed

zees-dev deleted the feat/bless-crawl branch June 26, 2025 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: bless-crawler module #23

feat: bless-crawler module #23

Uh oh!

zees-dev commented Jun 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: bless-crawler module #23

feat: bless-crawler module #23

Uh oh!

Conversation

zees-dev commented Jun 25, 2025

Description

New Features

Dependency Updates

Documentation Enhancements

Codebase Integration

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants