Skip to content

Commit 9c91855

Browse files
authored
Merge pull request #1 from trybyte-app/add-stress-tests-and-docs
2 parents 472a7be + 08e327e commit 9c91855

File tree

9 files changed

+406
-58
lines changed

9 files changed

+406
-58
lines changed

.prettierrc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"useTabs": true
3+
}

README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,53 @@ The parser supports standard robots.txt pattern syntax:
276276

277277
**Priority**: When both Allow and Disallow match, the longer pattern wins.
278278

279+
## Production Usage
280+
281+
This library is designed for correctness and RFC 9309 compliance. When using it in production environments that fetch robots.txt from untrusted sources, consider these safeguards:
282+
283+
### File Size Limits
284+
285+
The library does not enforce file size limits. Both RFC 9309 and Google require parsing at least 500 KiB. Implement size checks before parsing:
286+
287+
```typescript
288+
const MAX_ROBOTS_SIZE = 500 * 1024; // 500 KiB (per RFC 9309)
289+
290+
async function fetchAndParse(url: string) {
291+
const response = await fetch(url);
292+
const contentLength = response.headers.get('content-length');
293+
294+
if (contentLength && parseInt(contentLength) > MAX_ROBOTS_SIZE) {
295+
throw new Error('robots.txt too large');
296+
}
297+
298+
const text = await response.text();
299+
if (text.length > MAX_ROBOTS_SIZE) {
300+
throw new Error('robots.txt too large');
301+
}
302+
303+
return ParsedRobots.parse(text);
304+
}
305+
```
306+
307+
### Timeouts
308+
309+
Implement timeouts when fetching robots.txt to prevent hanging requests.
310+
311+
## Google-Specific Behaviors
312+
313+
This library is a port of Google's C++ parser and includes several behaviors that are Google-specific extensions beyond RFC 9309:
314+
315+
| Behavior | Google | RFC 9309 |
316+
|----------|--------|----------|
317+
| **Line length limit** | Truncates at 16,664 bytes | No limit specified |
318+
| **Typo tolerance** | Accepts "disalow", "useragent", etc. | "MAY be lenient" (unspecified) |
319+
| **index.html normalization** | `Allow: /path/index.html` also allows `/path/` | Not specified |
320+
| **User-agent `*` with trailing text** | `* foo` treated as global agent | Not specified |
321+
322+
The core matching behavior (longest-match-wins, case-insensitive user-agent matching, UTF-8 encoding) follows RFC 9309.
323+
324+
**Note:** This library only handles parsing and matching. HTTP behaviors like redirect following, caching, and status code handling are your responsibility to implement.
325+
279326
## Project Structure
280327

281328
```

TESTS.md

Lines changed: 84 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ This document provides comprehensive documentation of all tests in Google's robo
2020

2121
| Metric | Count |
2222
| ---------------- | ----------------- |
23-
| Total Test Files | 4 |
24-
| Total Test Cases | 196 |
25-
| Total Assertions | 476 |
23+
| Total Test Files | 5 |
24+
| Total Test Cases | 206 |
25+
| Total Assertions | 495 |
2626
| Coverage | 100% of C++ tests |
2727

2828
## Test Naming Conventions
@@ -46,6 +46,7 @@ This document provides comprehensive documentation of all tests in Google's robo
4646
2. **tests/reporter.test.ts** - Reporting/parsing metadata tests (6 test cases)
4747
3. **tests/url-utils.test.ts** - URL utility function tests (22 test cases)
4848
4. **tests/bulk-check.test.ts** - Bulk URL checking API tests (23 test cases)
49+
5. **tests/stress.test.ts** - Performance and stress tests (10 test cases)
4950

5051
---
5152

@@ -706,6 +707,84 @@ Test 2 - ParsedRobots reuse vs repeated parsing: 3. Batch check (single parse) i
706707

707708
---
708709

710+
## Category F: Stress Tests (TypeScript Extension)
711+
712+
These tests validate the library's performance and stability under extreme conditions.
713+
714+
### StressTest_LargeFileHandling (stress.test.ts:18-63)
715+
716+
**Purpose**: Tests parsing of large robots.txt files.
717+
718+
**Assertions (3 total)**:
719+
720+
Test 1 - 1MB robots.txt:
721+
1. Parser completes without crashing → expects TRUE
722+
2. Completes within 5 seconds → expects TRUE
723+
724+
Test 2 - 100K lines:
725+
3. Parser handles 100,000 Disallow rules efficiently → expects TRUE
726+
727+
Test 3 - Many user-agent groups:
728+
4. Parser handles 1,000 separate user-agent groups → expects TRUE
729+
730+
**Edge Cases**: Memory efficiency, parsing speed with large inputs
731+
732+
---
733+
734+
### StressTest_PathologicalPatterns (stress.test.ts:65-124)
735+
736+
**Purpose**: Tests pattern matching with complex wildcard patterns.
737+
738+
**Assertions (3 total)**:
739+
740+
Test 1 - Many wildcards:
741+
1. Pattern `/a*b*c*d*e*f*g*h*i*j*` matches efficiently → expects TRUE (< 100ms)
742+
743+
Test 2 - Deeply nested wildcards:
744+
2. Pattern with 16 wildcard segments matches efficiently → expects TRUE
745+
746+
Test 3 - Many rules with same prefix:
747+
3. 10,000 rules starting with `/api/v1/users/` checked efficiently → expects TRUE
748+
749+
**Edge Cases**: Avoids exponential backtracking in pattern matching
750+
751+
---
752+
753+
### StressTest_BulkURLCheckingPerformance (stress.test.ts:126-146)
754+
755+
**Purpose**: Tests bulk URL checking at scale.
756+
757+
**Assertions (2 total)**:
758+
759+
Test 1 - 10K URLs:
760+
1. 10,000 URLs processed → expects 10,000 results
761+
2. Completes under 1 second → expects TRUE
762+
763+
**Edge Cases**: Linear scaling with URL count
764+
765+
---
766+
767+
### StressTest_EdgeCases (stress.test.ts:148-188)
768+
769+
**Purpose**: Tests graceful handling of edge cases.
770+
771+
**Assertions (5 total)**:
772+
773+
Test 1 - Empty robots.txt:
774+
1. Returns allowed (true) → expects TRUE
775+
776+
Test 2 - Comments only:
777+
2. Returns allowed (true) → expects TRUE
778+
779+
Test 3 - Malformed URLs:
780+
3. Empty URL doesn't throw → expects no exception
781+
4. Invalid URL doesn't throw → expects no exception
782+
5. Missing scheme URL doesn't throw → expects no exception
783+
784+
**Edge Cases**: Graceful degradation with invalid input
785+
786+
---
787+
709788
## Helper Classes
710789

711790
### RobotsStatsReporter (robots_test.cc:765-819)
@@ -869,7 +948,7 @@ The TypeScript port has been verified to provide **100% test coverage** of all C
869948
bun test
870949

871950
# Expected output:
872-
# 173 pass
951+
# 206 pass
873952
# 0 fail
874-
# 420 expect() calls
953+
# 495 expect() calls
875954
```

bun.lock

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
"name": "robotstxt-ts-port",
77
"devDependencies": {
88
"@types/bun": "latest",
9+
"prettier": "^3.7.4",
910
"typescript": "^5.0.0",
1011
},
1112
},
@@ -17,6 +18,8 @@
1718

1819
"bun-types": ["bun-types@1.3.3", "", { "dependencies": { "@types/node": "*" } }, "sha512-z3Xwlg7j2l9JY27x5Qn3Wlyos8YAp0kKRlrePAOjgjMGS5IG6E7Jnlx736vH9UVI4wUICwwhC9anYL++XeOgTQ=="],
1920

21+
"prettier": ["prettier@3.7.4", "", { "bin": { "prettier": "bin/prettier.cjs" } }, "sha512-v6UNi1+3hSlVvv8fSaoUbggEM5VErKmmpGA7Pl3HF8V6uKY7rvClBOJlH6yNwQtfTueNkGVpOv/mtWL9L4bgRA=="],
22+
2023
"typescript": ["typescript@5.9.3", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw=="],
2124

2225
"undici-types": ["undici-types@7.16.0", "", {}, "sha512-Zz+aZWSj8LE6zoxD+xrjh4VfkIG8Ya6LvYkZqtUQGJPZjYl53ypCaUwWqo7eI0x66KBGeRo+mlBEkMSeSZ38Nw=="],

index.ts

Lines changed: 0 additions & 1 deletion
This file was deleted.

package.json

Lines changed: 52 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,54 @@
11
{
2-
"name": "@trybyte/robotstxt-parser",
3-
"version": "1.0.0",
4-
"description": "Google's robots.txt parser ported to TypeScript - RFC 9309 compliant",
5-
"keywords": [
6-
"robots.txt",
7-
"robots",
8-
"parser",
9-
"crawler",
10-
"seo",
11-
"geo",
12-
"google",
13-
"rfc9309"
14-
],
15-
"homepage": "https://github.com/trybyte-app/robotstxt-ts-port#readme",
16-
"bugs": {
17-
"url": "https://github.com/trybyte-app/robotstxt-ts-port/issues"
18-
},
19-
"repository": {
20-
"type": "git",
21-
"url": "git+https://github.com/trybyte-app/robotstxt-ts-port.git"
22-
},
23-
"license": "Apache-2.0",
24-
"author": "Alireza Esmikhani",
25-
"type": "module",
26-
"exports": {
27-
".": {
28-
"types": "./dist/index.d.ts",
29-
"import": "./dist/index.js"
30-
}
31-
},
32-
"main": "./dist/index.js",
33-
"types": "./dist/index.d.ts",
34-
"directories": {
35-
"test": "tests"
36-
},
37-
"files": [
38-
"dist"
39-
],
40-
"scripts": {
41-
"build": "tsc",
42-
"test": "bun test",
43-
"prepublishOnly": "bun run build"
44-
},
45-
"devDependencies": {
46-
"@types/bun": "latest",
47-
"typescript": "^5.0.0"
48-
},
49-
"engines": {
50-
"node": ">=20.0.0"
51-
}
2+
"name": "@trybyte/robotstxt-parser",
3+
"version": "1.0.0",
4+
"description": "Google's robots.txt parser ported to TypeScript - RFC 9309 compliant",
5+
"keywords": [
6+
"robots.txt",
7+
"robots",
8+
"parser",
9+
"crawler",
10+
"seo",
11+
"geo",
12+
"google",
13+
"rfc9309"
14+
],
15+
"homepage": "https://github.com/trybyte-app/robotstxt-ts-port#readme",
16+
"bugs": {
17+
"url": "https://github.com/trybyte-app/robotstxt-ts-port/issues"
18+
},
19+
"repository": {
20+
"type": "git",
21+
"url": "git+https://github.com/trybyte-app/robotstxt-ts-port.git"
22+
},
23+
"license": "Apache-2.0",
24+
"author": "Byte Team (trybyte.app)",
25+
"type": "module",
26+
"exports": {
27+
".": {
28+
"types": "./dist/index.d.ts",
29+
"import": "./dist/index.js"
30+
}
31+
},
32+
"main": "./dist/index.js",
33+
"types": "./dist/index.d.ts",
34+
"directories": {
35+
"test": "tests"
36+
},
37+
"files": [
38+
"dist"
39+
],
40+
"scripts": {
41+
"build": "tsc",
42+
"test": "bun test",
43+
"prepublishOnly": "bun run build",
44+
"format": "prettier \"**/*.{js,jsx,mjs,ts,tsx,json,jsonc}\" --write"
45+
},
46+
"devDependencies": {
47+
"@types/bun": "latest",
48+
"prettier": "^3.7.4",
49+
"typescript": "^5.0.0"
50+
},
51+
"engines": {
52+
"node": ">=20.0.0"
53+
}
5254
}

src/matcher.ts

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,14 @@ export class RobotsMatcher extends RobotsParseHandler {
184184
/**
185185
* Returns true iff 'url' is allowed to be fetched by any member of the
186186
* "userAgents" array. 'url' must be %-encoded according to RFC3986.
187+
*
188+
* Invalid or malformed URLs are handled gracefully - if the path cannot be
189+
* extracted, it defaults to "/" which typically allows access.
190+
*
191+
* @param robotsBody - The robots.txt content to parse
192+
* @param userAgents - Array of user-agent strings to check
193+
* @param url - The URL to check (should be %-encoded per RFC3986)
194+
* @returns true if access is allowed, false if disallowed
187195
*/
188196
public allowedByRobots(
189197
robotsBody: string,
@@ -201,6 +209,14 @@ export class RobotsMatcher extends RobotsParseHandler {
201209
/**
202210
* Do robots check for 'url' when there is only one user agent. 'url' must
203211
* be %-encoded according to RFC3986.
212+
*
213+
* Invalid or malformed URLs are handled gracefully - if the path cannot be
214+
* extracted, it defaults to "/" which typically allows access.
215+
*
216+
* @param robotsTxt - The robots.txt content to parse
217+
* @param userAgent - The user-agent string to check
218+
* @param url - The URL to check (should be %-encoded per RFC3986)
219+
* @returns true if access is allowed, false if disallowed
204220
*/
205221
public oneAgentAllowedByRobots(
206222
robotsTxt: string,

src/parsed-robots.ts

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -253,8 +253,12 @@ export class ParsedRobots {
253253
* Check multiple URLs for a single user-agent.
254254
* This is the fast operation - O(urls * rules) with no parsing overhead.
255255
*
256+
* Invalid or malformed URLs are handled gracefully - if the path cannot be
257+
* extracted, it defaults to "/" which typically allows access. No exceptions
258+
* are thrown for invalid input.
259+
*
256260
* @param userAgent - The user-agent to check (e.g., 'Googlebot', 'Googlebot/2.1')
257-
* @param urls - Array of URLs to check (must be %-encoded per RFC3986)
261+
* @param urls - Array of URLs to check (should be %-encoded per RFC3986)
258262
* @returns Array of results in the same order as input URLs
259263
*/
260264
public checkUrls(userAgent: string, urls: string[]): UrlCheckResult[] {
@@ -274,8 +278,11 @@ export class ParsedRobots {
274278
/**
275279
* Check a single URL (convenience method).
276280
*
281+
* Invalid or malformed URLs are handled gracefully - if the path cannot be
282+
* extracted, it defaults to "/" which typically allows access.
283+
*
277284
* @param userAgent - The user-agent to check
278-
* @param url - The URL to check (must be %-encoded per RFC3986)
285+
* @param url - The URL to check (should be %-encoded per RFC3986)
279286
* @returns Result with detailed match information
280287
*/
281288
public checkUrl(userAgent: string, url: string): UrlCheckResult {

0 commit comments

Comments
 (0)