Commit f578d74
committed
automata: reduce regex contention somewhat
> **Context:** A `Regex` uses internal mutable space (called a `Cache`)
> while executing a search. Since a `Regex` really wants to be easily
> shared across multiple threads simultaneously, it follows that a
> `Regex` either needs to provide search functions that accept a `&mut
> Cache` (thereby pushing synchronization to a problem for the caller
> to solve) or it needs to do synchronization itself. While there are
> lower level APIs in `regex-automata` that do the former, they are
> less convenient. The higher level APIs, especially in the `regex`
> crate proper, need to do some kind of synchronization to give a
> search the mutable `Cache` that it needs.
>
> The current approach to that synchronization essentially uses a
> `Mutex<Vec<Cache>>` with an optimization for the "owning" thread
> that lets it bypass the `Mutex`. The owning thread optimization
> makes it so the single threaded use case essentially doesn't pay for
> any synchronization overhead, and that all works fine. But once the
> `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>`
> gets hit. And if you're doing a lot of regex searches on short
> haystacks in parallel, that `Mutex` comes under extremely heavy
> contention. To the point that a program can slow down by enormous
> amounts.
>
> This PR attempts to address that problem.
>
> Note that it's worth pointing out that this issue can be worked
> around.
>
> The simplest work-around is to clone a `Regex` and send it to other
> threads instead of sharing a single `Regex`. This won't use any
> additional memory (a `Regex` is reference counted internally),
> but it will force each thread to use the "owner" optimization
> described above. This does mean, for example, that you can't
> share a `Regex` across multiple threads conveniently with a
> `lazy_static`/`OnceCell`/`OnceLock`/whatever.
>
> The other work-around is to use the lower level search APIs on a
> `meta::Regex` in the `regex-automata` crate. Those APIs accept a
> `&mut Cache` explicitly. In that case, you can use the `thread_local`
> crate or even an actual `thread_local!` or something else entirely.
I wish I could say this PR was a home run that fixed the contention
issues with `Regex` once and for all, but it's not. It just makes
things a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy,
but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.
A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a `cloned` and `shared`
approach. The `cloned` approach clones the regex before sending it to
each thread where as the `shared` approach shares a single regex across
multiple threads. The `cloned` approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
`shared` approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.
Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).
```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 128.3 ms, System: 5.7 ms]
Range (min … max): 7.7 ms … 11.1 ms 278 runs
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
Time (mean ± σ): 1.938 s ± 0.036 s [User: 4.827 s, System: 41.401 s]
Range (min … max): 1.885 s … 1.992 s 10 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'
```
And here's what it looks like after this PR:
```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 127.6 ms, System: 6.2 ms]
Range (min … max): 7.9 ms … 11.7 ms 287 runs
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
Time (mean ± σ): 55.0 ms ± 5.1 ms [User: 1050.4 ms, System: 12.0 ms]
Range (min … max): 46.1 ms … 67.3 ms 57 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'
```
So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.
Closes #9341 parent 9a505a1 commit f578d74
1 file changed
+168
-19
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
268 | 268 | | |
269 | 269 | | |
270 | 270 | | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
271 | 329 | | |
272 | 330 | | |
273 | 331 | | |
| |||
291 | 349 | | |
292 | 350 | | |
293 | 351 | | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
294 | 363 | | |
295 | 364 | | |
296 | 365 | | |
| |||
299 | 368 | | |
300 | 369 | | |
301 | 370 | | |
302 | | - | |
303 | | - | |
304 | | - | |
305 | 371 | | |
306 | 372 | | |
307 | 373 | | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
308 | 381 | | |
309 | 382 | | |
310 | 383 | | |
| |||
354 | 427 | | |
355 | 428 | | |
356 | 429 | | |
357 | | - | |
358 | | - | |
359 | | - | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
360 | 441 | | |
361 | 442 | | |
362 | 443 | | |
| |||
375 | 456 | | |
376 | 457 | | |
377 | 458 | | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
378 | 463 | | |
379 | 464 | | |
380 | | - | |
| 465 | + | |
381 | 466 | | |
382 | 467 | | |
383 | 468 | | |
| |||
401 | 486 | | |
402 | 487 | | |
403 | 488 | | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
404 | 492 | | |
405 | 493 | | |
406 | 494 | | |
| |||
444 | 532 | | |
445 | 533 | | |
446 | 534 | | |
447 | | - | |
448 | | - | |
449 | | - | |
450 | | - | |
451 | | - | |
452 | | - | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
453 | 560 | | |
454 | 561 | | |
455 | 562 | | |
456 | 563 | | |
457 | 564 | | |
458 | 565 | | |
459 | | - | |
460 | | - | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
461 | 587 | | |
462 | 588 | | |
463 | 589 | | |
464 | 590 | | |
465 | | - | |
| 591 | + | |
466 | 592 | | |
467 | 593 | | |
468 | 594 | | |
469 | 595 | | |
470 | | - | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
471 | 604 | | |
472 | 605 | | |
473 | 606 | | |
474 | 607 | | |
475 | 608 | | |
476 | 609 | | |
477 | | - | |
| 610 | + | |
478 | 611 | | |
479 | 612 | | |
480 | 613 | | |
| |||
490 | 623 | | |
491 | 624 | | |
492 | 625 | | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
493 | 632 | | |
494 | 633 | | |
495 | 634 | | |
| |||
557 | 696 | | |
558 | 697 | | |
559 | 698 | | |
560 | | - | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
561 | 710 | | |
562 | 711 | | |
563 | 712 | | |
| |||
0 commit comments