-
Notifications
You must be signed in to change notification settings - Fork 715
Optimize HLS playlist parsing by caching regex Matchers #2986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduce a per-thread regex Matcher cache in HlsPlaylistParser to significantly reduce object allocation overhead during playlist parsing. Problem: Each Matcher object allocates memory in BOTH Java heap and native heap, creating two critical issues in production: 1. Native heap exhaustion: Native allocations are substantial and not subject to normal Java GC pressure. When Matcher objects are created faster than they're garbage collected, the native heap can be exhausted even when Java heap has space available, causing OutOfMemoryError in the native allocator. 2. GC-induced ANRs: Excessive Matcher allocation causes frequent GC cycles. This is particularly severe with our MultiView feature (4 concurrent playback sessions) on lower-performance devices, where sustained GC pressure from thousands of short-lived Matcher objects causes Application Not Responding (ANR) events. Both issues are exacerbated by frequent HLS playlist refreshes (every 2-6 seconds), creating continuous allocation pressure. Solution: - Uses ThreadLocal<MatcherCacheState> for lock-free per-thread isolation - Employs access-ordered LinkedHashMap as LRU cache (max 32 entries) - Reuses Matcher objects via reset() instead of creating new instances - Eliminates both Java heap AND native heap allocation pressure Performance impact: - Reduces Matcher allocations by >99% in production workloads (from ~20M allocations to ~37K allocations + 19.9M reuses) - Eliminates native heap exhaustion risk from Matcher object churn - Drastically reduces GC frequency and duration, preventing ANRs - Removes dependency on GC timing for native memory reclamation - Typical cache occupancy: 6-12 patterns (well under 32 limit) - Zero synchronization overhead through thread-local storage - Critical for MultiView and lower-performance devices Testing: - Validated over 2+ hours with production HLS streams - 99.82% reuse rate across 3,692 loader threads - No native memory allocation errors observed - Significant reduction in GC events during MultiView usage - No functional changes to parsing behavior - All existing tests pass
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
@googlebot I signed the CLA. |
marcbaechinger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I added one comment to discuss use of thread local.
| }; | ||
| } | ||
|
|
||
| private static final ThreadLocal<MatcherCacheState> matcherCache = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR and the reasoning in the PR description. Also, many thanks for the data regarding your usage with your stream and the reduction of issues you see. That's all very convincing.
I'm a bit skeptical about the use of ThreadLocal. My assumption is you've chosen this approach mainly because all the methods that need access to the cache are static. So I think and appreciate you were kind of forced into such an approach by the design of the parser.
My issue with that approach is (probably besides of myself actually never having really used it :)) that from the perspective of the HlsPlaylistParser it's unclear how the parser is used with regards to threading. So we don't know if it used by only a short living thread like as part of a Loader or whether someone attempt to call this from a thread that is only short living, or then is part of a thread pool.
It also appears to me that the parser is created by DefaultPlaylistTracker on the playback thread and is then passed to the Loader where the parser is actually used on a separate (short-living I believe) thread.
I think what I'm trying to say is that, I've we would have an approach that is not bound to a thread, we would probably have much less to think now, and we also would not have to care about any future refactoring that attempts to change the threading model in which the parser may be used.
If I don't want to think about all that to much, then I would argue we can instantiate a MatcherCacheState class for instance per parser class or for a single parse(Uri) call. We would then need to pass that cache object to each of the static methods you changed like parseOptionalBooleanAttribute and friends. Given these are private methods that would be ok API wise. There are 7 such methods.
I think for your case for which yo9u have measured things. This would have the same effect. Currently a HlsPlaylistParser instance is create for each time an HLS playlist is parsed [1]. There isn't a case where a parser instance is used twice on the same thread.
WDYT when you hear this? My suggestion would therefore be to to keep pretty much everything from this change except the thread local?
[1]
Lines 801 to 825 in 7cc1056
| private void loadPlaylistImmediately(Uri playlistRequestUri) { | |
| ParsingLoadable.Parser<HlsPlaylist> mediaPlaylistParser = | |
| playlistParserFactory.createPlaylistParser(multivariantPlaylist, playlistSnapshot); | |
| DataSpec dataSpec = | |
| new DataSpec.Builder() | |
| .setUri(playlistRequestUri) | |
| .setFlags(DataSpec.FLAG_ALLOW_GZIP) | |
| .build(); | |
| if (cmcdConfiguration != null) { | |
| CmcdData.Factory cmcdDataFactory = | |
| new CmcdData.Factory(cmcdConfiguration, CmcdData.STREAMING_FORMAT_HLS) | |
| .setObjectType(CmcdData.OBJECT_TYPE_MANIFEST); | |
| if (primaryMediaPlaylistSnapshot != null) { | |
| cmcdDataFactory.setIsLive(!primaryMediaPlaylistSnapshot.hasEndTag); | |
| } | |
| dataSpec = cmcdDataFactory.createCmcdData().addToDataSpec(dataSpec); | |
| } | |
| ParsingLoadable<HlsPlaylist> mediaPlaylistLoadable = | |
| new ParsingLoadable<>( | |
| mediaPlaylistDataSource, dataSpec, C.DATA_TYPE_MANIFEST, mediaPlaylistParser); | |
| mediaPlaylistLoader.startLoading( | |
| mediaPlaylistLoadable, | |
| /* callback= */ this, | |
| loadErrorHandlingPolicy.getMinimumLoadableRetryCount(mediaPlaylistLoadable.type)); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a case where a parser instance is used twice on the same thread.
Just to keep me honest. I don't think what I said in the quote above is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I've removed the ThreadLocal and now create a MatcherCacheState per parse() call, passed through the private helpers. This keeps matcher reuse scoped to a single parse operation while preserving the allocation benefits.
Instantiate a MatcherCacheState per parse(...) call and pass it through the private parsing helpers. Keeps regex matcher reuse without relying on ThreadLocal, aligning with feedback and clarifying ownership/lifetime of cached matchers. No functional change intended.
|
Thanks a lot for the change. I wanted to import the pull to the internal source tree but for that I had to merge quite some changes mostly of the RELEASE_NOTES.md mostly. I'm not quite sure about why this is the case as you actually didn't touch these files. Part of taking a pull is to rebase your branch onto main where the conflicts happened. Would it be possible for you to update your branch and sync it with the main branch? I wonder whether this is easier for you on your end. You could if it fails in a similar way for you as for me, just sync to head of 'main' and then copy back the single file you changed? This may be easier and this way we can preserve the credits you get with the pull as I take stuff from your branch. WDYT? Can you give this a try! |
|
Ah, sorry. I just got made aware that you've taken your pull from the release branch. You'd need to take this from the main branch. Sorry, I haven't noticed this earlier. Can you drop this change and make a new one from main? |
|
Hi @marcbaechinger, oops, sure no problem. I recreated the change on top of main and opened a new PR here: #3008 |
|
Closing this PR to avoid confusion; please review/land #3008 instead. |
Introduce a per-thread regex Matcher cache in HlsPlaylistParser to significantly reduce object allocation overhead during playlist parsing.
Summary
Cache and reuse regex Matcher objects to prevent OOM and ANR issues during HLS playlist parsing.
Problem:
Each Matcher object allocates memory in BOTH Java heap and native heap, creating two critical issues in production:
Native heap exhaustion: Native allocations are substantial and not subject to normal Java GC pressure. When Matcher objects are created faster than they're garbage collected, the native heap can be exhausted even when Java heap has space available, causing OutOfMemoryError in the native allocator.
GC-induced ANRs: Excessive Matcher allocation causes frequent GC cycles. This is particularly severe with multiple concurrent playback sessions on lower-performance devices, where sustained GC pressure from thousands of short-lived Matcher objects causes Application Not Responding (ANR) events.
Both issues are exacerbated by frequent HLS playlist refreshes (every 2-6 seconds), creating continuous allocation pressure.
Solution:
Performance impact:
(e.g., live TV with large caches or multiple concurrent streams can
generate 20M+ allocations, reduced to ~37K allocations + 19.9M reuses)
Testing: