⚡️ Speed up function get_sample_data by 18%
#215
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 18% (0.18x) speedup for
get_sample_datainlib/matplotlib/cbook.py⏱️ Runtime :
193 microseconds→163 microseconds(best of5runs)📝 Explanation and details
The optimization achieves a 17% speedup through two key improvements:
1. Function Attribute Caching in
_get_data_pathThe most significant optimization caches the result of
matplotlib.get_data_path()as a function attribute (_get_data_path._base_path). The line profiler shows this reduces time from 1.89ms to 1.39ms (27% improvement) by eliminating repeated expensive path lookups. Since_get_data_pathis called 50 times in the test, this caching provides substantial savings when accessing multiple sample data files.2. Early Return and Tuple Optimization in
get_sample_dataMoving the
asfileobj=Falsecheck to the top enables faster early returns for path-only requests. Additionally, replacing list literals['.npy', '.npz']with tuples('.npy', '.npz')for suffix checks provides minor performance gains since tuple membership tests are slightly faster than lists in Python.Impact on Workloads
Based on the function references,
get_sample_datais called in matplotlib test suites and visualization demos where sample datasets are loaded. The caching optimization is particularly beneficial when:The annotated tests show consistent 15-27% speedups across different file types, with the optimization being most effective for workflows that make repeated calls to
get_sample_dataor whenasfileobj=False(path-only access) is used.✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-get_sample_data-miscep90and push.