-
Notifications
You must be signed in to change notification settings - Fork 180
make diff of time series to compare test productions+AliasDataFrame #2014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
REQUEST FOR PRODUCTION RELEASES: This will add The following labels are available |
…unctionality by enabling: * **Lazy evaluation of derived columns via named aliases** * **Automatic dependency resolution across aliases** * **Persistence via Parquet + JSON or ROOT TTree (via `uproot` + `PyROOT`)** * **ROOT-compatible TTree export/import including alias metadata**
✨ Add `AliasDataFrame Utilities for On-Demand EvaluationThis PR adds support for alias-based derived column computation, as used for example in TPC distortion error parameterization. It includes: ✅ Key Features
🧪 Example UsageThe function below demonstrates how derived error estimates and quality flags can be defined in terms of other DataFrame columns and aliases: def makeErrParamAlias(adf):
adf.df["Beta2"] = np.minimum(50 / adf.df["dEdxTPC"], 1.0).astype(np.float16)
adf.add_alias("errz0a0", "0.35*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
adf.add_alias("errz0b0", "0.006*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
adf.add_alias("errz0b1", "0.0015*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
adf.add_alias("erry0c1", "0.5*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**2.5/150**2", dtype=np.float16)
adf.add_alias("cutB6", "((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ((abs(dy0_b0/errz0b0) > 6) * 4) + ((abs(dy0_b1/errz0b1) > 6) * 8)", dtype=np.uint8)
adf.add_alias("cutC6", "((abs(dy0_c1/erry0c1) > 6) * 1) + ((abs(dy0_c0/erry0c1) > 6) * 2)", dtype=np.uint8)
adf.add_alias("cutA6", "((abs(dy0_a0/errz0a0) > 6) * 1) + ((abs(dz0_a0/errz0a0) > 6) * 2)", dtype=np.uint8)
adf.add_alias("cutT", "((cutB6 + cutC6 + cutA6) > 0)", dtype=np.uint8)
return adf📊 Alias Dependency Graph |
- Allow optional dtype per alias via `add_alias(..., dtype=...)` - Enable global override dtype in `materialize_alias` and `materialize_all` - Add `plot_alias_dependencies()` for visualizing alias dependencies - Improve alias validation with support for numpy/math functions
- Extend `save()` with dropAliasColumns to skip derived columns (before done only for TTree) - Store alias output dtypes in JSON metadata - Restore dtypes on load using numpy type resolution
🧾 Output Storage for
|
🔄 Update Changes Summary✅ Constants Support
🧠 Smart Dependency Handling
💾 Parquet and ROOT I/O Support
🧪 Unit Tests
|
…ses` **Extended commit description:** * Introduced `convert_expr_to_root()` static method using `ast` to translate Python expressions into ROOT-compatible syntax, including function mapping (`mod → fmod`, `arctan2 → atan2`, etc.). * Patched `export_tree()` to: * Apply ROOT-compatible expression conversion. * Handle ROOT’s TTree::SetAlias limitations (e.g. constants) using `(<value> + 0)` workaround. * Save full Python alias metadata (`aliases`, `dtypes`, `constants`) as JSON in `TTree::GetUserInfo()`. * Patched `read_tree()` to: * Restore alias expressions and metadata from `UserInfo` JSON. * Maintain full alias context including constants and types. * Preserved full compatibility with the existing parquet export/load code. * Ensured Python remains the canonical representation; conversion is only needed for ROOT alias usage.
|
Extended commit description:
|
…verbosity - Introduced `materialize_aliases(targets, cleanTemporary=True, verbose=False)` method: - Builds a dependency graph among defined aliases using NetworkX. - Topologically sorts dependencies to ensure correct materialization order. - Materializes only the requested aliases and their dependencies. - Optionally cleans up intermediate (temporary) columns not in the target list. - Includes verbose logging to trace evaluation and cleanup steps. - Improves memory efficiency and control when working with layered alias chains. - Ensures robust handling of mixed alias and non-alias columns.
…ror handling - Added tests for: * Circular dependency detection * Undefined alias symbols * Invalid expression syntax * Partial materialization logic * Subframe behavior with unregistered references * Improved save/load integrity checks with alias mean delta validation * Direct alias dictionary comparison after load Known test failures to be addressed: - Circular dependency not detected (ValueError not raised) - Syntax error not caught (SyntaxError not raised) - Undefined symbol not caught (Exception not raised) - Partial materialization does not preserve dependency logic - Subframe alias on unregistered frame does not raise NameError
- Introduces per-channel, detector-agnostic model: X(Q,n) = a(q0,n) + b(q0,n)·(Q−q0), centered on Δq - Defines inputs/outputs, fit steps, and monotonicity policy (b > b_min) - Details nuisance-axis interpolation (linear/PCHIP) and uncertainty (σ_Q, σ_Q_irr) - Provides API sketch (fit_quantile_linear_nd, QuantileEvaluator) and persistence (Parquet/Arrow/ROOT) - Outlines unit tests, diagnostics, and performance expectations Refs: calibration, multiplicity/flow estimator framework
…er plots - Added NumPy-style docstrings to df_draw_scatter and drawExample
…ench - Introduces dfextensions/quantile_fit_nd: - quantile_fit_nd.py: per-channel ND fit, separable interpolation, evaluator, I/O - test_quantile_fit_nd.py: synthetic unit tests (uniform/poisson/gaussian, z nuisance) - bench_quantile_fit_nd.py: simple timing benchmark over N and distributions - Uses Δq-centered model: X = a(q0,n) + b(q0,n)·(Q − q0) - Enforces monotonicity with configurable b_min (auto/fixed) - Outputs DataFrame (Parquet/Arrow/ROOT) with diagnostics and metadata
…ust edge expectations - Define evaluator.invert_rank() with self-consistent candidate + fixed-point refinement - Compute b(z) expectation by averaging b_true over sample per z-bin - Relax sigma_Q tolerance to 0.25 (finite-window OLS) - Update edge-case test to assert edge coverage instead of unrealistic 90% overall
…ngle-groupby warning - Evaluator was treating 'q_center' as a nuisance axis (detected by *_center), causing axis misalignment and AxisError in moveaxis. Exclude it explicitly. - When grouping by a single nuisance bin column, use scalar grouper to avoid pandas FutureWarning.
…b_min + stable inversion - QuantileEvaluator: exclude 'q_center' from nuisance axes (fix AxisError in moveaxis) - Groupby: use scalar grouper for single nuisance bin column (silence FutureWarning) - Fit: compute b_min per |Q−q0|≤dq window (avoid over-clipping b in low-b regions) - Inversion: implement self-consistent candidate + 2-step fixed-point refine (invert_rank) - Keep API/metadata unchanged; prepare for ND nuisances and time
…(exclude IDE files) - remove .idea/ from repo and add .gitignore
…d record reason - Apply b_min only when a valid fit yields b<=0 (monotonicity enforcement) - For low-Q-spread / low-N windows, keep NaN (no floor), add reason in fit_stats - Greatly reduces bias in Poisson case; z-bin averages use informative windows only
- Use Q = F(k-1) + U*(F(k)-F(k-1)) for Poisson synthetic data - Ensures continuous ranks and informative Δq windows - Keeps fitter unchanged; diagnostics remain valid
- Explain continuous-Q assumption and discrete preprocessing (PIT/mid-ranks) - Add utils: discrete_to_uniform_rank_poisson / _empirical for reuse
- Round-trip RMS is dominated by per-event noise → expect α_rt≈0 (flat), not −0.5 - Keep rms_b scaling check near −0.5 (loosen tol to ±0.2 across 5 N points) - Clarify summary prints and expectations; leave constancy check only for rms_b·√N PWGPP-643
- One-page snapshot of goals, assumptions, API, commands - Documents discrete-input policy (PIT/mid-rank) and monotonicity - Links code, tests, and benchmark usage with scaling expectations PWGPP-643
- One-page snapshot of goals, assumptions, API, commands - Documents discrete-input policy (PIT/mid-rank) and monotonicity - Links code, tests, and benchmark usage with scaling expectations PWGPP-643
- bench_groupby_regression.py: self-contained scenarios (clean/outliers, serial/parallel) - Emits TXT and JSON (CSV optional) for easy doc inclusion and CI checks - Uses y ~ x1 + x2 per-group via GroupByRegressor.make_parallel_fit - Workaround for single-col group key (duplicate column for tuple keys) Sample results show: - ~1.75 s / 1k groups (serial clean, 50k rows, 10k groups) - ~0.41 s / 1k groups with n_jobs=10 (≈4.3× speedup) - Current y-shift outliers do not slow down OLS path (no refits triggered)
…x Markdown tables - Added new "Performance & Benchmarking" section describing benchmark usage, results, and interpretation - Included CLion-compatible Markdown tables for output columns, example results, and recommendations - Documented benchmark command line and sample outputs (50k rows / 10k groups) - Clarified how sigmaCut and parallelization affect runtime - Minor formatting and readability improvements across the file
- Default benchmark: 5 rows/group, 5k groups (faster, still representative) - Added 30% outlier scenario to examples; clarified that response-only outliers don’t trigger slow robust re-fits - Updated example tables for Mac and Linux with new per-1k-group timings - (optional) bench CLI default --groups=5000
…erage-outlier plan - Record new cross-platform results (Mac vs Linux) and observation that response-only outliers do not slow runtime - Add action plan: leverage-outlier generator + refit counters + multi-target cost check - Keep PR target; align benchmarks and docs with 5k/5 default
…iag_prefix) - process_group_robust: record n_refits, frac_rejected, hat_max, cond_xtx, time_ms, n_rows (only when diag=True) - make_parallel_fit: new args diag / diag_prefix (default off; no behavior change) - add summarize_diagnostics(dfGB) helper for quick triage
… report - Append scenario-wise diagnostics summary to benchmark_report.txt - Save top-10 violators per scenario (time/refits) as CSVs - Supports suffix-aware summarize_diagnostics() from GroupByRegressor - Verified clean pytest and benchmark runs on real and synthetic data
…lidation Added suffix-aware summarize_diagnostics + benchmark report integration Confirmed robust re-fit loop in real datasets Prepared next-phase plan for real-use-case profiling and fast-path study
…lidation Added suffix-aware summarize_diagnostics + benchmark report integration Confirmed robust re-fit loop in real datasets Prepared next-phase plan for real-use-case profiling and fast-path study


Adding -
AliasDataFrameis a small utility that extendspandas.DataFramefunctionality by enabling:uproot+PyROOT)