Skip to content

Conversation

@miranov25
Copy link
Contributor

@miranov25 miranov25 commented May 29, 2025

Adding - AliasDataFrame is a small utility that extends pandas.DataFrame functionality by enabling:

  • Lazy evaluation of derived columns via named aliases
  • Automatic dependency resolution across aliases
  • Persistence via Parquet + JSON or ROOT TTree (via uproot + PyROOT)
  • ROOT-compatible TTree export/import including alias metadata

@github-actions
Copy link

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

…unctionality by enabling:

* **Lazy evaluation of derived columns via named aliases**
* **Automatic dependency resolution across aliases**
* **Persistence via Parquet + JSON or ROOT TTree (via `uproot` + `PyROOT`)**
* **ROOT-compatible TTree export/import including alias metadata**
@miranov25 miranov25 changed the title make diff of time series to compare test productions make diff of time series to compare test productions+AliasDataFrame May 29, 2025
@miranov25 miranov25 marked this pull request as draft May 31, 2025 17:35
@miranov25
Copy link
Contributor Author

✨ Add `AliasDataFrame Utilities for On-Demand Evaluation

This PR adds support for alias-based derived column computation, as used for example in TPC distortion error parameterization. It includes:

✅ Key Features

  • Function Validation: Supports expressions using standard math, numpy, and previously defined aliases. Invalid aliases are warned during definition.

  • Alias Dependency Resolution: Automatic topological sort of aliases with dependency tracking and detection of circular references.

  • Output Type Specification: Each alias can specify its desired output dtype (e.g. np.float16, np.uint8). This can also be overridden during materialization.

    • Dtypes are preserved in .parquet exports.
    • TTree support can be extended to encode dtype metadata in a structured way.
  • Alias Dependency Graph: Visualization of alias relationships using networkx and matplotlib.

🧪 Example Usage

The function below demonstrates how derived error estimates and quality flags can be defined in terms of other DataFrame columns and aliases:

def makeErrParamAlias(adf):
    adf.df["Beta2"] = np.minimum(50 / adf.df["dEdxTPC"], 1.0).astype(np.float16)
    adf.add_alias("errz0a0", "0.35*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
    adf.add_alias("errz0b0", "0.006*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
    adf.add_alias("errz0b1", "0.0015*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5", dtype=np.float16)
    adf.add_alias("erry0c1", "0.5*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**2.5/150**2", dtype=np.float16)
    adf.add_alias("cutB6", "((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ((abs(dy0_b0/errz0b0) > 6) * 4) + ((abs(dy0_b1/errz0b1) > 6) * 8)", dtype=np.uint8)
    adf.add_alias("cutC6", "((abs(dy0_c1/erry0c1) > 6) * 1) + ((abs(dy0_c0/erry0c1) > 6) * 2)", dtype=np.uint8)
    adf.add_alias("cutA6", "((abs(dy0_a0/errz0a0) > 6) * 1) + ((abs(dz0_a0/errz0a0) > 6) * 2)", dtype=np.uint8)
    adf.add_alias("cutT", "((cutB6 + cutC6 + cutA6) > 0)", dtype=np.uint8)
    return adf

📊 Alias Dependency Graph

Visual representation of dependencies:
image

miranov25 added 3 commits June 1, 2025 09:07
- Allow optional dtype per alias via `add_alias(..., dtype=...)`
- Enable global override dtype in `materialize_alias` and `materialize_all`
- Add `plot_alias_dependencies()` for visualizing alias dependencies
- Improve alias validation with support for numpy/math functions
- Extend `save()` with dropAliasColumns to skip derived columns (before done only for TTree)
- Store alias output dtypes in JSON metadata
- Restore dtypes on load using numpy type resolution
@miranov25
Copy link
Contributor Author

🧾 Output Storage for AliasDataFrame: TTree or Parquet + JSON/Metadata

This update improves how aliases and derived columns are saved and loaded across different formats.

✅ Key Features

  • Selective Column Export:

    • save(..., dropAliasColumns=True) stores only non-alias columns in .parquet (default), matching export_tree() behavior.
  • Alias Dtype Persistence:

    • Output dtypes (e.g. np.float16, np.uint8) are now stored as type names (e.g. "float16").
    • Correctly reloaded using getattr(np, ...) to ensure .astype(...) works.
  • Dual Metadata Storage:

    • Aliases and dtypes are stored both:
      • in .parquet file metadata (pyarrow)
      • and in a .aliases.json file for inspection and fallback

🔍 Example Outputs (for exame above #2014 (comment))

ROOT TTree alias list:

TNamed	cutB6	((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ...
TNamed	cutT	((cutB6 + cutC6 + cutA6) > 0)

Parquet + JSON metadata:

{
  "aliases": {
    "cutB6": "((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ...",
    "cutT": "((cutB6 + cutC6 + cutA6) > 0)"
  },
  "dtypes": {
    "cutB6": "uint8",
    "cutT": "uint8"
  }
}
  • The .parquet file contains embedded metadata.
  • The .aliases.json provides a transparent sidecar view.
  • If metadata is missing or outdated, the loader will fall back to the JSON.

Example usage of tree with aliases (later RDataFrame) - in TTree query:

root [20] tree->GetListOfAliases()->ls()
OBJ: TList	TList	Doubly linked list : 0
 OBJ: TNamed	errz0a0	0.35*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5 : 0 at: 0x445d4e0
 OBJ: TNamed	errz0b0	0.006*sqrt(1+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5 : 0 at: 0x4465bb0
 OBJ: TNamed	errz0b1	(0.0015*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**1.5) : 0 at: 0x4465cb0
 OBJ: TNamed	erry0c1	(0.5*sqrt(0.25+(mP4**2)/Beta2)/(nClsTPC/150.)**2.5)/(150**2) : 0 at: 0x4465db0
 OBJ: TNamed	cutB6	((abs(dz0_b0/errz0b0) > 6) * 1) + ((abs(dz0_b1/errz0b1) > 6) * 2) + ((abs(dy0_b0/errz0b0) > 6) * 4) + ((abs(dy0_b1/errz0b1) > 6) * 8) : 0 at: 0x4465eb0
 OBJ: TNamed	cutC6	((abs(dy0_c1/erry0c1) > 6) * 1) + ((abs(dy0_c0/erry0c1) > 6) * 2) : 0 at: 0x4465f60
 OBJ: TNamed	cutA6	((abs(dy0_a0/errz0a0) > 6) * 1) + ((abs(dz0_a0/errz0a0) > 6) * 2) : 0 at: 0x4466070
 OBJ: TNamed	cutT	((cutB6 + cutC6 + cutA6) > 0) : 0 at: 0x4466180

root [17] tree->Draw("(dz0_a0/errz0a0):mP4>>his(200,-5,5,100,-5,5)","cutT==0","colz")
(long long) 3682828
root [18] tree->Draw("abs(dz0_a0/errz0a0):mP4>>rofdz(200,-5,5,100,0,20)","cutT==0&&abs(dy0_a0/errz0a0)<20","profsame")

image

@miranov25
Copy link
Contributor Author

🔄 Update Changes Summary

✅ Constants Support

  • New parameter is_constant=True in add_alias marks aliases as constants (e.g. countsFV0_median = 2096.0).
  • Constants are evaluated once and injected directly during alias materialization.
  • Constants are not materialized as DataFrame columns unless explicitly requested.

🧠 Smart Dependency Handling

  • During materialize_alias or materialize_all, constants are evaluated and injected before dependency resolution.
  • Dependency graphs and topological sorting skip constants to avoid recomputation.

💾 Parquet and ROOT I/O Support

  • Metadata (aliases, dtypes, constants) is stored inside .parquet file as Arrow schema metadata.
  • Export to ROOT TTree uses SetAlias, with workaround for ROOT interpreter bug by forcing numeric constants to use val + 0.

🧪 Unit Tests

  • Added robust pytest unit tests with:

    • Standard aliases
    • Aliases with custom dtype
    • Constants and mixed expressions
    • Dependency resolution
    • Reloading and validating from saved Parquet files

…ses`

**Extended commit description:**

* Introduced `convert_expr_to_root()` static method using `ast` to translate Python expressions into ROOT-compatible syntax, including function mapping (`mod → fmod`, `arctan2 → atan2`, etc.).
* Patched `export_tree()` to:

  * Apply ROOT-compatible expression conversion.
  * Handle ROOT’s TTree::SetAlias limitations (e.g. constants) using `(<value> + 0)` workaround.
  * Save full Python alias metadata (`aliases`, `dtypes`, `constants`) as JSON in `TTree::GetUserInfo()`.
* Patched `read_tree()` to:

  * Restore alias expressions and metadata from `UserInfo` JSON.
  * Maintain full alias context including constants and types.
* Preserved full compatibility with the existing parquet export/load code.
* Ensured Python remains the canonical representation; conversion is only needed for ROOT alias usage.
@miranov25
Copy link
Contributor Author

Add ROOT SetAlias export and Python-to-ROOT AST translation for aliases

Extended commit description:

  • Introduced convert_expr_to_root() static method using ast to translate Python expressions into ROOT-compatible syntax, including function mapping (mod → fmod, arctan2 → atan2, etc.).

  • Patched export_tree() to:

    • Apply ROOT-compatible expression conversion.
    • Handle ROOT’s TTree::SetAlias limitations (e.g. constants) using (<value> + 0) workaround.
    • Save full Python alias metadata (aliases, dtypes, constants) as JSON in TTree::GetUserInfo().
  • Patched read_tree() to:

    • Restore alias expressions and metadata from UserInfo JSON.
    • Maintain full alias context including constants and types.
  • Preserved full compatibility with the existing parquet export/load code.

  • Ensured Python remains the canonical representation; conversion is only needed for ROOT alias usage.

miranov25 added 5 commits June 4, 2025 13:58
…verbosity

- Introduced `materialize_aliases(targets, cleanTemporary=True, verbose=False)` method:
  - Builds a dependency graph among defined aliases using NetworkX.
  - Topologically sorts dependencies to ensure correct materialization order.
  - Materializes only the requested aliases and their dependencies.
  - Optionally cleans up intermediate (temporary) columns not in the target list.
  - Includes verbose logging to trace evaluation and cleanup steps.
- Improves memory efficiency and control when working with layered alias chains.
- Ensures robust handling of mixed alias and non-alias columns.
…ror handling

- Added tests for:
  * Circular dependency detection
  * Undefined alias symbols
  * Invalid expression syntax
  * Partial materialization logic
  * Subframe behavior with unregistered references
  * Improved save/load integrity checks with alias mean delta validation
  * Direct alias dictionary comparison after load

Known test failures to be addressed:
- Circular dependency not detected (ValueError not raised)
- Syntax error not caught (SyntaxError not raised)
- Undefined symbol not caught (Exception not raised)
- Partial materialization does not preserve dependency logic
- Subframe alias on unregistered frame does not raise NameError
miranov25 added 30 commits June 25, 2025 13:37
- Introduces per-channel, detector-agnostic model:
  X(Q,n) = a(q0,n) + b(q0,n)·(Q−q0), centered on Δq
- Defines inputs/outputs, fit steps, and monotonicity policy (b > b_min)
- Details nuisance-axis interpolation (linear/PCHIP) and uncertainty (σ_Q, σ_Q_irr)
- Provides API sketch (fit_quantile_linear_nd, QuantileEvaluator) and persistence (Parquet/Arrow/ROOT)
- Outlines unit tests, diagnostics, and performance expectations

Refs: calibration, multiplicity/flow estimator framework
…er plots

- Added NumPy-style docstrings to df_draw_scatter and drawExample
…ench

- Introduces dfextensions/quantile_fit_nd:
  - quantile_fit_nd.py: per-channel ND fit, separable interpolation, evaluator, I/O
  - test_quantile_fit_nd.py: synthetic unit tests (uniform/poisson/gaussian, z nuisance)
  - bench_quantile_fit_nd.py: simple timing benchmark over N and distributions
- Uses Δq-centered model: X = a(q0,n) + b(q0,n)·(Q − q0)
- Enforces monotonicity with configurable b_min (auto/fixed)
- Outputs DataFrame (Parquet/Arrow/ROOT) with diagnostics and metadata
…ust edge expectations

- Define evaluator.invert_rank() with self-consistent candidate + fixed-point refinement
- Compute b(z) expectation by averaging b_true over sample per z-bin
- Relax sigma_Q tolerance to 0.25 (finite-window OLS)
- Update edge-case test to assert edge coverage instead of unrealistic 90% overall
…ngle-groupby warning

- Evaluator was treating 'q_center' as a nuisance axis (detected by *_center),
  causing axis misalignment and AxisError in moveaxis. Exclude it explicitly.
- When grouping by a single nuisance bin column, use scalar grouper to avoid
  pandas FutureWarning.
…b_min + stable inversion

- QuantileEvaluator: exclude 'q_center' from nuisance axes (fix AxisError in moveaxis)
- Groupby: use scalar grouper for single nuisance bin column (silence FutureWarning)
- Fit: compute b_min per |Q−q0|≤dq window (avoid over-clipping b in low-b regions)
- Inversion: implement self-consistent candidate + 2-step fixed-point refine (invert_rank)
- Keep API/metadata unchanged; prepare for ND nuisances and time
…(exclude IDE files)

- remove .idea/ from repo and add .gitignore
…d record reason

- Apply b_min only when a valid fit yields b<=0 (monotonicity enforcement)
- For low-Q-spread / low-N windows, keep NaN (no floor), add reason in fit_stats
- Greatly reduces bias in Poisson case; z-bin averages use informative windows only
- Use Q = F(k-1) + U*(F(k)-F(k-1)) for Poisson synthetic data
- Ensures continuous ranks and informative Δq windows
- Keeps fitter unchanged; diagnostics remain valid
- Explain continuous-Q assumption and discrete preprocessing (PIT/mid-ranks)
- Add utils: discrete_to_uniform_rank_poisson / _empirical for reuse
- Round-trip RMS is dominated by per-event noise → expect α_rt≈0 (flat), not −0.5
- Keep rms_b scaling check near −0.5 (loosen tol to ±0.2 across 5 N points)
- Clarify summary prints and expectations; leave constancy check only for rms_b·√N

PWGPP-643
- One-page snapshot of goals, assumptions, API, commands
- Documents discrete-input policy (PIT/mid-rank) and monotonicity
- Links code, tests, and benchmark usage with scaling expectations
PWGPP-643
- One-page snapshot of goals, assumptions, API, commands
- Documents discrete-input policy (PIT/mid-rank) and monotonicity
- Links code, tests, and benchmark usage with scaling expectations
PWGPP-643
- bench_groupby_regression.py: self-contained scenarios (clean/outliers, serial/parallel)
- Emits TXT and JSON (CSV optional) for easy doc inclusion and CI checks
- Uses y ~ x1 + x2 per-group via GroupByRegressor.make_parallel_fit
- Workaround for single-col group key (duplicate column for tuple keys)

Sample results show:
- ~1.75 s / 1k groups (serial clean, 50k rows, 10k groups)
- ~0.41 s / 1k groups with n_jobs=10 (≈4.3× speedup)
- Current y-shift outliers do not slow down OLS path (no refits triggered)
…x Markdown tables

- Added new "Performance & Benchmarking" section describing benchmark usage, results, and interpretation
- Included CLion-compatible Markdown tables for output columns, example results, and recommendations
- Documented benchmark command line and sample outputs (50k rows / 10k groups)
- Clarified how sigmaCut and parallelization affect runtime
- Minor formatting and readability improvements across the file
- Default benchmark: 5 rows/group, 5k groups (faster, still representative)
- Added 30% outlier scenario to examples; clarified that response-only outliers
  don’t trigger slow robust re-fits
- Updated example tables for Mac and Linux with new per-1k-group timings
- (optional) bench CLI default --groups=5000
…erage-outlier plan

- Record new cross-platform results (Mac vs Linux) and observation that response-only outliers do not slow runtime
- Add action plan: leverage-outlier generator + refit counters + multi-target cost check
- Keep PR target; align benchmarks and docs with 5k/5 default
…iag_prefix)

- process_group_robust: record n_refits, frac_rejected, hat_max, cond_xtx, time_ms, n_rows (only when diag=True)
- make_parallel_fit: new args diag / diag_prefix (default off; no behavior change)
- add summarize_diagnostics(dfGB) helper for quick triage
… report

- Append scenario-wise diagnostics summary to benchmark_report.txt
- Save top-10 violators per scenario (time/refits) as CSVs
- Supports suffix-aware summarize_diagnostics() from GroupByRegressor
- Verified clean pytest and benchmark runs on real and synthetic data
…lidation

Added suffix-aware summarize_diagnostics + benchmark report integration

Confirmed robust re-fit loop in real datasets

Prepared next-phase plan for real-use-case profiling and fast-path study
…lidation

Added suffix-aware summarize_diagnostics + benchmark report integration

Confirmed robust re-fit loop in real datasets

Prepared next-phase plan for real-use-case profiling and fast-path study
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant