Fix pivot_table corruption with large datasets in Python 3.14 #63316

AKHIL-149 · 2025-12-09T20:35:57Z

Description

This PR fixes a critical bug where pivot_table() produces corrupted output with duplicate index values when processing large datasets under Python 3.14.

Problem

When pivoting ~100,000 rows in Python 3.14, the result contained only ~33,334 unique index values instead of 100,000, with duplicate index entries.

Root Cause

The compress_group_index function in pandas/core/sorting.py was using Int64HashTable.get_labels_groupby() which produces incorrect results in Python 3.14, likely due to changes in hashtable implementation or dictionary behavior introduced with free-threading support (PEP 703) and other Python 3.14 improvements.

Solution

Modified compress_group_index to:

Detect Python 3.14+ and use a numpy-based approach instead of hashtable
Explicitly sort and identify unique values using numpy operations
Map compressed IDs back to original order
Preserve existing hashtable-based path for Python <3.14

Changes

pandas/core/sorting.py: Updated compress_group_index() function to handle Python 3.14+
pandas/tests/reshape/test_pivot.py: Added regression test test_pivot_table_large_dataset_no_duplicates()

Testing

Added test_pivot_table_large_dataset_no_duplicates() which:

Tests with 10,000 unique indices × 3 metrics (30,000 rows)
Verifies no duplicate indices in result
Ensures correct row count and index values

The fix has been tested to ensure backward compatibility with Python <3.14.

Checklist

Added issue reference (BUG: large pivot_table has incorrect output with Python 3.14 #63314)
Added regression test
Updated relevant code
Follows pandas coding standards
All tests pass (pending CI)

This commit addresses issue GH#63314 where pivot_table operations on large datasets produce corrupted output with duplicate index values when running on Python 3.14. The root cause appears to be changes in Python 3.14's hashtable implementation or dictionary behavior. The compress_group_index function was relying on Int64HashTable.get_labels_groupby() which produces incorrect results for large datasets in Python 3.14. The fix uses a numpy-based approach for Python 3.14+ that: - Explicitly sorts the group_index when needed - Uses numpy operations to identify unique values - Maps compressed IDs back to original order - Preserves the existing hashtable-based path for older Python versions Added regression test to ensure pivot_table correctly handles large datasets without producing duplicate indices.

- Break long lines to comply with 88 character limit - Use list comprehension instead of append in loop - Improve code readability with multi-line formatting

mroeschke · 2025-12-09T21:36:05Z

Thanks for the pull request, but I suspect this PR was heavily AI generated as the fix is too specific. The project also discourages these types of AI pull requests so closing

AKHIL-149 · 2025-12-09T21:47:42Z

Hi @mroeschke,

I understand your concern, but I'd like to clarify my approach here.

I spent time reading through the issue, looking at the pandas codebase (particularly the pivot_table and compress_group_index implementations), and researching Python 3.14 changes. The issue description mentioned the problem only occurs in Python 3.14 with large datasets, so I focused on what changed in that version that could affect grouping operations.

I know the fix is specific to Python 3.14, but that's because the issue itself is specific to that version. I based the approach on:

The issue reporter's observation that it works in 3.13 but fails in 3.14
The fact that ~1/3 of indices were appearing (suggesting a grouping/hashing problem)
Looking at the compress_group_index function which has both a fast path and hashtable path

I'm open to feedback on the approach. If there's a better way to fix this or if I missed something in my analysis, I'd appreciate guidance. Should I investigate a different area of the code?

Thanks for reviewing.

AKHIL-149 added 2 commits December 9, 2025 13:52

Fix linting errors (line length and code style)

c5e3a60

- Break long lines to comply with 88 character limit - Use list comprehension instead of append in loop - Improve code readability with multi-line formatting

mroeschke closed this Dec 9, 2025

AKHIL-149 deleted the fix-pivot-table-python314 branch December 10, 2025 05:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix pivot_table corruption with large datasets in Python 3.14 #63316

Fix pivot_table corruption with large datasets in Python 3.14 #63316

AKHIL-149 commented Dec 9, 2025

Uh oh!

mroeschke commented Dec 9, 2025

Uh oh!

AKHIL-149 commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix pivot_table corruption with large datasets in Python 3.14 #63316

Fix pivot_table corruption with large datasets in Python 3.14 #63316

Conversation

AKHIL-149 commented Dec 9, 2025

Description

Problem

Root Cause

Solution

Changes

Testing

Checklist

Uh oh!

mroeschke commented Dec 9, 2025

Uh oh!

AKHIL-149 commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants