Skip to content

Commit 8b0fc27

Browse files
committed
Merge remote-tracking branch 'upstream/main' into issue_61917
2 parents b919b7e + 8476e0f commit 8b0fc27

File tree

201 files changed

+4177
-2153
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

201 files changed

+4177
-2153
lines changed

.github/workflows/wheels.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ jobs:
162162
run: echo "sdist_name=$(cd ./dist && ls -d */)" >> "$GITHUB_ENV"
163163

164164
- name: Build wheels
165-
uses: pypa/cibuildwheel@v3.1.4
165+
uses: pypa/cibuildwheel@v3.2.0
166166
with:
167167
package-dir: ./dist/${{ startsWith(matrix.buildplat[1], 'macosx') && env.sdist_name || needs.build_sdist.outputs.sdist_file }}
168168
env:

Dockerfile

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
FROM python:3.11.13
22
WORKDIR /home/pandas
33

4+
# https://docs.docker.com/reference/dockerfile/#automatic-platform-args-in-the-global-scope
5+
ARG TARGETPLATFORM
6+
47
RUN apt-get update && \
58
apt-get --no-install-recommends -y upgrade && \
69
apt-get --no-install-recommends -y install \
@@ -13,7 +16,14 @@ RUN apt-get update && \
1316
rm -rf /var/lib/apt/lists/*
1417

1518
COPY requirements-dev.txt /tmp
16-
RUN python -m pip install --no-cache-dir --upgrade pip && \
19+
20+
RUN case "$TARGETPLATFORM" in \
21+
linux/arm*) \
22+
# Drop PyQt5 for ARM GH#61037
23+
sed -i "/^pyqt5/Id" /tmp/requirements-dev.txt \
24+
;; \
25+
esac && \
26+
python -m pip install --no-cache-dir --upgrade pip && \
1727
python -m pip install --no-cache-dir -r /tmp/requirements-dev.txt
1828
RUN git config --global --add safe.directory /home/pandas
1929

asv_bench/benchmarks/algorithms.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -199,8 +199,8 @@ class SortIntegerArray:
199199
params = [10**3, 10**5]
200200

201201
def setup(self, N):
202-
data = np.arange(N, dtype=float)
203-
data[40] = np.nan
202+
data = np.arange(N, dtype=float).astype(object)
203+
data[40] = pd.NA
204204
self.array = pd.array(data, dtype="Int64")
205205

206206
def time_argsort(self, N):

asv_bench/benchmarks/frame_methods.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import numpy as np
55

66
from pandas import (
7+
NA,
78
DataFrame,
89
Index,
910
MultiIndex,
@@ -445,6 +446,8 @@ def setup(self, inplace, dtype):
445446
values[::2] = np.nan
446447
if dtype == "Int64":
447448
values = values.round()
449+
values = values.astype(object)
450+
values[::2] = NA
448451
self.df = DataFrame(values, dtype=dtype)
449452
self.fill_values = self.df.iloc[self.df.first_valid_index()].to_dict()
450453

asv_bench/benchmarks/groupby.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -689,6 +689,10 @@ def setup(self, dtype, method, with_nans):
689689
null_vals = vals.astype(float, copy=True)
690690
null_vals[::2, :] = np.nan
691691
null_vals[::3, :] = np.nan
692+
if dtype in ["Int64", "Float64"]:
693+
null_vals = null_vals.astype(object)
694+
null_vals[::2, :] = NA
695+
null_vals[::3, :] = NA
692696
df = DataFrame(null_vals, columns=list("abcde"), dtype=dtype)
693697
df["key"] = keys
694698
self.df = df

doc/source/getting_started/comparison/comparison_with_sql.rst

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,42 @@ column with another DataFrame's index.
270270
indexed_df2 = df2.set_index("key")
271271
pd.merge(df1, indexed_df2, left_on="key", right_index=True)
272272
273+
:meth:`~pandas.merge` also supports joining on multiple columns by passing a list of column names.
274+
275+
.. code-block:: sql
276+
277+
SELECT *
278+
FROM df1_multi
279+
INNER JOIN df2_multi
280+
ON df1_multi.key1 = df2_multi.key1
281+
AND df1_multi.key2 = df2_multi.key2;
282+
283+
.. ipython:: python
284+
285+
df1_multi = pd.DataFrame({
286+
"key1": ["A", "B", "C", "D"],
287+
"key2": [1, 2, 3, 4],
288+
"value": np.random.randn(4)
289+
})
290+
df2_multi = pd.DataFrame({
291+
"key1": ["B", "D", "D", "E"],
292+
"key2": [2, 4, 4, 5],
293+
"value": np.random.randn(4)
294+
})
295+
pd.merge(df1_multi, df2_multi, on=["key1", "key2"])
296+
297+
If the columns have different names between DataFrames, on can be replaced with left_on and
298+
right_on.
299+
300+
.. ipython:: python
301+
302+
df2_multi = pd.DataFrame({
303+
"key_1": ["B", "D", "D", "E"],
304+
"key_2": [2, 4, 4, 5],
305+
"value": np.random.randn(4)
306+
})
307+
pd.merge(df1_multi, df2_multi, left_on=["key1", "key2"], right_on=["key_1", "key_2"])
308+
273309
LEFT OUTER JOIN
274310
~~~~~~~~~~~~~~~
275311

doc/source/reference/missing_value.rst

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,12 @@ NA is the way to represent missing values for nullable dtypes (see below):
1111

1212
.. autosummary::
1313
:toctree: api/
14-
:template: autosummary/class_without_autosummary.rst
1514

1615
NA
1716

1817
NaT is the missing value for timedelta and datetime data (see below):
1918

2019
.. autosummary::
2120
:toctree: api/
22-
:template: autosummary/class_without_autosummary.rst
2321

2422
NaT

doc/source/user_guide/scale.rst

Lines changed: 23 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -164,35 +164,35 @@ files. Each file in the directory represents a different year of the entire data
164164
.. ipython:: python
165165
:okwarning:
166166
167-
import pathlib
167+
import glob
168+
import tempfile
168169
169170
N = 12
170171
starts = [f"20{i:>02d}-01-01" for i in range(N)]
171172
ends = [f"20{i:>02d}-12-13" for i in range(N)]
172173
173-
pathlib.Path("data/timeseries").mkdir(exist_ok=True)
174+
tmpdir = tempfile.TemporaryDirectory(ignore_cleanup_errors=True)
174175
175176
for i, (start, end) in enumerate(zip(starts, ends)):
176177
ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
177-
ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
178+
ts.to_parquet(f"{tmpdir.name}/ts-{i:0>2d}.parquet")
178179
179180
180181
::
181182

182-
data
183-
└── timeseries
184-
├── ts-00.parquet
185-
├── ts-01.parquet
186-
├── ts-02.parquet
187-
├── ts-03.parquet
188-
├── ts-04.parquet
189-
├── ts-05.parquet
190-
├── ts-06.parquet
191-
├── ts-07.parquet
192-
├── ts-08.parquet
193-
├── ts-09.parquet
194-
├── ts-10.parquet
195-
└── ts-11.parquet
183+
tmpdir
184+
├── ts-00.parquet
185+
├── ts-01.parquet
186+
├── ts-02.parquet
187+
├── ts-03.parquet
188+
├── ts-04.parquet
189+
├── ts-05.parquet
190+
├── ts-06.parquet
191+
├── ts-07.parquet
192+
├── ts-08.parquet
193+
├── ts-09.parquet
194+
├── ts-10.parquet
195+
└── ts-11.parquet
196196

197197
Now we'll implement an out-of-core :meth:`pandas.Series.value_counts`. The peak memory usage of this
198198
workflow is the single largest chunk, plus a small series storing the unique value
@@ -202,13 +202,18 @@ work for arbitrary-sized datasets.
202202
.. ipython:: python
203203
204204
%%time
205-
files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
205+
files = glob.iglob(f"{tmpdir.name}/ts*.parquet")
206206
counts = pd.Series(dtype=int)
207207
for path in files:
208208
df = pd.read_parquet(path)
209209
counts = counts.add(df["name"].value_counts(), fill_value=0)
210210
counts.astype(int)
211211
212+
.. ipython:: python
213+
:suppress:
214+
215+
tmpdir.cleanup()
216+
212217
Some readers, like :meth:`pandas.read_csv`, offer parameters to control the
213218
``chunksize`` when reading a single file.
214219

doc/source/user_guide/text.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ or convert from existing pandas data:
7575

7676
.. ipython:: python
7777
78-
s1 = pd.Series([1, 2, np.nan], dtype="Int64")
78+
s1 = pd.Series([1, 2, pd.NA], dtype="Int64")
7979
s1
8080
s2 = s1.astype("string")
8181
s2

doc/source/whatsnew/v0.21.0.rst

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -635,22 +635,17 @@ Previous behavior:
635635
636636
New behavior:
637637

638-
.. code-block:: ipython
638+
.. ipython:: python
639639
640-
In [1]: pi = pd.period_range('2017-01', periods=12, freq='M')
640+
pi = pd.period_range('2017-01', periods=12, freq='M')
641641
642-
In [2]: s = pd.Series(np.arange(12), index=pi)
642+
s = pd.Series(np.arange(12), index=pi)
643643
644-
In [3]: resampled = s.resample('2Q').mean()
644+
resampled = s.resample('2Q').mean()
645645
646-
In [4]: resampled
647-
Out[4]:
648-
2017Q1 2.5
649-
2017Q3 8.5
650-
Freq: 2Q-DEC, dtype: float64
646+
resampled
651647
652-
In [5]: resampled.index
653-
Out[5]: PeriodIndex(['2017Q1', '2017Q3'], dtype='period[2Q-DEC]')
648+
resampled.index
654649
655650
Upsampling and calling ``.ohlc()`` previously returned a ``Series``, basically identical to calling ``.asfreq()``. OHLC upsampling now returns a DataFrame with columns ``open``, ``high``, ``low`` and ``close`` (:issue:`13083`). This is consistent with downsampling and ``DatetimeIndex`` behavior.
656651

0 commit comments

Comments
 (0)