Skip to content

Commit 7abddcd

Browse files
authored
DOC: Improve scale.rst (#45385)
1 parent 43245c9 commit 7abddcd

File tree

2 files changed

+40
-121
lines changed

2 files changed

+40
-121
lines changed

doc/source/user_guide/scale.rst

Lines changed: 40 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -18,36 +18,9 @@ tool for all situations. If you're working with very large datasets and a tool
1818
like PostgreSQL fits your needs, then you should probably be using that.
1919
Assuming you want or need the expressiveness and power of pandas, let's carry on.
2020

21-
.. ipython:: python
22-
23-
import pandas as pd
24-
import numpy as np
25-
26-
.. ipython:: python
27-
:suppress:
28-
29-
from pandas._testing import _make_timeseries
30-
31-
# Make a random in-memory dataset
32-
ts = _make_timeseries(freq="30S", seed=0)
33-
ts.to_csv("timeseries.csv")
34-
ts.to_parquet("timeseries.parquet")
35-
36-
3721
Load less data
3822
--------------
3923

40-
.. ipython:: python
41-
:suppress:
42-
43-
# make a similar dataset with many columns
44-
timeseries = [
45-
_make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}")
46-
for i in range(10)
47-
]
48-
ts_wide = pd.concat(timeseries, axis=1)
49-
ts_wide.to_parquet("timeseries_wide.parquet")
50-
5124
Suppose our raw dataset on disk has many columns::
5225

5326
id_0 name_0 x_0 y_0 id_1 name_1 x_1 ... name_8 x_8 y_8 id_9 name_9 x_9 y_9
@@ -66,6 +39,34 @@ Suppose our raw dataset on disk has many columns::
6639

6740
[525601 rows x 40 columns]
6841

42+
That can be generated by the following code snippet:
43+
44+
.. ipython:: python
45+
46+
import pandas as pd
47+
import numpy as np
48+
49+
def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
50+
index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
51+
n = len(index)
52+
state = np.random.RandomState(seed)
53+
columns = {
54+
"name": state.choice(["Alice", "Bob", "Charlie"], size=n),
55+
"id": state.poisson(1000, size=n),
56+
"x": state.rand(n) * 2 - 1,
57+
"y": state.rand(n) * 2 - 1,
58+
}
59+
df = pd.DataFrame(columns, index=index, columns=sorted(columns))
60+
if df.index[-1] == end:
61+
df = df.iloc[:-1]
62+
return df
63+
64+
timeseries = [
65+
make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}")
66+
for i in range(10)
67+
]
68+
ts_wide = pd.concat(timeseries, axis=1)
69+
ts_wide.to_parquet("timeseries_wide.parquet")
6970
7071
To load the columns we want, we have two options.
7172
Option 1 loads in all the data and then filters to what we need.
@@ -99,6 +100,8 @@ can store larger datasets in memory.
99100

100101
.. ipython:: python
101102
103+
ts = make_timeseries(freq="30S", seed=0)
104+
ts.to_parquet("timeseries.parquet")
102105
ts = pd.read_parquet("timeseries.parquet")
103106
ts
104107
@@ -116,7 +119,7 @@ attention.
116119
117120
The ``name`` column is taking up much more memory than any other. It has just a
118121
few unique values, so it's a good candidate for converting to a
119-
:class:`Categorical`. With a Categorical, we store each unique name once and use
122+
:class:`pandas.Categorical`. With a :class:`pandas.Categorical`, we store each unique name once and use
120123
space-efficient integers to know which specific name is used in each row.
121124

122125

@@ -147,7 +150,7 @@ using :func:`pandas.to_numeric`.
147150
In all, we've reduced the in-memory footprint of this dataset to 1/5 of its
148151
original size.
149152

150-
See :ref:`categorical` for more on ``Categorical`` and :ref:`basics.dtypes`
153+
See :ref:`categorical` for more on :class:`pandas.Categorical` and :ref:`basics.dtypes`
151154
for an overview of all of pandas' dtypes.
152155

153156
Use chunking
@@ -168,7 +171,6 @@ Suppose we have an even larger "logical dataset" on disk that's a directory of p
168171
files. Each file in the directory represents a different year of the entire dataset.
169172

170173
.. ipython:: python
171-
:suppress:
172174
173175
import pathlib
174176
@@ -179,7 +181,7 @@ files. Each file in the directory represents a different year of the entire data
179181
pathlib.Path("data/timeseries").mkdir(exist_ok=True)
180182
181183
for i, (start, end) in enumerate(zip(starts, ends)):
182-
ts = _make_timeseries(start=start, end=end, freq="1T", seed=i)
184+
ts = make_timeseries(start=start, end=end, freq="1T", seed=i)
183185
ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
184186
185187
@@ -200,7 +202,7 @@ files. Each file in the directory represents a different year of the entire data
200202
├── ts-10.parquet
201203
└── ts-11.parquet
202204

203-
Now we'll implement an out-of-core ``value_counts``. The peak memory usage of this
205+
Now we'll implement an out-of-core :meth:`pandas.Series.value_counts`. The peak memory usage of this
204206
workflow is the single largest chunk, plus a small series storing the unique value
205207
counts up to this point. As long as each individual file fits in memory, this will
206208
work for arbitrary-sized datasets.
@@ -211,17 +213,15 @@ work for arbitrary-sized datasets.
211213
files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
212214
counts = pd.Series(dtype=int)
213215
for path in files:
214-
# Only one dataframe is in memory at a time...
215216
df = pd.read_parquet(path)
216-
# ... plus a small Series ``counts``, which is updated.
217217
counts = counts.add(df["name"].value_counts(), fill_value=0)
218218
counts.astype(int)
219219
220220
Some readers, like :meth:`pandas.read_csv`, offer parameters to control the
221221
``chunksize`` when reading a single file.
222222

223223
Manually chunking is an OK option for workflows that don't
224-
require too sophisticated of operations. Some operations, like ``groupby``, are
224+
require too sophisticated of operations. Some operations, like :meth:`pandas.DataFrame.groupby`, are
225225
much harder to do chunkwise. In these cases, you may be better switching to a
226226
different library that implements these out-of-core algorithms for you.
227227

@@ -259,7 +259,7 @@ Inspecting the ``ddf`` object, we see a few things
259259
* There are new attributes like ``.npartitions`` and ``.divisions``
260260

261261
The partitions and divisions are how Dask parallelizes computation. A **Dask**
262-
DataFrame is made up of many pandas DataFrames. A single method call on a
262+
DataFrame is made up of many pandas :class:`pandas.DataFrame`. A single method call on a
263263
Dask DataFrame ends up making many pandas method calls, and Dask knows how to
264264
coordinate everything to get the result.
265265

@@ -283,8 +283,8 @@ Rather than executing immediately, doing operations build up a **task graph**.
283283
284284
Each of these calls is instant because the result isn't being computed yet.
285285
We're just building up a list of computation to do when someone needs the
286-
result. Dask knows that the return type of a ``pandas.Series.value_counts``
287-
is a pandas Series with a certain dtype and a certain name. So the Dask version
286+
result. Dask knows that the return type of a :class:`pandas.Series.value_counts`
287+
is a pandas :class:`pandas.Series` with a certain dtype and a certain name. So the Dask version
288288
returns a Dask Series with the same dtype and the same name.
289289

290290
To get the actual result you can call ``.compute()``.
@@ -294,13 +294,13 @@ To get the actual result you can call ``.compute()``.
294294
%time ddf["name"].value_counts().compute()
295295
296296
At that point, you get back the same thing you'd get with pandas, in this case
297-
a concrete pandas Series with the count of each ``name``.
297+
a concrete pandas :class:`pandas.Series` with the count of each ``name``.
298298

299299
Calling ``.compute`` causes the full task graph to be executed. This includes
300300
reading the data, selecting the columns, and doing the ``value_counts``. The
301301
execution is done *in parallel* where possible, and Dask tries to keep the
302302
overall memory footprint small. You can work with datasets that are much larger
303-
than memory, as long as each partition (a regular pandas DataFrame) fits in memory.
303+
than memory, as long as each partition (a regular pandas :class:`pandas.DataFrame`) fits in memory.
304304

305305
By default, ``dask.dataframe`` operations use a threadpool to do operations in
306306
parallel. We can also connect to a cluster to distribute the work on many

pandas/_testing/__init__.py

Lines changed: 0 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -385,87 +385,6 @@ def makeMultiIndex(k=10, names=None, **kwargs):
385385
return mi[:k]
386386

387387

388-
_names = [
389-
"Alice",
390-
"Bob",
391-
"Charlie",
392-
"Dan",
393-
"Edith",
394-
"Frank",
395-
"George",
396-
"Hannah",
397-
"Ingrid",
398-
"Jerry",
399-
"Kevin",
400-
"Laura",
401-
"Michael",
402-
"Norbert",
403-
"Oliver",
404-
"Patricia",
405-
"Quinn",
406-
"Ray",
407-
"Sarah",
408-
"Tim",
409-
"Ursula",
410-
"Victor",
411-
"Wendy",
412-
"Xavier",
413-
"Yvonne",
414-
"Zelda",
415-
]
416-
417-
418-
def _make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
419-
"""
420-
Make a DataFrame with a DatetimeIndex
421-
422-
Parameters
423-
----------
424-
start : str or Timestamp, default "2000-01-01"
425-
The start of the index. Passed to date_range with `freq`.
426-
end : str or Timestamp, default "2000-12-31"
427-
The end of the index. Passed to date_range with `freq`.
428-
freq : str or Freq
429-
The frequency to use for the DatetimeIndex
430-
seed : int, optional
431-
The random state seed.
432-
433-
* name : object dtype with string names
434-
* id : int dtype with
435-
* x, y : float dtype
436-
437-
Examples
438-
--------
439-
>>> _make_timeseries() # doctest: +SKIP
440-
id name x y
441-
timestamp
442-
2000-01-01 982 Frank 0.031261 0.986727
443-
2000-01-02 1025 Edith -0.086358 -0.032920
444-
2000-01-03 982 Edith 0.473177 0.298654
445-
2000-01-04 1009 Sarah 0.534344 -0.750377
446-
2000-01-05 963 Zelda -0.271573 0.054424
447-
... ... ... ... ...
448-
2000-12-27 980 Ingrid -0.132333 -0.422195
449-
2000-12-28 972 Frank -0.376007 -0.298687
450-
2000-12-29 1009 Ursula -0.865047 -0.503133
451-
2000-12-30 1000 Hannah -0.063757 -0.507336
452-
2000-12-31 972 Tim -0.869120 0.531685
453-
"""
454-
index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
455-
n = len(index)
456-
state = np.random.RandomState(seed)
457-
columns = {
458-
"name": state.choice(_names, size=n),
459-
"id": state.poisson(1000, size=n),
460-
"x": state.rand(n) * 2 - 1,
461-
"y": state.rand(n) * 2 - 1,
462-
}
463-
df = DataFrame(columns, index=index, columns=sorted(columns))
464-
if df.index[-1] == end:
465-
df = df.iloc[:-1]
466-
return df
467-
468-
469388
def index_subclass_makers_generator():
470389
make_index_funcs = [
471390
makeDateIndex,

0 commit comments

Comments
 (0)