@@ -18,36 +18,9 @@ tool for all situations. If you're working with very large datasets and a tool
1818like PostgreSQL fits your needs, then you should probably be using that.
1919Assuming you want or need the expressiveness and power of pandas, let's carry on.
2020
21- .. ipython :: python
22-
23- import pandas as pd
24- import numpy as np
25-
26- .. ipython :: python
27- :suppress:
28-
29- from pandas._testing import _make_timeseries
30-
31- # Make a random in-memory dataset
32- ts = _make_timeseries(freq = " 30S" , seed = 0 )
33- ts.to_csv(" timeseries.csv" )
34- ts.to_parquet(" timeseries.parquet" )
35-
36-
3721Load less data
3822--------------
3923
40- .. ipython :: python
41- :suppress:
42-
43- # make a similar dataset with many columns
44- timeseries = [
45- _make_timeseries(freq = " 1T" , seed = i).rename(columns = lambda x : f " { x} _ { i} " )
46- for i in range (10 )
47- ]
48- ts_wide = pd.concat(timeseries, axis = 1 )
49- ts_wide.to_parquet(" timeseries_wide.parquet" )
50-
5124Suppose our raw dataset on disk has many columns::
5225
5326 id_0 name_0 x_0 y_0 id_1 name_1 x_1 ... name_8 x_8 y_8 id_9 name_9 x_9 y_9
@@ -66,6 +39,34 @@ Suppose our raw dataset on disk has many columns::
6639
6740 [525601 rows x 40 columns]
6841
42+ That can be generated by the following code snippet:
43+
44+ .. ipython :: python
45+
46+ import pandas as pd
47+ import numpy as np
48+
49+ def make_timeseries (start = " 2000-01-01" , end = " 2000-12-31" , freq = " 1D" , seed = None ):
50+ index = pd.date_range(start = start, end = end, freq = freq, name = " timestamp" )
51+ n = len (index)
52+ state = np.random.RandomState(seed)
53+ columns = {
54+ " name" : state.choice([" Alice" , " Bob" , " Charlie" ], size = n),
55+ " id" : state.poisson(1000 , size = n),
56+ " x" : state.rand(n) * 2 - 1 ,
57+ " y" : state.rand(n) * 2 - 1 ,
58+ }
59+ df = pd.DataFrame(columns, index = index, columns = sorted (columns))
60+ if df.index[- 1 ] == end:
61+ df = df.iloc[:- 1 ]
62+ return df
63+
64+ timeseries = [
65+ make_timeseries(freq = " 1T" , seed = i).rename(columns = lambda x : f " { x} _ { i} " )
66+ for i in range (10 )
67+ ]
68+ ts_wide = pd.concat(timeseries, axis = 1 )
69+ ts_wide.to_parquet(" timeseries_wide.parquet" )
6970
7071 To load the columns we want, we have two options.
7172Option 1 loads in all the data and then filters to what we need.
@@ -99,6 +100,8 @@ can store larger datasets in memory.
99100
100101.. ipython :: python
101102
103+ ts = make_timeseries(freq = " 30S" , seed = 0 )
104+ ts.to_parquet(" timeseries.parquet" )
102105 ts = pd.read_parquet(" timeseries.parquet" )
103106 ts
104107
@@ -116,7 +119,7 @@ attention.
116119
117120 The ``name `` column is taking up much more memory than any other. It has just a
118121few unique values, so it's a good candidate for converting to a
119- :class: `Categorical `. With a Categorical, we store each unique name once and use
122+ :class: `pandas. Categorical `. With a :class: ` pandas. Categorical` , we store each unique name once and use
120123space-efficient integers to know which specific name is used in each row.
121124
122125
@@ -147,7 +150,7 @@ using :func:`pandas.to_numeric`.
147150 In all, we've reduced the in-memory footprint of this dataset to 1/5 of its
148151original size.
149152
150- See :ref: `categorical ` for more on `` Categorical ` ` and :ref: `basics.dtypes `
153+ See :ref: `categorical ` for more on :class: ` pandas. Categorical ` and :ref: `basics.dtypes `
151154for an overview of all of pandas' dtypes.
152155
153156Use chunking
@@ -168,7 +171,6 @@ Suppose we have an even larger "logical dataset" on disk that's a directory of p
168171files. Each file in the directory represents a different year of the entire dataset.
169172
170173.. ipython :: python
171- :suppress:
172174
173175 import pathlib
174176
@@ -179,7 +181,7 @@ files. Each file in the directory represents a different year of the entire data
179181 pathlib.Path(" data/timeseries" ).mkdir(exist_ok = True )
180182
181183 for i, (start, end) in enumerate (zip (starts, ends)):
182- ts = _make_timeseries (start = start, end = end, freq = " 1T" , seed = i)
184+ ts = make_timeseries (start = start, end = end, freq = " 1T" , seed = i)
183185 ts.to_parquet(f " data/timeseries/ts- { i:0>2d } .parquet " )
184186
185187
@@ -200,7 +202,7 @@ files. Each file in the directory represents a different year of the entire data
200202 ├── ts-10.parquet
201203 └── ts-11.parquet
202204
203- Now we'll implement an out-of-core `` value_counts ` `. The peak memory usage of this
205+ Now we'll implement an out-of-core :meth: ` pandas.Series. value_counts `. The peak memory usage of this
204206workflow is the single largest chunk, plus a small series storing the unique value
205207counts up to this point. As long as each individual file fits in memory, this will
206208work for arbitrary-sized datasets.
@@ -211,17 +213,15 @@ work for arbitrary-sized datasets.
211213 files = pathlib.Path(" data/timeseries/" ).glob(" ts*.parquet" )
212214 counts = pd.Series(dtype = int )
213215 for path in files:
214- # Only one dataframe is in memory at a time...
215216 df = pd.read_parquet(path)
216- # ... plus a small Series ``counts``, which is updated.
217217 counts = counts.add(df[" name" ].value_counts(), fill_value = 0 )
218218 counts.astype(int )
219219
220220 Some readers, like :meth: `pandas.read_csv `, offer parameters to control the
221221``chunksize `` when reading a single file.
222222
223223Manually chunking is an OK option for workflows that don't
224- require too sophisticated of operations. Some operations, like `` groupby ` `, are
224+ require too sophisticated of operations. Some operations, like :meth: ` pandas.DataFrame. groupby `, are
225225much harder to do chunkwise. In these cases, you may be better switching to a
226226different library that implements these out-of-core algorithms for you.
227227
@@ -259,7 +259,7 @@ Inspecting the ``ddf`` object, we see a few things
259259* There are new attributes like ``.npartitions `` and ``.divisions ``
260260
261261The partitions and divisions are how Dask parallelizes computation. A **Dask **
262- DataFrame is made up of many pandas DataFrames . A single method call on a
262+ DataFrame is made up of many pandas :class: ` pandas.DataFrame ` . A single method call on a
263263Dask DataFrame ends up making many pandas method calls, and Dask knows how to
264264coordinate everything to get the result.
265265
@@ -283,8 +283,8 @@ Rather than executing immediately, doing operations build up a **task graph**.
283283
284284 Each of these calls is instant because the result isn't being computed yet.
285285We're just building up a list of computation to do when someone needs the
286- result. Dask knows that the return type of a `` pandas.Series.value_counts ` `
287- is a pandas Series with a certain dtype and a certain name. So the Dask version
286+ result. Dask knows that the return type of a :class: ` pandas.Series.value_counts `
287+ is a pandas :class: ` pandas. Series` with a certain dtype and a certain name. So the Dask version
288288returns a Dask Series with the same dtype and the same name.
289289
290290To get the actual result you can call ``.compute() ``.
@@ -294,13 +294,13 @@ To get the actual result you can call ``.compute()``.
294294 % time ddf[" name" ].value_counts().compute()
295295
296296 At that point, you get back the same thing you'd get with pandas, in this case
297- a concrete pandas Series with the count of each ``name ``.
297+ a concrete pandas :class: ` pandas. Series` with the count of each ``name ``.
298298
299299Calling ``.compute `` causes the full task graph to be executed. This includes
300300reading the data, selecting the columns, and doing the ``value_counts ``. The
301301execution is done *in parallel * where possible, and Dask tries to keep the
302302overall memory footprint small. You can work with datasets that are much larger
303- than memory, as long as each partition (a regular pandas DataFrame) fits in memory.
303+ than memory, as long as each partition (a regular pandas :class: ` pandas. DataFrame` ) fits in memory.
304304
305305By default, ``dask.dataframe `` operations use a threadpool to do operations in
306306parallel. We can also connect to a cluster to distribute the work on many
0 commit comments