DOC: Improve scale.rst (#45385)

mroeschke · web-flow · commit 7abddcd7cc23 · 2022-01-16T12:05:04.000-05:00
diff --git a/doc/source/user_guide/scale.rst b/doc/source/user_guide/scale.rst
@@ -18,36 +18,9 @@ tool for all situations. If you're working with very large datasets and a tool
 like PostgreSQL fits your needs, then you should probably be using that.
 Assuming you want or need the expressiveness and power of pandas, let's carry on.
 
-.. ipython:: python
-
-   import pandas as pd
-   import numpy as np
-
-.. ipython:: python
-   :suppress:
-
-   from pandas._testing import _make_timeseries
-
-   # Make a random in-memory dataset
-   ts = _make_timeseries(freq="30S", seed=0)
-   ts.to_csv("timeseries.csv")
-   ts.to_parquet("timeseries.parquet")
-
-
 Load less data
 --------------
 
-.. ipython:: python
-   :suppress:
-
-   # make a similar dataset with many columns
-   timeseries = [
-       _make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}")
-       for i in range(10)
-   ]
-   ts_wide = pd.concat(timeseries, axis=1)
-   ts_wide.to_parquet("timeseries_wide.parquet")
-
 Suppose our raw dataset on disk has many columns::
 
                         id_0    name_0       x_0       y_0  id_1   name_1       x_1  ...  name_8       x_8       y_8  id_9   name_9       x_9       y_9
@@ -66,6 +39,34 @@ Suppose our raw dataset on disk has many columns::
 
    [525601 rows x 40 columns]
 
+That can be generated by the following code snippet:
+
+.. ipython:: python
+
+   import pandas as pd
+   import numpy as np
+
+   def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
+       index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
+       n = len(index)
+       state = np.random.RandomState(seed)
+       columns = {
+           "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
+           "id": state.poisson(1000, size=n),
+           "x": state.rand(n) * 2 - 1,
+           "y": state.rand(n) * 2 - 1,
+       }
+       df = pd.DataFrame(columns, index=index, columns=sorted(columns))
+       if df.index[-1] == end:
+           df = df.iloc[:-1]
+       return df
+
+   timeseries = [
+       make_timeseries(freq="1T", seed=i).rename(columns=lambda x: f"{x}_{i}")
+       for i in range(10)
+   ]
+   ts_wide = pd.concat(timeseries, axis=1)
+   ts_wide.to_parquet("timeseries_wide.parquet")
 
 To load the columns we want, we have two options.
 Option 1 loads in all the data and then filters to what we need.
@@ -99,6 +100,8 @@ can store larger datasets in memory.
 
 .. ipython:: python
 
+   ts = make_timeseries(freq="30S", seed=0)
+   ts.to_parquet("timeseries.parquet")
    ts = pd.read_parquet("timeseries.parquet")
    ts
 
@@ -116,7 +119,7 @@ attention.
 
 The ``name`` column is taking up much more memory than any other. It has just a
 few unique values, so it's a good candidate for converting to a
-:class:`Categorical`. With a Categorical, we store each unique name once and use
+:class:`pandas.Categorical`. With a :class:`pandas.Categorical`, we store each unique name once and use
 space-efficient integers to know which specific name is used in each row.
 
 
@@ -147,7 +150,7 @@ using :func:`pandas.to_numeric`.
 In all, we've reduced the in-memory footprint of this dataset to 1/5 of its
 original size.
 
-See :ref:`categorical` for more on ``Categorical`` and :ref:`basics.dtypes`
+See :ref:`categorical` for more on :class:`pandas.Categorical` and :ref:`basics.dtypes`
 for an overview of all of pandas' dtypes.
 
 Use chunking
@@ -168,7 +171,6 @@ Suppose we have an even larger "logical dataset" on disk that's a directory of p
 files. Each file in the directory represents a different year of the entire dataset.
 
 .. ipython:: python
-   :suppress:
 
    import pathlib
 
@@ -179,7 +181,7 @@ files. Each file in the directory represents a different year of the entire data
    pathlib.Path("data/timeseries").mkdir(exist_ok=True)
 
    for i, (start, end) in enumerate(zip(starts, ends)):
-       ts = _make_timeseries(start=start, end=end, freq="1T", seed=i)
+       ts = make_timeseries(start=start, end=end, freq="1T", seed=i)
        ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")
 
 
@@ -200,7 +202,7 @@ files. Each file in the directory represents a different year of the entire data
        ├── ts-10.parquet
        └── ts-11.parquet
 
-Now we'll implement an out-of-core ``value_counts``. The peak memory usage of this
+Now we'll implement an out-of-core :meth:`pandas.Series.value_counts`. The peak memory usage of this
 workflow is the single largest chunk, plus a small series storing the unique value
 counts up to this point. As long as each individual file fits in memory, this will
 work for arbitrary-sized datasets.
@@ -211,17 +213,15 @@ work for arbitrary-sized datasets.
    files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
    counts = pd.Series(dtype=int)
    for path in files:
-       # Only one dataframe is in memory at a time...
        df = pd.read_parquet(path)
-       # ... plus a small Series ``counts``, which is updated.
        counts = counts.add(df["name"].value_counts(), fill_value=0)
    counts.astype(int)
 
 Some readers, like :meth:`pandas.read_csv`, offer parameters to control the
 ``chunksize`` when reading a single file.
 
 Manually chunking is an OK option for workflows that don't
-require too sophisticated of operations. Some operations, like ``groupby``, are
+require too sophisticated of operations. Some operations, like :meth:`pandas.DataFrame.groupby`, are
 much harder to do chunkwise. In these cases, you may be better switching to a
 different library that implements these out-of-core algorithms for you.
 
@@ -259,7 +259,7 @@ Inspecting the ``ddf`` object, we see a few things
 * There are new attributes like ``.npartitions`` and ``.divisions``
 
 The partitions and divisions are how Dask parallelizes computation. A **Dask**
-DataFrame is made up of many pandas DataFrames. A single method call on a
+DataFrame is made up of many pandas :class:`pandas.DataFrame`. A single method call on a
 Dask DataFrame ends up making many pandas method calls, and Dask knows how to
 coordinate everything to get the result.
 
@@ -283,8 +283,8 @@ Rather than executing immediately, doing operations build up a **task graph**.
 
 Each of these calls is instant because the result isn't being computed yet.
 We're just building up a list of computation to do when someone needs the
-result. Dask knows that the return type of a ``pandas.Series.value_counts``
-is a pandas Series with a certain dtype and a certain name. So the Dask version
+result. Dask knows that the return type of a :class:`pandas.Series.value_counts`
+is a pandas :class:`pandas.Series` with a certain dtype and a certain name. So the Dask version
 returns a Dask Series with the same dtype and the same name.
 
 To get the actual result you can call ``.compute()``.
@@ -294,13 +294,13 @@ To get the actual result you can call ``.compute()``.
    %time ddf["name"].value_counts().compute()
 
 At that point, you get back the same thing you'd get with pandas, in this case
-a concrete pandas Series with the count of each ``name``.
+a concrete pandas :class:`pandas.Series` with the count of each ``name``.
 
 Calling ``.compute`` causes the full task graph to be executed. This includes
 reading the data, selecting the columns, and doing the ``value_counts``. The
 execution is done *in parallel* where possible, and Dask tries to keep the
 overall memory footprint small. You can work with datasets that are much larger
-than memory, as long as each partition (a regular pandas DataFrame) fits in memory.
+than memory, as long as each partition (a regular pandas :class:`pandas.DataFrame`) fits in memory.
 
 By default, ``dask.dataframe`` operations use a threadpool to do operations in
 parallel. We can also connect to a cluster to distribute the work on many
diff --git a/pandas/_testing/__init__.py b/pandas/_testing/__init__.py
@@ -385,87 +385,6 @@ def makeMultiIndex(k=10, names=None, **kwargs):
     return mi[:k]
 
 
-_names = [
-    "Alice",
-    "Bob",
-    "Charlie",
-    "Dan",
-    "Edith",
-    "Frank",
-    "George",
-    "Hannah",
-    "Ingrid",
-    "Jerry",
-    "Kevin",
-    "Laura",
-    "Michael",
-    "Norbert",
-    "Oliver",
-    "Patricia",
-    "Quinn",
-    "Ray",
-    "Sarah",
-    "Tim",
-    "Ursula",
-    "Victor",
-    "Wendy",
-    "Xavier",
-    "Yvonne",
-    "Zelda",
-]
-
-
-def _make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
-    """
-    Make a DataFrame with a DatetimeIndex
-
-    Parameters
-    ----------
-    start : str or Timestamp, default "2000-01-01"
-        The start of the index. Passed to date_range with `freq`.
-    end : str or Timestamp, default "2000-12-31"
-        The end of the index. Passed to date_range with `freq`.
-    freq : str or Freq
-        The frequency to use for the DatetimeIndex
-    seed : int, optional
-        The random state seed.
-
-        * name : object dtype with string names
-        * id : int dtype with
-        * x, y : float dtype
-
-    Examples
-    --------
-    >>> _make_timeseries()  # doctest: +SKIP
-                  id    name         x         y
-    timestamp
-    2000-01-01   982   Frank  0.031261  0.986727
-    2000-01-02  1025   Edith -0.086358 -0.032920
-    2000-01-03   982   Edith  0.473177  0.298654
-    2000-01-04  1009   Sarah  0.534344 -0.750377
-    2000-01-05   963   Zelda -0.271573  0.054424
-    ...          ...     ...       ...       ...
-    2000-12-27   980  Ingrid -0.132333 -0.422195
-    2000-12-28   972   Frank -0.376007 -0.298687
-    2000-12-29  1009  Ursula -0.865047 -0.503133
-    2000-12-30  1000  Hannah -0.063757 -0.507336
-    2000-12-31   972     Tim -0.869120  0.531685
-    """
-    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
-    n = len(index)
-    state = np.random.RandomState(seed)
-    columns = {
-        "name": state.choice(_names, size=n),
-        "id": state.poisson(1000, size=n),
-        "x": state.rand(n) * 2 - 1,
-        "y": state.rand(n) * 2 - 1,
-    }
-    df = DataFrame(columns, index=index, columns=sorted(columns))
-    if df.index[-1] == end:
-        df = df.iloc[:-1]
-    return df
-
-
 def index_subclass_makers_generator():
     make_index_funcs = [
         makeDateIndex,