Add notebook on apply method

gjbex · gjbex · commit f01bc969eb01 · 2025-11-20T17:33:53.000+01:00
diff --git a/source-code/pandas/README.md b/source-code/pandas/README.md
@@ -25,7 +25,10 @@ easy to use.
 1. `pipes.ipynb`: consolidating data processing using pipes.
 1. `screenshots`: screenshots made for the slides.
 1. `generate_csv_files.py`: script to generate CSV files in different
-    formats.
+    formatg.
 1. `copy_on_write.ipynb`: Jupyter notebook that illustrates how data is shared
    between related notebooks and the role Copy-on-Write plays in order to
    prevent accidental data modifications in more than one dataframe.
+1. `apply.ipynb`: Jupyter notebook that illustrates the use of the `apply` method
+   in pandas dataframes for applying functions along rows or columns. It includes
+   a comparison of performance between using `apply` and vectorized operations.
diff --git a/source-code/pandas/apply.ipynb b/source-code/pandas/apply.ipynb
@@ -0,0 +1,360 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "c507c033-f47a-40f3-9d9d-d24d23e25474",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c973633e-eccd-4a0f-873d-faf43fa3b836",
+   "metadata": {},
+   "source": [
+    "## apply"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1401362-5955-495e-be57-5436a7446530",
+   "metadata": {},
+   "source": [
+    "Code that uses `.apply()` looks clean, but it is rather slow when used row-wise (`axis=1`). To quantify this, you can run the example below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "af048047-df04-4c5f-8b36-d48f53d021ae",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "size = 100_000\n",
+    "df = pd.DataFrame({\n",
+    "    'A': np.random.uniform(0.0, 1.0, size=size),\n",
+    "    'B': np.random.uniform(0.0, 1.0, size=size),\n",
+    "})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "84b3d0d6-d9c3-4921-8561-80ef6d766f6f",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 100000 entries, 0 to 99999\n",
+      "Data columns (total 2 columns):\n",
+      " #   Column  Non-Null Count   Dtype  \n",
+      "---  ------  --------------   -----  \n",
+      " 0   A       100000 non-null  float64\n",
+      " 1   B       100000 non-null  float64\n",
+      "dtypes: float64(2)\n",
+      "memory usage: 1.5 MB\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9dfd0c4b-996d-4426-8b58-d66c78124a8f",
+   "metadata": {},
+   "source": [
+    "Note that this dataframe is fairly small."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0b672e5-9762-496e-932f-4c5729c62061",
+   "metadata": {},
+   "source": [
+    "### Evaluating a condition"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "id": "093ddcde-ee7f-4d66-847d-221e8181b9dc",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "551 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit df.apply(lambda x: 0 if x.A + x.B < 1.0 else 1, axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "id": "6b10519f-26b5-4c74-af2f-ee34af35e96d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.17 ms ± 5.24 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit np.select([df.A + df.B < 1.0, df.A + df.B >= 1.0], [0, 1])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "e8b003c0-7445-475e-9ece-68a9783b1388",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "510 μs ± 4.17 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit np.where(df.A + df.B < 1.0, 0, 1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35ebd7e1-48bb-4d3b-860d-f0d765ffa62e",
+   "metadata": {},
+   "source": [
+    "Clearly, `.apply()` is very slow comparted to `np.select()` and `np.where()`.  Note that `np.where()` is faster than `np.select()` by a factor of 2."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "9bc83bfe-680e-4b3d-8017-970cf08fd956",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert np.array_equal(\n",
+    "    df.apply(lambda x: 0 if x.A + x.B < 1.0 else 1, axis=1).to_numpy(),\n",
+    "    np.where(df.A + df.B < 1.0, 0, 1),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "id": "de5e05b5-154e-498c-a565-3116e490ae11",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert np.array_equal(\n",
+    "    df.apply(lambda x: 0 if x.A + x.B < 1.0 else 1, axis=1).to_numpy(),\n",
+    "    np.select([df.A + df.B < 1.0, df.A + df.B >= 1.0], [0, 1]),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9b46cd48-f1ce-4041-9560-6c1b09556d53",
+   "metadata": {},
+   "source": [
+    "All three approaches produce the same results."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c63e4df2-6fed-4072-aadd-3256a7c8cede",
+   "metadata": {},
+   "source": [
+    "### Adding a column"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "ef441507-f6f5-4485-b03f-36636259a848",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "563 ms ± 8.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit df['C'] = df.apply(lambda x: x.A + x.B, axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "bd13d78b-b7fd-40c0-8b0e-3bdafdef4b33",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "176 μs ± 2.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit df['C'] = df.A + df.B"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f092bfc-9f32-4636-ba95-52b2c07d2fdb",
+   "metadata": {},
+   "source": [
+    "Clearly, `.apply()` is very slow comparted to a straightforward column definition.  The difference is a factor of 1,000."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "5c8f3b66-1eea-4e58-9035-f6db4af3df3f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert df.apply(lambda x: x.A + x.B, axis=1).equals(df.A + df.B)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c31c53ea-e297-4658-b55b-35ab47987237",
+   "metadata": {},
+   "source": [
+    "Both approaches yield the same result."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a32a0791-8063-40ac-83d0-93a5ab796c70",
+   "metadata": {},
+   "source": [
+    "### Aggregating columns"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8be8ec5b-878b-4452-9815-9c0a23f97d9d",
+   "metadata": {},
+   "source": [
+    "Although less dramatically so, applying `.apply()` along axis 0 is also slower than its numpy counterpart."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "id": "47d6aca2-f52e-4746-a139-119fcdfe3030",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "303 μs ± 4.28 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit df.apply(np.sum, axis=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "id": "1e4ed799-08fd-4c14-bdf0-f6db5b829c0c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "179 μs ± 10.2 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit np.sum(df.to_numpy(), axis=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "id": "d0504d19-a6d8-4f3d-a4ff-73e9c04152e4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assert np.array_equal(df.apply(np.sum, axis=0), np.sum(df.to_numpy(), axis=0))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce8fb4ac-795e-43e3-ae72-fa528df86855",
+   "metadata": {},
+   "source": [
+    "Again, both produce the same result."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}