Skip to content

Commit f01bc96

Browse files
committed
Add notebook on apply method
1 parent 27de073 commit f01bc96

File tree

2 files changed

+364
-1
lines changed

2 files changed

+364
-1
lines changed

source-code/pandas/README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ easy to use.
2525
1. `pipes.ipynb`: consolidating data processing using pipes.
2626
1. `screenshots`: screenshots made for the slides.
2727
1. `generate_csv_files.py`: script to generate CSV files in different
28-
formats.
28+
formatg.
2929
1. `copy_on_write.ipynb`: Jupyter notebook that illustrates how data is shared
3030
between related notebooks and the role Copy-on-Write plays in order to
3131
prevent accidental data modifications in more than one dataframe.
32+
1. `apply.ipynb`: Jupyter notebook that illustrates the use of the `apply` method
33+
in pandas dataframes for applying functions along rows or columns. It includes
34+
a comparison of performance between using `apply` and vectorized operations.

source-code/pandas/apply.ipynb

Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": 2,
6+
"id": "c507c033-f47a-40f3-9d9d-d24d23e25474",
7+
"metadata": {},
8+
"outputs": [],
9+
"source": [
10+
"import numpy as np\n",
11+
"import pandas as pd"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"id": "c973633e-eccd-4a0f-873d-faf43fa3b836",
17+
"metadata": {},
18+
"source": [
19+
"## apply"
20+
]
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"id": "f1401362-5955-495e-be57-5436a7446530",
25+
"metadata": {},
26+
"source": [
27+
"Code that uses `.apply()` looks clean, but it is rather slow when used row-wise (`axis=1`). To quantify this, you can run the example below."
28+
]
29+
},
30+
{
31+
"cell_type": "code",
32+
"execution_count": 31,
33+
"id": "af048047-df04-4c5f-8b36-d48f53d021ae",
34+
"metadata": {},
35+
"outputs": [],
36+
"source": [
37+
"size = 100_000\n",
38+
"df = pd.DataFrame({\n",
39+
" 'A': np.random.uniform(0.0, 1.0, size=size),\n",
40+
" 'B': np.random.uniform(0.0, 1.0, size=size),\n",
41+
"})"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": 32,
47+
"id": "84b3d0d6-d9c3-4921-8561-80ef6d766f6f",
48+
"metadata": {
49+
"scrolled": true
50+
},
51+
"outputs": [
52+
{
53+
"name": "stdout",
54+
"output_type": "stream",
55+
"text": [
56+
"<class 'pandas.core.frame.DataFrame'>\n",
57+
"RangeIndex: 100000 entries, 0 to 99999\n",
58+
"Data columns (total 2 columns):\n",
59+
" # Column Non-Null Count Dtype \n",
60+
"--- ------ -------------- ----- \n",
61+
" 0 A 100000 non-null float64\n",
62+
" 1 B 100000 non-null float64\n",
63+
"dtypes: float64(2)\n",
64+
"memory usage: 1.5 MB\n"
65+
]
66+
}
67+
],
68+
"source": [
69+
"df.info()"
70+
]
71+
},
72+
{
73+
"cell_type": "markdown",
74+
"id": "9dfd0c4b-996d-4426-8b58-d66c78124a8f",
75+
"metadata": {},
76+
"source": [
77+
"Note that this dataframe is fairly small."
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"id": "d0b672e5-9762-496e-932f-4c5729c62061",
83+
"metadata": {},
84+
"source": [
85+
"### Evaluating a condition"
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"execution_count": 38,
91+
"id": "093ddcde-ee7f-4d66-847d-221e8181b9dc",
92+
"metadata": {
93+
"editable": true,
94+
"slideshow": {
95+
"slide_type": ""
96+
},
97+
"tags": []
98+
},
99+
"outputs": [
100+
{
101+
"name": "stdout",
102+
"output_type": "stream",
103+
"text": [
104+
"551 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
105+
]
106+
}
107+
],
108+
"source": [
109+
"%timeit df.apply(lambda x: 0 if x.A + x.B < 1.0 else 1, axis=1)"
110+
]
111+
},
112+
{
113+
"cell_type": "code",
114+
"execution_count": 39,
115+
"id": "6b10519f-26b5-4c74-af2f-ee34af35e96d",
116+
"metadata": {},
117+
"outputs": [
118+
{
119+
"name": "stdout",
120+
"output_type": "stream",
121+
"text": [
122+
"1.17 ms ± 5.24 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
123+
]
124+
}
125+
],
126+
"source": [
127+
"%timeit np.select([df.A + df.B < 1.0, df.A + df.B >= 1.0], [0, 1])"
128+
]
129+
},
130+
{
131+
"cell_type": "code",
132+
"execution_count": 40,
133+
"id": "e8b003c0-7445-475e-9ece-68a9783b1388",
134+
"metadata": {
135+
"scrolled": true
136+
},
137+
"outputs": [
138+
{
139+
"name": "stdout",
140+
"output_type": "stream",
141+
"text": [
142+
"510 μs ± 4.17 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
143+
]
144+
}
145+
],
146+
"source": [
147+
"%timeit np.where(df.A + df.B < 1.0, 0, 1)"
148+
]
149+
},
150+
{
151+
"cell_type": "markdown",
152+
"id": "35ebd7e1-48bb-4d3b-860d-f0d765ffa62e",
153+
"metadata": {},
154+
"source": [
155+
"Clearly, `.apply()` is very slow comparted to `np.select()` and `np.where()`. Note that `np.where()` is faster than `np.select()` by a factor of 2."
156+
]
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": 50,
161+
"id": "9bc83bfe-680e-4b3d-8017-970cf08fd956",
162+
"metadata": {},
163+
"outputs": [],
164+
"source": [
165+
"assert np.array_equal(\n",
166+
" df.apply(lambda x: 0 if x.A + x.B < 1.0 else 1, axis=1).to_numpy(),\n",
167+
" np.where(df.A + df.B < 1.0, 0, 1),\n",
168+
")"
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": 51,
174+
"id": "de5e05b5-154e-498c-a565-3116e490ae11",
175+
"metadata": {},
176+
"outputs": [],
177+
"source": [
178+
"assert np.array_equal(\n",
179+
" df.apply(lambda x: 0 if x.A + x.B < 1.0 else 1, axis=1).to_numpy(),\n",
180+
" np.select([df.A + df.B < 1.0, df.A + df.B >= 1.0], [0, 1]),\n",
181+
")"
182+
]
183+
},
184+
{
185+
"cell_type": "markdown",
186+
"id": "9b46cd48-f1ce-4041-9560-6c1b09556d53",
187+
"metadata": {},
188+
"source": [
189+
"All three approaches produce the same results."
190+
]
191+
},
192+
{
193+
"cell_type": "markdown",
194+
"id": "c63e4df2-6fed-4072-aadd-3256a7c8cede",
195+
"metadata": {},
196+
"source": [
197+
"### Adding a column"
198+
]
199+
},
200+
{
201+
"cell_type": "code",
202+
"execution_count": 41,
203+
"id": "ef441507-f6f5-4485-b03f-36636259a848",
204+
"metadata": {},
205+
"outputs": [
206+
{
207+
"name": "stdout",
208+
"output_type": "stream",
209+
"text": [
210+
"563 ms ± 8.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
211+
]
212+
}
213+
],
214+
"source": [
215+
"%timeit df['C'] = df.apply(lambda x: x.A + x.B, axis=1)"
216+
]
217+
},
218+
{
219+
"cell_type": "code",
220+
"execution_count": 42,
221+
"id": "bd13d78b-b7fd-40c0-8b0e-3bdafdef4b33",
222+
"metadata": {
223+
"editable": true,
224+
"slideshow": {
225+
"slide_type": ""
226+
},
227+
"tags": []
228+
},
229+
"outputs": [
230+
{
231+
"name": "stdout",
232+
"output_type": "stream",
233+
"text": [
234+
"176 μs ± 2.21 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n"
235+
]
236+
}
237+
],
238+
"source": [
239+
"%timeit df['C'] = df.A + df.B"
240+
]
241+
},
242+
{
243+
"cell_type": "markdown",
244+
"id": "3f092bfc-9f32-4636-ba95-52b2c07d2fdb",
245+
"metadata": {},
246+
"source": [
247+
"Clearly, `.apply()` is very slow comparted to a straightforward column definition. The difference is a factor of 1,000."
248+
]
249+
},
250+
{
251+
"cell_type": "code",
252+
"execution_count": 48,
253+
"id": "5c8f3b66-1eea-4e58-9035-f6db4af3df3f",
254+
"metadata": {},
255+
"outputs": [],
256+
"source": [
257+
"assert df.apply(lambda x: x.A + x.B, axis=1).equals(df.A + df.B)"
258+
]
259+
},
260+
{
261+
"cell_type": "markdown",
262+
"id": "c31c53ea-e297-4658-b55b-35ab47987237",
263+
"metadata": {},
264+
"source": [
265+
"Both approaches yield the same result."
266+
]
267+
},
268+
{
269+
"cell_type": "markdown",
270+
"id": "a32a0791-8063-40ac-83d0-93a5ab796c70",
271+
"metadata": {},
272+
"source": [
273+
"### Aggregating columns"
274+
]
275+
},
276+
{
277+
"cell_type": "markdown",
278+
"id": "8be8ec5b-878b-4452-9815-9c0a23f97d9d",
279+
"metadata": {},
280+
"source": [
281+
"Although less dramatically so, applying `.apply()` along axis 0 is also slower than its numpy counterpart."
282+
]
283+
},
284+
{
285+
"cell_type": "code",
286+
"execution_count": 52,
287+
"id": "47d6aca2-f52e-4746-a139-119fcdfe3030",
288+
"metadata": {},
289+
"outputs": [
290+
{
291+
"name": "stdout",
292+
"output_type": "stream",
293+
"text": [
294+
"303 μs ± 4.28 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
295+
]
296+
}
297+
],
298+
"source": [
299+
"%timeit df.apply(np.sum, axis=0)"
300+
]
301+
},
302+
{
303+
"cell_type": "code",
304+
"execution_count": 54,
305+
"id": "1e4ed799-08fd-4c14-bdf0-f6db5b829c0c",
306+
"metadata": {},
307+
"outputs": [
308+
{
309+
"name": "stdout",
310+
"output_type": "stream",
311+
"text": [
312+
"179 μs ± 10.2 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n"
313+
]
314+
}
315+
],
316+
"source": [
317+
"%timeit np.sum(df.to_numpy(), axis=0)"
318+
]
319+
},
320+
{
321+
"cell_type": "code",
322+
"execution_count": 55,
323+
"id": "d0504d19-a6d8-4f3d-a4ff-73e9c04152e4",
324+
"metadata": {},
325+
"outputs": [],
326+
"source": [
327+
"assert np.array_equal(df.apply(np.sum, axis=0), np.sum(df.to_numpy(), axis=0))"
328+
]
329+
},
330+
{
331+
"cell_type": "markdown",
332+
"id": "ce8fb4ac-795e-43e3-ae72-fa528df86855",
333+
"metadata": {},
334+
"source": [
335+
"Again, both produce the same result."
336+
]
337+
}
338+
],
339+
"metadata": {
340+
"kernelspec": {
341+
"display_name": "Python 3 (ipykernel)",
342+
"language": "python",
343+
"name": "python3"
344+
},
345+
"language_info": {
346+
"codemirror_mode": {
347+
"name": "ipython",
348+
"version": 3
349+
},
350+
"file_extension": ".py",
351+
"mimetype": "text/x-python",
352+
"name": "python",
353+
"nbconvert_exporter": "python",
354+
"pygments_lexer": "ipython3",
355+
"version": "3.12.12"
356+
}
357+
},
358+
"nbformat": 4,
359+
"nbformat_minor": 5
360+
}

0 commit comments

Comments
 (0)