|
1 | 1 | { |
2 | 2 | "cells": [ |
3 | | - { |
4 | | - "cell_type": "markdown", |
5 | | - "metadata": {}, |
6 | | - "source": [ |
7 | | - "[](https://mybinder.org/v2/gh/treehouse-projects/python-introducing-pandas/master?filepath=s2n5-handling-missing-and-duplicated-data.ipynb)" |
8 | | - ] |
9 | | - }, |
10 | 3 | { |
11 | 4 | "cell_type": "markdown", |
12 | 5 | "metadata": {}, |
|
44 | 37 | "transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)\n", |
45 | 38 | "requests = pd.read_csv(os.path.join('data', 'requests.csv'), index_col=0)\n", |
46 | 39 | "\n", |
47 | | - "# Perform the merge from the previous notebook (s2n4-combining-dataframes.ipynb)\n", |
| 40 | + "# Perform the merge from the previous notebook (s2n6-combining-dataframes.ipynb)\n", |
48 | 41 | "successful_requests = requests.merge(\n", |
49 | 42 | " transactions,\n", |
50 | 43 | " left_on=['from_user', 'to_user', 'amount'], \n", |
|
64 | 57 | "source": [ |
65 | 58 | "## Duplicated Data\n", |
66 | 59 | "\n", |
67 | | - "We realized in our the previous notebook (s2n4-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n", |
| 60 | + "We realized in our the previous notebook (s2n6-combining-dataframes.ipynb) that the **`requests`** `DataFrame` had duplicates. Unfortunately this means that our **`successful_requests`** also contains duplicates because we merged those same values with a transaction, even though in actuality, only one of those duplicated requests should be deemed \"successful\".\n", |
68 | 61 | "\n", |
69 | | - "We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. " |
| 62 | + "We should correct our `DataFrame` by removing the duplicate requests, keeping only the last one, as that is really the one that triggered the actual transaction. The great news is that there is a method named [`drop_duplicates`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) that does just that. \n", |
| 63 | + "\n", |
| 64 | + "Like `duplicated` there is a `keep` parameter that works similarly, you tell it which of the duplicates to keep. " |
70 | 65 | ] |
71 | 66 | }, |
72 | 67 | { |
|
88 | 83 | "source": [ |
89 | 84 | "# Let's get our records sorted chronologically\n", |
90 | 85 | "successful_requests.sort_values('request_date', inplace=True) \n", |
| 86 | + "\n", |
91 | 87 | "# And then we'll drop dupes keeping only the last one. Note the call to inplace \n", |
92 | 88 | "successful_requests.drop_duplicates(('from_user', 'to_user', 'amount'), keep='last', inplace=True)\n", |
| 89 | + "\n", |
93 | 90 | "# Statement from previous notebook\n", |
94 | 91 | "\"Wow! ${:,.2f} has passed through the request system in {} transactions!!!\".format(\n", |
95 | 92 | " successful_requests.amount.sum(),\n", |
|
363 | 360 | "source": [ |
364 | 361 | "## Locating Missing Data\n", |
365 | 362 | "\n", |
366 | | - "As I was looking at these people who hadn't made requests I noticed that a few of them had a Not A Number (`np.nan`) for a **`last_name`**.\n", |
| 363 | + "As I was looking at these people who hadn't made requests I noticed that a few of them had a NaN (Not A Number) for a **`last_name`**.\n", |
367 | 364 | "\n", |
368 | 365 | "We can get a quick overview of how many blank values we have by using the [`DataFrame.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html)\n" |
369 | 366 | ] |
|
529 | 526 | }, |
530 | 527 | { |
531 | 528 | "cell_type": "code", |
532 | | - "execution_count": 7, |
| 529 | + "execution_count": 9, |
533 | 530 | "metadata": {}, |
534 | 531 | "outputs": [ |
535 | 532 | { |
|
573 | 570 | "Index: []" |
574 | 571 | ] |
575 | 572 | }, |
576 | | - "execution_count": 7, |
| 573 | + "execution_count": 9, |
577 | 574 | "metadata": {}, |
578 | 575 | "output_type": "execute_result" |
579 | 576 | } |
580 | 577 | ], |
581 | 578 | "source": [ |
582 | 579 | "# Make a copy of the DataFrame with \"Unknown\" as the last name where it is missing\n", |
583 | 580 | "users_with_unknown = users.fillna('Unknown')\n", |
| 581 | + "\n", |
584 | 582 | "# Make sure we got 'em all\n", |
585 | 583 | "users_with_unknown[users_with_unknown.last_name.isna()]" |
586 | 584 | ] |
|
598 | 596 | }, |
599 | 597 | { |
600 | 598 | "cell_type": "code", |
601 | | - "execution_count": 9, |
| 599 | + "execution_count": 10, |
602 | 600 | "metadata": {}, |
603 | 601 | "outputs": [ |
604 | 602 | { |
|
607 | 605 | "(475, 430)" |
608 | 606 | ] |
609 | 607 | }, |
610 | | - "execution_count": 9, |
| 608 | + "execution_count": 10, |
611 | 609 | "metadata": {}, |
612 | 610 | "output_type": "execute_result" |
613 | 611 | } |
614 | 612 | ], |
615 | 613 | "source": [ |
616 | 614 | "users_with_last_names = users.dropna()\n", |
| 615 | + "\n", |
617 | 616 | "# Row counts of the original \n", |
618 | 617 | "(len(users), len(users_with_last_names))" |
619 | 618 | ] |
|
0 commit comments