Skip to content

Commit 8b3ae34

Browse files
DOC: Add comprehensive Google Colab data loading guide
1 parent 1b5b02c commit 8b3ae34

File tree

1 file changed

+148
-1
lines changed

1 file changed

+148
-1
lines changed

doc/source/user_guide/io.rst

Lines changed: 148 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6305,7 +6305,154 @@ xarray_ provides data structures inspired by the pandas ``DataFrame`` for workin
63056305
with multi-dimensional datasets, with a focus on the netCDF file format and
63066306
easy conversion to and from pandas.
63076307

6308-
.. _xarray: https://xarray.pydata.org/en/stable/
6308+
.. _io.google_colab:
6309+
6310+
Google Colab
6311+
------------
6312+
6313+
Google Colab is a popular cloud-based environment for running Python code,
6314+
including pandas operations. This section covers various methods to load data
6315+
into pandas DataFrames when working in Google Colab.
6316+
6317+
.. _io.google_colab.drive:
6318+
6319+
Reading from Google Drive
6320+
'''''''''''''''''''''''''
6321+
6322+
The most common approach is to mount your Google Drive, which allows you to
6323+
access files stored in Drive as if they were local files.
6324+
6325+
.. code-block:: python
6326+
6327+
from google.colab import drive
6328+
import pandas as pd
6329+
6330+
# Mount Google Drive
6331+
drive.mount('/content/drive')
6332+
6333+
# Read a CSV file from Google Drive
6334+
df = pd.read_csv('/content/drive/MyDrive/path/to/your/file.csv')
6335+
6336+
After running the mount command, you'll be prompted to authorize access to your
6337+
Google Drive. Once mounted, you can navigate to your files using the file browser
6338+
in the Colab sidebar and copy the path to use in pandas read functions.
6339+
6340+
This approach works with all pandas read functions:
6341+
6342+
.. code-block:: python
6343+
6344+
# Read Excel file
6345+
df = pd.read_excel('/content/drive/MyDrive/data.xlsx')
6346+
6347+
# Read JSON file
6348+
df = pd.read_json('/content/drive/MyDrive/data.json')
6349+
6350+
# Read Parquet file
6351+
df = pd.read_parquet('/content/drive/MyDrive/data.parquet')
6352+
6353+
.. _io.google_colab.upload:
6354+
6355+
Uploading files directly
6356+
'''''''''''''''''''''''''
6357+
6358+
For smaller files or one-time uploads, you can upload files directly from your
6359+
local machine:
6360+
6361+
.. code-block:: python
6362+
6363+
from google.colab import files
6364+
import pandas as pd
6365+
import io
6366+
6367+
# Upload file(s)
6368+
uploaded = files.upload()
6369+
6370+
# Read the uploaded CSV file
6371+
# Replace 'filename.csv' with your actual filename
6372+
df = pd.read_csv(io.BytesIO(uploaded['filename.csv']))
6373+
6374+
.. note::
6375+
Uploaded files are stored in the Colab session's temporary storage and will
6376+
be lost when the runtime disconnects.
6377+
6378+
.. _io.google_colab.url:
6379+
6380+
Reading from URLs
6381+
'''''''''''''''''
6382+
6383+
pandas can read files directly from URLs, which is useful for accessing data
6384+
from GitHub, public datasets, or other web sources:
6385+
6386+
.. code-block:: python
6387+
6388+
import pandas as pd
6389+
6390+
# Read CSV from a URL
6391+
url = 'https://raw.githubusercontent.com/user/repo/main/data.csv'
6392+
df = pd.read_csv(url)
6393+
6394+
# Read from GitHub
6395+
github_url = 'https://github.com/user/repo/raw/main/data.xlsx'
6396+
df = pd.read_excel(github_url)
6397+
6398+
.. _io.google_colab.gsheets:
6399+
6400+
Reading from Google Sheets
6401+
'''''''''''''''''''''''''''
6402+
6403+
You can read data directly from Google Sheets by making the sheet publicly
6404+
accessible and using its export URL:
6405+
6406+
.. code-block:: python
6407+
6408+
import pandas as pd
6409+
6410+
# Method 1: Using the sheet's export URL
6411+
sheet_id = 'your-sheet-id'
6412+
sheet_name = 'Sheet1'
6413+
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
6414+
df = pd.read_csv(url)
6415+
6416+
For more advanced Google Sheets integration with authentication, consider using
6417+
the ``gspread`` library alongside pandas.
6418+
6419+
.. _io.google_colab.kaggle:
6420+
6421+
Reading Kaggle datasets
6422+
''''''''''''''''''''''''
6423+
6424+
To access Kaggle datasets in Colab, you need to authenticate using your Kaggle
6425+
API credentials:
6426+
6427+
.. code-block:: python
6428+
6429+
# Upload your kaggle.json file
6430+
from google.colab import files
6431+
files.upload() # Select kaggle.json when prompted
6432+
6433+
# Setup Kaggle
6434+
!mkdir -p ~/.kaggle
6435+
!cp kaggle.json ~/.kaggle/
6436+
!chmod 600 ~/.kaggle/kaggle.json
6437+
6438+
# Download a dataset
6439+
!kaggle datasets download -d dataset-owner/dataset-name
6440+
!unzip dataset-name.zip
6441+
6442+
# Read the data
6443+
import pandas as pd
6444+
df = pd.read_csv('datafile.csv')
6445+
6446+
.. _io.google_colab.best_practices:
6447+
6448+
Best practices for Colab
6449+
'''''''''''''''''''''''''
6450+
6451+
- **For repeated use**: Mount Google Drive and store your data there
6452+
- **For small files**: Use the upload widget for quick one-time analysis
6453+
- **For public datasets**: Read directly from URLs when possible
6454+
- **For large files**: Consider using Parquet format for faster loading and smaller file sizes
6455+
- **Session management**: Remember that uploaded files and variables are lost when the runtime disconnects.. _xarray: https://xarray.pydata.org/en/stable/
63096456

63106457
.. _io.perf:
63116458

0 commit comments

Comments
 (0)