Skip to content

Commit eaaf560

Browse files
DarthMaxMats-SX
andcommitted
Improve notebook documentation
Co-authored-by: Mats Rydberg <mats@neo4j.org>
1 parent ec66ddb commit eaaf560

File tree

2 files changed

+17
-16
lines changed

2 files changed

+17
-16
lines changed

examples/graph-analytics-serverless-spark.ipynb

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,12 @@
2626
"cell_type": "markdown",
2727
"metadata": {},
2828
"source": [
29-
"This Jupyter notebook is hosted [here](https://github.com/neo4j/graph-data-science-client/blob/main/examples/graph-analytics-serverless.ipynb) in the Neo4j Graph Data Science Client Github repository.\n",
29+
"This Jupyter notebook is hosted [here](https://github.com/neo4j/graph-data-science-client/blob/main/examples/graph-analytics-serverless-spark.ipynb) in the Neo4j Graph Data Science Client Github repository.\n",
3030
"\n",
31-
"The notebook shows how to use the `graphdatascience` Python library to create, manage, and use a GDS Session.\n",
31+
"The notebook shows how to use the `graphdatascience` Python library to create, manage, and use a GDS Session from within an Apache Spark cluster.\n",
3232
"\n",
33-
"We consider a graph of bicycle rentals, which we're using as a simple example to show how project data from Spark to a GDS Session, run algorithms, and eventually retrieving the results back to Spark.\n",
34-
"We will cover all management operations: creation, listing, and deletion."
33+
"We consider a graph of bicycle rentals, which we're using as a simple example to show how to project data from Spark to a GDS Session, run algorithms, and eventually return results back to Spark.\n",
34+
"In this notebook we will focus on the interaction with Apache Spark, and will not cover all possible actions using GDS sessions. We refer to other Tutorials for additional details."
3535
]
3636
},
3737
{
@@ -74,7 +74,7 @@
7474
"source": [
7575
"### Connecting to a Spark Session\n",
7676
"\n",
77-
"To interact with the Spark Cluster we need to first instantiate a Spark session. In this example we will use a local Spark session, which will run Spark on the same machine.\n",
77+
"To interact with the Spark cluster we need to first instantiate a Spark session. In this example we will use a local Spark session, which will run Spark on the same machine.\n",
7878
"Working with a remote Spark cluster will work similarly."
7979
]
8080
},
@@ -115,7 +115,7 @@
115115
"api_credentials = AuraAPICredentials(\n",
116116
" client_id=os.environ[\"CLIENT_ID\"],\n",
117117
" client_secret=os.environ[\"CLIENT_SECRET\"],\n",
118-
" # If your account is a member of several project, you must also specify the project ID to use\n",
118+
" # If your account is a member of several projects, you must also specify the project ID to use\n",
119119
" project_id=os.environ.get(\"PROJECT_ID\", None),\n",
120120
")\n",
121121
"\n",
@@ -164,7 +164,8 @@
164164
"source": [
165165
"## Adding a dataset\n",
166166
"\n",
167-
"As the next step we will setup a dataset in Spark. In this example we will use the New York Bike trips dataset (https://www.kaggle.com/datasets/gabrielramos87/bike-trips)."
167+
"As the next step we will setup a dataset in Spark. In this example we will use the New York Bike trips dataset (https://www.kaggle.com/datasets/gabrielramos87/bike-trips).",
168+
"The bike trips form a graph where nodes represent bike renting stations and relationships represent start and end points for a bike rental trip."
168169
]
169170
},
170171
{
@@ -209,8 +210,8 @@
209210
"\n",
210211
"1. Send an action `v2/graph.project.fromTriplets`\n",
211212
" This will initialize the import process and allows us to specify the graph name, and settings like `undirected_relationship_types`. It returns a job id, that we need to reference the import job in the following steps.\n",
212-
"2. Send the data in batches to the arrow server.\n",
213-
"3. Send another action called `v2/graph.project.fromTriplets.done` to tell the import process that no more data will be send. This will trigger the final graph creation inside the session.\n",
213+
"2. Send the data in batches to the Arrow server.\n",
214+
"3. Send another action called `v2/graph.project.fromTriplets.done` to tell the import process that no more data will be sent. This will trigger the final graph creation inside the GDS session.\n",
214215
"4. Wait for the import process to reach the `DONE` state.\n",
215216
"\n",
216217
"While the overall process is straight forward, we need to somehow tell Spark to"
@@ -234,7 +235,7 @@
234235
"job_id = arrow_client.create_graph_from_triplets(graph_name, concurrency=4)\n",
235236
"\n",
236237
"\n",
237-
"# Define a function that receives an arrow batch and uploads it to the session\n",
238+
"# Define a function that receives an arrow batch and uploads it to the GDS session\n",
238239
"def upload_batch(iterator):\n",
239240
" for batch in iterator:\n",
240241
" arrow_client.upload_triplets(job_id, [batch])\n",
@@ -247,7 +248,7 @@
247248
" FROM bike_trips\n",
248249
"\"\"\")\n",
249250
"\n",
250-
"# 2. Use the `mapInArrow` function to upload the data to the sessions. Returns a dataframe with a single column with the batch sizes.\n",
251+
"# 2. Use the `mapInArrow` function to upload the data to the GDS session. Returns a DataFrame with a single column containing the batch sizes.\n",
251252
"uploaded_batches = source_target_pairs.mapInArrow(upload_batch, \"batch_rows_imported long\")\n",
252253
"\n",
253254
"# Aggregate the batch sizes to receive the row count.\n",
@@ -291,7 +292,7 @@
291292
"\n",
292293
"Once the computation is done, we might want to further use the result in Spark.\n",
293294
"We can do this in a similar way to the projection, by streaming batches of data into each of the Spark workers.\n",
294-
"Retrieving the data is a bit more complicated since we need some input data frame in order to trigger computations on the Spark workers.\n",
295+
"Retrieving the data is a bit more complicated since we need some input DataFrame in order to trigger computations on the Spark workers.\n",
295296
"We use a data range equal to the size of workers we have in our cluster as our driving table.\n",
296297
"On the workers we will disregard the input and instead stream the computation data from the GDS Session."
297298
]
@@ -302,7 +303,7 @@
302303
"metadata": {},
303304
"outputs": [],
304305
"source": [
305-
"# 1. Start the node property export on the session\n",
306+
"# 1. Start the node property export on the GDS session\n",
306307
"job_id = arrow_client.get_node_properties(G.name(), [\"pagerank\"])\n",
307308
"\n",
308309
"\n",
@@ -330,9 +331,9 @@
330331
"source": [
331332
"## Cleanup\n",
332333
"\n",
333-
"Now that we have finished our analysis, we can delete the session and stop the spark connection.\n",
334+
"Now that we have finished our analysis, we can delete the GDS session and stop the Spark session.\n",
334335
"\n",
335-
"Deleting the session will release all resources associated with it, and stop incurring costs."
336+
"Deleting the GDS session will release all resources associated with it, and stop incurring costs."
336337
]
337338
},
338339
{

src/graphdatascience/session/aura_graph_data_science.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ def __getattr__(self, attr: str) -> IndirectCallBuilder:
184184
def arrow_client(self) -> GdsArrowClient:
185185
"""
186186
Returns a GdsArrowClient that is authenticated to communicate with the Aura Graph Analytics Session.
187-
This client can be used to get direct access to the sessions Arrow Flight server.
187+
This client can be used to get direct access to the specific session's Arrow Flight server.
188188
189189
Returns:
190190
A GdsArrowClient

0 commit comments

Comments
 (0)