Skip to content

Commit 9569339

Browse files
committed
Improve notebook descriptions and address PR comments
1 parent eaaf560 commit 9569339

File tree

2 files changed

+32
-18
lines changed

2 files changed

+32
-18
lines changed

examples/graph-analytics-serverless-spark.ipynb

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,7 @@
4040
"source": [
4141
"## Prerequisites\n",
4242
"\n",
43-
"This notebook requires having an AuraDB instance available and have the Aura Graph Analytics [feature](https://neo4j.com/docs/aura/graph-analytics/#aura-gds-serverless) enabled for your project.\n",
44-
"\n",
45-
"We also need to have the `graphdatascience` Python library installed, version `1.18` or later, as well as `pyspark`. For more information about setting up pyspark visit https://spark.apache.org/docs/latest/api/python/getting_started/"
43+
"We also need to have the `graphdatascience` Python library installed, version `1.18` or later, as well as `pyspark`."
4644
]
4745
},
4846
{
@@ -75,7 +73,7 @@
7573
"### Connecting to a Spark Session\n",
7674
"\n",
7775
"To interact with the Spark cluster we need to first instantiate a Spark session. In this example we will use a local Spark session, which will run Spark on the same machine.\n",
78-
"Working with a remote Spark cluster will work similarly."
76+
"Working with a remote Spark cluster will work similarly. For more information about setting up pyspark visit https://spark.apache.org/docs/latest/api/python/getting_started/"
7977
]
8078
},
8179
{
@@ -164,8 +162,7 @@
164162
"source": [
165163
"## Adding a dataset\n",
166164
"\n",
167-
"As the next step we will setup a dataset in Spark. In this example we will use the New York Bike trips dataset (https://www.kaggle.com/datasets/gabrielramos87/bike-trips).",
168-
"The bike trips form a graph where nodes represent bike renting stations and relationships represent start and end points for a bike rental trip."
165+
"As the next step we will setup a dataset in Spark. In this example we will use the New York Bike trips dataset (https://www.kaggle.com/datasets/gabrielramos87/bike-trips). The bike trips form a graph where nodes represent bike renting stations and relationships represent start and end points for a bike rental trip."
169166
]
170167
},
171168
{
@@ -206,23 +203,37 @@
206203
"\n",
207204
"We first need to get access to the GDSArrowClient. This client allows us to directly communicate with the Arrow Flight server provided by the session.\n",
208205
"\n",
209-
"Our input data already resembles edge triplets, where each of the rows represents an edge from a source station to a target station. This allows us to use the arrows servers graph import from triplets functionality, which requires the following protocol:\n",
206+
"Our input data already resembles triplets, where each row represents an edge from a source station to a target station. This allows us to use the Arrow Server's \"graph import from triplets\" functionality, which requires the following protocol:\n",
210207
"\n",
211208
"1. Send an action `v2/graph.project.fromTriplets`\n",
212209
" This will initialize the import process and allows us to specify the graph name, and settings like `undirected_relationship_types`. It returns a job id, that we need to reference the import job in the following steps.\n",
213210
"2. Send the data in batches to the Arrow server.\n",
214211
"3. Send another action called `v2/graph.project.fromTriplets.done` to tell the import process that no more data will be sent. This will trigger the final graph creation inside the GDS session.\n",
215212
"4. Wait for the import process to reach the `DONE` state.\n",
216213
"\n",
217-
"While the overall process is straight forward, we need to somehow tell Spark to"
214+
"The most complicated step here is to run the actual data upload on each spark worker. We will use the `mapInArrow` function to run custom code on each spark worker. Each worker will receive a number of arrow record batches that we can directly send to the GDS session's Arrow server. "
215+
]
216+
},
217+
{
218+
"metadata": {},
219+
"cell_type": "markdown",
220+
"source": [
221+
"The user wants to add a 1-second delay (sleep) within the loop that waits for the import job to finish. This requires importing the `time` module and adding `time.sleep(1)` inside the `while` loop at the end of the cell.\n",
222+
"\n"
218223
]
219224
},
220225
{
221-
"cell_type": "code",
222-
"execution_count": null,
223226
"metadata": {},
227+
"cell_type": "markdown",
228+
"source": "<llm-snippet-file>graph-analytics-serverless-spark.ipynb</llm-snippet-file>\n"
229+
},
230+
{
231+
"metadata": {},
232+
"cell_type": "code",
224233
"outputs": [],
234+
"execution_count": null,
225235
"source": [
236+
"import time\n",
226237
"import pandas as pd\n",
227238
"import pyarrow\n",
228239
"from pyspark.sql import functions\n",
@@ -244,30 +255,33 @@
244255
"\n",
245256
"# Select the source target pairs from our source data\n",
246257
"source_target_pairs = spark.sql(\"\"\"\n",
247-
" SELECT start_station_id AS sourceNode, end_station_id AS targetNode\n",
248-
" FROM bike_trips\n",
249-
"\"\"\")\n",
258+
" SELECT start_station_id AS sourceNode, end_station_id AS targetNode\n",
259+
" FROM bike_trips\n",
260+
" \"\"\")\n",
250261
"\n",
251262
"# 2. Use the `mapInArrow` function to upload the data to the GDS session. Returns a DataFrame with a single column containing the batch sizes.\n",
252263
"uploaded_batches = source_target_pairs.mapInArrow(upload_batch, \"batch_rows_imported long\")\n",
253264
"\n",
254265
"# Aggregate the batch sizes to receive the row count.\n",
255-
"uploaded_batches.agg(functions.sum(\"batch_rows_imported\").alias(\"rows_imported\")).show()\n",
266+
"aggregated_batch_sizes = uploaded_batches.agg(functions.sum(\"batch_rows_imported\").alias(\"rows_imported\"))\n",
267+
"\n",
268+
"# Show the result. This will trigger the computation and thus run the data upload.\n",
269+
"aggregated_batch_sizes.show()\n",
256270
"\n",
257271
"# 3. Finish the import process\n",
258272
"arrow_client.triplet_load_done(job_id)\n",
259273
"\n",
260274
"# 4. Wait for the import to finish\n",
261275
"while not arrow_client.job_status(job_id).succeeded():\n",
262-
" pass\n",
276+
" time.sleep(1)\n",
263277
"\n",
264278
"G = gds.v2.graph.get(graph_name)\n",
265279
"G"
266280
]
267281
},
268282
{
269-
"cell_type": "markdown",
270283
"metadata": {},
284+
"cell_type": "markdown",
271285
"source": [
272286
"## Running Algorithms\n",
273287
"\n",
@@ -322,7 +336,7 @@
322336
"# Optional: Repartition the data to make sure it is distributed equally\n",
323337
"result = received_batches.repartition(numPartitions=spark.sparkContext.defaultParallelism)\n",
324338
"\n",
325-
"result.show()"
339+
"result.toPandas()"
326340
]
327341
},
328342
{

src/graphdatascience/arrow_client/v2/api_types.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ def sub_tasks(self) -> str | None:
3434
return None
3535

3636
def aborted(self) -> bool:
37-
return self.status == "Aborted"
37+
return self.status.lower() == "aborted"
3838

3939
def succeeded(self) -> bool:
4040
return self.status.lower() == "done"

0 commit comments

Comments
 (0)