|
26 | 26 | "cell_type": "markdown", |
27 | 27 | "metadata": {}, |
28 | 28 | "source": [ |
29 | | - "This Jupyter notebook is hosted [here](https://github.com/neo4j/graph-data-science-client/blob/main/examples/graph-analytics-serverless.ipynb) in the Neo4j Graph Data Science Client Github repository.\n", |
| 29 | + "This Jupyter notebook is hosted [here](https://github.com/neo4j/graph-data-science-client/blob/main/examples/graph-analytics-serverless-spark.ipynb) in the Neo4j Graph Data Science Client Github repository.\n", |
30 | 30 | "\n", |
31 | | - "The notebook shows how to use the `graphdatascience` Python library to create, manage, and use a GDS Session.\n", |
| 31 | + "The notebook shows how to use the `graphdatascience` Python library to create, manage, and use a GDS Session from within an Apache Spark cluster.\n", |
32 | 32 | "\n", |
33 | | - "We consider a graph of bicycle rentals, which we're using as a simple example to show how project data from Spark to a GDS Session, run algorithms, and eventually retrieving the results back to Spark.\n", |
34 | | - "We will cover all management operations: creation, listing, and deletion." |
| 33 | + "We consider a graph of bicycle rentals, which we're using as a simple example to show how to project data from Spark to a GDS Session, run algorithms, and eventually return results back to Spark.\n", |
| 34 | + "In this notebook we will focus on the interaction with Apache Spark, and will not cover all possible actions using GDS sessions. We refer to other Tutorials for additional details." |
35 | 35 | ] |
36 | 36 | }, |
37 | 37 | { |
|
74 | 74 | "source": [ |
75 | 75 | "### Connecting to a Spark Session\n", |
76 | 76 | "\n", |
77 | | - "To interact with the Spark Cluster we need to first instantiate a Spark session. In this example we will use a local Spark session, which will run Spark on the same machine.\n", |
| 77 | + "To interact with the Spark cluster we need to first instantiate a Spark session. In this example we will use a local Spark session, which will run Spark on the same machine.\n", |
78 | 78 | "Working with a remote Spark cluster will work similarly." |
79 | 79 | ] |
80 | 80 | }, |
|
115 | 115 | "api_credentials = AuraAPICredentials(\n", |
116 | 116 | " client_id=os.environ[\"CLIENT_ID\"],\n", |
117 | 117 | " client_secret=os.environ[\"CLIENT_SECRET\"],\n", |
118 | | - " # If your account is a member of several project, you must also specify the project ID to use\n", |
| 118 | + " # If your account is a member of several projects, you must also specify the project ID to use\n", |
119 | 119 | " project_id=os.environ.get(\"PROJECT_ID\", None),\n", |
120 | 120 | ")\n", |
121 | 121 | "\n", |
|
164 | 164 | "source": [ |
165 | 165 | "## Adding a dataset\n", |
166 | 166 | "\n", |
167 | | - "As the next step we will setup a dataset in Spark. In this example we will use the New York Bike trips dataset (https://www.kaggle.com/datasets/gabrielramos87/bike-trips)." |
| 167 | + "As the next step we will setup a dataset in Spark. In this example we will use the New York Bike trips dataset (https://www.kaggle.com/datasets/gabrielramos87/bike-trips).", |
| 168 | + "The bike trips form a graph where nodes represent bike renting stations and relationships represent start and end points for a bike rental trip." |
168 | 169 | ] |
169 | 170 | }, |
170 | 171 | { |
|
209 | 210 | "\n", |
210 | 211 | "1. Send an action `v2/graph.project.fromTriplets`\n", |
211 | 212 | " This will initialize the import process and allows us to specify the graph name, and settings like `undirected_relationship_types`. It returns a job id, that we need to reference the import job in the following steps.\n", |
212 | | - "2. Send the data in batches to the arrow server.\n", |
213 | | - "3. Send another action called `v2/graph.project.fromTriplets.done` to tell the import process that no more data will be send. This will trigger the final graph creation inside the session.\n", |
| 213 | + "2. Send the data in batches to the Arrow server.\n", |
| 214 | + "3. Send another action called `v2/graph.project.fromTriplets.done` to tell the import process that no more data will be sent. This will trigger the final graph creation inside the GDS session.\n", |
214 | 215 | "4. Wait for the import process to reach the `DONE` state.\n", |
215 | 216 | "\n", |
216 | 217 | "While the overall process is straight forward, we need to somehow tell Spark to" |
|
234 | 235 | "job_id = arrow_client.create_graph_from_triplets(graph_name, concurrency=4)\n", |
235 | 236 | "\n", |
236 | 237 | "\n", |
237 | | - "# Define a function that receives an arrow batch and uploads it to the session\n", |
| 238 | + "# Define a function that receives an arrow batch and uploads it to the GDS session\n", |
238 | 239 | "def upload_batch(iterator):\n", |
239 | 240 | " for batch in iterator:\n", |
240 | 241 | " arrow_client.upload_triplets(job_id, [batch])\n", |
|
247 | 248 | " FROM bike_trips\n", |
248 | 249 | "\"\"\")\n", |
249 | 250 | "\n", |
250 | | - "# 2. Use the `mapInArrow` function to upload the data to the sessions. Returns a dataframe with a single column with the batch sizes.\n", |
| 251 | + "# 2. Use the `mapInArrow` function to upload the data to the GDS session. Returns a DataFrame with a single column containing the batch sizes.\n", |
251 | 252 | "uploaded_batches = source_target_pairs.mapInArrow(upload_batch, \"batch_rows_imported long\")\n", |
252 | 253 | "\n", |
253 | 254 | "# Aggregate the batch sizes to receive the row count.\n", |
|
291 | 292 | "\n", |
292 | 293 | "Once the computation is done, we might want to further use the result in Spark.\n", |
293 | 294 | "We can do this in a similar way to the projection, by streaming batches of data into each of the Spark workers.\n", |
294 | | - "Retrieving the data is a bit more complicated since we need some input data frame in order to trigger computations on the Spark workers.\n", |
| 295 | + "Retrieving the data is a bit more complicated since we need some input DataFrame in order to trigger computations on the Spark workers.\n", |
295 | 296 | "We use a data range equal to the size of workers we have in our cluster as our driving table.\n", |
296 | 297 | "On the workers we will disregard the input and instead stream the computation data from the GDS Session." |
297 | 298 | ] |
|
302 | 303 | "metadata": {}, |
303 | 304 | "outputs": [], |
304 | 305 | "source": [ |
305 | | - "# 1. Start the node property export on the session\n", |
| 306 | + "# 1. Start the node property export on the GDS session\n", |
306 | 307 | "job_id = arrow_client.get_node_properties(G.name(), [\"pagerank\"])\n", |
307 | 308 | "\n", |
308 | 309 | "\n", |
|
330 | 331 | "source": [ |
331 | 332 | "## Cleanup\n", |
332 | 333 | "\n", |
333 | | - "Now that we have finished our analysis, we can delete the session and stop the spark connection.\n", |
| 334 | + "Now that we have finished our analysis, we can delete the GDS session and stop the Spark session.\n", |
334 | 335 | "\n", |
335 | | - "Deleting the session will release all resources associated with it, and stop incurring costs." |
| 336 | + "Deleting the GDS session will release all resources associated with it, and stop incurring costs." |
336 | 337 | ] |
337 | 338 | }, |
338 | 339 | { |
|
0 commit comments