GoogleCloudDataproc · cjac · Jun 23, 2025 · Jun 23, 2025 · Jun 24, 2025
diff --git a/gcloud/.gitignore b/gcloud/.gitignore
@@ -1 +1,4 @@
-init/*/
+init/*/
+tls
+*~
+env.json
diff --git a/gcloud/README.md b/gcloud/README.md
@@ -16,200 +16,85 @@ limitations under the License.
 
 -->
 
-## Introduction
+# Dataproc Critical User Journey (CUJ) Framework
 
-This README file describes how to use this collection of gcloud bash examples to
-reproduce common Dataproc cluster creation problems relating to the GCE startup
-script, Dataproc startup script, and Dataproc initialization-actions scripts.
+This directory contains a collection of scripts that form a test framework for exercising Critical User Journeys (CUJs) on Google Cloud Dataproc. The goal of this framework is to provide a robust, maintainable, and automated way to reproduce and validate the common and complex use cases that are essential for our customers.
 
-## Clone the git repository
+This framework replaces the previous monolithic scripts with a modular, scalable, and self-documenting structure designed for both interactive use and CI/CD automation.
 
-```
-$ git clone git@github.com:GoogleCloudDataproc/cloud-dataproc
-$ cd cloud-dataproc/gcloud
-$ cp env.json.sample env.json
-$ vi env.json
-```
+## Framework Overview
 
-## Environment configuration
+The framework is organized into several key directories, each with a distinct purpose:
 
-First, copy `env.json.sample` to `env.json` and modify the environment
-variable names and their values in `env.json` to match your
-environment:
+* **`onboarding/`**: Contains idempotent scripts to set up persistent, shared infrastructure that multiple CUJs might depend on. These are typically run once per project. Examples include setting up a shared Cloud SQL instance or a Squid proxy VM.
 
-```
-{
-  "PROJECT_ID":"ldap-example-yyyy-nn",
-  "ORG_NUMBER":"100000000001",
-  "DOMAIN": "your-domain-goes-here.com",
-  "BILLING_ACCOUNT":"100000-000000-000001",
-  "FOLDER_NUMBER":"100000000001",
-  "REGION":"us-west4",
-  "RANGE":"10.00.01.0/24",
-  "IDLE_TIMEOUT":"30m",
-  "ASN_NUMBER":"65531",
-  "IMAGE_VERSION":"2.2,
-  "BIGTABLE_INSTANCE":"my-bigtable"
-}
+* **`cuj/`**: The heart of the framework. This directory contains the individual, self-contained CUJs, grouped by the Dataproc platform (`gce`, `gke`, `s8s`). Each CUJ represents a specific, testable customer scenario.
+
+* **`lib/`**: A collection of modular bash script libraries (`_core.sh`, `_network.sh`, `_database.sh`, etc.). These files contain all the powerful, reusable functions for creating and managing GCP resources, forming a shared API for all `onboarding` and `cuj` scripts.
+
+* **`ci/`**: Includes scripts specifically for CI/CD automation. The `pristine_check.sh` script is designed to enforce a clean project state before and after test runs, preventing bitrot and ensuring reproducibility.
+
+## Getting Started
+
+Follow these steps to configure your environment and run your first CUJ.
+
+### 1. Prerequisites
+
+Ensure you have the following tools installed and configured:
+* `gcloud` CLI (authenticated to your Google account)
+* `jq`
+* A Google Cloud project with billing enabled.
+
+### 2. Configure Your Environment
+
+Copy the sample configuration file and edit it to match your environment.
+
+```bash
+cp gcloud/env.json.sample gcloud/env.json
+vi gcloud/env.json
 ```
 
-The values that you enter here will be used to build reasonable defaults in
-`lib/env.sh` ; you can view and modify `lib/env.sh` to more finely tune your
-environment.  The code in lib/env.sh is sourced and executed at the head of many
-scripts in this suite to ensure that the environment is tuned for use with this
-reproduction.
-
-#### Dataproc on GCE
-
-To tune the reproduction environment for your (customer's) GCE use case, review
-the `create_dpgce_cluster` function in the `lib/shared-functions.sh` file.  This
-is where you can select which arguments are passed to the `gcloud dataproc
-clusters create ${CLUSTER_NAME}` command.  There exist many examples in the
-comments of common use cases below the call to gcloud itself.
-
-## creation phase
-
-When reviewing `lib/shared-functions.sh`, pay attention to the
-`--metadata startup-script="..."` and `--initialization-actions
-"${INIT_ACTIONS_ROOT}/<script-name>"` arguments.  These can be used to
-execute arbitrary code during the creation of Dataproc clusters.  Many
-Google Cloud Support cases relate to failures during either a)
-Dataproc's internal startup script, which runs after the `--metadata
-startup-script="..."`, or b) scripts passed using the
-`--initialization-actions` cluster creation argument.
-
-## creating the environment and cluster
-
-Once you have altered `env.json` and have reviewed the function names in
-`lib/shared-functions.sh`, you can create your cluster environment and launch
-your cluster by running `bin/create-dpgce`.  Although the function should be
-idempotent, users should not plan to run this more than once for a single
-reproduction, as it may configure the environment in a way which renders the
-environment non-functional.
-
-Running the `bin/create-dpgce` script will create the staging bucket, enable the
-required services, create a dedicated VPC network, router, NAT, subnet, firewall
-rules, and finally, the cluster itself.
-
-By default, your cluster will time out and be destroyed after 30 minutes of
-inactivity.  Activity is defined by receipt of a job using the `gcloud dataproc
-jobs submit` command.  You can change this default of 30 minutes by altering the
-value of IDLE_TIMEOUT in `env.json`.  This saves your project and your org
-operating costs on reproduction clusters which are not being used to actively
-reproduce problems.  It also gives you a half of an hour to do your work before
-worrying that your cluster will be brought down.
-
-## recreating the cluster
-
-If your cluster has been destroyed either by timeout or manually calling
-`gcloud dataproc clusters delete` you can re-create it by running
-`bin/recreate-dpgce`.  This script does not re-create any of the resources the
-cluster depends on such as network, router, staging bucket, etc.  It only
-deletes and re-creates the cluster that's already been defined in `env.json` and
-previously provisioned using `bin/create-dpgce`
-
-## deleting the environment and cluster
-
-If you need to delete the entire environment, you can run `bin/destroy-dpgce` ;
-this will delete the cluster, remove the firewall rules, subnet, NAT, router,
-VPC network, and staging bucket.  To re-create a deleted environment, you may
-run `bin/create-dpgce` after `bin/destroy-dpgce` completes successfully.
-
-### Metadata store
-
-All startup-scripts run on GCE instances, including Dataproc GCE cluster nodes,
-may make use of the `/usr/share/google/get_metadata_value` script to look up
-information in the metadata store.  The information available in the metadata
-server includes some of the arguments passed when creating the cluster using the
-`--metadata` argument.
-
-For instance, if you were to call `gcloud dataproc clusters create
-${CLUSTER_NAME}` with the argument `--metadata
-init-actions-repo=${INIT_ACTIONS_ROOT}`, then you can find this value by running
-`/usr/share/google/get_metadata_value "attributes/init-actions-repo"`.  By
-default, there are some attributes which are set for dataproc.  Some important
-ones follow:
-
-* attributes/dataproc-role
-- value: `Master` for master nodes
-- value: `Worker` for primary and secondary worker nodes
-* attributes/dataproc-cluster-name
-* attributes/dataproc-bucket
-* attributes/dataproc-cluster-uuid
-* attributes/dataproc-region
-* hostname (FQDN)
-* name (short hostname)
-* machine-type
-
-### GCE Startup script
-
-Before reading this section, please become familiar with the documentation in
-the GCE library for the
-[startup-script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux)
-metadata argument
-
-The content of the startup-script, if passed as a string, is stored as
-`attributes/startup-script` in the metadata store.  If passed as a url, the url
-can be found as `attributes/startup-script-url`.
-
-The GCE startup script runs prior to the Dataproc Agent.  This script can be
-used to make small modifications to the environment prior to starting Dataproc
-services on the host.
-
-### Dataproc Startup script
-
-The Dataproc agent is responsible for launching the [Dataproc startup
-script](https://cs/piper///depot/google3/cloud/hadoop/services/images/startup-script.sh)
-and the [initialization
-actions](https://github.com/GoogleCloudDataproc/initialization-actions) in order
-of specification.
-
-The Dataproc startup script runs before the initialization actions, and logs its
-output to `/var/log/dataproc-startup-script.log`.  It is linked to by
-`/usr/local/share/google/dataproc/startup-script.sh` on all dataproc nodes.  The
-tasks which the startup script run are influenced by the following arguments.
-This is not an exhaustive list.  If you are troubleshooting startup errors,
-determine whether any arguments or properties are being supplied to the
-`clusters create` command, especially any similar to the following.
+You only need to edit the universal and onboarding settings. The `load_config` function in the library will dynamically generate a `PROJECT_ID` if the default value is present.
 
+### 3. Run Onboarding Scripts
+
+Before running any CUJs, you must set up the shared infrastructure for your project. These scripts are idempotent and can be run multiple times safely.
+
+```bash
+# Set up the shared Cloud SQL instance with VPC Peering
+bash gcloud/onboarding/create_cloudsql_instance.sh
+
+# Set up the shared Squid Proxy VM and its networking
+bash gcloud/onboarding/create_squid_proxy.sh
 ```
-* `--optional-components`
-* `--enable-component-gateway`
-* `--properties 'dataproc:conda.*=...'`
-* `--properties 'dataproc:pip.*=...'`
-* `--properties 'dataproc:kerberos.*=...'`
-* `--properties 'dataproc:ranger.*=...'`
-* `--properties 'dataproc:druid.*=...'`
-* `--properties 'dataproc:kafka.*=...'`
-* `--properties 'dataproc:yarn.docker.*=...'`
-* `--properties 'dataproc:solr.*=...'`
-* `--properties 'dataproc:jupyter.*=...'`
-* `--properties 'dataproc:zeppelin.*=...'`
+
+### 4. Run a Critical User Journey
+
+Navigate to the directory of the CUJ you want to run and use its `manage.sh` script.
+
+**Example: Running the standard GCE cluster CUJ**
+
+```bash
+# Navigate to the CUJ directory
+cd gcloud/cuj/gce/standard/
+
+# Create all resources for this CUJ
+./manage.sh up
+
+# When finished, tear down all resources for this CUJ
+./manage.sh down
 ```
 
-On Dataproc images prior to 2.3, the Startup script is responsible for
-configuring the optional components which the customer has selected in the way
-that the customer has specified with properties.  Errors indicating
-dataproc-startup-script.log often have to do with configuration of optional
-components and their services.
-
-### Dataproc Initialization Actions scripts
-
-Documentation for the
-[initialization-actions](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions)
-argument to the `gcloud dataproc clusters create` command can be found in the
-Dataproc library.  You may also want to review the
-[README.md](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md)
-from the public initialization-actions repo on GitHub.
-
-Do note that you can specify multiple initialization actions scripts.  They will
-be executed in the order of specification.  The initialization-actions scripts
-are stored to
-`/etc/google-dataproc/startup-scripts/dataproc-initialization-script-${INDEX}`
-on the filesystem of each cluster node, where ${INDEX} is the script number,
-starting with 0, and incrementing for each additional script.  The URL of the
-script can be found by querying the metadata server for
-`attributes/dataproc-initialization-action-script-${INDEX}`.  From within the
-script itself, you can refer to `attributes/$0`.
-
-Logs for each initialization action script are created under /var/log
+Each `manage.sh` script supports several commands:
+* **`up`**: Creates all resources for the CUJ.
+* **`down`**: Deletes all resources created by this CUJ.
+* **`rebuild`**: Runs `down` and then `up` for a full cycle.
+* **`validate`**: Checks for prerequisites, such as required APIs or shared infrastructure.
+
+## Available CUJs
+
+This framework includes the following initial CUJs:
+
+* **`gce/standard`**: Creates a standard Dataproc on GCE cluster in a dedicated VPC with a Cloud NAT gateway for secure internet egress.
+* **`gce/proxy-egress`**: Creates a Dataproc on GCE cluster in a private network configured to use the shared Squid proxy for all outbound internet traffic.
+* **`gke/standard`**: Creates a standard Dataproc on GKE virtual cluster on a new GKE cluster.
diff --git a/gcloud/ci/pristine_check.sh b/gcloud/ci/pristine_check.sh
@@ -0,0 +1,105 @@
+#!/bin/bash
+#
+# Verifies and enforces a pristine state in the project for CUJ testing
+# by finding and deleting all resources tagged with the CUJ_TAG.
+#
+# This script is designed to be run from a CI/CD pipeline at the beginning
+# (in cleanup mode) and at the end (in strict mode) of a test run.
+#
+# Usage:
+#   ./pristine_check.sh           # Cleanup mode: Aggressively deletes resources.
+#   ./pristine_check.sh --strict  # Validation mode: Fails if any resources are found.
+
+set -e
+SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
+source "${SCRIPT_DIR}/../lib/common.sh"
+load_config
+
+STRICT_MODE=false
+if [[ "$1" == "--strict" ]]; then
+  STRICT_MODE=true
+fi
+
+# Store leftover resources to report at the end
+LEFTOVERS_FILE=$(mktemp)
+trap 'rm -f -- "${LEFTOVERS_FILE}"' EXIT
+
+# --- Helper Functions ---
+
+# Generic function to find, report, and optionally delete tagged resources.
+# Arguments:
+#   $1: The type of resource (for logging purposes, e.g., "Dataproc Clusters")
+#   $2: The gcloud command to list resources (e.g., "gcloud dataproc clusters list ...")
+#   $3: The gcloud command to delete resources (e.g., "gcloud dataproc clusters delete ...")
+function process_resources() {
+  local resource_type="$1"
+  local list_command="$2"
+  local delete_command="$3"
+
+  # The "tr" command handles cases where no resources are found (to avoid errors)
+  # and where multiple resources are found (one per line).
+  local resources
+  resources=$(eval "${list_command}" | tr '\n' ' ' | sed 's/ *$//')
+
+  if [[ -n "${resources}" ]]; then
+    echo "Found leftover ${resource_type}: ${resources}" | tee -a "${LEFTOVERS_FILE}"
+    if [[ "$STRICT_MODE" == false ]]; then
+      echo "Cleaning up ${resource_type}..."
+      # Some delete commands need resource name(s) first, others last. We assume last.
+      eval "${delete_command} ${resources}" &
+    fi
+  fi
+}
+
+# --- Main Execution ---
+
+header "Pristine Check running in $([[ "$STRICT_MODE" == true ]] && echo 'STRICT' || echo 'CLEANUP') mode"
+
+# Define commands for each resource type. All are filtered by the CUJ_TAG where possible.
+LIST_CLUSTERS_CMD="gcloud dataproc clusters list --region='${CONFIG[REGION]}' --filter='config.gceClusterConfig.tags.items=${CONFIG[CUJ_TAG]}' --format='value(clusterName)' 2>/dev/null"
+DELETE_CLUSTERS_CMD="gcloud dataproc clusters delete --quiet --region='${CONFIG[REGION]}'"
+
+LIST_INSTANCES_CMD="gcloud compute instances list --filter='tags.items=${CONFIG[CUJ_TAG]}' --format='value(name)' 2>/dev/null"
+DELETE_INSTANCES_CMD="gcloud compute instances delete --quiet --zone='${CONFIG[ZONE]}'"
+
+# Routers and Networks cannot be tagged, so we must rely on a naming convention for them.
+LIST_ROUTERS_CMD="gcloud compute routers list --filter='name~^cuj-' --format='value(name)' 2>/dev/null"
+DELETE_ROUTERS_CMD="gcloud compute routers delete --quiet --region='${CONFIG[REGION]}'"
+
+LIST_FIREWALLS_CMD="gcloud compute firewall-rules list --filter='targetTags.items=${CONFIG[CUJ_TAG]} OR name~^cuj-' --format='value(name)' 2>/dev/null"
+DELETE_FIREWALLS_CMD="gcloud compute firewall-rules delete --quiet"
+
+# Process resources that can be deleted in parallel first.
+process_resources "Dataproc Clusters" "${LIST_CLUSTERS_CMD}" "${DELETE_CLUSTERS_CMD}"
+process_resources "GCE Instances" "${LIST_INSTANCES_CMD}" "${DELETE_INSTANCES_CMD}"
+process_resources "Firewall Rules" "${LIST_FIREWALLS_CMD}" "${DELETE_FIREWALLS_CMD}"
+process_resources "Cloud Routers" "${LIST_ROUTERS_CMD}" "${DELETE_ROUTERS_CMD}"
+
+if [[ "$STRICT_MODE" == false ]]; then
+  echo "Waiting for initial resource cleanup to complete..."
+  wait
+fi
+
+# Process networks last, as they have dependencies.
+LIST_NETWORKS_CMD="gcloud compute networks list --filter='name~^cuj-' --format='value(name)' 2>/dev/null"
+DELETE_NETWORKS_CMD="gcloud compute networks delete --quiet"
+process_resources "VPC Networks" "${LIST_NETWORKS_CMD}" "${DELETE_NETWORKS_CMD}"
+
+if [[ "$STRICT_MODE" == false ]]; then
+  wait
+fi
+
+# --- Final Report ---
+if [[ -s "${LEFTOVERS_FILE}" ]]; then
+  echo "--------------------------------------------------" >&2
+  echo "ERROR: Leftover resources were detected:" >&2
+  cat "${LEFTOVERS_FILE}" >&2
+  echo "--------------------------------------------------" >&2
+  if [[ "$STRICT_MODE" == true ]]; then
+    echo "STRICT mode failed. The project is not pristine." >&2
+    exit 1
+  fi
+  # In non-strict mode, we report but don't fail, assuming the next run will succeed.
+fi
+
+echo "Pristine check complete."