Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion gcloud/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
init/*/
init/*/
tls
*~
env.json
255 changes: 70 additions & 185 deletions gcloud/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,200 +16,85 @@ limitations under the License.

-->

## Introduction
# Dataproc Critical User Journey (CUJ) Framework

This README file describes how to use this collection of gcloud bash examples to
reproduce common Dataproc cluster creation problems relating to the GCE startup
script, Dataproc startup script, and Dataproc initialization-actions scripts.
This directory contains a collection of scripts that form a test framework for exercising Critical User Journeys (CUJs) on Google Cloud Dataproc. The goal of this framework is to provide a robust, maintainable, and automated way to reproduce and validate the common and complex use cases that are essential for our customers.

## Clone the git repository
This framework replaces the previous monolithic scripts with a modular, scalable, and self-documenting structure designed for both interactive use and CI/CD automation.

```
$ git clone git@github.com:GoogleCloudDataproc/cloud-dataproc
$ cd cloud-dataproc/gcloud
$ cp env.json.sample env.json
$ vi env.json
```
## Framework Overview

## Environment configuration
The framework is organized into several key directories, each with a distinct purpose:

First, copy `env.json.sample` to `env.json` and modify the environment
variable names and their values in `env.json` to match your
environment:
* **`onboarding/`**: Contains idempotent scripts to set up persistent, shared infrastructure that multiple CUJs might depend on. These are typically run once per project. Examples include setting up a shared Cloud SQL instance or a Squid proxy VM.

```
{
"PROJECT_ID":"ldap-example-yyyy-nn",
"ORG_NUMBER":"100000000001",
"DOMAIN": "your-domain-goes-here.com",
"BILLING_ACCOUNT":"100000-000000-000001",
"FOLDER_NUMBER":"100000000001",
"REGION":"us-west4",
"RANGE":"10.00.01.0/24",
"IDLE_TIMEOUT":"30m",
"ASN_NUMBER":"65531",
"IMAGE_VERSION":"2.2,
"BIGTABLE_INSTANCE":"my-bigtable"
}
* **`cuj/`**: The heart of the framework. This directory contains the individual, self-contained CUJs, grouped by the Dataproc platform (`gce`, `gke`, `s8s`). Each CUJ represents a specific, testable customer scenario.

* **`lib/`**: A collection of modular bash script libraries (`_core.sh`, `_network.sh`, `_database.sh`, etc.). These files contain all the powerful, reusable functions for creating and managing GCP resources, forming a shared API for all `onboarding` and `cuj` scripts.

* **`ci/`**: Includes scripts specifically for CI/CD automation. The `pristine_check.sh` script is designed to enforce a clean project state before and after test runs, preventing bitrot and ensuring reproducibility.

## Getting Started

Follow these steps to configure your environment and run your first CUJ.

### 1. Prerequisites

Ensure you have the following tools installed and configured:
* `gcloud` CLI (authenticated to your Google account)
* `jq`
* A Google Cloud project with billing enabled.

### 2. Configure Your Environment

Copy the sample configuration file and edit it to match your environment.

```bash
cp gcloud/env.json.sample gcloud/env.json
vi gcloud/env.json
```

The values that you enter here will be used to build reasonable defaults in
`lib/env.sh` ; you can view and modify `lib/env.sh` to more finely tune your
environment. The code in lib/env.sh is sourced and executed at the head of many
scripts in this suite to ensure that the environment is tuned for use with this
reproduction.

#### Dataproc on GCE

To tune the reproduction environment for your (customer's) GCE use case, review
the `create_dpgce_cluster` function in the `lib/shared-functions.sh` file. This
is where you can select which arguments are passed to the `gcloud dataproc
clusters create ${CLUSTER_NAME}` command. There exist many examples in the
comments of common use cases below the call to gcloud itself.

## creation phase

When reviewing `lib/shared-functions.sh`, pay attention to the
`--metadata startup-script="..."` and `--initialization-actions
"${INIT_ACTIONS_ROOT}/<script-name>"` arguments. These can be used to
execute arbitrary code during the creation of Dataproc clusters. Many
Google Cloud Support cases relate to failures during either a)
Dataproc's internal startup script, which runs after the `--metadata
startup-script="..."`, or b) scripts passed using the
`--initialization-actions` cluster creation argument.

## creating the environment and cluster

Once you have altered `env.json` and have reviewed the function names in
`lib/shared-functions.sh`, you can create your cluster environment and launch
your cluster by running `bin/create-dpgce`. Although the function should be
idempotent, users should not plan to run this more than once for a single
reproduction, as it may configure the environment in a way which renders the
environment non-functional.

Running the `bin/create-dpgce` script will create the staging bucket, enable the
required services, create a dedicated VPC network, router, NAT, subnet, firewall
rules, and finally, the cluster itself.

By default, your cluster will time out and be destroyed after 30 minutes of
inactivity. Activity is defined by receipt of a job using the `gcloud dataproc
jobs submit` command. You can change this default of 30 minutes by altering the
value of IDLE_TIMEOUT in `env.json`. This saves your project and your org
operating costs on reproduction clusters which are not being used to actively
reproduce problems. It also gives you a half of an hour to do your work before
worrying that your cluster will be brought down.

## recreating the cluster

If your cluster has been destroyed either by timeout or manually calling
`gcloud dataproc clusters delete` you can re-create it by running
`bin/recreate-dpgce`. This script does not re-create any of the resources the
cluster depends on such as network, router, staging bucket, etc. It only
deletes and re-creates the cluster that's already been defined in `env.json` and
previously provisioned using `bin/create-dpgce`

## deleting the environment and cluster

If you need to delete the entire environment, you can run `bin/destroy-dpgce` ;
this will delete the cluster, remove the firewall rules, subnet, NAT, router,
VPC network, and staging bucket. To re-create a deleted environment, you may
run `bin/create-dpgce` after `bin/destroy-dpgce` completes successfully.

### Metadata store

All startup-scripts run on GCE instances, including Dataproc GCE cluster nodes,
may make use of the `/usr/share/google/get_metadata_value` script to look up
information in the metadata store. The information available in the metadata
server includes some of the arguments passed when creating the cluster using the
`--metadata` argument.

For instance, if you were to call `gcloud dataproc clusters create
${CLUSTER_NAME}` with the argument `--metadata
init-actions-repo=${INIT_ACTIONS_ROOT}`, then you can find this value by running
`/usr/share/google/get_metadata_value "attributes/init-actions-repo"`. By
default, there are some attributes which are set for dataproc. Some important
ones follow:

* attributes/dataproc-role
- value: `Master` for master nodes
- value: `Worker` for primary and secondary worker nodes
* attributes/dataproc-cluster-name
* attributes/dataproc-bucket
* attributes/dataproc-cluster-uuid
* attributes/dataproc-region
* hostname (FQDN)
* name (short hostname)
* machine-type

### GCE Startup script

Before reading this section, please become familiar with the documentation in
the GCE library for the
[startup-script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux)
metadata argument

The content of the startup-script, if passed as a string, is stored as
`attributes/startup-script` in the metadata store. If passed as a url, the url
can be found as `attributes/startup-script-url`.

The GCE startup script runs prior to the Dataproc Agent. This script can be
used to make small modifications to the environment prior to starting Dataproc
services on the host.

### Dataproc Startup script

The Dataproc agent is responsible for launching the [Dataproc startup
script](https://cs/piper///depot/google3/cloud/hadoop/services/images/startup-script.sh)
and the [initialization
actions](https://github.com/GoogleCloudDataproc/initialization-actions) in order
of specification.

The Dataproc startup script runs before the initialization actions, and logs its
output to `/var/log/dataproc-startup-script.log`. It is linked to by
`/usr/local/share/google/dataproc/startup-script.sh` on all dataproc nodes. The
tasks which the startup script run are influenced by the following arguments.
This is not an exhaustive list. If you are troubleshooting startup errors,
determine whether any arguments or properties are being supplied to the
`clusters create` command, especially any similar to the following.
You only need to edit the universal and onboarding settings. The `load_config` function in the library will dynamically generate a `PROJECT_ID` if the default value is present.

### 3. Run Onboarding Scripts

Before running any CUJs, you must set up the shared infrastructure for your project. These scripts are idempotent and can be run multiple times safely.

```bash
# Set up the shared Cloud SQL instance with VPC Peering
bash gcloud/onboarding/create_cloudsql_instance.sh

# Set up the shared Squid Proxy VM and its networking
bash gcloud/onboarding/create_squid_proxy.sh
```
* `--optional-components`
* `--enable-component-gateway`
* `--properties 'dataproc:conda.*=...'`
* `--properties 'dataproc:pip.*=...'`
* `--properties 'dataproc:kerberos.*=...'`
* `--properties 'dataproc:ranger.*=...'`
* `--properties 'dataproc:druid.*=...'`
* `--properties 'dataproc:kafka.*=...'`
* `--properties 'dataproc:yarn.docker.*=...'`
* `--properties 'dataproc:solr.*=...'`
* `--properties 'dataproc:jupyter.*=...'`
* `--properties 'dataproc:zeppelin.*=...'`

### 4. Run a Critical User Journey

Navigate to the directory of the CUJ you want to run and use its `manage.sh` script.

**Example: Running the standard GCE cluster CUJ**

```bash
# Navigate to the CUJ directory
cd gcloud/cuj/gce/standard/

# Create all resources for this CUJ
./manage.sh up

# When finished, tear down all resources for this CUJ
./manage.sh down
```

On Dataproc images prior to 2.3, the Startup script is responsible for
configuring the optional components which the customer has selected in the way
that the customer has specified with properties. Errors indicating
dataproc-startup-script.log often have to do with configuration of optional
components and their services.

### Dataproc Initialization Actions scripts

Documentation for the
[initialization-actions](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions)
argument to the `gcloud dataproc clusters create` command can be found in the
Dataproc library. You may also want to review the
[README.md](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md)
from the public initialization-actions repo on GitHub.

Do note that you can specify multiple initialization actions scripts. They will
be executed in the order of specification. The initialization-actions scripts
are stored to
`/etc/google-dataproc/startup-scripts/dataproc-initialization-script-${INDEX}`
on the filesystem of each cluster node, where ${INDEX} is the script number,
starting with 0, and incrementing for each additional script. The URL of the
script can be found by querying the metadata server for
`attributes/dataproc-initialization-action-script-${INDEX}`. From within the
script itself, you can refer to `attributes/$0`.

Logs for each initialization action script are created under /var/log
Each `manage.sh` script supports several commands:
* **`up`**: Creates all resources for the CUJ.
* **`down`**: Deletes all resources created by this CUJ.
* **`rebuild`**: Runs `down` and then `up` for a full cycle.
* **`validate`**: Checks for prerequisites, such as required APIs or shared infrastructure.

## Available CUJs

This framework includes the following initial CUJs:

* **`gce/standard`**: Creates a standard Dataproc on GCE cluster in a dedicated VPC with a Cloud NAT gateway for secure internet egress.
* **`gce/proxy-egress`**: Creates a Dataproc on GCE cluster in a private network configured to use the shared Squid proxy for all outbound internet traffic.
* **`gke/standard`**: Creates a standard Dataproc on GKE virtual cluster on a new GKE cluster.
105 changes: 105 additions & 0 deletions gcloud/ci/pristine_check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#!/bin/bash
#
# Verifies and enforces a pristine state in the project for CUJ testing
# by finding and deleting all resources tagged with the CUJ_TAG.
#
# This script is designed to be run from a CI/CD pipeline at the beginning
# (in cleanup mode) and at the end (in strict mode) of a test run.
#
# Usage:
# ./pristine_check.sh # Cleanup mode: Aggressively deletes resources.
# ./pristine_check.sh --strict # Validation mode: Fails if any resources are found.

set -e
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
source "${SCRIPT_DIR}/../lib/common.sh"
load_config

STRICT_MODE=false
if [[ "$1" == "--strict" ]]; then
STRICT_MODE=true
fi

# Store leftover resources to report at the end
LEFTOVERS_FILE=$(mktemp)
trap 'rm -f -- "${LEFTOVERS_FILE}"' EXIT

# --- Helper Functions ---

# Generic function to find, report, and optionally delete tagged resources.
# Arguments:
# $1: The type of resource (for logging purposes, e.g., "Dataproc Clusters")
# $2: The gcloud command to list resources (e.g., "gcloud dataproc clusters list ...")
# $3: The gcloud command to delete resources (e.g., "gcloud dataproc clusters delete ...")
function process_resources() {
local resource_type="$1"
local list_command="$2"
local delete_command="$3"

# The "tr" command handles cases where no resources are found (to avoid errors)
# and where multiple resources are found (one per line).
local resources
resources=$(eval "${list_command}" | tr '\n' ' ' | sed 's/ *$//')

if [[ -n "${resources}" ]]; then
echo "Found leftover ${resource_type}: ${resources}" | tee -a "${LEFTOVERS_FILE}"
if [[ "$STRICT_MODE" == false ]]; then
echo "Cleaning up ${resource_type}..."
# Some delete commands need resource name(s) first, others last. We assume last.
eval "${delete_command} ${resources}" &
fi
fi
}

# --- Main Execution ---

header "Pristine Check running in $([[ "$STRICT_MODE" == true ]] && echo 'STRICT' || echo 'CLEANUP') mode"

# Define commands for each resource type. All are filtered by the CUJ_TAG where possible.
LIST_CLUSTERS_CMD="gcloud dataproc clusters list --region='${CONFIG[REGION]}' --filter='config.gceClusterConfig.tags.items=${CONFIG[CUJ_TAG]}' --format='value(clusterName)' 2>/dev/null"
DELETE_CLUSTERS_CMD="gcloud dataproc clusters delete --quiet --region='${CONFIG[REGION]}'"

LIST_INSTANCES_CMD="gcloud compute instances list --filter='tags.items=${CONFIG[CUJ_TAG]}' --format='value(name)' 2>/dev/null"
DELETE_INSTANCES_CMD="gcloud compute instances delete --quiet --zone='${CONFIG[ZONE]}'"

# Routers and Networks cannot be tagged, so we must rely on a naming convention for them.
LIST_ROUTERS_CMD="gcloud compute routers list --filter='name~^cuj-' --format='value(name)' 2>/dev/null"
DELETE_ROUTERS_CMD="gcloud compute routers delete --quiet --region='${CONFIG[REGION]}'"

LIST_FIREWALLS_CMD="gcloud compute firewall-rules list --filter='targetTags.items=${CONFIG[CUJ_TAG]} OR name~^cuj-' --format='value(name)' 2>/dev/null"
DELETE_FIREWALLS_CMD="gcloud compute firewall-rules delete --quiet"

# Process resources that can be deleted in parallel first.
process_resources "Dataproc Clusters" "${LIST_CLUSTERS_CMD}" "${DELETE_CLUSTERS_CMD}"
process_resources "GCE Instances" "${LIST_INSTANCES_CMD}" "${DELETE_INSTANCES_CMD}"
process_resources "Firewall Rules" "${LIST_FIREWALLS_CMD}" "${DELETE_FIREWALLS_CMD}"
process_resources "Cloud Routers" "${LIST_ROUTERS_CMD}" "${DELETE_ROUTERS_CMD}"

if [[ "$STRICT_MODE" == false ]]; then
echo "Waiting for initial resource cleanup to complete..."
wait
fi

# Process networks last, as they have dependencies.
LIST_NETWORKS_CMD="gcloud compute networks list --filter='name~^cuj-' --format='value(name)' 2>/dev/null"
DELETE_NETWORKS_CMD="gcloud compute networks delete --quiet"
process_resources "VPC Networks" "${LIST_NETWORKS_CMD}" "${DELETE_NETWORKS_CMD}"

if [[ "$STRICT_MODE" == false ]]; then
wait
fi

# --- Final Report ---
if [[ -s "${LEFTOVERS_FILE}" ]]; then
echo "--------------------------------------------------" >&2
echo "ERROR: Leftover resources were detected:" >&2
cat "${LEFTOVERS_FILE}" >&2
echo "--------------------------------------------------" >&2
if [[ "$STRICT_MODE" == true ]]; then
echo "STRICT mode failed. The project is not pristine." >&2
exit 1
fi
# In non-strict mode, we report but don't fail, assuming the next run will succeed.
fi

echo "Pristine check complete."
Loading