From 653466733ef41430bd62de2abcaa834ca6963d3b Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Wed, 16 Jun 2021 10:09:22 +0200 Subject: [PATCH 001/177] Initial commit --- LICENSE | 201 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 1 + 2 files changed, 202 insertions(+) create mode 100644 LICENSE create mode 100644 README.md diff --git a/LICENSE b/LICENSE new file mode 100644 index 00000000..261eeb9e --- /dev/null +++ b/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/README.md b/README.md new file mode 100644 index 00000000..4f04ba17 --- /dev/null +++ b/README.md @@ -0,0 +1 @@ +# c3 \ No newline at end of file From 17c947a8915cd1db32348d825775272d1efb0c56 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Wed, 16 Jun 2021 10:12:20 +0200 Subject: [PATCH 002/177] Update README.md --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4f04ba17..c2fd7809 100644 --- a/README.md +++ b/README.md @@ -1 +1,6 @@ -# c3 \ No newline at end of file +# C3 - the CLAIMED Component Compiler + +This repository contains C3 - the [CLAIMED](https://arxiv.org/abs/2103.03281) Component Compiler responsible for compipling CLAIMED components to Kubeflow Pipeline Components by containerizing the CLAIMED notebooks / scripts, creating the component.yaml and pushing the container image to a registry. + +# Prerequisites +- docker From 87e99aeb9e6864694ab7c02524086b3895aa795c Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Wed, 16 Jun 2021 10:57:45 +0200 Subject: [PATCH 003/177] test component --- .vscode/settings.json | 3 ++ Dockerfile | 3 ++ component.tar.gz | Bin 0 -> 586 bytes component.yaml | 27 +++++++++++ src/data/test.csv | 101 ++++++++++++++++++++++++++++++++++++++++++ src/program.py | 29 ++++++++++++ src/tmp/test.csv | 10 +++++ 7 files changed, 173 insertions(+) create mode 100644 .vscode/settings.json create mode 100644 Dockerfile create mode 100644 component.tar.gz create mode 100644 component.yaml create mode 100644 src/data/test.csv create mode 100755 src/program.py create mode 100644 src/tmp/test.csv diff --git a/.vscode/settings.json b/.vscode/settings.json new file mode 100644 index 00000000..94a6c908 --- /dev/null +++ b/.vscode/settings.json @@ -0,0 +1,3 @@ +{ + "jupyter.jupyterServerType": "remote" +} \ No newline at end of file diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 00000000..b9639b20 --- /dev/null +++ b/Dockerfile @@ -0,0 +1,3 @@ +FROM python:3.7 +#RUN python3 -m pip install keras +COPY ./src /pipelines/component/src \ No newline at end of file diff --git a/component.tar.gz b/component.tar.gz new file mode 100644 index 0000000000000000000000000000000000000000..a05891b8bec6bc6469d5c97cbc8b6934c26567cf GIT binary patch literal 586 zcmV-Q0=4}giwFSHyUAbx1MQUGYuhjo$MfvJ;%FbTLG0K|n?T+MgRRj0fQ`Wzdr%$O z5w#^vlF7W%|9vM}c9$^dQyGQ(yeK-|eea~ZxKyoHl1rEMtZimbLXNE07k(`^mw8-^ zVE4b*MLsJo^7X6L>T*+TW@uRz#SHSNDd(8ZI%WW7Mzvhs3NC*)-2B7uvGZq&WG&yq zTkhamNN$rIx1|x<2_=Ksg4=VjnwO#ydnf#lH~?E(u@`++afEHvpn# zHR>Hr(Xw1^8|Jum!Q~ud+Lp;32n!6Wwa!;kiYwp5b*53T%(H{PNwx$Mw3Xk7;_&h77>gCsiXG0)&yJzMjbY_2&Qcas<2|&T-ZZ^!3|MNRdFbx49KUBvYszJe&2+zru4x*Et4KAb7>#$q=95m7 zdjAmM@75Mmb%XO))^x{X`B%g+S24#FH&}-(;{?T^AIbmz4iE%E5ClOG1VIo4K@bE% Y5ClOG1VIo4K@iWAKRSay4*)0t03jP7R{#J2 literal 0 HcmV?d00001 diff --git a/component.yaml b/component.yaml new file mode 100644 index 00000000..de9639ee --- /dev/null +++ b/component.yaml @@ -0,0 +1,27 @@ +name: Get Lines +description: Gets the specified number of lines from the input file. + +inputs: +- {name: Input 1, type: String, description: 'Data for input 1'} +- {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'} + +outputs: +- {name: Output 1, type: String, description: 'Output 1 data.'} + +implementation: + container: + image: romeokienzler/c3:latest + # command is a list of strings (command-line arguments). + # The YAML language has two syntaxes for lists and you can use either of them. + # Here we use the "flow syntax" - comma-separated strings inside square brackets. + command: [ + python3, + # Path of the program inside the container + /pipelines/component/src/program.py, + --input1-path, + {inputPath: Input 1}, + --param1, + {inputValue: Parameter 1}, + --output1-path, + {outputPath: Output 1}, + ] diff --git a/src/data/test.csv b/src/data/test.csv new file mode 100644 index 00000000..d9f23e4f --- /dev/null +++ b/src/data/test.csv @@ -0,0 +1,101 @@ +x,y,z +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 \ No newline at end of file diff --git a/src/program.py b/src/program.py new file mode 100755 index 00000000..f0216c49 --- /dev/null +++ b/src/program.py @@ -0,0 +1,29 @@ +#!/usr/bin/env python3 +import argparse +from pathlib import Path + +# Function doing the actual work (Outputs first N lines from a text file) +def do_work(input1_file, output1_file, param1): + for x, line in enumerate(input1_file): + if x >= param1: + break + _ = output1_file.write(line) + +# Defining and parsing the command-line arguments +parser = argparse.ArgumentParser(description='My program description') +# Paths must be passed in, not hardcoded +parser.add_argument('--input1-path', type=str, + help='Path of the local file containing the Input 1 data.') +parser.add_argument('--output1-path', type=str, + help='Path of the local file where the Output 1 data should be written.') +parser.add_argument('--param1', type=int, default=100, + help='The number of lines to read from the input and write to the output.') +args = parser.parse_args() + +# Creating the directory where the output file is created (the directory +# may or may not exist). +Path(args.output1_path).parent.mkdir(parents=True, exist_ok=True) + +with open(args.input1_path, 'r') as input1_file: + with open(args.output1_path, 'w') as output1_file: + do_work(input1_file, output1_file, args.param1) diff --git a/src/tmp/test.csv b/src/tmp/test.csv new file mode 100644 index 00000000..df33812c --- /dev/null +++ b/src/tmp/test.csv @@ -0,0 +1,10 @@ +x,y,z +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 +1,2,3 From 94cfbbdcad70b16ab5c26f9bbe062f676078949f Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Tue, 22 Jun 2021 10:43:36 +0200 Subject: [PATCH 004/177] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c2fd7809..4bb7a6ae 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # C3 - the CLAIMED Component Compiler -This repository contains C3 - the [CLAIMED](https://arxiv.org/abs/2103.03281) Component Compiler responsible for compipling CLAIMED components to Kubeflow Pipeline Components by containerizing the CLAIMED notebooks / scripts, creating the component.yaml and pushing the container image to a registry. +This repository contains C3 - the [CLAIMED](https://arxiv.org/abs/2103.03281) Component Compiler responsible for compiling CLAIMED components to Kubeflow Pipeline Components by containerizing the CLAIMED notebooks / scripts, creating the component.yaml and pushing the container image to a registry. # Prerequisites - docker From 06478db3e6c02e67d803897f6ff071aafe6e0555 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Tue, 22 Jun 2021 10:44:28 +0200 Subject: [PATCH 005/177] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 4bb7a6ae..658a3ecf 100644 --- a/README.md +++ b/README.md @@ -2,5 +2,7 @@ This repository contains C3 - the [CLAIMED](https://arxiv.org/abs/2103.03281) Component Compiler responsible for compiling CLAIMED components to Kubeflow Pipeline Components by containerizing the CLAIMED notebooks / scripts, creating the component.yaml and pushing the container image to a registry. +Please note: this is a very early version of the solution - please come back in a couple of weeks. + # Prerequisites - docker From 865d236667deee88789813ffeccd41b6dc312849 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 24 Jun 2021 12:27:34 +0200 Subject: [PATCH 006/177] 1st iteration on mlx upload + nbconvert --- .gitignore | 1 + component.tar.gz | Bin 586 -> 0 bytes component.yaml | 2 +- requirements.txt | 1 + src/convert/a_notebook.ipynb | 200 +++++++++++++++++++++++ src/convert/a_python_script.py | 112 +++++++++++++ src/convert/notebook_to_python_script.py | 19 +++ src/mlx/publish.py | 83 ++++++++++ 8 files changed, 417 insertions(+), 1 deletion(-) create mode 100644 .gitignore delete mode 100644 component.tar.gz create mode 100644 requirements.txt create mode 100644 src/convert/a_notebook.ipynb create mode 100644 src/convert/a_python_script.py create mode 100644 src/convert/notebook_to_python_script.py create mode 100644 src/mlx/publish.py diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..f5e96dbf --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +venv \ No newline at end of file diff --git a/component.tar.gz b/component.tar.gz deleted file mode 100644 index a05891b8bec6bc6469d5c97cbc8b6934c26567cf..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 586 zcmV-Q0=4}giwFSHyUAbx1MQUGYuhjo$MfvJ;%FbTLG0K|n?T+MgRRj0fQ`Wzdr%$O z5w#^vlF7W%|9vM}c9$^dQyGQ(yeK-|eea~ZxKyoHl1rEMtZimbLXNE07k(`^mw8-^ zVE4b*MLsJo^7X6L>T*+TW@uRz#SHSNDd(8ZI%WW7Mzvhs3NC*)-2B7uvGZq&WG&yq zTkhamNN$rIx1|x<2_=Ksg4=VjnwO#ydnf#lH~?E(u@`++afEHvpn# zHR>Hr(Xw1^8|Jum!Q~ud+Lp;32n!6Wwa!;kiYwp5b*53T%(H{PNwx$Mw3Xk7;_&h77>gCsiXG0)&yJzMjbY_2&Qcas<2|&T-ZZ^!3|MNRdFbx49KUBvYszJe&2+zru4x*Et4KAb7>#$q=95m7 zdjAmM@75Mmb%XO))^x{X`B%g+S24#FH&}-(;{?T^AIbmz4iE%E5ClOG1VIo4K@bE% Y5ClOG1VIo4K@iWAKRSay4*)0t03jP7R{#J2 diff --git a/component.yaml b/component.yaml index de9639ee..367e3454 100644 --- a/component.yaml +++ b/component.yaml @@ -1,4 +1,4 @@ -name: Get Lines +name: Get Lines2 description: Gets the specified number of lines from the input file. inputs: diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 00000000..68e2a42d --- /dev/null +++ b/requirements.txt @@ -0,0 +1 @@ +p"git+https://github.com/machine-learning-exchange/mlx.git@main#egg=mlx-client&subdirectory=api/client" diff --git a/src/convert/a_notebook.ipynb b/src/convert/a_notebook.ipynb new file mode 100644 index 00000000..9aca72b0 --- /dev/null +++ b/src/convert/a_notebook.ipynb @@ -0,0 +1,200 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook pulls the HMP accelerometer sensor data classification data set" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install pyspark==2.4.4" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# @param data_dir temporal data storage for local execution\n", + "# @param data_csv csv path and file name (default: data.csv)\n", + "# @param data_parquet path and parquet file name (default: data.parquet)\n", + "# @param master url of master (default: local mode)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pyspark import SparkContext, SparkConf\n", + "from pyspark.sql import SparkSession\n", + "import os\n", + "from pyspark.sql.types import StructType, StructField, IntegerType\n", + "import fnmatch\n", + "from pyspark.sql.functions import lit" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_csv = os.environ.get('data_csv', 'data.csv')\n", + "master = os.environ.get('master', \"local[*]\")\n", + "data_dir = os.environ.get('data_dir', '../../data/')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lets create a local spark context (sc) and session (spark)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sc = SparkContext.getOrCreate(SparkConf().setMaster(master))\n", + "\n", + "spark = SparkSession \\\n", + " .builder \\\n", + " .getOrCreate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lets pull the data in raw format from the source (github)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!rm -Rf HMP_Dataset\n", + "!git clone https://github.com/wchill/HMP_Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "schema = StructType([\n", + " StructField(\"x\", IntegerType(), True),\n", + " StructField(\"y\", IntegerType(), True),\n", + " StructField(\"z\", IntegerType(), True)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This step takes a while, it parses through all files and folders and creates a temporary dataframe for each file which gets appended to an overall data-frame \"df\". In addition, a column called \"class\" is added to allow for straightforward usage in Spark afterwards in a supervised machine learning scenario for example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "d = 'HMP_Dataset/'\n", + "\n", + "# filter list for all folders containing data (folders that don't start with .)\n", + "file_list_filtered = [s for s in os.listdir(d)\n", + " if os.path.isdir(os.path.join(d, s)) &\n", + " ~fnmatch.fnmatch(s, '.*')]\n", + "\n", + "# create pandas data frame for all the data\n", + "\n", + "df = None\n", + "\n", + "for category in file_list_filtered:\n", + " data_files = os.listdir('HMP_Dataset/' + category)\n", + "\n", + " # create a temporary pandas data frame for each data file\n", + " for data_file in data_files:\n", + " print(data_file)\n", + " temp_df = spark.read. \\\n", + " option(\"header\", \"false\"). \\\n", + " option(\"delimiter\", \" \"). \\\n", + " csv('HMP_Dataset/' + category + '/' + data_file, schema=schema)\n", + "\n", + " # create a column called \"source\" storing the current CSV file\n", + " temp_df = temp_df.withColumn(\"source\", lit(data_file))\n", + "\n", + " # create a column called \"class\" storing the current data folder\n", + " temp_df = temp_df.withColumn(\"class\", lit(category))\n", + "\n", + " if df is None:\n", + " df = temp_df\n", + " else:\n", + " df = df.union(temp_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Lets write the dataf-rame to a file in \"CSV\" format, this will also take quite some time:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.write.option(\"header\", \"true\").csv(data_dir + data_csv)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we should have a CSV file with our contents" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/src/convert/a_python_script.py b/src/convert/a_python_script.py new file mode 100644 index 00000000..4bef91d2 --- /dev/null +++ b/src/convert/a_python_script.py @@ -0,0 +1,112 @@ +#!/usr/bin/env python +# coding: utf-8 + +# This notebook pulls the HMP accelerometer sensor data classification data set + +# In[ ]: + + +get_ipython().system('pip install pyspark==2.4.4') + + +# In[ ]: + + +# @param data_dir temporal data storage for local execution +# @param data_csv csv path and file name (default: data.csv) +# @param data_parquet path and parquet file name (default: data.parquet) +# @param master url of master (default: local mode) + + +# In[ ]: + + +from pyspark import SparkContext, SparkConf +from pyspark.sql import SparkSession +import os +from pyspark.sql.types import StructType, StructField, IntegerType +import fnmatch +from pyspark.sql.functions import lit + + +# In[ ]: + + +data_csv = os.environ.get('data_csv', 'data.csv') +master = os.environ.get('master', "local[*]") +data_dir = os.environ.get('data_dir', '../../data/') + + +# Lets create a local spark context (sc) and session (spark) + +# In[ ]: + + +sc = SparkContext.getOrCreate(SparkConf().setMaster(master)) + +spark = SparkSession .builder .getOrCreate() + + +# Lets pull the data in raw format from the source (github) + +# In[ ]: + + +get_ipython().system('rm -Rf HMP_Dataset') +get_ipython().system('git clone https://github.com/wchill/HMP_Dataset') + + +# In[ ]: + + +schema = StructType([ + StructField("x", IntegerType(), True), + StructField("y", IntegerType(), True), + StructField("z", IntegerType(), True)]) + + +# This step takes a while, it parses through all files and folders and creates a temporary dataframe for each file which gets appended to an overall data-frame "df". In addition, a column called "class" is added to allow for straightforward usage in Spark afterwards in a supervised machine learning scenario for example. + +# In[ ]: + + +d = 'HMP_Dataset/' + +# filter list for all folders containing data (folders that don't start with .) +file_list_filtered = [s for s in os.listdir(d) + if os.path.isdir(os.path.join(d, s)) & + ~fnmatch.fnmatch(s, '.*')] + +# create pandas data frame for all the data + +df = None + +for category in file_list_filtered: + data_files = os.listdir('HMP_Dataset/' + category) + + # create a temporary pandas data frame for each data file + for data_file in data_files: + print(data_file) + temp_df = spark.read. option("header", "false"). option("delimiter", " "). csv('HMP_Dataset/' + category + '/' + data_file, schema=schema) + + # create a column called "source" storing the current CSV file + temp_df = temp_df.withColumn("source", lit(data_file)) + + # create a column called "class" storing the current data folder + temp_df = temp_df.withColumn("class", lit(category)) + + if df is None: + df = temp_df + else: + df = df.union(temp_df) + + +# Lets write the dataf-rame to a file in "CSV" format, this will also take quite some time: + +# In[ ]: + + +df.write.option("header", "true").csv(data_dir + data_csv) + + +# Now we should have a CSV file with our contents diff --git a/src/convert/notebook_to_python_script.py b/src/convert/notebook_to_python_script.py new file mode 100644 index 00000000..76c70e99 --- /dev/null +++ b/src/convert/notebook_to_python_script.py @@ -0,0 +1,19 @@ +import nbformat as nbf +from nbconvert.exporters import PythonExporter +from nbconvert.preprocessors import TagRemovePreprocessor + +with open("a_notebook.ipynb", 'r', encoding='utf-8') as f: + the_notebook_nodes = nbf.read(f, as_version = 4) + +trp = TagRemovePreprocessor() + +trp.remove_cell_tags = ("remove",) + +pexp = PythonExporter() + +pexp.register_preprocessor(trp, enabled= True) + +the_python_script, meta = pexp.from_notebook_node(the_notebook_nodes) + +with open("a_python_script.py", 'w', encoding='utf-8') as f: + f.writelines(the_python_script) \ No newline at end of file diff --git a/src/mlx/publish.py b/src/mlx/publish.py new file mode 100644 index 00000000..464e32a7 --- /dev/null +++ b/src/mlx/publish.py @@ -0,0 +1,83 @@ +from __future__ import print_function + +import glob +import json +import os +import random +import re +import swagger_client +import tarfile +import tempfile + +from io import BytesIO +from os import environ as env +from pprint import pprint +from swagger_client.api_client import ApiClient, Configuration +# Copyright 2021 IBM Corporation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from swagger_client.models import ApiComponent, ApiGetTemplateResponse, ApiListComponentsResponse, \ + ApiGenerateCodeResponse, ApiRunCodeResponse +from swagger_client.rest import ApiException +from sys import stderr +from urllib3.response import HTTPResponse + +host = env.get("MLX_API_SERVICE_HOST",'127.0.0.1') +port = env.get("MLX_API_SERVICE_PORT", '8080') + + +api_base_path = 'apis/v1alpha1' + +def get_swagger_client(): + + config = Configuration() + config.host = f'http://{host}:{port}/{api_base_path}' + api_client = ApiClient(configuration=config) + + return api_client + +def create_tar_file(yamlfile_name): + + yamlfile_basename = os.path.basename(yamlfile_name) + tmp_dir = tempfile.gettempdir() + tarfile_path = os.path.join(tmp_dir, yamlfile_basename.replace(".yaml", ".tgz")) + + with tarfile.open(tarfile_path, "w:gz") as tar: + tar.add(yamlfile_name, arcname=yamlfile_basename) + + tar.close() + + return tarfile_path + +def upload_component_file(component_id, file_path): + + api_client = get_swagger_client() + api_instance = swagger_client.ComponentServiceApi(api_client=api_client) + + try: + response = api_instance.upload_component_file(id=component_id, uploadfile=file_path) + print(f"Upload file '{file_path}' to component with ID '{component_id}'") + + except ApiException as e: + print("Exception when calling ComponentServiceApi -> upload_component_file: %s\n" % e, file=stderr) + raise e + +def main(): + + component_file = create_tar_file('component.yaml') + upload_component_file('test3',component_file) + +if __name__ == '__main__': + main() \ No newline at end of file From d0305ce674bcc8bffba6805066d60f4b7ef1f47a Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 24 Jun 2021 21:41:03 +0200 Subject: [PATCH 007/177] name, description and envs extractable from notebook --- .gitignore | 3 +- requirements.txt | 5 +- src/builder/notebook.py | 13 ++ src/builder/parser.py | 209 ++++++++++++++++++ src/builder/test_notebook.py | 6 + .../notebooks}/a_notebook.ipynb | 9 +- 6 files changed, 242 insertions(+), 3 deletions(-) create mode 100644 src/builder/notebook.py create mode 100644 src/builder/parser.py create mode 100644 src/builder/test_notebook.py rename {src/convert => test/notebooks}/a_notebook.ipynb (98%) diff --git a/.gitignore b/.gitignore index f5e96dbf..d75edeae 100644 --- a/.gitignore +++ b/.gitignore @@ -1 +1,2 @@ -venv \ No newline at end of file +venv +__pycache__ \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 68e2a42d..10541fe0 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1 +1,4 @@ -p"git+https://github.com/machine-learning-exchange/mlx.git@main#egg=mlx-client&subdirectory=api/client" +"git+https://github.com/machine-learning-exchange/mlx.git@main#egg=mlx-client&subdirectory=api/client" +nbformat==5.1.3 +nbconvert==6.0.7 +ipython==7.16.1 \ No newline at end of file diff --git a/src/builder/notebook.py b/src/builder/notebook.py new file mode 100644 index 00000000..57f218d6 --- /dev/null +++ b/src/builder/notebook.py @@ -0,0 +1,13 @@ +import json +from parser import ContentParser + +class Notebook(): + def __init__(self, path): + with open(path) as json_file: + notebook = json.load(json_file) + self.name = notebook['cells'][0]['source'][0] + self.description = notebook['cells'][1]['source'][0] + + cp = ContentParser() + self.envs = cp.parse(path)['env_vars'] + diff --git a/src/builder/parser.py b/src/builder/parser.py new file mode 100644 index 00000000..8a130d8f --- /dev/null +++ b/src/builder/parser.py @@ -0,0 +1,209 @@ +# +# Copyright 2018-2021 Elyra Authors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import nbformat +import re + +from traitlets.config import LoggingConfigurable + +from typing import TypeVar, List, Dict + +# Setup forward reference for type hint on return from class factory method. See +# https://stackoverflow.com/questions/39205527/can-you-annotate-return-type-when-value-is-instance-of-cls/39205612#39205612 +F = TypeVar('F', bound='FileReader') + + +class FileReader(LoggingConfigurable): + """ + Base class for parsing a file for resources according to operation type. Subclasses set + their own parser member variable according to their implementation language. + """ + + def __init__(self, filepath: str): + self._filepath = filepath + + @property + def filepath(self): + return self._filepath + + @property + def language(self) -> str: + file_extension = os.path.splitext(self._filepath)[-1] + if file_extension == '.py': + return 'python' + elif file_extension == '.r': + return 'r' + else: + return None + + def read_next_code_chunk(self) -> List[str]: + """ + Implements a generator for lines of code in the specified filepath. Subclasses + may override if explicit line-by-line parsing is not feasible, e.g. with Notebooks. + """ + with open(self._filepath) as f: + for line in f: + yield [line.strip()] + + +class NotebookReader(FileReader): + def __init__(self, filepath: str): + super().__init__(filepath) + + with open(self._filepath) as f: + self._notebook = nbformat.read(f, as_version=4) + self._language = None + + try: + self._language = self._notebook['metadata']['kernelspec']['language'].lower() + + except KeyError: + self.log.warning(f'No language metadata found in {self._filepath}') + pass + + @property + def language(self) -> str: + return self._language + + def read_next_code_chunk(self) -> List[str]: + for cell in self._notebook.cells: + if cell.source and cell.cell_type == "code": + yield cell.source.split('\n') + + +class ScriptParser(): + """ + Base class for parsing individual lines of code. Subclasses implement a search_expressions() + function that returns language-specific regexes to match against code lines. + """ + + _comment_char = "#" + + def _get_line_without_comments(self, line): + if self._comment_char in line: + index = line.find(self._comment_char) + line = line[:index] + return line.strip() + + def parse_environment_variables(self, line): + # Parse a line fed from file and match each regex in regex dictionary + line = self._get_line_without_comments(line) + if not line: + return [] + + matches = [] + for key, value in self.search_expressions().items(): + for pattern in value: + regex = re.compile(pattern) + for match in regex.finditer(line): + matches.append((key, match)) + return matches + + +class PythonScriptParser(ScriptParser): + def search_expressions(self) -> Dict[str, List]: + # TODO: add more key:list-of-regex pairs to parse for additional resources + regex_dict = dict() + + # First regex matches envvar assignments of form os.environ["name"] = value w or w/o value provided + # Second regex matches envvar assignments that use os.getenv("name", "value") with ow w/o default provided + # Third regex matches envvar assignments that use os.environ.get("name", "value") with or w/o default provided + # Both name and value are captured if possible + envs = [r"os\.environ\[[\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']\](?:\s*=(?:\s*[\"'](.[^\"']*)?[\"'])?)*", + r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,\s*[\"'](.[^\"']*)?[\"'])?", + r"os\.environ\.get\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,(?:\s*[\"'](.[^\"']*)?[\"'])?)*"] + regex_dict["env_vars"] = envs + return regex_dict + + +class RScriptParser(ScriptParser): + def search_expressions(self) -> Dict[str, List]: + # TODO: add more key:list-of-regex pairs to parse for additional resources + regex_dict = dict() + + # Tests for matches of the form Sys.setenv("key" = "value") + envs = [r"Sys\.setenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*\s*=\s*[\"']*(.[^\"']*)?[\"']*\)", + r"Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*\)(.)*"] + regex_dict["env_vars"] = envs + return regex_dict + + +class ContentParser(LoggingConfigurable): + parsers = { + 'python': PythonScriptParser(), + 'r': RScriptParser() + } + + def parse(self, filepath: str) -> dict: + """Returns a model dictionary of all the regex matches for each key in the regex dictionary""" + + properties = {"env_vars": {}, "inputs": [], "outputs": []} + reader = self._get_reader(filepath) + parser = self._get_parser(reader.language) + + if not parser: + return properties + + for chunk in reader.read_next_code_chunk(): + if chunk: + for line in chunk: + matches = parser.parse_environment_variables(line) + for key, match in matches: + if key == "env_vars": + properties[key][match.group(1)] = match.group(2) + else: + properties[key].append(match.group(1)) + + return properties + + def _validate_file(self, filepath: str): + """ + Validate file exists and is file (e.g. not a directory) + """ + if not os.path.exists(filepath): + raise FileNotFoundError(f'No such file or directory: {filepath}') + if not os.path.isfile(filepath): + raise IsADirectoryError(f'Is a directory: {filepath}') + + def _get_reader(self, filepath: str): + """ + Find the proper reader based on the file extension + """ + file_extension = os.path.splitext(filepath)[-1] + + self._validate_file(filepath) + + if file_extension == '.ipynb': + return NotebookReader(filepath) + elif file_extension in ['.py', '.r']: + return FileReader(filepath) + else: + raise ValueError(f'File type {file_extension} is not supported.') + + def _get_parser(self, language: str): + """ + Find the proper parser based on content language + """ + parser = None + if language: + parser = self.parsers.get(language) + + if not parser: + self.log.warning(f'Content parser for {language} is not available.') + pass + + return parser diff --git a/src/builder/test_notebook.py b/src/builder/test_notebook.py new file mode 100644 index 00000000..78d9648b --- /dev/null +++ b/src/builder/test_notebook.py @@ -0,0 +1,6 @@ +from notebook import Notebook + +nb = Notebook('../../test/notebooks/a_notebook.ipynb') +print(nb.name) +print(nb.description) +print(nb.envs) \ No newline at end of file diff --git a/src/convert/a_notebook.ipynb b/test/notebooks/a_notebook.ipynb similarity index 98% rename from src/convert/a_notebook.ipynb rename to test/notebooks/a_notebook.ipynb index 9aca72b0..ddf948bf 100644 --- a/src/convert/a_notebook.ipynb +++ b/test/notebooks/a_notebook.ipynb @@ -1,5 +1,12 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "input_hmp" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -197,4 +204,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file From d5e7244f8bc6ec1428520afd56b8421ab1b70839 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 25 Jun 2021 11:05:44 +0200 Subject: [PATCH 008/177] start parsing for requirements --- src/builder/notebook.py | 18 +++++++++++++++--- src/builder/test_notebook.py | 3 ++- test/notebooks/a_notebook.ipynb | 4 ++-- 3 files changed, 19 insertions(+), 6 deletions(-) diff --git a/src/builder/notebook.py b/src/builder/notebook.py index 57f218d6..1a448c04 100644 --- a/src/builder/notebook.py +++ b/src/builder/notebook.py @@ -1,13 +1,25 @@ import json +import re from parser import ContentParser class Notebook(): def __init__(self, path): with open(path) as json_file: - notebook = json.load(json_file) - self.name = notebook['cells'][0]['source'][0] - self.description = notebook['cells'][1]['source'][0] + self.notebook = json.load(json_file) + self.name = self.notebook['cells'][0]['source'][0] + self.description = self.notebook['cells'][1]['source'][0] cp = ContentParser() self.envs = cp.parse(path)['env_vars'] + self.requirements = self._get_requirements() + + def _get_requirements(self): + for cell in self.notebook['cells']: + cell_content = cell['source'][0] + pattern = r"(![ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.:]*)" + + print(re.findall(pattern,cell_content)) # TODO romeo multiple matches not working + + + diff --git a/src/builder/test_notebook.py b/src/builder/test_notebook.py index 78d9648b..82bdfff8 100644 --- a/src/builder/test_notebook.py +++ b/src/builder/test_notebook.py @@ -3,4 +3,5 @@ nb = Notebook('../../test/notebooks/a_notebook.ipynb') print(nb.name) print(nb.description) -print(nb.envs) \ No newline at end of file +print(nb.envs) +print(nb.requirements) \ No newline at end of file diff --git a/test/notebooks/a_notebook.ipynb b/test/notebooks/a_notebook.ipynb index ddf948bf..9afcb28c 100644 --- a/test/notebooks/a_notebook.ipynb +++ b/test/notebooks/a_notebook.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "input_hmp" + "# input_hmp" ] }, { @@ -20,7 +20,7 @@ "metadata": {}, "outputs": [], "source": [ - "!pip install pyspark==2.4.4" + "!pip install pyspark1==2.4.4 pyspark2==2.4.4 pyspark3==2.4.4 pyspark4 pyspark5" ] }, { From 244cc1c0e064afdee286cbb5537bc20f07f821a2 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 25 Jun 2021 17:45:26 +0200 Subject: [PATCH 009/177] start implementing kfp component writer --- src/builder/base_component_spec.py | 18 ++++++++++++++++++ src/builder/kfp_component.py | 6 ++++++ src/builder/notebook.py | 17 ++++++++++++++--- src/builder/test_notebook.py | 25 +++++++++++++++++++++---- test/notebooks/a_notebook.ipynb | 11 +++++++++-- 5 files changed, 68 insertions(+), 9 deletions(-) create mode 100644 src/builder/base_component_spec.py create mode 100644 src/builder/kfp_component.py diff --git a/src/builder/base_component_spec.py b/src/builder/base_component_spec.py new file mode 100644 index 00000000..187e747c --- /dev/null +++ b/src/builder/base_component_spec.py @@ -0,0 +1,18 @@ +class BaseComponentSpec(): + def get_name() -> str: + raise Exception("Not implemented") + + def get_description() -> str: + raise Exception("Not implemented") + + def get_inputs() -> List[Dict[str,str]]: + raise Exception("Not implemented") + + def get_outputs() -> List[Dict[str,str]]: + raise Exception("Not implemented") + + def get_container_uri() -> str: + raise Exception("Not implemented") + + def get_requirements() -> List[str]: + raise Exception("Not implemented") \ No newline at end of file diff --git a/src/builder/kfp_component.py b/src/builder/kfp_component.py new file mode 100644 index 00000000..76a1c1d0 --- /dev/null +++ b/src/builder/kfp_component.py @@ -0,0 +1,6 @@ +from base_component_spec import BaseComponentSpec +from notebook import Notebook + +class KfpComponent(BaseComponentSpec): + def __init__(self, noteboook : Notebook): + noteboook. diff --git a/src/builder/notebook.py b/src/builder/notebook.py index 1a448c04..ad53cbce 100644 --- a/src/builder/notebook.py +++ b/src/builder/notebook.py @@ -6,7 +6,7 @@ class Notebook(): def __init__(self, path): with open(path) as json_file: self.notebook = json.load(json_file) - self.name = self.notebook['cells'][0]['source'][0] + self.name = self.notebook['cells'][0]['source'][0].replace('#', '').strip() self.description = self.notebook['cells'][1]['source'][0] cp = ContentParser() @@ -19,7 +19,18 @@ def _get_requirements(self): cell_content = cell['source'][0] pattern = r"(![ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.:]*)" - print(re.findall(pattern,cell_content)) # TODO romeo multiple matches not working - + #print(re.findall(pattern,cell_content)) # TODO romeo multiple matches not working + + def get_name(self): + return self.name + + def get_description(self): + return self.description + + def get_inputs(self): + return { key:value for (key,value) in self.envs.items() if not key.startswith('output_') } + + def get_outputs(self): + return { key:value for (key,value) in self.envs.items() if key.startswith('output_') } diff --git a/src/builder/test_notebook.py b/src/builder/test_notebook.py index 82bdfff8..cd285bd2 100644 --- a/src/builder/test_notebook.py +++ b/src/builder/test_notebook.py @@ -1,7 +1,24 @@ from notebook import Notebook nb = Notebook('../../test/notebooks/a_notebook.ipynb') -print(nb.name) -print(nb.description) -print(nb.envs) -print(nb.requirements) \ No newline at end of file + +assert 'input_hmp' == nb.get_name() +assert 'This notebook pulls the HMP accelerometer sensor data classification data set' == nb.get_description() +inputs = nb.get_inputs() +assert 'data_csv' in inputs +assert 'master' in inputs +assert 'master2' in inputs +assert 'data_dir' in inputs + +assert 'data.csv' == inputs['data_csv'] +assert 'local[*]' == inputs['master'] +assert '../../data/' == inputs['data_dir'] +outputs = nb.get_outputs() +assert not 'output_data' in inputs +assert 'output_data' in outputs +assert 'output_data2' in outputs +assert '/tmp/output.csv' == outputs['output_data'] +assert 'data_dir' not in outputs + + + diff --git a/test/notebooks/a_notebook.ipynb b/test/notebooks/a_notebook.ipynb index 9afcb28c..21cc7c95 100644 --- a/test/notebooks/a_notebook.ipynb +++ b/test/notebooks/a_notebook.ipynb @@ -20,7 +20,10 @@ "metadata": {}, "outputs": [], "source": [ - "!pip install pyspark1==2.4.4 pyspark2==2.4.4 pyspark3==2.4.4 pyspark4 pyspark5" + "!pip install pyspark1==2.4.4 pyspark2==2.4.4 pyspark3==2.4.4 pyspark4 pyspark5\n", + "\n", + "\n", + "\n" ] }, { @@ -57,7 +60,11 @@ "source": [ "data_csv = os.environ.get('data_csv', 'data.csv')\n", "master = os.environ.get('master', \"local[*]\")\n", - "data_dir = os.environ.get('data_dir', '../../data/')" + "master2 = os.environ.get('master2')\n", + "\n", + "data_dir = os.environ.get('data_dir', '../../data/')\n", + "output = os.environ.get('output_data','/tmp/output.csv')\n", + "output2 = os.environ.get('output_data2')" ] }, { From af46cc18f0163b45992d977142a1b08834fb271d Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 25 Jun 2021 18:22:18 +0200 Subject: [PATCH 010/177] finalize kfp_component --- src/builder/base_component_spec.py | 6 +++--- src/builder/kfp_component.py | 20 +++++++++++++++++++- src/builder/test_kfp_component.py | 25 +++++++++++++++++++++++++ 3 files changed, 47 insertions(+), 4 deletions(-) create mode 100644 src/builder/test_kfp_component.py diff --git a/src/builder/base_component_spec.py b/src/builder/base_component_spec.py index 187e747c..45ae08c5 100644 --- a/src/builder/base_component_spec.py +++ b/src/builder/base_component_spec.py @@ -5,14 +5,14 @@ def get_name() -> str: def get_description() -> str: raise Exception("Not implemented") - def get_inputs() -> List[Dict[str,str]]: + def get_inputs(): raise Exception("Not implemented") - def get_outputs() -> List[Dict[str,str]]: + def get_outputs(): raise Exception("Not implemented") def get_container_uri() -> str: raise Exception("Not implemented") - def get_requirements() -> List[str]: + def get_requirements(): raise Exception("Not implemented") \ No newline at end of file diff --git a/src/builder/kfp_component.py b/src/builder/kfp_component.py index 76a1c1d0..7c8f46f5 100644 --- a/src/builder/kfp_component.py +++ b/src/builder/kfp_component.py @@ -3,4 +3,22 @@ class KfpComponent(BaseComponentSpec): def __init__(self, noteboook : Notebook): - noteboook. + self.name = noteboook.get_name() + self.description = noteboook.get_description() + self.inputs = noteboook.get_inputs() + self.outputs = noteboook.get_outputs() + + def get_name(self) -> str: + return self.name + + def get_description(self) -> str: + return self.description + + def get_container_uri(self) -> str: + return 'continuumio/anaconda3:2020.07' + + def get_inputs(self): + return self.inputs + + def get_outputs(self): + return self.outputs diff --git a/src/builder/test_kfp_component.py b/src/builder/test_kfp_component.py new file mode 100644 index 00000000..c7de904a --- /dev/null +++ b/src/builder/test_kfp_component.py @@ -0,0 +1,25 @@ +from notebook import Notebook +from kfp_component import KfpComponent + +nb = Notebook('../../test/notebooks/a_notebook.ipynb') +kfp = KfpComponent(nb) +assert 'input_hmp' == kfp.get_name() +assert 'This notebook pulls the HMP accelerometer sensor data classification data set' == kfp.get_description() +inputs = kfp.get_inputs() +assert 'data_csv' in inputs +assert 'master' in inputs +assert 'master2' in inputs +assert 'data_dir' in inputs +assert 'continuumio/anaconda3:2020.07' == kfp.get_container_uri() +assert 'data.csv' == inputs['data_csv'] +assert 'local[*]' == inputs['master'] +assert '../../data/' == inputs['data_dir'] +outputs = kfp.get_outputs() +assert not 'output_data' in inputs +assert 'output_data' in outputs +assert 'output_data2' in outputs +assert '/tmp/output.csv' == outputs['output_data'] +assert 'data_dir' not in outputs + + + From 569cd5e0a53fb596b09b33dbba8b52a7877abc54 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 5 Jul 2021 15:16:27 +0200 Subject: [PATCH 011/177] add test trigger script --- run_tests.sh | 4 ++++ src/builder/kfp_component_builder.py | 5 +++++ 2 files changed, 9 insertions(+) create mode 100755 run_tests.sh create mode 100644 src/builder/kfp_component_builder.py diff --git a/run_tests.sh b/run_tests.sh new file mode 100755 index 00000000..6ef8d21d --- /dev/null +++ b/run_tests.sh @@ -0,0 +1,4 @@ +source ./venv/bin/activate +cd src/builder +python ./test_kfp_component.py +python ./test_notebook.py \ No newline at end of file diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py new file mode 100644 index 00000000..723d2892 --- /dev/null +++ b/src/builder/kfp_component_builder.py @@ -0,0 +1,5 @@ +from kfp_component import KfpComponent +from notebook import Notebook + +nb = Notebook('../../test/notebooks/a_notebook.ipynb') +kfp = KfpComponent(nb) \ No newline at end of file From a9585625573ad01cbc78e2d9c734e334784cd114 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 30 Jul 2021 11:16:43 +0200 Subject: [PATCH 012/177] finish kfp component builder --- run_tests.sh | 3 +- src/builder/kfp_component_builder.py | 49 +++++++++++++++- src/builder/test_kfp_component_builder.py | 7 +++ src/mlx/publish.py | 4 +- test_component.yaml | 70 +++++++++++++++++++++++ 5 files changed, 128 insertions(+), 5 deletions(-) create mode 100644 src/builder/test_kfp_component_builder.py create mode 100644 test_component.yaml diff --git a/run_tests.sh b/run_tests.sh index 6ef8d21d..4dba9fff 100755 --- a/run_tests.sh +++ b/run_tests.sh @@ -1,4 +1,5 @@ source ./venv/bin/activate cd src/builder python ./test_kfp_component.py -python ./test_notebook.py \ No newline at end of file +python ./test_notebook.py +python ./test_kfp_component_builder.py \ No newline at end of file diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index 723d2892..8f510eb7 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -1,5 +1,50 @@ from kfp_component import KfpComponent from notebook import Notebook +from string import Template +from io import StringIO -nb = Notebook('../../test/notebooks/a_notebook.ipynb') -kfp = KfpComponent(nb) \ No newline at end of file + +class KfpComponentBuilder(): + def __init__(self, notebook_url : str): + nb = Notebook('../../test/notebooks/a_notebook.ipynb') + self.kfp = KfpComponent(nb) + + def get_inputs(self): + with StringIO() as inputs_str: + for input in self.kfp.get_inputs(): + t = Template("- {name: $name, type: String, description: 'not yet supported'}") + print(t.substitute(name=input), file=inputs_str) + return inputs_str.getvalue() + + def get_outputs(self): + with StringIO() as outputs_str: + for output in self.kfp.get_outputs(): + t = Template("- {name: $name, type: String, description: 'not yet supported'}") + print(t.substitute(name=output), file=outputs_str) + return outputs_str.getvalue() + + def get_yaml(self): + t = Template(''' +name: $name +description: $description + +inputs: +$inputs + +outputs: +$outputs + +implementation: + container: + image: $container_uri + command: [ + seq 100 + ] + ''') + return t.substitute( + name=self.kfp.get_name(), + description=self.kfp.get_description(), + inputs=self.get_inputs(), + outputs=self.get_outputs(), + container_uri=self.kfp.get_container_uri() + ) diff --git a/src/builder/test_kfp_component_builder.py b/src/builder/test_kfp_component_builder.py new file mode 100644 index 00000000..65614ed5 --- /dev/null +++ b/src/builder/test_kfp_component_builder.py @@ -0,0 +1,7 @@ +from notebook import Notebook +from kfp_component import KfpComponent +from kfp_component_builder import KfpComponentBuilder + + +kfpcb = KfpComponentBuilder('../../test/notebooks/a_notebook.ipynb') +print(kfpcb.get_yaml()) \ No newline at end of file diff --git a/src/mlx/publish.py b/src/mlx/publish.py index 464e32a7..06569b96 100644 --- a/src/mlx/publish.py +++ b/src/mlx/publish.py @@ -76,8 +76,8 @@ def upload_component_file(component_id, file_path): def main(): - component_file = create_tar_file('component.yaml') - upload_component_file('test3',component_file) + component_file = create_tar_file('../../test_component.yaml') + upload_component_file('test4',component_file) if __name__ == '__main__': main() \ No newline at end of file diff --git a/test_component.yaml b/test_component.yaml new file mode 100644 index 00000000..e682e21e --- /dev/null +++ b/test_component.yaml @@ -0,0 +1,70 @@ +name: input_hmp +description: This notebook pulls the HMP accelerometer sensor data classification data set + +inputs: +- {name: data_csv, type: String, description: 'not yet supported'} +- {name: master, type: String, description: 'not yet supported'} +- {name: master2, type: String, description: 'not yet supported'} +- {name: data_dir, type: String, description: 'not yet supported'} + + +outputs: +- {name: output_data, type: String, description: 'not yet supported'} +- {name: output_data2, type: String, description: 'not yet supported'} + + +implementation: +container: + image: continuumio/anaconda3:2020.07 + command + +(venv) romeokienzler:c3$ ^C +(venv) romeokienzler:c3$ ^C +(venv) romeokienzler:c3$ ./run_tests.sh + +name: input_hmp +description: This notebook pulls the HMP accelerometer sensor data classification data set + +inputs: +- {name: data_csv, type: String, description: 'not yet supported'} +- {name: master, type: String, description: 'not yet supported'} +- {name: master2, type: String, description: 'not yet supported'} +- {name: data_dir, type: String, description: 'not yet supported'} + + +outputs: +- {name: output_data, type: String, description: 'not yet supported'} +- {name: output_data2, type: String, description: 'not yet supported'} + + +implementation: +container: + image: continuumio/anaconda3:2020.07 + command: [ + seq 100 + ] + +(venv) romeokienzler:c3$ ./run_tests.sh + +name: input_hmp +description: This notebook pulls the HMP accelerometer sensor data classification data set + +inputs: +- {name: data_csv, type: String, description: 'not yet supported'} +- {name: master, type: String, description: 'not yet supported'} +- {name: master2, type: String, description: 'not yet supported'} +- {name: data_dir, type: String, description: 'not yet supported'} + + +outputs: +- {name: output_data, type: String, description: 'not yet supported'} +- {name: output_data2, type: String, description: 'not yet supported'} + + +implementation: + container: + image: continuumio/anaconda3:2020.07 + command: [ + seq 100 + ] + \ No newline at end of file From 96112fab10c43f387ededf6abfadcc5393b40083 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 2 Aug 2021 17:08:56 +0200 Subject: [PATCH 013/177] fix test_component.yaml --- test_component.yaml | 51 +-------------------------------------------- 1 file changed, 1 insertion(+), 50 deletions(-) diff --git a/test_component.yaml b/test_component.yaml index e682e21e..2b5a6965 100644 --- a/test_component.yaml +++ b/test_component.yaml @@ -13,58 +13,9 @@ outputs: - {name: output_data2, type: String, description: 'not yet supported'} -implementation: -container: - image: continuumio/anaconda3:2020.07 - command - -(venv) romeokienzler:c3$ ^C -(venv) romeokienzler:c3$ ^C -(venv) romeokienzler:c3$ ./run_tests.sh - -name: input_hmp -description: This notebook pulls the HMP accelerometer sensor data classification data set - -inputs: -- {name: data_csv, type: String, description: 'not yet supported'} -- {name: master, type: String, description: 'not yet supported'} -- {name: master2, type: String, description: 'not yet supported'} -- {name: data_dir, type: String, description: 'not yet supported'} - - -outputs: -- {name: output_data, type: String, description: 'not yet supported'} -- {name: output_data2, type: String, description: 'not yet supported'} - - implementation: container: image: continuumio/anaconda3:2020.07 command: [ seq 100 - ] - -(venv) romeokienzler:c3$ ./run_tests.sh - -name: input_hmp -description: This notebook pulls the HMP accelerometer sensor data classification data set - -inputs: -- {name: data_csv, type: String, description: 'not yet supported'} -- {name: master, type: String, description: 'not yet supported'} -- {name: master2, type: String, description: 'not yet supported'} -- {name: data_dir, type: String, description: 'not yet supported'} - - -outputs: -- {name: output_data, type: String, description: 'not yet supported'} -- {name: output_data2, type: String, description: 'not yet supported'} - - -implementation: - container: - image: continuumio/anaconda3:2020.07 - command: [ - seq 100 - ] - \ No newline at end of file + ] \ No newline at end of file From db7f3edd2299c54d8de6660fe86fbd9fca0f790c Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Wed, 11 Aug 2021 13:15:11 +0200 Subject: [PATCH 014/177] find and work on env lines --- src/builder/kfp_component_builder.py | 2 +- src/builder/notebook.py | 18 ++- src/builder/test_kfp_component_builder.py | 2 +- test/notebooks/input-postgresql.ipynb | 161 ++++++++++++++++++++++ test_component.yaml | 10 +- 5 files changed, 183 insertions(+), 10 deletions(-) create mode 100644 test/notebooks/input-postgresql.ipynb diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index 8f510eb7..6e80a151 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -6,7 +6,7 @@ class KfpComponentBuilder(): def __init__(self, notebook_url : str): - nb = Notebook('../../test/notebooks/a_notebook.ipynb') + nb = Notebook(notebook_url) self.kfp = KfpComponent(nb) def get_inputs(self): diff --git a/src/builder/notebook.py b/src/builder/notebook.py index ad53cbce..e57898d7 100644 --- a/src/builder/notebook.py +++ b/src/builder/notebook.py @@ -8,11 +8,23 @@ def __init__(self, path): self.notebook = json.load(json_file) self.name = self.notebook['cells'][0]['source'][0].replace('#', '').strip() self.description = self.notebook['cells'][1]['source'][0] + self.envs = self._get_env_vars(path) + self.requirements = self._get_requirements() + + def _get_env_vars(self, path): + cp = ContentParser() + env_names = cp.parse(path)['env_vars'] + for env_name in env_names: + comment_line = str() + for line in self.notebook['cells'][4]['source']: + if re.search("[\"']" + env_name + "[\"']", line): + print(env_name + ':') + print(comment_line) + print(line) + comment_line = line + return env_names - cp = ContentParser() - self.envs = cp.parse(path)['env_vars'] - self.requirements = self._get_requirements() def _get_requirements(self): for cell in self.notebook['cells']: diff --git a/src/builder/test_kfp_component_builder.py b/src/builder/test_kfp_component_builder.py index 65614ed5..2a0355b8 100644 --- a/src/builder/test_kfp_component_builder.py +++ b/src/builder/test_kfp_component_builder.py @@ -3,5 +3,5 @@ from kfp_component_builder import KfpComponentBuilder -kfpcb = KfpComponentBuilder('../../test/notebooks/a_notebook.ipynb') +kfpcb = KfpComponentBuilder('../../test/notebooks/input-postgresql.ipynb') print(kfpcb.get_yaml()) \ No newline at end of file diff --git a/test/notebooks/input-postgresql.ipynb b/test/notebooks/input-postgresql.ipynb new file mode 100644 index 00000000..97108ff3 --- /dev/null +++ b/test/notebooks/input-postgresql.ipynb @@ -0,0 +1,161 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Input Postgresql" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook pulls data from a postgresql database as CSV on a given SQL statement" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install psycopg2-binary==2.9.1 pandas==1.3.1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "import psycopg2\n", + "import re\n", + "import sys" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# path and file name for output\n", + "data_csv = os.environ.get('output_data_csv', 'data.csv')\n", + "\n", + "# hostname of database server\n", + "host = os.environ.get('host')\n", + "\n", + "# database name\n", + "database = os.environ.get('database')\n", + "\n", + "# db user\n", + "user = os.environ.get('user')\n", + "\n", + "# db password\n", + "password = os.environ.get('password')\n", + "\n", + "# db port\n", + "port = int(os.environ.get('port', 5432))\n", + "\n", + "# sql query statement to be executed\n", + "sql = os.environ.get('sql')\n", + "\n", + "# temporal data storage for local execution\n", + "data_dir = os.environ.get('data_dir', '../../data/')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# override parameters received from a potential call using %run magic\n", + "parameters = list(\n", + " map(\n", + " lambda s: re.sub('$', '\"', s),\n", + " map(\n", + " lambda s: s.replace('=', '=\"'),\n", + " filter(\n", + " lambda s: s.find('=') > -1,\n", + " sys.argv\n", + " )\n", + " )\n", + " )\n", + ")\n", + "\n", + "for parameter in parameters:\n", + " exec(parameter)\n", + "\n", + "# cast parameters to appropriate type\n", + "port = int(port)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "conn = psycopg2.connect(\n", + " host=host,\n", + " database=database,\n", + " user=user,\n", + " password=password,\n", + " port=port\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "d = pd.read_sql_query(sql, conn)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "conn.close()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "d.to_csv(output_data_csv, index=False)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/test_component.yaml b/test_component.yaml index 2b5a6965..e95450ba 100644 --- a/test_component.yaml +++ b/test_component.yaml @@ -14,8 +14,8 @@ outputs: implementation: -container: - image: continuumio/anaconda3:2020.07 - command: [ - seq 100 - ] \ No newline at end of file + container: + image: continuumio/anaconda3:2020.07 + command: [ + seq 100 + ] \ No newline at end of file From 6ff1149db694ac0a5427b00caada1aad8f6deeff Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 12 Aug 2021 10:36:35 +0200 Subject: [PATCH 015/177] add support for parsing description and data type --- src/builder/notebook.py | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/src/builder/notebook.py b/src/builder/notebook.py index e57898d7..fc0e251c 100644 --- a/src/builder/notebook.py +++ b/src/builder/notebook.py @@ -14,15 +14,22 @@ def __init__(self, path): def _get_env_vars(self, path): cp = ContentParser() env_names = cp.parse(path)['env_vars'] + return_value = dict() for env_name in env_names: comment_line = str() for line in self.notebook['cells'][4]['source']: if re.search("[\"']" + env_name + "[\"']", line): - print(env_name + ':') - print(comment_line) - print(line) + assert '#' in comment_line, "comment line didn't contain #" + if "int(" in line: + type = 'Integer' + elif "float(" in line: + type = 'Float' + else: + type = 'String' + + return_value[env_name]=(comment_line.replace('#', '').strip(),type,None) comment_line = line - return env_names + return return_value From 2a4ef31d5d534fc3fbd51b15b222110f209696e1 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 13 Aug 2021 20:47:04 +0200 Subject: [PATCH 016/177] start working on command --- component.yaml | 39 +++++++++++++--------------- src/builder/kfp_component_builder.py | 15 ++++++----- test_component.yaml | 29 +++++++++++---------- 3 files changed, 42 insertions(+), 41 deletions(-) diff --git a/component.yaml b/component.yaml index 367e3454..7c5ffa37 100644 --- a/component.yaml +++ b/component.yaml @@ -1,27 +1,24 @@ -name: Get Lines2 -description: Gets the specified number of lines from the input file. +name: Input Postgresql +description: This notebook pulls data from a postgresql database as CSV on a given SQL statement inputs: -- {name: Input 1, type: String, description: 'Data for input 1'} -- {name: Parameter 1, type: Integer, default: '100', description: 'Number of lines to copy'} +- {name: host, type: String, description: 'hostname of database server'} +- {name: database, type: String, description: 'database name'} +- {name: user, type: String, description: 'db user'} +- {name: password, type: String, description: 'db password'} +- {name: port, type: Integer, description: 'db port'} +- {name: sql, type: String, description: 'sql query statement to be executed'} +- {name: data_dir, type: String, description: 'temporal data storage for local execution'} + outputs: -- {name: Output 1, type: String, description: 'Output 1 data.'} +- {name: output_data_csv, type: String, description: 'path and file name for output'} + implementation: - container: - image: romeokienzler/c3:latest - # command is a list of strings (command-line arguments). - # The YAML language has two syntaxes for lists and you can use either of them. - # Here we use the "flow syntax" - comma-separated strings inside square brackets. - command: [ - python3, - # Path of the program inside the container - /pipelines/component/src/program.py, - --input1-path, - {inputPath: Input 1}, - --param1, - {inputValue: Parameter 1}, - --output1-path, - {outputPath: Output 1}, - ] + container: + image: continuumio/anaconda3:2020.07 + command: [ + wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, + ipython ./input-postgresql.ipynb data_dir=., + ] \ No newline at end of file diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index 6e80a151..54ac85a2 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -11,16 +11,16 @@ def __init__(self, notebook_url : str): def get_inputs(self): with StringIO() as inputs_str: - for input in self.kfp.get_inputs(): - t = Template("- {name: $name, type: String, description: 'not yet supported'}") - print(t.substitute(name=input), file=inputs_str) + for input_key, input_value in self.kfp.get_inputs().items(): + t = Template("- {name: $name, type: $type, description: '$description'}") + print(t.substitute(name=input_key, type=input_value[1], description=input_value[0]), file=inputs_str) return inputs_str.getvalue() def get_outputs(self): with StringIO() as outputs_str: - for output in self.kfp.get_outputs(): - t = Template("- {name: $name, type: String, description: 'not yet supported'}") - print(t.substitute(name=output), file=outputs_str) + for output_key, output_value in self.kfp.get_outputs().items(): + t = Template("- {name: $name, type: $type, description: '$description'}") + print(t.substitute(name=output_key, type=output_value[1], description=output_value[0]), file=outputs_str) return outputs_str.getvalue() def get_yaml(self): @@ -38,7 +38,8 @@ def get_yaml(self): container: image: $container_uri command: [ - seq 100 + wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, + ipython ./input-postgresql.ipynb data_dir=., ] ''') return t.substitute( diff --git a/test_component.yaml b/test_component.yaml index e95450ba..7c5ffa37 100644 --- a/test_component.yaml +++ b/test_component.yaml @@ -1,21 +1,24 @@ -name: input_hmp -description: This notebook pulls the HMP accelerometer sensor data classification data set +name: Input Postgresql +description: This notebook pulls data from a postgresql database as CSV on a given SQL statement inputs: -- {name: data_csv, type: String, description: 'not yet supported'} -- {name: master, type: String, description: 'not yet supported'} -- {name: master2, type: String, description: 'not yet supported'} -- {name: data_dir, type: String, description: 'not yet supported'} +- {name: host, type: String, description: 'hostname of database server'} +- {name: database, type: String, description: 'database name'} +- {name: user, type: String, description: 'db user'} +- {name: password, type: String, description: 'db password'} +- {name: port, type: Integer, description: 'db port'} +- {name: sql, type: String, description: 'sql query statement to be executed'} +- {name: data_dir, type: String, description: 'temporal data storage for local execution'} outputs: -- {name: output_data, type: String, description: 'not yet supported'} -- {name: output_data2, type: String, description: 'not yet supported'} +- {name: output_data_csv, type: String, description: 'path and file name for output'} implementation: - container: - image: continuumio/anaconda3:2020.07 - command: [ - seq 100 - ] \ No newline at end of file + container: + image: continuumio/anaconda3:2020.07 + command: [ + wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, + ipython ./input-postgresql.ipynb data_dir=., + ] \ No newline at end of file From 6d2a7e10b1c0d505aec66cf5b16ef917e3a96f25 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Sat, 14 Aug 2021 19:47:28 +0200 Subject: [PATCH 017/177] add dummy parameters --- component.yaml | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/component.yaml b/component.yaml index 7c5ffa37..083bf6f3 100644 --- a/component.yaml +++ b/component.yaml @@ -20,5 +20,21 @@ implementation: image: continuumio/anaconda3:2020.07 command: [ wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, - ipython ./input-postgresql.ipynb data_dir=., + ipython ./input-postgresql.ipynb, + host=., + {host}, + database=., + {database}, + user=., + {user}, + password=., + {password}, + port=., + {port}, + sql=., + {sql}, + data_dir=., + {data_dir}, + output_data_csv= + {output_data_csv} ] \ No newline at end of file From 22bc2229a6bf23b1fdcb311727608d61ab04485a Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 16 Aug 2021 08:26:50 +0200 Subject: [PATCH 018/177] fix missing comma --- component.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/component.yaml b/component.yaml index 083bf6f3..264d10b1 100644 --- a/component.yaml +++ b/component.yaml @@ -35,6 +35,6 @@ implementation: {sql}, data_dir=., {data_dir}, - output_data_csv= + output_data_csv=, {output_data_csv} ] \ No newline at end of file From 771b10b1cfbdffa803e39830460fa32faefe4e3f Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 16 Aug 2021 08:38:06 +0200 Subject: [PATCH 019/177] fix --- component.yaml | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/component.yaml b/component.yaml index 264d10b1..611bd6c3 100644 --- a/component.yaml +++ b/component.yaml @@ -21,20 +21,20 @@ implementation: command: [ wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, ipython ./input-postgresql.ipynb, - host=., + host=, {host}, - database=., + database=, {database}, - user=., + user=, {user}, - password=., + password=, {password}, - port=., + port=, {port}, - sql=., + sql=, {sql}, - data_dir=., + data_dir=, {data_dir}, output_data_csv=, - {output_data_csv} + {output_data_csv}, ] \ No newline at end of file From 2c358af853c11d4d891be18938576128fcfd1816 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 16 Aug 2021 09:08:52 +0200 Subject: [PATCH 020/177] fix --- component.yaml | 31 +++++++++++-------------------- 1 file changed, 11 insertions(+), 20 deletions(-) diff --git a/component.yaml b/component.yaml index 611bd6c3..5f5ac7c2 100644 --- a/component.yaml +++ b/component.yaml @@ -18,23 +18,14 @@ outputs: implementation: container: image: continuumio/anaconda3:2020.07 - command: [ - wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, - ipython ./input-postgresql.ipynb, - host=, - {host}, - database=, - {database}, - user=, - {user}, - password=, - {password}, - port=, - {port}, - sql=, - {sql}, - data_dir=, - {data_dir}, - output_data_csv=, - {output_data_csv}, - ] \ No newline at end of file + command: [ + python3, + # Path of the program inside the container + /pipelines/component/src/program.py, + --input1-path, + {inputPath: Input 1}, + --param1, + {inputValue: Parameter 1}, + --output1-path, + {outputPath: Output 1}, + ] \ No newline at end of file From eb8efc7f77547ebba36e00bd3a2967d22b6017d3 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 16 Aug 2021 09:11:06 +0200 Subject: [PATCH 021/177] fix --- component.yaml | 31 ++++++++++++++++++++----------- 1 file changed, 20 insertions(+), 11 deletions(-) diff --git a/component.yaml b/component.yaml index 5f5ac7c2..d1a9b0fc 100644 --- a/component.yaml +++ b/component.yaml @@ -18,14 +18,23 @@ outputs: implementation: container: image: continuumio/anaconda3:2020.07 - command: [ - python3, - # Path of the program inside the container - /pipelines/component/src/program.py, - --input1-path, - {inputPath: Input 1}, - --param1, - {inputValue: Parameter 1}, - --output1-path, - {outputPath: Output 1}, - ] \ No newline at end of file + command: [ + wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, + ipython ./input-postgresql.ipynb, + host=, + {inputValue: host}, + database=, + {inputValue: database}, + user=, + {inputValue: user}, + password=, + {inputValue: password}, + port=, + {inputValue: port}, + sql=, + {inputValue: sql}, + data_dir=, + {inputValue: data_dir}, + output_data_csv=, + {outputPath: output_data_csv}, + ] \ No newline at end of file From e536df8ad260f51940b437f184682232d236b048 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 16 Aug 2021 09:57:48 +0200 Subject: [PATCH 022/177] fix --- component.yaml | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/component.yaml b/component.yaml index d1a9b0fc..306efd07 100644 --- a/component.yaml +++ b/component.yaml @@ -18,23 +18,23 @@ outputs: implementation: container: image: continuumio/anaconda3:2020.07 - command: [ - wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, - ipython ./input-postgresql.ipynb, - host=, - {inputValue: host}, - database=, - {inputValue: database}, - user=, - {inputValue: user}, - password=, - {inputValue: password}, - port=, - {inputValue: port}, - sql=, - {inputValue: sql}, - data_dir=, - {inputValue: data_dir}, - output_data_csv=, - {outputPath: output_data_csv}, - ] \ No newline at end of file + command: [ + wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, + ipython ./input-postgresql.ipynb, + host=, + {inputValue: host}, + database=, + {inputValue: database}, + user=, + {inputValue: user}, + password=, + {inputValue: password}, + port=, + {inputValue: port}, + sql=, + {inputValue: sql}, + data_dir=, + {inputValue: data_dir}, + output_data_csv=, + {outputPath: output_data_csv}, + ] \ No newline at end of file From c7c3006b905566af3e00e8e38bb6e4a14dd34c32 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Wed, 18 Aug 2021 15:19:46 +0200 Subject: [PATCH 023/177] fix --- component.yaml | 40 ++++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/component.yaml b/component.yaml index 306efd07..d1a9b0fc 100644 --- a/component.yaml +++ b/component.yaml @@ -18,23 +18,23 @@ outputs: implementation: container: image: continuumio/anaconda3:2020.07 - command: [ - wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, - ipython ./input-postgresql.ipynb, - host=, - {inputValue: host}, - database=, - {inputValue: database}, - user=, - {inputValue: user}, - password=, - {inputValue: password}, - port=, - {inputValue: port}, - sql=, - {inputValue: sql}, - data_dir=, - {inputValue: data_dir}, - output_data_csv=, - {outputPath: output_data_csv}, - ] \ No newline at end of file + command: [ + wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, + ipython ./input-postgresql.ipynb, + host=, + {inputValue: host}, + database=, + {inputValue: database}, + user=, + {inputValue: user}, + password=, + {inputValue: password}, + port=, + {inputValue: port}, + sql=, + {inputValue: sql}, + data_dir=, + {inputValue: data_dir}, + output_data_csv=, + {outputPath: output_data_csv}, + ] \ No newline at end of file From ca3f7518b95216484873c2440f4f6908e7e76095 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 23 Aug 2021 13:42:12 +0200 Subject: [PATCH 024/177] fix template problem --- src/builder/kfp_component_builder.py | 32 +++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index 54ac85a2..c75bddc5 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -18,11 +18,19 @@ def get_inputs(self): def get_outputs(self): with StringIO() as outputs_str: + assert len(self.kfp.get_outputs()) == 1, 'exactly one output currently supported' for output_key, output_value in self.kfp.get_outputs().items(): t = Template("- {name: $name, type: $type, description: '$description'}") print(t.substitute(name=output_key, type=output_value[1], description=output_value[0]), file=outputs_str) return outputs_str.getvalue() + def get_output_name(self): + assert len(self.kfp.get_outputs()) == 1, 'exactly one output currently supported' + for output_key, output_value in self.kfp.get_outputs().items(): + return output_key + + + def get_yaml(self): t = Template(''' name: $name @@ -37,15 +45,29 @@ def get_yaml(self): implementation: container: image: $container_uri - command: [ - wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, - ipython ./input-postgresql.ipynb data_dir=., - ] + command: + - sh + - -ec + - | + $mkdir + wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb + $call + - {outputPath: $outputPath} + - {inputValue: host} + - {inputValue: database} + - {inputValue: user} + - {inputValue: password} + - {inputValue: port} + - {inputValue: sql} + - {inputValue: data_dir} ''') return t.substitute( name=self.kfp.get_name(), description=self.kfp.get_description(), inputs=self.get_inputs(), outputs=self.get_outputs(), - container_uri=self.kfp.get_container_uri() + container_uri=self.kfp.get_container_uri(), + outputPath=self.get_output_name(), + mkdir="mkdir -p `echo $0 |sed -e 's/\/[a-zA-Z0-9]*$//'`", + call='ipython ./input-postgresql.ipynb output_data_csv="$0" host="$1" database="$2" user="$3" password="$4" port="$5" sql="$6" data_dir="$7"' ) From fe3006946d7e21eaafdaadf8fdb0f737313755ae Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 23 Aug 2021 15:44:45 +0200 Subject: [PATCH 025/177] release candidate 1 --- component.yaml | 35 ++++++++++++---------------- src/builder/kfp_component_builder.py | 16 +++++++------ 2 files changed, 24 insertions(+), 27 deletions(-) diff --git a/component.yaml b/component.yaml index d1a9b0fc..22f47cb2 100644 --- a/component.yaml +++ b/component.yaml @@ -18,23 +18,18 @@ outputs: implementation: container: image: continuumio/anaconda3:2020.07 - command: [ - wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, - ipython ./input-postgresql.ipynb, - host=, - {inputValue: host}, - database=, - {inputValue: database}, - user=, - {inputValue: user}, - password=, - {inputValue: password}, - port=, - {inputValue: port}, - sql=, - {inputValue: sql}, - data_dir=, - {inputValue: data_dir}, - output_data_csv=, - {outputPath: output_data_csv}, - ] \ No newline at end of file + command: + - sh + - -ec + - | + mkdir -p `echo $0 |sed -e 's/\/[a-zA-Z0-9]*$//'` + wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb + ipython ./input-postgresql.ipynb output_data_csv="$0" host="$1" database="$2" user="$3" password="$4" port="$5" sql="$6" data_dir="$7" + - {outputPath: output_data_csv} + - {inputValue: host} + - {inputValue: database} + - {inputValue: user} + - {inputValue: password} + - {inputValue: port} + - {inputValue: sql} + - {inputValue: data_dir} \ No newline at end of file diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index c75bddc5..5851d846 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -16,6 +16,13 @@ def get_inputs(self): print(t.substitute(name=input_key, type=input_value[1], description=input_value[0]), file=inputs_str) return inputs_str.getvalue() + def get_input_for_implementation(self): + with StringIO() as inputs_str: + for input_key, input_value in self.kfp.get_inputs().items(): + t = Template(" - {inputValue: $name}") + print(t.substitute(name=input_key), file=inputs_str) + return inputs_str.getvalue() + def get_outputs(self): with StringIO() as outputs_str: assert len(self.kfp.get_outputs()) == 1, 'exactly one output currently supported' @@ -53,13 +60,7 @@ def get_yaml(self): wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb $call - {outputPath: $outputPath} - - {inputValue: host} - - {inputValue: database} - - {inputValue: user} - - {inputValue: password} - - {inputValue: port} - - {inputValue: sql} - - {inputValue: data_dir} +$input_for_implementation ''') return t.substitute( name=self.kfp.get_name(), @@ -68,6 +69,7 @@ def get_yaml(self): outputs=self.get_outputs(), container_uri=self.kfp.get_container_uri(), outputPath=self.get_output_name(), + input_for_implementation=self.get_input_for_implementation(), mkdir="mkdir -p `echo $0 |sed -e 's/\/[a-zA-Z0-9]*$//'`", call='ipython ./input-postgresql.ipynb output_data_csv="$0" host="$1" database="$2" user="$3" password="$4" port="$5" sql="$6" data_dir="$7"' ) From c1e738a2f0750d9b7c76065603fda28648e892ab Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 6 Sep 2021 17:48:12 +0200 Subject: [PATCH 026/177] add generator scripts --- bin/generate_claimed.sh | 2 ++ bin/generate_elyra_component_config.sh | 7 +++++++ bin/generate_kfp_component.sh | 3 +++ component.yaml | 2 +- src/builder/generate_kfp_component.py | 18 ++++++++++++++++++ src/builder/kfp_component_builder.py | 6 ++---- 6 files changed, 33 insertions(+), 5 deletions(-) create mode 100755 bin/generate_claimed.sh create mode 100755 bin/generate_elyra_component_config.sh create mode 100755 bin/generate_kfp_component.sh create mode 100755 src/builder/generate_kfp_component.py diff --git a/bin/generate_claimed.sh b/bin/generate_claimed.sh new file mode 100755 index 00000000..c6ecd391 --- /dev/null +++ b/bin/generate_claimed.sh @@ -0,0 +1,2 @@ +#!/bin/bash +for file in `find ../claimed/component-library/ -name "*.ipynb"`; do ./bin/generate_kfp_component.sh $file `echo $file |sed 's/.ipynb/.yaml/g'`; done \ No newline at end of file diff --git a/bin/generate_elyra_component_config.sh b/bin/generate_elyra_component_config.sh new file mode 100755 index 00000000..e639d894 --- /dev/null +++ b/bin/generate_elyra_component_config.sh @@ -0,0 +1,7 @@ +#!/bin/bash +for file in `find ../claimed/component-library/ -name "*.yaml"`; do + new_file=`echo $file|sed -s 's/..\/claimed\/component-library//'` + component_name=${file##*/} + component_name=`echo $component_name | sed -s 's/.yaml//'` + printf '"%s": {\n "location": {\n "url": "https://raw.githubusercontent.com/IBM/claimed/master/component-library/%s"\n },\n "category": "kfp"\n }' $component_name $new_file +done \ No newline at end of file diff --git a/bin/generate_kfp_component.sh b/bin/generate_kfp_component.sh new file mode 100755 index 00000000..15fd1dfe --- /dev/null +++ b/bin/generate_kfp_component.sh @@ -0,0 +1,3 @@ +#!/bin/bash +source venv/bin/activate +python ./src/builder/generate_kfp_component.py $1 $2 \ No newline at end of file diff --git a/component.yaml b/component.yaml index 22f47cb2..b1fa3906 100644 --- a/component.yaml +++ b/component.yaml @@ -32,4 +32,4 @@ implementation: - {inputValue: password} - {inputValue: port} - {inputValue: sql} - - {inputValue: data_dir} \ No newline at end of file + - {inputValue: data_dir} diff --git a/src/builder/generate_kfp_component.py b/src/builder/generate_kfp_component.py new file mode 100755 index 00000000..242ab28f --- /dev/null +++ b/src/builder/generate_kfp_component.py @@ -0,0 +1,18 @@ +from notebook import Notebook +from kfp_component import KfpComponent +from kfp_component_builder import KfpComponentBuilder +import sys + + +def main(): + args = sys.argv[1:] + input_path = args[0] + output_path = args[1] + kfpcb = KfpComponentBuilder(input_path) + with open(output_path, "w") as output_file: + output_file.write(kfpcb.get_yaml()) + + +if __name__ == "__main__": + main() + diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index 5851d846..bd175df1 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -39,8 +39,7 @@ def get_output_name(self): def get_yaml(self): - t = Template(''' -name: $name + t = Template('''name: $name description: $description inputs: @@ -60,8 +59,7 @@ def get_yaml(self): wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb $call - {outputPath: $outputPath} -$input_for_implementation - ''') +$input_for_implementation''') return t.substitute( name=self.kfp.get_name(), description=self.kfp.get_description(), From 43d581ee1b3a41e7b0f23f65db791cb2ed7b8842 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 6 Sep 2021 17:58:45 +0200 Subject: [PATCH 027/177] fix --- bin/generate_elyra_component_config.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bin/generate_elyra_component_config.sh b/bin/generate_elyra_component_config.sh index e639d894..18b59487 100755 --- a/bin/generate_elyra_component_config.sh +++ b/bin/generate_elyra_component_config.sh @@ -3,5 +3,5 @@ for file in `find ../claimed/component-library/ -name "*.yaml"`; do new_file=`echo $file|sed -s 's/..\/claimed\/component-library//'` component_name=${file##*/} component_name=`echo $component_name | sed -s 's/.yaml//'` - printf '"%s": {\n "location": {\n "url": "https://raw.githubusercontent.com/IBM/claimed/master/component-library/%s"\n },\n "category": "kfp"\n }' $component_name $new_file + printf '"%s": {\n "location": {\n "url": "https://raw.githubusercontent.com/IBM/claimed/master/component-library/%s"\n },\n "category": "kfp"\n },\n' $component_name $new_file done \ No newline at end of file From bc0759b4740eec40eff3b27fc0183a8e8f8ef6ba Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 28 Oct 2021 17:32:27 +0200 Subject: [PATCH 028/177] ongoing work --- .vscode/settings.json | 2 +- bin/generate_claimed.sh | 2 +- src/builder/kfp_component_builder.py | 2 +- test/notebooks/input-postgresql.ipynb | 58 +++++++++++++-------------- 4 files changed, 32 insertions(+), 32 deletions(-) diff --git a/.vscode/settings.json b/.vscode/settings.json index 94a6c908..625b81c9 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -1,3 +1,3 @@ { - "jupyter.jupyterServerType": "remote" + "jupyter.jupyterServerType": "local" } \ No newline at end of file diff --git a/bin/generate_claimed.sh b/bin/generate_claimed.sh index c6ecd391..ff55d11f 100755 --- a/bin/generate_claimed.sh +++ b/bin/generate_claimed.sh @@ -1,2 +1,2 @@ #!/bin/bash -for file in `find ../claimed/component-library/ -name "*.ipynb"`; do ./bin/generate_kfp_component.sh $file `echo $file |sed 's/.ipynb/.yaml/g'`; done \ No newline at end of file +for file in `find ../claimed/component-library/ -name "*.ipynb"`; do echo $file; echo `echo $file |sed 's/.ipynb/.yaml/g'`; ./bin/generate_kfp_component.sh $file `echo $file |sed 's/.ipynb/.yaml/g'`; done \ No newline at end of file diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index bd175df1..4142b20b 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -25,7 +25,7 @@ def get_input_for_implementation(self): def get_outputs(self): with StringIO() as outputs_str: - assert len(self.kfp.get_outputs()) == 1, 'exactly one output currently supported' + assert len(self.kfp.get_outputs()) == 1, 'exactly one output currently supported: '+ str((len(self.kfp.get_outputs()))) for output_key, output_value in self.kfp.get_outputs().items(): t = Template("- {name: $name, type: $type, description: '$description'}") print(t.substitute(name=output_key, type=output_value[1], description=output_value[0]), file=outputs_str) diff --git a/test/notebooks/input-postgresql.ipynb b/test/notebooks/input-postgresql.ipynb index 97108ff3..160c3b62 100644 --- a/test/notebooks/input-postgresql.ipynb +++ b/test/notebooks/input-postgresql.ipynb @@ -2,45 +2,43 @@ "cells": [ { "cell_type": "markdown", - "metadata": {}, "source": [ "# Input Postgresql" - ] + ], + "metadata": {} }, { "cell_type": "markdown", - "metadata": {}, "source": [ "This notebook pulls data from a postgresql database as CSV on a given SQL statement" - ] + ], + "metadata": {} }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "!pip install psycopg2-binary==2.9.1 pandas==1.3.1" - ] + ], + "outputs": [], + "metadata": {} }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import psycopg2\n", "import re\n", "import sys" - ] + ], + "outputs": [], + "metadata": {} }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "# path and file name for output\n", "data_csv = os.environ.get('output_data_csv', 'data.csv')\n", @@ -65,13 +63,13 @@ "\n", "# temporal data storage for local execution\n", "data_dir = os.environ.get('data_dir', '../../data/')" - ] + ], + "outputs": [], + "metadata": {} }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "# override parameters received from a potential call using %run magic\n", "parameters = list(\n", @@ -92,13 +90,13 @@ "\n", "# cast parameters to appropriate type\n", "port = int(port)" - ] + ], + "outputs": [], + "metadata": {} }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "conn = psycopg2.connect(\n", " host=host,\n", @@ -107,34 +105,36 @@ " password=password,\n", " port=port\n", ")" - ] + ], + "outputs": [], + "metadata": {} }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "d = pd.read_sql_query(sql, conn)" - ] + ], + "outputs": [], + "metadata": {} }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "conn.close()" - ] + ], + "outputs": [], + "metadata": {} }, { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], "source": [ "d.to_csv(output_data_csv, index=False)" - ] + ], + "outputs": [], + "metadata": {} } ], "metadata": { @@ -158,4 +158,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file From 971667a1065c924d8e014cd6451d75ce4b9989ba Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 25 Nov 2021 17:36:29 +0100 Subject: [PATCH 029/177] commit for backup --- bin/generate_kfp_component.sh | 2 +- src/builder/generate_kfp_component.py | 4 +++- src/builder/kfp_component_builder.py | 22 ++++++++++++++++++---- 3 files changed, 22 insertions(+), 6 deletions(-) diff --git a/bin/generate_kfp_component.sh b/bin/generate_kfp_component.sh index 15fd1dfe..877f07a0 100755 --- a/bin/generate_kfp_component.sh +++ b/bin/generate_kfp_component.sh @@ -1,3 +1,3 @@ #!/bin/bash source venv/bin/activate -python ./src/builder/generate_kfp_component.py $1 $2 \ No newline at end of file +python ./src/builder/generate_kfp_component.py $1 $2 $3 $4 \ No newline at end of file diff --git a/src/builder/generate_kfp_component.py b/src/builder/generate_kfp_component.py index 242ab28f..90efbb63 100755 --- a/src/builder/generate_kfp_component.py +++ b/src/builder/generate_kfp_component.py @@ -8,7 +8,9 @@ def main(): args = sys.argv[1:] input_path = args[0] output_path = args[1] - kfpcb = KfpComponentBuilder(input_path) + source_uri = args[2] # URI to the component source code to be downloaded + source_file_name = args[3] # file name to be executed + kfpcb = KfpComponentBuilder(input_path,source_uri,source_file_name) with open(output_path, "w") as output_file: output_file.write(kfpcb.get_yaml()) diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index 4142b20b..a4774d15 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -5,8 +5,10 @@ class KfpComponentBuilder(): - def __init__(self, notebook_url : str): + def __init__(self, notebook_url : str, source_uri : str, source_file_name : str): nb = Notebook(notebook_url) + self.source_uri = source_uri + self.source_file_name = source_file_name self.kfp = KfpComponent(nb) def get_inputs(self): @@ -36,6 +38,17 @@ def get_output_name(self): for output_key, output_value in self.kfp.get_outputs().items(): return output_key + def get_parameter_list(self): + return_value = str() + index = 0 + for output_key, output_value in self.kfp.get_outputs().items(): + return_value = return_value + output_key + '="$' + str(index) + '" ' + index = index + 1 + for input_key, input_value in self.kfp.get_inputs().items(): + return_value = return_value + input_key + '="$' + str(index) + '" ' + index = index + 1 + return return_value + def get_yaml(self): @@ -56,7 +69,7 @@ def get_yaml(self): - -ec - | $mkdir - wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb + wget $source_uri $call - {outputPath: $outputPath} $input_for_implementation''') @@ -69,5 +82,6 @@ def get_yaml(self): outputPath=self.get_output_name(), input_for_implementation=self.get_input_for_implementation(), mkdir="mkdir -p `echo $0 |sed -e 's/\/[a-zA-Z0-9]*$//'`", - call='ipython ./input-postgresql.ipynb output_data_csv="$0" host="$1" database="$2" user="$3" password="$4" port="$5" sql="$6" data_dir="$7"' - ) + source_uri=self.source_uri, + call='ipython ' + self.source_file_name + ' ' + self.get_parameter_list() + ) From e216ade5829cf9925d3251d16e1387bb23c4704b Mon Sep 17 00:00:00 2001 From: romeo kienzler Date: Wed, 26 Oct 2022 12:13:33 +0200 Subject: [PATCH 030/177] add argument check --- src/builder/generate_kfp_component.py | 3 +++ src/builder/kfp_component_builder.py | 4 +++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/src/builder/generate_kfp_component.py b/src/builder/generate_kfp_component.py index 90efbb63..1bbb2c02 100755 --- a/src/builder/generate_kfp_component.py +++ b/src/builder/generate_kfp_component.py @@ -6,6 +6,9 @@ def main(): args = sys.argv[1:] + if len(args) < 4: + print('Usage: input_path output_path source_uri(URI to the component source code to be downloaded) source_file_name(file name to be executed)') + exit(-1) input_path = args[0] output_path = args[1] source_uri = args[2] # URI to the component source code to be downloaded diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py index a4774d15..98a05b8e 100644 --- a/src/builder/kfp_component_builder.py +++ b/src/builder/kfp_component_builder.py @@ -18,6 +18,7 @@ def get_inputs(self): print(t.substitute(name=input_key, type=input_value[1], description=input_value[0]), file=inputs_str) return inputs_str.getvalue() + def get_input_for_implementation(self): with StringIO() as inputs_str: for input_key, input_value in self.kfp.get_inputs().items(): @@ -25,6 +26,7 @@ def get_input_for_implementation(self): print(t.substitute(name=input_key), file=inputs_str) return inputs_str.getvalue() + def get_outputs(self): with StringIO() as outputs_str: assert len(self.kfp.get_outputs()) == 1, 'exactly one output currently supported: '+ str((len(self.kfp.get_outputs()))) @@ -33,8 +35,8 @@ def get_outputs(self): print(t.substitute(name=output_key, type=output_value[1], description=output_value[0]), file=outputs_str) return outputs_str.getvalue() + def get_output_name(self): - assert len(self.kfp.get_outputs()) == 1, 'exactly one output currently supported' for output_key, output_value in self.kfp.get_outputs().items(): return output_key From e5a2bd885ec9577943f954f8493a9e0c340bf1a3 Mon Sep 17 00:00:00 2001 From: romeo kienzler Date: Wed, 23 Nov 2022 16:57:17 +0100 Subject: [PATCH 031/177] clear v1 of the tool --- Dockerfile | 3 - LICENSE | 201 -------------------- README.md | 8 - bin/generate_claimed.sh | 2 - bin/generate_elyra_component_config.sh | 7 - bin/generate_kfp_component.sh | 3 - component.yaml | 35 ---- requirements.txt | 4 - run_tests.sh | 5 - src/builder/base_component_spec.py | 18 -- src/builder/generate_kfp_component.py | 23 --- src/builder/kfp_component.py | 24 --- src/builder/kfp_component_builder.py | 89 --------- src/builder/notebook.py | 55 ------ src/builder/parser.py | 209 --------------------- src/builder/test_kfp_component.py | 25 --- src/builder/test_kfp_component_builder.py | 7 - src/builder/test_notebook.py | 24 --- src/convert/a_python_script.py | 112 ----------- src/convert/notebook_to_python_script.py | 19 -- src/data/test.csv | 101 ---------- src/mlx/publish.py | 83 --------- src/program.py | 29 --- src/tmp/test.csv | 10 - test/notebooks/a_notebook.ipynb | 214 ---------------------- test/notebooks/input-postgresql.ipynb | 161 ---------------- test_component.yaml | 24 --- 27 files changed, 1495 deletions(-) delete mode 100644 Dockerfile delete mode 100644 LICENSE delete mode 100644 README.md delete mode 100755 bin/generate_claimed.sh delete mode 100755 bin/generate_elyra_component_config.sh delete mode 100755 bin/generate_kfp_component.sh delete mode 100644 component.yaml delete mode 100644 requirements.txt delete mode 100755 run_tests.sh delete mode 100644 src/builder/base_component_spec.py delete mode 100755 src/builder/generate_kfp_component.py delete mode 100644 src/builder/kfp_component.py delete mode 100644 src/builder/kfp_component_builder.py delete mode 100644 src/builder/notebook.py delete mode 100644 src/builder/parser.py delete mode 100644 src/builder/test_kfp_component.py delete mode 100644 src/builder/test_kfp_component_builder.py delete mode 100644 src/builder/test_notebook.py delete mode 100644 src/convert/a_python_script.py delete mode 100644 src/convert/notebook_to_python_script.py delete mode 100644 src/data/test.csv delete mode 100644 src/mlx/publish.py delete mode 100755 src/program.py delete mode 100644 src/tmp/test.csv delete mode 100644 test/notebooks/a_notebook.ipynb delete mode 100644 test/notebooks/input-postgresql.ipynb delete mode 100644 test_component.yaml diff --git a/Dockerfile b/Dockerfile deleted file mode 100644 index b9639b20..00000000 --- a/Dockerfile +++ /dev/null @@ -1,3 +0,0 @@ -FROM python:3.7 -#RUN python3 -m pip install keras -COPY ./src /pipelines/component/src \ No newline at end of file diff --git a/LICENSE b/LICENSE deleted file mode 100644 index 261eeb9e..00000000 --- a/LICENSE +++ /dev/null @@ -1,201 +0,0 @@ - Apache License - Version 2.0, January 2004 - http://www.apache.org/licenses/ - - TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION - - 1. Definitions. - - "License" shall mean the terms and conditions for use, reproduction, - and distribution as defined by Sections 1 through 9 of this document. - - "Licensor" shall mean the copyright owner or entity authorized by - the copyright owner that is granting the License. - - "Legal Entity" shall mean the union of the acting entity and all - other entities that control, are controlled by, or are under common - control with that entity. For the purposes of this definition, - "control" means (i) the power, direct or indirect, to cause the - direction or management of such entity, whether by contract or - otherwise, or (ii) ownership of fifty percent (50%) or more of the - outstanding shares, or (iii) beneficial ownership of such entity. - - "You" (or "Your") shall mean an individual or Legal Entity - exercising permissions granted by this License. - - "Source" form shall mean the preferred form for making modifications, - including but not limited to software source code, documentation - source, and configuration files. - - "Object" form shall mean any form resulting from mechanical - transformation or translation of a Source form, including but - not limited to compiled object code, generated documentation, - and conversions to other media types. - - "Work" shall mean the work of authorship, whether in Source or - Object form, made available under the License, as indicated by a - copyright notice that is included in or attached to the work - (an example is provided in the Appendix below). - - "Derivative Works" shall mean any work, whether in Source or Object - form, that is based on (or derived from) the Work and for which the - editorial revisions, annotations, elaborations, or other modifications - represent, as a whole, an original work of authorship. For the purposes - of this License, Derivative Works shall not include works that remain - separable from, or merely link (or bind by name) to the interfaces of, - the Work and Derivative Works thereof. - - "Contribution" shall mean any work of authorship, including - the original version of the Work and any modifications or additions - to that Work or Derivative Works thereof, that is intentionally - submitted to Licensor for inclusion in the Work by the copyright owner - or by an individual or Legal Entity authorized to submit on behalf of - the copyright owner. For the purposes of this definition, "submitted" - means any form of electronic, verbal, or written communication sent - to the Licensor or its representatives, including but not limited to - communication on electronic mailing lists, source code control systems, - and issue tracking systems that are managed by, or on behalf of, the - Licensor for the purpose of discussing and improving the Work, but - excluding communication that is conspicuously marked or otherwise - designated in writing by the copyright owner as "Not a Contribution." - - "Contributor" shall mean Licensor and any individual or Legal Entity - on behalf of whom a Contribution has been received by Licensor and - subsequently incorporated within the Work. - - 2. Grant of Copyright License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - copyright license to reproduce, prepare Derivative Works of, - publicly display, publicly perform, sublicense, and distribute the - Work and such Derivative Works in Source or Object form. - - 3. Grant of Patent License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - (except as stated in this section) patent license to make, have made, - use, offer to sell, sell, import, and otherwise transfer the Work, - where such license applies only to those patent claims licensable - by such Contributor that are necessarily infringed by their - Contribution(s) alone or by combination of their Contribution(s) - with the Work to which such Contribution(s) was submitted. If You - institute patent litigation against any entity (including a - cross-claim or counterclaim in a lawsuit) alleging that the Work - or a Contribution incorporated within the Work constitutes direct - or contributory patent infringement, then any patent licenses - granted to You under this License for that Work shall terminate - as of the date such litigation is filed. - - 4. Redistribution. You may reproduce and distribute copies of the - Work or Derivative Works thereof in any medium, with or without - modifications, and in Source or Object form, provided that You - meet the following conditions: - - (a) You must give any other recipients of the Work or - Derivative Works a copy of this License; and - - (b) You must cause any modified files to carry prominent notices - stating that You changed the files; and - - (c) You must retain, in the Source form of any Derivative Works - that You distribute, all copyright, patent, trademark, and - attribution notices from the Source form of the Work, - excluding those notices that do not pertain to any part of - the Derivative Works; and - - (d) If the Work includes a "NOTICE" text file as part of its - distribution, then any Derivative Works that You distribute must - include a readable copy of the attribution notices contained - within such NOTICE file, excluding those notices that do not - pertain to any part of the Derivative Works, in at least one - of the following places: within a NOTICE text file distributed - as part of the Derivative Works; within the Source form or - documentation, if provided along with the Derivative Works; or, - within a display generated by the Derivative Works, if and - wherever such third-party notices normally appear. The contents - of the NOTICE file are for informational purposes only and - do not modify the License. You may add Your own attribution - notices within Derivative Works that You distribute, alongside - or as an addendum to the NOTICE text from the Work, provided - that such additional attribution notices cannot be construed - as modifying the License. - - You may add Your own copyright statement to Your modifications and - may provide additional or different license terms and conditions - for use, reproduction, or distribution of Your modifications, or - for any such Derivative Works as a whole, provided Your use, - reproduction, and distribution of the Work otherwise complies with - the conditions stated in this License. - - 5. Submission of Contributions. Unless You explicitly state otherwise, - any Contribution intentionally submitted for inclusion in the Work - by You to the Licensor shall be under the terms and conditions of - this License, without any additional terms or conditions. - Notwithstanding the above, nothing herein shall supersede or modify - the terms of any separate license agreement you may have executed - with Licensor regarding such Contributions. - - 6. Trademarks. This License does not grant permission to use the trade - names, trademarks, service marks, or product names of the Licensor, - except as required for reasonable and customary use in describing the - origin of the Work and reproducing the content of the NOTICE file. - - 7. Disclaimer of Warranty. Unless required by applicable law or - agreed to in writing, Licensor provides the Work (and each - Contributor provides its Contributions) on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - implied, including, without limitation, any warranties or conditions - of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A - PARTICULAR PURPOSE. You are solely responsible for determining the - appropriateness of using or redistributing the Work and assume any - risks associated with Your exercise of permissions under this License. - - 8. Limitation of Liability. In no event and under no legal theory, - whether in tort (including negligence), contract, or otherwise, - unless required by applicable law (such as deliberate and grossly - negligent acts) or agreed to in writing, shall any Contributor be - liable to You for damages, including any direct, indirect, special, - incidental, or consequential damages of any character arising as a - result of this License or out of the use or inability to use the - Work (including but not limited to damages for loss of goodwill, - work stoppage, computer failure or malfunction, or any and all - other commercial damages or losses), even if such Contributor - has been advised of the possibility of such damages. - - 9. Accepting Warranty or Additional Liability. While redistributing - the Work or Derivative Works thereof, You may choose to offer, - and charge a fee for, acceptance of support, warranty, indemnity, - or other liability obligations and/or rights consistent with this - License. However, in accepting such obligations, You may act only - on Your own behalf and on Your sole responsibility, not on behalf - of any other Contributor, and only if You agree to indemnify, - defend, and hold each Contributor harmless for any liability - incurred by, or claims asserted against, such Contributor by reason - of your accepting any such warranty or additional liability. - - END OF TERMS AND CONDITIONS - - APPENDIX: How to apply the Apache License to your work. - - To apply the Apache License to your work, attach the following - boilerplate notice, with the fields enclosed by brackets "[]" - replaced with your own identifying information. (Don't include - the brackets!) The text should be enclosed in the appropriate - comment syntax for the file format. We also recommend that a - file or class name and description of purpose be included on the - same "printed page" as the copyright notice for easier - identification within third-party archives. - - Copyright [yyyy] [name of copyright owner] - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. diff --git a/README.md b/README.md deleted file mode 100644 index 658a3ecf..00000000 --- a/README.md +++ /dev/null @@ -1,8 +0,0 @@ -# C3 - the CLAIMED Component Compiler - -This repository contains C3 - the [CLAIMED](https://arxiv.org/abs/2103.03281) Component Compiler responsible for compiling CLAIMED components to Kubeflow Pipeline Components by containerizing the CLAIMED notebooks / scripts, creating the component.yaml and pushing the container image to a registry. - -Please note: this is a very early version of the solution - please come back in a couple of weeks. - -# Prerequisites -- docker diff --git a/bin/generate_claimed.sh b/bin/generate_claimed.sh deleted file mode 100755 index ff55d11f..00000000 --- a/bin/generate_claimed.sh +++ /dev/null @@ -1,2 +0,0 @@ -#!/bin/bash -for file in `find ../claimed/component-library/ -name "*.ipynb"`; do echo $file; echo `echo $file |sed 's/.ipynb/.yaml/g'`; ./bin/generate_kfp_component.sh $file `echo $file |sed 's/.ipynb/.yaml/g'`; done \ No newline at end of file diff --git a/bin/generate_elyra_component_config.sh b/bin/generate_elyra_component_config.sh deleted file mode 100755 index 18b59487..00000000 --- a/bin/generate_elyra_component_config.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/bin/bash -for file in `find ../claimed/component-library/ -name "*.yaml"`; do - new_file=`echo $file|sed -s 's/..\/claimed\/component-library//'` - component_name=${file##*/} - component_name=`echo $component_name | sed -s 's/.yaml//'` - printf '"%s": {\n "location": {\n "url": "https://raw.githubusercontent.com/IBM/claimed/master/component-library/%s"\n },\n "category": "kfp"\n },\n' $component_name $new_file -done \ No newline at end of file diff --git a/bin/generate_kfp_component.sh b/bin/generate_kfp_component.sh deleted file mode 100755 index 877f07a0..00000000 --- a/bin/generate_kfp_component.sh +++ /dev/null @@ -1,3 +0,0 @@ -#!/bin/bash -source venv/bin/activate -python ./src/builder/generate_kfp_component.py $1 $2 $3 $4 \ No newline at end of file diff --git a/component.yaml b/component.yaml deleted file mode 100644 index b1fa3906..00000000 --- a/component.yaml +++ /dev/null @@ -1,35 +0,0 @@ -name: Input Postgresql -description: This notebook pulls data from a postgresql database as CSV on a given SQL statement - -inputs: -- {name: host, type: String, description: 'hostname of database server'} -- {name: database, type: String, description: 'database name'} -- {name: user, type: String, description: 'db user'} -- {name: password, type: String, description: 'db password'} -- {name: port, type: Integer, description: 'db port'} -- {name: sql, type: String, description: 'sql query statement to be executed'} -- {name: data_dir, type: String, description: 'temporal data storage for local execution'} - - -outputs: -- {name: output_data_csv, type: String, description: 'path and file name for output'} - - -implementation: - container: - image: continuumio/anaconda3:2020.07 - command: - - sh - - -ec - - | - mkdir -p `echo $0 |sed -e 's/\/[a-zA-Z0-9]*$//'` - wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb - ipython ./input-postgresql.ipynb output_data_csv="$0" host="$1" database="$2" user="$3" password="$4" port="$5" sql="$6" data_dir="$7" - - {outputPath: output_data_csv} - - {inputValue: host} - - {inputValue: database} - - {inputValue: user} - - {inputValue: password} - - {inputValue: port} - - {inputValue: sql} - - {inputValue: data_dir} diff --git a/requirements.txt b/requirements.txt deleted file mode 100644 index 10541fe0..00000000 --- a/requirements.txt +++ /dev/null @@ -1,4 +0,0 @@ -"git+https://github.com/machine-learning-exchange/mlx.git@main#egg=mlx-client&subdirectory=api/client" -nbformat==5.1.3 -nbconvert==6.0.7 -ipython==7.16.1 \ No newline at end of file diff --git a/run_tests.sh b/run_tests.sh deleted file mode 100755 index 4dba9fff..00000000 --- a/run_tests.sh +++ /dev/null @@ -1,5 +0,0 @@ -source ./venv/bin/activate -cd src/builder -python ./test_kfp_component.py -python ./test_notebook.py -python ./test_kfp_component_builder.py \ No newline at end of file diff --git a/src/builder/base_component_spec.py b/src/builder/base_component_spec.py deleted file mode 100644 index 45ae08c5..00000000 --- a/src/builder/base_component_spec.py +++ /dev/null @@ -1,18 +0,0 @@ -class BaseComponentSpec(): - def get_name() -> str: - raise Exception("Not implemented") - - def get_description() -> str: - raise Exception("Not implemented") - - def get_inputs(): - raise Exception("Not implemented") - - def get_outputs(): - raise Exception("Not implemented") - - def get_container_uri() -> str: - raise Exception("Not implemented") - - def get_requirements(): - raise Exception("Not implemented") \ No newline at end of file diff --git a/src/builder/generate_kfp_component.py b/src/builder/generate_kfp_component.py deleted file mode 100755 index 1bbb2c02..00000000 --- a/src/builder/generate_kfp_component.py +++ /dev/null @@ -1,23 +0,0 @@ -from notebook import Notebook -from kfp_component import KfpComponent -from kfp_component_builder import KfpComponentBuilder -import sys - - -def main(): - args = sys.argv[1:] - if len(args) < 4: - print('Usage: input_path output_path source_uri(URI to the component source code to be downloaded) source_file_name(file name to be executed)') - exit(-1) - input_path = args[0] - output_path = args[1] - source_uri = args[2] # URI to the component source code to be downloaded - source_file_name = args[3] # file name to be executed - kfpcb = KfpComponentBuilder(input_path,source_uri,source_file_name) - with open(output_path, "w") as output_file: - output_file.write(kfpcb.get_yaml()) - - -if __name__ == "__main__": - main() - diff --git a/src/builder/kfp_component.py b/src/builder/kfp_component.py deleted file mode 100644 index 7c8f46f5..00000000 --- a/src/builder/kfp_component.py +++ /dev/null @@ -1,24 +0,0 @@ -from base_component_spec import BaseComponentSpec -from notebook import Notebook - -class KfpComponent(BaseComponentSpec): - def __init__(self, noteboook : Notebook): - self.name = noteboook.get_name() - self.description = noteboook.get_description() - self.inputs = noteboook.get_inputs() - self.outputs = noteboook.get_outputs() - - def get_name(self) -> str: - return self.name - - def get_description(self) -> str: - return self.description - - def get_container_uri(self) -> str: - return 'continuumio/anaconda3:2020.07' - - def get_inputs(self): - return self.inputs - - def get_outputs(self): - return self.outputs diff --git a/src/builder/kfp_component_builder.py b/src/builder/kfp_component_builder.py deleted file mode 100644 index 98a05b8e..00000000 --- a/src/builder/kfp_component_builder.py +++ /dev/null @@ -1,89 +0,0 @@ -from kfp_component import KfpComponent -from notebook import Notebook -from string import Template -from io import StringIO - - -class KfpComponentBuilder(): - def __init__(self, notebook_url : str, source_uri : str, source_file_name : str): - nb = Notebook(notebook_url) - self.source_uri = source_uri - self.source_file_name = source_file_name - self.kfp = KfpComponent(nb) - - def get_inputs(self): - with StringIO() as inputs_str: - for input_key, input_value in self.kfp.get_inputs().items(): - t = Template("- {name: $name, type: $type, description: '$description'}") - print(t.substitute(name=input_key, type=input_value[1], description=input_value[0]), file=inputs_str) - return inputs_str.getvalue() - - - def get_input_for_implementation(self): - with StringIO() as inputs_str: - for input_key, input_value in self.kfp.get_inputs().items(): - t = Template(" - {inputValue: $name}") - print(t.substitute(name=input_key), file=inputs_str) - return inputs_str.getvalue() - - - def get_outputs(self): - with StringIO() as outputs_str: - assert len(self.kfp.get_outputs()) == 1, 'exactly one output currently supported: '+ str((len(self.kfp.get_outputs()))) - for output_key, output_value in self.kfp.get_outputs().items(): - t = Template("- {name: $name, type: $type, description: '$description'}") - print(t.substitute(name=output_key, type=output_value[1], description=output_value[0]), file=outputs_str) - return outputs_str.getvalue() - - - def get_output_name(self): - for output_key, output_value in self.kfp.get_outputs().items(): - return output_key - - def get_parameter_list(self): - return_value = str() - index = 0 - for output_key, output_value in self.kfp.get_outputs().items(): - return_value = return_value + output_key + '="$' + str(index) + '" ' - index = index + 1 - for input_key, input_value in self.kfp.get_inputs().items(): - return_value = return_value + input_key + '="$' + str(index) + '" ' - index = index + 1 - return return_value - - - - def get_yaml(self): - t = Template('''name: $name -description: $description - -inputs: -$inputs - -outputs: -$outputs - -implementation: - container: - image: $container_uri - command: - - sh - - -ec - - | - $mkdir - wget $source_uri - $call - - {outputPath: $outputPath} -$input_for_implementation''') - return t.substitute( - name=self.kfp.get_name(), - description=self.kfp.get_description(), - inputs=self.get_inputs(), - outputs=self.get_outputs(), - container_uri=self.kfp.get_container_uri(), - outputPath=self.get_output_name(), - input_for_implementation=self.get_input_for_implementation(), - mkdir="mkdir -p `echo $0 |sed -e 's/\/[a-zA-Z0-9]*$//'`", - source_uri=self.source_uri, - call='ipython ' + self.source_file_name + ' ' + self.get_parameter_list() - ) diff --git a/src/builder/notebook.py b/src/builder/notebook.py deleted file mode 100644 index fc0e251c..00000000 --- a/src/builder/notebook.py +++ /dev/null @@ -1,55 +0,0 @@ -import json -import re -from parser import ContentParser - -class Notebook(): - def __init__(self, path): - with open(path) as json_file: - self.notebook = json.load(json_file) - self.name = self.notebook['cells'][0]['source'][0].replace('#', '').strip() - self.description = self.notebook['cells'][1]['source'][0] - self.envs = self._get_env_vars(path) - self.requirements = self._get_requirements() - - def _get_env_vars(self, path): - cp = ContentParser() - env_names = cp.parse(path)['env_vars'] - return_value = dict() - for env_name in env_names: - comment_line = str() - for line in self.notebook['cells'][4]['source']: - if re.search("[\"']" + env_name + "[\"']", line): - assert '#' in comment_line, "comment line didn't contain #" - if "int(" in line: - type = 'Integer' - elif "float(" in line: - type = 'Float' - else: - type = 'String' - - return_value[env_name]=(comment_line.replace('#', '').strip(),type,None) - comment_line = line - return return_value - - - - def _get_requirements(self): - for cell in self.notebook['cells']: - cell_content = cell['source'][0] - pattern = r"(![ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.:]*)" - - #print(re.findall(pattern,cell_content)) # TODO romeo multiple matches not working - - def get_name(self): - return self.name - - def get_description(self): - return self.description - - def get_inputs(self): - return { key:value for (key,value) in self.envs.items() if not key.startswith('output_') } - - def get_outputs(self): - return { key:value for (key,value) in self.envs.items() if key.startswith('output_') } - - diff --git a/src/builder/parser.py b/src/builder/parser.py deleted file mode 100644 index 8a130d8f..00000000 --- a/src/builder/parser.py +++ /dev/null @@ -1,209 +0,0 @@ -# -# Copyright 2018-2021 Elyra Authors -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -import os -import nbformat -import re - -from traitlets.config import LoggingConfigurable - -from typing import TypeVar, List, Dict - -# Setup forward reference for type hint on return from class factory method. See -# https://stackoverflow.com/questions/39205527/can-you-annotate-return-type-when-value-is-instance-of-cls/39205612#39205612 -F = TypeVar('F', bound='FileReader') - - -class FileReader(LoggingConfigurable): - """ - Base class for parsing a file for resources according to operation type. Subclasses set - their own parser member variable according to their implementation language. - """ - - def __init__(self, filepath: str): - self._filepath = filepath - - @property - def filepath(self): - return self._filepath - - @property - def language(self) -> str: - file_extension = os.path.splitext(self._filepath)[-1] - if file_extension == '.py': - return 'python' - elif file_extension == '.r': - return 'r' - else: - return None - - def read_next_code_chunk(self) -> List[str]: - """ - Implements a generator for lines of code in the specified filepath. Subclasses - may override if explicit line-by-line parsing is not feasible, e.g. with Notebooks. - """ - with open(self._filepath) as f: - for line in f: - yield [line.strip()] - - -class NotebookReader(FileReader): - def __init__(self, filepath: str): - super().__init__(filepath) - - with open(self._filepath) as f: - self._notebook = nbformat.read(f, as_version=4) - self._language = None - - try: - self._language = self._notebook['metadata']['kernelspec']['language'].lower() - - except KeyError: - self.log.warning(f'No language metadata found in {self._filepath}') - pass - - @property - def language(self) -> str: - return self._language - - def read_next_code_chunk(self) -> List[str]: - for cell in self._notebook.cells: - if cell.source and cell.cell_type == "code": - yield cell.source.split('\n') - - -class ScriptParser(): - """ - Base class for parsing individual lines of code. Subclasses implement a search_expressions() - function that returns language-specific regexes to match against code lines. - """ - - _comment_char = "#" - - def _get_line_without_comments(self, line): - if self._comment_char in line: - index = line.find(self._comment_char) - line = line[:index] - return line.strip() - - def parse_environment_variables(self, line): - # Parse a line fed from file and match each regex in regex dictionary - line = self._get_line_without_comments(line) - if not line: - return [] - - matches = [] - for key, value in self.search_expressions().items(): - for pattern in value: - regex = re.compile(pattern) - for match in regex.finditer(line): - matches.append((key, match)) - return matches - - -class PythonScriptParser(ScriptParser): - def search_expressions(self) -> Dict[str, List]: - # TODO: add more key:list-of-regex pairs to parse for additional resources - regex_dict = dict() - - # First regex matches envvar assignments of form os.environ["name"] = value w or w/o value provided - # Second regex matches envvar assignments that use os.getenv("name", "value") with ow w/o default provided - # Third regex matches envvar assignments that use os.environ.get("name", "value") with or w/o default provided - # Both name and value are captured if possible - envs = [r"os\.environ\[[\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']\](?:\s*=(?:\s*[\"'](.[^\"']*)?[\"'])?)*", - r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,\s*[\"'](.[^\"']*)?[\"'])?", - r"os\.environ\.get\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,(?:\s*[\"'](.[^\"']*)?[\"'])?)*"] - regex_dict["env_vars"] = envs - return regex_dict - - -class RScriptParser(ScriptParser): - def search_expressions(self) -> Dict[str, List]: - # TODO: add more key:list-of-regex pairs to parse for additional resources - regex_dict = dict() - - # Tests for matches of the form Sys.setenv("key" = "value") - envs = [r"Sys\.setenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*\s*=\s*[\"']*(.[^\"']*)?[\"']*\)", - r"Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*\)(.)*"] - regex_dict["env_vars"] = envs - return regex_dict - - -class ContentParser(LoggingConfigurable): - parsers = { - 'python': PythonScriptParser(), - 'r': RScriptParser() - } - - def parse(self, filepath: str) -> dict: - """Returns a model dictionary of all the regex matches for each key in the regex dictionary""" - - properties = {"env_vars": {}, "inputs": [], "outputs": []} - reader = self._get_reader(filepath) - parser = self._get_parser(reader.language) - - if not parser: - return properties - - for chunk in reader.read_next_code_chunk(): - if chunk: - for line in chunk: - matches = parser.parse_environment_variables(line) - for key, match in matches: - if key == "env_vars": - properties[key][match.group(1)] = match.group(2) - else: - properties[key].append(match.group(1)) - - return properties - - def _validate_file(self, filepath: str): - """ - Validate file exists and is file (e.g. not a directory) - """ - if not os.path.exists(filepath): - raise FileNotFoundError(f'No such file or directory: {filepath}') - if not os.path.isfile(filepath): - raise IsADirectoryError(f'Is a directory: {filepath}') - - def _get_reader(self, filepath: str): - """ - Find the proper reader based on the file extension - """ - file_extension = os.path.splitext(filepath)[-1] - - self._validate_file(filepath) - - if file_extension == '.ipynb': - return NotebookReader(filepath) - elif file_extension in ['.py', '.r']: - return FileReader(filepath) - else: - raise ValueError(f'File type {file_extension} is not supported.') - - def _get_parser(self, language: str): - """ - Find the proper parser based on content language - """ - parser = None - if language: - parser = self.parsers.get(language) - - if not parser: - self.log.warning(f'Content parser for {language} is not available.') - pass - - return parser diff --git a/src/builder/test_kfp_component.py b/src/builder/test_kfp_component.py deleted file mode 100644 index c7de904a..00000000 --- a/src/builder/test_kfp_component.py +++ /dev/null @@ -1,25 +0,0 @@ -from notebook import Notebook -from kfp_component import KfpComponent - -nb = Notebook('../../test/notebooks/a_notebook.ipynb') -kfp = KfpComponent(nb) -assert 'input_hmp' == kfp.get_name() -assert 'This notebook pulls the HMP accelerometer sensor data classification data set' == kfp.get_description() -inputs = kfp.get_inputs() -assert 'data_csv' in inputs -assert 'master' in inputs -assert 'master2' in inputs -assert 'data_dir' in inputs -assert 'continuumio/anaconda3:2020.07' == kfp.get_container_uri() -assert 'data.csv' == inputs['data_csv'] -assert 'local[*]' == inputs['master'] -assert '../../data/' == inputs['data_dir'] -outputs = kfp.get_outputs() -assert not 'output_data' in inputs -assert 'output_data' in outputs -assert 'output_data2' in outputs -assert '/tmp/output.csv' == outputs['output_data'] -assert 'data_dir' not in outputs - - - diff --git a/src/builder/test_kfp_component_builder.py b/src/builder/test_kfp_component_builder.py deleted file mode 100644 index 2a0355b8..00000000 --- a/src/builder/test_kfp_component_builder.py +++ /dev/null @@ -1,7 +0,0 @@ -from notebook import Notebook -from kfp_component import KfpComponent -from kfp_component_builder import KfpComponentBuilder - - -kfpcb = KfpComponentBuilder('../../test/notebooks/input-postgresql.ipynb') -print(kfpcb.get_yaml()) \ No newline at end of file diff --git a/src/builder/test_notebook.py b/src/builder/test_notebook.py deleted file mode 100644 index cd285bd2..00000000 --- a/src/builder/test_notebook.py +++ /dev/null @@ -1,24 +0,0 @@ -from notebook import Notebook - -nb = Notebook('../../test/notebooks/a_notebook.ipynb') - -assert 'input_hmp' == nb.get_name() -assert 'This notebook pulls the HMP accelerometer sensor data classification data set' == nb.get_description() -inputs = nb.get_inputs() -assert 'data_csv' in inputs -assert 'master' in inputs -assert 'master2' in inputs -assert 'data_dir' in inputs - -assert 'data.csv' == inputs['data_csv'] -assert 'local[*]' == inputs['master'] -assert '../../data/' == inputs['data_dir'] -outputs = nb.get_outputs() -assert not 'output_data' in inputs -assert 'output_data' in outputs -assert 'output_data2' in outputs -assert '/tmp/output.csv' == outputs['output_data'] -assert 'data_dir' not in outputs - - - diff --git a/src/convert/a_python_script.py b/src/convert/a_python_script.py deleted file mode 100644 index 4bef91d2..00000000 --- a/src/convert/a_python_script.py +++ /dev/null @@ -1,112 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -# This notebook pulls the HMP accelerometer sensor data classification data set - -# In[ ]: - - -get_ipython().system('pip install pyspark==2.4.4') - - -# In[ ]: - - -# @param data_dir temporal data storage for local execution -# @param data_csv csv path and file name (default: data.csv) -# @param data_parquet path and parquet file name (default: data.parquet) -# @param master url of master (default: local mode) - - -# In[ ]: - - -from pyspark import SparkContext, SparkConf -from pyspark.sql import SparkSession -import os -from pyspark.sql.types import StructType, StructField, IntegerType -import fnmatch -from pyspark.sql.functions import lit - - -# In[ ]: - - -data_csv = os.environ.get('data_csv', 'data.csv') -master = os.environ.get('master', "local[*]") -data_dir = os.environ.get('data_dir', '../../data/') - - -# Lets create a local spark context (sc) and session (spark) - -# In[ ]: - - -sc = SparkContext.getOrCreate(SparkConf().setMaster(master)) - -spark = SparkSession .builder .getOrCreate() - - -# Lets pull the data in raw format from the source (github) - -# In[ ]: - - -get_ipython().system('rm -Rf HMP_Dataset') -get_ipython().system('git clone https://github.com/wchill/HMP_Dataset') - - -# In[ ]: - - -schema = StructType([ - StructField("x", IntegerType(), True), - StructField("y", IntegerType(), True), - StructField("z", IntegerType(), True)]) - - -# This step takes a while, it parses through all files and folders and creates a temporary dataframe for each file which gets appended to an overall data-frame "df". In addition, a column called "class" is added to allow for straightforward usage in Spark afterwards in a supervised machine learning scenario for example. - -# In[ ]: - - -d = 'HMP_Dataset/' - -# filter list for all folders containing data (folders that don't start with .) -file_list_filtered = [s for s in os.listdir(d) - if os.path.isdir(os.path.join(d, s)) & - ~fnmatch.fnmatch(s, '.*')] - -# create pandas data frame for all the data - -df = None - -for category in file_list_filtered: - data_files = os.listdir('HMP_Dataset/' + category) - - # create a temporary pandas data frame for each data file - for data_file in data_files: - print(data_file) - temp_df = spark.read. option("header", "false"). option("delimiter", " "). csv('HMP_Dataset/' + category + '/' + data_file, schema=schema) - - # create a column called "source" storing the current CSV file - temp_df = temp_df.withColumn("source", lit(data_file)) - - # create a column called "class" storing the current data folder - temp_df = temp_df.withColumn("class", lit(category)) - - if df is None: - df = temp_df - else: - df = df.union(temp_df) - - -# Lets write the dataf-rame to a file in "CSV" format, this will also take quite some time: - -# In[ ]: - - -df.write.option("header", "true").csv(data_dir + data_csv) - - -# Now we should have a CSV file with our contents diff --git a/src/convert/notebook_to_python_script.py b/src/convert/notebook_to_python_script.py deleted file mode 100644 index 76c70e99..00000000 --- a/src/convert/notebook_to_python_script.py +++ /dev/null @@ -1,19 +0,0 @@ -import nbformat as nbf -from nbconvert.exporters import PythonExporter -from nbconvert.preprocessors import TagRemovePreprocessor - -with open("a_notebook.ipynb", 'r', encoding='utf-8') as f: - the_notebook_nodes = nbf.read(f, as_version = 4) - -trp = TagRemovePreprocessor() - -trp.remove_cell_tags = ("remove",) - -pexp = PythonExporter() - -pexp.register_preprocessor(trp, enabled= True) - -the_python_script, meta = pexp.from_notebook_node(the_notebook_nodes) - -with open("a_python_script.py", 'w', encoding='utf-8') as f: - f.writelines(the_python_script) \ No newline at end of file diff --git a/src/data/test.csv b/src/data/test.csv deleted file mode 100644 index d9f23e4f..00000000 --- a/src/data/test.csv +++ /dev/null @@ -1,101 +0,0 @@ -x,y,z -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 \ No newline at end of file diff --git a/src/mlx/publish.py b/src/mlx/publish.py deleted file mode 100644 index 06569b96..00000000 --- a/src/mlx/publish.py +++ /dev/null @@ -1,83 +0,0 @@ -from __future__ import print_function - -import glob -import json -import os -import random -import re -import swagger_client -import tarfile -import tempfile - -from io import BytesIO -from os import environ as env -from pprint import pprint -from swagger_client.api_client import ApiClient, Configuration -# Copyright 2021 IBM Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -from swagger_client.models import ApiComponent, ApiGetTemplateResponse, ApiListComponentsResponse, \ - ApiGenerateCodeResponse, ApiRunCodeResponse -from swagger_client.rest import ApiException -from sys import stderr -from urllib3.response import HTTPResponse - -host = env.get("MLX_API_SERVICE_HOST",'127.0.0.1') -port = env.get("MLX_API_SERVICE_PORT", '8080') - - -api_base_path = 'apis/v1alpha1' - -def get_swagger_client(): - - config = Configuration() - config.host = f'http://{host}:{port}/{api_base_path}' - api_client = ApiClient(configuration=config) - - return api_client - -def create_tar_file(yamlfile_name): - - yamlfile_basename = os.path.basename(yamlfile_name) - tmp_dir = tempfile.gettempdir() - tarfile_path = os.path.join(tmp_dir, yamlfile_basename.replace(".yaml", ".tgz")) - - with tarfile.open(tarfile_path, "w:gz") as tar: - tar.add(yamlfile_name, arcname=yamlfile_basename) - - tar.close() - - return tarfile_path - -def upload_component_file(component_id, file_path): - - api_client = get_swagger_client() - api_instance = swagger_client.ComponentServiceApi(api_client=api_client) - - try: - response = api_instance.upload_component_file(id=component_id, uploadfile=file_path) - print(f"Upload file '{file_path}' to component with ID '{component_id}'") - - except ApiException as e: - print("Exception when calling ComponentServiceApi -> upload_component_file: %s\n" % e, file=stderr) - raise e - -def main(): - - component_file = create_tar_file('../../test_component.yaml') - upload_component_file('test4',component_file) - -if __name__ == '__main__': - main() \ No newline at end of file diff --git a/src/program.py b/src/program.py deleted file mode 100755 index f0216c49..00000000 --- a/src/program.py +++ /dev/null @@ -1,29 +0,0 @@ -#!/usr/bin/env python3 -import argparse -from pathlib import Path - -# Function doing the actual work (Outputs first N lines from a text file) -def do_work(input1_file, output1_file, param1): - for x, line in enumerate(input1_file): - if x >= param1: - break - _ = output1_file.write(line) - -# Defining and parsing the command-line arguments -parser = argparse.ArgumentParser(description='My program description') -# Paths must be passed in, not hardcoded -parser.add_argument('--input1-path', type=str, - help='Path of the local file containing the Input 1 data.') -parser.add_argument('--output1-path', type=str, - help='Path of the local file where the Output 1 data should be written.') -parser.add_argument('--param1', type=int, default=100, - help='The number of lines to read from the input and write to the output.') -args = parser.parse_args() - -# Creating the directory where the output file is created (the directory -# may or may not exist). -Path(args.output1_path).parent.mkdir(parents=True, exist_ok=True) - -with open(args.input1_path, 'r') as input1_file: - with open(args.output1_path, 'w') as output1_file: - do_work(input1_file, output1_file, args.param1) diff --git a/src/tmp/test.csv b/src/tmp/test.csv deleted file mode 100644 index df33812c..00000000 --- a/src/tmp/test.csv +++ /dev/null @@ -1,10 +0,0 @@ -x,y,z -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 -1,2,3 diff --git a/test/notebooks/a_notebook.ipynb b/test/notebooks/a_notebook.ipynb deleted file mode 100644 index 21cc7c95..00000000 --- a/test/notebooks/a_notebook.ipynb +++ /dev/null @@ -1,214 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# input_hmp" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook pulls the HMP accelerometer sensor data classification data set" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install pyspark1==2.4.4 pyspark2==2.4.4 pyspark3==2.4.4 pyspark4 pyspark5\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# @param data_dir temporal data storage for local execution\n", - "# @param data_csv csv path and file name (default: data.csv)\n", - "# @param data_parquet path and parquet file name (default: data.parquet)\n", - "# @param master url of master (default: local mode)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from pyspark import SparkContext, SparkConf\n", - "from pyspark.sql import SparkSession\n", - "import os\n", - "from pyspark.sql.types import StructType, StructField, IntegerType\n", - "import fnmatch\n", - "from pyspark.sql.functions import lit" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_csv = os.environ.get('data_csv', 'data.csv')\n", - "master = os.environ.get('master', \"local[*]\")\n", - "master2 = os.environ.get('master2')\n", - "\n", - "data_dir = os.environ.get('data_dir', '../../data/')\n", - "output = os.environ.get('output_data','/tmp/output.csv')\n", - "output2 = os.environ.get('output_data2')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Lets create a local spark context (sc) and session (spark)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sc = SparkContext.getOrCreate(SparkConf().setMaster(master))\n", - "\n", - "spark = SparkSession \\\n", - " .builder \\\n", - " .getOrCreate()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Lets pull the data in raw format from the source (github)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!rm -Rf HMP_Dataset\n", - "!git clone https://github.com/wchill/HMP_Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "schema = StructType([\n", - " StructField(\"x\", IntegerType(), True),\n", - " StructField(\"y\", IntegerType(), True),\n", - " StructField(\"z\", IntegerType(), True)])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This step takes a while, it parses through all files and folders and creates a temporary dataframe for each file which gets appended to an overall data-frame \"df\". In addition, a column called \"class\" is added to allow for straightforward usage in Spark afterwards in a supervised machine learning scenario for example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "d = 'HMP_Dataset/'\n", - "\n", - "# filter list for all folders containing data (folders that don't start with .)\n", - "file_list_filtered = [s for s in os.listdir(d)\n", - " if os.path.isdir(os.path.join(d, s)) &\n", - " ~fnmatch.fnmatch(s, '.*')]\n", - "\n", - "# create pandas data frame for all the data\n", - "\n", - "df = None\n", - "\n", - "for category in file_list_filtered:\n", - " data_files = os.listdir('HMP_Dataset/' + category)\n", - "\n", - " # create a temporary pandas data frame for each data file\n", - " for data_file in data_files:\n", - " print(data_file)\n", - " temp_df = spark.read. \\\n", - " option(\"header\", \"false\"). \\\n", - " option(\"delimiter\", \" \"). \\\n", - " csv('HMP_Dataset/' + category + '/' + data_file, schema=schema)\n", - "\n", - " # create a column called \"source\" storing the current CSV file\n", - " temp_df = temp_df.withColumn(\"source\", lit(data_file))\n", - "\n", - " # create a column called \"class\" storing the current data folder\n", - " temp_df = temp_df.withColumn(\"class\", lit(category))\n", - "\n", - " if df is None:\n", - " df = temp_df\n", - " else:\n", - " df = df.union(temp_df)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Lets write the dataf-rame to a file in \"CSV\" format, this will also take quite some time:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df.write.option(\"header\", \"true\").csv(data_dir + data_csv)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we should have a CSV file with our contents" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file diff --git a/test/notebooks/input-postgresql.ipynb b/test/notebooks/input-postgresql.ipynb deleted file mode 100644 index 160c3b62..00000000 --- a/test/notebooks/input-postgresql.ipynb +++ /dev/null @@ -1,161 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Input Postgresql" - ], - "metadata": {} - }, - { - "cell_type": "markdown", - "source": [ - "This notebook pulls data from a postgresql database as CSV on a given SQL statement" - ], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "!pip install psycopg2-binary==2.9.1 pandas==1.3.1" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "import os\n", - "import pandas as pd\n", - "import psycopg2\n", - "import re\n", - "import sys" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# path and file name for output\n", - "data_csv = os.environ.get('output_data_csv', 'data.csv')\n", - "\n", - "# hostname of database server\n", - "host = os.environ.get('host')\n", - "\n", - "# database name\n", - "database = os.environ.get('database')\n", - "\n", - "# db user\n", - "user = os.environ.get('user')\n", - "\n", - "# db password\n", - "password = os.environ.get('password')\n", - "\n", - "# db port\n", - "port = int(os.environ.get('port', 5432))\n", - "\n", - "# sql query statement to be executed\n", - "sql = os.environ.get('sql')\n", - "\n", - "# temporal data storage for local execution\n", - "data_dir = os.environ.get('data_dir', '../../data/')" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# override parameters received from a potential call using %run magic\n", - "parameters = list(\n", - " map(\n", - " lambda s: re.sub('$', '\"', s),\n", - " map(\n", - " lambda s: s.replace('=', '=\"'),\n", - " filter(\n", - " lambda s: s.find('=') > -1,\n", - " sys.argv\n", - " )\n", - " )\n", - " )\n", - ")\n", - "\n", - "for parameter in parameters:\n", - " exec(parameter)\n", - "\n", - "# cast parameters to appropriate type\n", - "port = int(port)" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "conn = psycopg2.connect(\n", - " host=host,\n", - " database=database,\n", - " user=user,\n", - " password=password,\n", - " port=port\n", - ")" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "d = pd.read_sql_query(sql, conn)" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "conn.close()" - ], - "outputs": [], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "d.to_csv(output_data_csv, index=False)" - ], - "outputs": [], - "metadata": {} - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file diff --git a/test_component.yaml b/test_component.yaml deleted file mode 100644 index 7c5ffa37..00000000 --- a/test_component.yaml +++ /dev/null @@ -1,24 +0,0 @@ -name: Input Postgresql -description: This notebook pulls data from a postgresql database as CSV on a given SQL statement - -inputs: -- {name: host, type: String, description: 'hostname of database server'} -- {name: database, type: String, description: 'database name'} -- {name: user, type: String, description: 'db user'} -- {name: password, type: String, description: 'db password'} -- {name: port, type: Integer, description: 'db port'} -- {name: sql, type: String, description: 'sql query statement to be executed'} -- {name: data_dir, type: String, description: 'temporal data storage for local execution'} - - -outputs: -- {name: output_data_csv, type: String, description: 'path and file name for output'} - - -implementation: - container: - image: continuumio/anaconda3:2020.07 - command: [ - wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb &&, - ipython ./input-postgresql.ipynb data_dir=., - ] \ No newline at end of file From f4e918c2d91a8a54bb9149f08798aba145c56999 Mon Sep 17 00:00:00 2001 From: romeo kienzler Date: Wed, 23 Nov 2022 16:58:55 +0100 Subject: [PATCH 032/177] remove vscode clutter --- .vscode/settings.json | 3 --- 1 file changed, 3 deletions(-) delete mode 100644 .vscode/settings.json diff --git a/.vscode/settings.json b/.vscode/settings.json deleted file mode 100644 index 625b81c9..00000000 --- a/.vscode/settings.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "jupyter.jupyterServerType": "local" -} \ No newline at end of file From 129380d90ba95863f63718767e7f35a64d6a0983 Mon Sep 17 00:00:00 2001 From: romeo kienzler Date: Wed, 23 Nov 2022 17:00:11 +0100 Subject: [PATCH 033/177] add v2 of C3 --- src/create_component_library.ipynb | 119 ++++++++++ src/generate_kfp_component.ipynb | 346 +++++++++++++++++++++++++++++ src/notebook.py | 59 +++++ src/parser.py | 209 +++++++++++++++++ 4 files changed, 733 insertions(+) create mode 100644 src/create_component_library.ipynb create mode 100644 src/generate_kfp_component.ipynb create mode 100644 src/notebook.py create mode 100644 src/parser.py diff --git a/src/create_component_library.ipynb b/src/create_component_library.ipynb new file mode 100644 index 00000000..bce67b6a --- /dev/null +++ b/src/create_component_library.ipynb @@ -0,0 +1,119 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "a972c366-03a3-4d79-b917-01592f594eac", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import shutil" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "728ce188-e84a-4e5b-953b-cff00c95d8d8", + "metadata": {}, + "outputs": [], + "source": [ + "os.scandir('../component-library/')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "528e6ac7-efd8-4b5e-9da5-9cc971b9b4b9", + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%%bash\n", + "export version=0.1i\n", + "for file in `find ../component-library/ -name \"*.ipynb\" |grep -vi test |grep -v checkpoints`\n", + "do \n", + " ipython generate_kfp_component.ipynb $file $version 2>> log.txt >> log.txt\n", + " echo \"Status:\"$file:$? >> log.txt\n", + "done" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5303c806-b63a-4392-b97f-5bb962ae8f4e", + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%%bash\n", + "export version=0.2i\n", + "ipython generate_kfp_component.ipynb ../component-library/input/input-url.ipynb $version\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ced8585b-2208-4b77-b4ea-c629da4c5834", + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%%bash\n", + "export version=0.2m\n", + "ipython generate_kfp_component.ipynb ../component-library/transform/spark-json-to-parquet.ipynb $version\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe9ecd44-7ba2-4077-918b-5b369e6da32c", + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%%bash\n", + "export version=0.2n\n", + "ipython generate_kfp_component.ipynb ../component-library/output/upload-to-cos.ipynb $version\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b7a758f-a293-4fa2-8e3f-a0e8557369b9", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/generate_kfp_component.ipynb b/src/generate_kfp_component.ipynb new file mode 100644 index 00000000..c4074f3f --- /dev/null +++ b/src/generate_kfp_component.ipynb @@ -0,0 +1,346 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "c08007ab-0366-459b-8d61-695e003c3ce5", + "metadata": {}, + "outputs": [], + "source": [ + "from notebook import Notebook\n", + "import os\n", + "import shutil\n", + "from string import Template\n", + "import sys\n", + "from io import StringIO\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0a0b99ea-4fdd-4f65-bb3d-1057c5eb5c05", + "metadata": {}, + "outputs": [], + "source": [ + "if len(sys.argv)<2:\n", + " print('TODO gracefully shutdown')\n", + "\n", + "notebook_path = sys.argv[1]\n", + "version = sys.argv[2]\n", + "\n", + "#notebook_path = os.environ.get('notebook_path','../component-library/input/input-url.ipynb')\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "141b9a18-c302-4360-bfec-ac2c5b29fad9", + "metadata": {}, + "outputs": [], + "source": [ + "#version=\"0.1n\"\n", + "#notebook_path = '../component-library/input/input-url.ipynb'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "055a924f-ae8f-40c4-ab5c-91f136cee8ab", + "metadata": {}, + "outputs": [], + "source": [ + "nb = Notebook(notebook_path)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "40670316-3b8f-41a5-99b5-12c9162627f1", + "metadata": {}, + "outputs": [], + "source": [ + "name = nb.get_name()\n", + "description = nb.get_description() + \" CLAIMED v\"+ version\n", + "inputs = nb.get_inputs()\n", + "outputs = nb.get_outputs()\n", + "requirements = nb.get_requirements()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92c0f904-8afd-4130-8c27-ac955fb61266", + "metadata": {}, + "outputs": [], + "source": [ + "print(name)\n", + "print(description)\n", + "print(inputs)\n", + "print(outputs)\n", + "print(requirements)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0604b83b-88c5-4fa7-a2bb-b452d80e2a61", + "metadata": {}, + "outputs": [], + "source": [ + "!echo {notebook_path}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e271b688-b307-4bc9-803c-4c8bed0761ae", + "metadata": {}, + "outputs": [], + "source": [ + "#!jupyter nbconvert --to script `echo {notebook_path}` " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c0fcb61-43e3-4711-919c-078da0a02e72", + "metadata": {}, + "outputs": [], + "source": [ + "#target_code = notebook_path.replace('.ipynb','.py').split('/')[-1:][0]\n", + "target_code = notebook_path.split('/')[-1:][0]\n", + "\n", + "shutil.copy(notebook_path,target_code)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f67a914e-70b2-4ab6-ab37-121195b59ec4", + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "\n", + "file = target_code\n", + "\n", + "# they can be also raw string and regex\n", + "textToSearch = r'!pip' \n", + "textToReplace = '#!pip'\n", + "\n", + "# read and replace\n", + "with open(file, 'r') as fd:\n", + " text, counter = re.subn(textToSearch, textToReplace, fd.read(), re.I)\n", + "\n", + "# check if there is at least a match\n", + "if counter > 0:\n", + " # edit the file\n", + " with open(file, 'w') as fd:\n", + " fd.write(text)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa5c5b41-1637-456b-9193-fdd7c680c490", + "metadata": {}, + "outputs": [], + "source": [ + "requirements_docker = list(map(lambda s: 'RUN '+s, requirements))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21beb828-c09b-42c0-9b89-94a143d55a03", + "metadata": {}, + "outputs": [], + "source": [ + "docker_file = \"\"\"FROM registry.access.redhat.com/ubi8/python-39\n", + "USER root\n", + "RUN dnf install -y java-11-openjdk\n", + "USER default\n", + "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "{}\n", + "ADD {} /opt/app-root/src/ \n", + "\"\"\".format(\n", + " '\\n'.join(requirements_docker),\n", + " target_code,\n", + " target_code\n", + ")\n", + "with open(\"Dockerfile\", \"w\") as text_file:\n", + " text_file.write(docker_file)\n", + "!cat Dockerfile" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db44dd53-ee2f-497a-b9a0-e92cfcfd7ef7", + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "!docker build -t `echo claimed-{name}:{version}` .\n", + "!docker tag `echo claimed-{name}:{version}` `echo romeokienzler/claimed-{name}:{version}`\n", + "!docker push `echo romeokienzler/claimed-{name}:{version}`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8747a2f1-ed15-41ea-b864-d4482e751eb5", + "metadata": {}, + "outputs": [], + "source": [ + "def get_inputs():\n", + " with StringIO() as inputs_str:\n", + " for input_key, input_value in inputs.items():\n", + " t = Template(\"- {name: $name, type: $type, description: '$description'}\")\n", + " print(t.substitute(name=input_key, type=input_value[1], description=input_value[0]), file=inputs_str)\n", + " return inputs_str.getvalue()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d0e5e56a-0e65-42b3-98d2-cd410fcd52ab", + "metadata": {}, + "outputs": [], + "source": [ + "def get_outputs():\n", + " with StringIO() as outputs_str:\n", + " assert len(outputs) == 1, 'exactly one output currently supported: '+ str((len(outputs.items())))\n", + " for output_key, output_value in outputs.items():\n", + " t = Template(\"- {name: $name, type: $type, description: '$description'}\")\n", + " print(t.substitute(name=output_key, type=output_value[1], description=output_value[0]), file=outputs_str)\n", + " return outputs_str.getvalue()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27608ff6-60b7-4921-b360-4cafb6d8b11d", + "metadata": {}, + "outputs": [], + "source": [ + "def get_output_name():\n", + " for output_key, output_value in outputs.items():\n", + " return output_key" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b44cadc-b5e7-4cef-8ebd-f3bcf5a56328", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def get_input_for_implementation():\n", + " with StringIO() as inputs_str:\n", + " for input_key, input_value in inputs.items():\n", + " t = Template(\" - {inputValue: $name}\")\n", + " print(t.substitute(name=input_key), file=inputs_str)\n", + " return inputs_str.getvalue() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd6c276c-1e4d-4212-a7b8-469a0310adea", + "metadata": {}, + "outputs": [], + "source": [ + "def get_parameter_list():\n", + " return_value = str()\n", + " index = 0\n", + " for output_key, output_value in outputs.items():\n", + " return_value = return_value + output_key + '=\"$' + str(index) + '\" '\n", + " index = index + 1\n", + " for input_key, input_value in inputs.items():\n", + " return_value = return_value + input_key + '=\"$' + str(index) + '\" '\n", + " index = index + 1\n", + " return return_value " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9168a5bd-76a5-4ae4-baf9-018397fa1d80", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "t = Template('''name: $name\n", + "description: $description\n", + "\n", + "inputs:\n", + "$inputs\n", + "\n", + "outputs:\n", + "$outputs\n", + "\n", + "implementation:\n", + " container:\n", + " image: $container_uri:$version\n", + " command:\n", + " - sh\n", + " - -ec\n", + " - |\n", + " ipython $call\n", + " - {outputPath: $outputPath}\n", + "$input_for_implementation''')\n", + "yaml = t.substitute(\n", + " name=name,\n", + " description=description,\n", + " inputs=get_inputs(),\n", + " outputs=get_outputs(),\n", + " container_uri=f\"romeokienzler/claimed-{name}\",\n", + " version=version,\n", + " outputPath=get_output_name(),\n", + " input_for_implementation=get_input_for_implementation(),\n", + " call=f'./{target_code} {get_parameter_list()}' \n", + " )\n", + "print(yaml)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "320e6d14-9584-4154-a225-437f6026fe4b", + "metadata": {}, + "outputs": [], + "source": [ + "target_yaml_path = notebook_path.replace('.ipynb','.yaml')\n", + "\n", + "with open(target_yaml_path, \"w\") as text_file:\n", + " text_file.write(yaml)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/notebook.py b/src/notebook.py new file mode 100644 index 00000000..5204c347 --- /dev/null +++ b/src/notebook.py @@ -0,0 +1,59 @@ +import json +import re +from parser import ContentParser + +class Notebook(): + def __init__(self, path): + self.path = path + with open(path) as json_file: + self.notebook = json.load(json_file) + self.name = self.notebook['cells'][0]['source'][0].replace('#', '').strip() + self.description = self.notebook['cells'][1]['source'][0] + self.envs = self._get_env_vars() + + def _get_env_vars(self): + cp = ContentParser() + env_names = cp.parse(self.path)['env_vars'] + return_value = dict() + for env_name in env_names: + comment_line = str() + for line in self.notebook['cells'][4]['source']: + if re.search("[\"']" + env_name + "[\"']", line): + assert '#' in comment_line, "comment line didn't contain #" + if "int(" in line: + type = 'Integer' + elif "float(" in line: + type = 'Float' + else: + type = 'String' + if ',' in line: + default=line.split(',')[1].split(')')[0] + else: + default = None + return_value[env_name]=(comment_line.replace('#', '').strip(),type,default) + comment_line = line + return return_value + + def get_requirements(self): + requirements = [] + for cell in self.notebook['cells']: + for cell_content in cell['source']: + pattern = r"(![ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.: ]*)" + result = re.findall(pattern,cell_content) + if len(result) == 1: + requirements.append((result[0][0]+ ' ' +result[0][1])[1:]) + return requirements + + def get_name(self): + return self.name + + def get_description(self): + return self.description + + def get_inputs(self): + return { key:value for (key,value) in self.envs.items() if not key.startswith('output_') } + + def get_outputs(self): + return { key:value for (key,value) in self.envs.items() if key.startswith('output_') } + + diff --git a/src/parser.py b/src/parser.py new file mode 100644 index 00000000..8a130d8f --- /dev/null +++ b/src/parser.py @@ -0,0 +1,209 @@ +# +# Copyright 2018-2021 Elyra Authors +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import nbformat +import re + +from traitlets.config import LoggingConfigurable + +from typing import TypeVar, List, Dict + +# Setup forward reference for type hint on return from class factory method. See +# https://stackoverflow.com/questions/39205527/can-you-annotate-return-type-when-value-is-instance-of-cls/39205612#39205612 +F = TypeVar('F', bound='FileReader') + + +class FileReader(LoggingConfigurable): + """ + Base class for parsing a file for resources according to operation type. Subclasses set + their own parser member variable according to their implementation language. + """ + + def __init__(self, filepath: str): + self._filepath = filepath + + @property + def filepath(self): + return self._filepath + + @property + def language(self) -> str: + file_extension = os.path.splitext(self._filepath)[-1] + if file_extension == '.py': + return 'python' + elif file_extension == '.r': + return 'r' + else: + return None + + def read_next_code_chunk(self) -> List[str]: + """ + Implements a generator for lines of code in the specified filepath. Subclasses + may override if explicit line-by-line parsing is not feasible, e.g. with Notebooks. + """ + with open(self._filepath) as f: + for line in f: + yield [line.strip()] + + +class NotebookReader(FileReader): + def __init__(self, filepath: str): + super().__init__(filepath) + + with open(self._filepath) as f: + self._notebook = nbformat.read(f, as_version=4) + self._language = None + + try: + self._language = self._notebook['metadata']['kernelspec']['language'].lower() + + except KeyError: + self.log.warning(f'No language metadata found in {self._filepath}') + pass + + @property + def language(self) -> str: + return self._language + + def read_next_code_chunk(self) -> List[str]: + for cell in self._notebook.cells: + if cell.source and cell.cell_type == "code": + yield cell.source.split('\n') + + +class ScriptParser(): + """ + Base class for parsing individual lines of code. Subclasses implement a search_expressions() + function that returns language-specific regexes to match against code lines. + """ + + _comment_char = "#" + + def _get_line_without_comments(self, line): + if self._comment_char in line: + index = line.find(self._comment_char) + line = line[:index] + return line.strip() + + def parse_environment_variables(self, line): + # Parse a line fed from file and match each regex in regex dictionary + line = self._get_line_without_comments(line) + if not line: + return [] + + matches = [] + for key, value in self.search_expressions().items(): + for pattern in value: + regex = re.compile(pattern) + for match in regex.finditer(line): + matches.append((key, match)) + return matches + + +class PythonScriptParser(ScriptParser): + def search_expressions(self) -> Dict[str, List]: + # TODO: add more key:list-of-regex pairs to parse for additional resources + regex_dict = dict() + + # First regex matches envvar assignments of form os.environ["name"] = value w or w/o value provided + # Second regex matches envvar assignments that use os.getenv("name", "value") with ow w/o default provided + # Third regex matches envvar assignments that use os.environ.get("name", "value") with or w/o default provided + # Both name and value are captured if possible + envs = [r"os\.environ\[[\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']\](?:\s*=(?:\s*[\"'](.[^\"']*)?[\"'])?)*", + r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,\s*[\"'](.[^\"']*)?[\"'])?", + r"os\.environ\.get\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,(?:\s*[\"'](.[^\"']*)?[\"'])?)*"] + regex_dict["env_vars"] = envs + return regex_dict + + +class RScriptParser(ScriptParser): + def search_expressions(self) -> Dict[str, List]: + # TODO: add more key:list-of-regex pairs to parse for additional resources + regex_dict = dict() + + # Tests for matches of the form Sys.setenv("key" = "value") + envs = [r"Sys\.setenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*\s*=\s*[\"']*(.[^\"']*)?[\"']*\)", + r"Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*\)(.)*"] + regex_dict["env_vars"] = envs + return regex_dict + + +class ContentParser(LoggingConfigurable): + parsers = { + 'python': PythonScriptParser(), + 'r': RScriptParser() + } + + def parse(self, filepath: str) -> dict: + """Returns a model dictionary of all the regex matches for each key in the regex dictionary""" + + properties = {"env_vars": {}, "inputs": [], "outputs": []} + reader = self._get_reader(filepath) + parser = self._get_parser(reader.language) + + if not parser: + return properties + + for chunk in reader.read_next_code_chunk(): + if chunk: + for line in chunk: + matches = parser.parse_environment_variables(line) + for key, match in matches: + if key == "env_vars": + properties[key][match.group(1)] = match.group(2) + else: + properties[key].append(match.group(1)) + + return properties + + def _validate_file(self, filepath: str): + """ + Validate file exists and is file (e.g. not a directory) + """ + if not os.path.exists(filepath): + raise FileNotFoundError(f'No such file or directory: {filepath}') + if not os.path.isfile(filepath): + raise IsADirectoryError(f'Is a directory: {filepath}') + + def _get_reader(self, filepath: str): + """ + Find the proper reader based on the file extension + """ + file_extension = os.path.splitext(filepath)[-1] + + self._validate_file(filepath) + + if file_extension == '.ipynb': + return NotebookReader(filepath) + elif file_extension in ['.py', '.r']: + return FileReader(filepath) + else: + raise ValueError(f'File type {file_extension} is not supported.') + + def _get_parser(self, language: str): + """ + Find the proper parser based on content language + """ + parser = None + if language: + parser = self.parsers.get(language) + + if not parser: + self.log.warning(f'Content parser for {language} is not available.') + pass + + return parser From 0529267d059ec1eb46752ad35181052b92cedd26 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 27 Feb 2023 13:03:21 +0100 Subject: [PATCH 034/177] add README.md --- README.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 00000000..b1f35372 --- /dev/null +++ b/README.md @@ -0,0 +1,33 @@ +[![OpenSSF Best Practices](https://bestpractices.coreinfrastructure.org/projects/6718/badge)](https://bestpractices.coreinfrastructure.org/projects/6718) +[![GitHub](https://img.shields.io/badge/issue_tracking-github-blue.svg)](https://github.com/claimed-framework/component-library/issues) + + + +# C3 - the CLAIMED Component Compiler + +**TL;DR** +- takes arbitrary assets (Jupyter notebooks, python/R/shell/SQL scripts) as input +- automatically creates container images and pushed to container registries +- automatically installes all required dependencies into the container image +- creates KubeFlow Pipeline components (target workflow execution engines are pluggable) +- can be triggered from CICD pipelines + + +To learn more on how this library works in practice, please have a look at the following [video](https://www.youtube.com/watch?v=FuV2oG55C5s) + +## Related work +[Ploomber](https://github.com/ploomber/ploomber) + +[Orchest](https://www.orchest.io/) + +## Getting Help + +We welcome your questions, ideas, and feedback. Please create an [issue](https://github.com/claimed-framework/component-library/issues) or a [discussion thread](https://github.com/claimed-framework/component-library/discussions). +Please see [VULNERABILITIES.md](VULNERABILITIES.md) for reporting vulnerabilities. + +## Contributing to CLAIMED +Interested in helping make CLAIMED better? We encourage you to take a look at our +[Contributing](CONTRIBUTING.md) page. + +## License +This software is released under Apache License v2.0 From b34ee5704c6c66f9a90502c8c741dcfdad8e69eb Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 27 Feb 2023 13:03:46 +0100 Subject: [PATCH 035/177] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b1f35372..b5a24329 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ **TL;DR** - takes arbitrary assets (Jupyter notebooks, python/R/shell/SQL scripts) as input -- automatically creates container images and pushed to container registries +- automatically creates container images and pushes to container registries - automatically installes all required dependencies into the container image - creates KubeFlow Pipeline components (target workflow execution engines are pluggable) - can be triggered from CICD pipelines From 8e09650bda9fefdc762dc296103f7ce728c510c6 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 1 Jun 2023 10:58:19 +0200 Subject: [PATCH 036/177] improve .gitignore --- .gitignore | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/.gitignore b/.gitignore index d75edeae..16e6e1ce 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ -venv -__pycache__ \ No newline at end of file +venv/ +__pycache__ +.ipynb_checkpoints/ From 74b9c28a0922dccb97266dfb303a9d02861f0b46 Mon Sep 17 00:00:00 2001 From: Colin Lin Date: Tue, 27 Jun 2023 14:16:23 -0400 Subject: [PATCH 037/177] Support < 1 outputs Signed-off-by: Colin Lin --- src/generate_kfp_component.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/generate_kfp_component.ipynb b/src/generate_kfp_component.ipynb index c4074f3f..59e2fa2a 100644 --- a/src/generate_kfp_component.ipynb +++ b/src/generate_kfp_component.ipynb @@ -213,7 +213,7 @@ "source": [ "def get_outputs():\n", " with StringIO() as outputs_str:\n", - " assert len(outputs) == 1, 'exactly one output currently supported: '+ str((len(outputs.items())))\n", + " # assert len(outputs) == 1, 'exactly one output currently supported: '+ str((len(outputs.items())))\n", " for output_key, output_value in outputs.items():\n", " t = Template(\"- {name: $name, type: $type, description: '$description'}\")\n", " print(t.substitute(name=output_key, type=output_value[1], description=output_value[0]), file=outputs_str)\n", @@ -338,7 +338,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.6" + "version": "3.11.3" } }, "nbformat": 4, From a0c3055d13f87fcb65e6312c4024408cabe6d222 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 7 Jul 2023 12:13:49 +0200 Subject: [PATCH 038/177] add additional data types --- .gitignore | 1 + src/create_component_library.ipynb | 327 ++++++++++++++++++++++++++++- src/generate_kfp_component.ipynb | 56 ++--- src/notebook.py | 9 +- 4 files changed, 354 insertions(+), 39 deletions(-) diff --git a/.gitignore b/.gitignore index 16e6e1ce..5fbd6c8e 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ venv/ +.venv/ __pycache__ .ipynb_checkpoints/ diff --git a/src/create_component_library.ipynb b/src/create_component_library.ipynb index bce67b6a..ff490804 100644 --- a/src/create_component_library.ipynb +++ b/src/create_component_library.ipynb @@ -1,5 +1,15 @@ { "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "9c7ce914", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install ipython nbformat" + ] + }, { "cell_type": "code", "execution_count": null, @@ -51,8 +61,8 @@ "outputs": [], "source": [ "%%bash\n", - "export version=0.2i\n", - "ipython generate_kfp_component.ipynb ../component-library/input/input-url.ipynb $version\n" + "export version=0.2n\n", + "ipython generate_kfp_component.ipynb ../../component-library/component-library/input/input-url.ipynb $version\n" ] }, { @@ -88,9 +98,318 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 33, "id": "2b7a758f-a293-4fa2-8e3f-a0e8557369b9", "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "util-cos\n", + "This component provides COS utility functions (e.g. creating a bucket, listing contents of a bucket)\n", + " CLAIMED v0.29\n", + "{'access_key_id': {'description': 'access key id', 'type': 'String', 'default': None}, 'secret_access_key': {'description': 'secret access key', 'type': 'String', 'default': None}, 'endpoint': {'description': 'cos/s3 endpoint', 'type': 'String', 'default': None}, 'bucket_name': {'description': 'cos bucket name', 'type': 'String', 'default': None}, 'path': {'description': 'path', 'type': 'String', 'default': \"''\"}, 'source': {'description': 'source in case of uploads', 'type': 'String', 'default': \" ''\"}, 'target': {'description': 'target in case of downloads', 'type': 'String', 'default': \" ''\"}, 'recursive': {'description': 'recursive', 'type': 'Boolean', 'default': \"'False'\"}, 'operation': {'description': 'operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local)', 'type': 'String', 'default': None}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}}\n", + "{}\n", + "['pip install aiobotocore botocore s3fs']\n", + "../../component-library/component-library/util/util-cos.ipynb\n", + "\n", + "FROM registry.access.redhat.com/ubi8/python-39 \n", + "USER root\n", + "RUN dnf install -y java-11-openjdk\n", + "USER default\n", + "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "RUN pip install aiobotocore botocore s3fs\n", + "ADD util-cos.ipynb /opt/app-root/src/\n", + "CMD [\"ipython\", \"/opt/app-root/src/util-cos.ipynb\"]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "#1 [internal] load .dockerignore\n", + "#1 transferring context: 2B done\n", + "#1 DONE 0.0s\n", + "\n", + "#2 [internal] load build definition from Dockerfile\n", + "#2 transferring dockerfile: 385B done\n", + "#2 DONE 0.0s\n", + "\n", + "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", + "#3 DONE 0.0s\n", + "\n", + "#4 [1/5] FROM registry.access.redhat.com/ubi8/python-39\n", + "#4 DONE 0.0s\n", + "\n", + "#5 [internal] load build context\n", + "#5 transferring context: 8.64kB done\n", + "#5 DONE 0.0s\n", + "\n", + "#6 [2/5] RUN dnf install -y java-11-openjdk\n", + "#6 CACHED\n", + "\n", + "#7 [4/5] RUN pip install aiobotocore botocore s3fs\n", + "#7 CACHED\n", + "\n", + "#8 [3/5] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "#8 CACHED\n", + "\n", + "#9 [5/5] ADD util-cos.ipynb /opt/app-root/src/\n", + "#9 CACHED\n", + "\n", + "#10 exporting to image\n", + "#10 exporting layers done\n", + "#10 writing image sha256:07187cacad33f42c17f1686b51452342d6bb0bd91d5452634f92d96a4bf5cf1f done\n", + "#10 naming to docker.io/library/claimed-util-cos:0.29 done\n", + "#10 DONE 0.0s\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The push refers to repository [docker.io/romeokienzler/claimed-util-cos]\n", + "ade67154bab7: Preparing\n", + "d58d29942c09: Preparing\n", + "2902622ed6d0: Preparing\n", + "20b51819921d: Preparing\n", + "6d3c2c9872b8: Preparing\n", + "f945189ba7d6: Preparing\n", + "c7649cc32711: Preparing\n", + "486dcc5a5ac3: Preparing\n", + "c7649cc32711: Waiting\n", + "f945189ba7d6: Waiting\n", + "486dcc5a5ac3: Waiting\n", + "6d3c2c9872b8: Layer already exists\n", + "20b51819921d: Layer already exists\n", + "2902622ed6d0: Layer already exists\n", + "ade67154bab7: Layer already exists\n", + "486dcc5a5ac3: Layer already exists\n", + "c7649cc32711: Layer already exists\n", + "d58d29942c09: Layer already exists\n", + "f945189ba7d6: Layer already exists\n", + "0.29: digest: sha256:e0e1f4215bb97e8a9666ce98f413c8392e2431b068ff54009abcdd3167d71d83 size: 2011\n", + "The push refers to repository [docker.io/romeokienzler/claimed-util-cos]\n", + "ade67154bab7: Preparing\n", + "d58d29942c09: Preparing\n", + "2902622ed6d0: Preparing\n", + "20b51819921d: Preparing\n", + "6d3c2c9872b8: Preparing\n", + "f945189ba7d6: Preparing\n", + "c7649cc32711: Preparing\n", + "486dcc5a5ac3: Preparing\n", + "f945189ba7d6: Waiting\n", + "c7649cc32711: Waiting\n", + "486dcc5a5ac3: Waiting\n", + "6d3c2c9872b8: Layer already exists\n", + "ade67154bab7: Layer already exists\n", + "2902622ed6d0: Layer already exists\n", + "20b51819921d: Layer already exists\n", + "d58d29942c09: Layer already exists\n", + "f945189ba7d6: Layer already exists\n", + "c7649cc32711: Layer already exists\n", + "486dcc5a5ac3: Layer already exists\n", + "latest: digest: sha256:e0e1f4215bb97e8a9666ce98f413c8392e2431b068ff54009abcdd3167d71d83 size: 2011\n", + "name: util-cos\n", + "description: This component provides COS utility functions (e.g. creating a bucket, listing contents of a bucket)\n", + " CLAIMED v0.29\n", + "\n", + "inputs:\n", + "- {name: access_key_id, type: String, description: access key id}\n", + "- {name: secret_access_key, type: String, description: secret access key}\n", + "- {name: endpoint, type: String, description: cos/s3 endpoint}\n", + "- {name: bucket_name, type: String, description: cos bucket name}\n", + "- {name: path, type: String, description: path, default: ''}\n", + "- {name: source, type: String, description: source in case of uploads, default: ''}\n", + "- {name: target, type: String, description: target in case of downloads, default: ''}\n", + "- {name: recursive, type: Boolean, description: recursive, default: 'False'}\n", + "- {name: operation, type: String, description: operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local)}\n", + "- {name: log_level, type: String, description: log level, default: 'INFO'}\n", + "\n", + "\n", + "implementation:\n", + " container:\n", + " image: romeokienzler/claimed-util-cos:0.29\n", + " command:\n", + " - sh\n", + " - -ec\n", + " - |\n", + " ipython ./util-cos.ipynb access_key_id=\"$0\" secret_access_key=\"$1\" endpoint=\"$2\" bucket_name=\"$3\" path=\"$4\" source=\"$5\" target=\"$6\" recursive=\"$7\" operation=\"$8\" log_level=\"$9\" \n", + " - {inputValue: access_key_id}\n", + " - {inputValue: secret_access_key}\n", + " - {inputValue: endpoint}\n", + " - {inputValue: bucket_name}\n", + " - {inputValue: path}\n", + " - {inputValue: source}\n", + " - {inputValue: target}\n", + " - {inputValue: recursive}\n", + " - {inputValue: operation}\n", + " - {inputValue: log_level}\n", + "\n" + ] + } + ], + "source": [ + "%%bash\n", + "export version=0.29\n", + "ipython generate_kfp_component.ipynb ../../component-library/component-library/util/util-cos.ipynb $version" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "dc39195e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "geo-hls-remove-clouds\n", + "Removes clouds from HLS data CLAIMED v0.32\n", + "{'input_path': {'description': 'path for input', 'type': 'String', 'default': \"'/home/romeokienzler/Downloads/HLS2022/HLS/**/*'\"}, 'target_path': {'description': 'path for output', 'type': 'String', 'default': \"'/home/romeokienzler/Downloads/HLSS30.CF2.v3/'\"}, 'satellite': {'description': 'satellite', 'type': 'String', 'default': \"'HLS.L30'\"}, 'file_filter_pattern': {'description': 'file filter pattern', 'type': 'String', 'default': \"'HLS.S30*0.B*tif'\"}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}}\n", + "{}\n", + "['pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs']\n", + "../../workflows-and-operators/operators/hls_remove_clouds.ipynb\n", + "\n", + "FROM registry.access.redhat.com/ubi8/python-39 \n", + "USER root\n", + "RUN dnf install -y java-11-openjdk\n", + "USER default\n", + "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", + "ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", + "CMD [\"ipython\", \"/opt/app-root/src/hls_remove_clouds.ipynb\"]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "#1 [internal] load .dockerignore\n", + "#1 transferring context: 2B done\n", + "#1 DONE 0.0s\n", + "\n", + "#2 [internal] load build definition from Dockerfile\n", + "#2 transferring dockerfile: 526B done\n", + "#2 DONE 0.0s\n", + "\n", + "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", + "#3 DONE 0.0s\n", + "\n", + "#4 [1/5] FROM registry.access.redhat.com/ubi8/python-39\n", + "#4 DONE 0.0s\n", + "\n", + "#5 [internal] load build context\n", + "#5 transferring context: 13.90kB done\n", + "#5 DONE 0.0s\n", + "\n", + "#6 [2/5] RUN dnf install -y java-11-openjdk\n", + "#6 CACHED\n", + "\n", + "#7 [3/5] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "#7 CACHED\n", + "\n", + "#8 [4/5] RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", + "#8 CACHED\n", + "\n", + "#9 [5/5] ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", + "#9 DONE 0.1s\n", + "\n", + "#10 exporting to image\n", + "#10 exporting layers\n", + "#10 exporting layers 1.9s done\n", + "#10 writing image sha256:751a9436d257bf6d92a3a6ae71cb5217f001f87876a155f0df0e053e7e50a708 done\n", + "#10 naming to docker.io/library/claimed-geo-hls-remove-clouds:0.32 done\n", + "#10 DONE 1.9s\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The push refers to repository [docker.io/romeokienzler/claimed-geo-hls-remove-clouds]\n", + "f1f396df7f96: Preparing\n", + "1bb304a805f1: Preparing\n", + "2902622ed6d0: Preparing\n", + "20b51819921d: Preparing\n", + "6d3c2c9872b8: Preparing\n", + "f945189ba7d6: Preparing\n", + "c7649cc32711: Preparing\n", + "486dcc5a5ac3: Preparing\n", + "c7649cc32711: Waiting\n", + "486dcc5a5ac3: Waiting\n", + "f945189ba7d6: Waiting\n", + "20b51819921d: Layer already exists\n", + "1bb304a805f1: Layer already exists\n", + "6d3c2c9872b8: Layer already exists\n", + "2902622ed6d0: Layer already exists\n", + "f945189ba7d6: Layer already exists\n", + "486dcc5a5ac3: Layer already exists\n", + "c7649cc32711: Layer already exists\n", + "f1f396df7f96: Pushed\n", + "0.32: digest: sha256:e7adcd7923d40ef3aa2125090469e94b2955fc42bef364784b97cb46508c0966 size: 2012\n", + "The push refers to repository [docker.io/romeokienzler/claimed-geo-hls-remove-clouds]\n", + "f1f396df7f96: Preparing\n", + "1bb304a805f1: Preparing\n", + "2902622ed6d0: Preparing\n", + "20b51819921d: Preparing\n", + "6d3c2c9872b8: Preparing\n", + "f945189ba7d6: Preparing\n", + "c7649cc32711: Preparing\n", + "486dcc5a5ac3: Preparing\n", + "c7649cc32711: Waiting\n", + "486dcc5a5ac3: Waiting\n", + "f945189ba7d6: Waiting\n", + "2902622ed6d0: Layer already exists\n", + "1bb304a805f1: Layer already exists\n", + "6d3c2c9872b8: Layer already exists\n", + "f1f396df7f96: Layer already exists\n", + "20b51819921d: Layer already exists\n", + "f945189ba7d6: Layer already exists\n", + "486dcc5a5ac3: Layer already exists\n", + "c7649cc32711: Layer already exists\n", + "latest: digest: sha256:e7adcd7923d40ef3aa2125090469e94b2955fc42bef364784b97cb46508c0966 size: 2012\n", + "name: geo-hls-remove-clouds\n", + "description: Removes clouds from HLS data CLAIMED v0.32\n", + "\n", + "inputs:\n", + "- {name: input_path, type: String, description: path for input, default: '/home/romeokienzler/Downloads/HLS2022/HLS/**/*'}\n", + "- {name: target_path, type: String, description: path for output, default: '/home/romeokienzler/Downloads/HLSS30.CF2.v3/'}\n", + "- {name: satellite, type: String, description: satellite, default: 'HLS.L30'}\n", + "- {name: file_filter_pattern, type: String, description: file filter pattern, default: 'HLS.S30*0.B*tif'}\n", + "- {name: log_level, type: String, description: log level, default: 'INFO'}\n", + "\n", + "\n", + "implementation:\n", + " container:\n", + " image: romeokienzler/claimed-geo-hls-remove-clouds:0.32\n", + " command:\n", + " - sh\n", + " - -ec\n", + " - |\n", + " ipython ./hls_remove_clouds.ipynb input_path=\"$0\" target_path=\"$1\" satellite=\"$2\" file_filter_pattern=\"$3\" log_level=\"$4\" \n", + " - {inputValue: input_path}\n", + " - {inputValue: target_path}\n", + " - {inputValue: satellite}\n", + " - {inputValue: file_filter_pattern}\n", + " - {inputValue: log_level}\n", + "\n" + ] + } + ], + "source": [ + "%%bash\n", + "export version=0.33\n", + "ipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70241e41", + "metadata": {}, "outputs": [], "source": [] } @@ -111,7 +430,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.6" + "version": "3.11.3" } }, "nbformat": 4, diff --git a/src/generate_kfp_component.ipynb b/src/generate_kfp_component.ipynb index 59e2fa2a..9ef59a39 100644 --- a/src/generate_kfp_component.ipynb +++ b/src/generate_kfp_component.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "id": "c08007ab-0366-459b-8d61-695e003c3ce5", "metadata": {}, "outputs": [], @@ -12,7 +12,8 @@ "import shutil\n", "from string import Template\n", "import sys\n", - "from io import StringIO\n" + "from io import StringIO\n", + "from enum import Enum" ] }, { @@ -157,16 +158,19 @@ "metadata": {}, "outputs": [], "source": [ - "docker_file = \"\"\"FROM registry.access.redhat.com/ubi8/python-39\n", + "docker_file = \"\"\"\n", + "FROM registry.access.redhat.com/ubi8/python-39 \n", "USER root\n", "RUN dnf install -y java-11-openjdk\n", "USER default\n", "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", "{}\n", - "ADD {} /opt/app-root/src/ \n", + "ADD {} /opt/app-root/src/\n", + "CMD [\"ipython\", \"/opt/app-root/src/{}\"]\n", "\"\"\".format(\n", " '\\n'.join(requirements_docker),\n", " target_code,\n", + " target_code,\n", " target_code\n", ")\n", "with open(\"Dockerfile\", \"w\") as text_file:\n", @@ -186,7 +190,9 @@ "source": [ "!docker build -t `echo claimed-{name}:{version}` .\n", "!docker tag `echo claimed-{name}:{version}` `echo romeokienzler/claimed-{name}:{version}`\n", - "!docker push `echo romeokienzler/claimed-{name}:{version}`" + "!docker tag `echo claimed-{name}:{version}` `echo romeokienzler/claimed-{name}:latest`\n", + "!docker push `echo romeokienzler/claimed-{name}:{version}`\n", + "!docker push `echo romeokienzler/claimed-{name}:latest`" ] }, { @@ -196,28 +202,17 @@ "metadata": {}, "outputs": [], "source": [ - "def get_inputs():\n", - " with StringIO() as inputs_str:\n", - " for input_key, input_value in inputs.items():\n", - " t = Template(\"- {name: $name, type: $type, description: '$description'}\")\n", - " print(t.substitute(name=input_key, type=input_value[1], description=input_value[0]), file=inputs_str)\n", - " return inputs_str.getvalue()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d0e5e56a-0e65-42b3-98d2-cd410fcd52ab", - "metadata": {}, - "outputs": [], - "source": [ - "def get_outputs():\n", - " with StringIO() as outputs_str:\n", - " # assert len(outputs) == 1, 'exactly one output currently supported: '+ str((len(outputs.items())))\n", - " for output_key, output_value in outputs.items():\n", - " t = Template(\"- {name: $name, type: $type, description: '$description'}\")\n", - " print(t.substitute(name=output_key, type=output_value[1], description=output_value[0]), file=outputs_str)\n", - " return outputs_str.getvalue()\n" + "parameter_type = Enum('parameter_type', ['INPUT', 'OUTPUT'])\n", + "\n", + "def get_component_interface(parameters, type : parameter_type):\n", + " template_string = str()\n", + " for parameter_name, parameter_options in parameters.items():\n", + " default = ''\n", + " if parameter_options['default'] is not None and type == parameter_type.INPUT:\n", + " default = f\", default: {parameter_options['default']}\"\n", + " template_string += f\"- {{name: {parameter_name}, type: {parameter_options['type']}, description: {parameter_options['description']}{default}}}\"\n", + " template_string += '\\n'\n", + " return template_string" ] }, { @@ -281,9 +276,6 @@ "inputs:\n", "$inputs\n", "\n", - "outputs:\n", - "$outputs\n", - "\n", "implementation:\n", " container:\n", " image: $container_uri:$version\n", @@ -292,13 +284,11 @@ " - -ec\n", " - |\n", " ipython $call\n", - " - {outputPath: $outputPath}\n", "$input_for_implementation''')\n", "yaml = t.substitute(\n", " name=name,\n", " description=description,\n", - " inputs=get_inputs(),\n", - " outputs=get_outputs(),\n", + " inputs=get_component_interface(inputs, parameter_type.INPUT),\n", " container_uri=f\"romeokienzler/claimed-{name}\",\n", " version=version,\n", " outputPath=get_output_name(),\n", diff --git a/src/notebook.py b/src/notebook.py index 5204c347..611ea3ec 100644 --- a/src/notebook.py +++ b/src/notebook.py @@ -23,14 +23,19 @@ def _get_env_vars(self): if "int(" in line: type = 'Integer' elif "float(" in line: - type = 'Float' + type = 'Float' + elif "bool(" in line: + type = 'Boolean' else: type = 'String' if ',' in line: default=line.split(',')[1].split(')')[0] else: default = None - return_value[env_name]=(comment_line.replace('#', '').strip(),type,default) + return_value[env_name]={ + 'description': comment_line.replace('#', '').strip(), + 'type': type, + 'default': default} comment_line = line return return_value From d9ac049175279033f7110eb349b3f030fe46b871 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 17 Jul 2023 14:00:33 +0200 Subject: [PATCH 039/177] on the way to a pypi package --- src/build/lib/c3/__init__.py | 0 src/build/lib/c3/compiler.py | 13 +++ src/c3/__init__.py | 0 src/c3/compiler.py | 13 +++ src/{ => c3}/create_component_library.ipynb | 123 ++++++++++---------- src/{ => c3}/generate_kfp_component.ipynb | 6 +- src/{ => c3}/notebook.py | 0 src/{ => c3}/parser.py | 0 src/setup.py | 27 +++++ 9 files changed, 120 insertions(+), 62 deletions(-) create mode 100644 src/build/lib/c3/__init__.py create mode 100644 src/build/lib/c3/compiler.py create mode 100644 src/c3/__init__.py create mode 100644 src/c3/compiler.py rename src/{ => c3}/create_component_library.ipynb (85%) rename src/{ => c3}/generate_kfp_component.ipynb (98%) rename src/{ => c3}/notebook.py (100%) rename src/{ => c3}/parser.py (100%) create mode 100644 src/setup.py diff --git a/src/build/lib/c3/__init__.py b/src/build/lib/c3/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/build/lib/c3/compiler.py b/src/build/lib/c3/compiler.py new file mode 100644 index 00000000..88809def --- /dev/null +++ b/src/build/lib/c3/compiler.py @@ -0,0 +1,13 @@ +import subprocess + +def main(): + try: + #output = subprocess.check_output('pwd', shell=True, universal_newlines=True) + output = subprocess.check_output('ipython generate_kfp_component.ipynb', shell=True, universal_newlines=True) + print(output) + except subprocess.CalledProcessError as e: + print(f"Error executing command: {e}") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/c3/__init__.py b/src/c3/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/c3/compiler.py b/src/c3/compiler.py new file mode 100644 index 00000000..88809def --- /dev/null +++ b/src/c3/compiler.py @@ -0,0 +1,13 @@ +import subprocess + +def main(): + try: + #output = subprocess.check_output('pwd', shell=True, universal_newlines=True) + output = subprocess.check_output('ipython generate_kfp_component.ipynb', shell=True, universal_newlines=True) + print(output) + except subprocess.CalledProcessError as e: + print(f"Error executing command: {e}") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/src/create_component_library.ipynb b/src/c3/create_component_library.ipynb similarity index 85% rename from src/create_component_library.ipynb rename to src/c3/create_component_library.ipynb index ff490804..c5abf333 100644 --- a/src/create_component_library.ipynb +++ b/src/c3/create_component_library.ipynb @@ -98,7 +98,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 3, "id": "2b7a758f-a293-4fa2-8e3f-a0e8557369b9", "metadata": {}, "outputs": [ @@ -108,11 +108,11 @@ "text": [ "util-cos\n", "This component provides COS utility functions (e.g. creating a bucket, listing contents of a bucket)\n", - " CLAIMED v0.29\n", + " CLAIMED v0.30\n", "{'access_key_id': {'description': 'access key id', 'type': 'String', 'default': None}, 'secret_access_key': {'description': 'secret access key', 'type': 'String', 'default': None}, 'endpoint': {'description': 'cos/s3 endpoint', 'type': 'String', 'default': None}, 'bucket_name': {'description': 'cos bucket name', 'type': 'String', 'default': None}, 'path': {'description': 'path', 'type': 'String', 'default': \"''\"}, 'source': {'description': 'source in case of uploads', 'type': 'String', 'default': \" ''\"}, 'target': {'description': 'target in case of downloads', 'type': 'String', 'default': \" ''\"}, 'recursive': {'description': 'recursive', 'type': 'Boolean', 'default': \"'False'\"}, 'operation': {'description': 'operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local)', 'type': 'String', 'default': None}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}}\n", "{}\n", "['pip install aiobotocore botocore s3fs']\n", - "../../component-library/component-library/util/util-cos.ipynb\n", + "../../../../component-library/component-library/util/util-cos.ipynb\n", "\n", "FROM registry.access.redhat.com/ubi8/python-39 \n", "USER root\n", @@ -128,13 +128,13 @@ "name": "stderr", "output_type": "stream", "text": [ - "#1 [internal] load .dockerignore\n", - "#1 transferring context: 2B done\n", - "#1 DONE 0.0s\n", + "#1 [internal] load build definition from Dockerfile\n", + "#1 transferring dockerfile: 385B done\n", + "#1 DONE 0.1s\n", "\n", - "#2 [internal] load build definition from Dockerfile\n", - "#2 transferring dockerfile: 385B done\n", - "#2 DONE 0.0s\n", + "#2 [internal] load .dockerignore\n", + "#2 transferring context: 2B done\n", + "#2 DONE 0.1s\n", "\n", "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", "#3 DONE 0.0s\n", @@ -144,25 +144,26 @@ "\n", "#5 [internal] load build context\n", "#5 transferring context: 8.64kB done\n", - "#5 DONE 0.0s\n", + "#5 DONE 0.1s\n", "\n", "#6 [2/5] RUN dnf install -y java-11-openjdk\n", "#6 CACHED\n", "\n", - "#7 [4/5] RUN pip install aiobotocore botocore s3fs\n", + "#7 [3/5] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", "#7 CACHED\n", "\n", - "#8 [3/5] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "#8 [4/5] RUN pip install aiobotocore botocore s3fs\n", "#8 CACHED\n", "\n", "#9 [5/5] ADD util-cos.ipynb /opt/app-root/src/\n", - "#9 CACHED\n", + "#9 DONE 0.1s\n", "\n", "#10 exporting to image\n", - "#10 exporting layers done\n", - "#10 writing image sha256:07187cacad33f42c17f1686b51452342d6bb0bd91d5452634f92d96a4bf5cf1f done\n", - "#10 naming to docker.io/library/claimed-util-cos:0.29 done\n", - "#10 DONE 0.0s\n" + "#10 exporting layers\n", + "#10 exporting layers 2.1s done\n", + "#10 writing image sha256:6ee2f7dd70f11dda51013e41c934e65f3044820b6de12585764329649d4259e8 done\n", + "#10 naming to docker.io/library/claimed-util-cos:0.30 done\n", + "#10 DONE 2.1s\n" ] }, { @@ -170,7 +171,7 @@ "output_type": "stream", "text": [ "The push refers to repository [docker.io/romeokienzler/claimed-util-cos]\n", - "ade67154bab7: Preparing\n", + "eb3bee7d35b9: Preparing\n", "d58d29942c09: Preparing\n", "2902622ed6d0: Preparing\n", "20b51819921d: Preparing\n", @@ -178,20 +179,20 @@ "f945189ba7d6: Preparing\n", "c7649cc32711: Preparing\n", "486dcc5a5ac3: Preparing\n", - "c7649cc32711: Waiting\n", "f945189ba7d6: Waiting\n", + "c7649cc32711: Waiting\n", "486dcc5a5ac3: Waiting\n", "6d3c2c9872b8: Layer already exists\n", "20b51819921d: Layer already exists\n", "2902622ed6d0: Layer already exists\n", - "ade67154bab7: Layer already exists\n", - "486dcc5a5ac3: Layer already exists\n", - "c7649cc32711: Layer already exists\n", "d58d29942c09: Layer already exists\n", + "c7649cc32711: Layer already exists\n", + "486dcc5a5ac3: Layer already exists\n", "f945189ba7d6: Layer already exists\n", - "0.29: digest: sha256:e0e1f4215bb97e8a9666ce98f413c8392e2431b068ff54009abcdd3167d71d83 size: 2011\n", + "eb3bee7d35b9: Pushed\n", + "0.30: digest: sha256:b2365be03cd64b1821003f8cc6a88b275b3e9b6988ee63a4bc9ec92d62a76a09 size: 2011\n", "The push refers to repository [docker.io/romeokienzler/claimed-util-cos]\n", - "ade67154bab7: Preparing\n", + "eb3bee7d35b9: Preparing\n", "d58d29942c09: Preparing\n", "2902622ed6d0: Preparing\n", "20b51819921d: Preparing\n", @@ -202,18 +203,18 @@ "f945189ba7d6: Waiting\n", "c7649cc32711: Waiting\n", "486dcc5a5ac3: Waiting\n", - "6d3c2c9872b8: Layer already exists\n", - "ade67154bab7: Layer already exists\n", - "2902622ed6d0: Layer already exists\n", "20b51819921d: Layer already exists\n", + "6d3c2c9872b8: Layer already exists\n", + "eb3bee7d35b9: Layer already exists\n", "d58d29942c09: Layer already exists\n", - "f945189ba7d6: Layer already exists\n", - "c7649cc32711: Layer already exists\n", + "2902622ed6d0: Layer already exists\n", "486dcc5a5ac3: Layer already exists\n", - "latest: digest: sha256:e0e1f4215bb97e8a9666ce98f413c8392e2431b068ff54009abcdd3167d71d83 size: 2011\n", + "c7649cc32711: Layer already exists\n", + "f945189ba7d6: Layer already exists\n", + "latest: digest: sha256:b2365be03cd64b1821003f8cc6a88b275b3e9b6988ee63a4bc9ec92d62a76a09 size: 2011\n", "name: util-cos\n", "description: This component provides COS utility functions (e.g. creating a bucket, listing contents of a bucket)\n", - " CLAIMED v0.29\n", + " CLAIMED v0.30\n", "\n", "inputs:\n", "- {name: access_key_id, type: String, description: access key id}\n", @@ -230,7 +231,7 @@ "\n", "implementation:\n", " container:\n", - " image: romeokienzler/claimed-util-cos:0.29\n", + " image: romeokienzler/claimed-util-cos:0.30\n", " command:\n", " - sh\n", " - -ec\n", @@ -252,13 +253,13 @@ ], "source": [ "%%bash\n", - "export version=0.29\n", - "ipython generate_kfp_component.ipynb ../../component-library/component-library/util/util-cos.ipynb $version" + "export version=0.30\n", + "ipython generate_kfp_component.ipynb ../../../../component-library/component-library/util/util-cos.ipynb $version" ] }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 39, "id": "dc39195e", "metadata": {}, "outputs": [ @@ -267,7 +268,7 @@ "output_type": "stream", "text": [ "geo-hls-remove-clouds\n", - "Removes clouds from HLS data CLAIMED v0.32\n", + "Removes clouds from HLS data CLAIMED v0.34\n", "{'input_path': {'description': 'path for input', 'type': 'String', 'default': \"'/home/romeokienzler/Downloads/HLS2022/HLS/**/*'\"}, 'target_path': {'description': 'path for output', 'type': 'String', 'default': \"'/home/romeokienzler/Downloads/HLSS30.CF2.v3/'\"}, 'satellite': {'description': 'satellite', 'type': 'String', 'default': \"'HLS.L30'\"}, 'file_filter_pattern': {'description': 'file filter pattern', 'type': 'String', 'default': \"'HLS.S30*0.B*tif'\"}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}}\n", "{}\n", "['pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs']\n", @@ -302,7 +303,7 @@ "#4 DONE 0.0s\n", "\n", "#5 [internal] load build context\n", - "#5 transferring context: 13.90kB done\n", + "#5 transferring context: 14.31kB done\n", "#5 DONE 0.0s\n", "\n", "#6 [2/5] RUN dnf install -y java-11-openjdk\n", @@ -315,14 +316,14 @@ "#8 CACHED\n", "\n", "#9 [5/5] ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", - "#9 DONE 0.1s\n", + "#9 DONE 0.2s\n", "\n", "#10 exporting to image\n", "#10 exporting layers\n", - "#10 exporting layers 1.9s done\n", - "#10 writing image sha256:751a9436d257bf6d92a3a6ae71cb5217f001f87876a155f0df0e053e7e50a708 done\n", - "#10 naming to docker.io/library/claimed-geo-hls-remove-clouds:0.32 done\n", - "#10 DONE 1.9s\n" + "#10 exporting layers 2.2s done\n", + "#10 writing image sha256:4fce555cb3149b2f5e15968995c00dd9f9479e7e73eac5cca6d798166038d740 done\n", + "#10 naming to docker.io/library/claimed-geo-hls-remove-clouds:0.34 done\n", + "#10 DONE 2.2s\n" ] }, { @@ -330,7 +331,7 @@ "output_type": "stream", "text": [ "The push refers to repository [docker.io/romeokienzler/claimed-geo-hls-remove-clouds]\n", - "f1f396df7f96: Preparing\n", + "44de34118e36: Preparing\n", "1bb304a805f1: Preparing\n", "2902622ed6d0: Preparing\n", "20b51819921d: Preparing\n", @@ -341,17 +342,17 @@ "c7649cc32711: Waiting\n", "486dcc5a5ac3: Waiting\n", "f945189ba7d6: Waiting\n", + "2902622ed6d0: Layer already exists\n", "20b51819921d: Layer already exists\n", - "1bb304a805f1: Layer already exists\n", "6d3c2c9872b8: Layer already exists\n", - "2902622ed6d0: Layer already exists\n", + "1bb304a805f1: Layer already exists\n", "f945189ba7d6: Layer already exists\n", "486dcc5a5ac3: Layer already exists\n", "c7649cc32711: Layer already exists\n", - "f1f396df7f96: Pushed\n", - "0.32: digest: sha256:e7adcd7923d40ef3aa2125090469e94b2955fc42bef364784b97cb46508c0966 size: 2012\n", + "44de34118e36: Pushed\n", + "0.34: digest: sha256:7186445102f56c7fc4c36790bf0f4baf5348a76b3abe2102db156c1b530c96d6 size: 2012\n", "The push refers to repository [docker.io/romeokienzler/claimed-geo-hls-remove-clouds]\n", - "f1f396df7f96: Preparing\n", + "44de34118e36: Preparing\n", "1bb304a805f1: Preparing\n", "2902622ed6d0: Preparing\n", "20b51819921d: Preparing\n", @@ -360,19 +361,19 @@ "c7649cc32711: Preparing\n", "486dcc5a5ac3: Preparing\n", "c7649cc32711: Waiting\n", - "486dcc5a5ac3: Waiting\n", "f945189ba7d6: Waiting\n", + "486dcc5a5ac3: Waiting\n", + "20b51819921d: Layer already exists\n", + "6d3c2c9872b8: Layer already exists\n", "2902622ed6d0: Layer already exists\n", "1bb304a805f1: Layer already exists\n", - "6d3c2c9872b8: Layer already exists\n", - "f1f396df7f96: Layer already exists\n", - "20b51819921d: Layer already exists\n", + "44de34118e36: Layer already exists\n", "f945189ba7d6: Layer already exists\n", - "486dcc5a5ac3: Layer already exists\n", "c7649cc32711: Layer already exists\n", - "latest: digest: sha256:e7adcd7923d40ef3aa2125090469e94b2955fc42bef364784b97cb46508c0966 size: 2012\n", + "486dcc5a5ac3: Layer already exists\n", + "latest: digest: sha256:7186445102f56c7fc4c36790bf0f4baf5348a76b3abe2102db156c1b530c96d6 size: 2012\n", "name: geo-hls-remove-clouds\n", - "description: Removes clouds from HLS data CLAIMED v0.32\n", + "description: Removes clouds from HLS data CLAIMED v0.34\n", "\n", "inputs:\n", "- {name: input_path, type: String, description: path for input, default: '/home/romeokienzler/Downloads/HLS2022/HLS/**/*'}\n", @@ -384,7 +385,7 @@ "\n", "implementation:\n", " container:\n", - " image: romeokienzler/claimed-geo-hls-remove-clouds:0.32\n", + " image: romeokienzler/claimed-geo-hls-remove-clouds:0.34\n", " command:\n", " - sh\n", " - -ec\n", @@ -401,7 +402,7 @@ ], "source": [ "%%bash\n", - "export version=0.33\n", + "export version=0.34\n", "ipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version\n" ] }, @@ -411,7 +412,11 @@ "id": "70241e41", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "%%bash\n", + "export version=0.34\n", + "ipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/planetdownloader.ipynb $version\n" + ] } ], "metadata": { @@ -430,7 +435,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.3" + "version": "3.10.11" } }, "nbformat": 4, diff --git a/src/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb similarity index 98% rename from src/generate_kfp_component.ipynb rename to src/c3/generate_kfp_component.ipynb index 9ef59a39..0f75b2da 100644 --- a/src/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -2,8 +2,8 @@ "cells": [ { "cell_type": "code", - "execution_count": 1, - "id": "c08007ab-0366-459b-8d61-695e003c3ce5", + "execution_count": null, + "id": "2fc91001", "metadata": {}, "outputs": [], "source": [ @@ -328,7 +328,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.3" + "version": "3.10.11" } }, "nbformat": 4, diff --git a/src/notebook.py b/src/c3/notebook.py similarity index 100% rename from src/notebook.py rename to src/c3/notebook.py diff --git a/src/parser.py b/src/c3/parser.py similarity index 100% rename from src/parser.py rename to src/c3/parser.py diff --git a/src/setup.py b/src/setup.py new file mode 100644 index 00000000..67226c0a --- /dev/null +++ b/src/setup.py @@ -0,0 +1,27 @@ +from setuptools import setup, find_packages + +setup( + name='c3', + version='0.1.0', + author='The CLAIMED authors', + author_email='your@email.com', + description='Description of your package', + url='https://github.com/yourusername/your-package-name', + packages=find_packages(), + entry_points={ + 'console_scripts': [ + 'c3 = c3.compiler:main' + ] + }, + package_data={ + 'c3': ['./c3/generate_kfp_component.ipynb'], + }, + install_requires=[ + 'ipython', + ], + classifiers=[ + 'License :: OSI Approved :: MIT License', + 'Programming Language :: Python :: 3', + 'Operating System :: OS Independent', + ], +) From b1adf9b82e3d5ec8918204d7b2806463de9bd7e8 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 20 Jul 2023 16:33:40 +0200 Subject: [PATCH 040/177] add kubernetes job as compiler target --- src/c3/create_component_library.ipynb | 445 +++++++++++++++++++++++++- src/c3/generate_kfp_component.ipynb | 120 +++++-- 2 files changed, 536 insertions(+), 29 deletions(-) diff --git a/src/c3/create_component_library.ipynb b/src/c3/create_component_library.ipynb index c5abf333..d87ca2c1 100644 --- a/src/c3/create_component_library.ipynb +++ b/src/c3/create_component_library.ipynb @@ -2,10 +2,45 @@ "cells": [ { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "id": "9c7ce914", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: ipython in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (8.14.0)\n", + "Requirement already satisfied: nbformat in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (5.9.1)\n", + "Requirement already satisfied: backcall in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.2.0)\n", + "Requirement already satisfied: decorator in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (5.1.1)\n", + "Requirement already satisfied: jedi>=0.16 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.18.2)\n", + "Requirement already satisfied: matplotlib-inline in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.1.6)\n", + "Requirement already satisfied: pickleshare in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.7.5)\n", + "Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (3.0.39)\n", + "Requirement already satisfied: pygments>=2.4.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (2.15.1)\n", + "Requirement already satisfied: stack-data in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.6.2)\n", + "Requirement already satisfied: traitlets>=5 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (5.9.0)\n", + "Requirement already satisfied: pexpect>4.3 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (4.8.0)\n", + "Requirement already satisfied: fastjsonschema in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from nbformat) (2.17.1)\n", + "Requirement already satisfied: jsonschema>=2.6 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from nbformat) (4.17.3)\n", + "Requirement already satisfied: jupyter-core in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from nbformat) (5.3.1)\n", + "Requirement already satisfied: parso<0.9.0,>=0.8.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from jedi>=0.16->ipython) (0.8.3)\n", + "Requirement already satisfied: attrs>=17.4.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat) (23.1.0)\n", + "Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat) (0.19.3)\n", + "Requirement already satisfied: ptyprocess>=0.5 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from pexpect>4.3->ipython) (0.7.0)\n", + "Requirement already satisfied: wcwidth in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython) (0.2.6)\n", + "Requirement already satisfied: platformdirs>=2.5 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from jupyter-core->nbformat) (3.8.1)\n", + "Requirement already satisfied: executing>=1.2.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from stack-data->ipython) (1.2.0)\n", + "Requirement already satisfied: asttokens>=2.1.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from stack-data->ipython) (2.2.1)\n", + "Requirement already satisfied: pure-eval in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from stack-data->ipython) (0.2.2)\n", + "Requirement already satisfied: six in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from asttokens>=2.1.0->stack-data->ipython) (1.16.0)\n", + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.1.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.2\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], "source": [ "!pip install ipython nbformat" ] @@ -408,15 +443,413 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 47, + "id": "1df26cfb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2023-07-20 13:55:30,188 - root - INFO - Logging parameters: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"version=\"0.23\"repository=\"docker.io/romeokienzler\"additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", + "2023-07-20 13:55:30,188 - root - INFO - Parameter: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"\n", + "2023-07-20 13:55:30,188 - root - INFO - Parameter: version=\"0.23\"\n", + "2023-07-20 13:55:30,188 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", + "2023-07-20 13:55:30,188 - root - INFO - Parameter: additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", + "ccgw-hls-remove-clouds\n", + "hls_remove_clouds got wrappted by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 \n", + " CLAIMED v0.23\n", + "{'cgw_source_path': {'description': 'cos path to get job (files) from (including bucket)', 'type': 'String', 'default': None}, 'cgw_source_access_key_id': {'description': 'cgw_source_access_key_id', 'type': 'String', 'default': None}, 'cgw_source_secret_access_key': {'description': 'source_secret_access_key', 'type': 'String', 'default': None}, 'cgw_source_endpoint': {'description': 'source_endpoint', 'type': 'String', 'default': None}, 'cgw_target_access_key_id': {'description': 'cgw_target_access_key_id', 'type': 'String', 'default': None}, 'cgw_target_secret_access_key': {'description': 'cgw_target_secret_access_key', 'type': 'String', 'default': None}, 'cgw_target_endpoint': {'description': 'cgw_target_endpoint', 'type': 'String', 'default': None}, 'cgw_target_path': {'description': 'cgw_target_path (including bucket)', 'type': 'String', 'default': None}, 'cgw_lock_file_suffix': {'description': 'lock file suffix', 'type': 'String', 'default': \" '.lock'\"}, 'cgw_processed_file_suffix': {'description': 'processed file suffix', 'type': 'String', 'default': \" '.processed'\"}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}, 'cgw_lock_timeout': {'description': 'timeout in seconds to remove lock file from struggling job, default 1 hour', 'type': 'Integer', 'default': ' 60*60'}, 'cgw_group_by': {'description': 'group files which need to be processed together', 'type': 'String', 'default': ' None'}, 'satellite': {'description': 'satellite', 'type': 'String', 'default': \"'HLS.L30'\"}}\n", + "{}\n", + "['pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs']\n", + "../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\n", + "\n", + "FROM registry.access.redhat.com/ubi8/python-39 \n", + "USER root\n", + "RUN dnf install -y java-11-openjdk\n", + "USER default\n", + "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", + "ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", + "ADD cgw-hls-remove-clouds.ipynb /opt/app-root/src/\n", + "USER root\n", + "RUN chmod -R 777 /opt/app-root/src/\n", + "USER default\n", + "CMD [\"ipython\", \"/opt/app-root/src/cgw-hls-remove-clouds.ipynb\"]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "#1 [internal] load .dockerignore\n", + "#1 transferring context: 2B done\n", + "#1 DONE 0.0s\n", + "\n", + "#2 [internal] load build definition from Dockerfile\n", + "#2 transferring dockerfile: 640B done\n", + "#2 DONE 0.0s\n", + "\n", + "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", + "#3 DONE 0.0s\n", + "\n", + "#4 [1/7] FROM registry.access.redhat.com/ubi8/python-39\n", + "#4 DONE 0.0s\n", + "\n", + "#5 [internal] load build context\n", + "#5 transferring context: 24.77kB done\n", + "#5 DONE 0.0s\n", + "\n", + "#6 [2/7] RUN dnf install -y java-11-openjdk\n", + "#6 CACHED\n", + "\n", + "#7 [3/7] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "#7 CACHED\n", + "\n", + "#8 [4/7] RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", + "#8 CACHED\n", + "\n", + "#9 [5/7] ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", + "#9 DONE 0.1s\n", + "\n", + "#10 [6/7] ADD cgw-hls-remove-clouds.ipynb /opt/app-root/src/\n", + "#10 DONE 0.1s\n", + "\n", + "#11 [7/7] RUN chmod -R 777 /opt/app-root/src/\n", + "#11 DONE 0.7s\n", + "\n", + "#12 exporting to image\n", + "#12 exporting layers\n", + "#12 exporting layers 7.4s done\n", + "#12 writing image sha256:6642e733d224518402f32661f30f4e8751be64769e342360ef17dd85e8747edc done\n", + "#12 naming to docker.io/library/claimed-ccgw-hls-remove-clouds:0.23 0.0s done\n", + "#12 DONE 7.4s\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The push refers to repository [docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds]\n", + "0ec2bcbd8f7f: Preparing\n", + "c9d2cf3919a5: Preparing\n", + "a43fa6775580: Preparing\n", + "b638398d394b: Preparing\n", + "151382e656b8: Preparing\n", + "2d89553fcdef: Preparing\n", + "3568498d40ea: Preparing\n", + "e813c91400f3: Preparing\n", + "fb6a7cccdb84: Preparing\n", + "b51194abfc91: Preparing\n", + "3568498d40ea: Waiting\n", + "e813c91400f3: Waiting\n", + "fb6a7cccdb84: Waiting\n", + "b51194abfc91: Waiting\n", + "2d89553fcdef: Waiting\n", + "b638398d394b: Layer already exists\n", + "151382e656b8: Layer already exists\n", + "3568498d40ea: Layer already exists\n", + "2d89553fcdef: Layer already exists\n", + "fb6a7cccdb84: Layer already exists\n", + "e813c91400f3: Layer already exists\n", + "0ec2bcbd8f7f: Pushed\n", + "a43fa6775580: Pushed\n", + "c9d2cf3919a5: Pushed\n", + "b51194abfc91: Layer already exists\n", + "0.23: digest: sha256:42081189c25e9363364c89437d70eeb599c62e4f99ab6c85df6d8428b81a633a size: 2428\n", + "The push refers to repository [docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds]\n", + "0ec2bcbd8f7f: Preparing\n", + "c9d2cf3919a5: Preparing\n", + "a43fa6775580: Preparing\n", + "b638398d394b: Preparing\n", + "151382e656b8: Preparing\n", + "2d89553fcdef: Preparing\n", + "3568498d40ea: Preparing\n", + "e813c91400f3: Preparing\n", + "fb6a7cccdb84: Preparing\n", + "b51194abfc91: Preparing\n", + "e813c91400f3: Waiting\n", + "fb6a7cccdb84: Waiting\n", + "b51194abfc91: Waiting\n", + "2d89553fcdef: Waiting\n", + "3568498d40ea: Waiting\n", + "b638398d394b: Layer already exists\n", + "151382e656b8: Layer already exists\n", + "a43fa6775580: Layer already exists\n", + "0ec2bcbd8f7f: Layer already exists\n", + "c9d2cf3919a5: Layer already exists\n", + "2d89553fcdef: Layer already exists\n", + "3568498d40ea: Layer already exists\n", + "e813c91400f3: Layer already exists\n", + "fb6a7cccdb84: Layer already exists\n", + "b51194abfc91: Layer already exists\n", + "latest: digest: sha256:42081189c25e9363364c89437d70eeb599c62e4f99ab6c85df6d8428b81a633a size: 2428\n", + "name: ccgw-hls-remove-clouds\n", + "description: hls_remove_clouds got wrappted by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 \n", + " CLAIMED v0.23\n", + "\n", + "inputs:\n", + "- {name: cgw_source_path, type: String, description: cos path to get job (files) from (including bucket)}\n", + "- {name: cgw_source_access_key_id, type: String, description: cgw_source_access_key_id}\n", + "- {name: cgw_source_secret_access_key, type: String, description: source_secret_access_key}\n", + "- {name: cgw_source_endpoint, type: String, description: source_endpoint}\n", + "- {name: cgw_target_access_key_id, type: String, description: cgw_target_access_key_id}\n", + "- {name: cgw_target_secret_access_key, type: String, description: cgw_target_secret_access_key}\n", + "- {name: cgw_target_endpoint, type: String, description: cgw_target_endpoint}\n", + "- {name: cgw_target_path, type: String, description: cgw_target_path (including bucket)}\n", + "- {name: cgw_lock_file_suffix, type: String, description: lock file suffix, default: '.lock'}\n", + "- {name: cgw_processed_file_suffix, type: String, description: processed file suffix, default: '.processed'}\n", + "- {name: log_level, type: String, description: log level, default: 'INFO'}\n", + "- {name: cgw_lock_timeout, type: Integer, description: timeout in seconds to remove lock file from struggling job, default 1 hour, default: 60*60}\n", + "- {name: cgw_group_by, type: String, description: group files which need to be processed together, default: None}\n", + "- {name: satellite, type: String, description: satellite, default: 'HLS.L30'}\n", + "\n", + "\n", + "implementation:\n", + " container:\n", + " image: romeokienzler/claimed-ccgw-hls-remove-clouds:0.23\n", + " command:\n", + " - sh\n", + " - -ec\n", + " - |\n", + " ipython ./cgw-hls-remove-clouds.ipynb cgw_source_path=\"$0\" cgw_source_access_key_id=\"$1\" cgw_source_secret_access_key=\"$2\" cgw_source_endpoint=\"$3\" cgw_target_access_key_id=\"$4\" cgw_target_secret_access_key=\"$5\" cgw_target_endpoint=\"$6\" cgw_target_path=\"$7\" cgw_lock_file_suffix=\"$8\" cgw_processed_file_suffix=\"$9\" log_level=\"$10\" cgw_lock_timeout=\"$11\" cgw_group_by=\"$12\" satellite=\"$13\" \n", + " - {inputValue: cgw_source_path}\n", + " - {inputValue: cgw_source_access_key_id}\n", + " - {inputValue: cgw_source_secret_access_key}\n", + " - {inputValue: cgw_source_endpoint}\n", + " - {inputValue: cgw_target_access_key_id}\n", + " - {inputValue: cgw_target_secret_access_key}\n", + " - {inputValue: cgw_target_endpoint}\n", + " - {inputValue: cgw_target_path}\n", + " - {inputValue: cgw_lock_file_suffix}\n", + " - {inputValue: cgw_processed_file_suffix}\n", + " - {inputValue: log_level}\n", + " - {inputValue: cgw_lock_timeout}\n", + " - {inputValue: cgw_group_by}\n", + " - {inputValue: satellite}\n", + "\n", + "apiVersion: batch/v1\n", + "kind: Job\n", + "metadata:\n", + " name: ccgw-hls-remove-clouds\n", + "spec:\n", + " template:\n", + " spec:\n", + " containers:\n", + " - name: ccgw-hls-remove-clouds\n", + " image: docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds:0.23\n", + " command: [\"/opt/app-root/bin/ipython\",\"/opt/app-root/src/cgw-hls-remove-clouds.ipynb\"]\n", + " env:\n", + " - name: cgw_source_path\n", + " value: value_of_cgw_source_path\n", + " - name: cgw_source_access_key_id\n", + " value: value_of_cgw_source_access_key_id\n", + " - name: cgw_source_secret_access_key\n", + " value: value_of_cgw_source_secret_access_key\n", + " - name: cgw_source_endpoint\n", + " value: value_of_cgw_source_endpoint\n", + " - name: cgw_target_access_key_id\n", + " value: value_of_cgw_target_access_key_id\n", + " - name: cgw_target_secret_access_key\n", + " value: value_of_cgw_target_secret_access_key\n", + " - name: cgw_target_endpoint\n", + " value: value_of_cgw_target_endpoint\n", + " - name: cgw_target_path\n", + " value: value_of_cgw_target_path\n", + " - name: cgw_lock_file_suffix\n", + " value: value_of_cgw_lock_file_suffix\n", + " - name: cgw_processed_file_suffix\n", + " value: value_of_cgw_processed_file_suffix\n", + " - name: log_level\n", + " value: value_of_log_level\n", + " - name: cgw_lock_timeout\n", + " value: value_of_cgw_lock_timeout\n", + " - name: cgw_group_by\n", + " value: value_of_cgw_group_by\n", + " - name: satellite\n", + " value: value_of_satellite\n", + " restartPolicy: OnFailure\n" + ] + } + ], + "source": [ + "%%bash\n", + "export version=0.24\n", + "ipython generate_kfp_component.ipynb notebook_path=../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb version=$version repository=docker.io/romeokienzler additionl_files=../../../workflows-and-operators/operators/hls_remove_clouds.ipynb" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "d1fa8a81", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "build README.md src your_package_name.egg-info\n" + ] + } + ], + "source": [ + "!ls ../../\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, "id": "70241e41", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "planet-data-downloader\n", + "downloads data from planet CLAIMED v0.2\n", + "{'api_key': {'description': 'if your Planet API Key is not set as an environment variable, you can paste it below', 'type': 'String', 'default': \" 'PLAK710197d73217456199842c8db825a773'\"}, 'api_url': {'description': 'if your Planet API url is not set as an environment variable, you can paste it below', 'type': 'String', 'default': \" 'https://api.planet.com/basemaps/v1/mosaics'\"}, 'log_level': {'description': 'log_level', 'type': 'String', 'default': \" 'INFO'\"}}\n", + "{}\n", + "['pip install shapely geopandas ipython ']\n", + "../../../workflows-and-operators/operators/planetdownloader.ipynb\n", + "\n", + "FROM registry.access.redhat.com/ubi8/python-39 \n", + "USER root\n", + "RUN dnf install -y java-11-openjdk\n", + "USER default\n", + "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "RUN pip install shapely geopandas ipython \n", + "ADD planetdownloader.ipynb /opt/app-root/src/\n", + "CMD [\"ipython\", \"/opt/app-root/src/planetdownloader.ipynb\"]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "#1 [internal] load .dockerignore\n", + "#1 transferring context: 2B done\n", + "#1 DONE 0.0s\n", + "\n", + "#2 [internal] load build definition from Dockerfile\n", + "#2 transferring dockerfile: 402B done\n", + "#2 DONE 0.1s\n", + "\n", + "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", + "#3 DONE 0.0s\n", + "\n", + "#4 [1/5] FROM registry.access.redhat.com/ubi8/python-39\n", + "#4 DONE 0.0s\n", + "\n", + "#5 [2/5] RUN dnf install -y java-11-openjdk\n", + "#5 CACHED\n", + "\n", + "#6 [3/5] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "#6 CACHED\n", + "\n", + "#7 [internal] load build context\n", + "#7 transferring context: 5.75kB done\n", + "#7 DONE 0.0s\n", + "\n", + "#8 [4/5] RUN pip install shapely geopandas ipython\n", + "#8 3.623 Collecting shapely\n", + "#8 3.749 Downloading shapely-2.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)\n", + "#8 4.577 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 2.8 MB/s eta 0:00:00\n", + "#8 4.742 Collecting geopandas\n", + "#8 4.778 Downloading geopandas-0.13.2-py3-none-any.whl (1.1 MB)\n", + "#8 5.213 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 2.6 MB/s eta 0:00:00\n", + "#8 5.237 Requirement already satisfied: ipython in /opt/app-root/lib/python3.9/site-packages (8.6.0)\n", + "#8 6.880 Collecting numpy>=1.14\n", + "#8 6.946 Downloading numpy-1.25.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB)\n", + "#8 13.43 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.7/17.7 MB 3.4 MB/s eta 0:00:00\n", + "#8 13.89 Collecting packaging\n", + "#8 13.95 Downloading packaging-23.1-py3-none-any.whl (48 kB)\n", + "#8 13.99 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.9/48.9 kB 1.4 MB/s eta 0:00:00\n", + "#8 14.56 Collecting fiona>=1.8.19\n", + "#8 14.61 Downloading Fiona-1.9.4.post1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)\n", + "#8 18.86 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.4/16.4 MB 3.9 MB/s eta 0:00:00\n", + "#8 20.14 Collecting pandas>=1.1.0\n", + "#8 20.23 Downloading pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)\n", + "#8 22.88 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 4.7 MB/s eta 0:00:00\n", + "#8 23.74 Collecting pyproj>=3.0.1\n", + "#8 23.78 Downloading pyproj-3.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.9 MB)\n", + "#8 25.56 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 4.4 MB/s eta 0:00:00\n", + "#8 25.96 Requirement already satisfied: pickleshare in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.7.5)\n", + "#8 25.96 Requirement already satisfied: pexpect>4.3 in /opt/app-root/lib/python3.9/site-packages (from ipython) (4.8.0)\n", + "#8 25.96 Requirement already satisfied: backcall in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.2.0)\n", + "#8 25.97 Requirement already satisfied: stack-data in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.6.2)\n", + "#8 25.97 Requirement already satisfied: pygments>=2.4.0 in /opt/app-root/lib/python3.9/site-packages (from ipython) (2.15.1)\n", + "#8 25.97 Requirement already satisfied: jedi>=0.16 in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.18.2)\n", + "#8 25.98 Requirement already satisfied: matplotlib-inline in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.1.6)\n", + "#8 25.98 Requirement already satisfied: traitlets>=5 in /opt/app-root/lib/python3.9/site-packages (from ipython) (5.9.0)\n", + "#8 25.98 Requirement already satisfied: prompt-toolkit<3.1.0,>3.0.1 in /opt/app-root/lib/python3.9/site-packages (from ipython) (3.0.39)\n", + "#8 25.99 Requirement already satisfied: decorator in /opt/app-root/lib/python3.9/site-packages (from ipython) (5.1.1)\n", + "#8 26.06 Requirement already satisfied: attrs>=19.2.0 in /opt/app-root/lib/python3.9/site-packages (from fiona>=1.8.19->geopandas) (23.1.0)\n", + "#8 26.18 Collecting cligj>=0.5\n", + "#8 26.23 Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)\n", + "#8 26.32 Collecting click-plugins>=1.0\n", + "#8 26.39 Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)\n", + "#8 26.40 Requirement already satisfied: six in /opt/app-root/lib/python3.9/site-packages (from fiona>=1.8.19->geopandas) (1.16.0)\n", + "#8 26.54 Collecting click~=8.0\n", + "#8 26.61 Downloading click-8.1.6-py3-none-any.whl (97 kB)\n", + "#8 26.63 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.9/97.9 kB 5.0 MB/s eta 0:00:00\n", + "#8 26.81 Collecting certifi\n", + "#8 26.84 Downloading certifi-2023.5.7-py3-none-any.whl (156 kB)\n", + "#8 26.87 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157.0/157.0 kB 6.5 MB/s eta 0:00:00\n", + "#8 27.28 Collecting importlib-metadata\n", + "#8 27.35 Downloading importlib_metadata-6.8.0-py3-none-any.whl (22 kB)\n", + "#8 27.53 Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/app-root/lib/python3.9/site-packages (from jedi>=0.16->ipython) (0.8.3)\n", + "#8 27.98 Collecting python-dateutil>=2.8.2\n", + "#8 28.01 Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n", + "#8 28.08 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 247.7/247.7 kB 4.1 MB/s eta 0:00:00\n", + "#8 28.41 Collecting pytz>=2020.1\n", + "#8 28.44 Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)\n", + "#8 28.54 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 502.3/502.3 kB 5.6 MB/s eta 0:00:00\n", + "#8 28.78 Collecting tzdata>=2022.1\n", + "#8 28.81 Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)\n", + "#8 28.87 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.8/341.8 kB 6.4 MB/s eta 0:00:00\n", + "#8 28.91 Requirement already satisfied: ptyprocess>=0.5 in /opt/app-root/lib/python3.9/site-packages (from pexpect>4.3->ipython) (0.7.0)\n", + "#8 28.92 Requirement already satisfied: wcwidth in /opt/app-root/lib/python3.9/site-packages (from prompt-toolkit<3.1.0,>3.0.1->ipython) (0.2.6)\n", + "#8 29.07 Requirement already satisfied: executing>=1.2.0 in /opt/app-root/lib/python3.9/site-packages (from stack-data->ipython) (1.2.0)\n", + "#8 29.07 Requirement already satisfied: asttokens>=2.1.0 in /opt/app-root/lib/python3.9/site-packages (from stack-data->ipython) (2.2.1)\n", + "#8 29.07 Requirement already satisfied: pure-eval in /opt/app-root/lib/python3.9/site-packages (from stack-data->ipython) (0.2.2)\n", + "#8 29.66 Collecting zipp>=0.5\n", + "#8 29.68 Downloading zipp-3.16.2-py3-none-any.whl (7.2 kB)\n", + "#8 30.98 Installing collected packages: pytz, zipp, tzdata, python-dateutil, packaging, numpy, click, certifi, shapely, pyproj, pandas, importlib-metadata, cligj, click-plugins, fiona, geopandas\n" + ] + } + ], "source": [ "%%bash\n", - "export version=0.34\n", - "ipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/planetdownloader.ipynb $version\n" + "export version=0.2\n", + "ipython generate_kfp_component.ipynb ../../../workflows-and-operators/operators/planetdownloader.ipynb $version\n" ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "7da81684", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/romeokienzler/gitco/c3/src/c3\n" + ] + } + ], + "source": [ + "!pwd\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c156cb8", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index 0f75b2da..4369a939 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -13,7 +13,9 @@ "from string import Template\n", "import sys\n", "from io import StringIO\n", - "from enum import Enum" + "from enum import Enum\n", + "import logging\n", + "import re" ] }, { @@ -23,14 +25,30 @@ "metadata": {}, "outputs": [], "source": [ - "if len(sys.argv)<2:\n", - " print('TODO gracefully shutdown')\n", + "root = logging.getLogger()\n", + "root.setLevel('INFO')\n", "\n", - "notebook_path = sys.argv[1]\n", - "version = sys.argv[2]\n", + "handler = logging.StreamHandler(sys.stdout)\n", + "handler.setLevel('INFO')\n", + "formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')\n", + "handler.setFormatter(formatter)\n", + "root.addHandler(handler)\n", "\n", - "#notebook_path = os.environ.get('notebook_path','../component-library/input/input-url.ipynb')\n", - "\n" + "\n", + "parameters = list(\n", + " map(lambda s: re.sub('$', '\"', s),\n", + " map(\n", + " lambda s: s.replace('=', '=\"'),\n", + " filter(\n", + " lambda s: s.find('=') > -1 and bool(re.match(r'[A-Za-z0-9_]*=[.\\/A-Za-z0-9]*', s)),\n", + " sys.argv\n", + " )\n", + " )))\n", + "\n", + "logging.info('Logging parameters: ' + ''.join(parameters))\n", + "for parameter in parameters:\n", + " logging.info('Parameter: ' + parameter)\n", + " exec(parameter)" ] }, { @@ -111,8 +129,9 @@ "source": [ "#target_code = notebook_path.replace('.ipynb','.py').split('/')[-1:][0]\n", "target_code = notebook_path.split('/')[-1:][0]\n", - "\n", - "shutil.copy(notebook_path,target_code)" + "shutil.copy(notebook_path,target_code)\n", + "additionl_files_local = additionl_files.split('/')[-1:][0]\n", + "shutil.copy(additionl_files,additionl_files_local)" ] }, { @@ -158,21 +177,21 @@ "metadata": {}, "outputs": [], "source": [ - "docker_file = \"\"\"\n", + "requirements_docker = '\\n'.join(requirements_docker)\n", + "docker_file = f\"\"\"\n", "FROM registry.access.redhat.com/ubi8/python-39 \n", "USER root\n", "RUN dnf install -y java-11-openjdk\n", "USER default\n", "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "{}\n", - "ADD {} /opt/app-root/src/\n", - "CMD [\"ipython\", \"/opt/app-root/src/{}\"]\n", - "\"\"\".format(\n", - " '\\n'.join(requirements_docker),\n", - " target_code,\n", - " target_code,\n", - " target_code\n", - ")\n", + "{requirements_docker}\n", + "ADD {additionl_files_local} /opt/app-root/src/\n", + "ADD {target_code} /opt/app-root/src/\n", + "USER root\n", + "RUN chmod -R 777 /opt/app-root/src/\n", + "USER default\n", + "CMD [\"ipython\", \"/opt/app-root/src/{target_code}\"]\n", + "\"\"\"\n", "with open(\"Dockerfile\", \"w\") as text_file:\n", " text_file.write(docker_file)\n", "!cat Dockerfile" @@ -189,10 +208,10 @@ "outputs": [], "source": [ "!docker build -t `echo claimed-{name}:{version}` .\n", - "!docker tag `echo claimed-{name}:{version}` `echo romeokienzler/claimed-{name}:{version}`\n", - "!docker tag `echo claimed-{name}:{version}` `echo romeokienzler/claimed-{name}:latest`\n", - "!docker push `echo romeokienzler/claimed-{name}:{version}`\n", - "!docker push `echo romeokienzler/claimed-{name}:latest`" + "!docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:{version}`\n", + "!docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:latest`\n", + "!docker push `echo {repository}/claimed-{name}:{version}`\n", + "!docker push `echo {repository}/claimed-{name}:latest`" ] }, { @@ -310,6 +329,61 @@ "with open(target_yaml_path, \"w\") as text_file:\n", " text_file.write(yaml)" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83115456", + "metadata": {}, + "outputs": [], + "source": [ + "env_entries = []\n", + "for input_key, _ in inputs.items():\n", + " env_entry = f\" - name: {input_key}\\n value: value_of_{input_key}\"\n", + " env_entries.append(env_entry)\n", + " env_entries.append('\\n')\n", + "env_entries.pop(-1)\n", + "env_entries = ''.join(env_entries)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d40b282", + "metadata": {}, + "outputs": [], + "source": [ + "job_yaml = f'''apiVersion: batch/v1\n", + "kind: Job\n", + "metadata:\n", + " name: {name}\n", + "spec:\n", + " template:\n", + " spec:\n", + " containers:\n", + " - name: {name}\n", + " image: {repository}/claimed-{name}:{version}\n", + " command: [\"/opt/app-root/bin/ipython\",\"/opt/app-root/src/{target_code}\"]\n", + " env:\n", + "{env_entries}\n", + " restartPolicy: OnFailure'''\n", + "\n", + "\n", + "print(job_yaml)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52a6bafb", + "metadata": {}, + "outputs": [], + "source": [ + "target_job_yaml_path = notebook_path.replace('.ipynb','.job.yaml')\n", + "\n", + "with open(target_job_yaml_path, \"w\") as text_file:\n", + " text_file.write(job_yaml)" + ] } ], "metadata": { From f6b855efb0f2d5cbce07d63469eab360e365a8c6 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 21 Jul 2023 17:23:51 +0200 Subject: [PATCH 041/177] bugfix: fix notebook.py to work with new notebook format --- src/c3/create_component_library.ipynb | 444 ++++++++++++-------------- src/c3/generate_kfp_component.ipynb | 12 +- src/c3/notebook.py | 2 + 3 files changed, 209 insertions(+), 249 deletions(-) diff --git a/src/c3/create_component_library.ipynb b/src/c3/create_component_library.ipynb index d87ca2c1..f63beaba 100644 --- a/src/c3/create_component_library.ipynb +++ b/src/c3/create_component_library.ipynb @@ -133,7 +133,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 55, "id": "2b7a758f-a293-4fa2-8e3f-a0e8557369b9", "metadata": {}, "outputs": [ @@ -141,13 +141,17 @@ "name": "stdout", "output_type": "stream", "text": [ + "2023-07-20 19:24:35,109 - root - INFO - Logging parameters: notebook_path=\"../../../component-library/component-library/util/util-cos.ipynb\"version=\"0.32\"repository=\"docker.io/romeokienzler\"\n", + "2023-07-20 19:24:35,110 - root - INFO - Parameter: notebook_path=\"../../../component-library/component-library/util/util-cos.ipynb\"\n", + "2023-07-20 19:24:35,110 - root - INFO - Parameter: version=\"0.32\"\n", + "2023-07-20 19:24:35,110 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", "util-cos\n", "This component provides COS utility functions (e.g. creating a bucket, listing contents of a bucket)\n", - " CLAIMED v0.30\n", - "{'access_key_id': {'description': 'access key id', 'type': 'String', 'default': None}, 'secret_access_key': {'description': 'secret access key', 'type': 'String', 'default': None}, 'endpoint': {'description': 'cos/s3 endpoint', 'type': 'String', 'default': None}, 'bucket_name': {'description': 'cos bucket name', 'type': 'String', 'default': None}, 'path': {'description': 'path', 'type': 'String', 'default': \"''\"}, 'source': {'description': 'source in case of uploads', 'type': 'String', 'default': \" ''\"}, 'target': {'description': 'target in case of downloads', 'type': 'String', 'default': \" ''\"}, 'recursive': {'description': 'recursive', 'type': 'Boolean', 'default': \"'False'\"}, 'operation': {'description': 'operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local)', 'type': 'String', 'default': None}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}}\n", + " CLAIMED v0.32\n", + "{'access_key_id': {'description': 'access key id', 'type': 'String', 'default': None}, 'secret_access_key': {'description': 'secret access key', 'type': 'String', 'default': None}, 'endpoint': {'description': 'cos/s3 endpoint', 'type': 'String', 'default': None}, 'bucket_name': {'description': 'cos bucket name', 'type': 'String', 'default': None}, 'path': {'description': 'path', 'type': 'String', 'default': \"''\"}, 'source': {'description': 'source in case of uploads', 'type': 'String', 'default': \" ''\"}, 'target': {'description': 'target in case of downloads', 'type': 'String', 'default': \" ''\"}, 'recursive': {'description': 'recursive', 'type': 'Boolean', 'default': \"'False'\"}, 'operation': {'description': 'operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local|glob)', 'type': 'String', 'default': None}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}}\n", "{}\n", "['pip install aiobotocore botocore s3fs']\n", - "../../../../component-library/component-library/util/util-cos.ipynb\n", + "../../../component-library/component-library/util/util-cos.ipynb\n", "\n", "FROM registry.access.redhat.com/ubi8/python-39 \n", "USER root\n", @@ -156,6 +160,10 @@ "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", "RUN pip install aiobotocore botocore s3fs\n", "ADD util-cos.ipynb /opt/app-root/src/\n", + "ADD util-cos.ipynb /opt/app-root/src/\n", + "USER root\n", + "RUN chmod -R 777 /opt/app-root/src/\n", + "USER default\n", "CMD [\"ipython\", \"/opt/app-root/src/util-cos.ipynb\"]\n" ] }, @@ -163,42 +171,52 @@ "name": "stderr", "output_type": "stream", "text": [ - "#1 [internal] load build definition from Dockerfile\n", - "#1 transferring dockerfile: 385B done\n", - "#1 DONE 0.1s\n", + "#1 [internal] load .dockerignore\n", + "#1 transferring context:\n", + "#1 transferring context: 2B done\n", + "#1 DONE 0.0s\n", "\n", - "#2 [internal] load .dockerignore\n", - "#2 transferring context: 2B done\n", + "#2 [internal] load build definition from Dockerfile\n", + "#2 transferring dockerfile: 482B done\n", "#2 DONE 0.1s\n", "\n", "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", "#3 DONE 0.0s\n", "\n", - "#4 [1/5] FROM registry.access.redhat.com/ubi8/python-39\n", + "#4 [internal] load build context\n", "#4 DONE 0.0s\n", "\n", - "#5 [internal] load build context\n", - "#5 transferring context: 8.64kB done\n", - "#5 DONE 0.1s\n", + "#5 [1/7] FROM registry.access.redhat.com/ubi8/python-39\n", + "#5 DONE 0.0s\n", "\n", - "#6 [2/5] RUN dnf install -y java-11-openjdk\n", + "#4 [internal] load build context\n", + "#4 transferring context: 8.74kB done\n", + "#4 DONE 0.0s\n", + "\n", + "#6 [3/7] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", "#6 CACHED\n", "\n", - "#7 [3/5] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "#7 [2/7] RUN dnf install -y java-11-openjdk\n", "#7 CACHED\n", "\n", - "#8 [4/5] RUN pip install aiobotocore botocore s3fs\n", + "#8 [4/7] RUN pip install aiobotocore botocore s3fs\n", "#8 CACHED\n", "\n", - "#9 [5/5] ADD util-cos.ipynb /opt/app-root/src/\n", - "#9 DONE 0.1s\n", + "#9 [5/7] ADD util-cos.ipynb /opt/app-root/src/\n", + "#9 DONE 0.2s\n", + "\n", + "#10 [6/7] ADD util-cos.ipynb /opt/app-root/src/\n", + "#10 DONE 0.2s\n", + "\n", + "#11 [7/7] RUN chmod -R 777 /opt/app-root/src/\n", + "#11 DONE 0.7s\n", "\n", - "#10 exporting to image\n", - "#10 exporting layers\n", - "#10 exporting layers 2.1s done\n", - "#10 writing image sha256:6ee2f7dd70f11dda51013e41c934e65f3044820b6de12585764329649d4259e8 done\n", - "#10 naming to docker.io/library/claimed-util-cos:0.30 done\n", - "#10 DONE 2.1s\n" + "#12 exporting to image\n", + "#12 exporting layers\n", + "#12 exporting layers 5.9s done\n", + "#12 writing image sha256:43801d41bd756a495aa85fd975efa79e1bc06fe857fcbcc813fe8a73a2b5bf6c done\n", + "#12 naming to docker.io/library/claimed-util-cos:0.32 0.0s done\n", + "#12 DONE 5.9s\n" ] }, { @@ -206,50 +224,61 @@ "output_type": "stream", "text": [ "The push refers to repository [docker.io/romeokienzler/claimed-util-cos]\n", - "eb3bee7d35b9: Preparing\n", - "d58d29942c09: Preparing\n", - "2902622ed6d0: Preparing\n", - "20b51819921d: Preparing\n", - "6d3c2c9872b8: Preparing\n", - "f945189ba7d6: Preparing\n", - "c7649cc32711: Preparing\n", - "486dcc5a5ac3: Preparing\n", - "f945189ba7d6: Waiting\n", - "c7649cc32711: Waiting\n", - "486dcc5a5ac3: Waiting\n", - "6d3c2c9872b8: Layer already exists\n", - "20b51819921d: Layer already exists\n", - "2902622ed6d0: Layer already exists\n", - "d58d29942c09: Layer already exists\n", - "c7649cc32711: Layer already exists\n", - "486dcc5a5ac3: Layer already exists\n", - "f945189ba7d6: Layer already exists\n", - "eb3bee7d35b9: Pushed\n", - "0.30: digest: sha256:b2365be03cd64b1821003f8cc6a88b275b3e9b6988ee63a4bc9ec92d62a76a09 size: 2011\n", + "44f526e9989f: Preparing\n", + "5f70bf18a086: Preparing\n", + "e2c24570dcf1: Preparing\n", + "a38988a799dd: Preparing\n", + "151382e656b8: Preparing\n", + "2d89553fcdef: Preparing\n", + "3568498d40ea: Preparing\n", + "e813c91400f3: Preparing\n", + "fb6a7cccdb84: Preparing\n", + "b51194abfc91: Preparing\n", + "3568498d40ea: Waiting\n", + "e813c91400f3: Waiting\n", + "fb6a7cccdb84: Waiting\n", + "2d89553fcdef: Waiting\n", + "5f70bf18a086: Layer already exists\n", + "151382e656b8: Layer already exists\n", + "a38988a799dd: Layer already exists\n", + "3568498d40ea: Layer already exists\n", + "2d89553fcdef: Layer already exists\n", + "e813c91400f3: Layer already exists\n", + "b51194abfc91: Layer already exists\n", + "fb6a7cccdb84: Layer already exists\n", + "44f526e9989f: Pushed\n", + "e2c24570dcf1: Pushed\n", + "0.32: digest: sha256:d3d22874c39ff5273a50b0216a0e72049291ce953521497f3a1c76b6e21713b7 size: 2425\n", "The push refers to repository [docker.io/romeokienzler/claimed-util-cos]\n", - "eb3bee7d35b9: Preparing\n", - "d58d29942c09: Preparing\n", - "2902622ed6d0: Preparing\n", - "20b51819921d: Preparing\n", - "6d3c2c9872b8: Preparing\n", - "f945189ba7d6: Preparing\n", - "c7649cc32711: Preparing\n", - "486dcc5a5ac3: Preparing\n", - "f945189ba7d6: Waiting\n", - "c7649cc32711: Waiting\n", - "486dcc5a5ac3: Waiting\n", - "20b51819921d: Layer already exists\n", - "6d3c2c9872b8: Layer already exists\n", - "eb3bee7d35b9: Layer already exists\n", - "d58d29942c09: Layer already exists\n", - "2902622ed6d0: Layer already exists\n", - "486dcc5a5ac3: Layer already exists\n", - "c7649cc32711: Layer already exists\n", - "f945189ba7d6: Layer already exists\n", - "latest: digest: sha256:b2365be03cd64b1821003f8cc6a88b275b3e9b6988ee63a4bc9ec92d62a76a09 size: 2011\n", + "44f526e9989f: Preparing\n", + "5f70bf18a086: Preparing\n", + "e2c24570dcf1: Preparing\n", + "a38988a799dd: Preparing\n", + "151382e656b8: Preparing\n", + "2d89553fcdef: Preparing\n", + "3568498d40ea: Preparing\n", + "e813c91400f3: Preparing\n", + "fb6a7cccdb84: Preparing\n", + "b51194abfc91: Preparing\n", + "3568498d40ea: Waiting\n", + "e813c91400f3: Waiting\n", + "fb6a7cccdb84: Waiting\n", + "b51194abfc91: Waiting\n", + "2d89553fcdef: Waiting\n", + "e2c24570dcf1: Layer already exists\n", + "5f70bf18a086: Layer already exists\n", + "151382e656b8: Layer already exists\n", + "a38988a799dd: Layer already exists\n", + "44f526e9989f: Layer already exists\n", + "2d89553fcdef: Layer already exists\n", + "e813c91400f3: Layer already exists\n", + "3568498d40ea: Layer already exists\n", + "fb6a7cccdb84: Layer already exists\n", + "b51194abfc91: Layer already exists\n", + "latest: digest: sha256:d3d22874c39ff5273a50b0216a0e72049291ce953521497f3a1c76b6e21713b7 size: 2425\n", "name: util-cos\n", "description: This component provides COS utility functions (e.g. creating a bucket, listing contents of a bucket)\n", - " CLAIMED v0.30\n", + " CLAIMED v0.32\n", "\n", "inputs:\n", "- {name: access_key_id, type: String, description: access key id}\n", @@ -260,13 +289,13 @@ "- {name: source, type: String, description: source in case of uploads, default: ''}\n", "- {name: target, type: String, description: target in case of downloads, default: ''}\n", "- {name: recursive, type: Boolean, description: recursive, default: 'False'}\n", - "- {name: operation, type: String, description: operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local)}\n", + "- {name: operation, type: String, description: operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local|glob)}\n", "- {name: log_level, type: String, description: log level, default: 'INFO'}\n", "\n", "\n", "implementation:\n", " container:\n", - " image: romeokienzler/claimed-util-cos:0.30\n", + " image: romeokienzler/claimed-util-cos:0.32\n", " command:\n", " - sh\n", " - -ec\n", @@ -282,19 +311,52 @@ " - {inputValue: recursive}\n", " - {inputValue: operation}\n", " - {inputValue: log_level}\n", - "\n" + "\n", + "apiVersion: batch/v1\n", + "kind: Job\n", + "metadata:\n", + " name: util-cos\n", + "spec:\n", + " template:\n", + " spec:\n", + " containers:\n", + " - name: util-cos\n", + " image: docker.io/romeokienzler/claimed-util-cos:0.32\n", + " command: [\"/opt/app-root/bin/ipython\",\"/opt/app-root/src/util-cos.ipynb\"]\n", + " env:\n", + " - name: access_key_id\n", + " value: value_of_access_key_id\n", + " - name: secret_access_key\n", + " value: value_of_secret_access_key\n", + " - name: endpoint\n", + " value: value_of_endpoint\n", + " - name: bucket_name\n", + " value: value_of_bucket_name\n", + " - name: path\n", + " value: value_of_path\n", + " - name: source\n", + " value: value_of_source\n", + " - name: target\n", + " value: value_of_target\n", + " - name: recursive\n", + " value: value_of_recursive\n", + " - name: operation\n", + " value: value_of_operation\n", + " - name: log_level\n", + " value: value_of_log_level\n", + " restartPolicy: OnFailure\n" ] } ], "source": [ "%%bash\n", - "export version=0.30\n", - "ipython generate_kfp_component.ipynb ../../../../component-library/component-library/util/util-cos.ipynb $version" + "export version=0.32\n", + "ipython generate_kfp_component.ipynb notebook_path=../../../component-library/component-library/util/util-cos.ipynb version=$version repository=docker.io/romeokienzler" ] }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 50, "id": "dc39195e", "metadata": {}, "outputs": [ @@ -302,148 +364,40 @@ "name": "stdout", "output_type": "stream", "text": [ - "geo-hls-remove-clouds\n", - "Removes clouds from HLS data CLAIMED v0.34\n", - "{'input_path': {'description': 'path for input', 'type': 'String', 'default': \"'/home/romeokienzler/Downloads/HLS2022/HLS/**/*'\"}, 'target_path': {'description': 'path for output', 'type': 'String', 'default': \"'/home/romeokienzler/Downloads/HLSS30.CF2.v3/'\"}, 'satellite': {'description': 'satellite', 'type': 'String', 'default': \"'HLS.L30'\"}, 'file_filter_pattern': {'description': 'file filter pattern', 'type': 'String', 'default': \"'HLS.S30*0.B*tif'\"}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}}\n", - "{}\n", - "['pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs']\n", - "../../workflows-and-operators/operators/hls_remove_clouds.ipynb\n", - "\n", - "FROM registry.access.redhat.com/ubi8/python-39 \n", - "USER root\n", - "RUN dnf install -y java-11-openjdk\n", - "USER default\n", - "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", - "ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", - "CMD [\"ipython\", \"/opt/app-root/src/hls_remove_clouds.ipynb\"]\n" + "2023-07-20 18:27:17,441 - root - INFO - Logging parameters: repository=\"docker.io/romeokienzler\"\n", + "2023-07-20 18:27:17,441 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)\n", + "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n", + "\u001b[0;32m----> 1\u001b[0m nb \u001b[38;5;241m=\u001b[39m Notebook(\u001b[43mnotebook_path\u001b[49m)\n", + "\n", + "\u001b[0;31mNameError\u001b[0m: name 'notebook_path' is not defined\n" ] }, { - "name": "stderr", - "output_type": "stream", - "text": [ - "#1 [internal] load .dockerignore\n", - "#1 transferring context: 2B done\n", - "#1 DONE 0.0s\n", - "\n", - "#2 [internal] load build definition from Dockerfile\n", - "#2 transferring dockerfile: 526B done\n", - "#2 DONE 0.0s\n", - "\n", - "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", - "#3 DONE 0.0s\n", - "\n", - "#4 [1/5] FROM registry.access.redhat.com/ubi8/python-39\n", - "#4 DONE 0.0s\n", - "\n", - "#5 [internal] load build context\n", - "#5 transferring context: 14.31kB done\n", - "#5 DONE 0.0s\n", - "\n", - "#6 [2/5] RUN dnf install -y java-11-openjdk\n", - "#6 CACHED\n", - "\n", - "#7 [3/5] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "#7 CACHED\n", - "\n", - "#8 [4/5] RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", - "#8 CACHED\n", - "\n", - "#9 [5/5] ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", - "#9 DONE 0.2s\n", - "\n", - "#10 exporting to image\n", - "#10 exporting layers\n", - "#10 exporting layers 2.2s done\n", - "#10 writing image sha256:4fce555cb3149b2f5e15968995c00dd9f9479e7e73eac5cca6d798166038d740 done\n", - "#10 naming to docker.io/library/claimed-geo-hls-remove-clouds:0.34 done\n", - "#10 DONE 2.2s\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The push refers to repository [docker.io/romeokienzler/claimed-geo-hls-remove-clouds]\n", - "44de34118e36: Preparing\n", - "1bb304a805f1: Preparing\n", - "2902622ed6d0: Preparing\n", - "20b51819921d: Preparing\n", - "6d3c2c9872b8: Preparing\n", - "f945189ba7d6: Preparing\n", - "c7649cc32711: Preparing\n", - "486dcc5a5ac3: Preparing\n", - "c7649cc32711: Waiting\n", - "486dcc5a5ac3: Waiting\n", - "f945189ba7d6: Waiting\n", - "2902622ed6d0: Layer already exists\n", - "20b51819921d: Layer already exists\n", - "6d3c2c9872b8: Layer already exists\n", - "1bb304a805f1: Layer already exists\n", - "f945189ba7d6: Layer already exists\n", - "486dcc5a5ac3: Layer already exists\n", - "c7649cc32711: Layer already exists\n", - "44de34118e36: Pushed\n", - "0.34: digest: sha256:7186445102f56c7fc4c36790bf0f4baf5348a76b3abe2102db156c1b530c96d6 size: 2012\n", - "The push refers to repository [docker.io/romeokienzler/claimed-geo-hls-remove-clouds]\n", - "44de34118e36: Preparing\n", - "1bb304a805f1: Preparing\n", - "2902622ed6d0: Preparing\n", - "20b51819921d: Preparing\n", - "6d3c2c9872b8: Preparing\n", - "f945189ba7d6: Preparing\n", - "c7649cc32711: Preparing\n", - "486dcc5a5ac3: Preparing\n", - "c7649cc32711: Waiting\n", - "f945189ba7d6: Waiting\n", - "486dcc5a5ac3: Waiting\n", - "20b51819921d: Layer already exists\n", - "6d3c2c9872b8: Layer already exists\n", - "2902622ed6d0: Layer already exists\n", - "1bb304a805f1: Layer already exists\n", - "44de34118e36: Layer already exists\n", - "f945189ba7d6: Layer already exists\n", - "c7649cc32711: Layer already exists\n", - "486dcc5a5ac3: Layer already exists\n", - "latest: digest: sha256:7186445102f56c7fc4c36790bf0f4baf5348a76b3abe2102db156c1b530c96d6 size: 2012\n", - "name: geo-hls-remove-clouds\n", - "description: Removes clouds from HLS data CLAIMED v0.34\n", - "\n", - "inputs:\n", - "- {name: input_path, type: String, description: path for input, default: '/home/romeokienzler/Downloads/HLS2022/HLS/**/*'}\n", - "- {name: target_path, type: String, description: path for output, default: '/home/romeokienzler/Downloads/HLSS30.CF2.v3/'}\n", - "- {name: satellite, type: String, description: satellite, default: 'HLS.L30'}\n", - "- {name: file_filter_pattern, type: String, description: file filter pattern, default: 'HLS.S30*0.B*tif'}\n", - "- {name: log_level, type: String, description: log level, default: 'INFO'}\n", - "\n", - "\n", - "implementation:\n", - " container:\n", - " image: romeokienzler/claimed-geo-hls-remove-clouds:0.34\n", - " command:\n", - " - sh\n", - " - -ec\n", - " - |\n", - " ipython ./hls_remove_clouds.ipynb input_path=\"$0\" target_path=\"$1\" satellite=\"$2\" file_filter_pattern=\"$3\" log_level=\"$4\" \n", - " - {inputValue: input_path}\n", - " - {inputValue: target_path}\n", - " - {inputValue: satellite}\n", - " - {inputValue: file_filter_pattern}\n", - " - {inputValue: log_level}\n", - "\n" + "ename": "CalledProcessError", + "evalue": "Command 'b'export version=0.34\\nipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version repository=docker.io/romeokienzler\\n'' returned non-zero exit status 1.", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mCalledProcessError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[50], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m get_ipython()\u001b[39m.\u001b[39;49mrun_cell_magic(\u001b[39m'\u001b[39;49m\u001b[39mbash\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mexport version=0.34\u001b[39;49m\u001b[39m\\n\u001b[39;49;00m\u001b[39mipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version repository=docker.io/romeokienzler\u001b[39;49m\u001b[39m\\n\u001b[39;49;00m\u001b[39m'\u001b[39;49m)\n", + "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/interactiveshell.py:2478\u001b[0m, in \u001b[0;36mInteractiveShell.run_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m 2476\u001b[0m \u001b[39mwith\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mbuiltin_trap:\n\u001b[1;32m 2477\u001b[0m args \u001b[39m=\u001b[39m (magic_arg_s, cell)\n\u001b[0;32m-> 2478\u001b[0m result \u001b[39m=\u001b[39m fn(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 2480\u001b[0m \u001b[39m# The code below prevents the output from being displayed\u001b[39;00m\n\u001b[1;32m 2481\u001b[0m \u001b[39m# when using magics with decodator @output_can_be_silenced\u001b[39;00m\n\u001b[1;32m 2482\u001b[0m \u001b[39m# when the last Python token in the expression is a ';'.\u001b[39;00m\n\u001b[1;32m 2483\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mgetattr\u001b[39m(fn, magic\u001b[39m.\u001b[39mMAGIC_OUTPUT_CAN_BE_SILENCED, \u001b[39mFalse\u001b[39;00m):\n", + "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/magics/script.py:154\u001b[0m, in \u001b[0;36mScriptMagics._make_script_magic..named_script_magic\u001b[0;34m(line, cell)\u001b[0m\n\u001b[1;32m 152\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 153\u001b[0m line \u001b[39m=\u001b[39m script\n\u001b[0;32m--> 154\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mshebang(line, cell)\n", + "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/magics/script.py:314\u001b[0m, in \u001b[0;36mScriptMagics.shebang\u001b[0;34m(self, line, cell)\u001b[0m\n\u001b[1;32m 309\u001b[0m \u001b[39mif\u001b[39;00m args\u001b[39m.\u001b[39mraise_error \u001b[39mand\u001b[39;00m p\u001b[39m.\u001b[39mreturncode \u001b[39m!=\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[1;32m 310\u001b[0m \u001b[39m# If we get here and p.returncode is still None, we must have\u001b[39;00m\n\u001b[1;32m 311\u001b[0m \u001b[39m# killed it but not yet seen its return code. We don't wait for it,\u001b[39;00m\n\u001b[1;32m 312\u001b[0m \u001b[39m# in case it's stuck in uninterruptible sleep. -9 = SIGKILL\u001b[39;00m\n\u001b[1;32m 313\u001b[0m rc \u001b[39m=\u001b[39m p\u001b[39m.\u001b[39mreturncode \u001b[39mor\u001b[39;00m \u001b[39m-\u001b[39m\u001b[39m9\u001b[39m\n\u001b[0;32m--> 314\u001b[0m \u001b[39mraise\u001b[39;00m CalledProcessError(rc, cell)\n", + "\u001b[0;31mCalledProcessError\u001b[0m: Command 'b'export version=0.34\\nipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version repository=docker.io/romeokienzler\\n'' returned non-zero exit status 1." ] } ], "source": [ "%%bash\n", "export version=0.34\n", - "ipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version\n" + "ipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version repository=docker.io/romeokienzler\n" ] }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 61, "id": "1df26cfb", "metadata": {}, "outputs": [ @@ -451,15 +405,15 @@ "name": "stdout", "output_type": "stream", "text": [ - "2023-07-20 13:55:30,188 - root - INFO - Logging parameters: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"version=\"0.23\"repository=\"docker.io/romeokienzler\"additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", - "2023-07-20 13:55:30,188 - root - INFO - Parameter: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"\n", - "2023-07-20 13:55:30,188 - root - INFO - Parameter: version=\"0.23\"\n", - "2023-07-20 13:55:30,188 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", - "2023-07-20 13:55:30,188 - root - INFO - Parameter: additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", + "2023-07-21 10:53:27,950 - root - INFO - Logging parameters: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"version=\"0.29\"repository=\"docker.io/romeokienzler\"additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", + "2023-07-21 10:53:27,950 - root - INFO - Parameter: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"\n", + "2023-07-21 10:53:27,951 - root - INFO - Parameter: version=\"0.29\"\n", + "2023-07-21 10:53:27,951 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", + "2023-07-21 10:53:27,951 - root - INFO - Parameter: additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", "ccgw-hls-remove-clouds\n", "hls_remove_clouds got wrappted by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 \n", - " CLAIMED v0.23\n", - "{'cgw_source_path': {'description': 'cos path to get job (files) from (including bucket)', 'type': 'String', 'default': None}, 'cgw_source_access_key_id': {'description': 'cgw_source_access_key_id', 'type': 'String', 'default': None}, 'cgw_source_secret_access_key': {'description': 'source_secret_access_key', 'type': 'String', 'default': None}, 'cgw_source_endpoint': {'description': 'source_endpoint', 'type': 'String', 'default': None}, 'cgw_target_access_key_id': {'description': 'cgw_target_access_key_id', 'type': 'String', 'default': None}, 'cgw_target_secret_access_key': {'description': 'cgw_target_secret_access_key', 'type': 'String', 'default': None}, 'cgw_target_endpoint': {'description': 'cgw_target_endpoint', 'type': 'String', 'default': None}, 'cgw_target_path': {'description': 'cgw_target_path (including bucket)', 'type': 'String', 'default': None}, 'cgw_lock_file_suffix': {'description': 'lock file suffix', 'type': 'String', 'default': \" '.lock'\"}, 'cgw_processed_file_suffix': {'description': 'processed file suffix', 'type': 'String', 'default': \" '.processed'\"}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}, 'cgw_lock_timeout': {'description': 'timeout in seconds to remove lock file from struggling job, default 1 hour', 'type': 'Integer', 'default': ' 60*60'}, 'cgw_group_by': {'description': 'group files which need to be processed together', 'type': 'String', 'default': ' None'}, 'satellite': {'description': 'satellite', 'type': 'String', 'default': \"'HLS.L30'\"}}\n", + " CLAIMED v0.29\n", + "{'cgw_source_path': {'description': 'cos path to get job (files) from (including bucket)', 'type': 'String', 'default': None}, 'cgw_source_access_key_id': {'description': 'cgw_source_access_key_id', 'type': 'String', 'default': None}, 'cgw_source_secret_access_key': {'description': 'source_secret_access_key', 'type': 'String', 'default': None}, 'cgw_source_endpoint': {'description': 'source_endpoint', 'type': 'String', 'default': None}, 'cgw_target_access_key_id': {'description': 'cgw_target_access_key_id', 'type': 'String', 'default': None}, 'cgw_target_secret_access_key': {'description': 'cgw_target_secret_access_key', 'type': 'String', 'default': None}, 'cgw_target_endpoint': {'description': 'cgw_target_endpoint', 'type': 'String', 'default': None}, 'cgw_target_path': {'description': 'cgw_target_path (including bucket)', 'type': 'String', 'default': None}, 'cgw_lock_file_suffix': {'description': 'lock file suffix', 'type': 'String', 'default': \" '.lock'\"}, 'cgw_processed_file_suffix': {'description': 'processed file suffix', 'type': 'String', 'default': \" '.processed'\"}, 'cgw_log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}, 'cgw_lock_timeout': {'description': 'timeout in seconds to remove lock file from struggling job (default 1 hour)', 'type': 'Integer', 'default': ' 60*60'}, 'cgw_group_by': {'description': 'group files which need to be processed together', 'type': 'String', 'default': ' None'}, 'satellite': {'description': 'satellite', 'type': 'String', 'default': \"'HLS.L30'\"}}\n", "{}\n", "['pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs']\n", "../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\n", @@ -488,7 +442,7 @@ "\n", "#2 [internal] load build definition from Dockerfile\n", "#2 transferring dockerfile: 640B done\n", - "#2 DONE 0.0s\n", + "#2 DONE 0.1s\n", "\n", "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", "#3 DONE 0.0s\n", @@ -497,7 +451,7 @@ "#4 DONE 0.0s\n", "\n", "#5 [internal] load build context\n", - "#5 transferring context: 24.77kB done\n", + "#5 transferring context: 25.50kB done\n", "#5 DONE 0.0s\n", "\n", "#6 [2/7] RUN dnf install -y java-11-openjdk\n", @@ -510,20 +464,20 @@ "#8 CACHED\n", "\n", "#9 [5/7] ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", - "#9 DONE 0.1s\n", + "#9 CACHED\n", "\n", "#10 [6/7] ADD cgw-hls-remove-clouds.ipynb /opt/app-root/src/\n", - "#10 DONE 0.1s\n", + "#10 DONE 0.4s\n", "\n", "#11 [7/7] RUN chmod -R 777 /opt/app-root/src/\n", - "#11 DONE 0.7s\n", + "#11 DONE 0.8s\n", "\n", "#12 exporting to image\n", "#12 exporting layers\n", - "#12 exporting layers 7.4s done\n", - "#12 writing image sha256:6642e733d224518402f32661f30f4e8751be64769e342360ef17dd85e8747edc done\n", - "#12 naming to docker.io/library/claimed-ccgw-hls-remove-clouds:0.23 0.0s done\n", - "#12 DONE 7.4s\n" + "#12 exporting layers 7.1s done\n", + "#12 writing image sha256:e5be079e3dc558e7ba26186c603b3bd2acb3a87e013fc9da581db66e10f262f7 done\n", + "#12 naming to docker.io/library/claimed-ccgw-hls-remove-clouds:0.29 0.0s done\n", + "#12 DONE 7.1s\n" ] }, { @@ -531,8 +485,8 @@ "output_type": "stream", "text": [ "The push refers to repository [docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds]\n", - "0ec2bcbd8f7f: Preparing\n", - "c9d2cf3919a5: Preparing\n", + "49db0df0ed8e: Preparing\n", + "1eb725d74412: Preparing\n", "a43fa6775580: Preparing\n", "b638398d394b: Preparing\n", "151382e656b8: Preparing\n", @@ -541,25 +495,25 @@ "e813c91400f3: Preparing\n", "fb6a7cccdb84: Preparing\n", "b51194abfc91: Preparing\n", + "2d89553fcdef: Waiting\n", "3568498d40ea: Waiting\n", "e813c91400f3: Waiting\n", "fb6a7cccdb84: Waiting\n", "b51194abfc91: Waiting\n", - "2d89553fcdef: Waiting\n", - "b638398d394b: Layer already exists\n", "151382e656b8: Layer already exists\n", - "3568498d40ea: Layer already exists\n", + "a43fa6775580: Layer already exists\n", + "b638398d394b: Layer already exists\n", "2d89553fcdef: Layer already exists\n", - "fb6a7cccdb84: Layer already exists\n", "e813c91400f3: Layer already exists\n", - "0ec2bcbd8f7f: Pushed\n", - "a43fa6775580: Pushed\n", - "c9d2cf3919a5: Pushed\n", + "3568498d40ea: Layer already exists\n", + "fb6a7cccdb84: Layer already exists\n", "b51194abfc91: Layer already exists\n", - "0.23: digest: sha256:42081189c25e9363364c89437d70eeb599c62e4f99ab6c85df6d8428b81a633a size: 2428\n", + "49db0df0ed8e: Pushed\n", + "1eb725d74412: Pushed\n", + "0.29: digest: sha256:bc946868a164e528ac8019c0187c114d14fdf529078dacb7dc7364c8a767bd7a size: 2428\n", "The push refers to repository [docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds]\n", - "0ec2bcbd8f7f: Preparing\n", - "c9d2cf3919a5: Preparing\n", + "49db0df0ed8e: Preparing\n", + "1eb725d74412: Preparing\n", "a43fa6775580: Preparing\n", "b638398d394b: Preparing\n", "151382e656b8: Preparing\n", @@ -568,25 +522,25 @@ "e813c91400f3: Preparing\n", "fb6a7cccdb84: Preparing\n", "b51194abfc91: Preparing\n", - "e813c91400f3: Waiting\n", + "3568498d40ea: Waiting\n", "fb6a7cccdb84: Waiting\n", "b51194abfc91: Waiting\n", "2d89553fcdef: Waiting\n", - "3568498d40ea: Waiting\n", + "e813c91400f3: Waiting\n", + "a43fa6775580: Layer already exists\n", "b638398d394b: Layer already exists\n", "151382e656b8: Layer already exists\n", - "a43fa6775580: Layer already exists\n", - "0ec2bcbd8f7f: Layer already exists\n", - "c9d2cf3919a5: Layer already exists\n", + "1eb725d74412: Layer already exists\n", + "49db0df0ed8e: Layer already exists\n", "2d89553fcdef: Layer already exists\n", "3568498d40ea: Layer already exists\n", "e813c91400f3: Layer already exists\n", - "fb6a7cccdb84: Layer already exists\n", "b51194abfc91: Layer already exists\n", - "latest: digest: sha256:42081189c25e9363364c89437d70eeb599c62e4f99ab6c85df6d8428b81a633a size: 2428\n", + "fb6a7cccdb84: Layer already exists\n", + "latest: digest: sha256:bc946868a164e528ac8019c0187c114d14fdf529078dacb7dc7364c8a767bd7a size: 2428\n", "name: ccgw-hls-remove-clouds\n", "description: hls_remove_clouds got wrappted by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 \n", - " CLAIMED v0.23\n", + " CLAIMED v0.29\n", "\n", "inputs:\n", "- {name: cgw_source_path, type: String, description: cos path to get job (files) from (including bucket)}\n", @@ -599,20 +553,20 @@ "- {name: cgw_target_path, type: String, description: cgw_target_path (including bucket)}\n", "- {name: cgw_lock_file_suffix, type: String, description: lock file suffix, default: '.lock'}\n", "- {name: cgw_processed_file_suffix, type: String, description: processed file suffix, default: '.processed'}\n", - "- {name: log_level, type: String, description: log level, default: 'INFO'}\n", - "- {name: cgw_lock_timeout, type: Integer, description: timeout in seconds to remove lock file from struggling job, default 1 hour, default: 60*60}\n", + "- {name: cgw_log_level, type: String, description: log level, default: 'INFO'}\n", + "- {name: cgw_lock_timeout, type: Integer, description: timeout in seconds to remove lock file from struggling job (default 1 hour), default: 60*60}\n", "- {name: cgw_group_by, type: String, description: group files which need to be processed together, default: None}\n", "- {name: satellite, type: String, description: satellite, default: 'HLS.L30'}\n", "\n", "\n", "implementation:\n", " container:\n", - " image: romeokienzler/claimed-ccgw-hls-remove-clouds:0.23\n", + " image: romeokienzler/claimed-ccgw-hls-remove-clouds:0.29\n", " command:\n", " - sh\n", " - -ec\n", " - |\n", - " ipython ./cgw-hls-remove-clouds.ipynb cgw_source_path=\"$0\" cgw_source_access_key_id=\"$1\" cgw_source_secret_access_key=\"$2\" cgw_source_endpoint=\"$3\" cgw_target_access_key_id=\"$4\" cgw_target_secret_access_key=\"$5\" cgw_target_endpoint=\"$6\" cgw_target_path=\"$7\" cgw_lock_file_suffix=\"$8\" cgw_processed_file_suffix=\"$9\" log_level=\"$10\" cgw_lock_timeout=\"$11\" cgw_group_by=\"$12\" satellite=\"$13\" \n", + " ipython ./cgw-hls-remove-clouds.ipynb cgw_source_path=\"$0\" cgw_source_access_key_id=\"$1\" cgw_source_secret_access_key=\"$2\" cgw_source_endpoint=\"$3\" cgw_target_access_key_id=\"$4\" cgw_target_secret_access_key=\"$5\" cgw_target_endpoint=\"$6\" cgw_target_path=\"$7\" cgw_lock_file_suffix=\"$8\" cgw_processed_file_suffix=\"$9\" cgw_log_level=\"$10\" cgw_lock_timeout=\"$11\" cgw_group_by=\"$12\" satellite=\"$13\" \n", " - {inputValue: cgw_source_path}\n", " - {inputValue: cgw_source_access_key_id}\n", " - {inputValue: cgw_source_secret_access_key}\n", @@ -623,7 +577,7 @@ " - {inputValue: cgw_target_path}\n", " - {inputValue: cgw_lock_file_suffix}\n", " - {inputValue: cgw_processed_file_suffix}\n", - " - {inputValue: log_level}\n", + " - {inputValue: cgw_log_level}\n", " - {inputValue: cgw_lock_timeout}\n", " - {inputValue: cgw_group_by}\n", " - {inputValue: satellite}\n", @@ -637,7 +591,7 @@ " spec:\n", " containers:\n", " - name: ccgw-hls-remove-clouds\n", - " image: docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds:0.23\n", + " image: docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds:0.29\n", " command: [\"/opt/app-root/bin/ipython\",\"/opt/app-root/src/cgw-hls-remove-clouds.ipynb\"]\n", " env:\n", " - name: cgw_source_path\n", @@ -660,8 +614,8 @@ " value: value_of_cgw_lock_file_suffix\n", " - name: cgw_processed_file_suffix\n", " value: value_of_cgw_processed_file_suffix\n", - " - name: log_level\n", - " value: value_of_log_level\n", + " - name: cgw_log_level\n", + " value: value_of_cgw_log_level\n", " - name: cgw_lock_timeout\n", " value: value_of_cgw_lock_timeout\n", " - name: cgw_group_by\n", @@ -674,7 +628,7 @@ ], "source": [ "%%bash\n", - "export version=0.24\n", + "export version=0.30\n", "ipython generate_kfp_component.ipynb notebook_path=../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb version=$version repository=docker.io/romeokienzler additionl_files=../../../workflows-and-operators/operators/hls_remove_clouds.ipynb" ] }, diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index 4369a939..34b7556b 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -112,12 +112,13 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "id": "e271b688-b307-4bc9-803c-4c8bed0761ae", "metadata": {}, "outputs": [], "source": [ - "#!jupyter nbconvert --to script `echo {notebook_path}` " + "def check_variable(var_name):\n", + " return var_name in locals() or var_name in globals()\n" ] }, { @@ -130,8 +131,11 @@ "#target_code = notebook_path.replace('.ipynb','.py').split('/')[-1:][0]\n", "target_code = notebook_path.split('/')[-1:][0]\n", "shutil.copy(notebook_path,target_code)\n", - "additionl_files_local = additionl_files.split('/')[-1:][0]\n", - "shutil.copy(additionl_files,additionl_files_local)" + "if check_variable('additionl_files'):\n", + " additionl_files_local = additionl_files.split('/')[-1:][0]\n", + " shutil.copy(additionl_files,additionl_files_local)\n", + "else:\n", + " additionl_files_local=target_code #hack" ] }, { diff --git a/src/c3/notebook.py b/src/c3/notebook.py index 611ea3ec..e4f2408b 100644 --- a/src/c3/notebook.py +++ b/src/c3/notebook.py @@ -20,6 +20,8 @@ def _get_env_vars(self): for line in self.notebook['cells'][4]['source']: if re.search("[\"']" + env_name + "[\"']", line): assert '#' in comment_line, "comment line didn't contain #" + assert ',' not in comment_line, "comment line contains ," + if "int(" in line: type = 'Integer' elif "float(" in line: From fa698c5babaa9a18c7775f1f6de5efac8db88eaf Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 21 Jul 2023 17:43:26 +0200 Subject: [PATCH 042/177] changes for hiwot --- src/c3/create_component_library.ipynb | 225 +++++++++----------------- 1 file changed, 72 insertions(+), 153 deletions(-) diff --git a/src/c3/create_component_library.ipynb b/src/c3/create_component_library.ipynb index f63beaba..4322b520 100644 --- a/src/c3/create_component_library.ipynb +++ b/src/c3/create_component_library.ipynb @@ -397,7 +397,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 62, "id": "1df26cfb", "metadata": {}, "outputs": [ @@ -405,14 +405,14 @@ "name": "stdout", "output_type": "stream", "text": [ - "2023-07-21 10:53:27,950 - root - INFO - Logging parameters: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"version=\"0.29\"repository=\"docker.io/romeokienzler\"additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", - "2023-07-21 10:53:27,950 - root - INFO - Parameter: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"\n", - "2023-07-21 10:53:27,951 - root - INFO - Parameter: version=\"0.29\"\n", - "2023-07-21 10:53:27,951 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", - "2023-07-21 10:53:27,951 - root - INFO - Parameter: additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", + "2023-07-21 11:11:41,617 - root - INFO - Logging parameters: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"version=\"0.30\"repository=\"docker.io/romeokienzler\"additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", + "2023-07-21 11:11:41,618 - root - INFO - Parameter: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"\n", + "2023-07-21 11:11:41,618 - root - INFO - Parameter: version=\"0.30\"\n", + "2023-07-21 11:11:41,618 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", + "2023-07-21 11:11:41,619 - root - INFO - Parameter: additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", "ccgw-hls-remove-clouds\n", "hls_remove_clouds got wrappted by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 \n", - " CLAIMED v0.29\n", + " CLAIMED v0.30\n", "{'cgw_source_path': {'description': 'cos path to get job (files) from (including bucket)', 'type': 'String', 'default': None}, 'cgw_source_access_key_id': {'description': 'cgw_source_access_key_id', 'type': 'String', 'default': None}, 'cgw_source_secret_access_key': {'description': 'source_secret_access_key', 'type': 'String', 'default': None}, 'cgw_source_endpoint': {'description': 'source_endpoint', 'type': 'String', 'default': None}, 'cgw_target_access_key_id': {'description': 'cgw_target_access_key_id', 'type': 'String', 'default': None}, 'cgw_target_secret_access_key': {'description': 'cgw_target_secret_access_key', 'type': 'String', 'default': None}, 'cgw_target_endpoint': {'description': 'cgw_target_endpoint', 'type': 'String', 'default': None}, 'cgw_target_path': {'description': 'cgw_target_path (including bucket)', 'type': 'String', 'default': None}, 'cgw_lock_file_suffix': {'description': 'lock file suffix', 'type': 'String', 'default': \" '.lock'\"}, 'cgw_processed_file_suffix': {'description': 'processed file suffix', 'type': 'String', 'default': \" '.processed'\"}, 'cgw_log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}, 'cgw_lock_timeout': {'description': 'timeout in seconds to remove lock file from struggling job (default 1 hour)', 'type': 'Integer', 'default': ' 60*60'}, 'cgw_group_by': {'description': 'group files which need to be processed together', 'type': 'String', 'default': ' None'}, 'satellite': {'description': 'satellite', 'type': 'String', 'default': \"'HLS.L30'\"}}\n", "{}\n", "['pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs']\n", @@ -436,13 +436,13 @@ "name": "stderr", "output_type": "stream", "text": [ - "#1 [internal] load .dockerignore\n", - "#1 transferring context: 2B done\n", + "#1 [internal] load build definition from Dockerfile\n", + "#1 transferring dockerfile: 640B done\n", "#1 DONE 0.0s\n", "\n", - "#2 [internal] load build definition from Dockerfile\n", - "#2 transferring dockerfile: 640B done\n", - "#2 DONE 0.1s\n", + "#2 [internal] load .dockerignore\n", + "#2 transferring context: 2B done\n", + "#2 DONE 0.0s\n", "\n", "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", "#3 DONE 0.0s\n", @@ -451,33 +451,33 @@ "#4 DONE 0.0s\n", "\n", "#5 [internal] load build context\n", - "#5 transferring context: 25.50kB done\n", + "#5 transferring context: 25.49kB done\n", "#5 DONE 0.0s\n", "\n", - "#6 [2/7] RUN dnf install -y java-11-openjdk\n", + "#6 [4/7] RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", "#6 CACHED\n", "\n", - "#7 [3/7] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", + "#7 [2/7] RUN dnf install -y java-11-openjdk\n", "#7 CACHED\n", "\n", - "#8 [4/7] RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", + "#8 [3/7] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", "#8 CACHED\n", "\n", "#9 [5/7] ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", "#9 CACHED\n", "\n", "#10 [6/7] ADD cgw-hls-remove-clouds.ipynb /opt/app-root/src/\n", - "#10 DONE 0.4s\n", + "#10 DONE 0.2s\n", "\n", "#11 [7/7] RUN chmod -R 777 /opt/app-root/src/\n", - "#11 DONE 0.8s\n", + "#11 DONE 0.7s\n", "\n", "#12 exporting to image\n", "#12 exporting layers\n", - "#12 exporting layers 7.1s done\n", - "#12 writing image sha256:e5be079e3dc558e7ba26186c603b3bd2acb3a87e013fc9da581db66e10f262f7 done\n", - "#12 naming to docker.io/library/claimed-ccgw-hls-remove-clouds:0.29 0.0s done\n", - "#12 DONE 7.1s\n" + "#12 exporting layers 7.0s done\n", + "#12 writing image sha256:2d5e087b09230262740a48e74c38b7204cb730091a81a0c7c8a2d8f7ef35a6eb done\n", + "#12 naming to docker.io/library/claimed-ccgw-hls-remove-clouds:0.30 0.0s done\n", + "#12 DONE 7.0s\n" ] }, { @@ -485,8 +485,8 @@ "output_type": "stream", "text": [ "The push refers to repository [docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds]\n", - "49db0df0ed8e: Preparing\n", - "1eb725d74412: Preparing\n", + "8c83b3d3eb0f: Preparing\n", + "f01eaaffe9ea: Preparing\n", "a43fa6775580: Preparing\n", "b638398d394b: Preparing\n", "151382e656b8: Preparing\n", @@ -497,23 +497,21 @@ "b51194abfc91: Preparing\n", "2d89553fcdef: Waiting\n", "3568498d40ea: Waiting\n", - "e813c91400f3: Waiting\n", "fb6a7cccdb84: Waiting\n", - "b51194abfc91: Waiting\n", "151382e656b8: Layer already exists\n", "a43fa6775580: Layer already exists\n", "b638398d394b: Layer already exists\n", - "2d89553fcdef: Layer already exists\n", "e813c91400f3: Layer already exists\n", + "2d89553fcdef: Layer already exists\n", "3568498d40ea: Layer already exists\n", - "fb6a7cccdb84: Layer already exists\n", "b51194abfc91: Layer already exists\n", - "49db0df0ed8e: Pushed\n", - "1eb725d74412: Pushed\n", - "0.29: digest: sha256:bc946868a164e528ac8019c0187c114d14fdf529078dacb7dc7364c8a767bd7a size: 2428\n", + "fb6a7cccdb84: Layer already exists\n", + "f01eaaffe9ea: Pushed\n", + "8c83b3d3eb0f: Pushed\n", + "0.30: digest: sha256:61041d44168883c1b2ff07abb249832ce6eb6457f099e6e2e20faaca60f74cf3 size: 2428\n", "The push refers to repository [docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds]\n", - "49db0df0ed8e: Preparing\n", - "1eb725d74412: Preparing\n", + "8c83b3d3eb0f: Preparing\n", + "f01eaaffe9ea: Preparing\n", "a43fa6775580: Preparing\n", "b638398d394b: Preparing\n", "151382e656b8: Preparing\n", @@ -522,25 +520,25 @@ "e813c91400f3: Preparing\n", "fb6a7cccdb84: Preparing\n", "b51194abfc91: Preparing\n", + "e813c91400f3: Waiting\n", + "2d89553fcdef: Waiting\n", "3568498d40ea: Waiting\n", "fb6a7cccdb84: Waiting\n", "b51194abfc91: Waiting\n", - "2d89553fcdef: Waiting\n", - "e813c91400f3: Waiting\n", "a43fa6775580: Layer already exists\n", - "b638398d394b: Layer already exists\n", + "f01eaaffe9ea: Layer already exists\n", "151382e656b8: Layer already exists\n", - "1eb725d74412: Layer already exists\n", - "49db0df0ed8e: Layer already exists\n", - "2d89553fcdef: Layer already exists\n", + "b638398d394b: Layer already exists\n", + "8c83b3d3eb0f: Layer already exists\n", "3568498d40ea: Layer already exists\n", + "2d89553fcdef: Layer already exists\n", + "fb6a7cccdb84: Layer already exists\n", "e813c91400f3: Layer already exists\n", "b51194abfc91: Layer already exists\n", - "fb6a7cccdb84: Layer already exists\n", - "latest: digest: sha256:bc946868a164e528ac8019c0187c114d14fdf529078dacb7dc7364c8a767bd7a size: 2428\n", + "latest: digest: sha256:61041d44168883c1b2ff07abb249832ce6eb6457f099e6e2e20faaca60f74cf3 size: 2428\n", "name: ccgw-hls-remove-clouds\n", "description: hls_remove_clouds got wrappted by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 \n", - " CLAIMED v0.29\n", + " CLAIMED v0.30\n", "\n", "inputs:\n", "- {name: cgw_source_path, type: String, description: cos path to get job (files) from (including bucket)}\n", @@ -561,7 +559,7 @@ "\n", "implementation:\n", " container:\n", - " image: romeokienzler/claimed-ccgw-hls-remove-clouds:0.29\n", + " image: romeokienzler/claimed-ccgw-hls-remove-clouds:0.30\n", " command:\n", " - sh\n", " - -ec\n", @@ -591,7 +589,7 @@ " spec:\n", " containers:\n", " - name: ccgw-hls-remove-clouds\n", - " image: docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds:0.29\n", + " image: docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds:0.30\n", " command: [\"/opt/app-root/bin/ipython\",\"/opt/app-root/src/cgw-hls-remove-clouds.ipynb\"]\n", " env:\n", " - name: cgw_source_path\n", @@ -652,7 +650,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 64, "id": "70241e41", "metadata": {}, "outputs": [ @@ -660,123 +658,44 @@ "name": "stdout", "output_type": "stream", "text": [ - "planet-data-downloader\n", - "downloads data from planet CLAIMED v0.2\n", - "{'api_key': {'description': 'if your Planet API Key is not set as an environment variable, you can paste it below', 'type': 'String', 'default': \" 'PLAK710197d73217456199842c8db825a773'\"}, 'api_url': {'description': 'if your Planet API url is not set as an environment variable, you can paste it below', 'type': 'String', 'default': \" 'https://api.planet.com/basemaps/v1/mosaics'\"}, 'log_level': {'description': 'log_level', 'type': 'String', 'default': \" 'INFO'\"}}\n", - "{}\n", - "['pip install shapely geopandas ipython ']\n", - "../../../workflows-and-operators/operators/planetdownloader.ipynb\n", + "2023-07-21 17:32:01,822 - root - INFO - Logging parameters: notebook_path=\"../../workflows-and-operators/operators/planetdownloader.ipynb\"version=\"0.36\"repository=\"us.icr.io/geodn\"\n", + "2023-07-21 17:32:01,823 - root - INFO - Parameter: notebook_path=\"../../workflows-and-operators/operators/planetdownloader.ipynb\"\n", + "2023-07-21 17:32:01,823 - root - INFO - Parameter: version=\"0.36\"\n", + "2023-07-21 17:32:01,823 - root - INFO - Parameter: repository=\"us.icr.io/geodn\"\n", + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n", + "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)\n", + "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n", + "\u001b[0;32m----> 1\u001b[0m nb \u001b[38;5;241m=\u001b[39m \u001b[43mNotebook\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnotebook_path\u001b[49m\u001b[43m)\u001b[49m\n", "\n", - "FROM registry.access.redhat.com/ubi8/python-39 \n", - "USER root\n", - "RUN dnf install -y java-11-openjdk\n", - "USER default\n", - "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "RUN pip install shapely geopandas ipython \n", - "ADD planetdownloader.ipynb /opt/app-root/src/\n", - "CMD [\"ipython\", \"/opt/app-root/src/planetdownloader.ipynb\"]\n" + "File \u001b[0;32m~/gitco/c3/src/c3/notebook.py:8\u001b[0m, in \u001b[0;36mNotebook.__init__\u001b[0;34m(self, path)\u001b[0m\n", + "\u001b[1;32m 6\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__init__\u001b[39m(\u001b[38;5;28mself\u001b[39m, path):\n", + "\u001b[1;32m 7\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpath \u001b[38;5;241m=\u001b[39m path\n", + "\u001b[0;32m----> 8\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28;43mopen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m json_file:\n", + "\u001b[1;32m 9\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnotebook \u001b[38;5;241m=\u001b[39m json\u001b[38;5;241m.\u001b[39mload(json_file)\n", + "\u001b[1;32m 10\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mname \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnotebook[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcells\u001b[39m\u001b[38;5;124m'\u001b[39m][\u001b[38;5;241m0\u001b[39m][\u001b[38;5;124m'\u001b[39m\u001b[38;5;124msource\u001b[39m\u001b[38;5;124m'\u001b[39m][\u001b[38;5;241m0\u001b[39m]\u001b[38;5;241m.\u001b[39mreplace(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m#\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m'\u001b[39m)\u001b[38;5;241m.\u001b[39mstrip()\n", + "\n", + "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../../workflows-and-operators/operators/planetdownloader.ipynb'\n" ] }, { - "name": "stderr", - "output_type": "stream", - "text": [ - "#1 [internal] load .dockerignore\n", - "#1 transferring context: 2B done\n", - "#1 DONE 0.0s\n", - "\n", - "#2 [internal] load build definition from Dockerfile\n", - "#2 transferring dockerfile: 402B done\n", - "#2 DONE 0.1s\n", - "\n", - "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", - "#3 DONE 0.0s\n", - "\n", - "#4 [1/5] FROM registry.access.redhat.com/ubi8/python-39\n", - "#4 DONE 0.0s\n", - "\n", - "#5 [2/5] RUN dnf install -y java-11-openjdk\n", - "#5 CACHED\n", - "\n", - "#6 [3/5] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "#6 CACHED\n", - "\n", - "#7 [internal] load build context\n", - "#7 transferring context: 5.75kB done\n", - "#7 DONE 0.0s\n", - "\n", - "#8 [4/5] RUN pip install shapely geopandas ipython\n", - "#8 3.623 Collecting shapely\n", - "#8 3.749 Downloading shapely-2.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)\n", - "#8 4.577 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 2.8 MB/s eta 0:00:00\n", - "#8 4.742 Collecting geopandas\n", - "#8 4.778 Downloading geopandas-0.13.2-py3-none-any.whl (1.1 MB)\n", - "#8 5.213 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 2.6 MB/s eta 0:00:00\n", - "#8 5.237 Requirement already satisfied: ipython in /opt/app-root/lib/python3.9/site-packages (8.6.0)\n", - "#8 6.880 Collecting numpy>=1.14\n", - "#8 6.946 Downloading numpy-1.25.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB)\n", - "#8 13.43 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.7/17.7 MB 3.4 MB/s eta 0:00:00\n", - "#8 13.89 Collecting packaging\n", - "#8 13.95 Downloading packaging-23.1-py3-none-any.whl (48 kB)\n", - "#8 13.99 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.9/48.9 kB 1.4 MB/s eta 0:00:00\n", - "#8 14.56 Collecting fiona>=1.8.19\n", - "#8 14.61 Downloading Fiona-1.9.4.post1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)\n", - "#8 18.86 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.4/16.4 MB 3.9 MB/s eta 0:00:00\n", - "#8 20.14 Collecting pandas>=1.1.0\n", - "#8 20.23 Downloading pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)\n", - "#8 22.88 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 4.7 MB/s eta 0:00:00\n", - "#8 23.74 Collecting pyproj>=3.0.1\n", - "#8 23.78 Downloading pyproj-3.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.9 MB)\n", - "#8 25.56 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 4.4 MB/s eta 0:00:00\n", - "#8 25.96 Requirement already satisfied: pickleshare in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.7.5)\n", - "#8 25.96 Requirement already satisfied: pexpect>4.3 in /opt/app-root/lib/python3.9/site-packages (from ipython) (4.8.0)\n", - "#8 25.96 Requirement already satisfied: backcall in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.2.0)\n", - "#8 25.97 Requirement already satisfied: stack-data in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.6.2)\n", - "#8 25.97 Requirement already satisfied: pygments>=2.4.0 in /opt/app-root/lib/python3.9/site-packages (from ipython) (2.15.1)\n", - "#8 25.97 Requirement already satisfied: jedi>=0.16 in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.18.2)\n", - "#8 25.98 Requirement already satisfied: matplotlib-inline in /opt/app-root/lib/python3.9/site-packages (from ipython) (0.1.6)\n", - "#8 25.98 Requirement already satisfied: traitlets>=5 in /opt/app-root/lib/python3.9/site-packages (from ipython) (5.9.0)\n", - "#8 25.98 Requirement already satisfied: prompt-toolkit<3.1.0,>3.0.1 in /opt/app-root/lib/python3.9/site-packages (from ipython) (3.0.39)\n", - "#8 25.99 Requirement already satisfied: decorator in /opt/app-root/lib/python3.9/site-packages (from ipython) (5.1.1)\n", - "#8 26.06 Requirement already satisfied: attrs>=19.2.0 in /opt/app-root/lib/python3.9/site-packages (from fiona>=1.8.19->geopandas) (23.1.0)\n", - "#8 26.18 Collecting cligj>=0.5\n", - "#8 26.23 Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)\n", - "#8 26.32 Collecting click-plugins>=1.0\n", - "#8 26.39 Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)\n", - "#8 26.40 Requirement already satisfied: six in /opt/app-root/lib/python3.9/site-packages (from fiona>=1.8.19->geopandas) (1.16.0)\n", - "#8 26.54 Collecting click~=8.0\n", - "#8 26.61 Downloading click-8.1.6-py3-none-any.whl (97 kB)\n", - "#8 26.63 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.9/97.9 kB 5.0 MB/s eta 0:00:00\n", - "#8 26.81 Collecting certifi\n", - "#8 26.84 Downloading certifi-2023.5.7-py3-none-any.whl (156 kB)\n", - "#8 26.87 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157.0/157.0 kB 6.5 MB/s eta 0:00:00\n", - "#8 27.28 Collecting importlib-metadata\n", - "#8 27.35 Downloading importlib_metadata-6.8.0-py3-none-any.whl (22 kB)\n", - "#8 27.53 Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/app-root/lib/python3.9/site-packages (from jedi>=0.16->ipython) (0.8.3)\n", - "#8 27.98 Collecting python-dateutil>=2.8.2\n", - "#8 28.01 Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)\n", - "#8 28.08 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 247.7/247.7 kB 4.1 MB/s eta 0:00:00\n", - "#8 28.41 Collecting pytz>=2020.1\n", - "#8 28.44 Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)\n", - "#8 28.54 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 502.3/502.3 kB 5.6 MB/s eta 0:00:00\n", - "#8 28.78 Collecting tzdata>=2022.1\n", - "#8 28.81 Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)\n", - "#8 28.87 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.8/341.8 kB 6.4 MB/s eta 0:00:00\n", - "#8 28.91 Requirement already satisfied: ptyprocess>=0.5 in /opt/app-root/lib/python3.9/site-packages (from pexpect>4.3->ipython) (0.7.0)\n", - "#8 28.92 Requirement already satisfied: wcwidth in /opt/app-root/lib/python3.9/site-packages (from prompt-toolkit<3.1.0,>3.0.1->ipython) (0.2.6)\n", - "#8 29.07 Requirement already satisfied: executing>=1.2.0 in /opt/app-root/lib/python3.9/site-packages (from stack-data->ipython) (1.2.0)\n", - "#8 29.07 Requirement already satisfied: asttokens>=2.1.0 in /opt/app-root/lib/python3.9/site-packages (from stack-data->ipython) (2.2.1)\n", - "#8 29.07 Requirement already satisfied: pure-eval in /opt/app-root/lib/python3.9/site-packages (from stack-data->ipython) (0.2.2)\n", - "#8 29.66 Collecting zipp>=0.5\n", - "#8 29.68 Downloading zipp-3.16.2-py3-none-any.whl (7.2 kB)\n", - "#8 30.98 Installing collected packages: pytz, zipp, tzdata, python-dateutil, packaging, numpy, click, certifi, shapely, pyproj, pandas, importlib-metadata, cligj, click-plugins, fiona, geopandas\n" + "ename": "CalledProcessError", + "evalue": "Command 'b'export version=0.36\\nipython generate_kfp_component.ipynb notebook_path=../../workflows-and-operators/operators/planetdownloader.ipynb version=$version repository=us.icr.io/geodn\\n'' returned non-zero exit status 1.", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mCalledProcessError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[64], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m get_ipython()\u001b[39m.\u001b[39;49mrun_cell_magic(\u001b[39m'\u001b[39;49m\u001b[39mbash\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mexport version=0.36\u001b[39;49m\u001b[39m\\n\u001b[39;49;00m\u001b[39mipython generate_kfp_component.ipynb notebook_path=../../workflows-and-operators/operators/planetdownloader.ipynb version=$version repository=us.icr.io/geodn\u001b[39;49m\u001b[39m\\n\u001b[39;49;00m\u001b[39m'\u001b[39;49m)\n", + "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/interactiveshell.py:2478\u001b[0m, in \u001b[0;36mInteractiveShell.run_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m 2476\u001b[0m \u001b[39mwith\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mbuiltin_trap:\n\u001b[1;32m 2477\u001b[0m args \u001b[39m=\u001b[39m (magic_arg_s, cell)\n\u001b[0;32m-> 2478\u001b[0m result \u001b[39m=\u001b[39m fn(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 2480\u001b[0m \u001b[39m# The code below prevents the output from being displayed\u001b[39;00m\n\u001b[1;32m 2481\u001b[0m \u001b[39m# when using magics with decodator @output_can_be_silenced\u001b[39;00m\n\u001b[1;32m 2482\u001b[0m \u001b[39m# when the last Python token in the expression is a ';'.\u001b[39;00m\n\u001b[1;32m 2483\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mgetattr\u001b[39m(fn, magic\u001b[39m.\u001b[39mMAGIC_OUTPUT_CAN_BE_SILENCED, \u001b[39mFalse\u001b[39;00m):\n", + "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/magics/script.py:154\u001b[0m, in \u001b[0;36mScriptMagics._make_script_magic..named_script_magic\u001b[0;34m(line, cell)\u001b[0m\n\u001b[1;32m 152\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 153\u001b[0m line \u001b[39m=\u001b[39m script\n\u001b[0;32m--> 154\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mshebang(line, cell)\n", + "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/magics/script.py:314\u001b[0m, in \u001b[0;36mScriptMagics.shebang\u001b[0;34m(self, line, cell)\u001b[0m\n\u001b[1;32m 309\u001b[0m \u001b[39mif\u001b[39;00m args\u001b[39m.\u001b[39mraise_error \u001b[39mand\u001b[39;00m p\u001b[39m.\u001b[39mreturncode \u001b[39m!=\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[1;32m 310\u001b[0m \u001b[39m# If we get here and p.returncode is still None, we must have\u001b[39;00m\n\u001b[1;32m 311\u001b[0m \u001b[39m# killed it but not yet seen its return code. We don't wait for it,\u001b[39;00m\n\u001b[1;32m 312\u001b[0m \u001b[39m# in case it's stuck in uninterruptible sleep. -9 = SIGKILL\u001b[39;00m\n\u001b[1;32m 313\u001b[0m rc \u001b[39m=\u001b[39m p\u001b[39m.\u001b[39mreturncode \u001b[39mor\u001b[39;00m \u001b[39m-\u001b[39m\u001b[39m9\u001b[39m\n\u001b[0;32m--> 314\u001b[0m \u001b[39mraise\u001b[39;00m CalledProcessError(rc, cell)\n", + "\u001b[0;31mCalledProcessError\u001b[0m: Command 'b'export version=0.36\\nipython generate_kfp_component.ipynb notebook_path=../../workflows-and-operators/operators/planetdownloader.ipynb version=$version repository=us.icr.io/geodn\\n'' returned non-zero exit status 1." ] } ], "source": [ "%%bash\n", - "export version=0.2\n", - "ipython generate_kfp_component.ipynb ../../../workflows-and-operators/operators/planetdownloader.ipynb $version\n" + "export version=0.36\n", + "ipython generate_kfp_component.ipynb notebook_path=../../../workflows-and-operators/operators/planetdownloader.ipynb version=$version repository=us.icr.io/geodn\n" ] }, { From 400cf1ae5b6e9cb66edd961430fde17cd77bf49d Mon Sep 17 00:00:00 2001 From: fredotieno Date: Tue, 25 Jul 2023 16:20:20 +0300 Subject: [PATCH 043/177] fix: :adhesive_bandage: specify build platform Signed-off-by: Fred Otieno --- src/c3/generate_kfp_component.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index 34b7556b..94183761 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -211,7 +211,7 @@ }, "outputs": [], "source": [ - "!docker build -t `echo claimed-{name}:{version}` .\n", + "!docker build --platform=linux/amd64 -t `echo claimed-{name}:{version}` .\n", "!docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:{version}`\n", "!docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:latest`\n", "!docker push `echo {repository}/claimed-{name}:{version}`\n", From eaca49987d81b0aa9704c3d8e31b9e44d64d3cff Mon Sep 17 00:00:00 2001 From: fredotieno Date: Tue, 25 Jul 2023 16:21:38 +0300 Subject: [PATCH 044/177] fix: :adhesive_bandage: include hyphen in regex, to capture libs with hyphens Signed-off-by: Fred Otieno --- src/c3/notebook.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/notebook.py b/src/c3/notebook.py index e4f2408b..5453d5bf 100644 --- a/src/c3/notebook.py +++ b/src/c3/notebook.py @@ -45,7 +45,7 @@ def get_requirements(self): requirements = [] for cell in self.notebook['cells']: for cell_content in cell['source']: - pattern = r"(![ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.: ]*)" + pattern = r"(![ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.\-: ]*)" result = re.findall(pattern,cell_content) if len(result) == 1: requirements.append((result[0][0]+ ' ' +result[0][1])[1:]) From 37ec9f4bc02cf051898441a760725098a74b44fb Mon Sep 17 00:00:00 2001 From: fredotieno Date: Tue, 25 Jul 2023 16:23:10 +0300 Subject: [PATCH 045/177] fix: :adhesive_bandage: use repository var Signed-off-by: Fred Otieno --- src/c3/generate_kfp_component.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index 94183761..4da12e26 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -312,7 +312,7 @@ " name=name,\n", " description=description,\n", " inputs=get_component_interface(inputs, parameter_type.INPUT),\n", - " container_uri=f\"romeokienzler/claimed-{name}\",\n", + " container_uri=f\"{repository}/claimed-{name}\",\n", " version=version,\n", " outputPath=get_output_name(),\n", " input_for_implementation=get_input_for_implementation(),\n", From 9ef5a887c320e112febca7dac65550061fb97203 Mon Sep 17 00:00:00 2001 From: Fred Otieno Date: Wed, 16 Aug 2023 17:55:07 +0300 Subject: [PATCH 046/177] fix: :adhesive_bandage: use curly brackets for parameter list generation Signed-off-by: Fred Otieno --- src/c3/generate_kfp_component.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index 4da12e26..4f3ca38e 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -277,10 +277,10 @@ " return_value = str()\n", " index = 0\n", " for output_key, output_value in outputs.items():\n", - " return_value = return_value + output_key + '=\"$' + str(index) + '\" '\n", + " return_value = return_value + output_key + '=\"${' + str(index) + '}\" '\n", " index = index + 1\n", " for input_key, input_value in inputs.items():\n", - " return_value = return_value + input_key + '=\"$' + str(index) + '\" '\n", + " return_value = return_value + input_key + '=\"${' + str(index) + '}\" '\n", " index = index + 1\n", " return return_value " ] From 50dae1132a2c22b68b5985a14f0fb0620254dae0 Mon Sep 17 00:00:00 2001 From: Julian-Kuehnert Date: Fri, 18 Aug 2023 11:25:22 +0200 Subject: [PATCH 047/177] allow multiple additional files --- src/c3/generate_kfp_component.ipynb | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index 4f3ca38e..fd6e1cbf 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -129,11 +129,24 @@ "outputs": [], "source": [ "#target_code = notebook_path.replace('.ipynb','.py').split('/')[-1:][0]\n", + "\n", "target_code = notebook_path.split('/')[-1:][0]\n", "shutil.copy(notebook_path,target_code)\n", "if check_variable('additionl_files'):\n", - " additionl_files_local = additionl_files.split('/')[-1:][0]\n", - " shutil.copy(additionl_files,additionl_files_local)\n", + " if additionl_files.startswith('['):\n", + " local_path = 'additionl_file_path'\n", + " if not os.path.exists(local_path):\n", + " os.makedirs(local_path)\n", + " additionl_files_local = local_path\n", + " additionl_files=additionl_files[1:-1].split(',')\n", + " print('Additional files to add to container:')\n", + " for additionl_file in additionl_files:\n", + " print(additionl_file)\n", + " shutil.copy(additionl_file, additionl_files_local)\n", + " print(os.listdir(local_path))\n", + " else:\n", + " additionl_files_local = additionl_files.split('/')[-1:][0]\n", + " shutil.copy(additionl_files,additionl_files_local)\n", "else:\n", " additionl_files_local=target_code #hack" ] From c986e4b1da60c8fad40562ae024683a709aee94b Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 30 Aug 2023 15:10:31 +0200 Subject: [PATCH 048/177] Corrected misspelling --- src/c3/generate_kfp_component.ipynb | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index fd6e1cbf..5ab293c0 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -132,23 +132,23 @@ "\n", "target_code = notebook_path.split('/')[-1:][0]\n", "shutil.copy(notebook_path,target_code)\n", - "if check_variable('additionl_files'):\n", - " if additionl_files.startswith('['):\n", - " local_path = 'additionl_file_path'\n", + "if check_variable('additional_files'):\n", + " if additional_files.startswith('['):\n", + " local_path = 'additional_files_path'\n", " if not os.path.exists(local_path):\n", " os.makedirs(local_path)\n", - " additionl_files_local = local_path\n", - " additionl_files=additionl_files[1:-1].split(',')\n", + " additional_files_local = local_path\n", + " additional_files=additional_files[1:-1].split(',')\n", " print('Additional files to add to container:')\n", - " for additionl_file in additionl_files:\n", - " print(additionl_file)\n", - " shutil.copy(additionl_file, additionl_files_local)\n", + " for additional_file in additional_files:\n", + " print(additional_file)\n", + " shutil.copy(additional_file, additional_files_local)\n", " print(os.listdir(local_path))\n", " else:\n", - " additionl_files_local = additionl_files.split('/')[-1:][0]\n", - " shutil.copy(additionl_files,additionl_files_local)\n", + " additional_files_local = additional_files.split('/')[-1:][0]\n", + " shutil.copy(additional_files,additional_files_local)\n", "else:\n", - " additionl_files_local=target_code #hack" + " additional_files_local=target_code #hack" ] }, { @@ -202,7 +202,7 @@ "USER default\n", "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", "{requirements_docker}\n", - "ADD {additionl_files_local} /opt/app-root/src/\n", + "ADD {additional_files_local} /opt/app-root/src/\n", "ADD {target_code} /opt/app-root/src/\n", "USER root\n", "RUN chmod -R 777 /opt/app-root/src/\n", From 76064044dec88efa018c761cd7226fdcf57b1bd5 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 30 Aug 2023 15:20:45 +0200 Subject: [PATCH 049/177] Added getting started to README.md --- README.md | 47 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 46 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index b5a24329..62c0048d 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ **TL;DR** - takes arbitrary assets (Jupyter notebooks, python/R/shell/SQL scripts) as input - automatically creates container images and pushes to container registries -- automatically installes all required dependencies into the container image +- automatically installs all required dependencies into the container image - creates KubeFlow Pipeline components (target workflow execution engines are pluggable) - can be triggered from CICD pipelines @@ -20,6 +20,51 @@ To learn more on how this library works in practice, please have a look at the f [Orchest](https://www.orchest.io/) +## Getting started + +### Install + +Download the code from https://github.com/claimed-framework/c3/tree/main and install the package. + +```sh +git clone claimed-framework/c3 +cd c3 +pip install -e src +``` + +### Usage + +Run `generate_kfp_component.ipynb` with ipython and provide the required additional information (`notebook_path`, `version`, `repository`, optional: `additionl_files`). + +Example from the `c3` project root. Remember to add `..//` when select your notebook path. Note that the code creates containers and therefore requires a running docker instance. +```sh +ipython src/c3/generate_kfp_component.ipynb notebook_path="" version="" additionl_files="[file1,file2]" repository="docker.io/" +``` + +### Notebook requirements + +The c3 compiler requires your notebook to follow a certain pattern: + +1. Cell: Markdown with the component name +2. Cell: Markdown with the component description +3. Cell: Requirements installed by pip, e.g., `!pip install <...>` +4. Cell: Imports, e.g., `import numpy as np` +5. Cell: Component interface, e.g., `input_path = os.environ.get('input_path')`. Output variables have to start with `output`, more details in the following. +6. Cell and following: Your code + +## Component interface + +The interface consists of input and output variables that are defined by environment variables. Output variables have to start with `output`, e.g., `output_path`. +Environment variables and arguments are by default string values. You can cast a specific type by wrapping the `os.environ.get()` into the methods `bool()`, `int()`, or `float()`. +The c3 compiler cannot handle other types than string, boolean, integer, and float values. + +```py +input_string = os.environ.get('input_string', 'default_value') +input_bool = bool(os.environ.get('input_bool', False)) +input_int = int(os.environ.get('input_int')) +input_float = float(os.environ.get('input_float')) +``` + ## Getting Help We welcome your questions, ideas, and feedback. Please create an [issue](https://github.com/claimed-framework/component-library/issues) or a [discussion thread](https://github.com/claimed-framework/component-library/discussions). From bc301551a92fdb1fd1ce9b6f5349c0b59729ed33 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 1 Sep 2023 08:36:30 +0200 Subject: [PATCH 050/177] Replace "_" with "-" in container name --- src/c3/notebook.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/notebook.py b/src/c3/notebook.py index 5453d5bf..5f9a8f62 100644 --- a/src/c3/notebook.py +++ b/src/c3/notebook.py @@ -7,7 +7,7 @@ def __init__(self, path): self.path = path with open(path) as json_file: self.notebook = json.load(json_file) - self.name = self.notebook['cells'][0]['source'][0].replace('#', '').strip() + self.name = self.notebook['cells'][0]['source'][0].replace('#', '').replace('_', '-').strip() self.description = self.notebook['cells'][1]['source'][0] self.envs = self._get_env_vars() From a709f6f33ce15bf4bf0e68eff432df2b88c9b1b8 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 1 Sep 2023 09:31:06 +0200 Subject: [PATCH 051/177] Delete local copies after containerization --- src/c3/generate_kfp_component.ipynb | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index 5ab293c0..355fcf0c 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -401,6 +401,18 @@ "with open(target_job_yaml_path, \"w\") as text_file:\n", " text_file.write(job_yaml)" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e36aa496", + "metadata": {}, + "outputs": [], + "source": [ + "# remove local files\n", + "shutil.rmtree(local_path, ignore_errors=True)\n", + "os.remove(target_code)" + ] } ], "metadata": { @@ -419,7 +431,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.11.4" } }, "nbformat": 4, From 32d4ef008732499acc0afad71c521c1691ed5861 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 4 Sep 2023 15:59:47 +0200 Subject: [PATCH 052/177] Delete local copies after containerization --- src/c3/generate_kfp_component.ipynb | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb index 355fcf0c..7bf640cb 100644 --- a/src/c3/generate_kfp_component.ipynb +++ b/src/c3/generate_kfp_component.ipynb @@ -134,21 +134,22 @@ "shutil.copy(notebook_path,target_code)\n", "if check_variable('additional_files'):\n", " if additional_files.startswith('['):\n", - " local_path = 'additional_files_path'\n", - " if not os.path.exists(local_path):\n", - " os.makedirs(local_path)\n", - " additional_files_local = local_path\n", + " additional_files_path = 'additional_files_path'\n", + " if not os.path.exists(additional_files_path):\n", + " os.makedirs(additional_files_path)\n", + " additional_files_local = additional_files_path\n", " additional_files=additional_files[1:-1].split(',')\n", " print('Additional files to add to container:')\n", " for additional_file in additional_files:\n", " print(additional_file)\n", " shutil.copy(additional_file, additional_files_local)\n", - " print(os.listdir(local_path))\n", + " print(os.listdir(additional_files_path))\n", " else:\n", " additional_files_local = additional_files.split('/')[-1:][0]\n", " shutil.copy(additional_files,additional_files_local)\n", "else:\n", - " additional_files_local=target_code #hack" + " additional_files_local=target_code #hack\n", + " additional_files_path = None" ] }, { @@ -410,8 +411,10 @@ "outputs": [], "source": [ "# remove local files\n", - "shutil.rmtree(local_path, ignore_errors=True)\n", - "os.remove(target_code)" + "os.remove(target_code)\n", + "os.remove('Dockerfile')\n", + "if additional_files_path is not None:\n", + " shutil.rmtree(additional_files_path, ignore_errors=True)" ] } ], From ba0a652d1aeabc0e2c78d73422837114dbb6c9ea Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 4 Sep 2023 16:03:13 +0200 Subject: [PATCH 053/177] Added the cpython compiler --- src/c3/generate_kfp_component.py | 245 +++++++++++++++++++++++++++++++ src/c3/pythonscript.py | 79 ++++++++++ 2 files changed, 324 insertions(+) create mode 100644 src/c3/generate_kfp_component.py create mode 100644 src/c3/pythonscript.py diff --git a/src/c3/generate_kfp_component.py b/src/c3/generate_kfp_component.py new file mode 100644 index 00000000..840eaef0 --- /dev/null +++ b/src/c3/generate_kfp_component.py @@ -0,0 +1,245 @@ + +import os +import sys +import re +import logging +import shutil +from notebook import Notebook +from pythonscript import Pythonscript +from string import Template +from io import StringIO +from enum import Enum + + +def generate_component(file_path: str, repository: str, version: str, additional_files: str = None): + + root = logging.getLogger() + root.setLevel('INFO') + + handler = logging.StreamHandler(sys.stdout) + handler.setLevel('INFO') + formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') + handler.setFormatter(formatter) + root.addHandler(handler) + + logging.info('Parameters: ') + logging.info('file_path: ' + file_path) + logging.info('repository: ' + repository) + logging.info('version: ' + version) + logging.info('additional_files: ' + str(additional_files)) + + if file_path.endswith('.ipynb'): + nb = Notebook(file_path) + name = nb.get_name() + description = nb.get_description() + " CLAIMED v" + version + inputs = nb.get_inputs() + outputs = nb.get_outputs() + requirements = nb.get_requirements() + elif file_path.endswith('.py'): + py = Pythonscript(file_path) + name = py.get_name() + description = py.get_description() + " CLAIMED v" + version + inputs = py.get_inputs() + outputs = py.get_outputs() + requirements = py.get_requirements() + else: + print('Please provide a file_path to a jupyter notebook or python script.') + raise NotImplementedError + + print(name) + print(description) + print(inputs) + print(outputs) + print(requirements) + + def check_variable(var_name): + return var_name in locals() or var_name in globals() + + target_code = file_path.split('/')[-1:][0] + shutil.copy(file_path,target_code) + if check_variable('additional_files'): + if additional_files.startswith('['): + additional_files_path = 'additional_files_path' + if not os.path.exists(additional_files_path): + os.makedirs(additional_files_path) + additional_files_local = additional_files_path + additional_files=additional_files[1:-1].split(',') + print('Additional files to add to container:') + for additional_file in additional_files: + print(additional_file) + shutil.copy(additional_file, additional_files_local) + print(os.listdir(additional_files_path)) + else: + additional_files_local = additional_files.split('/')[-1:][0] + shutil.copy(additional_files,additional_files_local) + else: + additional_files_local=target_code # hack + additional_files_path = None + file = target_code + + # read and replace '!pip' in notebooks + with open(file, 'r') as fd: + text, counter = re.subn(r'!pip', '#!pip', fd.read(), re.I) + + # check if there is at least a match + if counter > 0: + # edit the file + with open(file, 'w') as fd: + fd.write(text) + + requirements_docker = list(map(lambda s: 'RUN '+s, requirements)) + requirements_docker = '\n'.join(requirements_docker) + + python_command = 'python' if target_code.endswith('.py') else 'ipython' + + docker_file = f""" + FROM registry.access.redhat.com/ubi8/python-39 + USER root + RUN dnf install -y java-11-openjdk + USER default + RUN pip install ipython==8.6.0 nbformat==5.7.0 + {requirements_docker} + ADD {additional_files_local} /opt/app-root/src/ + ADD {target_code} /opt/app-root/src/ + USER root + RUN chmod -R 777 /opt/app-root/src/ + USER default + CMD ["{python_command}", "/opt/app-root/src/{target_code}"] + """ + + # Remove packages that are not used for python scripts + if target_code.endswith('.py'): + docker_file = docker_file.replace('RUN pip install ipython==8.6.0 nbformat==5.7.0\n', '') + + with open("Dockerfile", "w") as text_file: + text_file.write(docker_file) + + os.system('cat Dockerfile') + os.system(f'docker build --platform=linux/amd64 -t `echo claimed-{name}:{version}` .') + os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:{version}`') + os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:latest`') + os.system(f'docker push `echo {repository}/claimed-{name}:{version}`') + os.system(f'docker push `echo {repository}/claimed-{name}:latest`') + parameter_type = Enum('parameter_type', ['INPUT', 'OUTPUT']) + + def get_component_interface(parameters, type : parameter_type): + template_string = str() + for parameter_name, parameter_options in parameters.items(): + default = '' + if parameter_options['default'] is not None and type == parameter_type.INPUT: + default = f", default: {parameter_options['default']}" + template_string += f"- {{name: {parameter_name}, type: {parameter_options['type']}, description: {parameter_options['description']}{default}}}" + template_string += '\n' + return template_string + + def get_output_name(): + for output_key, output_value in outputs.items(): + return output_key + + def get_input_for_implementation(): + with StringIO() as inputs_str: + for input_key, input_value in inputs.items(): + t = Template(" - {inputValue: $name}") + print(t.substitute(name=input_key), file=inputs_str) + return inputs_str.getvalue() + + def get_parameter_list(): + return_value = str() + index = 0 + for output_key, output_value in outputs.items(): + return_value = return_value + output_key + '="${' + str(index) + '}" ' + index = index + 1 + for input_key, input_value in inputs.items(): + return_value = return_value + input_key + '="${' + str(index) + '}" ' + index = index + 1 + return return_value + + t = Template('''name: $name +description: $description + +inputs: +$inputs + +implementation: + container: + image: $container_uri:$version + command: + - sh + - -ec + - | + $python $call +$input_for_implementation''') + + yaml = t.substitute( + name=name, + description=description, + inputs=get_component_interface(inputs, parameter_type.INPUT), + container_uri=f"{repository}/claimed-{name}", + version=version, + outputPath=get_output_name(), + input_for_implementation=get_input_for_implementation(), + call=f'./{target_code} {get_parameter_list()}', + python=python_command, + ) + + print(yaml) + target_yaml_path = file_path.replace('.ipynb','.yaml').replace('.py','.yaml') + + with open(target_yaml_path, "w") as text_file: + text_file.write(yaml) + + # get environment entries + env_entries = [] + for input_key, _ in inputs.items(): + env_entry = f" - name: {input_key}\n value: value_of_{input_key}" + env_entries.append(env_entry) + env_entries.append('\n') + env_entries.pop(-1) + env_entries = ''.join(env_entries) + + job_yaml = f'''apiVersion: batch/v1 +kind: Job +metadata: + name: {name} +spec: + template: + spec: + containers: + - name: {name} + image: {repository}/claimed-{name}:{version} + command: ["/opt/app-root/bin/{python_command}","/opt/app-root/src/{target_code}"] + env: +{env_entries} + restartPolicy: OnFailure + imagePullSecrets: + - name: image_pull_secret''' + + print(job_yaml) + target_job_yaml_path = file_path.replace('.ipynb','.job.yaml').replace('.py','.job.yaml') + + with open(target_job_yaml_path, "w") as text_file: + text_file.write(job_yaml) + + # remove local files + os.remove(target_code) + os.remove('Dockerfile') + if additional_files_path is not None: + shutil.rmtree(additional_files_path, ignore_errors=True) + + +if __name__ == '__main__': + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument('--file_path', type=str, required=True, + help='Path to python script or notebook') + parser.add_argument('--repository', type=str, required=True, + help='Container registry address, e.g. docker.io/') + parser.add_argument('--version', type=str, required=True, + help='Image version') + parser.add_argument('--additional_files', type=str, + help='Comma-separated list of paths to additional files to include in the container image') + + args = parser.parse_args() + + generate_component(**vars(args)) diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py new file mode 100644 index 00000000..0de400c3 --- /dev/null +++ b/src/c3/pythonscript.py @@ -0,0 +1,79 @@ +import json +import logging +import os +import re +from parser import ContentParser + + +class Pythonscript: + def __init__(self, path, function_name: str = None): + + self.path = path + with open(path, 'r') as f: + self.script = f.read() + + self.name = os.path.basename(path)[:-3].replace('_', '-') + assert '"""' in self.script, 'Please provide a description of the operator inside the first doc string.' + self.description = self.script.split('"""')[1].strip() + self.envs = self._get_env_vars() + + def _get_env_vars(self): + cp = ContentParser() + env_names = cp.parse(self.path)['env_vars'] + return_value = dict() + for env_name in env_names: + comment_line = str() + for line in self.script.split('\n'): + if re.search("[\"']" + env_name + "[\"']", line): + # Check the description for current variable + if not comment_line.strip().startswith('#'): + # previous line was no description, reset comment_line. + comment_line = '' + if comment_line == '': + logging.info(f'Interface: No description for variable {env_name} provided.') + if ',' in comment_line: + logging.info( + f"Interface: comment line for variable {env_name} contains commas which will be deleted.") + comment_line = comment_line.replace(',', '') + + if "int(" in line: + type = 'Integer' + elif "float(" in line: + type = 'Float' + elif "bool(" in line: + type = 'Boolean' + else: + type = 'String' + if ',' in line: + default = line.split(',')[1].split(')')[0] + else: + default = None + return_value[env_name] = { + 'description': comment_line.replace('#', '').strip(), + 'type': type, + 'default': default + } + break + comment_line = line + return return_value + + def get_requirements(self): + requirements = [] + for line in self.script.split('\n'): + pattern = r"([ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.\-: ]*)" + result = re.findall(pattern, line) + if len(result) == 1: + requirements.append((result[0][0].strip() + ' ' + result[0][1].strip())) + return requirements + + def get_name(self): + return self.name + + def get_description(self): + return self.description + + def get_inputs(self): + return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} + + def get_outputs(self): + return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} From fe60bf52b386fdd38a6290dc6d1f0cc85bea1f89 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 12 Sep 2023 16:49:20 +0200 Subject: [PATCH 054/177] Added an operator example --- examples/operator_template.py | 112 ++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 examples/operator_template.py diff --git a/examples/operator_template.py b/examples/operator_template.py new file mode 100644 index 00000000..a88adfac --- /dev/null +++ b/examples/operator_template.py @@ -0,0 +1,112 @@ +# TODO: Rename the file to the desired operator name. +""" +# TODO: Update the description of the operator. +This is a template for an operator that read files from COS, processes them, and saves the results to COS. +You can create a container image and KubeFlow job with C3. +""" + +# TODO: Update the required pip packages. +# pip install xarray s3fs + +import os +import logging +import sys +import re +import s3fs +import xarray as xr + +# TODO: Add the operator interface. +# You can use os.environ["name"], os.getenv("name"), or os.environ.get("name"). +# The default type is string. You can also use int, float, and bool values with type casting. +# Optionally, you can set a default value like in the following. +# string example description with default value +string_example = os.getenv('string_example', 'default_value') +# int example description +int_example = int(os.getenv('int_example', 10)) +# float example description +float_example = float(os.getenv('float_example', 0.1)) +# bool example description +bool_example = bool(os.getenv('bool_example', False)) + +# # # Exemplary interface for processing COS files # # # + +# glob pattern for all zarr files to process (e.g. path/to/files/**/*.zarr) +file_path_pattern = os.getenv('file_path_pattern') +# directory for the output files +target_dir = os.getenv('target_dir') +# access_key_id +access_key_id = os.getenv('access_key_id') +# secret_access_key +secret_access_key = os.getenv('secret_access_key') +# endpoint +endpoint = os.getenv('endpoint') +# bucket +bucket = os.getenv('bucket') +# set log level +log_level = os.getenv('log_level', "INFO") + +# Init logging +root = logging.getLogger() +root.setLevel(log_level) + +handler = logging.StreamHandler(sys.stdout) +handler.setLevel(log_level) +formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') +handler.setFormatter(formatter) +root.addHandler(handler) + +logging.basicConfig(level=logging.CRITICAL) + +# get arguments from the command (C3 passes all arguments in the form '=') +parameters = list( + map(lambda s: re.sub('$', '"', s), + map( + lambda s: s.replace('=', '="'), + filter( + lambda s: s.find('=') > -1 and bool(re.match(r'[A-Za-z0-9_]*=[.\/A-Za-z0-9]*', s)), + sys.argv + ) + ))) + +# set values from command arguments +for parameter in parameters: + logging.info('Parameter: ' + parameter) + exec(parameter) + +# TODO: You might want to add type casting after the exec(parameter). +# C3 will added this automatically in the future, but it not implemented yet. +# type casting +int_example = int(int_example) +float_example = float(float_example) +bool_example = bool(bool_example) + + +# TODO: Add your code. +# You can just call a function from an additional file (must be in the same directory) or add your code here. +# Example code for processing COS files based on a file pattern +def main(): + # init s3 + s3 = s3fs.S3FileSystem( + anon=False, + key=access_key_id, + secret=secret_access_key, + client_kwargs={'endpoint_url': endpoint}) + + # get file paths from a glob pattern, e.g., path/to/files/**/*.zarr + file_paths = s3.glob(os.path.join(bucket, file_path_pattern)) + + for file_path in file_paths: + # open a zarr file from COS as xarray dataset + ds = xr.open_zarr(s3fs.S3Map(root=f's3://{file_path}', s3=s3)) + + # TODO: do something with the dataset + processed_ds = ds + + # write processed dataset to s3 + # TODO: edit how to save the processed data + target_path = os.path.join(bucket, target_dir, os.path.basename(file_path)) + processed_ds.to_zarr(s3fs.S3Map(root=f's3://{target_path}', s3=s3)) + + +if __name__ == '__main__': + main() From 094cf348ede9b88c414aa14b5ff277b5408ab613 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 13 Sep 2023 19:34:06 +0200 Subject: [PATCH 055/177] Update ReadMe --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 62c0048d..34dc38e2 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,7 @@ The c3 compiler requires your notebook to follow a certain pattern: 5. Cell: Component interface, e.g., `input_path = os.environ.get('input_path')`. Output variables have to start with `output`, more details in the following. 6. Cell and following: Your code -## Component interface +### Component interface The interface consists of input and output variables that are defined by environment variables. Output variables have to start with `output`, e.g., `output_path`. Environment variables and arguments are by default string values. You can cast a specific type by wrapping the `os.environ.get()` into the methods `bool()`, `int()`, or `float()`. From 8a27a85bfb53ceec0bf4b5fa824915e15501fb66 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 13 Sep 2023 19:52:04 +0200 Subject: [PATCH 056/177] Add GettingStarted --- GettingStarted.md | 37 +++++++++++++++++++++++++++++++++++++ README.md | 29 +++-------------------------- 2 files changed, 40 insertions(+), 26 deletions(-) create mode 100644 GettingStarted.md diff --git a/GettingStarted.md b/GettingStarted.md new file mode 100644 index 00000000..a26691e6 --- /dev/null +++ b/GettingStarted.md @@ -0,0 +1,37 @@ +# Getting started with CLAIMED + +The [CLAIMED framework](https://github.com/claimed-framework) enables ease-of-use development and deployment of cloud native data processing applications on Kubernetes using operators and workflows. + +A central tool of **CLAIMED is the Claimed Component Compiler (C3)** which creates a docker image with all dependencies, pushes the container to a registry, and creates a kubernetes-job.yaml as well as a kubeflow-pipeline-component.yaml. +The following explains how to build operators yourself. + +## C3 requirements + +Your operator script has to follow certain requirements to be processed by C3. + +#### Python scripts + +- The operator name is the python file:`your_operator_name.py` +- The operator description is the first doc string in the script: `"""Operator description"""` +- You need to provide the required pip packages in comments starting: `# pip install ` +- The interface is defined by environment variables `your_parameter = os.getenv('your_parameter')`. Output variables start with `output_`. +- You can cast a specific type by wrapping `os.getenv()` with `int()`, `float()`, `bool()`. The default type is string. Only these four types are currently supported. You can use `None` as a default value but not pass the `NoneType` via the `job.yaml`. + +#### iPython notebooks + +- The operator name is the notebook file:`your_operator_name.ipynb` +- The notebook is converted to a python script before creating the operator by merging all cells. +- Markdown cells are converted into doc strings. shell commands with `!...` are converted into `os.system()`. +- The requirements of python scripts apply to the notebook code (The operator description can be a markdown cell). + +## Compile an operator with C3 + +With a running Docker engine and your operator script matching the C3 requirements, you can execute the C3 compiler by running `generate_kfp_component.py`: + +```sh +python /src/c3/generate_kfp_component.py --file_path ".py" --version "X.X" --repository "us.icr.io/" --additional_files "[file1,file2]" +``` + +The `file_path` can point to a python script or an ipython notebook. It is recommended to increase the `version` with every compilation as clusters pull images of a specific version from the cache if you used the image before. +`additional_files` is an optional parameter and must include all files your using in your operator script. The additional files are placed within the same directory as the operator script. + diff --git a/README.md b/README.md index 34dc38e2..48648ca5 100644 --- a/README.md +++ b/README.md @@ -34,36 +34,13 @@ pip install -e src ### Usage -Run `generate_kfp_component.ipynb` with ipython and provide the required additional information (`notebook_path`, `version`, `repository`, optional: `additionl_files`). - -Example from the `c3` project root. Remember to add `..//` when select your notebook path. Note that the code creates containers and therefore requires a running docker instance. +Just run the following command with your python script or notebook: ```sh -ipython src/c3/generate_kfp_component.ipynb notebook_path="" version="" additionl_files="[file1,file2]" repository="docker.io/" +python /src/c3/generate_kfp_component.py --file_path ".py" --version "X.X" --repository "us.icr.io/" --additional_files "[file1,file2]" ``` -### Notebook requirements - -The c3 compiler requires your notebook to follow a certain pattern: - -1. Cell: Markdown with the component name -2. Cell: Markdown with the component description -3. Cell: Requirements installed by pip, e.g., `!pip install <...>` -4. Cell: Imports, e.g., `import numpy as np` -5. Cell: Component interface, e.g., `input_path = os.environ.get('input_path')`. Output variables have to start with `output`, more details in the following. -6. Cell and following: Your code +Your code include certain requirements which are explained in [Getting Started](GettingStarted.md). -### Component interface - -The interface consists of input and output variables that are defined by environment variables. Output variables have to start with `output`, e.g., `output_path`. -Environment variables and arguments are by default string values. You can cast a specific type by wrapping the `os.environ.get()` into the methods `bool()`, `int()`, or `float()`. -The c3 compiler cannot handle other types than string, boolean, integer, and float values. - -```py -input_string = os.environ.get('input_string', 'default_value') -input_bool = bool(os.environ.get('input_bool', False)) -input_int = int(os.environ.get('input_int')) -input_float = float(os.environ.get('input_float')) -``` ## Getting Help From 01cdd144747a5fe3c981518ec2be144ed0b16a9d Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 13 Sep 2023 19:53:21 +0200 Subject: [PATCH 057/177] Convert notebooks to py scripts --- src/c3/generate_kfp_component.py | 62 ++++++++++++++++---------------- src/c3/notebook_converter.py | 46 ++++++++++++++++++++++++ 2 files changed, 76 insertions(+), 32 deletions(-) create mode 100644 src/c3/notebook_converter.py diff --git a/src/c3/generate_kfp_component.py b/src/c3/generate_kfp_component.py index 840eaef0..9407b238 100644 --- a/src/c3/generate_kfp_component.py +++ b/src/c3/generate_kfp_component.py @@ -1,18 +1,16 @@ - import os import sys import re import logging import shutil -from notebook import Notebook -from pythonscript import Pythonscript from string import Template from io import StringIO from enum import Enum +from pythonscript import Pythonscript +from notebook_converter import convert_notebook def generate_component(file_path: str, repository: str, version: str, additional_files: str = None): - root = logging.getLogger() root.setLevel('INFO') @@ -29,13 +27,10 @@ def generate_component(file_path: str, repository: str, version: str, additional logging.info('additional_files: ' + str(additional_files)) if file_path.endswith('.ipynb'): - nb = Notebook(file_path) - name = nb.get_name() - description = nb.get_description() + " CLAIMED v" + version - inputs = nb.get_inputs() - outputs = nb.get_outputs() - requirements = nb.get_requirements() - elif file_path.endswith('.py'): + logging.info('Convert notebook to python script') + file_path = convert_notebook(file_path) + + if file_path.endswith('.py'): py = Pythonscript(file_path) name = py.get_name() description = py.get_description() + " CLAIMED v" + version @@ -55,15 +50,16 @@ def generate_component(file_path: str, repository: str, version: str, additional def check_variable(var_name): return var_name in locals() or var_name in globals() - target_code = file_path.split('/')[-1:][0] - shutil.copy(file_path,target_code) + target_code = file_path.split('/')[-1] + if file_path != target_code: + shutil.copy(file_path, target_code) if check_variable('additional_files'): if additional_files.startswith('['): additional_files_path = 'additional_files_path' if not os.path.exists(additional_files_path): os.makedirs(additional_files_path) additional_files_local = additional_files_path - additional_files=additional_files[1:-1].split(',') + additional_files = additional_files[1:-1].split(',') print('Additional files to add to container:') for additional_file in additional_files: print(additional_file) @@ -71,9 +67,9 @@ def check_variable(var_name): print(os.listdir(additional_files_path)) else: additional_files_local = additional_files.split('/')[-1:][0] - shutil.copy(additional_files,additional_files_local) + shutil.copy(additional_files, additional_files_local) else: - additional_files_local=target_code # hack + additional_files_local = target_code # hack additional_files_path = None file = target_code @@ -87,7 +83,7 @@ def check_variable(var_name): with open(file, 'w') as fd: fd.write(text) - requirements_docker = list(map(lambda s: 'RUN '+s, requirements)) + requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) requirements_docker = '\n'.join(requirements_docker) python_command = 'python' if target_code.endswith('.py') else 'ipython' @@ -122,12 +118,12 @@ def check_variable(var_name): os.system(f'docker push `echo {repository}/claimed-{name}:latest`') parameter_type = Enum('parameter_type', ['INPUT', 'OUTPUT']) - def get_component_interface(parameters, type : parameter_type): + def get_component_interface(parameters, type: parameter_type): template_string = str() for parameter_name, parameter_options in parameters.items(): default = '' if parameter_options['default'] is not None and type == parameter_type.INPUT: - default = f", default: {parameter_options['default']}" + default = f", default: {parameter_options['default']}" template_string += f"- {{name: {parameter_name}, type: {parameter_options['type']}, description: {parameter_options['description']}{default}}}" template_string += '\n' return template_string @@ -171,19 +167,19 @@ def get_parameter_list(): $input_for_implementation''') yaml = t.substitute( - name=name, - description=description, - inputs=get_component_interface(inputs, parameter_type.INPUT), - container_uri=f"{repository}/claimed-{name}", - version=version, - outputPath=get_output_name(), - input_for_implementation=get_input_for_implementation(), - call=f'./{target_code} {get_parameter_list()}', - python=python_command, - ) + name=name, + description=description, + inputs=get_component_interface(inputs, parameter_type.INPUT), + container_uri=f"{repository}/claimed-{name}", + version=version, + outputPath=get_output_name(), + input_for_implementation=get_input_for_implementation(), + call=f'./{target_code} {get_parameter_list()}', + python=python_command, + ) print(yaml) - target_yaml_path = file_path.replace('.ipynb','.yaml').replace('.py','.yaml') + target_yaml_path = file_path.replace('.ipynb', '.yaml').replace('.py', '.yaml') with open(target_yaml_path, "w") as text_file: text_file.write(yaml) @@ -194,7 +190,9 @@ def get_parameter_list(): env_entry = f" - name: {input_key}\n value: value_of_{input_key}" env_entries.append(env_entry) env_entries.append('\n') - env_entries.pop(-1) + # TODO: Is it possible that a component has no inputs? + if len(env_entries) != 0: + env_entries.pop(-1) env_entries = ''.join(env_entries) job_yaml = f'''apiVersion: batch/v1 @@ -215,7 +213,7 @@ def get_parameter_list(): - name: image_pull_secret''' print(job_yaml) - target_job_yaml_path = file_path.replace('.ipynb','.job.yaml').replace('.py','.job.yaml') + target_job_yaml_path = file_path.replace('.ipynb', '.job.yaml').replace('.py', '.job.yaml') with open(target_job_yaml_path, "w") as text_file: text_file.write(job_yaml) diff --git a/src/c3/notebook_converter.py b/src/c3/notebook_converter.py new file mode 100644 index 00000000..9bcd2b06 --- /dev/null +++ b/src/c3/notebook_converter.py @@ -0,0 +1,46 @@ + +import json +import logging +import os + + +def convert_notebook(path): + with open(path) as json_file: + notebook = json.load(json_file) + + # backwards compatibility + if notebook['cells'][0]['cell_type'] == 'markdown' and notebook['cells'][1]['cell_type'] == 'markdown': + logging.info('Merge first two markdown cells. File name is used as operator name, not first markdown cell.') + notebook['cells'][1]['source'] = notebook['cells'][0]['source'] + ['\n'] + notebook['cells'][1]['source'] + notebook['cells'] = notebook['cells'][1:] + + code_lines = [] + for cell in notebook['cells']: + if cell['cell_type'] == 'markdown': + # add markdown as doc string + code_lines.extend(['"""\n'] + [f'{line}' for line in cell['source']] + ['\n"""']) + elif cell['cell_type'] == 'code': + for line in cell['source']: + if line.strip().startswith('!'): + # convert sh scripts + if line.strip().startswith('!pip'): + # change pip install to comment + code_lines.append(line.replace('!pip', '# pip', 1)) + else: + # change sh command to os.system() + logging.info(f'Replace shell command with os.system() ({line})') + code_lines.append(line.replace('!', 'os.system(', 1).replace('\n', ')\n')) + else: + # add code + code_lines.append(line) + # add line break after cell + code_lines.append('\n') + code = ''.join(code_lines) + + py_path = path.split('/')[-1].replace('.ipynb', '.py') + + assert not os.path.exists(py_path), f"File {py_path} already exist. Cannot convert notebook." + with open(py_path, 'w') as py_file: + py_file.write(code) + + return py_path From ea92df61c2bf4ead6fa46b5f617ef061a2c2d98e Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 14 Sep 2023 09:48:17 +0200 Subject: [PATCH 058/177] fixed remove temporary files --- src/c3/generate_kfp_component.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/c3/generate_kfp_component.py b/src/c3/generate_kfp_component.py index 9407b238..b92ba796 100644 --- a/src/c3/generate_kfp_component.py +++ b/src/c3/generate_kfp_component.py @@ -218,8 +218,9 @@ def get_parameter_list(): with open(target_job_yaml_path, "w") as text_file: text_file.write(job_yaml) - # remove local files - os.remove(target_code) + # remove temporary files + if file_path != target_code: + os.remove(target_code) os.remove('Dockerfile') if additional_files_path is not None: shutil.rmtree(additional_files_path, ignore_errors=True) From e99ceaefeb0f1b64f73367c0ddd5608ca5936d21 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 14 Sep 2023 10:14:20 +0200 Subject: [PATCH 059/177] fixed description and file_path copy --- src/c3/generate_kfp_component.py | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/src/c3/generate_kfp_component.py b/src/c3/generate_kfp_component.py index b92ba796..797d6aca 100644 --- a/src/c3/generate_kfp_component.py +++ b/src/c3/generate_kfp_component.py @@ -9,6 +9,8 @@ from pythonscript import Pythonscript from notebook_converter import convert_notebook +CLAIMED_VERSION = 'V0.1' + def generate_component(file_path: str, repository: str, version: str, additional_files: str = None): root = logging.getLogger() @@ -28,12 +30,19 @@ def generate_component(file_path: str, repository: str, version: str, additional if file_path.endswith('.ipynb'): logging.info('Convert notebook to python script') - file_path = convert_notebook(file_path) + target_code = convert_notebook(file_path) + else: + target_code = file_path.split('/')[-1] + if file_path != target_code: + # Copy file to current working directory + shutil.copy(file_path, target_code) - if file_path.endswith('.py'): - py = Pythonscript(file_path) + if target_code.endswith('.py'): + py = Pythonscript(target_code) name = py.get_name() - description = py.get_description() + " CLAIMED v" + version + # convert description into a string with a single line + description = ('"' + py.get_description().replace('\n', ' ').replace('"', '\'') + + ' – CLAIMED ' + CLAIMED_VERSION + '"') inputs = py.get_inputs() outputs = py.get_outputs() requirements = py.get_requirements() @@ -50,9 +59,6 @@ def generate_component(file_path: str, repository: str, version: str, additional def check_variable(var_name): return var_name in locals() or var_name in globals() - target_code = file_path.split('/')[-1] - if file_path != target_code: - shutil.copy(file_path, target_code) if check_variable('additional_files'): if additional_files.startswith('['): additional_files_path = 'additional_files_path' From 18e422b59ca3a6a0438d5176c203cc559d052c8d Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 14 Sep 2023 15:55:15 +0200 Subject: [PATCH 060/177] Adding additional code at each script for logging and setting cli parameters. Removed ipython commands. --- src/c3/generate_kfp_component.py | 72 +++++++++++++++++++++++++------- 1 file changed, 58 insertions(+), 14 deletions(-) diff --git a/src/c3/generate_kfp_component.py b/src/c3/generate_kfp_component.py index 797d6aca..4654849c 100644 --- a/src/c3/generate_kfp_component.py +++ b/src/c3/generate_kfp_component.py @@ -12,6 +12,46 @@ CLAIMED_VERSION = 'V0.1' +ADDITIONAL_CODE = """ +# default code for each operator +import os +import sys +import re +import logging + +# init logger +default_log_level = 'INFO' +root = logging.getLogger() +root.setLevel(default_log_level) +handler = logging.StreamHandler(sys.stdout) +handler.setLevel(default_log_level) +formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') +handler.setFormatter(formatter) +root.addHandler(handler) +logging.basicConfig(level=logging.CRITICAL) + +# get parameters from args +parameters = list(filter( + lambda s: s.find('=') > -1 and bool(re.match(r'[A-Za-z0-9_]*=[.\/A-Za-z0-9]*', s)), + sys.argv + )) + +# set parameters to env variables +for parameter in parameters: + variable = parameter.split('=')[0] + value = '='.join(parameter.split('=')[1:]) + logging.info(f'Parameter: {variable} = "{value}"') + os.environ[variable] = value + +# update log level +log_level = os.environ.get('log_level', default_log_level) +if default_log_level != log_level: + logging.info(f'Updating log level to {log_level}') + handler.setLevel(log_level) + +""" + + def generate_component(file_path: str, repository: str, version: str, additional_files: str = None): root = logging.getLogger() root.setLevel('INFO') @@ -33,10 +73,21 @@ def generate_component(file_path: str, repository: str, version: str, additional target_code = convert_notebook(file_path) else: target_code = file_path.split('/')[-1] - if file_path != target_code: - # Copy file to current working directory - shutil.copy(file_path, target_code) + if file_path == target_code: + # use temp file for processing + target_code = 'copy_' + target_code + # Copy file to current working directory + shutil.copy(file_path, target_code) + if target_code.endswith('.py'): + # Add code for logging and cli parameters to the beginning of the script + with open(target_code, 'r') as f: + script = f.read() + script = ADDITIONAL_CODE + script + with open(target_code, 'w') as f: + f.write(script) + + # getting parameter from the script if target_code.endswith('.py'): py = Pythonscript(target_code) name = py.get_name() @@ -92,27 +143,20 @@ def check_variable(var_name): requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) requirements_docker = '\n'.join(requirements_docker) - python_command = 'python' if target_code.endswith('.py') else 'ipython' - docker_file = f""" FROM registry.access.redhat.com/ubi8/python-39 USER root RUN dnf install -y java-11-openjdk USER default - RUN pip install ipython==8.6.0 nbformat==5.7.0 {requirements_docker} ADD {additional_files_local} /opt/app-root/src/ ADD {target_code} /opt/app-root/src/ USER root RUN chmod -R 777 /opt/app-root/src/ USER default - CMD ["{python_command}", "/opt/app-root/src/{target_code}"] + CMD ["python", "/opt/app-root/src/{target_code}"] """ - # Remove packages that are not used for python scripts - if target_code.endswith('.py'): - docker_file = docker_file.replace('RUN pip install ipython==8.6.0 nbformat==5.7.0\n', '') - with open("Dockerfile", "w") as text_file: text_file.write(docker_file) @@ -122,6 +166,7 @@ def check_variable(var_name): os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:latest`') os.system(f'docker push `echo {repository}/claimed-{name}:{version}`') os.system(f'docker push `echo {repository}/claimed-{name}:latest`') + parameter_type = Enum('parameter_type', ['INPUT', 'OUTPUT']) def get_component_interface(parameters, type: parameter_type): @@ -169,7 +214,7 @@ def get_parameter_list(): - sh - -ec - | - $python $call + python $call $input_for_implementation''') yaml = t.substitute( @@ -181,7 +226,6 @@ def get_parameter_list(): outputPath=get_output_name(), input_for_implementation=get_input_for_implementation(), call=f'./{target_code} {get_parameter_list()}', - python=python_command, ) print(yaml) @@ -211,7 +255,7 @@ def get_parameter_list(): containers: - name: {name} image: {repository}/claimed-{name}:{version} - command: ["/opt/app-root/bin/{python_command}","/opt/app-root/src/{target_code}"] + command: ["/opt/app-root/bin/python","/opt/app-root/src/{target_code}"] env: {env_entries} restartPolicy: OnFailure From 043b1d7797f2366411f03c347b9117451c12ece9 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Sat, 16 Sep 2023 10:37:50 +0200 Subject: [PATCH 061/177] Moved templates to src/templates --- src/c3/generate_kfp_component.py | 140 +++++------------- src/templates/__init__.py | 33 +++++ src/templates/component_setup_code.py | 35 +++++ src/templates/dockerfile_template | 11 ++ src/templates/kfp_component_template.yaml | 15 ++ .../kubernetes_job_template.job.yaml | 16 ++ 6 files changed, 148 insertions(+), 102 deletions(-) create mode 100644 src/templates/__init__.py create mode 100644 src/templates/component_setup_code.py create mode 100644 src/templates/dockerfile_template create mode 100644 src/templates/kfp_component_template.yaml create mode 100644 src/templates/kubernetes_job_template.job.yaml diff --git a/src/c3/generate_kfp_component.py b/src/c3/generate_kfp_component.py index 4654849c..e4eedb94 100644 --- a/src/c3/generate_kfp_component.py +++ b/src/c3/generate_kfp_component.py @@ -3,64 +3,18 @@ import re import logging import shutil +import argparse from string import Template from io import StringIO from enum import Enum from pythonscript import Pythonscript from notebook_converter import convert_notebook +from src.templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template CLAIMED_VERSION = 'V0.1' -ADDITIONAL_CODE = """ -# default code for each operator -import os -import sys -import re -import logging - -# init logger -default_log_level = 'INFO' -root = logging.getLogger() -root.setLevel(default_log_level) -handler = logging.StreamHandler(sys.stdout) -handler.setLevel(default_log_level) -formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') -handler.setFormatter(formatter) -root.addHandler(handler) -logging.basicConfig(level=logging.CRITICAL) - -# get parameters from args -parameters = list(filter( - lambda s: s.find('=') > -1 and bool(re.match(r'[A-Za-z0-9_]*=[.\/A-Za-z0-9]*', s)), - sys.argv - )) - -# set parameters to env variables -for parameter in parameters: - variable = parameter.split('=')[0] - value = '='.join(parameter.split('=')[1:]) - logging.info(f'Parameter: {variable} = "{value}"') - os.environ[variable] = value - -# update log level -log_level = os.environ.get('log_level', default_log_level) -if default_log_level != log_level: - logging.info(f'Updating log level to {log_level}') - handler.setLevel(log_level) - -""" - - def generate_component(file_path: str, repository: str, version: str, additional_files: str = None): - root = logging.getLogger() - root.setLevel('INFO') - - handler = logging.StreamHandler(sys.stdout) - handler.setLevel('INFO') - formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') - handler.setFormatter(formatter) - root.addHandler(handler) logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -83,7 +37,7 @@ def generate_component(file_path: str, repository: str, version: str, additional # Add code for logging and cli parameters to the beginning of the script with open(target_code, 'r') as f: script = f.read() - script = ADDITIONAL_CODE + script + script = component_setup_code + script with open(target_code, 'w') as f: f.write(script) @@ -107,10 +61,7 @@ def generate_component(file_path: str, repository: str, version: str, additional print(outputs) print(requirements) - def check_variable(var_name): - return var_name in locals() or var_name in globals() - - if check_variable('additional_files'): + if additional_files is not None: if additional_files.startswith('['): additional_files_path = 'additional_files_path' if not os.path.exists(additional_files_path): @@ -125,6 +76,11 @@ def check_variable(var_name): else: additional_files_local = additional_files.split('/')[-1:][0] shutil.copy(additional_files, additional_files_local) + # ensure the original file is not deleted later + if additional_files != additional_files_local: + additional_files_path = additional_files_local + else: + additional_files_path = None else: additional_files_local = target_code # hack additional_files_path = None @@ -143,19 +99,11 @@ def check_variable(var_name): requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) requirements_docker = '\n'.join(requirements_docker) - docker_file = f""" - FROM registry.access.redhat.com/ubi8/python-39 - USER root - RUN dnf install -y java-11-openjdk - USER default - {requirements_docker} - ADD {additional_files_local} /opt/app-root/src/ - ADD {target_code} /opt/app-root/src/ - USER root - RUN chmod -R 777 /opt/app-root/src/ - USER default - CMD ["python", "/opt/app-root/src/{target_code}"] - """ + docker_file = dockerfile_template.substitute( + requirements_docker=requirements_docker, + target_code=target_code, + additional_files_local=additional_files_local, + ) with open("Dockerfile", "w") as text_file: text_file.write(docker_file) @@ -164,8 +112,8 @@ def check_variable(var_name): os.system(f'docker build --platform=linux/amd64 -t `echo claimed-{name}:{version}` .') os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:{version}`') os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:latest`') - os.system(f'docker push `echo {repository}/claimed-{name}:{version}`') os.system(f'docker push `echo {repository}/claimed-{name}:latest`') + os.system(f'docker push `echo {repository}/claimed-{name}:{version}`') parameter_type = Enum('parameter_type', ['INPUT', 'OUTPUT']) @@ -201,23 +149,7 @@ def get_parameter_list(): index = index + 1 return return_value - t = Template('''name: $name -description: $description - -inputs: -$inputs - -implementation: - container: - image: $container_uri:$version - command: - - sh - - -ec - - | - python $call -$input_for_implementation''') - - yaml = t.substitute( + yaml = kfp_component_template.substitute( name=name, description=description, inputs=get_component_interface(inputs, parameter_type.INPUT), @@ -245,22 +177,13 @@ def get_parameter_list(): env_entries.pop(-1) env_entries = ''.join(env_entries) - job_yaml = f'''apiVersion: batch/v1 -kind: Job -metadata: - name: {name} -spec: - template: - spec: - containers: - - name: {name} - image: {repository}/claimed-{name}:{version} - command: ["/opt/app-root/bin/python","/opt/app-root/src/{target_code}"] - env: -{env_entries} - restartPolicy: OnFailure - imagePullSecrets: - - name: image_pull_secret''' + job_yaml = kubernetes_job_template.substitute( + name=name, + repository=repository, + version=version, + target_code=target_code, + env_entries=env_entries, + ) print(job_yaml) target_job_yaml_path = file_path.replace('.ipynb', '.job.yaml').replace('.py', '.job.yaml') @@ -277,7 +200,6 @@ def get_parameter_list(): if __name__ == '__main__': - import argparse parser = argparse.ArgumentParser() parser.add_argument('--file_path', type=str, required=True, @@ -288,7 +210,21 @@ def get_parameter_list(): help='Image version') parser.add_argument('--additional_files', type=str, help='Comma-separated list of paths to additional files to include in the container image') + parser.add_argument('--log_level', type=str, default='INFO') + args = parser.parse_args() - generate_component(**vars(args)) + # Init logging + root = logging.getLogger() + root.setLevel(args.log_level) + handler = logging.StreamHandler(sys.stdout) + formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s') + handler.setFormatter(formatter) + handler.setLevel(args.log_level) + root.addHandler(handler) + + generate_component(file_path=args.file_path, + repository=args.repository, + version=args.version, + additional_files=args.additional_files) diff --git a/src/templates/__init__.py b/src/templates/__init__.py new file mode 100644 index 00000000..e80048fb --- /dev/null +++ b/src/templates/__init__.py @@ -0,0 +1,33 @@ + +import os +from string import Template +from pathlib import Path + +# template file names +COMPONENT_SETUP_CODE = 'component_setup_code.py' +GW_COMPONENT_SETUP_CODE = 'gw_component_setup_code.py' +DOCKERFILE_FILE = 'dockerfile_template' +KFP_COMPONENT_FILE = 'kfp_component_template.yaml' +KUBERNETES_JOB_FILE = 'kubernetes_job_template.job.yaml' +GRID_WRAPPER_FILE = 'grid_wrapper_template.py' + +# load templates +template_path = Path(os.path.dirname(__file__)) + +with open(template_path / COMPONENT_SETUP_CODE, 'r') as f: + component_setup_code = Template(f.read()) + +with open(template_path / GW_COMPONENT_SETUP_CODE, 'r') as f: + gw_component_setup_code = Template(f.read()) + +with open(template_path / DOCKERFILE_FILE, 'r') as f: + dockerfile_template = Template(f.read()) + +with open(template_path / KFP_COMPONENT_FILE, 'r') as f: + kfp_component_template = Template(f.read()) + +with open(template_path / KUBERNETES_JOB_FILE, 'r') as f: + kubernetes_job_template = Template(f.read()) + +with open(template_path / GRID_WRAPPER_FILE, 'r') as f: + grid_wrapper_template = Template(f.read()) diff --git a/src/templates/component_setup_code.py b/src/templates/component_setup_code.py new file mode 100644 index 00000000..348cae7b --- /dev/null +++ b/src/templates/component_setup_code.py @@ -0,0 +1,35 @@ +# default code for each operator +import os +import sys +import re +import logging + +# init logger +root = logging.getLogger() +root.setLevel('INFO') +handler = logging.StreamHandler(sys.stdout) +handler.setLevel('INFO') +formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') +handler.setFormatter(formatter) +root.addHandler(handler) +logging.basicConfig(level=logging.CRITICAL) + +# get parameters from args +parameters = list(filter( + lambda s: s.find('=') > -1 and bool(re.match(r'[A-Za-z0-9_]*=[.\/A-Za-z0-9]*', s)), + sys.argv + )) + +# set parameters to env variables +for parameter in parameters: + variable = parameter.split('=')[0] + value = parameter.split('=', 1)[-1] + logging.info(f'Parameter: {variable} = "{value}"') + os.environ[variable] = value + +# update log level +log_level = os.environ.get('log_level', 'INFO') +if log_level !='INFO': + logging.info(f'Updating log level to {log_level}') + root.setLevel(log_level) + handler.setLevel(log_level) diff --git a/src/templates/dockerfile_template b/src/templates/dockerfile_template new file mode 100644 index 00000000..dfc1134d --- /dev/null +++ b/src/templates/dockerfile_template @@ -0,0 +1,11 @@ +FROM registry.access.redhat.com/ubi8/python-39 +USER root +RUN dnf install -y java-11-openjdk +USER default +${requirements_docker} +ADD ${target_code} /opt/app-root/src/ +ADD ${additional_files_local} /opt/app-root/src/ +USER root +RUN chmod -R 777 /opt/app-root/src/ +USER default +CMD ["python", "/opt/app-root/src/${target_code}"] \ No newline at end of file diff --git a/src/templates/kfp_component_template.yaml b/src/templates/kfp_component_template.yaml new file mode 100644 index 00000000..be12ebb7 --- /dev/null +++ b/src/templates/kfp_component_template.yaml @@ -0,0 +1,15 @@ +name: $name +description: $description + +inputs: +$inputs + +implementation: + container: + image: $container_uri:$version + command: + - sh + - -ec + - | + python $call +$input_for_implementation \ No newline at end of file diff --git a/src/templates/kubernetes_job_template.job.yaml b/src/templates/kubernetes_job_template.job.yaml new file mode 100644 index 00000000..f5210a4d --- /dev/null +++ b/src/templates/kubernetes_job_template.job.yaml @@ -0,0 +1,16 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: ${name} +spec: + template: + spec: + containers: + - name: ${name} + image: ${repository}/claimed-${name}:${version} + command: ["/opt/app-root/bin/python","/opt/app-root/src/${target_code}"] + env: +${env_entries} + restartPolicy: OnFailure + imagePullSecrets: + - name: image_pull_secret \ No newline at end of file From ef4322062b1b3a4cf4750813cab49a388fa16fdf Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Sat, 16 Sep 2023 10:38:06 +0200 Subject: [PATCH 062/177] Added grid wrapper --- src/c3/generate_grid_wrapper.py | 186 +++++++++++++++++++++++ src/templates/grid_wrapper_template.py | 177 +++++++++++++++++++++ src/templates/gw_component_setup_code.py | 17 +++ 3 files changed, 380 insertions(+) create mode 100644 src/c3/generate_grid_wrapper.py create mode 100644 src/templates/grid_wrapper_template.py create mode 100644 src/templates/gw_component_setup_code.py diff --git a/src/c3/generate_grid_wrapper.py b/src/c3/generate_grid_wrapper.py new file mode 100644 index 00000000..175779a8 --- /dev/null +++ b/src/c3/generate_grid_wrapper.py @@ -0,0 +1,186 @@ +import logging +import os +import argparse +import sys +from pythonscript import Pythonscript +from notebook_converter import convert_notebook +from generate_kfp_component import generate_component +from src.templates import grid_wrapper_template, gw_component_setup_code + + +def wrap_component(component_path, + component_description, + component_dependencies, + component_interface, + component_inputs, + component_process, + component_pre_process, + component_post_process, + ): + # get component name from path + component_name = os.path.splitext(os.path.basename(component_path))[0] + + grid_wrapper_code = grid_wrapper_template.substitute( + component_name=component_name, + component_description=component_description, + component_dependencies=component_dependencies, + component_inputs=component_inputs, + component_interface=component_interface, + component_process=component_process, + component_pre_process=component_pre_process, + component_post_process=component_post_process, + ) + + # Write edited code to file + grid_wrapper_file_path = os.path.join(os.path.dirname(component_path), f'gw_{component_name}.py') + with open(grid_wrapper_file_path, 'w') as f: + f.write(grid_wrapper_code) + + logging.info(f'Saved wrapped component to {grid_wrapper_file_path}') + + return grid_wrapper_file_path + + +def get_component_elements(file_path): + # get required elements from component code + py = Pythonscript(file_path) + # convert description into a string with a single line + description = (py.get_description().replace('\n', ' ').replace('"', '\'')) + inputs = py.get_inputs() + outputs = py.get_outputs() + dependencies = py.get_requirements() + + # combine inputs and outputs + interface_values = {} + interface_values.update(inputs) + interface_values.update(outputs) + + # combine dependencies list + dependencies = '\n# '.join(dependencies) + + # generate interface code from inputs and outputs + interface = '' + type_to_func = {'String': '', 'Boolean': 'bool', 'Integer': 'int', 'Float': 'float'} + for variable, d in interface_values.items(): + interface += f"# {d['description']}\n" + interface += f"component_{variable} = {type_to_func[d['type']]}(os.getenv('{variable}', {d['default']}))\n" + + # generate kwargs for the subprocesses + process_inputs = ', '.join([f'{i}=component_{i}' for i in inputs.keys()]) + # use log level from grid wrapper + process_inputs = process_inputs.replace('component_log_level', 'log_level') + + return description, interface, process_inputs, dependencies + + +# Adding code +def edit_component_code(file_path): + file_name = os.path.basename(file_path) + if file_path.endswith('.ipynb'): + logging.info('Convert notebook to python script') + target_file = convert_notebook(file_path) + file_path = target_file + file_name = os.path.basename(file_path) + else: + # write edited code to different file + target_file = os.path.join(os.path.dirname(file_path), 'component_' + file_name) + + target_file_name = os.path.basename(target_file) + + with open(file_path, 'r') as f: + script = f.read() + # Add code for logging and cli parameters to the beginning of the script + script = gw_component_setup_code + script + # replace old filename with new file name + script = script.replace(file_name, target_file_name) + with open(target_file, 'w') as f: + f.write(script) + + if '__main__' not in script: + logging.warning('No __main__ found in component code. Grid wrapper will import functions from component, ' + 'which can lead to unexpected behaviour without using __main__.') + + logging.info('Saved component python script in ' + target_file) + + return target_file + + +def apply_grid_wrapper(file_path, component_process, component_pre_process, component_post_process, + *args, **kwargs): + + assert file_path.endswith('.py') or file_path.endswith('.ipynb'), \ + "Please provide a component file path to a python script or notebook." + + file_path = edit_component_code(file_path) + + description, interface, inputs, dependencies = get_component_elements(file_path) + + component_elements = dict(component_path=file_path, + component_description=description, + component_dependencies=dependencies, + component_interface=interface, + component_inputs=inputs, + component_process=component_process, + component_pre_process=component_pre_process, + component_post_process=component_post_process, + ) + + logging.debug('Wrap component with parameters:') + for component, value in component_elements.items(): + logging.debug(component + ':\n' + str(value) + '\n') + + logging.info('Wrap component') + grid_wrapper_file_path = wrap_component(**component_elements) + return grid_wrapper_file_path, file_path + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('-f', '--file_path', type=str, required=True, + help = 'Path to python script or notebook') + parser.add_argument('-p', '--component_process', type=str, required=True, + help='Name of the component sub process that is executed for each batch.') + parser.add_argument('-pre', '--component_pre_process', type=str, + help='Name of the component pre process which is executed once before parallelization.') + parser.add_argument('-post', '--component_post_process', type=str, + help='Name of the component post process which is executed once after parallelization.') + + parser.add_argument('-r', '--repository', type=str, + help='Container registry address, e.g. docker.io/') + parser.add_argument('-v', '--version', type=str, + help='Image version') + parser.add_argument('-a', '--additional_files', type=str, + help='Comma-separated list of paths to additional files to include in the container image') + parser.add_argument('-l', '--log_level', type=str, default='INFO') + + args = parser.parse_args() + + # Init logging + root = logging.getLogger() + root.setLevel(args.log_level) + handler = logging.StreamHandler(sys.stdout) + formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s') + handler.setFormatter(formatter) + handler.setLevel(args.log_level) + root.addHandler(handler) + + grid_wrapper_file_path, component_path = apply_grid_wrapper(**vars(args)) + + if args.repository is not None and args.version is not None: + logging.info('Generate CLAIMED operator for grid wrapper') + + # Add component path and init file path to additional_files + if args.additional_files is None: + args.additional_files = component_path + else: + if args.additional_files.startswith('['): + args.additional_files = f'{args.additional_files[:-1]},{component_path}]' + else: + args.additional_files = f'[{args.additional_files},{component_path}]' + + generate_component(file_path=grid_wrapper_file_path, + repository=args.repository, + version=args.version, + additional_files=args.additional_files) + + # TODO: Delete component_path? diff --git a/src/templates/grid_wrapper_template.py b/src/templates/grid_wrapper_template.py new file mode 100644 index 00000000..a69d52aa --- /dev/null +++ b/src/templates/grid_wrapper_template.py @@ -0,0 +1,177 @@ +""" +${component_name} got wrapped by grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 + +CLAIMED component description: ${component_description} +""" + +# component dependencies +# ${component_dependencies} + +import os +import json +import random +import logging +import time +import glob +from pathlib import Path + +# import component code +from ${component_name} import * + + +# File with batches. Provided as a comma-separated list of strings or keys in a json dict. +gw_batch_file = os.environ.get('gw_batch_file', None) +# file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. Is ignored if gw_batch_file is provided. +gw_file_path_pattern = os.environ.get('gw_file_path_pattern', None) +# pattern for grouping file paths into batches like ".split('.')[-1]". Is ignored if gw_batch_file is provided. +gw_group_by = os.environ.get('gw_group_by', None) +# path to grid wrapper coordinator directory +gw_coordinator_path = os.environ.get('gw_coordinator_path') +# lock file suffix +gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') +# processed file suffix +gw_processed_file_suffix = os.environ.get('gw_lock_file_suffix', '.processed') +# timeout in seconds to remove lock file from struggling job (default 1 hour) +gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 3600)) + +# component interface +${component_interface} + +def load_batches_from_file(batch_file): + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + with open(batch_file, 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + else: + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + with open(batch_file, 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] + + logging.info(f'Loaded {len(batches)} batches') + logging.debug(f'List of batches: {batches}') + assert len(batches) > 0, f"batch_file {batch_file} has no batches." + return batches + + +def identify_batches_from_pattern(file_path_patterns, group_by): + logging.info(f'Start identifying files and batches') + batches = set() + all_files = [] + + # Iterate over comma-separated paths + for file_path_pattern in file_path_patterns.split(','): + logging.info(f'Get file paths from pattern: {file_path_pattern}') + all_files.extend(glob.glob(file_path_pattern.strip())) + assert len(all_files) > 0, f"Found no files with file_path_patterns {file_path_patterns}." + + # get batches by applying the group by function to all file paths + for path_string in all_files: + exec('part = path_string' + group_by) + batches.add(part) + + logging.info(f'Identified {len(batches)} batches') + logging.debug(f'List of batches: {batches}') + assert len(all_files) > 0, (f"Found batches with group_by {group_by}. " + f"Identified {len(all_files)} files, e.g., {all_files[:10]}.") + return batches + + +def perform_process(process, batch): + logging.debug(f'Check coordinator files for batch {batch}.') + # init coordinator files + lock_file = Path(gw_coordinator_path) / (batch + gw_lock_file_suffix) + processed_file = Path(gw_coordinator_path) / (batch + gw_processed_file_suffix) + + if lock_file.exists(): + # remove strugglers + if lock_file.stat().st_mtime < time.time() - gw_lock_timeout: + logging.debug(f'Lock file {lock_file} is expired.') + lock_file.unlink() + else: + logging.debug(f'Batch {batch} is locked.') + return + + if processed_file.exists(): + logging.debug(f'Batch {batch} is processed.') + return + + logging.debug(f'Locking batch {batch}.') + lock_file.touch() + + # processing files with custom process + logging.info(f'Processing batch {batch}.') + try: + target_files = process(batch, ${component_inputs}) + except Exception as e: + # Remove lock file before raising the error + lock_file.unlink() + raise e + + # optional verify target files + if target_files is not None: + if isinstance(target_files, str): + target_files = [target_files] + for target_file in target_files: + assert os.path.exists(target_file), f'Target file {target_file} does not exist for batch {batch}.' + else: + logging.info(f'Cannot verify batch {batch} (target files not provided).') + + logging.info(f'Finished Batch {batch}.') + processed_file.touch() + + # Remove lock file + lock_file.unlink() + + +def process_wrapper(sub_process, pre_process=None, post_process=None): + delay = random.randint(1, 60) + logging.info(f'Staggering start, waiting for {delay} seconds') + time.sleep(delay) + + # Init coordinator dir + Path(gw_coordinator_path).mkdir(exist_ok=True, parents=True) + + # run preprocessing + if pre_process is not None: + perform_process(pre_process, 'preprocess') + + # wait until preprocessing is finished + processed_file = Path(gw_coordinator_path) / ('preprocess' + gw_processed_file_suffix) + while not processed_file.exists(): + logging.info(f'Waiting for preprocessing to finish.') + time.sleep(60) + + # get batches + if gw_batch_file is not None and os.path.isfile(gw_batch_file): + batches = load_batches_from_file(gw_batch_file) + elif gw_file_path_pattern is not None and gw_group_by is not None: + batches = identify_batches_from_pattern(gw_file_path_pattern, gw_group_by) + else: + raise ValueError("Cannot identify batches. " + "Provide valid gw_batch_file or gw_file_path_pattern and gw_group_by.") + + # Iterate over all batches + for batch in batches: + perform_process(sub_process, batch) + + # Check if all batches are processed + processed_status = [(Path(gw_coordinator_path) / (batch + gw_processed_file_suffix)).exists() for batch in batches] + lock_status = [(Path(gw_coordinator_path) / (batch + gw_lock_file_suffix)).exists() for batch in batches] + if all(processed_status): + if post_process is not None: + # run postprocessing + perform_process(post_process, 'postprocess') + + logging.info('Finished all processes.') + else: + logging.info(f'Finished current process. Status batches: ' + f'{sum(processed_status)} processed / {sum(lock_status)} locked / {len(processed_status)} total') + + +if __name__ == '__main__': + process_wrapper(${component_process}, ${component_pre_process}, ${component_post_process}) diff --git a/src/templates/gw_component_setup_code.py b/src/templates/gw_component_setup_code.py new file mode 100644 index 00000000..205f07ac --- /dev/null +++ b/src/templates/gw_component_setup_code.py @@ -0,0 +1,17 @@ +import os +import re +import sys +import logging + +# get parameters from args +parameters = list(filter( + lambda s: s.find('=') > -1 and bool(re.match(r'[A-Za-z0-9_]*=[.\/A-Za-z0-9]*', s)), + sys.argv + )) + +# set parameters to env variables +for parameter in parameters: + variable = parameter.split('=')[0] + value = parameter.split('=', 1)[-1] + logging.debug(f'Parameter: {variable} = "{value}"') + os.environ[variable] = value \ No newline at end of file From 94ae7e36397f719de97c5a0626a4a53ac710e8e1 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Sat, 16 Sep 2023 10:40:21 +0200 Subject: [PATCH 063/177] Removed nb compiler because of nb converter --- src/__init__.py | 0 src/c3/generate_kfp_component.ipynb | 442 ---------------------------- src/c3/notebook.py | 66 ----- 3 files changed, 508 deletions(-) create mode 100644 src/__init__.py delete mode 100644 src/c3/generate_kfp_component.ipynb delete mode 100644 src/c3/notebook.py diff --git a/src/__init__.py b/src/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/c3/generate_kfp_component.ipynb b/src/c3/generate_kfp_component.ipynb deleted file mode 100644 index 7bf640cb..00000000 --- a/src/c3/generate_kfp_component.ipynb +++ /dev/null @@ -1,442 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "2fc91001", - "metadata": {}, - "outputs": [], - "source": [ - "from notebook import Notebook\n", - "import os\n", - "import shutil\n", - "from string import Template\n", - "import sys\n", - "from io import StringIO\n", - "from enum import Enum\n", - "import logging\n", - "import re" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0a0b99ea-4fdd-4f65-bb3d-1057c5eb5c05", - "metadata": {}, - "outputs": [], - "source": [ - "root = logging.getLogger()\n", - "root.setLevel('INFO')\n", - "\n", - "handler = logging.StreamHandler(sys.stdout)\n", - "handler.setLevel('INFO')\n", - "formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')\n", - "handler.setFormatter(formatter)\n", - "root.addHandler(handler)\n", - "\n", - "\n", - "parameters = list(\n", - " map(lambda s: re.sub('$', '\"', s),\n", - " map(\n", - " lambda s: s.replace('=', '=\"'),\n", - " filter(\n", - " lambda s: s.find('=') > -1 and bool(re.match(r'[A-Za-z0-9_]*=[.\\/A-Za-z0-9]*', s)),\n", - " sys.argv\n", - " )\n", - " )))\n", - "\n", - "logging.info('Logging parameters: ' + ''.join(parameters))\n", - "for parameter in parameters:\n", - " logging.info('Parameter: ' + parameter)\n", - " exec(parameter)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "141b9a18-c302-4360-bfec-ac2c5b29fad9", - "metadata": {}, - "outputs": [], - "source": [ - "#version=\"0.1n\"\n", - "#notebook_path = '../component-library/input/input-url.ipynb'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "055a924f-ae8f-40c4-ab5c-91f136cee8ab", - "metadata": {}, - "outputs": [], - "source": [ - "nb = Notebook(notebook_path)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "40670316-3b8f-41a5-99b5-12c9162627f1", - "metadata": {}, - "outputs": [], - "source": [ - "name = nb.get_name()\n", - "description = nb.get_description() + \" CLAIMED v\"+ version\n", - "inputs = nb.get_inputs()\n", - "outputs = nb.get_outputs()\n", - "requirements = nb.get_requirements()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "92c0f904-8afd-4130-8c27-ac955fb61266", - "metadata": {}, - "outputs": [], - "source": [ - "print(name)\n", - "print(description)\n", - "print(inputs)\n", - "print(outputs)\n", - "print(requirements)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0604b83b-88c5-4fa7-a2bb-b452d80e2a61", - "metadata": {}, - "outputs": [], - "source": [ - "!echo {notebook_path}" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "e271b688-b307-4bc9-803c-4c8bed0761ae", - "metadata": {}, - "outputs": [], - "source": [ - "def check_variable(var_name):\n", - " return var_name in locals() or var_name in globals()\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5c0fcb61-43e3-4711-919c-078da0a02e72", - "metadata": {}, - "outputs": [], - "source": [ - "#target_code = notebook_path.replace('.ipynb','.py').split('/')[-1:][0]\n", - "\n", - "target_code = notebook_path.split('/')[-1:][0]\n", - "shutil.copy(notebook_path,target_code)\n", - "if check_variable('additional_files'):\n", - " if additional_files.startswith('['):\n", - " additional_files_path = 'additional_files_path'\n", - " if not os.path.exists(additional_files_path):\n", - " os.makedirs(additional_files_path)\n", - " additional_files_local = additional_files_path\n", - " additional_files=additional_files[1:-1].split(',')\n", - " print('Additional files to add to container:')\n", - " for additional_file in additional_files:\n", - " print(additional_file)\n", - " shutil.copy(additional_file, additional_files_local)\n", - " print(os.listdir(additional_files_path))\n", - " else:\n", - " additional_files_local = additional_files.split('/')[-1:][0]\n", - " shutil.copy(additional_files,additional_files_local)\n", - "else:\n", - " additional_files_local=target_code #hack\n", - " additional_files_path = None" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f67a914e-70b2-4ab6-ab37-121195b59ec4", - "metadata": {}, - "outputs": [], - "source": [ - "import re\n", - "\n", - "file = target_code\n", - "\n", - "# they can be also raw string and regex\n", - "textToSearch = r'!pip' \n", - "textToReplace = '#!pip'\n", - "\n", - "# read and replace\n", - "with open(file, 'r') as fd:\n", - " text, counter = re.subn(textToSearch, textToReplace, fd.read(), re.I)\n", - "\n", - "# check if there is at least a match\n", - "if counter > 0:\n", - " # edit the file\n", - " with open(file, 'w') as fd:\n", - " fd.write(text)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fa5c5b41-1637-456b-9193-fdd7c680c490", - "metadata": {}, - "outputs": [], - "source": [ - "requirements_docker = list(map(lambda s: 'RUN '+s, requirements))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21beb828-c09b-42c0-9b89-94a143d55a03", - "metadata": {}, - "outputs": [], - "source": [ - "requirements_docker = '\\n'.join(requirements_docker)\n", - "docker_file = f\"\"\"\n", - "FROM registry.access.redhat.com/ubi8/python-39 \n", - "USER root\n", - "RUN dnf install -y java-11-openjdk\n", - "USER default\n", - "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "{requirements_docker}\n", - "ADD {additional_files_local} /opt/app-root/src/\n", - "ADD {target_code} /opt/app-root/src/\n", - "USER root\n", - "RUN chmod -R 777 /opt/app-root/src/\n", - "USER default\n", - "CMD [\"ipython\", \"/opt/app-root/src/{target_code}\"]\n", - "\"\"\"\n", - "with open(\"Dockerfile\", \"w\") as text_file:\n", - " text_file.write(docker_file)\n", - "!cat Dockerfile" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "db44dd53-ee2f-497a-b9a0-e92cfcfd7ef7", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "!docker build --platform=linux/amd64 -t `echo claimed-{name}:{version}` .\n", - "!docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:{version}`\n", - "!docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:latest`\n", - "!docker push `echo {repository}/claimed-{name}:{version}`\n", - "!docker push `echo {repository}/claimed-{name}:latest`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8747a2f1-ed15-41ea-b864-d4482e751eb5", - "metadata": {}, - "outputs": [], - "source": [ - "parameter_type = Enum('parameter_type', ['INPUT', 'OUTPUT'])\n", - "\n", - "def get_component_interface(parameters, type : parameter_type):\n", - " template_string = str()\n", - " for parameter_name, parameter_options in parameters.items():\n", - " default = ''\n", - " if parameter_options['default'] is not None and type == parameter_type.INPUT:\n", - " default = f\", default: {parameter_options['default']}\"\n", - " template_string += f\"- {{name: {parameter_name}, type: {parameter_options['type']}, description: {parameter_options['description']}{default}}}\"\n", - " template_string += '\\n'\n", - " return template_string" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "27608ff6-60b7-4921-b360-4cafb6d8b11d", - "metadata": {}, - "outputs": [], - "source": [ - "def get_output_name():\n", - " for output_key, output_value in outputs.items():\n", - " return output_key" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9b44cadc-b5e7-4cef-8ebd-f3bcf5a56328", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "def get_input_for_implementation():\n", - " with StringIO() as inputs_str:\n", - " for input_key, input_value in inputs.items():\n", - " t = Template(\" - {inputValue: $name}\")\n", - " print(t.substitute(name=input_key), file=inputs_str)\n", - " return inputs_str.getvalue() " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dd6c276c-1e4d-4212-a7b8-469a0310adea", - "metadata": {}, - "outputs": [], - "source": [ - "def get_parameter_list():\n", - " return_value = str()\n", - " index = 0\n", - " for output_key, output_value in outputs.items():\n", - " return_value = return_value + output_key + '=\"${' + str(index) + '}\" '\n", - " index = index + 1\n", - " for input_key, input_value in inputs.items():\n", - " return_value = return_value + input_key + '=\"${' + str(index) + '}\" '\n", - " index = index + 1\n", - " return return_value " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9168a5bd-76a5-4ae4-baf9-018397fa1d80", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "t = Template('''name: $name\n", - "description: $description\n", - "\n", - "inputs:\n", - "$inputs\n", - "\n", - "implementation:\n", - " container:\n", - " image: $container_uri:$version\n", - " command:\n", - " - sh\n", - " - -ec\n", - " - |\n", - " ipython $call\n", - "$input_for_implementation''')\n", - "yaml = t.substitute(\n", - " name=name,\n", - " description=description,\n", - " inputs=get_component_interface(inputs, parameter_type.INPUT),\n", - " container_uri=f\"{repository}/claimed-{name}\",\n", - " version=version,\n", - " outputPath=get_output_name(),\n", - " input_for_implementation=get_input_for_implementation(),\n", - " call=f'./{target_code} {get_parameter_list()}' \n", - " )\n", - "print(yaml)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "320e6d14-9584-4154-a225-437f6026fe4b", - "metadata": {}, - "outputs": [], - "source": [ - "target_yaml_path = notebook_path.replace('.ipynb','.yaml')\n", - "\n", - "with open(target_yaml_path, \"w\") as text_file:\n", - " text_file.write(yaml)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "83115456", - "metadata": {}, - "outputs": [], - "source": [ - "env_entries = []\n", - "for input_key, _ in inputs.items():\n", - " env_entry = f\" - name: {input_key}\\n value: value_of_{input_key}\"\n", - " env_entries.append(env_entry)\n", - " env_entries.append('\\n')\n", - "env_entries.pop(-1)\n", - "env_entries = ''.join(env_entries)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1d40b282", - "metadata": {}, - "outputs": [], - "source": [ - "job_yaml = f'''apiVersion: batch/v1\n", - "kind: Job\n", - "metadata:\n", - " name: {name}\n", - "spec:\n", - " template:\n", - " spec:\n", - " containers:\n", - " - name: {name}\n", - " image: {repository}/claimed-{name}:{version}\n", - " command: [\"/opt/app-root/bin/ipython\",\"/opt/app-root/src/{target_code}\"]\n", - " env:\n", - "{env_entries}\n", - " restartPolicy: OnFailure'''\n", - "\n", - "\n", - "print(job_yaml)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "52a6bafb", - "metadata": {}, - "outputs": [], - "source": [ - "target_job_yaml_path = notebook_path.replace('.ipynb','.job.yaml')\n", - "\n", - "with open(target_job_yaml_path, \"w\") as text_file:\n", - " text_file.write(job_yaml)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e36aa496", - "metadata": {}, - "outputs": [], - "source": [ - "# remove local files\n", - "os.remove(target_code)\n", - "os.remove('Dockerfile')\n", - "if additional_files_path is not None:\n", - " shutil.rmtree(additional_files_path, ignore_errors=True)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/src/c3/notebook.py b/src/c3/notebook.py deleted file mode 100644 index 5f9a8f62..00000000 --- a/src/c3/notebook.py +++ /dev/null @@ -1,66 +0,0 @@ -import json -import re -from parser import ContentParser - -class Notebook(): - def __init__(self, path): - self.path = path - with open(path) as json_file: - self.notebook = json.load(json_file) - self.name = self.notebook['cells'][0]['source'][0].replace('#', '').replace('_', '-').strip() - self.description = self.notebook['cells'][1]['source'][0] - self.envs = self._get_env_vars() - - def _get_env_vars(self): - cp = ContentParser() - env_names = cp.parse(self.path)['env_vars'] - return_value = dict() - for env_name in env_names: - comment_line = str() - for line in self.notebook['cells'][4]['source']: - if re.search("[\"']" + env_name + "[\"']", line): - assert '#' in comment_line, "comment line didn't contain #" - assert ',' not in comment_line, "comment line contains ," - - if "int(" in line: - type = 'Integer' - elif "float(" in line: - type = 'Float' - elif "bool(" in line: - type = 'Boolean' - else: - type = 'String' - if ',' in line: - default=line.split(',')[1].split(')')[0] - else: - default = None - return_value[env_name]={ - 'description': comment_line.replace('#', '').strip(), - 'type': type, - 'default': default} - comment_line = line - return return_value - - def get_requirements(self): - requirements = [] - for cell in self.notebook['cells']: - for cell_content in cell['source']: - pattern = r"(![ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.\-: ]*)" - result = re.findall(pattern,cell_content) - if len(result) == 1: - requirements.append((result[0][0]+ ' ' +result[0][1])[1:]) - return requirements - - def get_name(self): - return self.name - - def get_description(self): - return self.description - - def get_inputs(self): - return { key:value for (key,value) in self.envs.items() if not key.startswith('output_') } - - def get_outputs(self): - return { key:value for (key,value) in self.envs.items() if key.startswith('output_') } - - From f459f43110e66a9acab8ad6e3118b455894bfb53 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 9 Oct 2023 15:43:23 +0200 Subject: [PATCH 064/177] Updated create_operator and create_grid_wrapper --- ...grid_wrapper.py => create_grid_wrapper.py} | 48 +++++--- ...te_kfp_component.py => create_operator.py} | 106 +++++++++++------- src/c3/parser.py | 8 +- src/c3/pythonscript.py | 11 +- src/templates/__init__.py | 4 +- src/templates/grid_wrapper_template.py | 38 +++++-- src/templates/gw_component_setup_code.py | 2 +- src/templates/kfp_component_template.yaml | 15 ++- 8 files changed, 145 insertions(+), 87 deletions(-) rename src/c3/{generate_grid_wrapper.py => create_grid_wrapper.py} (82%) rename src/c3/{generate_kfp_component.py => create_operator.py} (69%) diff --git a/src/c3/generate_grid_wrapper.py b/src/c3/create_grid_wrapper.py similarity index 82% rename from src/c3/generate_grid_wrapper.py rename to src/c3/create_grid_wrapper.py index 175779a8..0efb910d 100644 --- a/src/c3/generate_grid_wrapper.py +++ b/src/c3/create_grid_wrapper.py @@ -2,10 +2,11 @@ import os import argparse import sys +from string import Template from pythonscript import Pythonscript from notebook_converter import convert_notebook -from generate_kfp_component import generate_component -from src.templates import grid_wrapper_template, gw_component_setup_code +from create_operator import create_operator +from templates import grid_wrapper_template, gw_component_setup_code, dockerfile_template def wrap_component(component_path, @@ -33,6 +34,8 @@ def wrap_component(component_path, # Write edited code to file grid_wrapper_file_path = os.path.join(os.path.dirname(component_path), f'gw_{component_name}.py') + # remove 'component_' from gw path + grid_wrapper_file_path = grid_wrapper_file_path.replace('component_', '') with open(grid_wrapper_file_path, 'w') as f: f.write(grid_wrapper_code) @@ -107,7 +110,6 @@ def edit_component_code(file_path): def apply_grid_wrapper(file_path, component_process, component_pre_process, component_post_process, *args, **kwargs): - assert file_path.endswith('.py') or file_path.endswith('.ipynb'), \ "Please provide a component file path to a python script or notebook." @@ -115,15 +117,16 @@ def apply_grid_wrapper(file_path, component_process, component_pre_process, comp description, interface, inputs, dependencies = get_component_elements(file_path) - component_elements = dict(component_path=file_path, - component_description=description, - component_dependencies=dependencies, - component_interface=interface, - component_inputs=inputs, - component_process=component_process, - component_pre_process=component_pre_process, - component_post_process=component_post_process, - ) + component_elements = dict( + component_path=file_path, + component_description=description, + component_dependencies=dependencies, + component_interface=interface, + component_inputs=inputs, + component_process=component_process, + component_pre_process=component_pre_process, + component_post_process=component_post_process, + ) logging.debug('Wrap component with parameters:') for component, value in component_elements.items(): @@ -137,7 +140,7 @@ def apply_grid_wrapper(file_path, component_process, component_pre_process, comp if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('-f', '--file_path', type=str, required=True, - help = 'Path to python script or notebook') + help='Path to python script or notebook') parser.add_argument('-p', '--component_process', type=str, required=True, help='Name of the component sub process that is executed for each batch.') parser.add_argument('-pre', '--component_pre_process', type=str, @@ -152,6 +155,8 @@ def apply_grid_wrapper(file_path, component_process, component_pre_process, comp parser.add_argument('-a', '--additional_files', type=str, help='Comma-separated list of paths to additional files to include in the container image') parser.add_argument('-l', '--log_level', type=str, default='INFO') + parser.add_argument('--dockerfile_template_path', type=str, default='', + help='Path to custom dockerfile template') args = parser.parse_args() @@ -178,9 +183,18 @@ def apply_grid_wrapper(file_path, component_process, component_pre_process, comp else: args.additional_files = f'[{args.additional_files},{component_path}]' - generate_component(file_path=grid_wrapper_file_path, - repository=args.repository, - version=args.version, - additional_files=args.additional_files) + # Update dockerfile template if specified + if args.dockerfile_template_path != '': + logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') + with open(args.dockerfile_template_path, 'r') as f: + dockerfile_template = Template(f.read()) + + create_operator( + file_path=grid_wrapper_file_path, + repository=args.repository, + version=args.version, + dockerfile_template=dockerfile_template, + additional_files=args.additional_files + ) # TODO: Delete component_path? diff --git a/src/c3/generate_kfp_component.py b/src/c3/create_operator.py similarity index 69% rename from src/c3/generate_kfp_component.py rename to src/c3/create_operator.py index e4eedb94..f9813c96 100644 --- a/src/c3/generate_kfp_component.py +++ b/src/c3/create_operator.py @@ -6,16 +6,19 @@ import argparse from string import Template from io import StringIO -from enum import Enum from pythonscript import Pythonscript from notebook_converter import convert_notebook -from src.templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template +from templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template CLAIMED_VERSION = 'V0.1' -def generate_component(file_path: str, repository: str, version: str, additional_files: str = None): - +def create_operator(file_path: str, + repository: str, + version: str, + dockerfile_template: str, + additional_files: str = None, + ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) logging.info('repository: ' + repository) @@ -29,7 +32,7 @@ def generate_component(file_path: str, repository: str, version: str, additional target_code = file_path.split('/')[-1] if file_path == target_code: # use temp file for processing - target_code = 'copy_' + target_code + target_code = 'claimed_' + target_code # Copy file to current working directory shutil.copy(file_path, target_code) @@ -52,14 +55,17 @@ def generate_component(file_path: str, repository: str, version: str, additional outputs = py.get_outputs() requirements = py.get_requirements() else: - print('Please provide a file_path to a jupyter notebook or python script.') - raise NotImplementedError - - print(name) - print(description) - print(inputs) - print(outputs) - print(requirements) + raise NotImplementedError('Please provide a file_path to a jupyter notebook or python script.') + + # Strip 'claimed-' from name of copied temp file + if name.startswith('claimed-'): + name = name[8:] + + logging.info('Operator name: ' + name) + logging.info('Description:: ' + description) + logging.info('Inputs: ' + str(inputs)) + logging.info('Outputs: ' + str(outputs)) + logging.info('Requirements: ' + str(requirements)) if additional_files is not None: if additional_files.startswith('['): @@ -105,36 +111,35 @@ def generate_component(file_path: str, repository: str, version: str, additional additional_files_local=additional_files_local, ) + logging.info('Create Dockerfile') with open("Dockerfile", "w") as text_file: text_file.write(docker_file) - os.system('cat Dockerfile') + logging.info(f'Build and push image to {repository}/claimed-{name}:{version}') os.system(f'docker build --platform=linux/amd64 -t `echo claimed-{name}:{version}` .') os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:{version}`') os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:latest`') os.system(f'docker push `echo {repository}/claimed-{name}:latest`') os.system(f'docker push `echo {repository}/claimed-{name}:{version}`') - parameter_type = Enum('parameter_type', ['INPUT', 'OUTPUT']) - - def get_component_interface(parameters, type: parameter_type): + def get_component_interface(parameters): template_string = str() for parameter_name, parameter_options in parameters.items(): - default = '' - if parameter_options['default'] is not None and type == parameter_type.INPUT: - default = f", default: {parameter_options['default']}" - template_string += f"- {{name: {parameter_name}, type: {parameter_options['type']}, description: {parameter_options['description']}{default}}}" - template_string += '\n' + template_string += f'- {{name: {parameter_name}, type: {parameter_options["type"]}, description: "{parameter_options["description"]}"' + if parameter_options['default'] is not None: + template_string += f', default: {parameter_options["default"]}' + template_string += '}\n' return template_string def get_output_name(): for output_key, output_value in outputs.items(): return output_key + # TODO: Review implementation def get_input_for_implementation(): + t = Template(" - {inputValue: $name}") with StringIO() as inputs_str: for input_key, input_value in inputs.items(): - t = Template(" - {inputValue: $name}") print(t.substitute(name=input_key), file=inputs_str) return inputs_str.getvalue() @@ -152,26 +157,34 @@ def get_parameter_list(): yaml = kfp_component_template.substitute( name=name, description=description, - inputs=get_component_interface(inputs, parameter_type.INPUT), - container_uri=f"{repository}/claimed-{name}", + repository=repository, version=version, - outputPath=get_output_name(), - input_for_implementation=get_input_for_implementation(), + inputs=get_component_interface(inputs), + outputs=get_component_interface(outputs), call=f'./{target_code} {get_parameter_list()}', + input_for_implementation=get_input_for_implementation(), ) - print(yaml) + logging.debug('KubeFlow component yaml:') + logging.debug(yaml) target_yaml_path = file_path.replace('.ipynb', '.yaml').replace('.py', '.yaml') + logging.debug(f' Write KubeFlow component yaml to {target_yaml_path}') with open(target_yaml_path, "w") as text_file: text_file.write(yaml) # get environment entries + # TODO: Make it similar to the kfp code env_entries = [] for input_key, _ in inputs.items(): env_entry = f" - name: {input_key}\n value: value_of_{input_key}" env_entries.append(env_entry) env_entries.append('\n') + for output_key, _ in outputs.items(): + env_entry = f" - name: {output_key}\n value: value_of_{output_key}" + env_entries.append(env_entry) + env_entries.append('\n') + # TODO: Is it possible that a component has no inputs? if len(env_entries) != 0: env_entries.pop(-1) @@ -185,12 +198,15 @@ def get_parameter_list(): env_entries=env_entries, ) - print(job_yaml) + logging.debug('Kubernetes job yaml:') + logging.debug(job_yaml) target_job_yaml_path = file_path.replace('.ipynb', '.job.yaml').replace('.py', '.job.yaml') + logging.info(f'Write kubernetes job yaml to {target_job_yaml_path}') with open(target_job_yaml_path, "w") as text_file: text_file.write(job_yaml) + logging.info(f'Remove local files') # remove temporary files if file_path != target_code: os.remove(target_code) @@ -200,19 +216,18 @@ def get_parameter_list(): if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument('--file_path', type=str, required=True, + parser.add_argument('-f', '--file_path', type=str, required=True, help='Path to python script or notebook') - parser.add_argument('--repository', type=str, required=True, + parser.add_argument('-r', '--repository', type=str, required=True, help='Container registry address, e.g. docker.io/') - parser.add_argument('--version', type=str, required=True, + parser.add_argument('-v', '--version', type=str, required=True, help='Image version') - parser.add_argument('--additional_files', type=str, + parser.add_argument('-a', '--additional_files', type=str, help='Comma-separated list of paths to additional files to include in the container image') - parser.add_argument('--log_level', type=str, default='INFO') - - + parser.add_argument('-l', '--log_level', type=str, default='INFO') + parser.add_argument('--dockerfile_template_path', type=str, default='', + help='Path to custom dockerfile template') args = parser.parse_args() # Init logging @@ -224,7 +239,16 @@ def get_parameter_list(): handler.setLevel(args.log_level) root.addHandler(handler) - generate_component(file_path=args.file_path, - repository=args.repository, - version=args.version, - additional_files=args.additional_files) + # Update dockerfile template if specified + if args.dockerfile_template_path != '': + logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') + with open(args.dockerfile_template_path, 'r') as f: + dockerfile_template = Template(f.read()) + + create_operator( + file_path=args.file_path, + repository=args.repository, + version=args.version, + dockerfile_template=dockerfile_template, + additional_files=args.additional_files + ) diff --git a/src/c3/parser.py b/src/c3/parser.py index 8a130d8f..18fee616 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -15,10 +15,11 @@ # import os -import nbformat import re -from traitlets.config import LoggingConfigurable +# TODO: Do we need LoggingConfigurable +# from traitlets.config import LoggingConfigurable +LoggingConfigurable = object from typing import TypeVar, List, Dict @@ -63,13 +64,14 @@ def read_next_code_chunk(self) -> List[str]: class NotebookReader(FileReader): def __init__(self, filepath: str): super().__init__(filepath) + import nbformat with open(self._filepath) as f: self._notebook = nbformat.read(f, as_version=4) self._language = None try: - self._language = self._notebook['metadata']['kernelspec']['language'].lower() + self._language = self._notebook['metadata']['language_info']['name'].lower() except KeyError: self.log.warning(f'No language metadata found in {self._filepath}') diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 0de400c3..cf20044d 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -13,7 +13,7 @@ def __init__(self, path, function_name: str = None): self.script = f.read() self.name = os.path.basename(path)[:-3].replace('_', '-') - assert '"""' in self.script, 'Please provide a description of the operator inside the first doc string.' + assert '"""' in self.script, 'Please provide a description of the operator in the first doc string.' self.description = self.script.split('"""')[1].strip() self.envs = self._get_env_vars() @@ -31,11 +31,6 @@ def _get_env_vars(self): comment_line = '' if comment_line == '': logging.info(f'Interface: No description for variable {env_name} provided.') - if ',' in comment_line: - logging.info( - f"Interface: comment line for variable {env_name} contains commas which will be deleted.") - comment_line = comment_line.replace(',', '') - if "int(" in line: type = 'Integer' elif "float(" in line: @@ -45,11 +40,11 @@ def _get_env_vars(self): else: type = 'String' if ',' in line: - default = line.split(',')[1].split(')')[0] + default = line.split(',', 1)[1].rstrip(') ').strip().replace("\"", "\'") else: default = None return_value[env_name] = { - 'description': comment_line.replace('#', '').strip(), + 'description': comment_line.replace('#', '').replace("\"", "\'").strip(), 'type': type, 'default': default } diff --git a/src/templates/__init__.py b/src/templates/__init__.py index e80048fb..c575d341 100644 --- a/src/templates/__init__.py +++ b/src/templates/__init__.py @@ -15,10 +15,10 @@ template_path = Path(os.path.dirname(__file__)) with open(template_path / COMPONENT_SETUP_CODE, 'r') as f: - component_setup_code = Template(f.read()) + component_setup_code = f.read() with open(template_path / GW_COMPONENT_SETUP_CODE, 'r') as f: - gw_component_setup_code = Template(f.read()) + gw_component_setup_code = f.read() with open(template_path / DOCKERFILE_FILE, 'r') as f: dockerfile_template = Template(f.read()) diff --git a/src/templates/grid_wrapper_template.py b/src/templates/grid_wrapper_template.py index a69d52aa..54d29597 100644 --- a/src/templates/grid_wrapper_template.py +++ b/src/templates/grid_wrapper_template.py @@ -31,6 +31,8 @@ gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') # processed file suffix gw_processed_file_suffix = os.environ.get('gw_lock_file_suffix', '.processed') +# error file suffix +gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') # timeout in seconds to remove lock file from struggling job (default 1 hour) gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 3600)) @@ -71,7 +73,8 @@ def identify_batches_from_pattern(file_path_patterns, group_by): # get batches by applying the group by function to all file paths for path_string in all_files: - exec('part = path_string' + group_by) + part = eval('str(path_string)' + group_by, {"group_by": group_by, "path_string": path_string}) + assert part != '', f'Could not extract batch with path_string {path_string} and group_by {group_by}' batches.add(part) logging.info(f'Identified {len(batches)} batches') @@ -85,6 +88,7 @@ def perform_process(process, batch): logging.debug(f'Check coordinator files for batch {batch}.') # init coordinator files lock_file = Path(gw_coordinator_path) / (batch + gw_lock_file_suffix) + error_file = Path(gw_coordinator_path) / (batch + gw_error_file_suffix) processed_file = Path(gw_coordinator_path) / (batch + gw_processed_file_suffix) if lock_file.exists(): @@ -100,24 +104,34 @@ def perform_process(process, batch): logging.debug(f'Batch {batch} is processed.') return + if error_file.exists(): + logging.debug(f'Batch {batch} has error.') + return + logging.debug(f'Locking batch {batch}.') + lock_file.parent.mkdir(parents=True, exist_ok=True) lock_file.touch() # processing files with custom process logging.info(f'Processing batch {batch}.') try: target_files = process(batch, ${component_inputs}) - except Exception as e: - # Remove lock file before raising the error + except Exception as err: + logging.error(f'{type(err).__name__} in batch {batch}: {err}') + # Write error to file + with open(error_file, 'w') as f: + f.write(f"{type(err).__name__} in batch {batch}: {err}") lock_file.unlink() - raise e + logging.error(f'Continue processing.') + return # optional verify target files if target_files is not None: if isinstance(target_files, str): target_files = [target_files] for target_file in target_files: - assert os.path.exists(target_file), f'Target file {target_file} does not exist for batch {batch}.' + if not os.path.exists(target_file): + logging.error(f'Target file {target_file} does not exist for batch {batch}.') else: logging.info(f'Cannot verify batch {batch} (target files not provided).') @@ -142,7 +156,11 @@ def process_wrapper(sub_process, pre_process=None, post_process=None): # wait until preprocessing is finished processed_file = Path(gw_coordinator_path) / ('preprocess' + gw_processed_file_suffix) + error_file = Path(gw_coordinator_path) / ('preprocess' + gw_error_file_suffix) while not processed_file.exists(): + if error_file.exists(): + logging.error('Error in preprocessing. See error file in coordinator path.') + exit(1) logging.info(f'Waiting for preprocessing to finish.') time.sleep(60) @@ -163,15 +181,17 @@ def process_wrapper(sub_process, pre_process=None, post_process=None): processed_status = [(Path(gw_coordinator_path) / (batch + gw_processed_file_suffix)).exists() for batch in batches] lock_status = [(Path(gw_coordinator_path) / (batch + gw_lock_file_suffix)).exists() for batch in batches] if all(processed_status): - if post_process is not None: - # run postprocessing - perform_process(post_process, 'postprocess') - logging.info('Finished all processes.') else: logging.info(f'Finished current process. Status batches: ' f'{sum(processed_status)} processed / {sum(lock_status)} locked / {len(processed_status)} total') + # Check for errors + error_status = [(Path(gw_coordinator_path) / (batch + gw_error_file_suffix)).exists() + for batch in list(batches) + ['preprocess', 'postprocess']] + if sum(error_status): + logging.error(f'Found {sum(error_status)} errors. See error files in coordinator path.') + if __name__ == '__main__': process_wrapper(${component_process}, ${component_pre_process}, ${component_post_process}) diff --git a/src/templates/gw_component_setup_code.py b/src/templates/gw_component_setup_code.py index 205f07ac..e8b67a7b 100644 --- a/src/templates/gw_component_setup_code.py +++ b/src/templates/gw_component_setup_code.py @@ -14,4 +14,4 @@ variable = parameter.split('=')[0] value = parameter.split('=', 1)[-1] logging.debug(f'Parameter: {variable} = "{value}"') - os.environ[variable] = value \ No newline at end of file + os.environ[variable] = value diff --git a/src/templates/kfp_component_template.yaml b/src/templates/kfp_component_template.yaml index be12ebb7..56d8c74b 100644 --- a/src/templates/kfp_component_template.yaml +++ b/src/templates/kfp_component_template.yaml @@ -1,15 +1,18 @@ -name: $name -description: $description +name: ${name} +description: ${description} inputs: -$inputs +${inputs} + +outputs: +${outputs} implementation: container: - image: $container_uri:$version + image: ${repository}/claimed-${name}:${version} command: - sh - -ec - | - python $call -$input_for_implementation \ No newline at end of file + python ${call} +${input_for_implementation} \ No newline at end of file From 42b356acf21313ec18723eb216e188fe24d2bf96 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 10 Oct 2023 13:37:52 +0200 Subject: [PATCH 065/177] Removed pre- and post-process from grid_wapper --- src/c3/create_grid_wrapper.py | 16 ++---------- src/templates/grid_wrapper_template.py | 34 ++++++-------------------- 2 files changed, 10 insertions(+), 40 deletions(-) diff --git a/src/c3/create_grid_wrapper.py b/src/c3/create_grid_wrapper.py index 0efb910d..183cc998 100644 --- a/src/c3/create_grid_wrapper.py +++ b/src/c3/create_grid_wrapper.py @@ -15,8 +15,6 @@ def wrap_component(component_path, component_interface, component_inputs, component_process, - component_pre_process, - component_post_process, ): # get component name from path component_name = os.path.splitext(os.path.basename(component_path))[0] @@ -28,8 +26,6 @@ def wrap_component(component_path, component_inputs=component_inputs, component_interface=component_interface, component_process=component_process, - component_pre_process=component_pre_process, - component_post_process=component_post_process, ) # Write edited code to file @@ -108,8 +104,7 @@ def edit_component_code(file_path): return target_file -def apply_grid_wrapper(file_path, component_process, component_pre_process, component_post_process, - *args, **kwargs): +def apply_grid_wrapper(file_path, component_process, *args, **kwargs): assert file_path.endswith('.py') or file_path.endswith('.ipynb'), \ "Please provide a component file path to a python script or notebook." @@ -123,9 +118,7 @@ def apply_grid_wrapper(file_path, component_process, component_pre_process, comp component_dependencies=dependencies, component_interface=interface, component_inputs=inputs, - component_process=component_process, - component_pre_process=component_pre_process, - component_post_process=component_post_process, + component_process=component_process ) logging.debug('Wrap component with parameters:') @@ -143,11 +136,6 @@ def apply_grid_wrapper(file_path, component_process, component_pre_process, comp help='Path to python script or notebook') parser.add_argument('-p', '--component_process', type=str, required=True, help='Name of the component sub process that is executed for each batch.') - parser.add_argument('-pre', '--component_pre_process', type=str, - help='Name of the component pre process which is executed once before parallelization.') - parser.add_argument('-post', '--component_post_process', type=str, - help='Name of the component post process which is executed once after parallelization.') - parser.add_argument('-r', '--repository', type=str, help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, diff --git a/src/templates/grid_wrapper_template.py b/src/templates/grid_wrapper_template.py index 54d29597..bfe4009a 100644 --- a/src/templates/grid_wrapper_template.py +++ b/src/templates/grid_wrapper_template.py @@ -142,7 +142,7 @@ def perform_process(process, batch): lock_file.unlink() -def process_wrapper(sub_process, pre_process=None, post_process=None): +def process_wrapper(sub_process): delay = random.randint(1, 60) logging.info(f'Staggering start, waiting for {delay} seconds') time.sleep(delay) @@ -150,20 +150,6 @@ def process_wrapper(sub_process, pre_process=None, post_process=None): # Init coordinator dir Path(gw_coordinator_path).mkdir(exist_ok=True, parents=True) - # run preprocessing - if pre_process is not None: - perform_process(pre_process, 'preprocess') - - # wait until preprocessing is finished - processed_file = Path(gw_coordinator_path) / ('preprocess' + gw_processed_file_suffix) - error_file = Path(gw_coordinator_path) / ('preprocess' + gw_error_file_suffix) - while not processed_file.exists(): - if error_file.exists(): - logging.error('Error in preprocessing. See error file in coordinator path.') - exit(1) - logging.info(f'Waiting for preprocessing to finish.') - time.sleep(60) - # get batches if gw_batch_file is not None and os.path.isfile(gw_batch_file): batches = load_batches_from_file(gw_batch_file) @@ -177,21 +163,17 @@ def process_wrapper(sub_process, pre_process=None, post_process=None): for batch in batches: perform_process(sub_process, batch) - # Check if all batches are processed + # Check and log status of batches processed_status = [(Path(gw_coordinator_path) / (batch + gw_processed_file_suffix)).exists() for batch in batches] lock_status = [(Path(gw_coordinator_path) / (batch + gw_lock_file_suffix)).exists() for batch in batches] - if all(processed_status): - logging.info('Finished all processes.') - else: - logging.info(f'Finished current process. Status batches: ' - f'{sum(processed_status)} processed / {sum(lock_status)} locked / {len(processed_status)} total') + error_status = [(Path(gw_coordinator_path) / (batch + gw_error_file_suffix)).exists() for batch in batches] + + logging.info(f'Finished current process. Status batches: ' + f'{sum(processed_status)} processed / {sum(lock_status)} locked / {sum(error_status)} errors / {len(processed_status)} total') - # Check for errors - error_status = [(Path(gw_coordinator_path) / (batch + gw_error_file_suffix)).exists() - for batch in list(batches) + ['preprocess', 'postprocess']] if sum(error_status): - logging.error(f'Found {sum(error_status)} errors. See error files in coordinator path.') + logging.error(f'Found errors. See error files in coordinator path.') if __name__ == '__main__': - process_wrapper(${component_process}, ${component_pre_process}, ${component_post_process}) + process_wrapper(${component_process}) From dc64611bcd9d8a362ed5c7a9d27ff33a0c0db28b Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 10 Oct 2023 18:11:10 +0200 Subject: [PATCH 066/177] Added cos grip wrapper --- src/c3/create_grid_wrapper.py | 15 +- src/templates/cos_grid_wrapper_template.py | 293 +++++++++++++++++++++ 2 files changed, 303 insertions(+), 5 deletions(-) create mode 100644 src/templates/cos_grid_wrapper_template.py diff --git a/src/c3/create_grid_wrapper.py b/src/c3/create_grid_wrapper.py index 183cc998..feb7a369 100644 --- a/src/c3/create_grid_wrapper.py +++ b/src/c3/create_grid_wrapper.py @@ -6,7 +6,7 @@ from pythonscript import Pythonscript from notebook_converter import convert_notebook from create_operator import create_operator -from templates import grid_wrapper_template, gw_component_setup_code, dockerfile_template +from templates import grid_wrapper_template, cos_grid_wrapper_template, gw_component_setup_code, dockerfile_template def wrap_component(component_path, @@ -15,11 +15,13 @@ def wrap_component(component_path, component_interface, component_inputs, component_process, + cos, ): # get component name from path component_name = os.path.splitext(os.path.basename(component_path))[0] - grid_wrapper_code = grid_wrapper_template.substitute( + gw_template = cos_grid_wrapper_template if cos else grid_wrapper_template + grid_wrapper_code = gw_template.substitute( component_name=component_name, component_description=component_description, component_dependencies=component_dependencies, @@ -29,7 +31,8 @@ def wrap_component(component_path, ) # Write edited code to file - grid_wrapper_file_path = os.path.join(os.path.dirname(component_path), f'gw_{component_name}.py') + grid_wrapper_file = f'cgw_{component_name}.py' if cos else f'gw_{component_name}.py' + grid_wrapper_file_path = os.path.join(os.path.dirname(component_path), grid_wrapper_file) # remove 'component_' from gw path grid_wrapper_file_path = grid_wrapper_file_path.replace('component_', '') with open(grid_wrapper_file_path, 'w') as f: @@ -104,7 +107,7 @@ def edit_component_code(file_path): return target_file -def apply_grid_wrapper(file_path, component_process, *args, **kwargs): +def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): assert file_path.endswith('.py') or file_path.endswith('.ipynb'), \ "Please provide a component file path to a python script or notebook." @@ -126,7 +129,7 @@ def apply_grid_wrapper(file_path, component_process, *args, **kwargs): logging.debug(component + ':\n' + str(value) + '\n') logging.info('Wrap component') - grid_wrapper_file_path = wrap_component(**component_elements) + grid_wrapper_file_path = wrap_component(cos=cos, **component_elements) return grid_wrapper_file_path, file_path @@ -136,6 +139,8 @@ def apply_grid_wrapper(file_path, component_process, *args, **kwargs): help='Path to python script or notebook') parser.add_argument('-p', '--component_process', type=str, required=True, help='Name of the component sub process that is executed for each batch.') + parser.add_argument('-cos', action=argparse.BooleanOptionalAction, default=False, + help='Creates a grid wrapper for processing COS files') parser.add_argument('-r', '--repository', type=str, help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, diff --git a/src/templates/cos_grid_wrapper_template.py b/src/templates/cos_grid_wrapper_template.py new file mode 100644 index 00000000..32b2af70 --- /dev/null +++ b/src/templates/cos_grid_wrapper_template.py @@ -0,0 +1,293 @@ +""" +${component_name} got wrapped by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern for cos files https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 + +CLAIMED component description: ${component_description} +""" + +# pip install s3fs +# component dependencies +# ${component_dependencies} + +import os +import json +import random +import logging +import shutil +import time +import glob +import s3fs +from datetime import datetime +from pathlib import Path + +# import component code +from ${component_name} import * + + +# File containing batches. Provided as a comma-separated list of strings or keys in a json dict. All batch file names must contain the batch name. +gw_batch_file = os.environ.get('gw_batch_file', None) +# file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. It is ignored if gw_batch_file is provided. +gw_file_path_pattern = os.environ.get('gw_file_path_pattern', None) +# pattern for grouping file paths into batches like ".split('.')[-2]". It is ignored if gw_batch_file is provided. +gw_group_by = os.environ.get('gw_group_by', None) + +# comma-separated list of additional cos files to copy +gw_additional_source_files = os.environ.get('gw_additional_source_files', '') +# download source cos files to local input path +gw_local_input_path = os.environ.get('gw_local_input_path', 'input') +# upload local target files to target cos path +gw_local_target_path = os.environ.get('gw_local_target_path', 'target') + +# cos source_access_key_id +gw_source_access_key_id = os.environ.get('gw_source_access_key_id') +# cos source_secret_access_key +gw_source_secret_access_key = os.environ.get('gw_source_secret_access_key') +# cos source_endpoint +gw_source_endpoint = os.environ.get('gw_source_endpoint') +# cos source_bucket +gw_source_bucket = os.environ.get('gw_source_bucket') + +# cos target_access_key_id (uses source s3 if not provided) +gw_target_access_key_id = os.environ.get('gw_target_access_key_id', None) +# cos target_secret_access_key (uses source s3 if not provided) +gw_target_secret_access_key = os.environ.get('gw_target_secret_access_key', None) +# cos target_endpoint (uses source s3 if not provided) +gw_target_endpoint = os.environ.get('gw_target_endpoint', None) +# cos target_bucket (uses source s3 if not provided) +gw_target_bucket = os.environ.get('gw_target_bucket', None) +# cos target_path +gw_target_path = os.environ.get('gw_target_path') + +# cos coordinator_access_key_id (uses source s3 if not provided) +gw_coordinator_access_key_id = os.environ.get('gw_coordinator_access_key_id', None) +# cos coordinator_secret_access_key (uses source s3 if not provided) +gw_coordinator_secret_access_key = os.environ.get('gw_coordinator_secret_access_key', None) +# cos coordinator_endpoint (uses source s3 if not provided) +gw_coordinator_endpoint = os.environ.get('gw_coordinator_endpoint', None) +# cos coordinator_bucket (uses source s3 if not provided) +gw_coordinator_bucket = os.environ.get('gw_coordinator_bucket', None) +# cos path to grid wrapper coordinator directory +gw_coordinator_path = os.environ.get('gw_coordinator_path') +# lock file suffix +gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') +# processed file suffix +gw_processed_file_suffix = os.environ.get('gw_lock_file_suffix', '.processed') +# error file suffix +gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') +# timeout in seconds to remove lock file from struggling job (default 1 hour) +gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 3600)) + + +# component interface +${component_interface} + +# init s3 +s3source = s3fs.S3FileSystem( + anon=False, + key=gw_source_access_key_id, + secret=gw_source_secret_access_key, + client_kwargs={'endpoint_url': gw_source_endpoint}) + +if gw_target_endpoint is not None: + s3target = s3fs.S3FileSystem( + anon=False, + key=gw_target_access_key_id, + secret=gw_target_secret_access_key, + client_kwargs={'endpoint_url': gw_target_endpoint}) +else: + logging.debug('Using source bucket as target bucket.') + gw_target_bucket = gw_source_bucket + s3target = s3source + +if gw_coordinator_bucket is not None: + s3coordinator = s3fs.S3FileSystem( + anon=False, + key=gw_coordinator_access_key_id, + secret=gw_coordinator_secret_access_key, + client_kwargs={'endpoint_url': gw_coordinator_endpoint}) +else: + logging.debug('Using source bucket as coordinator bucket.') + gw_coordinator_bucket = gw_source_bucket + s3coordinator = s3source + +def load_batches_from_file(batch_file): + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + # TODO: Test open + with s3source.open(f'{gw_source_bucket}/{batch_file}', 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + else: + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + # TODO: Test open + with s3source.open(f'{gw_source_bucket}/{batch_file}', 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] + + logging.info(f'Loaded {len(batches)} batches') + logging.debug(f'List of batches: {batches}') + assert len(batches) > 0, f"batch_file {batch_file} has no batches." + return batches + + +def get_files_from_pattern(file_path_patterns): + logging.info(f'Start identifying files') + all_files = [] + + # Iterate over comma-separated paths + for file_path_pattern in file_path_patterns.split(','): + logging.info(f'Get file paths from pattern: {file_path_pattern}') + files = s3source.glob(f'{gw_source_bucket}/{file_path_pattern.strip()}') + assert len(files) > 0, f"Found no files with file_path_pattern {file_path_pattern}." + all_files.extend(files) + logging.info(f'Found {len(all_files)} cos files') + return all_files + +def identify_batches_from_pattern(file_path_patterns, group_by): + logging.info(f'Start identifying files and batches') + batches = set() + all_files = get_files_from_pattern(file_path_patterns) + + # get batches by applying the group by function to all file paths + for path_string in all_files: + part = eval('str(path_string)' + group_by, {"group_by": group_by, "path_string": path_string}) + assert part != '', f'Could not extract batch with path_string {path_string} and group_by {group_by}' + batches.add(part) + + logging.info(f'Identified {len(batches)} batches') + logging.debug(f'List of batches: {batches}') + + return batches, all_files + + +def perform_process(process, batch, cos_files): + logging.debug(f'Check coordinator files for batch {batch}.') + # init coordinator files + lock_file = Path(gw_coordinator_path) / (batch + gw_lock_file_suffix) + error_file = Path(gw_coordinator_path) / (batch + gw_error_file_suffix) + processed_file = Path(gw_coordinator_path) / (batch + gw_processed_file_suffix) + + # TODO: Check if lock_file etc. must be string + if s3coordinator.exists(lock_file): + # remove strugglers + last_modified = s3coordinator.info(lock_file)['LastModified'] + # TODO: Check and use function from time instead of datetime + if datetime.now(last_modified.tzinfo) < time.time() - gw_lock_timeout: + logging.info(f'Lock file {lock_file} is expired.') + s3coordinator.rm(lock_file) + else: + logging.debug(f'Batch {batch} is locked.') + return + + if s3coordinator.exists(processed_file): + logging.debug(f'Batch {batch} is processed.') + return + + if s3coordinator.exists(error_file): + logging.debug(f'Batch {batch} has error.') + return + + logging.debug(f'Locking batch {batch}.') + s3coordinator.makedirs(lock_file.parent, exist_ok=True) + s3coordinator.touch(lock_file) + logging.info(f'Processing batch {batch}.') + + # Create input and target directories + input_path = Path(gw_local_input_path) + target_path = Path(gw_local_target_path) + assert not input_path.exists(), (f'gw_local_input_path ({gw_local_input_path}) already exists. ' + f'Please provide a new input path.') + assert not target_path.exists(), (f'gw_local_target_path ({gw_local_target_path}) already exists. ' + f'Please provide a new target path.') + input_path.mkdir(parents=True) + target_path.mkdir(parents=True) + + # Download cos files to local input folder + batch_fileset = list(filter(lambda file: batch in file, cos_files)) + if gw_additional_source_files != '': + additional_source_files = [f.strip() for f in gw_additional_source_files.split(',')] + batch_fileset.extend(additional_source_files) + logging.info(f'Downloading {len(batch_fileset)} files from COS') + for cos_file in batch_fileset: + local_file = str(input_path / cos_file.split('/', 1)[-1]) + logging.debug(f'Downloading {cos_file} to {local_file}') + s3source.get(cos_file, local_file) + + # processing files with custom process + try: + target_files = process(batch, ${component_inputs}) + except Exception as err: + logging.error(f'{type(err).__name__} in batch {batch}: {err}') + # Write error to file + with open(error_file, 'w') as f: + f.write(f"{type(err).__name__} in batch {batch}: {err}") + lock_file.unlink() + logging.error(f'Continue processing.') + return + + # optional verify target files + if target_files is not None: + if isinstance(target_files, str): + target_files = [target_files] + for target_file in target_files: + if not os.path.exists(target_file): + logging.error(f'Target file {target_file} does not exist for batch {batch}.') + else: + logging.info(f'Cannot verify batch {batch} (target files not provided). Using files in target_path.') + target_files = glob.glob(target_path) + + logging.info(f'Uploading {len(target_files)} target files to COS.') + target_path_depth = len(target_path.split('/')) + for local_file in target_files: + cos_file = f'{gw_target_bucket}/{gw_target_path}/{local_file.split("/", target_path_depth)[-1]}' + logging.debug(f'Uploading {local_file} to {cos_file}') + s3target.put(local_file, cos_file) + + logging.info(f'Remove local input and target files.') + shutil.rmtree(input_path) + shutil.rmtree(target_path) + + logging.info(f'Finished Batch {batch}.') + s3coordinator.touch(processed_file) + # Remove lock file + s3coordinator.rm(lock_file) + + +def process_wrapper(sub_process): + delay = random.randint(1, 60) + logging.info(f'Staggering start, waiting for {delay} seconds') + time.sleep(delay) + + # Init coordinator dir + Path(gw_coordinator_path).mkdir(exist_ok=True, parents=True) + + # get batches + if gw_batch_file is not None and os.path.isfile(gw_batch_file): + batches = load_batches_from_file(gw_batch_file) + cos_files = get_files_from_pattern(gw_file_path_pattern) + elif gw_file_path_pattern is not None and gw_group_by is not None: + batches, cos_files = identify_batches_from_pattern(gw_file_path_pattern, gw_group_by) + else: + raise ValueError("Cannot identify batches. " + "Provide valid gw_batch_file or gw_file_path_pattern and gw_group_by.") + + # Iterate over all batches + for batch in batches: + perform_process(sub_process, batch, cos_files) + + # Check and log status of batches + processed_status = [(Path(gw_coordinator_path) / (batch + gw_processed_file_suffix)).exists() for batch in batches] + lock_status = [(Path(gw_coordinator_path) / (batch + gw_lock_file_suffix)).exists() for batch in batches] + error_status = [(Path(gw_coordinator_path) / (batch + gw_error_file_suffix)).exists() for batch in batches] + + logging.info(f'Finished current process. Status batches: ' + f'{sum(processed_status)} processed / {sum(lock_status)} locked / {sum(error_status)} errors / {len(processed_status)} total') + + if sum(error_status): + logging.error(f'Found errors. See error files in coordinator path.') + + +if __name__ == '__main__': + process_wrapper(${component_process}) From a5d810a4576256e5931696740d58a52e9b967d10 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 10 Oct 2023 18:45:03 +0200 Subject: [PATCH 067/177] fixed additional files error --- src/c3/create_operator.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index f9813c96..48e16c39 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -8,6 +8,9 @@ from io import StringIO from pythonscript import Pythonscript from notebook_converter import convert_notebook + +# Update sys path to load templates +sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) from templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template CLAIMED_VERSION = 'V0.1' @@ -81,11 +84,11 @@ def create_operator(file_path: str, print(os.listdir(additional_files_path)) else: additional_files_local = additional_files.split('/')[-1:][0] - shutil.copy(additional_files, additional_files_local) - # ensure the original file is not deleted later if additional_files != additional_files_local: + shutil.copy(additional_files, additional_files_local) additional_files_path = additional_files_local else: + # ensure the original file is not deleted later additional_files_path = None else: additional_files_local = target_code # hack From 6787e917cc02d96343126d978ae855fc017fd057 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 12 Oct 2023 10:46:04 +0200 Subject: [PATCH 068/177] Updated error handling in grid wrapper and fixed cos processing --- src/templates/__init__.py | 4 ++ src/templates/cos_grid_wrapper_template.py | 66 +++++++++++++--------- src/templates/grid_wrapper_template.py | 39 +++++++++---- 3 files changed, 70 insertions(+), 39 deletions(-) diff --git a/src/templates/__init__.py b/src/templates/__init__.py index c575d341..626002de 100644 --- a/src/templates/__init__.py +++ b/src/templates/__init__.py @@ -10,6 +10,7 @@ KFP_COMPONENT_FILE = 'kfp_component_template.yaml' KUBERNETES_JOB_FILE = 'kubernetes_job_template.job.yaml' GRID_WRAPPER_FILE = 'grid_wrapper_template.py' +COS_GRID_WRAPPER_FILE = 'cos_grid_wrapper_template.py' # load templates template_path = Path(os.path.dirname(__file__)) @@ -31,3 +32,6 @@ with open(template_path / GRID_WRAPPER_FILE, 'r') as f: grid_wrapper_template = Template(f.read()) + +with open(template_path / COS_GRID_WRAPPER_FILE, 'r') as f: + cos_grid_wrapper_template = Template(f.read()) diff --git a/src/templates/cos_grid_wrapper_template.py b/src/templates/cos_grid_wrapper_template.py index 32b2af70..d5ac5f62 100644 --- a/src/templates/cos_grid_wrapper_template.py +++ b/src/templates/cos_grid_wrapper_template.py @@ -75,6 +75,8 @@ gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') # timeout in seconds to remove lock file from struggling job (default 1 hour) gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 3600)) +# ignore error files and rerun batches with errors +gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) # component interface @@ -113,16 +115,14 @@ def load_batches_from_file(batch_file): if batch_file.endswith('.json'): # load batches from keys of a json file logging.info(f'Loading batches from json file: {batch_file}') - # TODO: Test open - with s3source.open(f'{gw_source_bucket}/{batch_file}', 'r') as f: + with s3source.open(Path(gw_source_bucket) / batch_file, 'r') as f: batch_dict = json.load(f) batches = batch_dict.keys() else: # Load batches from comma-separated txt file logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - # TODO: Test open - with s3source.open(f'{gw_source_bucket}/{batch_file}', 'r') as f: + with s3source.open(Path(gw_source_bucket) / batch_file, 'r') as f: batch_string = f.read() batches = [b.strip() for b in batch_string.split(',')] @@ -139,7 +139,7 @@ def get_files_from_pattern(file_path_patterns): # Iterate over comma-separated paths for file_path_pattern in file_path_patterns.split(','): logging.info(f'Get file paths from pattern: {file_path_pattern}') - files = s3source.glob(f'{gw_source_bucket}/{file_path_pattern.strip()}') + files = s3source.glob(str(Path(gw_source_bucket) / file_path_pattern.strip())) assert len(files) > 0, f"Found no files with file_path_pattern {file_path_pattern}." all_files.extend(files) logging.info(f'Found {len(all_files)} cos files') @@ -165,16 +165,15 @@ def identify_batches_from_pattern(file_path_patterns, group_by): def perform_process(process, batch, cos_files): logging.debug(f'Check coordinator files for batch {batch}.') # init coordinator files - lock_file = Path(gw_coordinator_path) / (batch + gw_lock_file_suffix) - error_file = Path(gw_coordinator_path) / (batch + gw_error_file_suffix) - processed_file = Path(gw_coordinator_path) / (batch + gw_processed_file_suffix) + coordinator_dir = Path(gw_coordinator_bucket) / gw_coordinator_path + lock_file = str(coordinator_dir / (batch + gw_lock_file_suffix)) + processed_file = str(coordinator_dir / (batch + gw_processed_file_suffix)) + error_file = str(coordinator_dir / (batch + gw_error_file_suffix)) - # TODO: Check if lock_file etc. must be string if s3coordinator.exists(lock_file): # remove strugglers last_modified = s3coordinator.info(lock_file)['LastModified'] - # TODO: Check and use function from time instead of datetime - if datetime.now(last_modified.tzinfo) < time.time() - gw_lock_timeout: + if (datetime.now(last_modified.tzinfo) - last_modified).total_seconds() > gw_lock_timeout: logging.info(f'Lock file {lock_file} is expired.') s3coordinator.rm(lock_file) else: @@ -186,11 +185,13 @@ def perform_process(process, batch, cos_files): return if s3coordinator.exists(error_file): - logging.debug(f'Batch {batch} has error.') - return + if gw_ignore_error_files: + logging.info(f'Ignoring previous error in batch {batch} and rerun.') + else: + logging.debug(f'Batch {batch} has error.') + return logging.debug(f'Locking batch {batch}.') - s3coordinator.makedirs(lock_file.parent, exist_ok=True) s3coordinator.touch(lock_file) logging.info(f'Processing batch {batch}.') @@ -221,7 +222,7 @@ def perform_process(process, batch, cos_files): except Exception as err: logging.error(f'{type(err).__name__} in batch {batch}: {err}') # Write error to file - with open(error_file, 'w') as f: + with s3coordinator.open(error_file, 'w') as f: f.write(f"{type(err).__name__} in batch {batch}: {err}") lock_file.unlink() logging.error(f'Continue processing.') @@ -234,16 +235,18 @@ def perform_process(process, batch, cos_files): for target_file in target_files: if not os.path.exists(target_file): logging.error(f'Target file {target_file} does not exist for batch {batch}.') + if any([not t.starts_with(gw_local_target_path) for t in target_files]): + logging.warning('Some target files are not in target path. Only files in target path are uploaded.') else: logging.info(f'Cannot verify batch {batch} (target files not provided). Using files in target_path.') - target_files = glob.glob(target_path) - logging.info(f'Uploading {len(target_files)} target files to COS.') - target_path_depth = len(target_path.split('/')) - for local_file in target_files: - cos_file = f'{gw_target_bucket}/{gw_target_path}/{local_file.split("/", target_path_depth)[-1]}' + # upload files in target path + local_target_files = list(target_path.glob('*')) + logging.info(f'Uploading {len(local_target_files)} target files to COS.') + for local_file in local_target_files: + cos_file = Path(gw_target_bucket) / gw_target_path / local_file.relative_to(target_path) logging.debug(f'Uploading {local_file} to {cos_file}') - s3target.put(local_file, cos_file) + s3target.put(str(local_file), str(cos_file)) logging.info(f'Remove local input and target files.') shutil.rmtree(input_path) @@ -252,7 +255,11 @@ def perform_process(process, batch, cos_files): logging.info(f'Finished Batch {batch}.') s3coordinator.touch(processed_file) # Remove lock file - s3coordinator.rm(lock_file) + if s3coordinator.exists(lock_file): + s3coordinator.rm(lock_file) + else: + logging.warning(f'Lock file {lock_file} was removed by another process. ' + f'Consider increasing gw_lock_timeout (currently {gw_lock_timeout}s) to repeated processing.') def process_wrapper(sub_process): @@ -261,7 +268,8 @@ def process_wrapper(sub_process): time.sleep(delay) # Init coordinator dir - Path(gw_coordinator_path).mkdir(exist_ok=True, parents=True) + coordinator_dir = Path(gw_coordinator_bucket) / gw_coordinator_path + s3coordinator.makedirs(coordinator_dir, exist_ok=True) # get batches if gw_batch_file is not None and os.path.isfile(gw_batch_file): @@ -278,15 +286,19 @@ def process_wrapper(sub_process): perform_process(sub_process, batch, cos_files) # Check and log status of batches - processed_status = [(Path(gw_coordinator_path) / (batch + gw_processed_file_suffix)).exists() for batch in batches] - lock_status = [(Path(gw_coordinator_path) / (batch + gw_lock_file_suffix)).exists() for batch in batches] - error_status = [(Path(gw_coordinator_path) / (batch + gw_error_file_suffix)).exists() for batch in batches] + processed_status = [s3coordinator.exists(coordinator_dir / (batch + gw_processed_file_suffix)) for batch in batches] + lock_status = [s3coordinator.exists(coordinator_dir / (batch + gw_lock_file_suffix)) for batch in batches] + error_status = [s3coordinator.exists(coordinator_dir / (batch + gw_error_file_suffix)) for batch in batches] logging.info(f'Finished current process. Status batches: ' f'{sum(processed_status)} processed / {sum(lock_status)} locked / {sum(error_status)} errors / {len(processed_status)} total') if sum(error_status): - logging.error(f'Found errors. See error files in coordinator path.') + logging.error(f'Found errors! Resolve errors and rerun operator with gw_ignore_error_files=True.') + # print all error messages + for error_file in s3coordinator.glob(str(coordinator_dir / ('**/*' + gw_error_file_suffix))): + with s3coordinator.open(error_file, 'r') as f: + logging.error(f.read()) if __name__ == '__main__': diff --git a/src/templates/grid_wrapper_template.py b/src/templates/grid_wrapper_template.py index bfe4009a..fd9b989e 100644 --- a/src/templates/grid_wrapper_template.py +++ b/src/templates/grid_wrapper_template.py @@ -35,6 +35,8 @@ gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') # timeout in seconds to remove lock file from struggling job (default 1 hour) gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 3600)) +# ignore error files and rerun batches with errors +gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) # component interface ${component_interface} @@ -68,8 +70,9 @@ def identify_batches_from_pattern(file_path_patterns, group_by): # Iterate over comma-separated paths for file_path_pattern in file_path_patterns.split(','): logging.info(f'Get file paths from pattern: {file_path_pattern}') - all_files.extend(glob.glob(file_path_pattern.strip())) - assert len(all_files) > 0, f"Found no files with file_path_patterns {file_path_patterns}." + files = glob.glob(file_path_pattern.strip()) + assert len(files) > 0, f"Found no files with file_path_pattern {file_path_pattern}." + all_files.extend(files) # get batches by applying the group by function to all file paths for path_string in all_files: @@ -79,8 +82,7 @@ def identify_batches_from_pattern(file_path_patterns, group_by): logging.info(f'Identified {len(batches)} batches') logging.debug(f'List of batches: {batches}') - assert len(all_files) > 0, (f"Found batches with group_by {group_by}. " - f"Identified {len(all_files)} files, e.g., {all_files[:10]}.") + return batches @@ -105,8 +107,11 @@ def perform_process(process, batch): return if error_file.exists(): - logging.debug(f'Batch {batch} has error.') - return + if gw_ignore_error_files: + logging.info(f'Ignoring previous error in batch {batch} and rerun.') + else: + logging.debug(f'Batch {batch} has error.') + return logging.debug(f'Locking batch {batch}.') lock_file.parent.mkdir(parents=True, exist_ok=True) @@ -139,7 +144,12 @@ def perform_process(process, batch): processed_file.touch() # Remove lock file - lock_file.unlink() + if lock_file.exists(): + lock_file.unlink() + else: + logging.warning(f'Lock file {lock_file} was removed by another process. ' + f'Consider increasing gw_lock_timeout (currently {gw_lock_timeout}s) to repeated processing.') + def process_wrapper(sub_process): @@ -148,7 +158,8 @@ def process_wrapper(sub_process): time.sleep(delay) # Init coordinator dir - Path(gw_coordinator_path).mkdir(exist_ok=True, parents=True) + coordinator_dir = Path(gw_coordinator_path) + coordinator_dir.mkdir(exist_ok=True, parents=True) # get batches if gw_batch_file is not None and os.path.isfile(gw_batch_file): @@ -164,15 +175,19 @@ def process_wrapper(sub_process): perform_process(sub_process, batch) # Check and log status of batches - processed_status = [(Path(gw_coordinator_path) / (batch + gw_processed_file_suffix)).exists() for batch in batches] - lock_status = [(Path(gw_coordinator_path) / (batch + gw_lock_file_suffix)).exists() for batch in batches] - error_status = [(Path(gw_coordinator_path) / (batch + gw_error_file_suffix)).exists() for batch in batches] + processed_status = [(coordinator_dir / (batch + gw_processed_file_suffix)).exists() for batch in batches] + lock_status = [(coordinator_dir / (batch + gw_lock_file_suffix)).exists() for batch in batches] + error_status = [(coordinator_dir / (batch + gw_error_file_suffix)).exists() for batch in batches] logging.info(f'Finished current process. Status batches: ' f'{sum(processed_status)} processed / {sum(lock_status)} locked / {sum(error_status)} errors / {len(processed_status)} total') if sum(error_status): - logging.error(f'Found errors. See error files in coordinator path.') + logging.error(f'Found errors! Resolve errors and rerun operator with gw_ignore_error_files=True.') + # print all error messages + for error_file in coordinator_dir.glob('**/*' + gw_error_file_suffix): + with open(error_file, 'r') as f: + logging.error(f.read()) if __name__ == '__main__': From b41be587c6a3ad78536417c8d15cc073a937a6c9 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 13 Oct 2023 13:25:49 +0200 Subject: [PATCH 069/177] Fixed typo in cos grid wrapper --- src/templates/cos_grid_wrapper_template.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/templates/cos_grid_wrapper_template.py b/src/templates/cos_grid_wrapper_template.py index d5ac5f62..684e4f6c 100644 --- a/src/templates/cos_grid_wrapper_template.py +++ b/src/templates/cos_grid_wrapper_template.py @@ -224,7 +224,7 @@ def perform_process(process, batch, cos_files): # Write error to file with s3coordinator.open(error_file, 'w') as f: f.write(f"{type(err).__name__} in batch {batch}: {err}") - lock_file.unlink() + s3coordinator.rm(lock_file) logging.error(f'Continue processing.') return @@ -235,7 +235,7 @@ def perform_process(process, batch, cos_files): for target_file in target_files: if not os.path.exists(target_file): logging.error(f'Target file {target_file} does not exist for batch {batch}.') - if any([not t.starts_with(gw_local_target_path) for t in target_files]): + if any([not str(t).startswith(gw_local_target_path) for t in target_files]): logging.warning('Some target files are not in target path. Only files in target path are uploaded.') else: logging.info(f'Cannot verify batch {batch} (target files not provided). Using files in target_path.') From 018ed1acec229f391353449798f37e5fd3490731 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 16 Oct 2023 18:08:45 +0200 Subject: [PATCH 070/177] Updated GettingStarted.md --- GettingStarted.md | 556 +++++++++++++++++++++++++++++++++++++++++++++- README.md | 8 +- 2 files changed, 550 insertions(+), 14 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index a26691e6..05bab158 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -1,37 +1,569 @@ -# Getting started with CLAIMED +# Getting Started with CLAIMED The [CLAIMED framework](https://github.com/claimed-framework) enables ease-of-use development and deployment of cloud native data processing applications on Kubernetes using operators and workflows. -A central tool of **CLAIMED is the Claimed Component Compiler (C3)** which creates a docker image with all dependencies, pushes the container to a registry, and creates a kubernetes-job.yaml as well as a kubeflow-pipeline-component.yaml. -The following explains how to build operators yourself. +A central tool of CLAIMED is the **Claimed Component Compiler (C3)** which creates a docker image with all dependencies, pushes the container to a registry, and creates a kubernetes-job.yaml as well as a kubeflow-pipeline-component.yaml. +This page explains how to apply operators, combine them to workflows, and how to build them yourself using C3. -## C3 requirements +## Content -Your operator script has to follow certain requirements to be processed by C3. +**[1. Apply operators](#1-apply-operators)** + +**[2. Operator library](#2-operator-library)** + +**[3. Create workflows](#3-create-workflows)** + +**[4. Create operators](#4-create-operators)** + +**[5. Create grid wrapper](#5-create-grid-wrapper)** + +--- + +## 1. Apply operators + +An operator is a single processing step such as a kubernetes job. You can run an operator via [CLAIMED CLI](https://github.com/claimed-framework/cli), use them in [workflows](#3-create-workflows), or deploy a kubernetes job using the `job.yaml` which is explained in the following. + +### 1.1 Specify the job + +First, update the variable values in the `job.yaml`. +You can delete a variable to use the default value, if one is defined. +The default values are listed in the KubeFlow component `yaml` file. + +#### Secrets + +You can use key-value secrets for passing credentials to the job. Save the secrets to the cluster and replace the `value: ...` with the following pattern in the `job.yaml`: + +```yaml + containers: + env: + - name: + valueFrom: + secretKeyRef: + name: + key: + +# Example for an access key + containers: + env: + - name: access_key_id + valueFrom: + secretKeyRef: + name: cos-secret + key: access_key_id +``` + +#### Container registry + +If the container image is saved in a non-public registry, add an image pull secret to the container specs. Check `image: ...` in the `job.yaml` to find the location of the container image. If it includes a non-public registry like [icr.io](), you need to provide the image pull secret at the end of the file: + +```yaml + spec: + containers: + - name: example-script + image: icr.io/namespace/claimed-example-script:0.1 + ...: + imagePullSecrets: + - name: +``` + +#### Storage + +You can provide access to a Kubernetes/OpenShift persistent volume by specifying it in the `job.yaml`. +OpenShift clusters require specifying the security context on the pod/template spec level. +You get the group ID for the volume from your administrator. +You can use `/opt/app-root/src/` to mount the `mount_dir` in the working directory of the pod. + +```yaml + spec: + containers: + ...: + volumeMounts: + - name: data + mountPath: /opt/app-root/src/ + securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + volumes: + - name: data + persistentVolumeClaim: + claimName: + securityContext: + supplementalGroups: [] +``` + +#### Error handling + +If a pod fails, it is restarted by the job until it finishes successfully. You can specify the error handling in the `job.yaml`. +First, `backoffLimit` limits the number of restarts (default: 5). Second, `restartPolicy` defines if a failed pod restarts (`OnFailure`) or if a new pod is created while the failed pod stops with the error (`Never`). + +```yaml +spec: + backoffLimit: 1 + template: + spec: + ...: + restartPolicy: Never +``` + +#### Example + +The following is an exemplary `example_script.job.yaml` that includes a `imagePullSecret` and mounts a persistent volume claim from a cluster. +Variables that are not defined are using the default value. + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: example-script +spec: + template: + spec: + containers: + - name: example-script + image: docker.io/user/claimed-example-script:0.1 + command: ["/opt/app-root/bin/ipython","/opt/app-root/src/example_script.py"] + env: + - name: input_path + value: "data/" + - name: num_values + value: "5" + volumeMounts: + - name: pvc-data + mountPath: /opt/app-root/src/data/ + volumes: + - name: pvc-data + persistentVolumeClaim: + claimName: pvc-name + restartPolicy: OnFailure + imagePullSecrets: + - name: user-pull-secret +``` + + +### 1.2 Cluster CLI login + +You can start jobs via with the `kubectl` (Kubernetes) or `oc` (OpenShift) CLI. If your using Kubernetes, the login procedure includes multiple steps which are detailed in the [Kubernetes docs](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/). + +Logging into an OpenShift cluster is easier. You can use a token which you can generate via the browser UI, or you're username. You might want to add `--insecure-skip-tls-verify` when errors occur. + +```sh +# Login via token (Browser login > Your name > Copy login command > Display token) +oc login --token= --server= --insecure-skip-tls-verify + +# Login via user name +oc login -u + +# Optional: Change default project +oc project +``` + +### 1.3 Start and manage jobs + +After specifying the `job.yaml` and logging into the cluster, you can start or stop a job via the CLI. If your using an OpenShift cluster, you simply replace `kubectl` with `oc` in the commands. + +```sh +# start job +kubectl apply -f .job.yaml + +# kill job +kubectl delete -f .job.yaml +``` + +The job creates a pod which is accessible via the browser UI or via CLI using the standard kubectl commands. +```sh +# list all pods in the current project +kubectl pods + +# get logs of a pod +kubectl logs -f + +# pod description +kubectl describe pod +``` + +--- + +## 2. Operator library + +Reusable code is a key idea of CLAIMED and operator libraries make it easier to share single processing steps. Because each operator includes a docker image with specified dependencies, operators can be easily reused in different workflows. + +Public operators are accessible from the [CLAIMED component library](https://github.com/claimed-framework/component-library/tree/main/component-library). + +--- + +## 3. Create workflows + +Multiple operators can be combined to a workflow, e.g., a KubeFlow pipeline. Therefore, C3 creates `.yaml` files which define a KFP component. After initializing your operators, you can combine them in a pipeline function. + +```python +# pip install kfp + +import kfp.components as comp +import kfp +import kfp.dsl as dsl + +# initialize operator from yaml file +file_op = comp.load_component_from_file('.yaml') +# initialize operator from remote file +web_op = comp.load_component_from_url('https://raw.githubusercontent.com/claimed-framework/component-library/main/component-library/.yaml') + +@dsl.pipeline( + name="my_pipeline", + description="Description", +): +def my_pipeline( + parameter1: str = "value", + parameter2: int = 1, + parameter3: str = "value", +): + step1 = file_op( + parameter1=parameter1, + parameter2=parameter2, + ) + + step2 = web_op( + parameter1=parameter1, + parameter3=parameter3, + ) + +kfp.compiler.Compiler().compile(pipeline_func=my_pipeline, package_path='my_pipeline.yaml') +``` + +When running the script, the KFP compiler generates a `.yaml` file which can be uploaded to the KubeFlow UI to start the pipeline. +Alternatively, you can run the pipeline with the SDK client, see [KubeFlow Docs](https://www.kubeflow.org/docs/components/pipelines/v1/sdk/build-pipeline/) for details. + +If your using an OpenShift cluster, your might want to use the tekton compiler. + +```python +# pip install kfp-tekton + +from kfp_tekton.compiler import TektonCompiler + +TektonCompiler().compile(pipeline_func=my_pipeline, package_path='my_pipeline.yaml') +``` + +--- + +## 4. Create operators + +### 4.1 Download C3 + +Download the [C3 repository](https://github.com/claimed-framework/c3) and install the dependencies with: + +```sh +git clone claimed-framework/c3 +cd c3 +pip install -e src +``` + +This documentation describes the functionality of the `dev` branch, which currently differs significantly from the `main` branch. Run the following to pull the dev branch: +```sh +git checkout dev +git pull +``` + + +### 4.2 C3 requirements + +Your operator script has to follow certain requirements to be processed by C3. Currently supported are python scripts and ipython notebooks. #### Python scripts -- The operator name is the python file:`your_operator_name.py` +- The operator name is the python file: `my_operator_name.py` -> `claimed-my-operator-name` - The operator description is the first doc string in the script: `"""Operator description"""` -- You need to provide the required pip packages in comments starting: `# pip install ` -- The interface is defined by environment variables `your_parameter = os.getenv('your_parameter')`. Output variables start with `output_`. +- The required pip packages are listed in comments starting with pip install: `# pip install ` +- The interface is defined by environment variables `my_parameter = os.getenv('my_parameter')`. Output variables start with `output_`. - You can cast a specific type by wrapping `os.getenv()` with `int()`, `float()`, `bool()`. The default type is string. Only these four types are currently supported. You can use `None` as a default value but not pass the `NoneType` via the `job.yaml`. #### iPython notebooks -- The operator name is the notebook file:`your_operator_name.ipynb` +- The operator name is the notebook file: `my_operator_name.ipynb` -> `claimed-my-operator-name` - The notebook is converted to a python script before creating the operator by merging all cells. - Markdown cells are converted into doc strings. shell commands with `!...` are converted into `os.system()`. - The requirements of python scripts apply to the notebook code (The operator description can be a markdown cell). -## Compile an operator with C3 +#### Example + +The following is an example python script `example_script.py` that can be compiled by C3. + +```py +""" +This is the operator description. +The file name becomes the operator name. +""" -With a running Docker engine and your operator script matching the C3 requirements, you can execute the C3 compiler by running `generate_kfp_component.py`: +# Add dependencies by comments starting with "pip install". +# You can add multiple comments if the packages require a specific order. +# pip install numpy + +import os +import logging +import numpy as np + +# A comment one line above os.getenv is the description of this variable. +input_path = os.getenv('input_path') + +# You can cast a specific type with int(), float(), or bool(). +num_values = int(os.getenv('num_values', 5)) + +# Output parameters are starting with "output_" +output_path = os.getenv('output_path', None) + +# Output parameters are used for pipelines and are not configurable in single jobs. Use "target_" instead. +target_path = os.getenv('target_path', None) + + +def my_function(n_random): + """ + The compiler only includes the first doc string.This text is not included. + """ + random_values = np.random.randn(n_random) + # You can use logging in operators. + # C3 adds a logger and a parameter log_level (default: 'INFO') to the operator. + logging.info(f'Random values: {random_values}') + + +if __name__ == '__main__': + my_function(num_values) + +``` + +### 4.3 Docker engine +C3 requires a running Docker engine to build the container image. A popular app is [Docker Desktop](https://www.docker.com/products/docker-desktop/). However, Docker Desktop requires licences for commercial usage in companies. An open source alternatives is [Rancher Desktop](https://rancherdesktop.io) (macOS/Windows/Linux) which includes docker engine and a UI. A CLI alternative for macOS and Linux is [Colima](https://github.com/abiosoft/colima) which creates a Linux VM for docker. + +```sh +# Install Colima with homebrew +brew install docker docker-compose colima + +# Start docker VM +colima start + +#Stop docker VM +colima stop +``` + +### 4.4 Container registry + +C3 creates a container image for the operator which has to be stored in a container registry. A simple solution for non-commercial usage is Docker Hub, but it has limited private usage. +Alternative to a professional plan from Docker Hub are the [IBM Cloud registry](https://www.ibm.com/products/container-registry) or [Amazon ECR](https://aws.amazon.com/ecr/). + +After starting the Docker engine, you need to login to the registry with docker. + +```sh +docker login -u -p / +``` + +### 4.5 Compile an operator with C3 + +With a running Docker engine and your operator script matching the C3 requirements, you can execute the C3 compiler by running `create_operator.py`: ```sh -python /src/c3/generate_kfp_component.py --file_path ".py" --version "X.X" --repository "us.icr.io/" --additional_files "[file1,file2]" +python /src/c3/create_operator.py --file_path ".py" --version "X.X" --repository "/" --additional_files "[file1,file2]" ``` The `file_path` can point to a python script or an ipython notebook. It is recommended to increase the `version` with every compilation as clusters pull images of a specific version from the cache if you used the image before. `additional_files` is an optional parameter and must include all files your using in your operator script. The additional files are placed within the same directory as the operator script. +C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, and a `.job.yaml` that can be directly used as described above. + +--- + +## 5. Create grid wrapper + +You can use grid computing to parallelize an operator. +The grid computing requires that the code is parallelizable, e.g., by processing different files. +Therefore, the code gets wrapped by a coordinator script: The grid wrapper. + +### 5.1 C3 grid computing requirements + +You can use the same code for the grid wrapper as for an operator by adding an extra functon which is passed to C3. +The grid wrapper executes this function in each batch and passes specific parameters to the function: +The first parameter is the batch name, followed by all variables defined in the operator interface. +You need to adapt the variables based on the batch, e.g., by adding the batch name to input and output paths. + +```python +def grid_process(batch, parameter1, parameter2, *args, **kwargs): + # update operator parameters based on batch name + parameter1 = parameter1 + batch + parameter2 = os.path.join(parameter2, batch) + + # execute operator code with adapted parameters + my_function(parameter1, parameter2) +``` + +You might want to add `*args, **kwargs` to avoid errors, if not all interface variables are used. +Note that the operator script is imported by the grid wrapper script. Therefore, all code in the script is executed. +It is recommended to avoid executions in the code or to use a main block if the script is also used as a single operator. + +```python +if __name__ == '__main__': + my_function(parameter1, parameter2) +``` + +### 5.2 Compile a grid wrapper with C3 + +The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_grid_wrapper.py`. + +```sh +python /src/c3/create_grid_wrapper.py --file_path ".py" --process "grid_process" -v "X.X" -r "/" -a "[file1,file2]" +``` + +C3 also includes a grid computing pattern for Cloud Object Storage (COS). You can create a COS grid wrapper by adding a `-cos` flag. +The COS grid wrapper downloads all files of a batch to local storage, compute the process, and uploads the target files to COS. + +The created files include a `gw_.py` file that includes the generated code for the grid wrapper (`cgw_.py` for the COS version). +Similar to an operator, `gw_.yaml` and `gw_.job.yaml` are created. + + +### 5.3 Apply grid wrappers + +The grid wrapper uses coordinator files to split up the batch processes between different pods. +Therefore, each pod needs access to a shared persistent volume, see [storage](#storage). +Alternatively, you can use the COS grid wrapper which uses a coordinator path in COS. + +The grid wrapper includes specific variables in the `job.yaml`, that define the batches and some coordination settings. + +First, you can define the list of batches in a file and pass `gw_batch_file` to the grid wrapper. +You can use either a txt file with a comma-separated list if strings or a json file with the keys being the batches. +Alternatively, the batches can be defined by a file name pattern via `gw_file_path_pattern` and `gw_group_by`. +You can provide multiple patterns via a comma-separated list and the patterns can include wildcards like `*` or `?` to find all relevant files. +`gw_group_by` is code that extracts the batch from a file name by merging the file name string with the code string and passing it to `eval()`. +Assuming, we have the file names `file-from-batch-42-metadata.json` and `second_file-42-image.png`. +The code `gw_group_by = ".split('-')[-2]"` extracts the batch `42` from both files. +You can also to use something like `"[-15:-10]"` or `".split('/')[-1].split('.')[0]"`. +`gw_group_by` is ignored if you provide `gw_batch_file`. +Be aware that the file names need to include the batch name if you are using `gw_group_by` or the COS version (because files are downloaded based on the batch). + +Second, you need to define `gw_coordinator_path` and optionally other coordinator variables. +The `gw_coordinator_path` is a path to a persistent and shared directory that is used by the pods to lock batches and mark them as processed. +`gw_lock_file_suffix` and similar variables are the suffixes for coordinator files (default: `.lock`, `.processed`, and `.err`). +`gw_lock_timeout` defines the time in seconds until other pods remove the `.lock` file from batches that might be struggling (default `3600`). +You need to increase `gw_lock_timeout` to avoid multiple processing if batch processes run very long. +By default, pods skip batches with `.err` files. You can set `gw_ignore_error_files` to `True` after you fixed the error. + +Lastly, you want to add the number of parallel pods by adding `parallelism : ` to the `job.yaml`. + +```yaml +spec: + parallelism: 10 +``` + +In KubeFlow pipelines, you can call the grid wrapper multiple times via a `for` loop. Note that the following step needs to wait for all parallel processes to finish. + +```python +process_parallel_instances = 10 + +@dsl.pipeline(...) +def preprocessing_val_pipeline(...): + step1 = first_op() + step3 = following_op() + for i in range(process_parallel_instances): + step2 = grid_wrapper_op(...) + + step2.after(step1) + step3.after(step2) +``` + +If your using the COS grid wrapper, further variables are required. +You can provide a comma-separated list of additional files that should be downloaded COS using `gw_additional_source_files`. +All batch files and additional files are download to an input directory, defined via `gw_local_input_path` (default: `input`). +Similar, all files in `gw_local_target_path` are uploaded to COS after the batch processing (default: `target`). + +Furthermore, `gw_source_access_key_id`, `gw_source_secret_access_key`, `gw_source_endpoint`, and `gw_source_bucket` define the COS bucket to the source files. +You can specify other buckets for the coordinator and target files. +If the buckets are similar to the source bucket, you just need to provide `gw_target_path` and `gw_coordinator_path` and remove the other variables from the `job.yaml`. +It is recommended to use [secrets](#secrets) for the access key and secret. + + +#### Local example + +The local grid wrapper requires a local storage for coordination like the pvc in the following example. + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: cgw-my-operator +spec: + parallelism: 10 + template: + spec: + containers: + - name: cgw-my-operator + image: us.icr.io/geodn/claimed-cgw-my-operator:0.01 + command: ["/opt/app-root/bin/python","/opt/app-root/src/claimed_cgw_my_operator.py"] + env: + - name: gw_batch_file + value: "data/schedule.json" + - name: gw_coordinator_path + value: 'gw_coordinator' + - name: my_operator_data_path + value: 'data/*' + - name: my_operator_target_path + value: 'data/output/' + - name: my_operator_parameter + value: "100" + volumeMounts: + - name: pvc-data + mountPath: /opt/app-root/src/data/ + volumes: + - name: pvc-data + persistentVolumeClaim: + claimName: pvc-name + restartPolicy: Never + imagePullSecrets: + - name: image-pull-secret +``` + +#### COS example + +The COS grid wrapper uses a COS bucket for downloading and uploading batch data and coordination. + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: cgw-my-operator +spec: + parallelism: 10 + template: + spec: + containers: + - name: cgw-my-operator + image: us.icr.io/geodn/claimed-cgw-my-operator:0.01 + command: ["/opt/app-root/bin/python","/opt/app-root/src/claimed_cgw_my_operator.py"] + env: + - name: gw_file_path_pattern + value: 'data/*' + - name: gw_group_by + value: '[-10:-4]' + - name: gw_source_access_key_id + valueFrom: + secretKeyRef: + name: cos-secret + key: access_key_id + - name: gw_source_secret_access_key + valueFrom: + secretKeyRef: + name: cos-secret + key: secret_access_key + - name: gw_source_endpoint + value: 'https://s3.cloud-object-storage.cloud' + - name: gw_source_bucket + value: 'my-bucket' + - name: gw_target_path + value: 'cos_results' + - name: gw_coordinator_path + value: 'gw_coordinator' + - name: my_operator_data_path + value: 'input' + - name: my_operator_target_path + value: 'target' + - name: my_operator_parameter + value: "100" + restartPolicy: Never + imagePullSecrets: + - name: image-pull-secret +``` \ No newline at end of file diff --git a/README.md b/README.md index 48648ca5..8e4297d4 100644 --- a/README.md +++ b/README.md @@ -36,14 +36,18 @@ pip install -e src Just run the following command with your python script or notebook: ```sh -python /src/c3/generate_kfp_component.py --file_path ".py" --version "X.X" --repository "us.icr.io/" --additional_files "[file1,file2]" +python /src/c3/create_operator.py --file_path ".py" --version "X.X" --repository "/" --additional_files "[file1,file2]" ``` -Your code include certain requirements which are explained in [Getting Started](GettingStarted.md). +Your code needs to follow certain requirements which are explained in [Getting Started](GettingStarted.md). ## Getting Help +```sh +python src/c3/create_operator.py --help +``` + We welcome your questions, ideas, and feedback. Please create an [issue](https://github.com/claimed-framework/component-library/issues) or a [discussion thread](https://github.com/claimed-framework/component-library/discussions). Please see [VULNERABILITIES.md](VULNERABILITIES.md) for reporting vulnerabilities. From 66da4343d01fcea604d7d4f77d140c0775d190fd Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 16 Oct 2023 18:09:11 +0200 Subject: [PATCH 071/177] Remove grid wrapper component --- src/c3/create_grid_wrapper.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/c3/create_grid_wrapper.py b/src/c3/create_grid_wrapper.py index feb7a369..46e26319 100644 --- a/src/c3/create_grid_wrapper.py +++ b/src/c3/create_grid_wrapper.py @@ -190,4 +190,5 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): additional_files=args.additional_files ) - # TODO: Delete component_path? + logging.info('Remove local component file') + os.remove(component_path) From 7ff076fadda8a3bd15825ea768d47f797fabeb6a Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 17 Oct 2023 16:24:33 +0200 Subject: [PATCH 072/177] Add auto-incremental image version --- src/c3/create_grid_wrapper.py | 8 +-- src/c3/create_operator.py | 12 ++-- src/c3/notebook_converter.py | 46 ---------------- src/c3/utils.py | 101 ++++++++++++++++++++++++++++++++++ 4 files changed, 113 insertions(+), 54 deletions(-) delete mode 100644 src/c3/notebook_converter.py create mode 100644 src/c3/utils.py diff --git a/src/c3/create_grid_wrapper.py b/src/c3/create_grid_wrapper.py index 46e26319..63e1711f 100644 --- a/src/c3/create_grid_wrapper.py +++ b/src/c3/create_grid_wrapper.py @@ -4,7 +4,7 @@ import sys from string import Template from pythonscript import Pythonscript -from notebook_converter import convert_notebook +from utils import convert_notebook from create_operator import create_operator from templates import grid_wrapper_template, cos_grid_wrapper_template, gw_component_setup_code, dockerfile_template @@ -141,9 +141,9 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): help='Name of the component sub process that is executed for each batch.') parser.add_argument('-cos', action=argparse.BooleanOptionalAction, default=False, help='Creates a grid wrapper for processing COS files') - parser.add_argument('-r', '--repository', type=str, + parser.add_argument('-r', '--repository', type=str, default=None, help='Container registry address, e.g. docker.io/') - parser.add_argument('-v', '--version', type=str, + parser.add_argument('-v', '--version', type=str, default=None, help='Image version') parser.add_argument('-a', '--additional_files', type=str, help='Comma-separated list of paths to additional files to include in the container image') @@ -164,7 +164,7 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): grid_wrapper_file_path, component_path = apply_grid_wrapper(**vars(args)) - if args.repository is not None and args.version is not None: + if args.repository is not None: logging.info('Generate CLAIMED operator for grid wrapper') # Add component path and init file path to additional_files diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 48e16c39..1dc988c7 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -7,7 +7,7 @@ from string import Template from io import StringIO from pythonscript import Pythonscript -from notebook_converter import convert_notebook +from utils import convert_notebook, get_image_version # Update sys path to load templates sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) @@ -25,7 +25,7 @@ def create_operator(file_path: str, logging.info('Parameters: ') logging.info('file_path: ' + file_path) logging.info('repository: ' + repository) - logging.info('version: ' + version) + logging.info('version: ' + str(version)) logging.info('additional_files: ' + str(additional_files)) if file_path.endswith('.ipynb'): @@ -117,6 +117,10 @@ def create_operator(file_path: str, logging.info('Create Dockerfile') with open("Dockerfile", "w") as text_file: text_file.write(docker_file) + + if version is None: + # auto increase version based on registered images + version = get_image_version(repository, name) logging.info(f'Build and push image to {repository}/claimed-{name}:{version}') os.system(f'docker build --platform=linux/amd64 -t `echo claimed-{name}:{version}` .') @@ -224,8 +228,8 @@ def get_parameter_list(): help='Path to python script or notebook') parser.add_argument('-r', '--repository', type=str, required=True, help='Container registry address, e.g. docker.io/') - parser.add_argument('-v', '--version', type=str, required=True, - help='Image version') + parser.add_argument('-v', '--version', type=str, default=None, + help='Image version. Increases the version numer of image:latest if not provided.') parser.add_argument('-a', '--additional_files', type=str, help='Comma-separated list of paths to additional files to include in the container image') parser.add_argument('-l', '--log_level', type=str, default='INFO') diff --git a/src/c3/notebook_converter.py b/src/c3/notebook_converter.py deleted file mode 100644 index 9bcd2b06..00000000 --- a/src/c3/notebook_converter.py +++ /dev/null @@ -1,46 +0,0 @@ - -import json -import logging -import os - - -def convert_notebook(path): - with open(path) as json_file: - notebook = json.load(json_file) - - # backwards compatibility - if notebook['cells'][0]['cell_type'] == 'markdown' and notebook['cells'][1]['cell_type'] == 'markdown': - logging.info('Merge first two markdown cells. File name is used as operator name, not first markdown cell.') - notebook['cells'][1]['source'] = notebook['cells'][0]['source'] + ['\n'] + notebook['cells'][1]['source'] - notebook['cells'] = notebook['cells'][1:] - - code_lines = [] - for cell in notebook['cells']: - if cell['cell_type'] == 'markdown': - # add markdown as doc string - code_lines.extend(['"""\n'] + [f'{line}' for line in cell['source']] + ['\n"""']) - elif cell['cell_type'] == 'code': - for line in cell['source']: - if line.strip().startswith('!'): - # convert sh scripts - if line.strip().startswith('!pip'): - # change pip install to comment - code_lines.append(line.replace('!pip', '# pip', 1)) - else: - # change sh command to os.system() - logging.info(f'Replace shell command with os.system() ({line})') - code_lines.append(line.replace('!', 'os.system(', 1).replace('\n', ')\n')) - else: - # add code - code_lines.append(line) - # add line break after cell - code_lines.append('\n') - code = ''.join(code_lines) - - py_path = path.split('/')[-1].replace('.ipynb', '.py') - - assert not os.path.exists(py_path), f"File {py_path} already exist. Cannot convert notebook." - with open(py_path, 'w') as py_file: - py_file.write(code) - - return py_path diff --git a/src/c3/utils.py b/src/c3/utils.py new file mode 100644 index 00000000..7586a208 --- /dev/null +++ b/src/c3/utils.py @@ -0,0 +1,101 @@ +import os +import logging +import json +import subprocess + + +def convert_notebook(path): + with open(path) as json_file: + notebook = json.load(json_file) + + # backwards compatibility + if notebook['cells'][0]['cell_type'] == 'markdown' and notebook['cells'][1]['cell_type'] == 'markdown': + logging.info('Merge first two markdown cells. File name is used as operator name, not first markdown cell.') + notebook['cells'][1]['source'] = notebook['cells'][0]['source'] + ['\n'] + notebook['cells'][1]['source'] + notebook['cells'] = notebook['cells'][1:] + + code_lines = [] + for cell in notebook['cells']: + if cell['cell_type'] == 'markdown': + # add markdown as doc string + code_lines.extend(['"""\n'] + [f'{line}' for line in cell['source']] + ['\n"""']) + elif cell['cell_type'] == 'code': + for line in cell['source']: + if line.strip().startswith('!'): + # convert sh scripts + if line.strip().startswith('!pip'): + # change pip install to comment + code_lines.append(line.replace('!pip', '# pip', 1)) + else: + # change sh command to os.system() + logging.info(f'Replace shell command with os.system() ({line})') + code_lines.append(line.replace('!', 'os.system(', 1).replace('\n', ')\n')) + else: + # add code + code_lines.append(line) + # add line break after cell + code_lines.append('\n') + code = ''.join(code_lines) + + py_path = path.split('/')[-1].replace('.ipynb', '.py') + + assert not os.path.exists(py_path), f"File {py_path} already exist. Cannot convert notebook." + with open(py_path, 'w') as py_file: + py_file.write(code) + + return py_path + + +def increase_image_version(last_version): + try: + # increase last version value by 1 + version = last_version.split('.') + version[-1] = str(int(version[-1]) + 1) + version = '.'.join(version) + except: + # fails if a string value was used for the last tag + version = last_version + '.1' + logging.debug(f'Failed to increase last value, adding .1') + pass + logging.info(f'Using version {version} based on latest tag ({last_version}).') + return version + + +def get_image_version(repository, name): + """ + Get current version of the image from the registry and increase the version by 1. + Default to 0.1.1 if no image is found in the registry. + """ + logging.debug(f'Get image version from registry.') + # list images + image_list = subprocess.run( + ['docker', 'image', 'ls', f'{repository}/claimed-{name}'], + stdout=subprocess.PIPE + ).stdout.decode('utf-8') + # get list of image tags + image_tags = [line.split()[1] for line in image_list.splitlines()][1:] + # filter latest and none + image_tags = [t for t in image_tags if t not in ['latest', '']] + logging.debug(f'Image tags: {image_tags}') + + def check_only_numbers(test_str): + return set(test_str) <= set('.0123456789') + + if len(image_tags) == 0: + # default version + version = '0.1.1' + logging.info(f'Using default version {version}. No prior image tag found for {repository}/claimed-{name}.') + + elif not check_only_numbers(image_tags[0]): + # increase last version + version = increase_image_version(image_tags[0]) + logging.info(f'Using version {version} based on last version {image_tags[0]}.') + + else: + # find the highest numerical version + image_tags = list(filter(check_only_numbers, image_tags)) + image_tags.sort(key=lambda s: list(map(int, s.split('.')))) + version = increase_image_version(image_tags[-1]) + logging.info(f'Using version {version} based on highest previous version {image_tags[-1]}.') + + return version From c7dfe6fa7cc62caada45fa301da987b966e22e82 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 17 Oct 2023 17:16:32 +0200 Subject: [PATCH 073/177] Replaced os.system with subprocess.run --- src/c3/create_grid_wrapper.py | 5 +-- src/c3/create_operator.py | 59 ++++++++++++++++++++++++----------- 2 files changed, 43 insertions(+), 21 deletions(-) diff --git a/src/c3/create_grid_wrapper.py b/src/c3/create_grid_wrapper.py index 63e1711f..fd15578e 100644 --- a/src/c3/create_grid_wrapper.py +++ b/src/c3/create_grid_wrapper.py @@ -157,7 +157,7 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): root = logging.getLogger() root.setLevel(args.log_level) handler = logging.StreamHandler(sys.stdout) - formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s') + formatter = logging.Formatter('%(levelname)s - %(message)s') handler.setFormatter(formatter) handler.setLevel(args.log_level) root.addHandler(handler) @@ -187,7 +187,8 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): repository=args.repository, version=args.version, dockerfile_template=dockerfile_template, - additional_files=args.additional_files + additional_files=args.additional_files, + log_level=args.log_level, ) logging.info('Remove local component file') diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 1dc988c7..e330baba 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -4,6 +4,7 @@ import logging import shutil import argparse +import subprocess from string import Template from io import StringIO from pythonscript import Pythonscript @@ -21,6 +22,7 @@ def create_operator(file_path: str, version: str, dockerfile_template: str, additional_files: str = None, + log_level='INFO', ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -93,17 +95,6 @@ def create_operator(file_path: str, else: additional_files_local = target_code # hack additional_files_path = None - file = target_code - - # read and replace '!pip' in notebooks - with open(file, 'r') as fd: - text, counter = re.subn(r'!pip', '#!pip', fd.read(), re.I) - - # check if there is at least a match - if counter > 0: - # edit the file - with open(file, 'w') as fd: - fd.write(text) requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) requirements_docker = '\n'.join(requirements_docker) @@ -122,12 +113,41 @@ def create_operator(file_path: str, # auto increase version based on registered images version = get_image_version(repository, name) - logging.info(f'Build and push image to {repository}/claimed-{name}:{version}') - os.system(f'docker build --platform=linux/amd64 -t `echo claimed-{name}:{version}` .') - os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:{version}`') - os.system(f'docker tag `echo claimed-{name}:{version}` `echo {repository}/claimed-{name}:latest`') - os.system(f'docker push `echo {repository}/claimed-{name}:latest`') - os.system(f'docker push `echo {repository}/claimed-{name}:{version}`') + logging.info(f'Building container image claimed-{name}:{version}') + try: + subprocess.run( + ['docker', 'build', '--platform', 'linux/amd64', '-t', f'claimed-{name}:{version}', '.'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + logging.debug(f'Tagging images with "latest" and "{version}"') + subprocess.run( + ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:{version}'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + subprocess.run( + ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:latest'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + logging.info('Successfully built image') + except: + logging.error(f'Failed to build image with docker.') + pass + + logging.info(f'Pushing images to registry {repository}') + try: + subprocess.run( + ['docker', 'push', f'{repository}/claimed-{name}:latest'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + subprocess.run( + ['docker', 'push', f'{repository}/claimed-{name}:{version}'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + logging.info('Successfully pushed image to registry') + except: + logging.error(f'Could not push images to namespace {repository}. ' + f'Please check if docker is logged in or select a namespace with access.') + pass def get_component_interface(parameters): template_string = str() @@ -241,7 +261,7 @@ def get_parameter_list(): root = logging.getLogger() root.setLevel(args.log_level) handler = logging.StreamHandler(sys.stdout) - formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s') + formatter = logging.Formatter('%(levelname)s - %(message)s') handler.setFormatter(formatter) handler.setLevel(args.log_level) root.addHandler(handler) @@ -257,5 +277,6 @@ def get_parameter_list(): repository=args.repository, version=args.version, dockerfile_template=dockerfile_template, - additional_files=args.additional_files + additional_files=args.additional_files, + log_level=args.log_level, ) From b209cef304acd736404ca431bc111c1f563a3eda Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 17 Oct 2023 18:33:51 +0200 Subject: [PATCH 074/177] Updated code for additional_files and kfp yaml --- src/c3/create_operator.py | 128 +++++++++------------- src/templates/dockerfile_template | 2 +- src/templates/kfp_component_template.yaml | 2 +- 3 files changed, 51 insertions(+), 81 deletions(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index e330baba..635940be 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -33,13 +33,15 @@ def create_operator(file_path: str, if file_path.endswith('.ipynb'): logging.info('Convert notebook to python script') target_code = convert_notebook(file_path) - else: + elif file_path.endswith('.py'): target_code = file_path.split('/')[-1] if file_path == target_code: # use temp file for processing target_code = 'claimed_' + target_code # Copy file to current working directory shutil.copy(file_path, target_code) + else: + raise NotImplementedError('Please provide a file_path to a jupyter notebook or python script.') if target_code.endswith('.py'): # Add code for logging and cli parameters to the beginning of the script @@ -49,8 +51,7 @@ def create_operator(file_path: str, with open(target_code, 'w') as f: f.write(script) - # getting parameter from the script - if target_code.endswith('.py'): + # getting parameter from the script py = Pythonscript(target_code) name = py.get_name() # convert description into a string with a single line @@ -60,7 +61,7 @@ def create_operator(file_path: str, outputs = py.get_outputs() requirements = py.get_requirements() else: - raise NotImplementedError('Please provide a file_path to a jupyter notebook or python script.') + raise NotImplementedError('C3 currently only supports jupyter notebook or python script.') # Strip 'claimed-' from name of copied temp file if name.startswith('claimed-'): @@ -73,27 +74,18 @@ def create_operator(file_path: str, logging.info('Requirements: ' + str(requirements)) if additional_files is not None: - if additional_files.startswith('['): - additional_files_path = 'additional_files_path' - if not os.path.exists(additional_files_path): - os.makedirs(additional_files_path) - additional_files_local = additional_files_path - additional_files = additional_files[1:-1].split(',') - print('Additional files to add to container:') - for additional_file in additional_files: - print(additional_file) - shutil.copy(additional_file, additional_files_local) - print(os.listdir(additional_files_path)) - else: - additional_files_local = additional_files.split('/')[-1:][0] - if additional_files != additional_files_local: - shutil.copy(additional_files, additional_files_local) - additional_files_path = additional_files_local - else: - # ensure the original file is not deleted later - additional_files_path = None + additional_files_path = 'additional_files_path' + while os.path.exists(additional_files_path): + # ensures using a new directory + additional_files_path += '_temp' + logging.debug(f'Create dir for additional files {additional_files_path}') + os.makedirs(additional_files_path) + # Strip [] from backward compatibility + additional_files = additional_files.strip('[]').split(',') + for additional_file in additional_files: + shutil.copy(additional_file.strip(), additional_files_path) + logging.info(f'Selected additional files: {os.listdir(additional_files_path)}') else: - additional_files_local = target_code # hack additional_files_path = None requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) @@ -102,7 +94,7 @@ def create_operator(file_path: str, docker_file = dockerfile_template.substitute( requirements_docker=requirements_docker, target_code=target_code, - additional_files_local=additional_files_local, + additional_files_path=additional_files_path or target_code, ) logging.info('Create Dockerfile') @@ -150,72 +142,51 @@ def create_operator(file_path: str, pass def get_component_interface(parameters): - template_string = str() - for parameter_name, parameter_options in parameters.items(): - template_string += f'- {{name: {parameter_name}, type: {parameter_options["type"]}, description: "{parameter_options["description"]}"' - if parameter_options['default'] is not None: - template_string += f', default: {parameter_options["default"]}' - template_string += '}\n' - return template_string - - def get_output_name(): - for output_key, output_value in outputs.items(): - return output_key - - # TODO: Review implementation - def get_input_for_implementation(): - t = Template(" - {inputValue: $name}") - with StringIO() as inputs_str: - for input_key, input_value in inputs.items(): - print(t.substitute(name=input_key), file=inputs_str) - return inputs_str.getvalue() - - def get_parameter_list(): - return_value = str() - index = 0 - for output_key, output_value in outputs.items(): - return_value = return_value + output_key + '="${' + str(index) + '}" ' - index = index + 1 - for input_key, input_value in inputs.items(): - return_value = return_value + input_key + '="${' + str(index) + '}" ' - index = index + 1 - return return_value + return_string = str() + for name, options in parameters.items(): + return_string += f'- {{name: {name}, type: {options["type"]}, description: "{options["description"]}"' + if options['default'] is not None: + if not options["default"].startswith("'"): + options["default"] = f"'{options['default']}'" + return_string += f', default: {options["default"]}' + return_string += '}\n' + return return_string + inputs_list = get_component_interface(inputs) + outputs_list = get_component_interface(outputs) + + parameter_list = str() + for index, key in enumerate(list(inputs.keys()) + list(outputs.keys())): + parameter_list += f'{key}="${{{index}}}" ' + + parameter_values = str() + for input_key in inputs.keys(): + parameter_values += f" - {{inputValue: {input_key}}}\n" + for input_key in outputs.keys(): + parameter_values += f" - {{outputPath: {input_key}}}\n" yaml = kfp_component_template.substitute( name=name, description=description, repository=repository, version=version, - inputs=get_component_interface(inputs), - outputs=get_component_interface(outputs), - call=f'./{target_code} {get_parameter_list()}', - input_for_implementation=get_input_for_implementation(), + inputs=inputs_list, + outputs=outputs_list, + call=f'./{target_code} {parameter_list}', + parameter_values=parameter_values, ) - logging.debug('KubeFlow component yaml:') - logging.debug(yaml) + logging.debug('KubeFlow component yaml:\n' + yaml) target_yaml_path = file_path.replace('.ipynb', '.yaml').replace('.py', '.yaml') - logging.debug(f' Write KubeFlow component yaml to {target_yaml_path}') + logging.info(f'Write KubeFlow component yaml to {target_yaml_path}') with open(target_yaml_path, "w") as text_file: text_file.write(yaml) # get environment entries - # TODO: Make it similar to the kfp code - env_entries = [] - for input_key, _ in inputs.items(): - env_entry = f" - name: {input_key}\n value: value_of_{input_key}" - env_entries.append(env_entry) - env_entries.append('\n') - for output_key, _ in outputs.items(): - env_entry = f" - name: {output_key}\n value: value_of_{output_key}" - env_entries.append(env_entry) - env_entries.append('\n') - - # TODO: Is it possible that a component has no inputs? - if len(env_entries) != 0: - env_entries.pop(-1) - env_entries = ''.join(env_entries) + env_entries = str() + for key in list(inputs.keys()) + list(outputs.keys()): + env_entries += f" - name: {key}\n value: value_of_{key}\n" + env_entries = env_entries.rstrip() job_yaml = kubernetes_job_template.substitute( name=name, @@ -225,8 +196,7 @@ def get_parameter_list(): env_entries=env_entries, ) - logging.debug('Kubernetes job yaml:') - logging.debug(job_yaml) + logging.debug('Kubernetes job yaml:\n' + job_yaml) target_job_yaml_path = file_path.replace('.ipynb', '.job.yaml').replace('.py', '.job.yaml') logging.info(f'Write kubernetes job yaml to {target_job_yaml_path}') diff --git a/src/templates/dockerfile_template b/src/templates/dockerfile_template index dfc1134d..c82a7435 100644 --- a/src/templates/dockerfile_template +++ b/src/templates/dockerfile_template @@ -4,7 +4,7 @@ RUN dnf install -y java-11-openjdk USER default ${requirements_docker} ADD ${target_code} /opt/app-root/src/ -ADD ${additional_files_local} /opt/app-root/src/ +ADD ${additional_files_path} /opt/app-root/src/ USER root RUN chmod -R 777 /opt/app-root/src/ USER default diff --git a/src/templates/kfp_component_template.yaml b/src/templates/kfp_component_template.yaml index 56d8c74b..c0c53569 100644 --- a/src/templates/kfp_component_template.yaml +++ b/src/templates/kfp_component_template.yaml @@ -15,4 +15,4 @@ implementation: - -ec - | python ${call} -${input_for_implementation} \ No newline at end of file +${parameter_values} \ No newline at end of file From 9e6bab77076b091dd16655f3ce0f9b612e3d77a3 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 10:05:16 +0200 Subject: [PATCH 075/177] Moved file_path and additional_files to positional arguments --- GettingStarted.md | 19 +++++++++++------- src/c3/create_grid_wrapper.py | 16 +++++---------- src/c3/create_operator.py | 38 +++++++++++++++++------------------ 3 files changed, 35 insertions(+), 38 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index 05bab158..9c8c0b99 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -360,11 +360,16 @@ docker login -u -p / With a running Docker engine and your operator script matching the C3 requirements, you can execute the C3 compiler by running `create_operator.py`: ```sh -python /src/c3/create_operator.py --file_path ".py" --version "X.X" --repository "/" --additional_files "[file1,file2]" +python /src/c3/create_operator.py ".py" "" "" --repository "/" ``` -The `file_path` can point to a python script or an ipython notebook. It is recommended to increase the `version` with every compilation as clusters pull images of a specific version from the cache if you used the image before. -`additional_files` is an optional parameter and must include all files your using in your operator script. The additional files are placed within the same directory as the operator script. +The first positional argument is the path to the python script or the ipython notebook. Optional, you can provide additional files that are copied to the container images with in all following parameters. The additional files are placed within the same directory as the operator script. +C3 automatically increases the version of the container image (default: "0.1") but you can set the version with `--version` or `-v`. You need to provide the repository with `--repository` or `-r`. + +View all arguments by running: +```sh +python /src/c3/create_grid_wrapper.py --help +``` C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, and a `.job.yaml` that can be directly used as described above. @@ -404,14 +409,14 @@ if __name__ == '__main__': ### 5.2 Compile a grid wrapper with C3 -The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_grid_wrapper.py`. +The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_grid_wrapper.py` using `--process` or `-p`. ```sh -python /src/c3/create_grid_wrapper.py --file_path ".py" --process "grid_process" -v "X.X" -r "/" -a "[file1,file2]" +python /src/c3/create_grid_wrapper.py ".py" "" "" --process "grid_process" -r "/" ``` -C3 also includes a grid computing pattern for Cloud Object Storage (COS). You can create a COS grid wrapper by adding a `-cos` flag. -The COS grid wrapper downloads all files of a batch to local storage, compute the process, and uploads the target files to COS. +C3 also includes a grid computing pattern for Cloud Object Storage (COS). You can create a COS grid wrapper by adding a `--cos` flag. +The COS grid wrapper downloads all files of a batch to local storage, compute the process, and uploads the target files to COS. The created files include a `gw_.py` file that includes the generated code for the grid wrapper (`cgw_.py` for the COS version). Similar to an operator, `gw_.yaml` and `gw_.job.yaml` are created. diff --git a/src/c3/create_grid_wrapper.py b/src/c3/create_grid_wrapper.py index fd15578e..d3f51f31 100644 --- a/src/c3/create_grid_wrapper.py +++ b/src/c3/create_grid_wrapper.py @@ -135,18 +135,18 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): if __name__ == '__main__': parser = argparse.ArgumentParser() - parser.add_argument('-f', '--file_path', type=str, required=True, + parser.add_argument('file_path', type=str, help='Path to python script or notebook') + parser.add_argument('additional_files', type=str, nargs='*', + help='List of paths to additional files to include in the container image') parser.add_argument('-p', '--component_process', type=str, required=True, help='Name of the component sub process that is executed for each batch.') - parser.add_argument('-cos', action=argparse.BooleanOptionalAction, default=False, + parser.add_argument('--cos', action=argparse.BooleanOptionalAction, default=False, help='Creates a grid wrapper for processing COS files') parser.add_argument('-r', '--repository', type=str, default=None, help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, help='Image version') - parser.add_argument('-a', '--additional_files', type=str, - help='Comma-separated list of paths to additional files to include in the container image') parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') @@ -168,13 +168,7 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): logging.info('Generate CLAIMED operator for grid wrapper') # Add component path and init file path to additional_files - if args.additional_files is None: - args.additional_files = component_path - else: - if args.additional_files.startswith('['): - args.additional_files = f'{args.additional_files[:-1]},{component_path}]' - else: - args.additional_files = f'[{args.additional_files},{component_path}]' + args.additional_files.append(component_path) # Update dockerfile template if specified if args.dockerfile_template_path != '': diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 635940be..0015086f 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -73,20 +73,18 @@ def create_operator(file_path: str, logging.info('Outputs: ' + str(outputs)) logging.info('Requirements: ' + str(requirements)) - if additional_files is not None: - additional_files_path = 'additional_files_path' - while os.path.exists(additional_files_path): - # ensures using a new directory - additional_files_path += '_temp' - logging.debug(f'Create dir for additional files {additional_files_path}') - os.makedirs(additional_files_path) - # Strip [] from backward compatibility - additional_files = additional_files.strip('[]').split(',') - for additional_file in additional_files: - shutil.copy(additional_file.strip(), additional_files_path) - logging.info(f'Selected additional files: {os.listdir(additional_files_path)}') - else: - additional_files_path = None + # copy all additional files to temporary folder + additional_files_path = 'additional_files_path' + while os.path.exists(additional_files_path): + # ensures using a new directory + additional_files_path += '_temp' + logging.debug(f'Create dir for additional files {additional_files_path}') + os.makedirs(additional_files_path) + for additional_file in additional_files: + assert os.path.isfile(additional_file), \ + f"Could not find file at {additional_file}. Please provide only files as additional parameters." + shutil.copy(additional_file, additional_files_path) + logging.info(f'Selected additional files: {os.listdir(additional_files_path)}') requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) requirements_docker = '\n'.join(requirements_docker) @@ -94,7 +92,7 @@ def create_operator(file_path: str, docker_file = dockerfile_template.substitute( requirements_docker=requirements_docker, target_code=target_code, - additional_files_path=additional_files_path or target_code, + additional_files_path=additional_files_path, ) logging.info('Create Dockerfile') @@ -214,14 +212,14 @@ def get_component_interface(parameters): if __name__ == '__main__': parser = argparse.ArgumentParser() - parser.add_argument('-f', '--file_path', type=str, required=True, + parser.add_argument('FILE_PATH', type=str, help='Path to python script or notebook') + parser.add_argument('ADDITIONAL_FILES', type=str, nargs='*', + help='Paths to additional files to include in the container image') parser.add_argument('-r', '--repository', type=str, required=True, help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, help='Image version. Increases the version numer of image:latest if not provided.') - parser.add_argument('-a', '--additional_files', type=str, - help='Comma-separated list of paths to additional files to include in the container image') parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') @@ -243,10 +241,10 @@ def get_component_interface(parameters): dockerfile_template = Template(f.read()) create_operator( - file_path=args.file_path, + file_path=args.FILE_PATH, repository=args.repository, version=args.version, dockerfile_template=dockerfile_template, - additional_files=args.additional_files, + additional_files=args.ADDITIONAL_FILES, log_level=args.log_level, ) From 9870f1509e0a78ef1de7adb26cd17dc91b2ca9f2 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 10:05:29 +0200 Subject: [PATCH 076/177] Changed default version to 0.1 --- src/c3/utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/utils.py b/src/c3/utils.py index 7586a208..1e94555d 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -83,7 +83,7 @@ def check_only_numbers(test_str): if len(image_tags) == 0: # default version - version = '0.1.1' + version = '0.1' logging.info(f'Using default version {version}. No prior image tag found for {repository}/claimed-{name}.') elif not check_only_numbers(image_tags[0]): From 33f72953872dd37ac18afec1fe442411859ef5bc Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 14:29:00 +0200 Subject: [PATCH 077/177] Moved templates to c3 dir --- src/{ => c3}/templates/__init__.py | 0 src/{ => c3}/templates/component_setup_code.py | 0 src/{ => c3}/templates/cos_grid_wrapper_template.py | 0 src/{ => c3}/templates/dockerfile_template | 0 src/{ => c3}/templates/grid_wrapper_template.py | 0 src/{ => c3}/templates/gw_component_setup_code.py | 0 src/{ => c3}/templates/kfp_component_template.yaml | 0 src/{ => c3}/templates/kubernetes_job_template.job.yaml | 0 8 files changed, 0 insertions(+), 0 deletions(-) rename src/{ => c3}/templates/__init__.py (100%) rename src/{ => c3}/templates/component_setup_code.py (100%) rename src/{ => c3}/templates/cos_grid_wrapper_template.py (100%) rename src/{ => c3}/templates/dockerfile_template (100%) rename src/{ => c3}/templates/grid_wrapper_template.py (100%) rename src/{ => c3}/templates/gw_component_setup_code.py (100%) rename src/{ => c3}/templates/kfp_component_template.yaml (100%) rename src/{ => c3}/templates/kubernetes_job_template.job.yaml (100%) diff --git a/src/templates/__init__.py b/src/c3/templates/__init__.py similarity index 100% rename from src/templates/__init__.py rename to src/c3/templates/__init__.py diff --git a/src/templates/component_setup_code.py b/src/c3/templates/component_setup_code.py similarity index 100% rename from src/templates/component_setup_code.py rename to src/c3/templates/component_setup_code.py diff --git a/src/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py similarity index 100% rename from src/templates/cos_grid_wrapper_template.py rename to src/c3/templates/cos_grid_wrapper_template.py diff --git a/src/templates/dockerfile_template b/src/c3/templates/dockerfile_template similarity index 100% rename from src/templates/dockerfile_template rename to src/c3/templates/dockerfile_template diff --git a/src/templates/grid_wrapper_template.py b/src/c3/templates/grid_wrapper_template.py similarity index 100% rename from src/templates/grid_wrapper_template.py rename to src/c3/templates/grid_wrapper_template.py diff --git a/src/templates/gw_component_setup_code.py b/src/c3/templates/gw_component_setup_code.py similarity index 100% rename from src/templates/gw_component_setup_code.py rename to src/c3/templates/gw_component_setup_code.py diff --git a/src/templates/kfp_component_template.yaml b/src/c3/templates/kfp_component_template.yaml similarity index 100% rename from src/templates/kfp_component_template.yaml rename to src/c3/templates/kfp_component_template.yaml diff --git a/src/templates/kubernetes_job_template.job.yaml b/src/c3/templates/kubernetes_job_template.job.yaml similarity index 100% rename from src/templates/kubernetes_job_template.job.yaml rename to src/c3/templates/kubernetes_job_template.job.yaml From 2991d4d8d6f56048d0fd7edbf9e8ebdf4d98c165 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 14:29:40 +0200 Subject: [PATCH 078/177] Removed old files --- src/build/lib/c3/compiler.py | 13 - src/c3/compiler.py | 13 - src/c3/create_component_library.ipynb | 749 -------------------------- src/setup.py | 27 - 4 files changed, 802 deletions(-) delete mode 100644 src/build/lib/c3/compiler.py delete mode 100644 src/c3/compiler.py delete mode 100644 src/c3/create_component_library.ipynb delete mode 100644 src/setup.py diff --git a/src/build/lib/c3/compiler.py b/src/build/lib/c3/compiler.py deleted file mode 100644 index 88809def..00000000 --- a/src/build/lib/c3/compiler.py +++ /dev/null @@ -1,13 +0,0 @@ -import subprocess - -def main(): - try: - #output = subprocess.check_output('pwd', shell=True, universal_newlines=True) - output = subprocess.check_output('ipython generate_kfp_component.ipynb', shell=True, universal_newlines=True) - print(output) - except subprocess.CalledProcessError as e: - print(f"Error executing command: {e}") - - -if __name__ == "__main__": - main() \ No newline at end of file diff --git a/src/c3/compiler.py b/src/c3/compiler.py deleted file mode 100644 index 88809def..00000000 --- a/src/c3/compiler.py +++ /dev/null @@ -1,13 +0,0 @@ -import subprocess - -def main(): - try: - #output = subprocess.check_output('pwd', shell=True, universal_newlines=True) - output = subprocess.check_output('ipython generate_kfp_component.ipynb', shell=True, universal_newlines=True) - print(output) - except subprocess.CalledProcessError as e: - print(f"Error executing command: {e}") - - -if __name__ == "__main__": - main() \ No newline at end of file diff --git a/src/c3/create_component_library.ipynb b/src/c3/create_component_library.ipynb deleted file mode 100644 index 4322b520..00000000 --- a/src/c3/create_component_library.ipynb +++ /dev/null @@ -1,749 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 3, - "id": "9c7ce914", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: ipython in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (8.14.0)\n", - "Requirement already satisfied: nbformat in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (5.9.1)\n", - "Requirement already satisfied: backcall in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.2.0)\n", - "Requirement already satisfied: decorator in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (5.1.1)\n", - "Requirement already satisfied: jedi>=0.16 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.18.2)\n", - "Requirement already satisfied: matplotlib-inline in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.1.6)\n", - "Requirement already satisfied: pickleshare in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.7.5)\n", - "Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (3.0.39)\n", - "Requirement already satisfied: pygments>=2.4.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (2.15.1)\n", - "Requirement already satisfied: stack-data in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (0.6.2)\n", - "Requirement already satisfied: traitlets>=5 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (5.9.0)\n", - "Requirement already satisfied: pexpect>4.3 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from ipython) (4.8.0)\n", - "Requirement already satisfied: fastjsonschema in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from nbformat) (2.17.1)\n", - "Requirement already satisfied: jsonschema>=2.6 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from nbformat) (4.17.3)\n", - "Requirement already satisfied: jupyter-core in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from nbformat) (5.3.1)\n", - "Requirement already satisfied: parso<0.9.0,>=0.8.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from jedi>=0.16->ipython) (0.8.3)\n", - "Requirement already satisfied: attrs>=17.4.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat) (23.1.0)\n", - "Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat) (0.19.3)\n", - "Requirement already satisfied: ptyprocess>=0.5 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from pexpect>4.3->ipython) (0.7.0)\n", - "Requirement already satisfied: wcwidth in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython) (0.2.6)\n", - "Requirement already satisfied: platformdirs>=2.5 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from jupyter-core->nbformat) (3.8.1)\n", - "Requirement already satisfied: executing>=1.2.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from stack-data->ipython) (1.2.0)\n", - "Requirement already satisfied: asttokens>=2.1.0 in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from stack-data->ipython) (2.2.1)\n", - "Requirement already satisfied: pure-eval in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from stack-data->ipython) (0.2.2)\n", - "Requirement already satisfied: six in /home/romeokienzler/gitco/c3/.venv/lib/python3.10/site-packages (from asttokens>=2.1.0->stack-data->ipython) (1.16.0)\n", - "\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.1.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.2\u001b[0m\n", - "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" - ] - } - ], - "source": [ - "!pip install ipython nbformat" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a972c366-03a3-4d79-b917-01592f594eac", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import shutil" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "728ce188-e84a-4e5b-953b-cff00c95d8d8", - "metadata": {}, - "outputs": [], - "source": [ - "os.scandir('../component-library/')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "528e6ac7-efd8-4b5e-9da5-9cc971b9b4b9", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "%%bash\n", - "export version=0.1i\n", - "for file in `find ../component-library/ -name \"*.ipynb\" |grep -vi test |grep -v checkpoints`\n", - "do \n", - " ipython generate_kfp_component.ipynb $file $version 2>> log.txt >> log.txt\n", - " echo \"Status:\"$file:$? >> log.txt\n", - "done" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5303c806-b63a-4392-b97f-5bb962ae8f4e", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "%%bash\n", - "export version=0.2n\n", - "ipython generate_kfp_component.ipynb ../../component-library/component-library/input/input-url.ipynb $version\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ced8585b-2208-4b77-b4ea-c629da4c5834", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "%%bash\n", - "export version=0.2m\n", - "ipython generate_kfp_component.ipynb ../component-library/transform/spark-json-to-parquet.ipynb $version\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fe9ecd44-7ba2-4077-918b-5b369e6da32c", - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "%%bash\n", - "export version=0.2n\n", - "ipython generate_kfp_component.ipynb ../component-library/output/upload-to-cos.ipynb $version\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "id": "2b7a758f-a293-4fa2-8e3f-a0e8557369b9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2023-07-20 19:24:35,109 - root - INFO - Logging parameters: notebook_path=\"../../../component-library/component-library/util/util-cos.ipynb\"version=\"0.32\"repository=\"docker.io/romeokienzler\"\n", - "2023-07-20 19:24:35,110 - root - INFO - Parameter: notebook_path=\"../../../component-library/component-library/util/util-cos.ipynb\"\n", - "2023-07-20 19:24:35,110 - root - INFO - Parameter: version=\"0.32\"\n", - "2023-07-20 19:24:35,110 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", - "util-cos\n", - "This component provides COS utility functions (e.g. creating a bucket, listing contents of a bucket)\n", - " CLAIMED v0.32\n", - "{'access_key_id': {'description': 'access key id', 'type': 'String', 'default': None}, 'secret_access_key': {'description': 'secret access key', 'type': 'String', 'default': None}, 'endpoint': {'description': 'cos/s3 endpoint', 'type': 'String', 'default': None}, 'bucket_name': {'description': 'cos bucket name', 'type': 'String', 'default': None}, 'path': {'description': 'path', 'type': 'String', 'default': \"''\"}, 'source': {'description': 'source in case of uploads', 'type': 'String', 'default': \" ''\"}, 'target': {'description': 'target in case of downloads', 'type': 'String', 'default': \" ''\"}, 'recursive': {'description': 'recursive', 'type': 'Boolean', 'default': \"'False'\"}, 'operation': {'description': 'operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local|glob)', 'type': 'String', 'default': None}, 'log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}}\n", - "{}\n", - "['pip install aiobotocore botocore s3fs']\n", - "../../../component-library/component-library/util/util-cos.ipynb\n", - "\n", - "FROM registry.access.redhat.com/ubi8/python-39 \n", - "USER root\n", - "RUN dnf install -y java-11-openjdk\n", - "USER default\n", - "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "RUN pip install aiobotocore botocore s3fs\n", - "ADD util-cos.ipynb /opt/app-root/src/\n", - "ADD util-cos.ipynb /opt/app-root/src/\n", - "USER root\n", - "RUN chmod -R 777 /opt/app-root/src/\n", - "USER default\n", - "CMD [\"ipython\", \"/opt/app-root/src/util-cos.ipynb\"]\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "#1 [internal] load .dockerignore\n", - "#1 transferring context:\n", - "#1 transferring context: 2B done\n", - "#1 DONE 0.0s\n", - "\n", - "#2 [internal] load build definition from Dockerfile\n", - "#2 transferring dockerfile: 482B done\n", - "#2 DONE 0.1s\n", - "\n", - "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", - "#3 DONE 0.0s\n", - "\n", - "#4 [internal] load build context\n", - "#4 DONE 0.0s\n", - "\n", - "#5 [1/7] FROM registry.access.redhat.com/ubi8/python-39\n", - "#5 DONE 0.0s\n", - "\n", - "#4 [internal] load build context\n", - "#4 transferring context: 8.74kB done\n", - "#4 DONE 0.0s\n", - "\n", - "#6 [3/7] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "#6 CACHED\n", - "\n", - "#7 [2/7] RUN dnf install -y java-11-openjdk\n", - "#7 CACHED\n", - "\n", - "#8 [4/7] RUN pip install aiobotocore botocore s3fs\n", - "#8 CACHED\n", - "\n", - "#9 [5/7] ADD util-cos.ipynb /opt/app-root/src/\n", - "#9 DONE 0.2s\n", - "\n", - "#10 [6/7] ADD util-cos.ipynb /opt/app-root/src/\n", - "#10 DONE 0.2s\n", - "\n", - "#11 [7/7] RUN chmod -R 777 /opt/app-root/src/\n", - "#11 DONE 0.7s\n", - "\n", - "#12 exporting to image\n", - "#12 exporting layers\n", - "#12 exporting layers 5.9s done\n", - "#12 writing image sha256:43801d41bd756a495aa85fd975efa79e1bc06fe857fcbcc813fe8a73a2b5bf6c done\n", - "#12 naming to docker.io/library/claimed-util-cos:0.32 0.0s done\n", - "#12 DONE 5.9s\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The push refers to repository [docker.io/romeokienzler/claimed-util-cos]\n", - "44f526e9989f: Preparing\n", - "5f70bf18a086: Preparing\n", - "e2c24570dcf1: Preparing\n", - "a38988a799dd: Preparing\n", - "151382e656b8: Preparing\n", - "2d89553fcdef: Preparing\n", - "3568498d40ea: Preparing\n", - "e813c91400f3: Preparing\n", - "fb6a7cccdb84: Preparing\n", - "b51194abfc91: Preparing\n", - "3568498d40ea: Waiting\n", - "e813c91400f3: Waiting\n", - "fb6a7cccdb84: Waiting\n", - "2d89553fcdef: Waiting\n", - "5f70bf18a086: Layer already exists\n", - "151382e656b8: Layer already exists\n", - "a38988a799dd: Layer already exists\n", - "3568498d40ea: Layer already exists\n", - "2d89553fcdef: Layer already exists\n", - "e813c91400f3: Layer already exists\n", - "b51194abfc91: Layer already exists\n", - "fb6a7cccdb84: Layer already exists\n", - "44f526e9989f: Pushed\n", - "e2c24570dcf1: Pushed\n", - "0.32: digest: sha256:d3d22874c39ff5273a50b0216a0e72049291ce953521497f3a1c76b6e21713b7 size: 2425\n", - "The push refers to repository [docker.io/romeokienzler/claimed-util-cos]\n", - "44f526e9989f: Preparing\n", - "5f70bf18a086: Preparing\n", - "e2c24570dcf1: Preparing\n", - "a38988a799dd: Preparing\n", - "151382e656b8: Preparing\n", - "2d89553fcdef: Preparing\n", - "3568498d40ea: Preparing\n", - "e813c91400f3: Preparing\n", - "fb6a7cccdb84: Preparing\n", - "b51194abfc91: Preparing\n", - "3568498d40ea: Waiting\n", - "e813c91400f3: Waiting\n", - "fb6a7cccdb84: Waiting\n", - "b51194abfc91: Waiting\n", - "2d89553fcdef: Waiting\n", - "e2c24570dcf1: Layer already exists\n", - "5f70bf18a086: Layer already exists\n", - "151382e656b8: Layer already exists\n", - "a38988a799dd: Layer already exists\n", - "44f526e9989f: Layer already exists\n", - "2d89553fcdef: Layer already exists\n", - "e813c91400f3: Layer already exists\n", - "3568498d40ea: Layer already exists\n", - "fb6a7cccdb84: Layer already exists\n", - "b51194abfc91: Layer already exists\n", - "latest: digest: sha256:d3d22874c39ff5273a50b0216a0e72049291ce953521497f3a1c76b6e21713b7 size: 2425\n", - "name: util-cos\n", - "description: This component provides COS utility functions (e.g. creating a bucket, listing contents of a bucket)\n", - " CLAIMED v0.32\n", - "\n", - "inputs:\n", - "- {name: access_key_id, type: String, description: access key id}\n", - "- {name: secret_access_key, type: String, description: secret access key}\n", - "- {name: endpoint, type: String, description: cos/s3 endpoint}\n", - "- {name: bucket_name, type: String, description: cos bucket name}\n", - "- {name: path, type: String, description: path, default: ''}\n", - "- {name: source, type: String, description: source in case of uploads, default: ''}\n", - "- {name: target, type: String, description: target in case of downloads, default: ''}\n", - "- {name: recursive, type: Boolean, description: recursive, default: 'False'}\n", - "- {name: operation, type: String, description: operation (mkdir|ls|find|get|put|rm|sync_to_cos|sync_to_local|glob)}\n", - "- {name: log_level, type: String, description: log level, default: 'INFO'}\n", - "\n", - "\n", - "implementation:\n", - " container:\n", - " image: romeokienzler/claimed-util-cos:0.32\n", - " command:\n", - " - sh\n", - " - -ec\n", - " - |\n", - " ipython ./util-cos.ipynb access_key_id=\"$0\" secret_access_key=\"$1\" endpoint=\"$2\" bucket_name=\"$3\" path=\"$4\" source=\"$5\" target=\"$6\" recursive=\"$7\" operation=\"$8\" log_level=\"$9\" \n", - " - {inputValue: access_key_id}\n", - " - {inputValue: secret_access_key}\n", - " - {inputValue: endpoint}\n", - " - {inputValue: bucket_name}\n", - " - {inputValue: path}\n", - " - {inputValue: source}\n", - " - {inputValue: target}\n", - " - {inputValue: recursive}\n", - " - {inputValue: operation}\n", - " - {inputValue: log_level}\n", - "\n", - "apiVersion: batch/v1\n", - "kind: Job\n", - "metadata:\n", - " name: util-cos\n", - "spec:\n", - " template:\n", - " spec:\n", - " containers:\n", - " - name: util-cos\n", - " image: docker.io/romeokienzler/claimed-util-cos:0.32\n", - " command: [\"/opt/app-root/bin/ipython\",\"/opt/app-root/src/util-cos.ipynb\"]\n", - " env:\n", - " - name: access_key_id\n", - " value: value_of_access_key_id\n", - " - name: secret_access_key\n", - " value: value_of_secret_access_key\n", - " - name: endpoint\n", - " value: value_of_endpoint\n", - " - name: bucket_name\n", - " value: value_of_bucket_name\n", - " - name: path\n", - " value: value_of_path\n", - " - name: source\n", - " value: value_of_source\n", - " - name: target\n", - " value: value_of_target\n", - " - name: recursive\n", - " value: value_of_recursive\n", - " - name: operation\n", - " value: value_of_operation\n", - " - name: log_level\n", - " value: value_of_log_level\n", - " restartPolicy: OnFailure\n" - ] - } - ], - "source": [ - "%%bash\n", - "export version=0.32\n", - "ipython generate_kfp_component.ipynb notebook_path=../../../component-library/component-library/util/util-cos.ipynb version=$version repository=docker.io/romeokienzler" - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "id": "dc39195e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2023-07-20 18:27:17,441 - root - INFO - Logging parameters: repository=\"docker.io/romeokienzler\"\n", - "2023-07-20 18:27:17,441 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n", - "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)\n", - "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n", - "\u001b[0;32m----> 1\u001b[0m nb \u001b[38;5;241m=\u001b[39m Notebook(\u001b[43mnotebook_path\u001b[49m)\n", - "\n", - "\u001b[0;31mNameError\u001b[0m: name 'notebook_path' is not defined\n" - ] - }, - { - "ename": "CalledProcessError", - "evalue": "Command 'b'export version=0.34\\nipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version repository=docker.io/romeokienzler\\n'' returned non-zero exit status 1.", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mCalledProcessError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[50], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m get_ipython()\u001b[39m.\u001b[39;49mrun_cell_magic(\u001b[39m'\u001b[39;49m\u001b[39mbash\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mexport version=0.34\u001b[39;49m\u001b[39m\\n\u001b[39;49;00m\u001b[39mipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version repository=docker.io/romeokienzler\u001b[39;49m\u001b[39m\\n\u001b[39;49;00m\u001b[39m'\u001b[39;49m)\n", - "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/interactiveshell.py:2478\u001b[0m, in \u001b[0;36mInteractiveShell.run_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m 2476\u001b[0m \u001b[39mwith\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mbuiltin_trap:\n\u001b[1;32m 2477\u001b[0m args \u001b[39m=\u001b[39m (magic_arg_s, cell)\n\u001b[0;32m-> 2478\u001b[0m result \u001b[39m=\u001b[39m fn(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 2480\u001b[0m \u001b[39m# The code below prevents the output from being displayed\u001b[39;00m\n\u001b[1;32m 2481\u001b[0m \u001b[39m# when using magics with decodator @output_can_be_silenced\u001b[39;00m\n\u001b[1;32m 2482\u001b[0m \u001b[39m# when the last Python token in the expression is a ';'.\u001b[39;00m\n\u001b[1;32m 2483\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mgetattr\u001b[39m(fn, magic\u001b[39m.\u001b[39mMAGIC_OUTPUT_CAN_BE_SILENCED, \u001b[39mFalse\u001b[39;00m):\n", - "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/magics/script.py:154\u001b[0m, in \u001b[0;36mScriptMagics._make_script_magic..named_script_magic\u001b[0;34m(line, cell)\u001b[0m\n\u001b[1;32m 152\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 153\u001b[0m line \u001b[39m=\u001b[39m script\n\u001b[0;32m--> 154\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mshebang(line, cell)\n", - "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/magics/script.py:314\u001b[0m, in \u001b[0;36mScriptMagics.shebang\u001b[0;34m(self, line, cell)\u001b[0m\n\u001b[1;32m 309\u001b[0m \u001b[39mif\u001b[39;00m args\u001b[39m.\u001b[39mraise_error \u001b[39mand\u001b[39;00m p\u001b[39m.\u001b[39mreturncode \u001b[39m!=\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[1;32m 310\u001b[0m \u001b[39m# If we get here and p.returncode is still None, we must have\u001b[39;00m\n\u001b[1;32m 311\u001b[0m \u001b[39m# killed it but not yet seen its return code. We don't wait for it,\u001b[39;00m\n\u001b[1;32m 312\u001b[0m \u001b[39m# in case it's stuck in uninterruptible sleep. -9 = SIGKILL\u001b[39;00m\n\u001b[1;32m 313\u001b[0m rc \u001b[39m=\u001b[39m p\u001b[39m.\u001b[39mreturncode \u001b[39mor\u001b[39;00m \u001b[39m-\u001b[39m\u001b[39m9\u001b[39m\n\u001b[0;32m--> 314\u001b[0m \u001b[39mraise\u001b[39;00m CalledProcessError(rc, cell)\n", - "\u001b[0;31mCalledProcessError\u001b[0m: Command 'b'export version=0.34\\nipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version repository=docker.io/romeokienzler\\n'' returned non-zero exit status 1." - ] - } - ], - "source": [ - "%%bash\n", - "export version=0.34\n", - "ipython generate_kfp_component.ipynb ../../workflows-and-operators/operators/hls_remove_clouds.ipynb $version repository=docker.io/romeokienzler\n" - ] - }, - { - "cell_type": "code", - "execution_count": 62, - "id": "1df26cfb", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2023-07-21 11:11:41,617 - root - INFO - Logging parameters: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"version=\"0.30\"repository=\"docker.io/romeokienzler\"additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", - "2023-07-21 11:11:41,618 - root - INFO - Parameter: notebook_path=\"../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\"\n", - "2023-07-21 11:11:41,618 - root - INFO - Parameter: version=\"0.30\"\n", - "2023-07-21 11:11:41,618 - root - INFO - Parameter: repository=\"docker.io/romeokienzler\"\n", - "2023-07-21 11:11:41,619 - root - INFO - Parameter: additionl_files=\"../../../workflows-and-operators/operators/hls_remove_clouds.ipynb\"\n", - "ccgw-hls-remove-clouds\n", - "hls_remove_clouds got wrappted by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 \n", - " CLAIMED v0.30\n", - "{'cgw_source_path': {'description': 'cos path to get job (files) from (including bucket)', 'type': 'String', 'default': None}, 'cgw_source_access_key_id': {'description': 'cgw_source_access_key_id', 'type': 'String', 'default': None}, 'cgw_source_secret_access_key': {'description': 'source_secret_access_key', 'type': 'String', 'default': None}, 'cgw_source_endpoint': {'description': 'source_endpoint', 'type': 'String', 'default': None}, 'cgw_target_access_key_id': {'description': 'cgw_target_access_key_id', 'type': 'String', 'default': None}, 'cgw_target_secret_access_key': {'description': 'cgw_target_secret_access_key', 'type': 'String', 'default': None}, 'cgw_target_endpoint': {'description': 'cgw_target_endpoint', 'type': 'String', 'default': None}, 'cgw_target_path': {'description': 'cgw_target_path (including bucket)', 'type': 'String', 'default': None}, 'cgw_lock_file_suffix': {'description': 'lock file suffix', 'type': 'String', 'default': \" '.lock'\"}, 'cgw_processed_file_suffix': {'description': 'processed file suffix', 'type': 'String', 'default': \" '.processed'\"}, 'cgw_log_level': {'description': 'log level', 'type': 'String', 'default': \" 'INFO'\"}, 'cgw_lock_timeout': {'description': 'timeout in seconds to remove lock file from struggling job (default 1 hour)', 'type': 'Integer', 'default': ' 60*60'}, 'cgw_group_by': {'description': 'group files which need to be processed together', 'type': 'String', 'default': ' None'}, 'satellite': {'description': 'satellite', 'type': 'String', 'default': \"'HLS.L30'\"}}\n", - "{}\n", - "['pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs']\n", - "../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb\n", - "\n", - "FROM registry.access.redhat.com/ubi8/python-39 \n", - "USER root\n", - "RUN dnf install -y java-11-openjdk\n", - "USER default\n", - "RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", - "ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", - "ADD cgw-hls-remove-clouds.ipynb /opt/app-root/src/\n", - "USER root\n", - "RUN chmod -R 777 /opt/app-root/src/\n", - "USER default\n", - "CMD [\"ipython\", \"/opt/app-root/src/cgw-hls-remove-clouds.ipynb\"]\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "#1 [internal] load build definition from Dockerfile\n", - "#1 transferring dockerfile: 640B done\n", - "#1 DONE 0.0s\n", - "\n", - "#2 [internal] load .dockerignore\n", - "#2 transferring context: 2B done\n", - "#2 DONE 0.0s\n", - "\n", - "#3 [internal] load metadata for registry.access.redhat.com/ubi8/python-39:latest\n", - "#3 DONE 0.0s\n", - "\n", - "#4 [1/7] FROM registry.access.redhat.com/ubi8/python-39\n", - "#4 DONE 0.0s\n", - "\n", - "#5 [internal] load build context\n", - "#5 transferring context: 25.49kB done\n", - "#5 DONE 0.0s\n", - "\n", - "#6 [4/7] RUN pip install xarray matplotlib geopandas rioxarray numpy shapely rasterio pyproj ipython dask distributed jinja2 bokeh ipython nbformat aiobotocore botocore s3fs\n", - "#6 CACHED\n", - "\n", - "#7 [2/7] RUN dnf install -y java-11-openjdk\n", - "#7 CACHED\n", - "\n", - "#8 [3/7] RUN pip install ipython==8.6.0 nbformat==5.7.0\n", - "#8 CACHED\n", - "\n", - "#9 [5/7] ADD hls_remove_clouds.ipynb /opt/app-root/src/\n", - "#9 CACHED\n", - "\n", - "#10 [6/7] ADD cgw-hls-remove-clouds.ipynb /opt/app-root/src/\n", - "#10 DONE 0.2s\n", - "\n", - "#11 [7/7] RUN chmod -R 777 /opt/app-root/src/\n", - "#11 DONE 0.7s\n", - "\n", - "#12 exporting to image\n", - "#12 exporting layers\n", - "#12 exporting layers 7.0s done\n", - "#12 writing image sha256:2d5e087b09230262740a48e74c38b7204cb730091a81a0c7c8a2d8f7ef35a6eb done\n", - "#12 naming to docker.io/library/claimed-ccgw-hls-remove-clouds:0.30 0.0s done\n", - "#12 DONE 7.0s\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The push refers to repository [docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds]\n", - "8c83b3d3eb0f: Preparing\n", - "f01eaaffe9ea: Preparing\n", - "a43fa6775580: Preparing\n", - "b638398d394b: Preparing\n", - "151382e656b8: Preparing\n", - "2d89553fcdef: Preparing\n", - "3568498d40ea: Preparing\n", - "e813c91400f3: Preparing\n", - "fb6a7cccdb84: Preparing\n", - "b51194abfc91: Preparing\n", - "2d89553fcdef: Waiting\n", - "3568498d40ea: Waiting\n", - "fb6a7cccdb84: Waiting\n", - "151382e656b8: Layer already exists\n", - "a43fa6775580: Layer already exists\n", - "b638398d394b: Layer already exists\n", - "e813c91400f3: Layer already exists\n", - "2d89553fcdef: Layer already exists\n", - "3568498d40ea: Layer already exists\n", - "b51194abfc91: Layer already exists\n", - "fb6a7cccdb84: Layer already exists\n", - "f01eaaffe9ea: Pushed\n", - "8c83b3d3eb0f: Pushed\n", - "0.30: digest: sha256:61041d44168883c1b2ff07abb249832ce6eb6457f099e6e2e20faaca60f74cf3 size: 2428\n", - "The push refers to repository [docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds]\n", - "8c83b3d3eb0f: Preparing\n", - "f01eaaffe9ea: Preparing\n", - "a43fa6775580: Preparing\n", - "b638398d394b: Preparing\n", - "151382e656b8: Preparing\n", - "2d89553fcdef: Preparing\n", - "3568498d40ea: Preparing\n", - "e813c91400f3: Preparing\n", - "fb6a7cccdb84: Preparing\n", - "b51194abfc91: Preparing\n", - "e813c91400f3: Waiting\n", - "2d89553fcdef: Waiting\n", - "3568498d40ea: Waiting\n", - "fb6a7cccdb84: Waiting\n", - "b51194abfc91: Waiting\n", - "a43fa6775580: Layer already exists\n", - "f01eaaffe9ea: Layer already exists\n", - "151382e656b8: Layer already exists\n", - "b638398d394b: Layer already exists\n", - "8c83b3d3eb0f: Layer already exists\n", - "3568498d40ea: Layer already exists\n", - "2d89553fcdef: Layer already exists\n", - "fb6a7cccdb84: Layer already exists\n", - "e813c91400f3: Layer already exists\n", - "b51194abfc91: Layer already exists\n", - "latest: digest: sha256:61041d44168883c1b2ff07abb249832ce6eb6457f099e6e2e20faaca60f74cf3 size: 2428\n", - "name: ccgw-hls-remove-clouds\n", - "description: hls_remove_clouds got wrappted by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 \n", - " CLAIMED v0.30\n", - "\n", - "inputs:\n", - "- {name: cgw_source_path, type: String, description: cos path to get job (files) from (including bucket)}\n", - "- {name: cgw_source_access_key_id, type: String, description: cgw_source_access_key_id}\n", - "- {name: cgw_source_secret_access_key, type: String, description: source_secret_access_key}\n", - "- {name: cgw_source_endpoint, type: String, description: source_endpoint}\n", - "- {name: cgw_target_access_key_id, type: String, description: cgw_target_access_key_id}\n", - "- {name: cgw_target_secret_access_key, type: String, description: cgw_target_secret_access_key}\n", - "- {name: cgw_target_endpoint, type: String, description: cgw_target_endpoint}\n", - "- {name: cgw_target_path, type: String, description: cgw_target_path (including bucket)}\n", - "- {name: cgw_lock_file_suffix, type: String, description: lock file suffix, default: '.lock'}\n", - "- {name: cgw_processed_file_suffix, type: String, description: processed file suffix, default: '.processed'}\n", - "- {name: cgw_log_level, type: String, description: log level, default: 'INFO'}\n", - "- {name: cgw_lock_timeout, type: Integer, description: timeout in seconds to remove lock file from struggling job (default 1 hour), default: 60*60}\n", - "- {name: cgw_group_by, type: String, description: group files which need to be processed together, default: None}\n", - "- {name: satellite, type: String, description: satellite, default: 'HLS.L30'}\n", - "\n", - "\n", - "implementation:\n", - " container:\n", - " image: romeokienzler/claimed-ccgw-hls-remove-clouds:0.30\n", - " command:\n", - " - sh\n", - " - -ec\n", - " - |\n", - " ipython ./cgw-hls-remove-clouds.ipynb cgw_source_path=\"$0\" cgw_source_access_key_id=\"$1\" cgw_source_secret_access_key=\"$2\" cgw_source_endpoint=\"$3\" cgw_target_access_key_id=\"$4\" cgw_target_secret_access_key=\"$5\" cgw_target_endpoint=\"$6\" cgw_target_path=\"$7\" cgw_lock_file_suffix=\"$8\" cgw_processed_file_suffix=\"$9\" cgw_log_level=\"$10\" cgw_lock_timeout=\"$11\" cgw_group_by=\"$12\" satellite=\"$13\" \n", - " - {inputValue: cgw_source_path}\n", - " - {inputValue: cgw_source_access_key_id}\n", - " - {inputValue: cgw_source_secret_access_key}\n", - " - {inputValue: cgw_source_endpoint}\n", - " - {inputValue: cgw_target_access_key_id}\n", - " - {inputValue: cgw_target_secret_access_key}\n", - " - {inputValue: cgw_target_endpoint}\n", - " - {inputValue: cgw_target_path}\n", - " - {inputValue: cgw_lock_file_suffix}\n", - " - {inputValue: cgw_processed_file_suffix}\n", - " - {inputValue: cgw_log_level}\n", - " - {inputValue: cgw_lock_timeout}\n", - " - {inputValue: cgw_group_by}\n", - " - {inputValue: satellite}\n", - "\n", - "apiVersion: batch/v1\n", - "kind: Job\n", - "metadata:\n", - " name: ccgw-hls-remove-clouds\n", - "spec:\n", - " template:\n", - " spec:\n", - " containers:\n", - " - name: ccgw-hls-remove-clouds\n", - " image: docker.io/romeokienzler/claimed-ccgw-hls-remove-clouds:0.30\n", - " command: [\"/opt/app-root/bin/ipython\",\"/opt/app-root/src/cgw-hls-remove-clouds.ipynb\"]\n", - " env:\n", - " - name: cgw_source_path\n", - " value: value_of_cgw_source_path\n", - " - name: cgw_source_access_key_id\n", - " value: value_of_cgw_source_access_key_id\n", - " - name: cgw_source_secret_access_key\n", - " value: value_of_cgw_source_secret_access_key\n", - " - name: cgw_source_endpoint\n", - " value: value_of_cgw_source_endpoint\n", - " - name: cgw_target_access_key_id\n", - " value: value_of_cgw_target_access_key_id\n", - " - name: cgw_target_secret_access_key\n", - " value: value_of_cgw_target_secret_access_key\n", - " - name: cgw_target_endpoint\n", - " value: value_of_cgw_target_endpoint\n", - " - name: cgw_target_path\n", - " value: value_of_cgw_target_path\n", - " - name: cgw_lock_file_suffix\n", - " value: value_of_cgw_lock_file_suffix\n", - " - name: cgw_processed_file_suffix\n", - " value: value_of_cgw_processed_file_suffix\n", - " - name: cgw_log_level\n", - " value: value_of_cgw_log_level\n", - " - name: cgw_lock_timeout\n", - " value: value_of_cgw_lock_timeout\n", - " - name: cgw_group_by\n", - " value: value_of_cgw_group_by\n", - " - name: satellite\n", - " value: value_of_satellite\n", - " restartPolicy: OnFailure\n" - ] - } - ], - "source": [ - "%%bash\n", - "export version=0.30\n", - "ipython generate_kfp_component.ipynb notebook_path=../../../workflows-and-operators/operators/cgw-hls-remove-clouds.ipynb version=$version repository=docker.io/romeokienzler additionl_files=../../../workflows-and-operators/operators/hls_remove_clouds.ipynb" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "d1fa8a81", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "build README.md src your_package_name.egg-info\n" - ] - } - ], - "source": [ - "!ls ../../\n" - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "id": "70241e41", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2023-07-21 17:32:01,822 - root - INFO - Logging parameters: notebook_path=\"../../workflows-and-operators/operators/planetdownloader.ipynb\"version=\"0.36\"repository=\"us.icr.io/geodn\"\n", - "2023-07-21 17:32:01,823 - root - INFO - Parameter: notebook_path=\"../../workflows-and-operators/operators/planetdownloader.ipynb\"\n", - "2023-07-21 17:32:01,823 - root - INFO - Parameter: version=\"0.36\"\n", - "2023-07-21 17:32:01,823 - root - INFO - Parameter: repository=\"us.icr.io/geodn\"\n", - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n", - "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)\n", - "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n", - "\u001b[0;32m----> 1\u001b[0m nb \u001b[38;5;241m=\u001b[39m \u001b[43mNotebook\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnotebook_path\u001b[49m\u001b[43m)\u001b[49m\n", - "\n", - "File \u001b[0;32m~/gitco/c3/src/c3/notebook.py:8\u001b[0m, in \u001b[0;36mNotebook.__init__\u001b[0;34m(self, path)\u001b[0m\n", - "\u001b[1;32m 6\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__init__\u001b[39m(\u001b[38;5;28mself\u001b[39m, path):\n", - "\u001b[1;32m 7\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mpath \u001b[38;5;241m=\u001b[39m path\n", - "\u001b[0;32m----> 8\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28;43mopen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mpath\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m json_file:\n", - "\u001b[1;32m 9\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnotebook \u001b[38;5;241m=\u001b[39m json\u001b[38;5;241m.\u001b[39mload(json_file)\n", - "\u001b[1;32m 10\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mname \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnotebook[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mcells\u001b[39m\u001b[38;5;124m'\u001b[39m][\u001b[38;5;241m0\u001b[39m][\u001b[38;5;124m'\u001b[39m\u001b[38;5;124msource\u001b[39m\u001b[38;5;124m'\u001b[39m][\u001b[38;5;241m0\u001b[39m]\u001b[38;5;241m.\u001b[39mreplace(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m#\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m'\u001b[39m)\u001b[38;5;241m.\u001b[39mstrip()\n", - "\n", - "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../../workflows-and-operators/operators/planetdownloader.ipynb'\n" - ] - }, - { - "ename": "CalledProcessError", - "evalue": "Command 'b'export version=0.36\\nipython generate_kfp_component.ipynb notebook_path=../../workflows-and-operators/operators/planetdownloader.ipynb version=$version repository=us.icr.io/geodn\\n'' returned non-zero exit status 1.", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mCalledProcessError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[64], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m get_ipython()\u001b[39m.\u001b[39;49mrun_cell_magic(\u001b[39m'\u001b[39;49m\u001b[39mbash\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39m'\u001b[39;49m, \u001b[39m'\u001b[39;49m\u001b[39mexport version=0.36\u001b[39;49m\u001b[39m\\n\u001b[39;49;00m\u001b[39mipython generate_kfp_component.ipynb notebook_path=../../workflows-and-operators/operators/planetdownloader.ipynb version=$version repository=us.icr.io/geodn\u001b[39;49m\u001b[39m\\n\u001b[39;49;00m\u001b[39m'\u001b[39;49m)\n", - "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/interactiveshell.py:2478\u001b[0m, in \u001b[0;36mInteractiveShell.run_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m 2476\u001b[0m \u001b[39mwith\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mbuiltin_trap:\n\u001b[1;32m 2477\u001b[0m args \u001b[39m=\u001b[39m (magic_arg_s, cell)\n\u001b[0;32m-> 2478\u001b[0m result \u001b[39m=\u001b[39m fn(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n\u001b[1;32m 2480\u001b[0m \u001b[39m# The code below prevents the output from being displayed\u001b[39;00m\n\u001b[1;32m 2481\u001b[0m \u001b[39m# when using magics with decodator @output_can_be_silenced\u001b[39;00m\n\u001b[1;32m 2482\u001b[0m \u001b[39m# when the last Python token in the expression is a ';'.\u001b[39;00m\n\u001b[1;32m 2483\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mgetattr\u001b[39m(fn, magic\u001b[39m.\u001b[39mMAGIC_OUTPUT_CAN_BE_SILENCED, \u001b[39mFalse\u001b[39;00m):\n", - "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/magics/script.py:154\u001b[0m, in \u001b[0;36mScriptMagics._make_script_magic..named_script_magic\u001b[0;34m(line, cell)\u001b[0m\n\u001b[1;32m 152\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 153\u001b[0m line \u001b[39m=\u001b[39m script\n\u001b[0;32m--> 154\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mshebang(line, cell)\n", - "File \u001b[0;32m~/gitco/c3/.venv/lib64/python3.10/site-packages/IPython/core/magics/script.py:314\u001b[0m, in \u001b[0;36mScriptMagics.shebang\u001b[0;34m(self, line, cell)\u001b[0m\n\u001b[1;32m 309\u001b[0m \u001b[39mif\u001b[39;00m args\u001b[39m.\u001b[39mraise_error \u001b[39mand\u001b[39;00m p\u001b[39m.\u001b[39mreturncode \u001b[39m!=\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[1;32m 310\u001b[0m \u001b[39m# If we get here and p.returncode is still None, we must have\u001b[39;00m\n\u001b[1;32m 311\u001b[0m \u001b[39m# killed it but not yet seen its return code. We don't wait for it,\u001b[39;00m\n\u001b[1;32m 312\u001b[0m \u001b[39m# in case it's stuck in uninterruptible sleep. -9 = SIGKILL\u001b[39;00m\n\u001b[1;32m 313\u001b[0m rc \u001b[39m=\u001b[39m p\u001b[39m.\u001b[39mreturncode \u001b[39mor\u001b[39;00m \u001b[39m-\u001b[39m\u001b[39m9\u001b[39m\n\u001b[0;32m--> 314\u001b[0m \u001b[39mraise\u001b[39;00m CalledProcessError(rc, cell)\n", - "\u001b[0;31mCalledProcessError\u001b[0m: Command 'b'export version=0.36\\nipython generate_kfp_component.ipynb notebook_path=../../workflows-and-operators/operators/planetdownloader.ipynb version=$version repository=us.icr.io/geodn\\n'' returned non-zero exit status 1." - ] - } - ], - "source": [ - "%%bash\n", - "export version=0.36\n", - "ipython generate_kfp_component.ipynb notebook_path=../../../workflows-and-operators/operators/planetdownloader.ipynb version=$version repository=us.icr.io/geodn\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "7da81684", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/home/romeokienzler/gitco/c3/src/c3\n" - ] - } - ], - "source": [ - "!pwd\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5c156cb8", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/src/setup.py b/src/setup.py deleted file mode 100644 index 67226c0a..00000000 --- a/src/setup.py +++ /dev/null @@ -1,27 +0,0 @@ -from setuptools import setup, find_packages - -setup( - name='c3', - version='0.1.0', - author='The CLAIMED authors', - author_email='your@email.com', - description='Description of your package', - url='https://github.com/yourusername/your-package-name', - packages=find_packages(), - entry_points={ - 'console_scripts': [ - 'c3 = c3.compiler:main' - ] - }, - package_data={ - 'c3': ['./c3/generate_kfp_component.ipynb'], - }, - install_requires=[ - 'ipython', - ], - classifiers=[ - 'License :: OSI Approved :: MIT License', - 'Programming Language :: Python :: 3', - 'Operating System :: OS Independent', - ], -) From 29ed35185f5a3398c1404f042168cc504494dbd9 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 14:30:12 +0200 Subject: [PATCH 079/177] Added pypi package code --- pyproject.toml | 38 +++++++++++++++++++ src/build/lib/c3/__init__.py | 0 ..._grid_wrapper.py => create_gridwrapper.py} | 15 +++++--- src/c3/create_operator.py | 24 ++++++------ src/c3/pythonscript.py | 2 +- 5 files changed, 62 insertions(+), 17 deletions(-) create mode 100644 pyproject.toml delete mode 100644 src/build/lib/c3/__init__.py rename src/c3/{create_grid_wrapper.py => create_gridwrapper.py} (96%) diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 00000000..11e2e410 --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,38 @@ +[build-system] +requires = ["setuptools>=61.0"] +build-backend = "setuptools.build_meta" + +[project] +name = "claimed-c3" +version = "0.2.4" +authors = [ + { name="The CLAIMED authors" }, +] +maintainers = [ + { name="Romeo Kienzler", email="romeo.kienzler1@ibm.com" }, + { name="Benedikt Blumenstiel", email="benedikt.blumenstiel@ibm.com" }, +] +description = "The CLAIMED component compiler (C3) generates container images, KFP components, and Kubernetes jobs." +readme = "README.md" +requires-python = ">=3.7" +license = {file = "LICENSE.txt"} +keywords = ["CLAIMED", "compiler", "KubeFlow", "Kubernetes"] +classifiers = [ + "Programming Language :: Python :: 3", + "License :: OSI Approved :: Apache Software License", + "Operating System :: OS Independent", +] + +[project.urls] +"Homepage" = "https://github.com/claimed-framework/c3" +"Bug Tracker" = "https://github.com/claimed-framework/c3/issues" + +[project.scripts] +create_operator = "c3.create_operator:main" +create_gridwrapper = "c3.create_gridwrapper:main" + +[tool.setuptools.packages.find] +where = ["src"] + +[tool.setuptools.package-data] +"c3.templates" = ["*"] diff --git a/src/build/lib/c3/__init__.py b/src/build/lib/c3/__init__.py deleted file mode 100644 index e69de29b..00000000 diff --git a/src/c3/create_grid_wrapper.py b/src/c3/create_gridwrapper.py similarity index 96% rename from src/c3/create_grid_wrapper.py rename to src/c3/create_gridwrapper.py index d3f51f31..e4a5ae79 100644 --- a/src/c3/create_grid_wrapper.py +++ b/src/c3/create_gridwrapper.py @@ -1,12 +1,13 @@ + import logging import os import argparse import sys from string import Template -from pythonscript import Pythonscript -from utils import convert_notebook -from create_operator import create_operator -from templates import grid_wrapper_template, cos_grid_wrapper_template, gw_component_setup_code, dockerfile_template +from c3.pythonscript import Pythonscript +from c3.utils import convert_notebook +from c3.create_operator import create_operator +from c3.templates import grid_wrapper_template, cos_grid_wrapper_template, gw_component_setup_code, dockerfile_template def wrap_component(component_path, @@ -133,7 +134,7 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): return grid_wrapper_file_path, file_path -if __name__ == '__main__': +def main(): parser = argparse.ArgumentParser() parser.add_argument('file_path', type=str, help='Path to python script or notebook') @@ -187,3 +188,7 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): logging.info('Remove local component file') os.remove(component_path) + + +if __name__ == '__main__': + main() diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 0015086f..fd7fa4fb 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -1,18 +1,14 @@ + import os import sys -import re import logging import shutil import argparse import subprocess from string import Template -from io import StringIO -from pythonscript import Pythonscript -from utils import convert_notebook, get_image_version - -# Update sys path to load templates -sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) -from templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template +from c3.pythonscript import Pythonscript +from c3.utils import convert_notebook, get_image_version +from c3.templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template CLAIMED_VERSION = 'V0.1' @@ -210,7 +206,7 @@ def get_component_interface(parameters): shutil.rmtree(additional_files_path, ignore_errors=True) -if __name__ == '__main__': +def main(): parser = argparse.ArgumentParser() parser.add_argument('FILE_PATH', type=str, help='Path to python script or notebook') @@ -238,13 +234,19 @@ def get_component_interface(parameters): if args.dockerfile_template_path != '': logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') with open(args.dockerfile_template_path, 'r') as f: - dockerfile_template = Template(f.read()) + _dockerfile_template = Template(f.read()) + else: + _dockerfile_template = dockerfile_template create_operator( file_path=args.FILE_PATH, repository=args.repository, version=args.version, - dockerfile_template=dockerfile_template, + dockerfile_template=_dockerfile_template, additional_files=args.ADDITIONAL_FILES, log_level=args.log_level, ) + + +if __name__ == '__main__': + main() diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index cf20044d..8d2ab0dd 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -2,7 +2,7 @@ import logging import os import re -from parser import ContentParser +from c3.parser import ContentParser class Pythonscript: From 42b672972e2e470de8e64d59062d391912ff335c Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 14:31:27 +0200 Subject: [PATCH 080/177] Update ReadMe and add LICENSE file --- LICENSE.txt | 202 ++++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 17 ++--- 2 files changed, 209 insertions(+), 10 deletions(-) create mode 100644 LICENSE.txt diff --git a/LICENSE.txt b/LICENSE.txt new file mode 100644 index 00000000..d6456956 --- /dev/null +++ b/LICENSE.txt @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/README.md b/README.md index 8e4297d4..03071781 100644 --- a/README.md +++ b/README.md @@ -6,10 +6,11 @@ # C3 - the CLAIMED Component Compiler **TL;DR** -- takes arbitrary assets (Jupyter notebooks, python/R/shell/SQL scripts) as input +- takes arbitrary assets (Jupyter notebooks, python scripts) as input - automatically creates container images and pushes to container registries - automatically installs all required dependencies into the container image - creates KubeFlow Pipeline components (target workflow execution engines are pluggable) +- creates Kubernetes job configs for execution on Kubernetes/Openshift clusters - can be triggered from CICD pipelines @@ -24,28 +25,24 @@ To learn more on how this library works in practice, please have a look at the f ### Install -Download the code from https://github.com/claimed-framework/c3/tree/main and install the package. - ```sh -git clone claimed-framework/c3 -cd c3 -pip install -e src +pip install claimed-c3 ``` ### Usage Just run the following command with your python script or notebook: ```sh -python /src/c3/create_operator.py --file_path ".py" --version "X.X" --repository "/" --additional_files "[file1,file2]" +create_operator --repository "/" ".py" ``` -Your code needs to follow certain requirements which are explained in [Getting Started](GettingStarted.md). +Your code needs to follow certain requirements which are explained in [Getting Started](https://github.com/claimed-framework/c3/blob/main/GettingStarted.md). ## Getting Help ```sh -python src/c3/create_operator.py --help +create_operator --help ``` We welcome your questions, ideas, and feedback. Please create an [issue](https://github.com/claimed-framework/component-library/issues) or a [discussion thread](https://github.com/claimed-framework/component-library/discussions). @@ -56,4 +53,4 @@ Interested in helping make CLAIMED better? We encourage you to take a look at ou [Contributing](CONTRIBUTING.md) page. ## License -This software is released under Apache License v2.0 +This software is released under Apache License v2.0. From 4000d3483cb9557786903caa7cbab9c6c004ee60 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel <64090593+blumenstiel@users.noreply.github.com> Date: Wed, 18 Oct 2023 16:49:44 +0200 Subject: [PATCH 081/177] Create python-publish.yml --- .github/workflows/python-publish.yml | 39 ++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 .github/workflows/python-publish.yml diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml new file mode 100644 index 00000000..bdaab28a --- /dev/null +++ b/.github/workflows/python-publish.yml @@ -0,0 +1,39 @@ +# This workflow will upload a Python Package using Twine when a release is created +# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries + +# This workflow uses actions that are not certified by GitHub. +# They are provided by a third-party and are governed by +# separate terms of service, privacy policy, and support +# documentation. + +name: Upload Python Package + +on: + release: + types: [published] + +permissions: + contents: read + +jobs: + deploy: + + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v3 + - name: Set up Python + uses: actions/setup-python@v3 + with: + python-version: '3.x' + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install build + - name: Build package + run: python -m build + - name: Publish package + uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29 + with: + user: __token__ + password: ${{ secrets.PYPI_API_TOKEN }} From 0baf79fc079ac4052fabaf08d22f2d9431640831 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 17:27:06 +0200 Subject: [PATCH 082/177] Updated examples --- examples/operator_example.py | 59 ++++++++++++++++++ examples/operator_template.py | 112 ---------------------------------- examples/pipeline_example.py | 54 ++++++++++++++++ 3 files changed, 113 insertions(+), 112 deletions(-) create mode 100644 examples/operator_example.py delete mode 100644 examples/operator_template.py create mode 100644 examples/pipeline_example.py diff --git a/examples/operator_example.py b/examples/operator_example.py new file mode 100644 index 00000000..c1c3a86f --- /dev/null +++ b/examples/operator_example.py @@ -0,0 +1,59 @@ +# TODO: Rename the file to the desired operator name. +""" +TODO: Update the description of the operator in the first doc string. +This is the operator description. +The file name becomes the operator name. +""" + +# TODO: Update the required pip packages. +# pip install numpy + +import os +import logging +import numpy as np + +# TODO: Add the operator interface. +# A comment one line above os.getenv is the description of this variable. +input_path = os.getenv('input_path') + +# Optionally, you can set a default values. +with_default = os.getenv('with_default', 'default_value') + +# You can cast a specific type with int(), float(), or bool(). +num_values = int(os.getenv('num_values', 5)) + +# Output paths are starting with "output_". +output_path = os.getenv('output_path', None) + + +# You can call a function from an additional file (must be in the same directory) or add your code here. +def main(num_values, *args, **kwargs): + # TODO: Add your code. + random_values = np.random.rand(num_values) + # C3 adds setup code to your script which initalize the logging. + # You can just use logging.debug(), logging.info(), logging.warning() in your code. + logging.info(f'Random values: {random_values}') + + +# It is recommended to use a main block to avoid unexpected code execution. +if __name__ == '__main__': + main(num_values) + + +# TODO: Add a grid process if you want to parallelize your code. +def grid_process(batch, input_path, with_default, num_values, output_path): + """ + A process for the c3 grid wrapper. The process gets the batch name as the first positional argument, + followed by all interface variables. This is only possible if the code can be processed in parallel, + e.g., by splitting up input files. + """ + + # You might need to update the variables based on the batch + input_path += batch + '*.json' + output_path += batch + 'data.csv' + + # Execute the processing with adjusted variables + main(num_values, input_path, output_path) + + # optionally return a string or list with output files + return output_path \ No newline at end of file diff --git a/examples/operator_template.py b/examples/operator_template.py deleted file mode 100644 index a88adfac..00000000 --- a/examples/operator_template.py +++ /dev/null @@ -1,112 +0,0 @@ -# TODO: Rename the file to the desired operator name. -""" -# TODO: Update the description of the operator. -This is a template for an operator that read files from COS, processes them, and saves the results to COS. -You can create a container image and KubeFlow job with C3. -""" - -# TODO: Update the required pip packages. -# pip install xarray s3fs - -import os -import logging -import sys -import re -import s3fs -import xarray as xr - -# TODO: Add the operator interface. -# You can use os.environ["name"], os.getenv("name"), or os.environ.get("name"). -# The default type is string. You can also use int, float, and bool values with type casting. -# Optionally, you can set a default value like in the following. -# string example description with default value -string_example = os.getenv('string_example', 'default_value') -# int example description -int_example = int(os.getenv('int_example', 10)) -# float example description -float_example = float(os.getenv('float_example', 0.1)) -# bool example description -bool_example = bool(os.getenv('bool_example', False)) - -# # # Exemplary interface for processing COS files # # # - -# glob pattern for all zarr files to process (e.g. path/to/files/**/*.zarr) -file_path_pattern = os.getenv('file_path_pattern') -# directory for the output files -target_dir = os.getenv('target_dir') -# access_key_id -access_key_id = os.getenv('access_key_id') -# secret_access_key -secret_access_key = os.getenv('secret_access_key') -# endpoint -endpoint = os.getenv('endpoint') -# bucket -bucket = os.getenv('bucket') -# set log level -log_level = os.getenv('log_level', "INFO") - -# Init logging -root = logging.getLogger() -root.setLevel(log_level) - -handler = logging.StreamHandler(sys.stdout) -handler.setLevel(log_level) -formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') -handler.setFormatter(formatter) -root.addHandler(handler) - -logging.basicConfig(level=logging.CRITICAL) - -# get arguments from the command (C3 passes all arguments in the form '=') -parameters = list( - map(lambda s: re.sub('$', '"', s), - map( - lambda s: s.replace('=', '="'), - filter( - lambda s: s.find('=') > -1 and bool(re.match(r'[A-Za-z0-9_]*=[.\/A-Za-z0-9]*', s)), - sys.argv - ) - ))) - -# set values from command arguments -for parameter in parameters: - logging.info('Parameter: ' + parameter) - exec(parameter) - -# TODO: You might want to add type casting after the exec(parameter). -# C3 will added this automatically in the future, but it not implemented yet. -# type casting -int_example = int(int_example) -float_example = float(float_example) -bool_example = bool(bool_example) - - -# TODO: Add your code. -# You can just call a function from an additional file (must be in the same directory) or add your code here. -# Example code for processing COS files based on a file pattern -def main(): - # init s3 - s3 = s3fs.S3FileSystem( - anon=False, - key=access_key_id, - secret=secret_access_key, - client_kwargs={'endpoint_url': endpoint}) - - # get file paths from a glob pattern, e.g., path/to/files/**/*.zarr - file_paths = s3.glob(os.path.join(bucket, file_path_pattern)) - - for file_path in file_paths: - # open a zarr file from COS as xarray dataset - ds = xr.open_zarr(s3fs.S3Map(root=f's3://{file_path}', s3=s3)) - - # TODO: do something with the dataset - processed_ds = ds - - # write processed dataset to s3 - # TODO: edit how to save the processed data - target_path = os.path.join(bucket, target_dir, os.path.basename(file_path)) - processed_ds.to_zarr(s3fs.S3Map(root=f's3://{target_path}', s3=s3)) - - -if __name__ == '__main__': - main() diff --git a/examples/pipeline_example.py b/examples/pipeline_example.py new file mode 100644 index 00000000..4b772158 --- /dev/null +++ b/examples/pipeline_example.py @@ -0,0 +1,54 @@ +""" +# TODO: Update description +Tekton pipeline for with the following steps: +1. Step 1 +""" + +# TODO: Install kfp +# pip install kfp +# pip install kfp-tekton + +import kfp +import kfp.dsl as dsl +import kfp.components as comp +from kfp_tekton.compiler import TektonCompiler + +# TODO: Add your pipeline components based on the kfp yaml file from CLAIMED +# initialize operator from yaml file +component_op = comp.load_component_from_file('.yaml') +# initialize operator from remote file +web_op = comp.load_component_from_url('https://raw.githubusercontent.com/claimed-framework/component-library/main/component-library/.yaml') + + +# TODO: Update pipeline description, function name, and parameters +pipeline_name = 'my_pipeline' +# Pipeline function +@dsl.pipeline( + name=pipeline_name, + description="Pipeline description" +) +def my_pipeline( + parameter1: str = "default_value", + parameter2: str = "default_value", +): + # TODO: Add the components and the required parameters + step1 = component_op( + parameter1=parameter1, + ) + step2 = web_op( + parameter2=parameter2, + ) + + # TODO: You can call multiple steps and created the dependencies + step2.after(step1) + +# TODO: Update pipeline function +# Kubernetes +kfp.compiler.Compiler().compile(pipeline_func=my_pipeline, package_path=f'{pipeline_name}.yaml') +# OpenShift with Tekton +TektonCompiler().compile(my_pipeline, f'{pipeline_name}.yaml') + +print(f'Saved pipeline in {pipeline_name}.yaml') + +# TODO: Run script with python +# TODO: Upload the yaml to KubeFlow From 689fab7fcd23fd8d4ea63df36453807fc32bdff6 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 17:27:38 +0200 Subject: [PATCH 083/177] Changed package version --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index 11e2e410..86f5d62b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "claimed-c3" -version = "0.2.4" +version = "0.2.1" authors = [ { name="The CLAIMED authors" }, ] From 2d2500187a73059deefeeb6a43663ab6c6917ada Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 18 Oct 2023 17:27:52 +0200 Subject: [PATCH 084/177] Updated GettingStarted.md --- GettingStarted.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index 9c8c0b99..5279bb7e 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -274,7 +274,7 @@ Your operator script has to follow certain requirements to be processed by C3. C - The operator name is the python file: `my_operator_name.py` -> `claimed-my-operator-name` - The operator description is the first doc string in the script: `"""Operator description"""` - The required pip packages are listed in comments starting with pip install: `# pip install ` -- The interface is defined by environment variables `my_parameter = os.getenv('my_parameter')`. Output variables start with `output_`. +- The interface is defined by environment variables `my_parameter = os.getenv('my_parameter')`. Output paths start with `output_`. Note that operators cannot return values but always have to save outputs in files. - You can cast a specific type by wrapping `os.getenv()` with `int()`, `float()`, `bool()`. The default type is string. Only these four types are currently supported. You can use `None` as a default value but not pass the `NoneType` via the `job.yaml`. #### iPython notebooks @@ -308,12 +308,9 @@ input_path = os.getenv('input_path') # You can cast a specific type with int(), float(), or bool(). num_values = int(os.getenv('num_values', 5)) -# Output parameters are starting with "output_" +# Output paths are starting with "output_". output_path = os.getenv('output_path', None) -# Output parameters are used for pipelines and are not configurable in single jobs. Use "target_" instead. -target_path = os.getenv('target_path', None) - def my_function(n_random): """ From 59c8051149d04fcb8a4fb7535866df1d07f78a66 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 19 Oct 2023 10:32:12 +0200 Subject: [PATCH 085/177] Updated GettingStarted.md --- GettingStarted.md | 14 ++------------ 1 file changed, 2 insertions(+), 12 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index 5279bb7e..a5b98997 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -250,21 +250,11 @@ TektonCompiler().compile(pipeline_func=my_pipeline, package_path='my_pipeline.ya ### 4.1 Download C3 -Download the [C3 repository](https://github.com/claimed-framework/c3) and install the dependencies with: - +You can install C3 via pip: ```sh -git clone claimed-framework/c3 -cd c3 -pip install -e src +pip install claimed-c3 ``` -This documentation describes the functionality of the `dev` branch, which currently differs significantly from the `main` branch. Run the following to pull the dev branch: -```sh -git checkout dev -git pull -``` - - ### 4.2 C3 requirements Your operator script has to follow certain requirements to be processed by C3. Currently supported are python scripts and ipython notebooks. From b4b887224b9116098b06d5827c73bdf9f74d7681 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 19 Oct 2023 14:17:19 +0200 Subject: [PATCH 086/177] Added tests --- tests/example_notebook.ipynb | 114 +++++++++++++++++++++++++++++++++++ tests/example_script.py | 39 ++++++++++++ tests/test_compiler.py | 101 +++++++++++++++++++++++++++++++ 3 files changed, 254 insertions(+) create mode 100644 tests/example_notebook.ipynb create mode 100644 tests/example_script.py create mode 100644 tests/test_compiler.py diff --git a/tests/example_notebook.ipynb b/tests/example_notebook.ipynb new file mode 100644 index 00000000..da2883d5 --- /dev/null +++ b/tests/example_notebook.ipynb @@ -0,0 +1,114 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Test description" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install numpy\n", + "\n", + "! pip install pandas" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "hello\n" + ] + } + ], + "source": [ + "%%bash\n", + "echo hello\n", + "echo world" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# A comment one line above os.getenv is the description of this variable.\n", + "input_path = os.getenv('input_path')\n", + "\n", + "# You can change the type by using int(), float(), or bool().\n", + "batch_size = int(os.getenv('batch_size', 16))\n", + "\n", + "# The commas in the previous comment are deleted because the yaml file requires descriptions without commas.\n", + "debug = bool(os.getenv('debug', False))\n", + "\n", + "# Output parameters are starting with \"output_\"\n", + "output_path = os.getenv('output_path')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def your_function(*args):\n", + " \"\"\"\n", + " The compiler only includes the first doc string. Therefore, this text is not included.\n", + " \"\"\"\n", + "\n", + " print(args)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "your_function(input_path, batch_size, debug, output_path)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/tests/example_script.py b/tests/example_script.py new file mode 100644 index 00000000..c8bc616c --- /dev/null +++ b/tests/example_script.py @@ -0,0 +1,39 @@ +""" +This is the operator description. +""" + +# pip install numpy + +#!pip install pandas + +import os +import numpy as np + +# A comment one line above os.getenv is the description of this variable. +input_path = os.getenv('input_path') + +# type casting to int(), float(), or bool() +batch_size = int(os.getenv('batch_size', 16)) + +# Commas in the previous comment are deleted because the yaml file requires descriptions without commas. +debug = bool(os.getenv('debug', False)) + +# Output parameters are starting with "output_" +output_path = os.getenv('output_path') + + +def main(*args): + """ + The compiler only includes the first doc string.This text should not be included. + """ + _ = np.random.randn(5) + + print(args) + +def process(batch, *args): + # process function for grid wrapper + print('Execute batch:', batch) + main(batch, *args, input_path, batch_size, debug, output_path) + +if __name__ == '__main__': + main(input_path, batch_size, debug, output_path) diff --git a/tests/test_compiler.py b/tests/test_compiler.py new file mode 100644 index 00000000..83d08915 --- /dev/null +++ b/tests/test_compiler.py @@ -0,0 +1,101 @@ +import os +import subprocess +import pytest +from typing import Any, Dict, List, Optional, Union +from pathlib import Path +from src.c3.utils import convert_notebook, increase_image_version, get_image_version +from src.c3.pythonscript import Pythonscript + +TEST_NOTEBOOK_PATH = 'example_notebook.ipynb' +TEST_SCRIPT_PATH = 'example_script.py' +DUMMY_REPO = 'test' + +test_convert_notebook_input = [ + ( + TEST_NOTEBOOK_PATH, + ['input_path', 'batch_size', 'debug', 'output_path'] + ) +] + +@pytest.mark.parametrize( + "notebook_path, env_values", + test_convert_notebook_input, +) +def test_convert_notebook( + notebook_path: str, + env_values: List, +): + # convert notebook + script_path = convert_notebook(notebook_path) + + assert os.path.isfile(script_path), f"Error! No file {script_path}" + + # check if script runs with errors + for env in env_values: + os.environ[env] = '0' + subprocess.run(['python', script_path], check=True) + + # check if converted script is processable for create_operator + py = Pythonscript(script_path) + name = py.get_name() + assert isinstance(name, str), "Name is not a string." + description = py.get_description() + assert isinstance(description, str), "Description is not a string." + inputs = py.get_inputs() + assert isinstance(inputs, dict), "Inputs is not a dict." + outputs = py.get_outputs() + assert isinstance(outputs, dict), "Ouputs is not a dict." + requirements = py.get_requirements() + assert isinstance(requirements, list), "Requirements is not a list." + + # remove temporary file + os.remove(script_path) + + +test_increase_version_input = [ + ('0.1', '0.2'), + ('2.1.13', '2.1.14'), + ('0.1beta', '0.1beta.1'), + ('0.1beta.1', '0.1beta.2'), +] + + +@pytest.mark.parametrize( + "last_version, expected_version", + test_increase_version_input, +) +def test_create_operator( + last_version: str, + expected_version: str, +): + new_version = increase_image_version(last_version) + assert new_version == expected_version, \ + f"Mismatch between new version {new_version} and expected version {expected_version}" + + +test_create_operator_input = [ + ( + TEST_NOTEBOOK_PATH, + DUMMY_REPO, + [], + ), + ( + TEST_SCRIPT_PATH, + DUMMY_REPO, + [TEST_NOTEBOOK_PATH], + ) +] +@pytest.mark.parametrize( + "file_path, repository, args", + test_create_operator_input, +) +def test_create_operator( + file_path: str, + repository: str, + args: List, +): + subprocess.run(['python', '../src/c3/create_operator.py', file_path, *args, '-r', repository], check=True) + + file = Path(file_path) + file.with_suffix('.yaml').unlink() + file.with_suffix('.job.yaml').unlink() From 9d5312ce8cdee44596b98f7d595650c1dd836804 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 19 Oct 2023 14:19:07 +0200 Subject: [PATCH 087/177] Renate scripts to c3.create... and minor fixes --- GettingStarted.md | 75 ++++++++++++++++++++---------------- README.md | 4 +- examples/operator_example.py | 6 +-- pyproject.toml | 10 ++--- src/c3/create_operator.py | 1 + src/c3/utils.py | 10 ++++- 6 files changed, 61 insertions(+), 45 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index a5b98997..43309ab2 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -144,7 +144,7 @@ spec: ### 1.2 Cluster CLI login -You can start jobs via with the `kubectl` (Kubernetes) or `oc` (OpenShift) CLI. If your using Kubernetes, the login procedure includes multiple steps which are detailed in the [Kubernetes docs](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/). +You can start jobs with the `kubectl` (Kubernetes) or `oc` (OpenShift) CLI. If your using Kubernetes, the login procedure includes multiple steps which are detailed in the [Kubernetes docs](https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/). Logging into an OpenShift cluster is easier. You can use a token which you can generate via the browser UI, or you're username. You might want to add `--insecure-skip-tls-verify` when errors occur. @@ -187,10 +187,14 @@ kubectl describe pod ## 2. Operator library -Reusable code is a key idea of CLAIMED and operator libraries make it easier to share single processing steps. Because each operator includes a docker image with specified dependencies, operators can be easily reused in different workflows. +Reusable code is a key idea of CLAIMED and operator libraries make it easier to share single processing steps. +Because each operator includes a docker image with specified dependencies, operators can be easily reused in different workflows. Public operators are accessible from the [CLAIMED component library](https://github.com/claimed-framework/component-library/tree/main/component-library). +You can run a public operator locally by using [claimed-cli](https://github.com/claimed-framework/cli) or copy the Kubernetes job.yaml file for running the operator on a Kubernetes/OpenShift cluster. +You can also use the operators in workflows as explained in the next section. + --- ## 3. Create workflows @@ -227,6 +231,8 @@ def my_pipeline( parameter1=parameter1, parameter3=parameter3, ) + + step2.after(step1) kfp.compiler.Compiler().compile(pipeline_func=my_pipeline, package_path='my_pipeline.yaml') ``` @@ -234,7 +240,7 @@ kfp.compiler.Compiler().compile(pipeline_func=my_pipeline, package_path='my_pipe When running the script, the KFP compiler generates a `.yaml` file which can be uploaded to the KubeFlow UI to start the pipeline. Alternatively, you can run the pipeline with the SDK client, see [KubeFlow Docs](https://www.kubeflow.org/docs/components/pipelines/v1/sdk/build-pipeline/) for details. -If your using an OpenShift cluster, your might want to use the tekton compiler. +If your using an OpenShift cluster, your might want to use the Tekton compiler. ```python # pip install kfp-tekton @@ -347,15 +353,16 @@ docker login -u -p / With a running Docker engine and your operator script matching the C3 requirements, you can execute the C3 compiler by running `create_operator.py`: ```sh -python /src/c3/create_operator.py ".py" "" "" --repository "/" +c3.create_operator.py ".py" "" "" --repository "/" ``` The first positional argument is the path to the python script or the ipython notebook. Optional, you can provide additional files that are copied to the container images with in all following parameters. The additional files are placed within the same directory as the operator script. -C3 automatically increases the version of the container image (default: "0.1") but you can set the version with `--version` or `-v`. You need to provide the repository with `--repository` or `-r`. +C3 automatically increases the version of the container image (default: "0.1") but you can set the version with `--version` or `-v`. You need to provide the repository with `--repository` or `-r`. +If you don't have access to the repository, C3 still creates the docker image and the other files but the images is not pushed to the registry and cannot be used on clusters. View all arguments by running: ```sh -python /src/c3/create_grid_wrapper.py --help +c3.create_operator --help ``` C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, and a `.job.yaml` that can be directly used as described above. @@ -372,22 +379,22 @@ Therefore, the code gets wrapped by a coordinator script: The grid wrapper. You can use the same code for the grid wrapper as for an operator by adding an extra functon which is passed to C3. The grid wrapper executes this function in each batch and passes specific parameters to the function: -The first parameter is the batch name, followed by all variables defined in the operator interface. -You need to adapt the variables based on the batch, e.g., by adding the batch name to input and output paths. +The first parameter is the batch id, followed by all variables defined in the operator interface. +You need to adapt the variables based on the batch, e.g., by adding the batch id to input and output paths. ```python -def grid_process(batch, parameter1, parameter2, *args, **kwargs): - # update operator parameters based on batch name - parameter1 = parameter1 + batch - parameter2 = os.path.join(parameter2, batch) +def grid_process(batch_id, parameter1, parameter2, *args, **kwargs): + # update operator parameters based on batch id + parameter1 = parameter1 + batch_id + parameter2 = os.path.join(parameter2, batch_id) # execute operator code with adapted parameters my_function(parameter1, parameter2) ``` -You might want to add `*args, **kwargs` to avoid errors, if not all interface variables are used. +You might want to add `*args, **kwargs` to avoid errors, if not all interface variables are used in the grid process. Note that the operator script is imported by the grid wrapper script. Therefore, all code in the script is executed. -It is recommended to avoid executions in the code or to use a main block if the script is also used as a single operator. +It is recommended to avoid executions in the code and to use a main block if the script is also used as a single operator. ```python if __name__ == '__main__': @@ -399,11 +406,12 @@ if __name__ == '__main__': The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_grid_wrapper.py` using `--process` or `-p`. ```sh -python /src/c3/create_grid_wrapper.py ".py" "" "" --process "grid_process" -r "/" +c3.create_gridwrapper ".py" "" "" --process "grid_process" -r "/" ``` C3 also includes a grid computing pattern for Cloud Object Storage (COS). You can create a COS grid wrapper by adding a `--cos` flag. -The COS grid wrapper downloads all files of a batch to local storage, compute the process, and uploads the target files to COS. +The COS grid wrapper downloads all files of a batch to local storage, compute the process, and uploads the output files to COS. +Note that the COS grid wrapper requires the file paths to include the batch id to be identified, see details in the next subsection. The created files include a `gw_.py` file that includes the generated code for the grid wrapper (`cgw_.py` for the COS version). Similar to an operator, `gw_.yaml` and `gw_.job.yaml` are created. @@ -415,18 +423,19 @@ The grid wrapper uses coordinator files to split up the batch processes between Therefore, each pod needs access to a shared persistent volume, see [storage](#storage). Alternatively, you can use the COS grid wrapper which uses a coordinator path in COS. -The grid wrapper includes specific variables in the `job.yaml`, that define the batches and some coordination settings. +The grid wrapper adds specific variables to the `job.yaml`, that define the batches and some coordination settings. First, you can define the list of batches in a file and pass `gw_batch_file` to the grid wrapper. -You can use either a txt file with a comma-separated list if strings or a json file with the keys being the batches. -Alternatively, the batches can be defined by a file name pattern via `gw_file_path_pattern` and `gw_group_by`. +You can use either a txt file with a comma-separated list of strings or a json file with the keys being the batch ids. +Alternatively, the batch ids can be defined by a file name pattern via `gw_file_path_pattern` and `gw_group_by`. You can provide multiple patterns via a comma-separated list and the patterns can include wildcards like `*` or `?` to find all relevant files. -`gw_group_by` is code that extracts the batch from a file name by merging the file name string with the code string and passing it to `eval()`. +`gw_group_by` is code that extracts the batch id from a file name by merging the file name string with the code string and passing it to `eval()`. Assuming, we have the file names `file-from-batch-42-metadata.json` and `second_file-42-image.png`. The code `gw_group_by = ".split('-')[-2]"` extracts the batch `42` from both files. You can also to use something like `"[-15:-10]"` or `".split('/')[-1].split('.')[0]"`. `gw_group_by` is ignored if you provide `gw_batch_file`. -Be aware that the file names need to include the batch name if you are using `gw_group_by` or the COS version (because files are downloaded based on the batch). +Be aware that the file names need to include the batch name if you are using `gw_group_by` or the COS version +(because files are downloaded based on a match with the batch id). Second, you need to define `gw_coordinator_path` and optionally other coordinator variables. The `gw_coordinator_path` is a path to a persistent and shared directory that is used by the pods to lock batches and mark them as processed. @@ -435,6 +444,16 @@ The `gw_coordinator_path` is a path to a persistent and shared directory that is You need to increase `gw_lock_timeout` to avoid multiple processing if batch processes run very long. By default, pods skip batches with `.err` files. You can set `gw_ignore_error_files` to `True` after you fixed the error. +If your using the COS grid wrapper, further variables are required. +You can provide a comma-separated list of additional files that should be downloaded COS using `gw_additional_source_files`. +All batch files and additional files are download to an input directory, defined via `gw_local_input_path` (default: `input`). +Similar, all files in `gw_local_target_path` are uploaded to COS after the batch processing (default: `target`). + +Furthermore, `gw_source_access_key_id`, `gw_source_secret_access_key`, `gw_source_endpoint`, and `gw_source_bucket` define the COS bucket to the source files. +You can specify other buckets for the coordinator and target files. +If the buckets are similar to the source bucket, you just need to provide `gw_target_path` and `gw_coordinator_path` and remove the other variables from the `job.yaml`. +It is recommended to use [secrets](#secrets) for the access key and secret. + Lastly, you want to add the number of parallel pods by adding `parallelism : ` to the `job.yaml`. ```yaml @@ -451,6 +470,7 @@ process_parallel_instances = 10 def preprocessing_val_pipeline(...): step1 = first_op() step3 = following_op() + for i in range(process_parallel_instances): step2 = grid_wrapper_op(...) @@ -458,20 +478,9 @@ def preprocessing_val_pipeline(...): step3.after(step2) ``` -If your using the COS grid wrapper, further variables are required. -You can provide a comma-separated list of additional files that should be downloaded COS using `gw_additional_source_files`. -All batch files and additional files are download to an input directory, defined via `gw_local_input_path` (default: `input`). -Similar, all files in `gw_local_target_path` are uploaded to COS after the batch processing (default: `target`). - -Furthermore, `gw_source_access_key_id`, `gw_source_secret_access_key`, `gw_source_endpoint`, and `gw_source_bucket` define the COS bucket to the source files. -You can specify other buckets for the coordinator and target files. -If the buckets are similar to the source bucket, you just need to provide `gw_target_path` and `gw_coordinator_path` and remove the other variables from the `job.yaml`. -It is recommended to use [secrets](#secrets) for the access key and secret. - - #### Local example -The local grid wrapper requires a local storage for coordination like the pvc in the following example. +The local grid wrapper requires a local storage for coordination like the PVC in the following example. ```yaml apiVersion: batch/v1 diff --git a/README.md b/README.md index 03071781..8fa42522 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ pip install claimed-c3 Just run the following command with your python script or notebook: ```sh -create_operator --repository "/" ".py" +c3.create_operator --repository "/" ".py" ``` Your code needs to follow certain requirements which are explained in [Getting Started](https://github.com/claimed-framework/c3/blob/main/GettingStarted.md). @@ -42,7 +42,7 @@ Your code needs to follow certain requirements which are explained in [Getting S ## Getting Help ```sh -create_operator --help +c3.create_operator --help ``` We welcome your questions, ideas, and feedback. Please create an [issue](https://github.com/claimed-framework/component-library/issues) or a [discussion thread](https://github.com/claimed-framework/component-library/discussions). diff --git a/examples/operator_example.py b/examples/operator_example.py index c1c3a86f..9751af6c 100644 --- a/examples/operator_example.py +++ b/examples/operator_example.py @@ -41,7 +41,7 @@ def main(num_values, *args, **kwargs): # TODO: Add a grid process if you want to parallelize your code. -def grid_process(batch, input_path, with_default, num_values, output_path): +def grid_process(batch_id, input_path, with_default, num_values, output_path): """ A process for the c3 grid wrapper. The process gets the batch name as the first positional argument, followed by all interface variables. This is only possible if the code can be processed in parallel, @@ -49,8 +49,8 @@ def grid_process(batch, input_path, with_default, num_values, output_path): """ # You might need to update the variables based on the batch - input_path += batch + '*.json' - output_path += batch + 'data.csv' + input_path += batch_id + '*.json' + output_path += batch_id + '_data.csv' # Execute the processing with adjusted variables main(num_values, input_path, output_path) diff --git a/pyproject.toml b/pyproject.toml index 86f5d62b..8319c6f9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,11 +6,11 @@ build-backend = "setuptools.build_meta" name = "claimed-c3" version = "0.2.1" authors = [ - { name="The CLAIMED authors" }, + { name="The CLAIMED authors", email="claimed-framework@proton.me"}, ] maintainers = [ - { name="Romeo Kienzler", email="romeo.kienzler1@ibm.com" }, - { name="Benedikt Blumenstiel", email="benedikt.blumenstiel@ibm.com" }, + { name="Romeo Kienzler", email="claimed-framework@proton.me"}, + { name="Benedikt Blumenstiel"}, ] description = "The CLAIMED component compiler (C3) generates container images, KFP components, and Kubernetes jobs." readme = "README.md" @@ -28,8 +28,8 @@ classifiers = [ "Bug Tracker" = "https://github.com/claimed-framework/c3/issues" [project.scripts] -create_operator = "c3.create_operator:main" -create_gridwrapper = "c3.create_gridwrapper:main" +c3.create_operator = "c3.create_operator:main" +c3.create_gridwrapper = "c3.create_gridwrapper:main" [tool.setuptools.packages.find] where = ["src"] diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index fd7fa4fb..2efdf483 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -26,6 +26,7 @@ def create_operator(file_path: str, logging.info('version: ' + str(version)) logging.info('additional_files: ' + str(additional_files)) + # TODO: add argument for running ipython instead of python within the container if file_path.endswith('.ipynb'): logging.info('Convert notebook to python script') target_code = convert_notebook(file_path) diff --git a/src/c3/utils.py b/src/c3/utils.py index 1e94555d..f9fc08e3 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -1,10 +1,12 @@ import os import logging import json +import re import subprocess def convert_notebook(path): + # TODO: switch to nbconvert long-term (need to replace pip install) with open(path) as json_file: notebook = json.load(json_file) @@ -19,13 +21,17 @@ def convert_notebook(path): if cell['cell_type'] == 'markdown': # add markdown as doc string code_lines.extend(['"""\n'] + [f'{line}' for line in cell['source']] + ['\n"""']) + elif cell['cell_type'] == 'code' and cell['source'][0].startswith('%%bash'): + code_lines.append('os.system("""') + code_lines.extend(cell['source'][1:]) + code_lines.append('""")') elif cell['cell_type'] == 'code': for line in cell['source']: if line.strip().startswith('!'): # convert sh scripts - if line.strip().startswith('!pip'): + if re.search('![ ]*pip', line): # change pip install to comment - code_lines.append(line.replace('!pip', '# pip', 1)) + code_lines.append(re.sub('![ ]*pip', '# pip', line)) else: # change sh command to os.system() logging.info(f'Replace shell command with os.system() ({line})') From 661d281e773b4a57fc686c6c03de0c6bbf96323c Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 23 Oct 2023 09:11:11 +0200 Subject: [PATCH 088/177] Fixed auto-increase version and added test --- src/c3/utils.py | 65 +++++++++++++++++++++++++++++++++++------- tests/test_compiler.py | 22 +++++++++++++- 2 files changed, 76 insertions(+), 11 deletions(-) diff --git a/src/c3/utils.py b/src/c3/utils.py index f9fc08e3..06baf25c 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -67,21 +67,66 @@ def increase_image_version(last_version): return version -def get_image_version(repository, name): - """ - Get current version of the image from the registry and increase the version by 1. - Default to 0.1.1 if no image is found in the registry. - """ - logging.debug(f'Get image version from registry.') +def pull_docker_image_tags(image): + logging.warning("The current implementation can only query local docker images. " + "Please use an argument '-v ' to avoid duplicates.") # list images - image_list = subprocess.run( - ['docker', 'image', 'ls', f'{repository}/claimed-{name}'], + output = subprocess.run( + ['docker', 'image', 'ls', image], + stdout=subprocess.PIPE + ).stdout.decode('utf-8') + try: + # remove header + image_list = output.splitlines()[1:] + # get list of image tags + image_tags = [line.split()[1] for line in image_list] + except: + image_tags = [] + logging.error(f"Could not load image tags from 'docker image ls' output: {output}") + pass + + # filter latest and none + image_tags = [t for t in image_tags if t not in ['latest', '']] + return image_tags + + +def pull_icr_image_tags(image): + # list images from icr + output = subprocess.run( + ['ibmcloud', 'cr', 'images', '--restrict', image.split('icr.io/', 1)[1]], stdout=subprocess.PIPE ).stdout.decode('utf-8') - # get list of image tags - image_tags = [line.split()[1] for line in image_list.splitlines()][1:] + + try: + # remove header and final status + image_list = output.splitlines()[3:-2] + # get list of image tags + image_tags = [line.split()[1] for line in image_list] + except: + image_tags = [] + logging.error(f"Could not load image tags from 'ibmcloud cr images' output: {output}") + pass + # filter latest and none image_tags = [t for t in image_tags if t not in ['latest', '']] + return image_tags + + +def get_image_version(repository, name): + """ + Get current version of the image from the registry and increase the version by 1. + Defaults to 0.1 if no image is found in the registry. + """ + logging.debug(f'Get image version from registry.') + if 'docker.io' in repository: + logging.debug('Get image tags from docker.') + image_tags = pull_docker_image_tags(f'{repository}/claimed-{name}') + elif 'icr.io' in repository: + logging.debug('Get image tags from ibmcloud container registry.') + image_tags = pull_icr_image_tags(f'{repository}/claimed-{name}') + else: + logging.warning('Unrecognised container registry, using docker to query image tags.') + image_tags = pull_docker_image_tags(f'{repository}/claimed-{name}') logging.debug(f'Image tags: {image_tags}') def check_only_numbers(test_str): diff --git a/tests/test_compiler.py b/tests/test_compiler.py index 83d08915..703b1eb8 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -52,6 +52,26 @@ def test_convert_notebook( os.remove(script_path) +test_get_remote_version_input = [ + ('us.icr.io/geodn', 'sleep',), + ('docker.io/romeokienzler', 'predict-image-endpoint',), +] + + +@pytest.mark.parametrize( + "repository, name", + test_get_remote_version_input, +) +def test_get_remote_version( + repository: str, + name: str, +): + # testing icr.io requires 'ibmcloud login' + version = get_image_version(repository, name) + assert version != '0.1', \ + f"get_image_version retruns default version 0.1" + + test_increase_version_input = [ ('0.1', '0.2'), ('2.1.13', '2.1.14'), @@ -64,7 +84,7 @@ def test_convert_notebook( "last_version, expected_version", test_increase_version_input, ) -def test_create_operator( +def test_increase_version( last_version: str, expected_version: str, ): From a323dd8f683b6094ad1c819780d2e3cc2620d61c Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 23 Oct 2023 10:57:49 +0200 Subject: [PATCH 089/177] Replaced dot with underscore in script name --- GettingStarted.md | 6 +++--- README.md | 4 ++-- pyproject.toml | 4 ++-- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index 43309ab2..80c5aef8 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -353,7 +353,7 @@ docker login -u -p / With a running Docker engine and your operator script matching the C3 requirements, you can execute the C3 compiler by running `create_operator.py`: ```sh -c3.create_operator.py ".py" "" "" --repository "/" +c3_create_operator.py ".py" "" "" --repository "/" ``` The first positional argument is the path to the python script or the ipython notebook. Optional, you can provide additional files that are copied to the container images with in all following parameters. The additional files are placed within the same directory as the operator script. @@ -362,7 +362,7 @@ If you don't have access to the repository, C3 still creates the docker image an View all arguments by running: ```sh -c3.create_operator --help +c3_create_operator --help ``` C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, and a `.job.yaml` that can be directly used as described above. @@ -406,7 +406,7 @@ if __name__ == '__main__': The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_grid_wrapper.py` using `--process` or `-p`. ```sh -c3.create_gridwrapper ".py" "" "" --process "grid_process" -r "/" +c3_create_gridwrapper ".py" "" "" --process "grid_process" -r "/" ``` C3 also includes a grid computing pattern for Cloud Object Storage (COS). You can create a COS grid wrapper by adding a `--cos` flag. diff --git a/README.md b/README.md index 8fa42522..f4dc29b0 100644 --- a/README.md +++ b/README.md @@ -33,7 +33,7 @@ pip install claimed-c3 Just run the following command with your python script or notebook: ```sh -c3.create_operator --repository "/" ".py" +c3_create_operator --repository "/" ".py" ``` Your code needs to follow certain requirements which are explained in [Getting Started](https://github.com/claimed-framework/c3/blob/main/GettingStarted.md). @@ -42,7 +42,7 @@ Your code needs to follow certain requirements which are explained in [Getting S ## Getting Help ```sh -c3.create_operator --help +c3_create_operator --help ``` We welcome your questions, ideas, and feedback. Please create an [issue](https://github.com/claimed-framework/component-library/issues) or a [discussion thread](https://github.com/claimed-framework/component-library/discussions). diff --git a/pyproject.toml b/pyproject.toml index 8319c6f9..be3b7b95 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -28,8 +28,8 @@ classifiers = [ "Bug Tracker" = "https://github.com/claimed-framework/c3/issues" [project.scripts] -c3.create_operator = "c3.create_operator:main" -c3.create_gridwrapper = "c3.create_gridwrapper:main" +c3_create_operator = "c3.create_operator:main" +c3_create_gridwrapper = "c3.create_gridwrapper:main" [tool.setuptools.packages.find] where = ["src"] From 3122d9cb08cfa75834cd979c21dc71faf4270fba Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 23 Oct 2023 11:24:53 +0200 Subject: [PATCH 090/177] Added automated release version and version arg --- pyproject.toml | 5 +++-- src/c3/create_gridwrapper.py | 6 ++++++ src/c3/create_operator.py | 6 ++++++ 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index be3b7b95..1f366a3b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,10 +1,11 @@ [build-system] -requires = ["setuptools>=61.0"] +requires = ["setuptools>=61.0", "setuptools_scm[toml]>=6.2"] build-backend = "setuptools.build_meta" +[tool.setuptools_scm] + [project] name = "claimed-c3" -version = "0.2.1" authors = [ { name="The CLAIMED authors", email="claimed-framework@proton.me"}, ] diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index e4a5ae79..b7659aca 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -4,6 +4,7 @@ import argparse import sys from string import Template +from importlib.metadata import version from c3.pythonscript import Pythonscript from c3.utils import convert_notebook from c3.create_operator import create_operator @@ -149,11 +150,16 @@ def main(): parser.add_argument('-v', '--version', type=str, default=None, help='Image version') parser.add_argument('-l', '--log_level', type=str, default='INFO') + parser.add_argument("-v", "--version", action="store_true") parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') args = parser.parse_args() + if args.version: + print(version("claimed-c3")) + sys.exit() + # Init logging root = logging.getLogger() root.setLevel(args.log_level) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 2efdf483..02486a54 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -6,6 +6,7 @@ import argparse import subprocess from string import Template +from importlib.metadata import version from c3.pythonscript import Pythonscript from c3.utils import convert_notebook, get_image_version from c3.templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template @@ -218,10 +219,15 @@ def main(): parser.add_argument('-v', '--version', type=str, default=None, help='Image version. Increases the version numer of image:latest if not provided.') parser.add_argument('-l', '--log_level', type=str, default='INFO') + parser.add_argument("-v", "--version", action="store_true") parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') args = parser.parse_args() + if args.version: + print(version("claimed-c3")) + sys.exit() + # Init logging root = logging.getLogger() root.setLevel(args.log_level) From 2924370857a540014f61029f62375216bfbfe9e7 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 23 Oct 2023 11:35:21 +0200 Subject: [PATCH 091/177] Fix automated release version --- pyproject.toml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/pyproject.toml b/pyproject.toml index 1f366a3b..bd0af70a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -3,9 +3,11 @@ requires = ["setuptools>=61.0", "setuptools_scm[toml]>=6.2"] build-backend = "setuptools.build_meta" [tool.setuptools_scm] +version_file = "src/c3/_version.py" [project] name = "claimed-c3" +dynamic = ["version"] authors = [ { name="The CLAIMED authors", email="claimed-framework@proton.me"}, ] From 21dda4fccb67a900507e8507ec731aeabf469ed0 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 23 Oct 2023 11:50:17 +0200 Subject: [PATCH 092/177] Removed package version arg --- src/c3/create_gridwrapper.py | 10 ++-------- src/c3/create_operator.py | 9 ++------- 2 files changed, 4 insertions(+), 15 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index b7659aca..5e5a56e2 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -146,20 +146,14 @@ def main(): parser.add_argument('--cos', action=argparse.BooleanOptionalAction, default=False, help='Creates a grid wrapper for processing COS files') parser.add_argument('-r', '--repository', type=str, default=None, - help='Container registry address, e.g. docker.io/') + help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, - help='Image version') + help='Container image version. Auto-increases the version number if not provided (default 0.1)') parser.add_argument('-l', '--log_level', type=str, default='INFO') - parser.add_argument("-v", "--version", action="store_true") parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') - args = parser.parse_args() - if args.version: - print(version("claimed-c3")) - sys.exit() - # Init logging root = logging.getLogger() root.setLevel(args.log_level) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 02486a54..e5b4e48a 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -215,19 +215,14 @@ def main(): parser.add_argument('ADDITIONAL_FILES', type=str, nargs='*', help='Paths to additional files to include in the container image') parser.add_argument('-r', '--repository', type=str, required=True, - help='Container registry address, e.g. docker.io/') + help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, - help='Image version. Increases the version numer of image:latest if not provided.') + help='Container image version. Auto-increases the version number if not provided (default 0.1)') parser.add_argument('-l', '--log_level', type=str, default='INFO') - parser.add_argument("-v", "--version", action="store_true") parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') args = parser.parse_args() - if args.version: - print(version("claimed-c3")) - sys.exit() - # Init logging root = logging.getLogger() root.setLevel(args.log_level) From cb5a74a4e74bfb64aa078b855ef1b2024c2c20db Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 23 Oct 2023 17:37:18 +0200 Subject: [PATCH 093/177] fix gridwrapper --- src/c3/create_gridwrapper.py | 7 ++++--- src/c3/create_operator.py | 1 - src/c3/utils.py | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 5e5a56e2..70d7eb57 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -4,7 +4,6 @@ import argparse import sys from string import Template -from importlib.metadata import version from c3.pythonscript import Pythonscript from c3.utils import convert_notebook from c3.create_operator import create_operator @@ -175,13 +174,15 @@ def main(): if args.dockerfile_template_path != '': logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') with open(args.dockerfile_template_path, 'r') as f: - dockerfile_template = Template(f.read()) + _dockerfile_template = Template(f.read()) + else: + _dockerfile_template = dockerfile_template create_operator( file_path=grid_wrapper_file_path, repository=args.repository, version=args.version, - dockerfile_template=dockerfile_template, + dockerfile_template=_dockerfile_template, additional_files=args.additional_files, log_level=args.log_level, ) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index e5b4e48a..f7a0c845 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -6,7 +6,6 @@ import argparse import subprocess from string import Template -from importlib.metadata import version from c3.pythonscript import Pythonscript from c3.utils import convert_notebook, get_image_version from c3.templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template diff --git a/src/c3/utils.py b/src/c3/utils.py index 06baf25c..c5e0dd64 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -63,13 +63,13 @@ def increase_image_version(last_version): version = last_version + '.1' logging.debug(f'Failed to increase last value, adding .1') pass - logging.info(f'Using version {version} based on latest tag ({last_version}).') return version def pull_docker_image_tags(image): logging.warning("The current implementation can only query local docker images. " "Please use an argument '-v ' to avoid duplicates.") + # TODO: Add script for reading image tags from docker hub # list images output = subprocess.run( ['docker', 'image', 'ls', image], From 2e1a42d637565686b462656430366722807b7abb Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 23 Oct 2023 17:49:39 +0200 Subject: [PATCH 094/177] Added tests for create_gridwrapper --- tests/example_script.py | 2 ++ tests/test_compiler.py | 37 +++++++++++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+) diff --git a/tests/example_script.py b/tests/example_script.py index c8bc616c..1815c02d 100644 --- a/tests/example_script.py +++ b/tests/example_script.py @@ -30,10 +30,12 @@ def main(*args): print(args) + def process(batch, *args): # process function for grid wrapper print('Execute batch:', batch) main(batch, *args, input_path, batch_size, debug, output_path) + if __name__ == '__main__': main(input_path, batch_size, debug, output_path) diff --git a/tests/test_compiler.py b/tests/test_compiler.py index 703b1eb8..c94121cd 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -119,3 +119,40 @@ def test_create_operator( file = Path(file_path) file.with_suffix('.yaml').unlink() file.with_suffix('.job.yaml').unlink() + # TODO: Add tests for the created container image + + +test_create_gridwrapper_input = [ + ( + TEST_NOTEBOOK_PATH, + DUMMY_REPO, + 'your_function', + [], + ), + ( + TEST_SCRIPT_PATH, + DUMMY_REPO, + 'process', + [TEST_NOTEBOOK_PATH], + ) +] +@pytest.mark.parametrize( + "file_path, repository, process, args", + test_create_gridwrapper_input, +) +def test_create_gridwrapper( + file_path: str, + repository: str, + process: str, + args: List, +): + subprocess.run(['python', '../src/c3/create_gridwrapper.py', file_path, *args, + '-r', repository, '-p', process], check=True) + + file = Path(file_path) + gw_file = file.parent / f'gw_{file.stem}.py' + + gw_file.with_suffix('.yaml').unlink() + gw_file.with_suffix('.job.yaml').unlink() + gw_file.unlink() + # TODO: Add tests for the created container image \ No newline at end of file From a3b5e2e03d5253472ea4baf30afd4603e34a842a Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Tue, 24 Oct 2023 09:28:03 +0200 Subject: [PATCH 095/177] fix #19 --- src/c3/utils.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/utils.py b/src/c3/utils.py index c5e0dd64..0cce95ea 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -35,7 +35,7 @@ def convert_notebook(path): else: # change sh command to os.system() logging.info(f'Replace shell command with os.system() ({line})') - code_lines.append(line.replace('!', 'os.system(', 1).replace('\n', ')\n')) + code_lines.append(line.replace('!', "os.system('", 1).replace('\n', "')\n")) else: # add code code_lines.append(line) From 37f365f18dc55b24bf71ade05d09e3a17d470800 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 24 Oct 2023 10:25:11 +0200 Subject: [PATCH 096/177] Added nbconvert and ipython --- pyproject.toml | 5 +++ src/c3/parser.py | 4 +-- src/c3/templates/dockerfile_template | 1 + src/c3/utils.py | 46 ++++++++++------------------ 4 files changed, 23 insertions(+), 33 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index bd0af70a..3511e9c1 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -25,6 +25,11 @@ classifiers = [ "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", ] +dependencies = [ + 'nbconvert >= 7.9.2', + 'ipython >= 8.16.1', + 'traitlets >= 5.11.2', +] [project.urls] "Homepage" = "https://github.com/claimed-framework/c3" diff --git a/src/c3/parser.py b/src/c3/parser.py index 18fee616..8fb4035f 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -17,9 +17,7 @@ import os import re -# TODO: Do we need LoggingConfigurable -# from traitlets.config import LoggingConfigurable -LoggingConfigurable = object +from traitlets.config import LoggingConfigurable from typing import TypeVar, List, Dict diff --git a/src/c3/templates/dockerfile_template b/src/c3/templates/dockerfile_template index c82a7435..be059532 100644 --- a/src/c3/templates/dockerfile_template +++ b/src/c3/templates/dockerfile_template @@ -2,6 +2,7 @@ FROM registry.access.redhat.com/ubi8/python-39 USER root RUN dnf install -y java-11-openjdk USER default +RUN pip install ipython ${requirements_docker} ADD ${target_code} /opt/app-root/src/ ADD ${additional_files_path} /opt/app-root/src/ diff --git a/src/c3/utils.py b/src/c3/utils.py index 0cce95ea..651a0dc4 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -1,47 +1,33 @@ import os import logging -import json +import nbformat import re import subprocess +from nbconvert.exporters import PythonExporter def convert_notebook(path): - # TODO: switch to nbconvert long-term (need to replace pip install) - with open(path) as json_file: - notebook = json.load(json_file) + notebook = nbformat.read(path, as_version=4) - # backwards compatibility + # backwards compatibility (v0.1 description was included in second cell, merge first two markdown cells) if notebook['cells'][0]['cell_type'] == 'markdown' and notebook['cells'][1]['cell_type'] == 'markdown': logging.info('Merge first two markdown cells. File name is used as operator name, not first markdown cell.') - notebook['cells'][1]['source'] = notebook['cells'][0]['source'] + ['\n'] + notebook['cells'][1]['source'] + notebook['cells'][1]['source'] = notebook['cells'][0]['source'] + '\n' + notebook['cells'][1]['source'] notebook['cells'] = notebook['cells'][1:] - code_lines = [] for cell in notebook['cells']: if cell['cell_type'] == 'markdown': - # add markdown as doc string - code_lines.extend(['"""\n'] + [f'{line}' for line in cell['source']] + ['\n"""']) - elif cell['cell_type'] == 'code' and cell['source'][0].startswith('%%bash'): - code_lines.append('os.system("""') - code_lines.extend(cell['source'][1:]) - code_lines.append('""")') - elif cell['cell_type'] == 'code': - for line in cell['source']: - if line.strip().startswith('!'): - # convert sh scripts - if re.search('![ ]*pip', line): - # change pip install to comment - code_lines.append(re.sub('![ ]*pip', '# pip', line)) - else: - # change sh command to os.system() - logging.info(f'Replace shell command with os.system() ({line})') - code_lines.append(line.replace('!', "os.system('", 1).replace('\n', "')\n")) - else: - # add code - code_lines.append(line) - # add line break after cell - code_lines.append('\n') - code = ''.join(code_lines) + # convert markdown to doc string + cell['cell_type'] = 'code' + cell['source'] = '"""\n' + cell['source'] + '\n"""' + cell['outputs'] = [] + cell['execution_count'] = 0 + if cell['cell_type'] == 'code' and re.search('![ ]*pip', cell['source']): + # replace !pip with #pip + cell['source'] = re.sub('![ ]*pip[ ]*install', '# pip install', cell['source']) + + # convert tp python script + (code, _) = PythonExporter().from_notebook_node(notebook) py_path = path.split('/')[-1].replace('.ipynb', '.py') From cd18edfa83f6e5c284f2bef0f8c6d1970f25ce9a Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 24 Oct 2023 13:36:29 +0200 Subject: [PATCH 097/177] Fixed default value from env variables Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 4 ++-- src/c3/parser.py | 3 +-- src/c3/pythonscript.py | 15 ++++++++++----- tests/example_script.py | 4 ++-- 4 files changed, 15 insertions(+), 11 deletions(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index f7a0c845..7668fd1c 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -141,8 +141,8 @@ def get_component_interface(parameters): for name, options in parameters.items(): return_string += f'- {{name: {name}, type: {options["type"]}, description: "{options["description"]}"' if options['default'] is not None: - if not options["default"].startswith("'"): - options["default"] = f"'{options['default']}'" + if not options["default"].startswith('"'): + options["default"] = f'"{options["default"]}"' return_string += f', default: {options["default"]}' return_string += '}\n' return return_string diff --git a/src/c3/parser.py b/src/c3/parser.py index 8fb4035f..8bfd1ffa 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -123,8 +123,7 @@ def search_expressions(self) -> Dict[str, List]: # Second regex matches envvar assignments that use os.getenv("name", "value") with ow w/o default provided # Third regex matches envvar assignments that use os.environ.get("name", "value") with or w/o default provided # Both name and value are captured if possible - envs = [r"os\.environ\[[\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']\](?:\s*=(?:\s*[\"'](.[^\"']*)?[\"'])?)*", - r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,\s*[\"'](.[^\"']*)?[\"'])?", + envs = [r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,\s*[\"'](.[^\"']*)?[\"'])?", r"os\.environ\.get\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,(?:\s*[\"'](.[^\"']*)?[\"'])?)*"] regex_dict["env_vars"] = envs return regex_dict diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 8d2ab0dd..11fa48ee 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -31,16 +31,21 @@ def _get_env_vars(self): comment_line = '' if comment_line == '': logging.info(f'Interface: No description for variable {env_name} provided.') - if "int(" in line: + if re.search(r'=\s*int\(\s*os', line): type = 'Integer' - elif "float(" in line: + elif re.search(r'=\s*float\(\s*os', line): type = 'Float' - elif "bool(" in line: + elif re.search(r'=\s*bool\(\s*os', line): type = 'Boolean' else: type = 'String' - if ',' in line: - default = line.split(',', 1)[1].rstrip(') ').strip().replace("\"", "\'") + # get default value + if ',' in line and type == 'String': + # extract default string value with regex and replace " with ' to avoid errors + default = re.search(r",\s*(['\"].*?['\"])\)", line).group(1)[1:-1].replace("\"", "\'") + elif ',' in line: + # extract int, float, bool + default = re.search(r",\s*(.*?)\)", line).group(1) else: default = None return_value[env_name] = { diff --git a/tests/example_script.py b/tests/example_script.py index 1815c02d..32d48bd7 100644 --- a/tests/example_script.py +++ b/tests/example_script.py @@ -10,10 +10,10 @@ import numpy as np # A comment one line above os.getenv is the description of this variable. -input_path = os.getenv('input_path') +input_path = os.environ.get('input_path', 'test') # ('not this') # type casting to int(), float(), or bool() -batch_size = int(os.getenv('batch_size', 16)) +batch_size = int(os.environ.get('batch_size', 16)) # (not this) # Commas in the previous comment are deleted because the yaml file requires descriptions without commas. debug = bool(os.getenv('debug', False)) From bf9e25f2b029e0718d645e8ad03b92efa9b28f5f Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 24 Oct 2023 13:57:25 +0200 Subject: [PATCH 098/177] Fixed None default value Signed-off-by: Benedikt Blumenstiel --- src/c3/pythonscript.py | 10 +++++----- tests/example_script.py | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 11fa48ee..1d4c551c 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -40,12 +40,12 @@ def _get_env_vars(self): else: type = 'String' # get default value - if ',' in line and type == 'String': - # extract default string value with regex and replace " with ' to avoid errors - default = re.search(r",\s*(['\"].*?['\"])\)", line).group(1)[1:-1].replace("\"", "\'") - elif ',' in line: + if ',' in line: # extract int, float, bool - default = re.search(r",\s*(.*?)\)", line).group(1) + default = re.search(r",\s*(.*?)\s*\)", line).group(1) + if type == 'String' and default != 'None': + # Process string default value + default = default[1:-1].replace("\"", "\'") else: default = None return_value[env_name] = { diff --git a/tests/example_script.py b/tests/example_script.py index 32d48bd7..4de5f042 100644 --- a/tests/example_script.py +++ b/tests/example_script.py @@ -10,7 +10,7 @@ import numpy as np # A comment one line above os.getenv is the description of this variable. -input_path = os.environ.get('input_path', 'test') # ('not this') +input_path = os.environ.get('input_path', None ) # ('not this') # type casting to int(), float(), or bool() batch_size = int(os.environ.get('batch_size', 16)) # (not this) From 646819754f84221befe45b4ce56c88a02e40a76e Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 24 Oct 2023 14:00:18 +0200 Subject: [PATCH 099/177] Fix default value Signed-off-by: Benedikt Blumenstiel --- src/c3/pythonscript.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 1d4c551c..98e6f73b 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -40,7 +40,7 @@ def _get_env_vars(self): else: type = 'String' # get default value - if ',' in line: + if re.search(r"\(.*,.*\)", line): # extract int, float, bool default = re.search(r",\s*(.*?)\s*\)", line).group(1) if type == 'String' and default != 'None': From ff92da8f05e7e63d859b742ebeb94fa97ab90e50 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 24 Oct 2023 14:17:29 +0200 Subject: [PATCH 100/177] Stop processing after docker error Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 2 ++ src/c3/create_operator.py | 50 ++++++++++++++++++++---------------- tests/test_compiler.py | 5 ++-- 3 files changed, 33 insertions(+), 24 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 70d7eb57..2dfcd729 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -151,6 +151,7 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') + parser.add_argument('--test_mode', action='store_true') args = parser.parse_args() # Init logging @@ -185,6 +186,7 @@ def main(): dockerfile_template=_dockerfile_template, additional_files=args.additional_files, log_level=args.log_level, + test_mode=args.test_mode, ) logging.info('Remove local component file') diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 7668fd1c..46647d3f 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -19,6 +19,7 @@ def create_operator(file_path: str, dockerfile_template: str, additional_files: str = None, log_level='INFO', + test_mode=False, ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -101,24 +102,20 @@ def create_operator(file_path: str, version = get_image_version(repository, name) logging.info(f'Building container image claimed-{name}:{version}') - try: - subprocess.run( - ['docker', 'build', '--platform', 'linux/amd64', '-t', f'claimed-{name}:{version}', '.'], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, - ) - logging.debug(f'Tagging images with "latest" and "{version}"') - subprocess.run( - ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:{version}'], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, - ) - subprocess.run( - ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:latest'], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, - ) - logging.info('Successfully built image') - except: - logging.error(f'Failed to build image with docker.') - pass + subprocess.run( + ['docker', 'build', '--platform', 'linux/amd64', '-t', f'claimed-{name}:{version}', '.'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + logging.debug(f'Tagging images with "latest" and "{version}"') + subprocess.run( + ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:{version}'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + subprocess.run( + ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:latest'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + logging.info('Successfully built image') logging.info(f'Pushing images to registry {repository}') try: @@ -131,10 +128,18 @@ def create_operator(file_path: str, stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, ) logging.info('Successfully pushed image to registry') - except: + except Exception as err: logging.error(f'Could not push images to namespace {repository}. ' f'Please check if docker is logged in or select a namespace with access.') - pass + if test_mode: + logging.info('Continue processing (test mode).') + pass + else: + if file_path != target_code: + os.remove(target_code) + os.remove('Dockerfile') + shutil.rmtree(additional_files_path, ignore_errors=True) + raise err def get_component_interface(parameters): return_string = str() @@ -203,8 +208,7 @@ def get_component_interface(parameters): if file_path != target_code: os.remove(target_code) os.remove('Dockerfile') - if additional_files_path is not None: - shutil.rmtree(additional_files_path, ignore_errors=True) + shutil.rmtree(additional_files_path, ignore_errors=True) def main(): @@ -220,6 +224,7 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') + parser.add_argument('--test_mode', action='store_true') args = parser.parse_args() # Init logging @@ -246,6 +251,7 @@ def main(): dockerfile_template=_dockerfile_template, additional_files=args.ADDITIONAL_FILES, log_level=args.log_level, + test_mode=args.test_mode, ) diff --git a/tests/test_compiler.py b/tests/test_compiler.py index c94121cd..5a352d2a 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -114,7 +114,8 @@ def test_create_operator( repository: str, args: List, ): - subprocess.run(['python', '../src/c3/create_operator.py', file_path, *args, '-r', repository], check=True) + subprocess.run(['python', '../src/c3/create_operator.py', file_path, *args, '-r', repository, '--test_mode'], + check=True) file = Path(file_path) file.with_suffix('.yaml').unlink() @@ -147,7 +148,7 @@ def test_create_gridwrapper( args: List, ): subprocess.run(['python', '../src/c3/create_gridwrapper.py', file_path, *args, - '-r', repository, '-p', process], check=True) + '-r', repository, '-p', process, '--test_mode'], check=True) file = Path(file_path) gw_file = file.parent / f'gw_{file.stem}.py' From be7d71a3c9d18eea94b6853eab264c61946b2bfa Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 6 Nov 2023 09:43:01 +0100 Subject: [PATCH 101/177] add / improve examples --- examples/operator_example.ipynb | 81 +++++++++++++++++++++++++++++++++ examples/operator_example.py | 4 +- 2 files changed, 83 insertions(+), 2 deletions(-) create mode 100644 examples/operator_example.ipynb diff --git a/examples/operator_example.ipynb b/examples/operator_example.ipynb new file mode 100644 index 00000000..c32380ad --- /dev/null +++ b/examples/operator_example.ipynb @@ -0,0 +1,81 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# operator_example" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Please update the description of the operator in this markdown cell" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# add any requirements (Hint: pip install -r requirements.txt is supported as well)\n", + "# commenting out the pip install command is supported as well\n", + "#!pip install numpy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import logging\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: Add the operator interface.\n", + "# A comment one line above os.getenv is the description of this variable.\n", + "input_path = os.getenv('input_path')\n", + "\n", + "# If you specify a default value, this parameter gets marked as optional\n", + "with_default = os.getenv('with_default', 'default_value')\n", + "\n", + "# You can cast to a specific type with int(), float(), or bool() - this type information propagates down to the execution engines (e.g., Kubeflow)\n", + "num_values = int(os.getenv('num_values', 5))\n", + "\n", + "# Output paths are starting with \"output_\".\n", + "output_path = os.getenv('output_path', None)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: Add your code.\n", + "random_values = np.random.rand(num_values)\n", + "\n", + "# C3 adds setup code to your notebook which initalize the logging.\n", + "# You can just use logging.debug(), logging.info(), logging.warning() in your code.\n", + "logging.info(f'Random values: {random_values}')" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/operator_example.py b/examples/operator_example.py index 9751af6c..abff15c9 100644 --- a/examples/operator_example.py +++ b/examples/operator_example.py @@ -16,10 +16,10 @@ # A comment one line above os.getenv is the description of this variable. input_path = os.getenv('input_path') -# Optionally, you can set a default values. +# If you specify a default value, this parameter gets marked as optional with_default = os.getenv('with_default', 'default_value') -# You can cast a specific type with int(), float(), or bool(). +# You can cast to a specific type with int(), float(), or bool() - this type information propagates down to the execution engines (e.g., Kubeflow) num_values = int(os.getenv('num_values', 5)) # Output paths are starting with "output_". From a68b826bcf9af4fca608229f435d619d468ada2a Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 9 Nov 2023 15:51:14 +0100 Subject: [PATCH 102/177] Changed grid wrapper timeout to 3 hours --- src/c3/templates/cos_grid_wrapper_template.py | 4 ++-- src/c3/templates/grid_wrapper_template.py | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index 684e4f6c..83281a53 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -73,8 +73,8 @@ gw_processed_file_suffix = os.environ.get('gw_lock_file_suffix', '.processed') # error file suffix gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') -# timeout in seconds to remove lock file from struggling job (default 1 hour) -gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 3600)) +# timeout in seconds to remove lock file from struggling job (default 3 hours) +gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 10800)) # ignore error files and rerun batches with errors gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) diff --git a/src/c3/templates/grid_wrapper_template.py b/src/c3/templates/grid_wrapper_template.py index fd9b989e..f4fd6188 100644 --- a/src/c3/templates/grid_wrapper_template.py +++ b/src/c3/templates/grid_wrapper_template.py @@ -33,8 +33,8 @@ gw_processed_file_suffix = os.environ.get('gw_lock_file_suffix', '.processed') # error file suffix gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') -# timeout in seconds to remove lock file from struggling job (default 1 hour) -gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 3600)) +# timeout in seconds to remove lock file from struggling job (default 3 hours) +gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 10800)) # ignore error files and rerun batches with errors gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) From 10bc3bfbd7a1dbe482ff03b02ea23808c7d205f0 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 9 Nov 2023 21:11:55 +0100 Subject: [PATCH 103/177] Added R script compiler --- GettingStarted.md | 21 ++++- README.md | 6 +- pyproject.toml | 3 +- src/c3/create_gridwrapper.py | 8 +- src/c3/create_operator.py | 87 ++++++++++++++----- src/c3/parser.py | 15 ++-- src/c3/pythonscript.py | 22 +++-- src/c3/rscript.py | 80 +++++++++++++++++ src/c3/templates/R_dockerfile_template | 9 ++ src/c3/templates/__init__.py | 20 +++-- src/c3/templates/component_setup_code.R | 14 +++ src/c3/templates/kfp_component_template.yaml | 2 +- .../kubernetes_job_template.job.yaml | 2 +- ...le_template => python_dockerfile_template} | 5 +- src/c3/utils.py | 3 + src/setup.py | 10 +++ tests/example_rscript.R | 17 ++++ tests/example_script.py | 4 +- tests/test_compiler.py | 23 +++-- 19 files changed, 283 insertions(+), 68 deletions(-) create mode 100644 src/c3/rscript.py create mode 100644 src/c3/templates/R_dockerfile_template create mode 100644 src/c3/templates/component_setup_code.R rename src/c3/templates/{dockerfile_template => python_dockerfile_template} (68%) create mode 100644 src/setup.py create mode 100644 tests/example_rscript.R diff --git a/GettingStarted.md b/GettingStarted.md index 80c5aef8..8b922405 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -170,6 +170,7 @@ kubectl apply -f .job.yaml # kill job kubectl delete -f .job.yaml ``` +Note that calling `kubectl apply` two times can lead to an error because jobs have unique names. If a job with the same name is running, you might need to kill the job before restarting it. The job creates a pod which is accessible via the browser UI or via CLI using the standard kubectl commands. ```sh @@ -273,13 +274,25 @@ Your operator script has to follow certain requirements to be processed by C3. C - The interface is defined by environment variables `my_parameter = os.getenv('my_parameter')`. Output paths start with `output_`. Note that operators cannot return values but always have to save outputs in files. - You can cast a specific type by wrapping `os.getenv()` with `int()`, `float()`, `bool()`. The default type is string. Only these four types are currently supported. You can use `None` as a default value but not pass the `NoneType` via the `job.yaml`. +You can optionally install future tools with `dnf` by adding a comment `# dnf `. + #### iPython notebooks - The operator name is the notebook file: `my_operator_name.ipynb` -> `claimed-my-operator-name` -- The notebook is converted to a python script before creating the operator by merging all cells. -- Markdown cells are converted into doc strings. shell commands with `!...` are converted into `os.system()`. +- The notebook is converted by `nbconvert` to a python script before creating the operator by merging all cells. +- Markdown cells are converted into doc strings. shell commands with `!...` are converted into `get_ipython().run_line_magic()`. - The requirements of python scripts apply to the notebook code (The operator description can be a markdown cell). +#### R scripts + +- The operator name is the python file: `my_operator_name.R` -> `claimed-my-operator-name` +- The operator description is currently fixed to `"R script"`. +- The required R packages are installed with: `install.packages(, repos=)` +- The interface is defined by environment variables `my_parameter <- Sys.getenv('my_parameter', 'optional_default_value')`. Output paths start with `output_`. Note that operators cannot return values but always have to save outputs in files. +- You can cast a specific type by wrapping `Sys.getenv()` with `as.numeric()` or `as.logical()`. The default type is string. Only these three types are currently supported. You can use `NULL` as a default value but not pass `NULL` via the `job.yaml`. + +You can optionally install future tools with `apt` by adding a comment `# apt ` + #### Example The following is an example python script `example_script.py` that can be compiled by C3. @@ -362,7 +375,7 @@ If you don't have access to the repository, C3 still creates the docker image an View all arguments by running: ```sh -c3_create_operator --help +c3_create_operator --help ``` C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, and a `.job.yaml` that can be directly used as described above. @@ -401,6 +414,8 @@ if __name__ == '__main__': my_function(parameter1, parameter2) ``` +Note that the grid computing is currently not implemented for R scripts. + ### 5.2 Compile a grid wrapper with C3 The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_grid_wrapper.py` using `--process` or `-p`. diff --git a/README.md b/README.md index f4dc29b0..a27c4752 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ # C3 - the CLAIMED Component Compiler **TL;DR** -- takes arbitrary assets (Jupyter notebooks, python scripts) as input +- takes arbitrary assets (Jupyter notebooks, python scripts, R scripts) as input - automatically creates container images and pushes to container registries - automatically installs all required dependencies into the container image - creates KubeFlow Pipeline components (target workflow execution engines are pluggable) @@ -26,14 +26,14 @@ To learn more on how this library works in practice, please have a look at the f ### Install ```sh -pip install claimed-c3 +pip install claimed ``` ### Usage Just run the following command with your python script or notebook: ```sh -c3_create_operator --repository "/" ".py" +c3_create_operator ".py" --repository "/" ``` Your code needs to follow certain requirements which are explained in [Getting Started](https://github.com/claimed-framework/c3/blob/main/GettingStarted.md). diff --git a/pyproject.toml b/pyproject.toml index 3511e9c1..5ed2abb6 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -7,7 +7,8 @@ version_file = "src/c3/_version.py" [project] name = "claimed-c3" -dynamic = ["version"] +# dynamic = ["version"] +version = "0.2.6" authors = [ { name="The CLAIMED authors", email="claimed-framework@proton.me"}, ] diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 2dfcd729..eb7e5a72 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -7,7 +7,7 @@ from c3.pythonscript import Pythonscript from c3.utils import convert_notebook from c3.create_operator import create_operator -from c3.templates import grid_wrapper_template, cos_grid_wrapper_template, gw_component_setup_code, dockerfile_template +from c3.templates import grid_wrapper_template, cos_grid_wrapper_template, gw_component_setup_code def wrap_component(component_path, @@ -175,15 +175,15 @@ def main(): if args.dockerfile_template_path != '': logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') with open(args.dockerfile_template_path, 'r') as f: - _dockerfile_template = Template(f.read()) + custom_dockerfile_template = Template(f.read()) else: - _dockerfile_template = dockerfile_template + custom_dockerfile_template = None create_operator( file_path=grid_wrapper_file_path, repository=args.repository, version=args.version, - dockerfile_template=_dockerfile_template, + custom_dockerfile_template=custom_dockerfile_template, additional_files=args.additional_files, log_level=args.log_level, test_mode=args.test_mode, diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 46647d3f..11372ddd 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -5,10 +5,15 @@ import shutil import argparse import subprocess +from pathlib import Path from string import Template +from typing import Optional from c3.pythonscript import Pythonscript +from c3.rscript import Rscript from c3.utils import convert_notebook, get_image_version -from c3.templates import component_setup_code, dockerfile_template, kfp_component_template, kubernetes_job_template +from c3.templates import (python_component_setup_code, r_component_setup_code, + python_dockerfile_template, r_dockerfile_template, + kfp_component_template, kubernetes_job_template, ) CLAIMED_VERSION = 'V0.1' @@ -16,7 +21,7 @@ def create_operator(file_path: str, repository: str, version: str, - dockerfile_template: str, + custom_dockerfile_template: Optional[Template], additional_files: str = None, log_level='INFO', test_mode=False, @@ -31,6 +36,7 @@ def create_operator(file_path: str, if file_path.endswith('.ipynb'): logging.info('Convert notebook to python script') target_code = convert_notebook(file_path) + command = '/opt/app-root/bin/ipython' elif file_path.endswith('.py'): target_code = file_path.split('/')[-1] if file_path == target_code: @@ -38,29 +44,49 @@ def create_operator(file_path: str, target_code = 'claimed_' + target_code # Copy file to current working directory shutil.copy(file_path, target_code) + command = '/opt/app-root/bin/python' + elif file_path.lower().endswith('.r'): + target_code = file_path.split('/')[-1] + if file_path == target_code: + # use temp file for processing + target_code = 'claimed_' + target_code + # Copy file to current working directory + shutil.copy(file_path, target_code) + command = 'Rscript' else: - raise NotImplementedError('Please provide a file_path to a jupyter notebook or python script.') + raise NotImplementedError('Please provide a file_path to a jupyter notebook, python script, or R script.') if target_code.endswith('.py'): # Add code for logging and cli parameters to the beginning of the script with open(target_code, 'r') as f: script = f.read() - script = component_setup_code + script + script = python_component_setup_code + script with open(target_code, 'w') as f: f.write(script) # getting parameter from the script - py = Pythonscript(target_code) - name = py.get_name() - # convert description into a string with a single line - description = ('"' + py.get_description().replace('\n', ' ').replace('"', '\'') + - ' – CLAIMED ' + CLAIMED_VERSION + '"') - inputs = py.get_inputs() - outputs = py.get_outputs() - requirements = py.get_requirements() + script_data = Pythonscript(target_code) + dockerfile_template = custom_dockerfile_template or python_dockerfile_template + elif target_code.lower().endswith('.r'): + # Add code for logging and cli parameters to the beginning of the script + with open(target_code, 'r') as f: + script = f.read() + script = r_component_setup_code + script + with open(target_code, 'w') as f: + f.write(script) + # getting parameter from the script + script_data = Rscript(target_code) + dockerfile_template = custom_dockerfile_template or r_dockerfile_template else: - raise NotImplementedError('C3 currently only supports jupyter notebook or python script.') - + raise NotImplementedError('C3 currently only supports jupyter notebooks, python scripts, and R scripts.') + + name = script_data.get_name() + # convert description into a string with a single line + description = ('"' + script_data.get_description().replace('\n', ' ').replace('"', '\'') + + ' – CLAIMED ' + CLAIMED_VERSION + '"') + inputs = script_data.get_inputs() + outputs = script_data.get_outputs() + requirements = script_data.get_requirements() # Strip 'claimed-' from name of copied temp file if name.startswith('claimed-'): name = name[8:] @@ -91,6 +117,7 @@ def create_operator(file_path: str, requirements_docker=requirements_docker, target_code=target_code, additional_files_path=additional_files_path, + command=os.path.basename(command) ) logging.info('Create Dockerfile') @@ -102,10 +129,19 @@ def create_operator(file_path: str, version = get_image_version(repository, name) logging.info(f'Building container image claimed-{name}:{version}') - subprocess.run( - ['docker', 'build', '--platform', 'linux/amd64', '-t', f'claimed-{name}:{version}', '.'], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, - ) + try: + subprocess.run( + ['docker', 'build', '--platform', 'linux/amd64', '-t', f'claimed-{name}:{version}', '.'], + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + ) + except Exception as err: + # remove temp files + if file_path != target_code: + os.remove(target_code) + os.remove('Dockerfile') + shutil.rmtree(additional_files_path, ignore_errors=True) + raise err + logging.debug(f'Tagging images with "latest" and "{version}"') subprocess.run( ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:{version}'], @@ -135,6 +171,7 @@ def create_operator(file_path: str, logging.info('Continue processing (test mode).') pass else: + # remove temp files if file_path != target_code: os.remove(target_code) os.remove('Dockerfile') @@ -164,6 +201,7 @@ def get_component_interface(parameters): for input_key in outputs.keys(): parameter_values += f" - {{outputPath: {input_key}}}\n" + # TODO: Check call and command in kfp pipeline for R script yaml = kfp_component_template.substitute( name=name, description=description, @@ -171,12 +209,12 @@ def get_component_interface(parameters): version=version, inputs=inputs_list, outputs=outputs_list, - call=f'./{target_code} {parameter_list}', + call=f'{os.path.basename(command)} ./{target_code} {parameter_list}', parameter_values=parameter_values, ) logging.debug('KubeFlow component yaml:\n' + yaml) - target_yaml_path = file_path.replace('.ipynb', '.yaml').replace('.py', '.yaml') + target_yaml_path = str(Path(file_path).with_suffix('.yaml')) logging.info(f'Write KubeFlow component yaml to {target_yaml_path}') with open(target_yaml_path, "w") as text_file: @@ -194,10 +232,11 @@ def get_component_interface(parameters): version=version, target_code=target_code, env_entries=env_entries, + command=command, ) logging.debug('Kubernetes job yaml:\n' + job_yaml) - target_job_yaml_path = file_path.replace('.ipynb', '.job.yaml').replace('.py', '.job.yaml') + target_job_yaml_path = str(Path(file_path).with_suffix('.job.yaml')) logging.info(f'Write kubernetes job yaml to {target_job_yaml_path}') with open(target_job_yaml_path, "w") as text_file: @@ -240,15 +279,15 @@ def main(): if args.dockerfile_template_path != '': logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') with open(args.dockerfile_template_path, 'r') as f: - _dockerfile_template = Template(f.read()) + custom_dockerfile_template = Template(f.read()) else: - _dockerfile_template = dockerfile_template + custom_dockerfile_template = None create_operator( file_path=args.FILE_PATH, repository=args.repository, version=args.version, - dockerfile_template=_dockerfile_template, + custom_dockerfile_template=custom_dockerfile_template, additional_files=args.ADDITIONAL_FILES, log_level=args.log_level, test_mode=args.test_mode, diff --git a/src/c3/parser.py b/src/c3/parser.py index 8bfd1ffa..da9d0ecf 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -41,7 +41,7 @@ def filepath(self): @property def language(self) -> str: - file_extension = os.path.splitext(self._filepath)[-1] + file_extension = os.path.splitext(self._filepath)[-1].lower() if file_extension == '.py': return 'python' elif file_extension == '.r': @@ -119,9 +119,8 @@ def search_expressions(self) -> Dict[str, List]: # TODO: add more key:list-of-regex pairs to parse for additional resources regex_dict = dict() - # First regex matches envvar assignments of form os.environ["name"] = value w or w/o value provided - # Second regex matches envvar assignments that use os.getenv("name", "value") with ow w/o default provided - # Third regex matches envvar assignments that use os.environ.get("name", "value") with or w/o default provided + # First regex matches envvar assignments that use os.getenv("name", "value") with ow w/o default provided + # Second regex matches envvar assignments that use os.environ.get("name", "value") with or w/o default provided # Both name and value are captured if possible envs = [r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,\s*[\"'](.[^\"']*)?[\"'])?", r"os\.environ\.get\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,(?:\s*[\"'](.[^\"']*)?[\"'])?)*"] @@ -131,12 +130,10 @@ def search_expressions(self) -> Dict[str, List]: class RScriptParser(ScriptParser): def search_expressions(self) -> Dict[str, List]: - # TODO: add more key:list-of-regex pairs to parse for additional resources regex_dict = dict() - # Tests for matches of the form Sys.setenv("key" = "value") - envs = [r"Sys\.setenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*\s*=\s*[\"']*(.[^\"']*)?[\"']*\)", - r"Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*\)(.)*"] + # Tests for matches of the form: var <- Sys.getenv("key", "optional default") + envs = [r".*Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?([A-Za-z0-9_]*)?[\"']?)?\).*"] regex_dict["env_vars"] = envs return regex_dict @@ -188,7 +185,7 @@ def _get_reader(self, filepath: str): if file_extension == '.ipynb': return NotebookReader(filepath) - elif file_extension in ['.py', '.r']: + elif file_extension.lower() in ['.py', '.r']: return FileReader(filepath) else: raise ValueError(f'File type {file_extension} is not supported.') diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 98e6f73b..c0211dc4 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -1,4 +1,4 @@ -import json + import logging import os import re @@ -6,15 +6,18 @@ class Pythonscript: - def __init__(self, path, function_name: str = None): + def __init__(self, path): self.path = path with open(path, 'r') as f: self.script = f.read() self.name = os.path.basename(path)[:-3].replace('_', '-') - assert '"""' in self.script, 'Please provide a description of the operator in the first doc string.' - self.description = self.script.split('"""')[1].strip() + if '"""' not in self.script: + logging.warning('Please provide a description of the operator in the first doc string.') + self.description = self.name + else: + self.description = self.script.split('"""')[1].strip() self.envs = self._get_env_vars() def _get_env_vars(self): @@ -59,8 +62,17 @@ def _get_env_vars(self): def get_requirements(self): requirements = [] + # Add dnf install + for line in self.script.split('\n'): + if re.search(r'[\s#]*dnf\s*[A-Za-z0-9_-]*', line): + if '-y' not in line: + # Adding default repo + line += ' -y' + requirements.append(line.replace('#', '').strip()) + + # Add pip install + pattern = r"([ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.\-: ]*)" for line in self.script.split('\n'): - pattern = r"([ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.\-: ]*)" result = re.findall(pattern, line) if len(result) == 1: requirements.append((result[0][0].strip() + ' ' + result[0][1].strip())) diff --git a/src/c3/rscript.py b/src/c3/rscript.py new file mode 100644 index 00000000..1ae513c7 --- /dev/null +++ b/src/c3/rscript.py @@ -0,0 +1,80 @@ + +import logging +import os +import re +from c3.parser import ContentParser + + +class Rscript: + def __init__(self, path): + + self.path = path + with open(path, 'r') as f: + self.script = f.read() + + self.name = os.path.basename(path)[:-2].replace('_', '-') + # TODO: Currently does not support a description + self.description = self.name + self.envs = self._get_env_vars() + + def _get_env_vars(self): + cp = ContentParser() + env_names = cp.parse(self.path)['env_vars'] + return_value = dict() + for env_name, default in env_names.items(): + comment_line = str() + for line in self.script.split('\n'): + if re.search("[\"']" + env_name + "[\"']", line): + # Check the description for current variable + if not comment_line.strip().startswith('#'): + # previous line was no description, reset comment_line. + comment_line = '' + if comment_line == '': + logging.info(f'Interface: No description for variable {env_name} provided.') + if re.search(r'=\s*as.numeric\(\s*os', line): + type = 'Float' # double in R + elif re.search(r'=\s*bool\(\s*os', line): + type = 'Boolean' # logical in R + else: + type = 'String' # character in R + + return_value[env_name] = { + 'description': comment_line.replace('#', '').replace("\"", "\'").strip(), + 'type': type, + 'default': default + } + break + comment_line = line + return return_value + + def get_requirements(self): + requirements = [] + # Add apt install commands + for line in self.script.split('\n'): + if re.search(r'[\s#]*apt\s*[A-Za-z0-9_-]*', line): + if '-y' not in line: + # Adding default repo + line += ' -y' + requirements.append(line.replace('#', '').strip()) + + # Add Rscript install.packages commands + for line in self.script.split('\n'): + if re.search(r'[\s#]*install\.packages\(.*\)', line): + if 'http://' not in line: + # Adding default repo + line = line.rstrip(') ') + ", repos='http://cran.us.r-project.org')" + command = f"Rscript -e \"{line.replace('#', '').strip()}\"" + requirements.append(command) + return requirements + + def get_name(self): + return self.name + + def get_description(self): + return self.description + + def get_inputs(self): + return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} + + def get_outputs(self): + return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} diff --git a/src/c3/templates/R_dockerfile_template b/src/c3/templates/R_dockerfile_template new file mode 100644 index 00000000..3921e170 --- /dev/null +++ b/src/c3/templates/R_dockerfile_template @@ -0,0 +1,9 @@ +FROM r-base:4.3.2 +USER root +RUN apt update +${requirements_docker} +ADD ${target_code} /home/docker/ +ADD ${additional_files_path} /home/docker/ +RUN chmod -R 777 /home/docker/ +USER docker +CMD ["${command}", "/home/docker/${target_code}"] \ No newline at end of file diff --git a/src/c3/templates/__init__.py b/src/c3/templates/__init__.py index 626002de..d7602ffb 100644 --- a/src/c3/templates/__init__.py +++ b/src/c3/templates/__init__.py @@ -4,9 +4,11 @@ from pathlib import Path # template file names -COMPONENT_SETUP_CODE = 'component_setup_code.py' +PYTHON_COMPONENT_SETUP_CODE = 'component_setup_code.py' +R_COMPONENT_SETUP_CODE = 'component_setup_code.R' GW_COMPONENT_SETUP_CODE = 'gw_component_setup_code.py' -DOCKERFILE_FILE = 'dockerfile_template' +PYTHON_DOCKERFILE_FILE = 'python_dockerfile_template' +R_DOCKERFILE_FILE = 'R_dockerfile_template' KFP_COMPONENT_FILE = 'kfp_component_template.yaml' KUBERNETES_JOB_FILE = 'kubernetes_job_template.job.yaml' GRID_WRAPPER_FILE = 'grid_wrapper_template.py' @@ -15,14 +17,20 @@ # load templates template_path = Path(os.path.dirname(__file__)) -with open(template_path / COMPONENT_SETUP_CODE, 'r') as f: - component_setup_code = f.read() +with open(template_path / PYTHON_COMPONENT_SETUP_CODE, 'r') as f: + python_component_setup_code = f.read() + +with open(template_path / R_COMPONENT_SETUP_CODE, 'r') as f: + r_component_setup_code = f.read() with open(template_path / GW_COMPONENT_SETUP_CODE, 'r') as f: gw_component_setup_code = f.read() -with open(template_path / DOCKERFILE_FILE, 'r') as f: - dockerfile_template = Template(f.read()) +with open(template_path / PYTHON_DOCKERFILE_FILE, 'r') as f: + python_dockerfile_template = Template(f.read()) + +with open(template_path / R_DOCKERFILE_FILE, 'r') as f: + r_dockerfile_template = Template(f.read()) with open(template_path / KFP_COMPONENT_FILE, 'r') as f: kfp_component_template = Template(f.read()) diff --git a/src/c3/templates/component_setup_code.R b/src/c3/templates/component_setup_code.R new file mode 100644 index 00000000..bedd266e --- /dev/null +++ b/src/c3/templates/component_setup_code.R @@ -0,0 +1,14 @@ + +args = commandArgs(trailingOnly=TRUE) + +for (parameter in args) { + key_value <- unlist(strsplit(parameter, split="=")) + if (length(key_value) == 2) { + print(parameter) + key <- key_value[1] + value <- key_value[2] + Sys.setenv(key = value) + } else { + print(paste('Could not find key value pair for argument ', parameter)) + } +} diff --git a/src/c3/templates/kfp_component_template.yaml b/src/c3/templates/kfp_component_template.yaml index c0c53569..2a2b108b 100644 --- a/src/c3/templates/kfp_component_template.yaml +++ b/src/c3/templates/kfp_component_template.yaml @@ -14,5 +14,5 @@ implementation: - sh - -ec - | - python ${call} + ${call} ${parameter_values} \ No newline at end of file diff --git a/src/c3/templates/kubernetes_job_template.job.yaml b/src/c3/templates/kubernetes_job_template.job.yaml index f5210a4d..22ceda55 100644 --- a/src/c3/templates/kubernetes_job_template.job.yaml +++ b/src/c3/templates/kubernetes_job_template.job.yaml @@ -8,7 +8,7 @@ spec: containers: - name: ${name} image: ${repository}/claimed-${name}:${version} - command: ["/opt/app-root/bin/python","/opt/app-root/src/${target_code}"] + command: [${command},"/opt/app-root/src/${target_code}"] env: ${env_entries} restartPolicy: OnFailure diff --git a/src/c3/templates/dockerfile_template b/src/c3/templates/python_dockerfile_template similarity index 68% rename from src/c3/templates/dockerfile_template rename to src/c3/templates/python_dockerfile_template index be059532..e1000ab7 100644 --- a/src/c3/templates/dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -1,12 +1,9 @@ FROM registry.access.redhat.com/ubi8/python-39 USER root -RUN dnf install -y java-11-openjdk -USER default RUN pip install ipython ${requirements_docker} ADD ${target_code} /opt/app-root/src/ ADD ${additional_files_path} /opt/app-root/src/ -USER root RUN chmod -R 777 /opt/app-root/src/ USER default -CMD ["python", "/opt/app-root/src/${target_code}"] \ No newline at end of file +CMD ["${command}", "/opt/app-root/src/${target_code}"] \ No newline at end of file diff --git a/src/c3/utils.py b/src/c3/utils.py index 651a0dc4..74df1b88 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -29,6 +29,9 @@ def convert_notebook(path): # convert tp python script (code, _) = PythonExporter().from_notebook_node(notebook) + # add import get_ipython + code = 'from IPython import get_ipython \n' + code + py_path = path.split('/')[-1].replace('.ipynb', '.py') assert not os.path.exists(py_path), f"File {py_path} already exist. Cannot convert notebook." diff --git a/src/setup.py b/src/setup.py new file mode 100644 index 00000000..aa3a8d25 --- /dev/null +++ b/src/setup.py @@ -0,0 +1,10 @@ +from setuptools import setup, find_packages + +setup( + name='c3', + packages=find_packages(), + install_requires=[ + 'ipython', + 'nbconvert', + ], +) diff --git a/tests/example_rscript.R b/tests/example_rscript.R new file mode 100644 index 00000000..4f4e19d9 --- /dev/null +++ b/tests/example_rscript.R @@ -0,0 +1,17 @@ +# Reading env variables + +name <- Sys.getenv('name') + +default <- Sys.getenv('default', "default") + +number <- as.numeric(Sys.getenv('number', 10)) + +print(paste("hello", name)) + +print(number) + +# apt install libgdal-dev + +# Install packages +install.packages('readr') +library(readr) diff --git a/tests/example_script.py b/tests/example_script.py index 4de5f042..bd86dc28 100644 --- a/tests/example_script.py +++ b/tests/example_script.py @@ -6,11 +6,13 @@ #!pip install pandas +# dnf update + import os import numpy as np # A comment one line above os.getenv is the description of this variable. -input_path = os.environ.get('input_path', None ) # ('not this') +input_path = os.environ.get('input_path', None) # ('not this') # type casting to int(), float(), or bool() batch_size = int(os.environ.get('batch_size', 16)) # (not this) diff --git a/tests/test_compiler.py b/tests/test_compiler.py index 5a352d2a..3d801db4 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -8,6 +8,7 @@ TEST_NOTEBOOK_PATH = 'example_notebook.ipynb' TEST_SCRIPT_PATH = 'example_script.py' +TEST_RSCRIPT_PATH = 'example_rscript.R' DUMMY_REPO = 'test' test_convert_notebook_input = [ @@ -69,7 +70,7 @@ def test_get_remote_version( # testing icr.io requires 'ibmcloud login' version = get_image_version(repository, name) assert version != '0.1', \ - f"get_image_version retruns default version 0.1" + f"get_image_version returns default version 0.1" test_increase_version_input = [ @@ -94,6 +95,11 @@ def test_increase_version( test_create_operator_input = [ + ( + TEST_RSCRIPT_PATH, + DUMMY_REPO, + [], + ), ( TEST_NOTEBOOK_PATH, DUMMY_REPO, @@ -103,7 +109,7 @@ def test_increase_version( TEST_SCRIPT_PATH, DUMMY_REPO, [TEST_NOTEBOOK_PATH], - ) + ), ] @pytest.mark.parametrize( "file_path, repository, args", @@ -114,13 +120,16 @@ def test_create_operator( repository: str, args: List, ): - subprocess.run(['python', '../src/c3/create_operator.py', file_path, *args, '-r', repository, '--test_mode'], + subprocess.run(['python', '../src/c3/create_operator.py', file_path, *args, '-r', repository, + '--test_mode', '-v', 'test', '--log_level', 'DEBUG'], check=True) file = Path(file_path) file.with_suffix('.yaml').unlink() file.with_suffix('.job.yaml').unlink() - # TODO: Add tests for the created container image + image_name = f"{repository}/claimed-{file_path.rsplit('.')[0].replace('_', '-')}:test" + subprocess.run(['docker', 'run', image_name], + check=True) test_create_gridwrapper_input = [ @@ -148,7 +157,7 @@ def test_create_gridwrapper( args: List, ): subprocess.run(['python', '../src/c3/create_gridwrapper.py', file_path, *args, - '-r', repository, '-p', process, '--test_mode'], check=True) + '-r', repository, '-p', process, '--test_mode', '-v', 'test', '--log_level', 'DEBUG'], check=True) file = Path(file_path) gw_file = file.parent / f'gw_{file.stem}.py' @@ -156,4 +165,6 @@ def test_create_gridwrapper( gw_file.with_suffix('.yaml').unlink() gw_file.with_suffix('.job.yaml').unlink() gw_file.unlink() - # TODO: Add tests for the created container image \ No newline at end of file + image_name = f"{repository}/claimed-gw-{file_path.rsplit('.')[0].replace('_', '-')}:test" + subprocess.run(['docker', 'run', image_name], + check=True) \ No newline at end of file From 23eee301de305fec081f870a90c7b5da9c06656d Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 9 Nov 2023 21:46:07 +0100 Subject: [PATCH 104/177] Fixed working dir error --- GettingStarted.md | 2 +- pyproject.toml | 2 +- src/c3/create_operator.py | 7 ++++++- src/c3/templates/R_dockerfile_template | 9 +++++---- src/c3/templates/kubernetes_job_template.job.yaml | 2 +- src/c3/templates/python_dockerfile_template | 8 ++++---- 6 files changed, 18 insertions(+), 12 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index 8b922405..1eb89a2c 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -259,7 +259,7 @@ TektonCompiler().compile(pipeline_func=my_pipeline, package_path='my_pipeline.ya You can install C3 via pip: ```sh -pip install claimed-c3 +pip install claimed ``` ### 4.2 C3 requirements diff --git a/pyproject.toml b/pyproject.toml index 5ed2abb6..bd27bdab 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -8,7 +8,7 @@ version_file = "src/c3/_version.py" [project] name = "claimed-c3" # dynamic = ["version"] -version = "0.2.6" +version = "0.2.7" authors = [ { name="The CLAIMED authors", email="claimed-framework@proton.me"}, ] diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 11372ddd..671c2abb 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -37,6 +37,7 @@ def create_operator(file_path: str, logging.info('Convert notebook to python script') target_code = convert_notebook(file_path) command = '/opt/app-root/bin/ipython' + working_dir = '/opt/app-root/src/' elif file_path.endswith('.py'): target_code = file_path.split('/')[-1] if file_path == target_code: @@ -45,6 +46,7 @@ def create_operator(file_path: str, # Copy file to current working directory shutil.copy(file_path, target_code) command = '/opt/app-root/bin/python' + working_dir = '/opt/app-root/src/' elif file_path.lower().endswith('.r'): target_code = file_path.split('/')[-1] if file_path == target_code: @@ -53,6 +55,7 @@ def create_operator(file_path: str, # Copy file to current working directory shutil.copy(file_path, target_code) command = 'Rscript' + working_dir = '/home/docker/' else: raise NotImplementedError('Please provide a file_path to a jupyter notebook, python script, or R script.') @@ -117,7 +120,8 @@ def create_operator(file_path: str, requirements_docker=requirements_docker, target_code=target_code, additional_files_path=additional_files_path, - command=os.path.basename(command) + working_dir=working_dir, + command=os.path.basename(command), ) logging.info('Create Dockerfile') @@ -233,6 +237,7 @@ def get_component_interface(parameters): target_code=target_code, env_entries=env_entries, command=command, + working_dir=working_dir, ) logging.debug('Kubernetes job yaml:\n' + job_yaml) diff --git a/src/c3/templates/R_dockerfile_template b/src/c3/templates/R_dockerfile_template index 3921e170..f0b22633 100644 --- a/src/c3/templates/R_dockerfile_template +++ b/src/c3/templates/R_dockerfile_template @@ -2,8 +2,9 @@ FROM r-base:4.3.2 USER root RUN apt update ${requirements_docker} -ADD ${target_code} /home/docker/ -ADD ${additional_files_path} /home/docker/ -RUN chmod -R 777 /home/docker/ +ADD ${target_code} ${working_dir} +ADD ${additional_files_path} ${working_dir} +RUN chmod -R 777 ${working_dir} +RUN chmod -R 777 /usr/local/lib/R/ USER docker -CMD ["${command}", "/home/docker/${target_code}"] \ No newline at end of file +CMD ["${command}", "${working_dir}${target_code}"] \ No newline at end of file diff --git a/src/c3/templates/kubernetes_job_template.job.yaml b/src/c3/templates/kubernetes_job_template.job.yaml index 22ceda55..e2a65339 100644 --- a/src/c3/templates/kubernetes_job_template.job.yaml +++ b/src/c3/templates/kubernetes_job_template.job.yaml @@ -8,7 +8,7 @@ spec: containers: - name: ${name} image: ${repository}/claimed-${name}:${version} - command: [${command},"/opt/app-root/src/${target_code}"] + command: ["${command}","${working_dir}${target_code}"] env: ${env_entries} restartPolicy: OnFailure diff --git a/src/c3/templates/python_dockerfile_template b/src/c3/templates/python_dockerfile_template index e1000ab7..f0e23d04 100644 --- a/src/c3/templates/python_dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -2,8 +2,8 @@ FROM registry.access.redhat.com/ubi8/python-39 USER root RUN pip install ipython ${requirements_docker} -ADD ${target_code} /opt/app-root/src/ -ADD ${additional_files_path} /opt/app-root/src/ -RUN chmod -R 777 /opt/app-root/src/ +ADD ${target_code} ${working_dir} +ADD ${additional_files_path} ${working_dir} +RUN chmod -R 777 ${working_dir} USER default -CMD ["${command}", "/opt/app-root/src/${target_code}"] \ No newline at end of file +CMD ["${command}", "${working_dir}${target_code}"] \ No newline at end of file From d552167958d4167b711140ba6f29ea1243075fdb Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 14 Nov 2023 16:18:10 +0100 Subject: [PATCH 105/177] Fixed some errors --- examples/example_rscript.R | 17 +++++++++++++++++ src/c3/create_operator.py | 6 +++++- src/c3/parser.py | 2 +- src/c3/pythonscript.py | 2 +- src/c3/rscript.py | 2 +- src/c3/utils.py | 3 ++- 6 files changed, 27 insertions(+), 5 deletions(-) create mode 100644 examples/example_rscript.R diff --git a/examples/example_rscript.R b/examples/example_rscript.R new file mode 100644 index 00000000..4f4e19d9 --- /dev/null +++ b/examples/example_rscript.R @@ -0,0 +1,17 @@ +# Reading env variables + +name <- Sys.getenv('name') + +default <- Sys.getenv('default', "default") + +number <- as.numeric(Sys.getenv('number', 10)) + +print(paste("hello", name)) + +print(number) + +# apt install libgdal-dev + +# Install packages +install.packages('readr') +library(readr) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 671c2abb..bb5c6868 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -25,6 +25,7 @@ def create_operator(file_path: str, additional_files: str = None, log_level='INFO', test_mode=False, + no_cache=False, ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -135,7 +136,8 @@ def create_operator(file_path: str, logging.info(f'Building container image claimed-{name}:{version}') try: subprocess.run( - ['docker', 'build', '--platform', 'linux/amd64', '-t', f'claimed-{name}:{version}', '.'], + ['docker', 'build', '--platform', 'linux/amd64', '-t', f'claimed-{name}:{version}', '.', + '--no-cache' if no_cache else ''], stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, ) except Exception as err: @@ -269,6 +271,7 @@ def main(): parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') parser.add_argument('--test_mode', action='store_true') + parser.add_argument('--no-cache', action='store_true') args = parser.parse_args() # Init logging @@ -296,6 +299,7 @@ def main(): additional_files=args.ADDITIONAL_FILES, log_level=args.log_level, test_mode=args.test_mode, + no_cache=args.no_cache, ) diff --git a/src/c3/parser.py b/src/c3/parser.py index da9d0ecf..18fbed0b 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -133,7 +133,7 @@ def search_expressions(self) -> Dict[str, List]: regex_dict = dict() # Tests for matches of the form: var <- Sys.getenv("key", "optional default") - envs = [r".*Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?([A-Za-z0-9_]*)?[\"']?)?\).*"] + envs = [r".*Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.*)?[\"']?)?\).*"] regex_dict["env_vars"] = envs return regex_dict diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index c0211dc4..34046153 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -12,7 +12,7 @@ def __init__(self, path): with open(path, 'r') as f: self.script = f.read() - self.name = os.path.basename(path)[:-3].replace('_', '-') + self.name = os.path.basename(path)[:-3].replace('_', '-').lower() if '"""' not in self.script: logging.warning('Please provide a description of the operator in the first doc string.') self.description = self.name diff --git a/src/c3/rscript.py b/src/c3/rscript.py index 1ae513c7..6405fb02 100644 --- a/src/c3/rscript.py +++ b/src/c3/rscript.py @@ -12,7 +12,7 @@ def __init__(self, path): with open(path, 'r') as f: self.script = f.read() - self.name = os.path.basename(path)[:-2].replace('_', '-') + self.name = os.path.basename(path)[:-2].replace('_', '-').lower() # TODO: Currently does not support a description self.description = self.name self.envs = self._get_env_vars() diff --git a/src/c3/utils.py b/src/c3/utils.py index 74df1b88..020506f2 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -87,13 +87,14 @@ def pull_icr_image_tags(image): ).stdout.decode('utf-8') try: + assert 'You have no images in the namespaces' not in output # remove header and final status image_list = output.splitlines()[3:-2] # get list of image tags image_tags = [line.split()[1] for line in image_list] except: image_tags = [] - logging.error(f"Could not load image tags from 'ibmcloud cr images' output: {output}") + logging.warning(f"Could not load image tags from 'ibmcloud cr images' output: {output}") pass # filter latest and none From 15a6b98d7baedec375f2c3b8243b995a9c7fc5cd Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 14 Nov 2023 16:53:18 +0100 Subject: [PATCH 106/177] Fixed no-cache --- src/c3/create_operator.py | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index bb5c6868..5fa5bab2 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -136,9 +136,18 @@ def create_operator(file_path: str, logging.info(f'Building container image claimed-{name}:{version}') try: subprocess.run( - ['docker', 'build', '--platform', 'linux/amd64', '-t', f'claimed-{name}:{version}', '.', - '--no-cache' if no_cache else ''], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + f"docker build --platform linux/amd64 -t claimed-{name}:{version} . {'--no-cache' if no_cache else ''}", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True + ) + + logging.debug(f'Tagging images with "latest" and "{version}"') + subprocess.run( + f"docker tag claimed-{name}:{version} {repository}/claimed-{name}:{version}", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, + ) + subprocess.run( + f"docker tag claimed-{name}:{version} {repository}/claimed-{name}:latest", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, ) except Exception as err: # remove temp files @@ -147,27 +156,17 @@ def create_operator(file_path: str, os.remove('Dockerfile') shutil.rmtree(additional_files_path, ignore_errors=True) raise err - - logging.debug(f'Tagging images with "latest" and "{version}"') - subprocess.run( - ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:{version}'], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, - ) - subprocess.run( - ['docker', 'tag', f'claimed-{name}:{version}', f'{repository}/claimed-{name}:latest'], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, - ) logging.info('Successfully built image') logging.info(f'Pushing images to registry {repository}') try: subprocess.run( - ['docker', 'push', f'{repository}/claimed-{name}:latest'], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + f"docker push {repository}/claimed-{name}:latest", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, ) subprocess.run( - ['docker', 'push', f'{repository}/claimed-{name}:{version}'], - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, + f"docker push {repository}/claimed-{name}:{version}", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, ) logging.info('Successfully pushed image to registry') except Exception as err: From 3054311ac42fd20099b92fb70cc0917a5fc411a8 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 14 Nov 2023 16:53:34 +0100 Subject: [PATCH 107/177] Fixed default values for py scripts --- pyproject.toml | 4 ++-- src/c3/parser.py | 12 ++++++++---- src/c3/pythonscript.py | 11 +---------- 3 files changed, 11 insertions(+), 16 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index bd27bdab..445566a6 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -7,8 +7,8 @@ version_file = "src/c3/_version.py" [project] name = "claimed-c3" -# dynamic = ["version"] -version = "0.2.7" +dynamic = ["version"] +# version = "0.2.9" authors = [ { name="The CLAIMED authors", email="claimed-framework@proton.me"}, ] diff --git a/src/c3/parser.py b/src/c3/parser.py index 18fbed0b..ab28224d 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -122,8 +122,8 @@ def search_expressions(self) -> Dict[str, List]: # First regex matches envvar assignments that use os.getenv("name", "value") with ow w/o default provided # Second regex matches envvar assignments that use os.environ.get("name", "value") with or w/o default provided # Both name and value are captured if possible - envs = [r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,\s*[\"'](.[^\"']*)?[\"'])?", - r"os\.environ\.get\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"'](?:\s*\,(?:\s*[\"'](.[^\"']*)?[\"'])?)*"] + envs = [r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*", + r"os\.environ\.get\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*"] regex_dict["env_vars"] = envs return regex_dict @@ -133,7 +133,7 @@ def search_expressions(self) -> Dict[str, List]: regex_dict = dict() # Tests for matches of the form: var <- Sys.getenv("key", "optional default") - envs = [r".*Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.*)?[\"']?)?\).*"] + envs = [r".*Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*"] regex_dict["env_vars"] = envs return regex_dict @@ -160,7 +160,11 @@ def parse(self, filepath: str) -> dict: matches = parser.parse_environment_variables(line) for key, match in matches: if key == "env_vars": - properties[key][match.group(1)] = match.group(2) + default_value = match.group(2) + if default_value: + # The default value match can end with a additional ', ", or ) which is removed + default_value = re.sub(r"['\")]?$", '', default_value, count=1) + properties[key][match.group(1)] = default_value else: properties[key].append(match.group(1)) diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 34046153..a5b1bd3c 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -24,7 +24,7 @@ def _get_env_vars(self): cp = ContentParser() env_names = cp.parse(self.path)['env_vars'] return_value = dict() - for env_name in env_names: + for env_name, default in env_names.items(): comment_line = str() for line in self.script.split('\n'): if re.search("[\"']" + env_name + "[\"']", line): @@ -42,15 +42,6 @@ def _get_env_vars(self): type = 'Boolean' else: type = 'String' - # get default value - if re.search(r"\(.*,.*\)", line): - # extract int, float, bool - default = re.search(r",\s*(.*?)\s*\)", line).group(1) - if type == 'String' and default != 'None': - # Process string default value - default = default[1:-1].replace("\"", "\'") - else: - default = None return_value[env_name] = { 'description': comment_line.replace('#', '').replace("\"", "\'").strip(), 'type': type, From a5dd7adbf1452a558cf943336774b08fc551ed83 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 15 Nov 2023 10:26:25 +0100 Subject: [PATCH 108/177] Added CWL, fixed some issues, created functions for application files Signed-off-by: Benedikt Blumenstiel --- GettingStarted.md | 65 +++++- src/c3/create_operator.py | 247 ++++++++++++-------- src/c3/pythonscript.py | 4 +- src/c3/templates/__init__.py | 4 + src/c3/templates/cwl_component_template.cwl | 15 ++ src/c3/templates/python_dockerfile_template | 4 +- tests/test_compiler.py | 12 +- 7 files changed, 246 insertions(+), 105 deletions(-) create mode 100644 src/c3/templates/cwl_component_template.cwl diff --git a/GettingStarted.md b/GettingStarted.md index 1eb89a2c..e8bf9132 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -21,7 +21,13 @@ This page explains how to apply operators, combine them to workflows, and how to ## 1. Apply operators -An operator is a single processing step such as a kubernetes job. You can run an operator via [CLAIMED CLI](https://github.com/claimed-framework/cli), use them in [workflows](#3-create-workflows), or deploy a kubernetes job using the `job.yaml` which is explained in the following. +An operator is a single processing step. You can run the script locally with the [CLAIMED CLI](https://github.com/claimed-framework/cli) using the following command: +```shell +claimed --component /: -- -- ... +``` + +Besides CLAIMED CLI, you can use an operator in [workflows](#3-create-workflows), or deploy a kubernetes job using the `job.yaml` which is explained in the following. + ### 1.1 Specify the job @@ -200,7 +206,11 @@ You can also use the operators in workflows as explained in the next section. ## 3. Create workflows -Multiple operators can be combined to a workflow, e.g., a KubeFlow pipeline. Therefore, C3 creates `.yaml` files which define a KFP component. After initializing your operators, you can combine them in a pipeline function. +Multiple operators can be combined to a workflow, e.g., a KubeFlow pipeline or a CWL workflow. Therefore, C3 creates `.yaml` files which define a KFP component and `.cwl` files for a CWL step. + +### KubeFlow Pipeline + +After initializing your operators, you can combine them in a pipeline function: ```python # pip install kfp @@ -251,6 +261,50 @@ from kfp_tekton.compiler import TektonCompiler TektonCompiler().compile(pipeline_func=my_pipeline, package_path='my_pipeline.yaml') ``` +### CWL workflows + +You can run workflows locally with CWL. This requires the cwltool package: +```shell +pip install cwltool +``` + +You can create a CWL workflow by combining multiple CWL steps: + +```text +cwlVersion: v1.0 +class: Workflow + +inputs: + parameter1: string + parameter2: string + parameter3: string + +outputs: + parameter4: string + +steps: + .cwl: + run: ./path/to/.cwl + in: + parameter1: parameter1 + parameter2: parameter2 + out: + parameter3: parameter3 + .cwl: + run: ./path/to/.cwl + in: + parameter3: parameter3 + out: + parameter4: parameter4 +``` + +If a workflow or component does not have outputs, use `outputs: []`. + +Run the CWL workflow in your terminal with: +```shell +cwltool .cwl --parameter1 --parameter2 --parameter3 --parameter4 +``` + --- ## 4. Create operators @@ -276,12 +330,17 @@ Your operator script has to follow certain requirements to be processed by C3. C You can optionally install future tools with `dnf` by adding a comment `# dnf `. +If you want to install a `requirements.txt` file you need to consider two steps: +First, you need to include the file as an additional file in the c3 command. +Second, the Dockerfile is executed from root while the files are placed in the working directory. +Therefore, use the command `pip install -r /opt/app-root/src/requirements.txt`. + #### iPython notebooks - The operator name is the notebook file: `my_operator_name.ipynb` -> `claimed-my-operator-name` - The notebook is converted by `nbconvert` to a python script before creating the operator by merging all cells. - Markdown cells are converted into doc strings. shell commands with `!...` are converted into `get_ipython().run_line_magic()`. -- The requirements of python scripts apply to the notebook code (The operator description can be a markdown cell). +- The requirements of python scripts apply to the notebook code (The operator description can be the first markdown cell). #### R scripts diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 5fa5bab2..d8217de1 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -13,11 +13,148 @@ from c3.utils import convert_notebook, get_image_version from c3.templates import (python_component_setup_code, r_component_setup_code, python_dockerfile_template, r_dockerfile_template, - kfp_component_template, kubernetes_job_template, ) + kfp_component_template, kubernetes_job_template, cwl_component_template) CLAIMED_VERSION = 'V0.1' +def create_dockerfile(dockerfile_template, requirements, target_code, additional_files_path, working_dir, command): + requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) + requirements_docker = '\n'.join(requirements_docker) + + docker_file = dockerfile_template.substitute( + requirements_docker=requirements_docker, + target_code=target_code, + additional_files_path=additional_files_path, + working_dir=working_dir, + command=os.path.basename(command), + ) + + logging.info('Create Dockerfile') + with open("Dockerfile", "w") as text_file: + text_file.write(docker_file) + + +def create_kfp_component(name, description, repository, version, command, target_code, file_path, inputs, outputs): + def get_component_interface(parameters): + return_string = str() + for name, options in parameters.items(): + return_string += f'- {{name: {name}, type: {options["type"]}, description: "{options["description"]}"' + if options['default'] is not None: + if not options["default"].startswith('"'): + options["default"] = f'"{options["default"]}"' + return_string += f', default: {options["default"]}' + return_string += '}\n' + return return_string + inputs_list = get_component_interface(inputs) + outputs_list = get_component_interface(outputs) + + parameter_list = str() + for index, key in enumerate(list(inputs.keys()) + list(outputs.keys())): + parameter_list += f'{key}="${{{index}}}" ' + + parameter_values = str() + for input_key in inputs.keys(): + parameter_values += f" - {{inputValue: {input_key}}}\n" + for output_key in outputs.keys(): + parameter_values += f" - {{outputPath: {output_key}}}\n" + + yaml = kfp_component_template.substitute( + name=name, + description=description, + repository=repository, + version=version, + inputs=inputs_list, + outputs=outputs_list, + call=f'{os.path.basename(command)} ./{target_code} {parameter_list}', + parameter_values=parameter_values, + ) + + logging.debug('KubeFlow component yaml:\n' + yaml) + target_yaml_path = str(Path(file_path).with_suffix('.yaml')) + + logging.info(f'Write KubeFlow component yaml to {target_yaml_path}') + with open(target_yaml_path, "w") as text_file: + text_file.write(yaml) + + +def create_kubernetes_job(name, repository, version, target_code, command, working_dir, file_path, inputs, outputs): + # get environment entries + env_entries = str() + for key in list(inputs.keys()) + list(outputs.keys()): + env_entries += f" - name: {key}\n value: value_of_{key}\n" + env_entries = env_entries.rstrip() + + job_yaml = kubernetes_job_template.substitute( + name=name, + repository=repository, + version=version, + target_code=target_code, + env_entries=env_entries, + command=command, + working_dir=working_dir, + ) + + logging.debug('Kubernetes job yaml:\n' + job_yaml) + target_job_yaml_path = str(Path(file_path).with_suffix('.job.yaml')) + + logging.info(f'Write kubernetes job yaml to {target_job_yaml_path}') + with open(target_job_yaml_path, "w") as text_file: + text_file.write(job_yaml) + + +def create_cwl_component(name, repository, version, file_path, inputs, outputs): + # get environment entries + i = 1 + input_envs = str() + for input, options in inputs.items(): + i += 1 + input_envs += (f" {input}:\n type: string\n default: {options['default']}\n " + f"inputBinding:\n position: {i}\n prefix: --{input}\n") + + if len(outputs) == 0: + output_envs = '[]' + else: + output_envs = '\n' + for output, options in outputs.items(): + i += 1 + output_envs += (f" {output}:\n type: string\n default: {options['default']}\n " + f"inputBinding:\n position: {i}\n prefix: --{output}\n") + + cwl = cwl_component_template.substitute( + name=name, + repository=repository, + version=version, + inputs=input_envs, + outputs=output_envs, + ) + + logging.debug('CWL component:\n' + cwl) + target_cwl_path = str(Path(file_path).with_suffix('.cwl')) + + logging.info(f'Write cwl component to {target_cwl_path}') + with open(target_cwl_path, "w") as text_file: + text_file.write(cwl) + + +def print_claimed_command(name, repository, version, inputs, outputs): + claimed_command = f"claimed --component {repository}/claimed-{name}:{version}" + for input, options in inputs.items(): + claimed_command += f" --{input} {options['default']}" + for output, options in outputs.items(): + claimed_command += f" --{output} {options['default']}" + logging.info(f'Run operators locally with claimed-cli:\n{claimed_command}') + + +def remove_temporary_files(file_path, target_code, additional_files_path): + logging.info(f'Remove local files') + # remove temporary files + if file_path != target_code: + os.remove(target_code) + os.remove('Dockerfile') + shutil.rmtree(additional_files_path, ignore_errors=True) + + def create_operator(file_path: str, repository: str, version: str, @@ -33,12 +170,12 @@ def create_operator(file_path: str, logging.info('version: ' + str(version)) logging.info('additional_files: ' + str(additional_files)) - # TODO: add argument for running ipython instead of python within the container if file_path.endswith('.ipynb'): logging.info('Convert notebook to python script') target_code = convert_notebook(file_path) command = '/opt/app-root/bin/ipython' working_dir = '/opt/app-root/src/' + elif file_path.endswith('.py'): target_code = file_path.split('/')[-1] if file_path == target_code: @@ -48,6 +185,7 @@ def create_operator(file_path: str, shutil.copy(file_path, target_code) command = '/opt/app-root/bin/python' working_dir = '/opt/app-root/src/' + elif file_path.lower().endswith('.r'): target_code = file_path.split('/')[-1] if file_path == target_code: @@ -71,6 +209,7 @@ def create_operator(file_path: str, # getting parameter from the script script_data = Pythonscript(target_code) dockerfile_template = custom_dockerfile_template or python_dockerfile_template + elif target_code.lower().endswith('.r'): # Add code for logging and cli parameters to the beginning of the script with open(target_code, 'r') as f: @@ -114,32 +253,21 @@ def create_operator(file_path: str, shutil.copy(additional_file, additional_files_path) logging.info(f'Selected additional files: {os.listdir(additional_files_path)}') - requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) - requirements_docker = '\n'.join(requirements_docker) - - docker_file = dockerfile_template.substitute( - requirements_docker=requirements_docker, - target_code=target_code, - additional_files_path=additional_files_path, - working_dir=working_dir, - command=os.path.basename(command), - ) + create_dockerfile(dockerfile_template, requirements, target_code, additional_files_path, working_dir, command) - logging.info('Create Dockerfile') - with open("Dockerfile", "w") as text_file: - text_file.write(docker_file) - if version is None: # auto increase version based on registered images version = get_image_version(repository, name) logging.info(f'Building container image claimed-{name}:{version}') try: + # Run docker build subprocess.run( f"docker build --platform linux/amd64 -t claimed-{name}:{version} . {'--no-cache' if no_cache else ''}", stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True ) + # Run docker tag logging.debug(f'Tagging images with "latest" and "{version}"') subprocess.run( f"docker tag claimed-{name}:{version} {repository}/claimed-{name}:{version}", @@ -150,16 +278,13 @@ def create_operator(file_path: str, stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, ) except Exception as err: - # remove temp files - if file_path != target_code: - os.remove(target_code) - os.remove('Dockerfile') - shutil.rmtree(additional_files_path, ignore_errors=True) + remove_temporary_files(file_path, target_code, additional_files_path) raise err logging.info('Successfully built image') logging.info(f'Pushing images to registry {repository}') try: + # Run docker push subprocess.run( f"docker push {repository}/claimed-{name}:latest", stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, @@ -176,84 +301,20 @@ def create_operator(file_path: str, logging.info('Continue processing (test mode).') pass else: - # remove temp files - if file_path != target_code: - os.remove(target_code) - os.remove('Dockerfile') - shutil.rmtree(additional_files_path, ignore_errors=True) + remove_temporary_files(file_path, target_code, additional_files_path) raise err - def get_component_interface(parameters): - return_string = str() - for name, options in parameters.items(): - return_string += f'- {{name: {name}, type: {options["type"]}, description: "{options["description"]}"' - if options['default'] is not None: - if not options["default"].startswith('"'): - options["default"] = f'"{options["default"]}"' - return_string += f', default: {options["default"]}' - return_string += '}\n' - return return_string - inputs_list = get_component_interface(inputs) - outputs_list = get_component_interface(outputs) - - parameter_list = str() - for index, key in enumerate(list(inputs.keys()) + list(outputs.keys())): - parameter_list += f'{key}="${{{index}}}" ' - - parameter_values = str() - for input_key in inputs.keys(): - parameter_values += f" - {{inputValue: {input_key}}}\n" - for input_key in outputs.keys(): - parameter_values += f" - {{outputPath: {input_key}}}\n" - - # TODO: Check call and command in kfp pipeline for R script - yaml = kfp_component_template.substitute( - name=name, - description=description, - repository=repository, - version=version, - inputs=inputs_list, - outputs=outputs_list, - call=f'{os.path.basename(command)} ./{target_code} {parameter_list}', - parameter_values=parameter_values, - ) + # Create application scripts + create_kfp_component(name, description, repository, version, command, target_code, file_path, inputs, outputs) - logging.debug('KubeFlow component yaml:\n' + yaml) - target_yaml_path = str(Path(file_path).with_suffix('.yaml')) + create_kubernetes_job(name, repository, version, target_code, command, working_dir, file_path, inputs, outputs) - logging.info(f'Write KubeFlow component yaml to {target_yaml_path}') - with open(target_yaml_path, "w") as text_file: - text_file.write(yaml) + create_cwl_component(name, repository, version, file_path, inputs, outputs) - # get environment entries - env_entries = str() - for key in list(inputs.keys()) + list(outputs.keys()): - env_entries += f" - name: {key}\n value: value_of_{key}\n" - env_entries = env_entries.rstrip() + print_claimed_command(name, repository, version, inputs, outputs) - job_yaml = kubernetes_job_template.substitute( - name=name, - repository=repository, - version=version, - target_code=target_code, - env_entries=env_entries, - command=command, - working_dir=working_dir, - ) - - logging.debug('Kubernetes job yaml:\n' + job_yaml) - target_job_yaml_path = str(Path(file_path).with_suffix('.job.yaml')) - - logging.info(f'Write kubernetes job yaml to {target_job_yaml_path}') - with open(target_job_yaml_path, "w") as text_file: - text_file.write(job_yaml) - - logging.info(f'Remove local files') - # remove temporary files - if file_path != target_code: - os.remove(target_code) - os.remove('Dockerfile') - shutil.rmtree(additional_files_path, ignore_errors=True) + # Remove temp files + remove_temporary_files(file_path, target_code, additional_files_path) def main(): diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index a5b1bd3c..d289c420 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -55,14 +55,14 @@ def get_requirements(self): requirements = [] # Add dnf install for line in self.script.split('\n'): - if re.search(r'[\s#]*dnf\s*[A-Za-z0-9_-]*', line): + if re.search(r'[\s#]*dnf\s*.[^#]*', line): if '-y' not in line: # Adding default repo line += ' -y' requirements.append(line.replace('#', '').strip()) # Add pip install - pattern = r"([ ]*pip[ ]*install[ ]*)([A-Za-z=0-9.\-: ]*)" + pattern = r"([ ]*pip[ ]*install[ ]*)(.[^#]*)" for line in self.script.split('\n'): result = re.findall(pattern, line) if len(result) == 1: diff --git a/src/c3/templates/__init__.py b/src/c3/templates/__init__.py index d7602ffb..1e276976 100644 --- a/src/c3/templates/__init__.py +++ b/src/c3/templates/__init__.py @@ -11,6 +11,7 @@ R_DOCKERFILE_FILE = 'R_dockerfile_template' KFP_COMPONENT_FILE = 'kfp_component_template.yaml' KUBERNETES_JOB_FILE = 'kubernetes_job_template.job.yaml' +CWL_COMPONENT_FILE = 'cwl_component_template.cwl' GRID_WRAPPER_FILE = 'grid_wrapper_template.py' COS_GRID_WRAPPER_FILE = 'cos_grid_wrapper_template.py' @@ -38,6 +39,9 @@ with open(template_path / KUBERNETES_JOB_FILE, 'r') as f: kubernetes_job_template = Template(f.read()) +with open(template_path / CWL_COMPONENT_FILE, 'r') as f: + cwl_component_template = Template(f.read()) + with open(template_path / GRID_WRAPPER_FILE, 'r') as f: grid_wrapper_template = Template(f.read()) diff --git a/src/c3/templates/cwl_component_template.cwl b/src/c3/templates/cwl_component_template.cwl new file mode 100644 index 00000000..fc4c5a6f --- /dev/null +++ b/src/c3/templates/cwl_component_template.cwl @@ -0,0 +1,15 @@ +cwlVersion: v1.2 +class: CommandLineTool + +baseCommand: "claimed" + +inputs: + component: + type: string + default: ${repository}/claimed-${name}:${version} + inputBinding: + position: 1 + prefix: --component +${inputs} + +outputs: ${outputs} \ No newline at end of file diff --git a/src/c3/templates/python_dockerfile_template b/src/c3/templates/python_dockerfile_template index f0e23d04..3c3b4128 100644 --- a/src/c3/templates/python_dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -1,9 +1,9 @@ FROM registry.access.redhat.com/ubi8/python-39 USER root -RUN pip install ipython -${requirements_docker} ADD ${target_code} ${working_dir} ADD ${additional_files_path} ${working_dir} +RUN pip install ipython +${requirements_docker} RUN chmod -R 777 ${working_dir} USER default CMD ["${command}", "${working_dir}${target_code}"] \ No newline at end of file diff --git a/tests/test_compiler.py b/tests/test_compiler.py index 3d801db4..a407d2e1 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -96,19 +96,19 @@ def test_increase_version( test_create_operator_input = [ ( - TEST_RSCRIPT_PATH, + TEST_SCRIPT_PATH, DUMMY_REPO, - [], + [TEST_NOTEBOOK_PATH], ), ( - TEST_NOTEBOOK_PATH, + TEST_RSCRIPT_PATH, DUMMY_REPO, [], ), ( - TEST_SCRIPT_PATH, + TEST_NOTEBOOK_PATH, DUMMY_REPO, - [TEST_NOTEBOOK_PATH], + [], ), ] @pytest.mark.parametrize( @@ -127,6 +127,7 @@ def test_create_operator( file = Path(file_path) file.with_suffix('.yaml').unlink() file.with_suffix('.job.yaml').unlink() + file.with_suffix('.cwl').unlink() image_name = f"{repository}/claimed-{file_path.rsplit('.')[0].replace('_', '-')}:test" subprocess.run(['docker', 'run', image_name], check=True) @@ -164,6 +165,7 @@ def test_create_gridwrapper( gw_file.with_suffix('.yaml').unlink() gw_file.with_suffix('.job.yaml').unlink() + gw_file.with_suffix('.cwl').unlink() gw_file.unlink() image_name = f"{repository}/claimed-gw-{file_path.rsplit('.')[0].replace('_', '-')}:test" subprocess.run(['docker', 'run', image_name], From dcd5135a00230ee13a7d281bdec7747e2da8b163 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 17 Nov 2023 14:40:56 +0100 Subject: [PATCH 109/177] Fixed outputs, changed working dir Signed-off-by: Benedikt Blumenstiel --- GettingStarted.md | 13 +++++-------- examples/example_rscript.R | 4 +--- src/c3/create_operator.py | 5 +---- src/c3/pythonscript.py | 7 +++++-- src/c3/rscript.py | 6 ++++-- src/c3/templates/R_dockerfile_template | 3 ++- src/c3/templates/cwl_component_template.cwl | 3 ++- src/c3/templates/python_dockerfile_template | 3 ++- 8 files changed, 22 insertions(+), 22 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index e8bf9132..10f9aeec 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -278,9 +278,8 @@ inputs: parameter1: string parameter2: string parameter3: string - -outputs: parameter4: string +outputs: [] steps: .cwl: @@ -288,17 +287,15 @@ steps: in: parameter1: parameter1 parameter2: parameter2 - out: parameter3: parameter3 + out: [] .cwl: run: ./path/to/.cwl in: parameter3: parameter3 - out: parameter4: parameter4 -``` - -If a workflow or component does not have outputs, use `outputs: []`. + out: [] +``` Run the CWL workflow in your terminal with: ```shell @@ -350,7 +347,7 @@ Therefore, use the command `pip install -r /opt/app-root/src/requirements.txt`. - The interface is defined by environment variables `my_parameter <- Sys.getenv('my_parameter', 'optional_default_value')`. Output paths start with `output_`. Note that operators cannot return values but always have to save outputs in files. - You can cast a specific type by wrapping `Sys.getenv()` with `as.numeric()` or `as.logical()`. The default type is string. Only these three types are currently supported. You can use `NULL` as a default value but not pass `NULL` via the `job.yaml`. -You can optionally install future tools with `apt` by adding a comment `# apt ` +You can optionally install future tools with `apt` by adding a comment `# apt `. #### Example diff --git a/examples/example_rscript.R b/examples/example_rscript.R index 4f4e19d9..05bc0745 100644 --- a/examples/example_rscript.R +++ b/examples/example_rscript.R @@ -1,6 +1,6 @@ # Reading env variables -name <- Sys.getenv('name') +name <- Sys.getenv('name', 'world') default <- Sys.getenv('default', "default") @@ -10,8 +10,6 @@ print(paste("hello", name)) print(number) -# apt install libgdal-dev - # Install packages install.packages('readr') library(readr) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index d8217de1..fad63ead 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -112,10 +112,7 @@ def create_cwl_component(name, repository, version, file_path, inputs, outputs): input_envs += (f" {input}:\n type: string\n default: {options['default']}\n " f"inputBinding:\n position: {i}\n prefix: --{input}\n") - if len(outputs) == 0: - output_envs = '[]' - else: - output_envs = '\n' + output_envs = '\n' for output, options in outputs.items(): i += 1 output_envs += (f" {output}:\n type: string\n default: {options['default']}\n " diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index d289c420..bb9e8397 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -76,7 +76,10 @@ def get_description(self): return self.description def get_inputs(self): - return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} + return self.envs.items() + # return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} def get_outputs(self): - return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} + # TODO: Test Kubeflow outputs. Does not fit current usage. Maybe use os.setenv() + return {} + # return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} diff --git a/src/c3/rscript.py b/src/c3/rscript.py index 6405fb02..4b029cfa 100644 --- a/src/c3/rscript.py +++ b/src/c3/rscript.py @@ -74,7 +74,9 @@ def get_description(self): return self.description def get_inputs(self): - return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} + # return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} + return self.envs.items() def get_outputs(self): - return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} + # return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} + return {} diff --git a/src/c3/templates/R_dockerfile_template b/src/c3/templates/R_dockerfile_template index f0b22633..a2f59e69 100644 --- a/src/c3/templates/R_dockerfile_template +++ b/src/c3/templates/R_dockerfile_template @@ -7,4 +7,5 @@ ADD ${additional_files_path} ${working_dir} RUN chmod -R 777 ${working_dir} RUN chmod -R 777 /usr/local/lib/R/ USER docker -CMD ["${command}", "${working_dir}${target_code}"] \ No newline at end of file +WORKDIR "${working_dir}" +CMD ["${command}", "${target_code}"] \ No newline at end of file diff --git a/src/c3/templates/cwl_component_template.cwl b/src/c3/templates/cwl_component_template.cwl index fc4c5a6f..5b01e333 100644 --- a/src/c3/templates/cwl_component_template.cwl +++ b/src/c3/templates/cwl_component_template.cwl @@ -11,5 +11,6 @@ inputs: position: 1 prefix: --component ${inputs} +${outputs} -outputs: ${outputs} \ No newline at end of file +outputs: [] \ No newline at end of file diff --git a/src/c3/templates/python_dockerfile_template b/src/c3/templates/python_dockerfile_template index 3c3b4128..ec05fd2a 100644 --- a/src/c3/templates/python_dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -6,4 +6,5 @@ RUN pip install ipython ${requirements_docker} RUN chmod -R 777 ${working_dir} USER default -CMD ["${command}", "${working_dir}${target_code}"] \ No newline at end of file +WORKDIR "${working_dir}" +CMD ["${command}", "${target_code}"] \ No newline at end of file From 00890a22366b5551cf9e456b9a496f7674721b94 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 17 Nov 2023 17:19:03 +0100 Subject: [PATCH 110/177] Fixed error Signed-off-by: Benedikt Blumenstiel --- src/c3/pythonscript.py | 2 +- src/c3/rscript.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index bb9e8397..027ccfcc 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -76,7 +76,7 @@ def get_description(self): return self.description def get_inputs(self): - return self.envs.items() + return self.envs # return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} def get_outputs(self): diff --git a/src/c3/rscript.py b/src/c3/rscript.py index 4b029cfa..3820b927 100644 --- a/src/c3/rscript.py +++ b/src/c3/rscript.py @@ -75,7 +75,7 @@ def get_description(self): def get_inputs(self): # return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} - return self.envs.items() + return self.envs def get_outputs(self): # return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} From b0b714358115f41eedaf3f42cc533502a6abfbed Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Wed, 22 Nov 2023 12:59:47 +0100 Subject: [PATCH 111/177] add generated artifacts and cwl workflow example --- examples/operator_example.cwl | 47 ++++++++++++++++++++++++++++++ examples/operator_example.job.yaml | 26 +++++++++++++++++ examples/operator_example.yaml | 27 +++++++++++++++++ examples/workflow_example.cwl | 29 ++++++++++++++++++ 4 files changed, 129 insertions(+) create mode 100644 examples/operator_example.cwl create mode 100644 examples/operator_example.job.yaml create mode 100644 examples/operator_example.yaml create mode 100644 examples/workflow_example.cwl diff --git a/examples/operator_example.cwl b/examples/operator_example.cwl new file mode 100644 index 00000000..ed1a5f1b --- /dev/null +++ b/examples/operator_example.cwl @@ -0,0 +1,47 @@ +cwlVersion: v1.2 +class: CommandLineTool + +baseCommand: "claimed" + +inputs: + component: + type: string + default: us.ico.io/geodn/claimed-operator-example:0.2 + inputBinding: + position: 1 + prefix: --component + log_level: + type: string + default: "INFO" + inputBinding: + position: 2 + prefix: --log_level + input_path: + type: string + default: None + inputBinding: + position: 3 + prefix: --input_path + with_default: + type: string + default: "default_value" + inputBinding: + position: 4 + prefix: --with_default + num_values: + type: string + default: "5" + inputBinding: + position: 5 + prefix: --num_values + output_path: + type: string + default: "None" + inputBinding: + position: 6 + prefix: --output_path + + + + +outputs: [] \ No newline at end of file diff --git a/examples/operator_example.job.yaml b/examples/operator_example.job.yaml new file mode 100644 index 00000000..3a1d8366 --- /dev/null +++ b/examples/operator_example.job.yaml @@ -0,0 +1,26 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: operator-example +spec: + template: + spec: + containers: + - name: operator-example + image: us.ico.io/geodn/claimed-operator-example:0.2 + workingDir: /opt/app-root/src/ + command: ["/opt/app-root/bin/python","claimed_operator_example.py"] + env: + - name: log_level + value: value_of_log_level + - name: input_path + value: value_of_input_path + - name: with_default + value: value_of_with_default + - name: num_values + value: value_of_num_values + - name: output_path + value: value_of_output_path + restartPolicy: OnFailure + imagePullSecrets: + - name: image_pull_secret \ No newline at end of file diff --git a/examples/operator_example.yaml b/examples/operator_example.yaml new file mode 100644 index 00000000..7fb4b75f --- /dev/null +++ b/examples/operator_example.yaml @@ -0,0 +1,27 @@ +name: operator-example +description: "TODO: Update the description of the operator in the first doc string. This is the operator description. The file name becomes the operator name. – CLAIMED V0.1" + +inputs: +- {name: log_level, type: String, description: "update log level", default: "INFO"} +- {name: input_path, type: String, description: "A comment one line above os.getenv is the description of this variable."} +- {name: with_default, type: String, description: "If you specify a default value, this parameter gets marked as optional", default: "default_value"} +- {name: num_values, type: Integer, description: "You can cast to a specific type with int(), float(), or bool() - this type information propagates down to the execution engines (e.g., Kubeflow)", default: "5"} +- {name: output_path, type: String, description: "Output paths are starting with 'output_'.", default: "None"} + + +outputs: + + +implementation: + container: + image: us.ico.io/geodn/claimed-operator-example:0.2 + command: + - sh + - -ec + - | + python ./claimed_operator_example.py log_level="${0}" input_path="${1}" with_default="${2}" num_values="${3}" output_path="${4}" + - {inputValue: log_level} + - {inputValue: input_path} + - {inputValue: with_default} + - {inputValue: num_values} + - {inputValue: output_path} diff --git a/examples/workflow_example.cwl b/examples/workflow_example.cwl new file mode 100644 index 00000000..40bf2a09 --- /dev/null +++ b/examples/workflow_example.cwl @@ -0,0 +1,29 @@ +#!/usr/bin/env cwl-runner + + +cwlVersion: v1.2 + +# What type of CWL process we have in this document. +#class: CommandLineTool + +class: Workflow + +inputs: + num_values: string + + +outputs: [] + +steps: + example1: + run: operator_example.cwl + in: + num_values: num_values + out: [] + + example2: + run: operator_example.cwl + in: + num_values: num_values + out: [] + From 42091769cab5a7a78b3e8c9d4ab8e93e14aeca11 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 23 Nov 2023 16:17:09 +0100 Subject: [PATCH 112/177] Added output paths Signed-off-by: Benedikt Blumenstiel --- pyproject.toml | 5 ++- src/c3/create_operator.py | 44 ++++++++++--------- src/c3/parser.py | 23 +++++----- src/c3/pythonscript.py | 21 +++++---- src/c3/rscript.py | 20 ++++++--- src/c3/templates/component_setup_code.R | 2 +- src/c3/templates/cwl_component_template.cwl | 3 +- .../kubernetes_job_template.job.yaml | 3 +- tests/example_script.py | 2 + 9 files changed, 71 insertions(+), 52 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index 445566a6..9be99509 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -7,8 +7,9 @@ version_file = "src/c3/_version.py" [project] name = "claimed-c3" -dynamic = ["version"] -# version = "0.2.9" +# dynamic = ["version"] +# test pypi version: +version = "0.2.15" authors = [ { name="The CLAIMED authors", email="claimed-framework@proton.me"}, ] diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index fad63ead..a4a5357e 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -36,18 +36,19 @@ def create_dockerfile(dockerfile_template, requirements, target_code, additional def create_kfp_component(name, description, repository, version, command, target_code, file_path, inputs, outputs): - def get_component_interface(parameters): - return_string = str() - for name, options in parameters.items(): - return_string += f'- {{name: {name}, type: {options["type"]}, description: "{options["description"]}"' - if options['default'] is not None: - if not options["default"].startswith('"'): - options["default"] = f'"{options["default"]}"' - return_string += f', default: {options["default"]}' - return_string += '}\n' - return return_string - inputs_list = get_component_interface(inputs) - outputs_list = get_component_interface(outputs) + + inputs_list = str() + for name, options in inputs.items(): + inputs_list += f'- {{name: {name}, type: {options["type"]}, description: "{options["description"]}"' + if options['default'] is not None: + if not options["default"].startswith('"'): + options["default"] = f'"{options["default"]}"' + inputs_list += f', default: {options["default"]}' + inputs_list += '}\n' + + outputs_list = str() + for name, options in outputs.items(): + outputs_list += f'- {{name: {name}, type: String, description: "{options["description"]}"}}\n' parameter_list = str() for index, key in enumerate(list(inputs.keys()) + list(outputs.keys())): @@ -78,10 +79,10 @@ def get_component_interface(parameters): text_file.write(yaml) -def create_kubernetes_job(name, repository, version, target_code, command, working_dir, file_path, inputs, outputs): +def create_kubernetes_job(name, repository, version, target_code, command, working_dir, file_path, inputs): # get environment entries env_entries = str() - for key in list(inputs.keys()) + list(outputs.keys()): + for key in list(inputs.keys()): env_entries += f" - name: {key}\n value: value_of_{key}\n" env_entries = env_entries.rstrip() @@ -112,10 +113,13 @@ def create_cwl_component(name, repository, version, file_path, inputs, outputs): input_envs += (f" {input}:\n type: string\n default: {options['default']}\n " f"inputBinding:\n position: {i}\n prefix: --{input}\n") - output_envs = '\n' + if len(outputs) == 0: + output_envs = '[]' + else: + output_envs = '\n' for output, options in outputs.items(): i += 1 - output_envs += (f" {output}:\n type: string\n default: {options['default']}\n " + output_envs += (f" {output}:\n type: string\n " f"inputBinding:\n position: {i}\n prefix: --{output}\n") cwl = cwl_component_template.substitute( @@ -134,12 +138,10 @@ def create_cwl_component(name, repository, version, file_path, inputs, outputs): text_file.write(cwl) -def print_claimed_command(name, repository, version, inputs, outputs): +def print_claimed_command(name, repository, version, inputs): claimed_command = f"claimed --component {repository}/claimed-{name}:{version}" for input, options in inputs.items(): claimed_command += f" --{input} {options['default']}" - for output, options in outputs.items(): - claimed_command += f" --{output} {options['default']}" logging.info(f'Run operators locally with claimed-cli:\n{claimed_command}') @@ -304,11 +306,11 @@ def create_operator(file_path: str, # Create application scripts create_kfp_component(name, description, repository, version, command, target_code, file_path, inputs, outputs) - create_kubernetes_job(name, repository, version, target_code, command, working_dir, file_path, inputs, outputs) + create_kubernetes_job(name, repository, version, target_code, command, working_dir, file_path, inputs) create_cwl_component(name, repository, version, file_path, inputs, outputs) - print_claimed_command(name, repository, version, inputs, outputs) + print_claimed_command(name, repository, version, inputs) # Remove temp files remove_temporary_files(file_path, target_code, additional_files_path) diff --git a/src/c3/parser.py b/src/c3/parser.py index ab28224d..0405eb22 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -116,25 +116,28 @@ def parse_environment_variables(self, line): class PythonScriptParser(ScriptParser): def search_expressions(self) -> Dict[str, List]: - # TODO: add more key:list-of-regex pairs to parse for additional resources - regex_dict = dict() - # First regex matches envvar assignments that use os.getenv("name", "value") with ow w/o default provided # Second regex matches envvar assignments that use os.environ.get("name", "value") with or w/o default provided # Both name and value are captured if possible - envs = [r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*", + inputs = [r"os\.getenv\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*", r"os\.environ\.get\([\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*"] - regex_dict["env_vars"] = envs + # regex matches setting envvars assignments that use + outputs = [r"\s*os\.environ\[[\"']([a-zA-Z_]+[A-Za-z0-9_]*)[\"']].*"] + + regex_dict = dict(inputs=inputs, outputs=outputs) return regex_dict class RScriptParser(ScriptParser): def search_expressions(self) -> Dict[str, List]: - regex_dict = dict() + # Tests for matches of the form: var <- Sys.getenv("key", "optional default") - envs = [r".*Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*"] - regex_dict["env_vars"] = envs + inputs = [r".*Sys\.getenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*"] + # Tests for matches of the form: var <- Sys.getenv("key", "optional default") + outputs = [r"\s*Sys\.setenv\([\"']*([a-zA-Z_]+[A-Za-z0-9_]*)[\"']*(?:\s*\,\s*[\"']?(.[^#]*)?[\"']?)?\).*"] + + regex_dict = dict(inputs=inputs, outputs=outputs) return regex_dict @@ -147,7 +150,7 @@ class ContentParser(LoggingConfigurable): def parse(self, filepath: str) -> dict: """Returns a model dictionary of all the regex matches for each key in the regex dictionary""" - properties = {"env_vars": {}, "inputs": [], "outputs": []} + properties = {"inputs": {}, "outputs": []} reader = self._get_reader(filepath) parser = self._get_parser(reader.language) @@ -159,7 +162,7 @@ def parse(self, filepath: str) -> dict: for line in chunk: matches = parser.parse_environment_variables(line) for key, match in matches: - if key == "env_vars": + if key == "inputs": default_value = match.group(2) if default_value: # The default value match can end with a additional ', ", or ) which is removed diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 027ccfcc..b06591fc 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -18,11 +18,12 @@ def __init__(self, path): self.description = self.name else: self.description = self.script.split('"""')[1].strip() - self.envs = self._get_env_vars() + self.inputs = self._get_input_vars() + self.outputs = self._get_output_vars() - def _get_env_vars(self): + def _get_input_vars(self): cp = ContentParser() - env_names = cp.parse(self.path)['env_vars'] + env_names = cp.parse(self.path)['inputs'] return_value = dict() for env_name, default in env_names.items(): comment_line = str() @@ -51,6 +52,13 @@ def _get_env_vars(self): comment_line = line return return_value + def _get_output_vars(self): + cp = ContentParser() + output_names = cp.parse(self.path)['outputs'] + # TODO: Does not check for description + return_value = {name: {'description': 'output path'} for name in output_names} + return return_value + def get_requirements(self): requirements = [] # Add dnf install @@ -76,10 +84,7 @@ def get_description(self): return self.description def get_inputs(self): - return self.envs - # return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} + return self.inputs def get_outputs(self): - # TODO: Test Kubeflow outputs. Does not fit current usage. Maybe use os.setenv() - return {} - # return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} + return self.outputs diff --git a/src/c3/rscript.py b/src/c3/rscript.py index 3820b927..144ca15a 100644 --- a/src/c3/rscript.py +++ b/src/c3/rscript.py @@ -15,11 +15,12 @@ def __init__(self, path): self.name = os.path.basename(path)[:-2].replace('_', '-').lower() # TODO: Currently does not support a description self.description = self.name - self.envs = self._get_env_vars() + self.inputs = self._get_input_vars() + self.outputs = self._get_output_vars() - def _get_env_vars(self): + def _get_input_vars(self): cp = ContentParser() - env_names = cp.parse(self.path)['env_vars'] + env_names = cp.parse(self.path)['inputs'] return_value = dict() for env_name, default in env_names.items(): comment_line = str() @@ -47,6 +48,13 @@ def _get_env_vars(self): comment_line = line return return_value + def _get_output_vars(self): + cp = ContentParser() + output_names = cp.parse(self.path)['outputs'] + # TODO: Does not check for description + return_value = {name: {'description': 'output path'} for name in output_names} + return return_value + def get_requirements(self): requirements = [] # Add apt install commands @@ -74,9 +82,7 @@ def get_description(self): return self.description def get_inputs(self): - # return {key: value for (key, value) in self.envs.items() if not key.startswith('output_')} - return self.envs + return self.inputs def get_outputs(self): - # return {key: value for (key, value) in self.envs.items() if key.startswith('output_')} - return {} + return self.outputs diff --git a/src/c3/templates/component_setup_code.R b/src/c3/templates/component_setup_code.R index bedd266e..daa3f847 100644 --- a/src/c3/templates/component_setup_code.R +++ b/src/c3/templates/component_setup_code.R @@ -7,7 +7,7 @@ for (parameter in args) { print(parameter) key <- key_value[1] value <- key_value[2] - Sys.setenv(key = value) + eval(parse(text=paste0('Sys.setenv(',key,'="',value,'")'))) } else { print(paste('Could not find key value pair for argument ', parameter)) } diff --git a/src/c3/templates/cwl_component_template.cwl b/src/c3/templates/cwl_component_template.cwl index 5b01e333..f5106075 100644 --- a/src/c3/templates/cwl_component_template.cwl +++ b/src/c3/templates/cwl_component_template.cwl @@ -11,6 +11,5 @@ inputs: position: 1 prefix: --component ${inputs} -${outputs} -outputs: [] \ No newline at end of file +outputs: ${outputs} diff --git a/src/c3/templates/kubernetes_job_template.job.yaml b/src/c3/templates/kubernetes_job_template.job.yaml index e2a65339..c730e168 100644 --- a/src/c3/templates/kubernetes_job_template.job.yaml +++ b/src/c3/templates/kubernetes_job_template.job.yaml @@ -8,7 +8,8 @@ spec: containers: - name: ${name} image: ${repository}/claimed-${name}:${version} - command: ["${command}","${working_dir}${target_code}"] + workingDir: ${working_dir} + command: ["${command}","${target_code}"] env: ${env_entries} restartPolicy: OnFailure diff --git a/tests/example_script.py b/tests/example_script.py index bd86dc28..103635d9 100644 --- a/tests/example_script.py +++ b/tests/example_script.py @@ -30,6 +30,8 @@ def main(*args): """ _ = np.random.randn(5) + os.environ['test_output'] = 'test' + print(args) From a3a2975e758d417c37ef7d5cf5e746c07e3697d7 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 23 Nov 2023 16:17:19 +0100 Subject: [PATCH 113/177] Updated GettingStarted.md Signed-off-by: Benedikt Blumenstiel --- GettingStarted.md | 95 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 93 insertions(+), 2 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index 10f9aeec..c0502026 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -261,6 +261,95 @@ from kfp_tekton.compiler import TektonCompiler TektonCompiler().compile(pipeline_func=my_pipeline, package_path='my_pipeline.yaml') ``` +If you are using another tekton version, you can use the following code to save an adjusted yaml file for version `v1beta1`: + +```python +# pip install kfp-tekton pyyaml + +import yaml +from kfp_tekton.compiler import TektonCompiler + +# Read dict to update apiVersion +_, pipeline_dict = TektonCompiler().prepare_workflow(my_pipeline) +pipeline_dict['apiVersion'] = 'tekton.dev/v1beta1' +# write pipeline to yaml +with open('my_pipeline.yaml', 'w') as f: + yaml.dump(pipeline_dict, f) +``` + +#### Timeout in KubeFlow Tekton + +The default timeout in a KFP tekton pipeline is set to 60 minutes. The default value can be changed in the tekton config by the [administrators](https://tekton.dev/docs/pipelines/pipelineruns/#configuring-a-failure-timeout). Otherwise, you can update the timeout in the yaml with the following code: + +```python +# Read dict to update apiVersion and timeouts +_, pipeline_dict = TektonCompiler().prepare_workflow(my_pipeline) +pipeline_dict['spec']['timeouts'] = {'pipeline': "0"} # 0 = no timeout +# write pipeline to yaml +with open('my_pipeline.yaml', 'w') as f: + yaml.dump(pipeline_dict, f) +``` + +#### Shared volumes + +Data is not shared by default between different steps. +You can add a volume to each step for data sharing. +First, you create a PersistentVolumeClaim (PVC) in the Kubernetes project that is running KubeFlow. +If you want to run multiple steps in parallel, this PVC must support ReadWriteMany, otherwise ReadWirteOnce is sufficient. +Next, you can mount this PVC to each step with the following code: + +```python +mount_folder = "/opt/app-root/src/" + +# Init the KFP component +step = my_kfp_op(...) + +step.add_pvolumes({mount_folder: dsl.PipelineVolume(pvc='')}) +``` + +You can include the working directory in the mount path to use relative paths (`/opt/app-root/src/` for python and `home/docker` for R). +Otherwise, you can use absolute paths in your scripts/variables `//...`. + +#### Secrets + +You can use key-value secrets in KubeFlow as well to avoid publishing sensible information in pod configs and logs. +You can add the secrets in the Kubernetes project that is running KubeFlow. +Then, you can add secrets to a specfic step in the pipeline with the following code: + +```python +from kubernetes.client import V1EnvVar, V1EnvVarSource, V1SecretKeySelector + +# Init the KFP component +step = my_kfp_op(...) + +# Add a secret as env variable +secret_env_var = V1EnvVar( + name='', + value_from=V1EnvVarSource(secret_key_ref=V1SecretKeySelector(name='', key='') +)) +step.add_env_variable(secret_env_var) +``` + +The secret will be set as an env variable and load by the common C3 interface. +Therefore, it is important that KubeFlow does not everwrite this env variable. +You need to adjust the command in the KFP component yaml by deleting the variable: +```yaml +# Original command with secret_variable +command: + ... + python ./.py log_level="${0}" ="${1}" other_variable="${2}" ... + ... + +# Adjusted command +command: + ... + python ./.py log_level="${0}" other_variable="${2}" ... + ... +``` +Further, it is important, that the variable has a default value and is optional +(You can simply add `default: ""` to the variable in the KFP component yaml without recompiling your script). + + ### CWL workflows You can run workflows locally with CWL. This requires the cwltool package: @@ -322,8 +411,9 @@ Your operator script has to follow certain requirements to be processed by C3. C - The operator name is the python file: `my_operator_name.py` -> `claimed-my-operator-name` - The operator description is the first doc string in the script: `"""Operator description"""` - The required pip packages are listed in comments starting with pip install: `# pip install ` -- The interface is defined by environment variables `my_parameter = os.getenv('my_parameter')`. Output paths start with `output_`. Note that operators cannot return values but always have to save outputs in files. +- The interface is defined by environment variables `my_parameter = os.getenv('my_parameter')`. - You can cast a specific type by wrapping `os.getenv()` with `int()`, `float()`, `bool()`. The default type is string. Only these four types are currently supported. You can use `None` as a default value but not pass the `NoneType` via the `job.yaml`. +- Output paths for KubeFlow can be defined with `os.environ['my_output_parameter'] = ...'`. Note that operators cannot return values but always have to save outputs in files. You can optionally install future tools with `dnf` by adding a comment `# dnf `. @@ -344,8 +434,9 @@ Therefore, use the command `pip install -r /opt/app-root/src/requirements.txt`. - The operator name is the python file: `my_operator_name.R` -> `claimed-my-operator-name` - The operator description is currently fixed to `"R script"`. - The required R packages are installed with: `install.packages(, repos=)` -- The interface is defined by environment variables `my_parameter <- Sys.getenv('my_parameter', 'optional_default_value')`. Output paths start with `output_`. Note that operators cannot return values but always have to save outputs in files. +- The interface is defined by environment variables `my_parameter <- Sys.getenv('my_parameter', 'optional_default_value')`. - You can cast a specific type by wrapping `Sys.getenv()` with `as.numeric()` or `as.logical()`. The default type is string. Only these three types are currently supported. You can use `NULL` as a default value but not pass `NULL` via the `job.yaml`. +- Output paths for KubeFlow can be defined with `Sys.setenv()`. Note that operators cannot return values but always have to save outputs in files. You can optionally install future tools with `apt` by adding a comment `# apt `. From dd212a935a4f116709cbe07901f5235239e706d6 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 23 Nov 2023 16:21:54 +0100 Subject: [PATCH 114/177] Reset dynamic pypi version Signed-off-by: Benedikt Blumenstiel --- pyproject.toml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index 9be99509..e5778350 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -7,9 +7,9 @@ version_file = "src/c3/_version.py" [project] name = "claimed-c3" -# dynamic = ["version"] +dynamic = ["version"] # test pypi version: -version = "0.2.15" +# version = "0.2.15" authors = [ { name="The CLAIMED authors", email="claimed-framework@proton.me"}, ] From 8d0dc8f67abe0e73744d034cae0b38899544f8ef Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 23 Nov 2023 21:54:23 +0100 Subject: [PATCH 115/177] add 'ask for stars' comment --- GettingStarted.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/GettingStarted.md b/GettingStarted.md index c0502026..baa1d4a5 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -5,6 +5,9 @@ The [CLAIMED framework](https://github.com/claimed-framework) enables ease-of-us A central tool of CLAIMED is the **Claimed Component Compiler (C3)** which creates a docker image with all dependencies, pushes the container to a registry, and creates a kubernetes-job.yaml as well as a kubeflow-pipeline-component.yaml. This page explains how to apply operators, combine them to workflows, and how to build them yourself using C3. +If you like CLAIMED, just give us a [star](https://github.com/claimed-framework/component-library) on our [main project](https://github.com/claimed-framework/component-library). + + ## Content **[1. Apply operators](#1-apply-operators)** @@ -729,4 +732,4 @@ spec: restartPolicy: Never imagePullSecrets: - name: image-pull-secret -``` \ No newline at end of file +``` From fd5ffdff2e1ea5449f29f5ec0c3cfab5a163d06b Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 15 Jan 2024 15:09:36 +0100 Subject: [PATCH 116/177] Fix interface issue in grid wrapper Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 33 ++++++++++++++++++++------------- src/c3/pythonscript.py | 7 +++++-- tests/example_script.py | 5 ++--- tests/test_compiler.py | 16 ++++++++-------- 4 files changed, 35 insertions(+), 26 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index eb7e5a72..0a903fda 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -53,21 +53,23 @@ def get_component_elements(file_path): outputs = py.get_outputs() dependencies = py.get_requirements() - # combine inputs and outputs - interface_values = {} - interface_values.update(inputs) - interface_values.update(outputs) - # combine dependencies list dependencies = '\n# '.join(dependencies) - # generate interface code from inputs and outputs + # generate interface code from inputs interface = '' type_to_func = {'String': '', 'Boolean': 'bool', 'Integer': 'int', 'Float': 'float'} - for variable, d in interface_values.items(): + for variable, d in inputs.items(): interface += f"# {d['description']}\n" + if d['type'] == 'String' and d['default'] is not None and d['default'][0] not in '\'\"': + # Add quotation marks + d['default'] = "'" + d['default'] + "'" interface += f"component_{variable} = {type_to_func[d['type']]}(os.getenv('{variable}', {d['default']}))\n" + # TODO: Implement output interface + if len(outputs) > 0: + logging.warning('Found output paths in the component code which is currently not supported.') + # generate kwargs for the subprocesses process_inputs = ', '.join([f'{i}=component_{i}' for i in inputs.keys()]) # use log level from grid wrapper @@ -108,7 +110,7 @@ def edit_component_code(file_path): return target_file -def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): +def apply_grid_wrapper(file_path, component_process, cos): assert file_path.endswith('.py') or file_path.endswith('.ipynb'), \ "Please provide a component file path to a python script or notebook." @@ -136,9 +138,9 @@ def apply_grid_wrapper(file_path, component_process, cos, *args, **kwargs): def main(): parser = argparse.ArgumentParser() - parser.add_argument('file_path', type=str, + parser.add_argument('FILE_PATH', type=str, help='Path to python script or notebook') - parser.add_argument('additional_files', type=str, nargs='*', + parser.add_argument('ADDITIONAL_FILES', type=str, nargs='*', help='List of paths to additional files to include in the container image') parser.add_argument('-p', '--component_process', type=str, required=True, help='Name of the component sub process that is executed for each batch.') @@ -152,6 +154,7 @@ def main(): parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') parser.add_argument('--test_mode', action='store_true') + parser.add_argument('--no-cache', action='store_true') args = parser.parse_args() # Init logging @@ -163,13 +166,17 @@ def main(): handler.setLevel(args.log_level) root.addHandler(handler) - grid_wrapper_file_path, component_path = apply_grid_wrapper(**vars(args)) + grid_wrapper_file_path, component_path = apply_grid_wrapper( + file_path=args.FILE_PATH, + component_process=args.component_process, + cos=args.cos, + ) if args.repository is not None: logging.info('Generate CLAIMED operator for grid wrapper') # Add component path and init file path to additional_files - args.additional_files.append(component_path) + args.ADDITIONAL_FILES.append(component_path) # Update dockerfile template if specified if args.dockerfile_template_path != '': @@ -184,7 +191,7 @@ def main(): repository=args.repository, version=args.version, custom_dockerfile_template=custom_dockerfile_template, - additional_files=args.additional_files, + additional_files=args.ADDITIONAL_FILES, log_level=args.log_level, test_mode=args.test_mode, ) diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index b06591fc..05cabf34 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -55,8 +55,11 @@ def _get_input_vars(self): def _get_output_vars(self): cp = ContentParser() output_names = cp.parse(self.path)['outputs'] - # TODO: Does not check for description - return_value = {name: {'description': 'output path'} for name in output_names} + # TODO: Does not check for description code + return_value = {name: { + 'description': f'Output path for {name}', + 'type': 'String', + } for name in output_names} return return_value def get_requirements(self): diff --git a/tests/example_script.py b/tests/example_script.py index 103635d9..6af2556c 100644 --- a/tests/example_script.py +++ b/tests/example_script.py @@ -12,7 +12,7 @@ import numpy as np # A comment one line above os.getenv is the description of this variable. -input_path = os.environ.get('input_path', None) # ('not this') +input_path = os.environ.get('input_path') # ('not this') # type casting to int(), float(), or bool() batch_size = int(os.environ.get('batch_size', 16)) # (not this) @@ -20,8 +20,7 @@ # Commas in the previous comment are deleted because the yaml file requires descriptions without commas. debug = bool(os.getenv('debug', False)) -# Output parameters are starting with "output_" -output_path = os.getenv('output_path') +output_path = os.getenv('output_path', 'default_value') def main(*args): diff --git a/tests/test_compiler.py b/tests/test_compiler.py index a407d2e1..1018605d 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -134,18 +134,18 @@ def test_create_operator( test_create_gridwrapper_input = [ + ( + TEST_SCRIPT_PATH, + DUMMY_REPO, + 'process', + [TEST_NOTEBOOK_PATH], + ), ( TEST_NOTEBOOK_PATH, DUMMY_REPO, 'your_function', [], ), - ( - TEST_SCRIPT_PATH, - DUMMY_REPO, - 'process', - [TEST_NOTEBOOK_PATH], - ) ] @pytest.mark.parametrize( "file_path, repository, process, args", @@ -168,5 +168,5 @@ def test_create_gridwrapper( gw_file.with_suffix('.cwl').unlink() gw_file.unlink() image_name = f"{repository}/claimed-gw-{file_path.rsplit('.')[0].replace('_', '-')}:test" - subprocess.run(['docker', 'run', image_name], - check=True) \ No newline at end of file + # TODO: Modify subprocess call to test grid wrapper + # subprocess.run(['docker', 'run', image_name], check=True) From 47c9107c7364186b1438fef80ad34ba342f7ccc0 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 15 Jan 2024 15:11:13 +0100 Subject: [PATCH 117/177] Avoid overwriting job yaml #15 Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 6 ++++ src/c3/create_operator.py | 55 ++++++++++++++++++++++++++++++++++++ 2 files changed, 61 insertions(+) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 0a903fda..5cbcbd9c 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -150,6 +150,9 @@ def main(): help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, help='Container image version. Auto-increases the version number if not provided (default 0.1)') + parser.add_argument('--rename', type=str, nargs='?', default=None, const='', + help='Rename existing yaml files (argument without value leads to modified_{file name})') + parser.add_argument('--overwrite', action='store_true', help='Overwrite existing yaml files') parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') @@ -194,6 +197,9 @@ def main(): additional_files=args.ADDITIONAL_FILES, log_level=args.log_level, test_mode=args.test_mode, + no_cache=args.no_cache, + overwrite_files=args.overwrite, + rename_files=args.rename, ) logging.info('Remove local component file') diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index a4a5357e..f800da62 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -138,6 +138,47 @@ def create_cwl_component(name, repository, version, file_path, inputs, outputs): text_file.write(cwl) +def check_existing_files(file_path, rename_files, overwrite_files): + if rename_files is None and overwrite_files: + # Overwrite potential files + return + + target_job_yaml_path = Path(file_path).with_suffix('.job.yaml') + + # Check for existing job yaml + if target_job_yaml_path.is_file(): + if rename_files is None: + # Ask user + rename_files = input(f'Found modified job.yaml at {target_job_yaml_path}. ' + f'C3 will rename the modified file to modified_{target_job_yaml_path.name}.\n' + f'ENTER to continue, write N for overwrite, ' + f'or provide a custom name for the modified file.') + if rename_files.lower() == 'n': + # Overwrite file + return + elif rename_files.strip() == '': + # Default file name + new_file_name = 'modified_' + Path(file_path).name + else: + # Rename to custom name + new_file_name = rename_files + + modified_path = (target_job_yaml_path.parent / new_file_name).with_suffix('.job.yaml') + # Check if modified path exists and potentially overwrite + if modified_path.exists(): + if overwrite_files: + logging.info(f'Overwriting modified path {modified_path}.') + else: + overwrite = input(f'Modified path {modified_path} already exists. ENTER to overwrite the file.') + if overwrite != '': + logging.error(f'Abort creating operator. Please rename file manually and rerun the script.') + raise FileExistsError + + os.rename(str(target_job_yaml_path), str(modified_path)) + logging.info(f'Renamed Kubernetes job file to {modified_path}') + # TODO: Should we check other files too? Currently assuming no modification for yaml and cwl. + + def print_claimed_command(name, repository, version, inputs): claimed_command = f"claimed --component {repository}/claimed-{name}:{version}" for input, options in inputs.items(): @@ -162,6 +203,8 @@ def create_operator(file_path: str, log_level='INFO', test_mode=False, no_cache=False, + rename_files=None, + overwrite_files=False, ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -303,6 +346,13 @@ def create_operator(file_path: str, remove_temporary_files(file_path, target_code, additional_files_path) raise err + # Check for existing files and optionally modify them before overwriting + try: + check_existing_files(file_path, rename_files, overwrite_files) + except Exception as err: + remove_temporary_files(file_path, target_code, additional_files_path) + raise err + # Create application scripts create_kfp_component(name, description, repository, version, command, target_code, file_path, inputs, outputs) @@ -326,6 +376,9 @@ def main(): help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, help='Container image version. Auto-increases the version number if not provided (default 0.1)') + parser.add_argument('--rename', type=str, nargs='?', default=None, const='', + help='Rename existing yaml files (argument without value leads to modified_{file name})') + parser.add_argument('--overwrite', action='store_true', help='Overwrite existing yaml files') parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') @@ -359,6 +412,8 @@ def main(): log_level=args.log_level, test_mode=args.test_mode, no_cache=args.no_cache, + overwrite_files=args.overwrite, + rename_files=args.rename, ) From 8e72391a1de6278798874d03ed85d9c7785bddbf Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 15 Jan 2024 15:18:14 +0100 Subject: [PATCH 118/177] Add home dir to requirements.txt #27 Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index f800da62..dbf60533 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -19,6 +19,9 @@ def create_dockerfile(dockerfile_template, requirements, target_code, additional_files_path, working_dir, command): + # Add missing home directory to the command `pip install -r ~/requirements.txt` + requirements = [r if '~/' in r else r.replace('-r ', '-r ~/') for r in requirements] + # TODO: add optional requirements.txt to additional files if missing and -r in requirements requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) requirements_docker = '\n'.join(requirements_docker) From 736f8409299fe519460ba6893fa4232ca3772ccd Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 15 Jan 2024 17:44:33 +0100 Subject: [PATCH 119/177] Updated additional files and include wildcards Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 112 +++++++++++------- src/c3/parser.py | 2 +- src/c3/templates/R_dockerfile_template | 6 +- src/c3/templates/kfp_component_template.yaml | 2 +- .../kubernetes_job_template.job.yaml | 2 +- src/c3/templates/python_dockerfile_template | 6 +- 6 files changed, 81 insertions(+), 49 deletions(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index dbf60533..27524300 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -5,6 +5,8 @@ import shutil import argparse import subprocess +import glob +import re from pathlib import Path from string import Template from typing import Optional @@ -18,17 +20,31 @@ CLAIMED_VERSION = 'V0.1' -def create_dockerfile(dockerfile_template, requirements, target_code, additional_files_path, working_dir, command): - # Add missing home directory to the command `pip install -r ~/requirements.txt` - requirements = [r if '~/' in r else r.replace('-r ', '-r ~/') for r in requirements] - # TODO: add optional requirements.txt to additional files if missing and -r in requirements +def create_dockerfile(dockerfile_template, requirements, target_code, target_dir, additional_files, working_dir, command): + # Check for requirements file + for i in range(len(requirements)): + if '-r ' in requirements[i]: + r_file_search = re.search('-r ~?\/?([A-Za-z0-9\/]*\.txt)', requirements[i]) + if len(r_file_search.groups()): + # Get file from regex + requirements_file = r_file_search.groups()[0] + if requirements_file not in additional_files and os.path.isfile(requirements_file): + # Add missing requirements text file to additional files + additional_files.append(r_file_search.groups()[0]) + if '/' not in requirements[i]: + # Add missing home directory to the command `pip install -r ~/requirements.txt` + requirements[i] = requirements[i].replace('-r ', '-r ~/') + requirements_docker = list(map(lambda s: 'RUN ' + s, requirements)) requirements_docker = '\n'.join(requirements_docker) + additional_files_docker = list(map(lambda s: f"ADD {s} {working_dir}{s}", additional_files)) + additional_files_docker = '\n'.join(additional_files_docker) docker_file = dockerfile_template.substitute( requirements_docker=requirements_docker, target_code=target_code, - additional_files_path=additional_files_path, + target_dir=target_dir, + additional_files_docker=additional_files_docker, working_dir=working_dir, command=os.path.basename(command), ) @@ -36,9 +52,10 @@ def create_dockerfile(dockerfile_template, requirements, target_code, additional logging.info('Create Dockerfile') with open("Dockerfile", "w") as text_file: text_file.write(docker_file) + logging.debug('Dockerfile:\n' + docker_file) -def create_kfp_component(name, description, repository, version, command, target_code, file_path, inputs, outputs): +def create_kfp_component(name, description, repository, version, command, target_code, target_dir, file_path, inputs, outputs): inputs_list = str() for name, options in inputs.items(): @@ -70,7 +87,10 @@ def create_kfp_component(name, description, repository, version, command, target version=version, inputs=inputs_list, outputs=outputs_list, - call=f'{os.path.basename(command)} ./{target_code} {parameter_list}', + command=os.path.basename(command), + target_dir=target_dir, + target_code=target_code, + parameter_list=parameter_list, parameter_values=parameter_values, ) @@ -82,7 +102,7 @@ def create_kfp_component(name, description, repository, version, command, target text_file.write(yaml) -def create_kubernetes_job(name, repository, version, target_code, command, working_dir, file_path, inputs): +def create_kubernetes_job(name, repository, version, target_code, target_dir, command, working_dir, file_path, inputs): # get environment entries env_entries = str() for key in list(inputs.keys()): @@ -94,6 +114,7 @@ def create_kubernetes_job(name, repository, version, target_code, command, worki repository=repository, version=version, target_code=target_code, + target_dir=target_dir, env_entries=env_entries, command=command, working_dir=working_dir, @@ -189,13 +210,13 @@ def print_claimed_command(name, repository, version, inputs): logging.info(f'Run operators locally with claimed-cli:\n{claimed_command}') -def remove_temporary_files(file_path, target_code, additional_files_path): +def remove_temporary_files(file_path, target_code): logging.info(f'Remove local files') # remove temporary files if file_path != target_code: os.remove(target_code) - os.remove('Dockerfile') - shutil.rmtree(additional_files_path, ignore_errors=True) + if os.path.isfile('Dockerfile'): + os.remove('Dockerfile') def create_operator(file_path: str, @@ -222,20 +243,16 @@ def create_operator(file_path: str, working_dir = '/opt/app-root/src/' elif file_path.endswith('.py'): - target_code = file_path.split('/')[-1] - if file_path == target_code: - # use temp file for processing - target_code = 'claimed_' + target_code + # use temp file for processing + target_code = 'claimed_' + os.path.basename(file_path) # Copy file to current working directory shutil.copy(file_path, target_code) command = '/opt/app-root/bin/python' working_dir = '/opt/app-root/src/' elif file_path.lower().endswith('.r'): - target_code = file_path.split('/')[-1] - if file_path == target_code: - # use temp file for processing - target_code = 'claimed_' + target_code + # use temp file for processing + target_code = 'claimed_' + os.path.basename(file_path) # Copy file to current working directory shutil.copy(file_path, target_code) command = 'Rscript' @@ -278,27 +295,41 @@ def create_operator(file_path: str, # Strip 'claimed-' from name of copied temp file if name.startswith('claimed-'): name = name[8:] + target_dir = os.path.dirname(file_path) + # Check that the main file is within the cwd + if '../' in target_dir: + raise PermissionError(f"Forbidden path outside the docker build context: {target_dir}. " + f"Change the current working directory to include the file.") + elif target_dir != '': + target_dir += '/' logging.info('Operator name: ' + name) logging.info('Description:: ' + description) logging.info('Inputs: ' + str(inputs)) logging.info('Outputs: ' + str(outputs)) logging.info('Requirements: ' + str(requirements)) - - # copy all additional files to temporary folder - additional_files_path = 'additional_files_path' - while os.path.exists(additional_files_path): - # ensures using a new directory - additional_files_path += '_temp' - logging.debug(f'Create dir for additional files {additional_files_path}') - os.makedirs(additional_files_path) - for additional_file in additional_files: - assert os.path.isfile(additional_file), \ - f"Could not find file at {additional_file}. Please provide only files as additional parameters." - shutil.copy(additional_file, additional_files_path) - logging.info(f'Selected additional files: {os.listdir(additional_files_path)}') - - create_dockerfile(dockerfile_template, requirements, target_code, additional_files_path, working_dir, command) + logging.debug(f'Target code: {target_code}') + logging.debug(f'Target directory: {target_dir}') + + # Load all additional files + logging.debug('Looking for additional files:') + additional_files_found = [] + for file_pattern in additional_files: + if '../' in file_pattern: + # Check that additional file are within the cwd + raise PermissionError(f"Forbidden path outside the docker build context: {file_pattern}. " + f"Change the current working directory to include all additional files.") + # Include files based on wildcards + files_found = glob.glob(file_pattern) + if len(files_found) == 0: + raise FileNotFoundError(f'No additional files for path {file_pattern}.') + additional_files_found.extend(files_found) + logging.debug(f'Searched for "{file_pattern}". Found {", ".join(files_found)}') + logging.info(f'Found {len(additional_files_found)} additional files and directories:\n' + f'{", ".join(additional_files_found)}') + + create_dockerfile(dockerfile_template, requirements, target_code, target_dir, additional_files_found, working_dir, + command) if version is None: # auto increase version based on registered images @@ -323,7 +354,7 @@ def create_operator(file_path: str, stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, ) except Exception as err: - remove_temporary_files(file_path, target_code, additional_files_path) + remove_temporary_files(file_path, target_code) raise err logging.info('Successfully built image') @@ -346,27 +377,28 @@ def create_operator(file_path: str, logging.info('Continue processing (test mode).') pass else: - remove_temporary_files(file_path, target_code, additional_files_path) + remove_temporary_files(file_path, target_code) raise err # Check for existing files and optionally modify them before overwriting try: check_existing_files(file_path, rename_files, overwrite_files) except Exception as err: - remove_temporary_files(file_path, target_code, additional_files_path) + remove_temporary_files(file_path, target_code) raise err # Create application scripts - create_kfp_component(name, description, repository, version, command, target_code, file_path, inputs, outputs) + create_kfp_component(name, description, repository, version, command, target_code, target_dir, file_path, inputs, + outputs) - create_kubernetes_job(name, repository, version, target_code, command, working_dir, file_path, inputs) + create_kubernetes_job(name, repository, version, target_code, target_dir, command, working_dir, file_path, inputs) create_cwl_component(name, repository, version, file_path, inputs, outputs) print_claimed_command(name, repository, version, inputs) # Remove temp files - remove_temporary_files(file_path, target_code, additional_files_path) + remove_temporary_files(file_path, target_code) def main(): diff --git a/src/c3/parser.py b/src/c3/parser.py index 0405eb22..524d5409 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -165,7 +165,7 @@ def parse(self, filepath: str) -> dict: if key == "inputs": default_value = match.group(2) if default_value: - # The default value match can end with a additional ', ", or ) which is removed + # The default value match can end with an additional ', ", or ) which is removed default_value = re.sub(r"['\")]?$", '', default_value, count=1) properties[key][match.group(1)] = default_value else: diff --git a/src/c3/templates/R_dockerfile_template b/src/c3/templates/R_dockerfile_template index a2f59e69..5d7d09e5 100644 --- a/src/c3/templates/R_dockerfile_template +++ b/src/c3/templates/R_dockerfile_template @@ -2,10 +2,10 @@ FROM r-base:4.3.2 USER root RUN apt update ${requirements_docker} -ADD ${target_code} ${working_dir} -ADD ${additional_files_path} ${working_dir} +ADD ${target_code} ${working_dir}${target_dir} +${additional_files_docker} RUN chmod -R 777 ${working_dir} RUN chmod -R 777 /usr/local/lib/R/ USER docker WORKDIR "${working_dir}" -CMD ["${command}", "${target_code}"] \ No newline at end of file +CMD ["${command}", "${target_dir}${target_code}"] \ No newline at end of file diff --git a/src/c3/templates/kfp_component_template.yaml b/src/c3/templates/kfp_component_template.yaml index 2a2b108b..d5031586 100644 --- a/src/c3/templates/kfp_component_template.yaml +++ b/src/c3/templates/kfp_component_template.yaml @@ -14,5 +14,5 @@ implementation: - sh - -ec - | - ${call} + ${command} ./${target_dir}${target_code} ${parameter_list} ${parameter_values} \ No newline at end of file diff --git a/src/c3/templates/kubernetes_job_template.job.yaml b/src/c3/templates/kubernetes_job_template.job.yaml index c730e168..413c417d 100644 --- a/src/c3/templates/kubernetes_job_template.job.yaml +++ b/src/c3/templates/kubernetes_job_template.job.yaml @@ -9,7 +9,7 @@ spec: - name: ${name} image: ${repository}/claimed-${name}:${version} workingDir: ${working_dir} - command: ["${command}","${target_code}"] + command: ["${command}","${target_dir}${target_code}"] env: ${env_entries} restartPolicy: OnFailure diff --git a/src/c3/templates/python_dockerfile_template b/src/c3/templates/python_dockerfile_template index ec05fd2a..7c79ef70 100644 --- a/src/c3/templates/python_dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -1,10 +1,10 @@ FROM registry.access.redhat.com/ubi8/python-39 USER root -ADD ${target_code} ${working_dir} -ADD ${additional_files_path} ${working_dir} +ADD ${target_code} ${working_dir}${target_dir} +${additional_files_docker} RUN pip install ipython ${requirements_docker} RUN chmod -R 777 ${working_dir} USER default WORKDIR "${working_dir}" -CMD ["${command}", "${target_code}"] \ No newline at end of file +CMD ["${command}", "${target_dir}${target_code}"] \ No newline at end of file From 6205b4f5a632284857df791fd6b3e80496cf4765 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 15 Jan 2024 18:33:10 +0100 Subject: [PATCH 120/177] Minor fixes Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 13 ++++++------- src/c3/pythonscript.py | 4 ++-- 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 27524300..ed9d4e6a 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -173,14 +173,13 @@ def check_existing_files(file_path, rename_files, overwrite_files): if target_job_yaml_path.is_file(): if rename_files is None: # Ask user - rename_files = input(f'Found modified job.yaml at {target_job_yaml_path}. ' - f'C3 will rename the modified file to modified_{target_job_yaml_path.name}.\n' - f'ENTER to continue, write N for overwrite, ' - f'or provide a custom name for the modified file.') - if rename_files.lower() == 'n': + rename_files = input(f'\nFound a existing Kubernetes job file at {target_job_yaml_path}.\n' + f'ENTER to overwrite the file, write Y to rename the file to ' + f'modified_{target_job_yaml_path.name}, or provide a custom name:\n') + if rename_files.strip() == '': # Overwrite file return - elif rename_files.strip() == '': + elif rename_files.lower() == 'y': # Default file name new_file_name = 'modified_' + Path(file_path).name else: @@ -325,7 +324,7 @@ def create_operator(file_path: str, raise FileNotFoundError(f'No additional files for path {file_pattern}.') additional_files_found.extend(files_found) logging.debug(f'Searched for "{file_pattern}". Found {", ".join(files_found)}') - logging.info(f'Found {len(additional_files_found)} additional files and directories:\n' + logging.info(f'Found {len(additional_files_found)} additional files and directories\n' f'{", ".join(additional_files_found)}') create_dockerfile(dockerfile_template, requirements, target_code, target_dir, additional_files_found, working_dir, diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 05cabf34..7e4bc298 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -73,11 +73,11 @@ def get_requirements(self): requirements.append(line.replace('#', '').strip()) # Add pip install - pattern = r"([ ]*pip[ ]*install[ ]*)(.[^#]*)" + pattern = r"^[# ]*(pip[ ]*install)[ ]*(.[^#]*)" for line in self.script.split('\n'): result = re.findall(pattern, line) if len(result) == 1: - requirements.append((result[0][0].strip() + ' ' + result[0][1].strip())) + requirements.append((result[0][0] + ' ' + result[0][1].strip())) return requirements def get_name(self): From dfaa1c94824b33459e70f4e7e0fc98ef9021b4dd Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 15 Jan 2024 18:42:36 +0100 Subject: [PATCH 121/177] Add DEBUG note Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 1 + 1 file changed, 1 insertion(+) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index ed9d4e6a..6660d2c2 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -354,6 +354,7 @@ def create_operator(file_path: str, ) except Exception as err: remove_temporary_files(file_path, target_code) + logging.error('Docker build failed. Consider running C3 with `--log_level DEBUG` to see the docker build logs.') raise err logging.info('Successfully built image') From 76d5ef661f17bc7a467a6f74c6b377d9d90c4460 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel <64090593+blumenstiel@users.noreply.github.com> Date: Thu, 25 Jan 2024 17:57:04 +0100 Subject: [PATCH 122/177] Update GettingStarted.md Specified additional files --- GettingStarted.md | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index baa1d4a5..95aba746 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -413,18 +413,13 @@ Your operator script has to follow certain requirements to be processed by C3. C - The operator name is the python file: `my_operator_name.py` -> `claimed-my-operator-name` - The operator description is the first doc string in the script: `"""Operator description"""` -- The required pip packages are listed in comments starting with pip install: `# pip install ` +- The required pip packages are listed in comments starting with pip install: `# pip install ` or `# pip install -r ~/requierments.txt` - The interface is defined by environment variables `my_parameter = os.getenv('my_parameter')`. - You can cast a specific type by wrapping `os.getenv()` with `int()`, `float()`, `bool()`. The default type is string. Only these four types are currently supported. You can use `None` as a default value but not pass the `NoneType` via the `job.yaml`. - Output paths for KubeFlow can be defined with `os.environ['my_output_parameter'] = ...'`. Note that operators cannot return values but always have to save outputs in files. You can optionally install future tools with `dnf` by adding a comment `# dnf `. -If you want to install a `requirements.txt` file you need to consider two steps: -First, you need to include the file as an additional file in the c3 command. -Second, the Dockerfile is executed from root while the files are placed in the working directory. -Therefore, use the command `pip install -r /opt/app-root/src/requirements.txt`. - #### iPython notebooks - The operator name is the notebook file: `my_operator_name.ipynb` -> `claimed-my-operator-name` @@ -516,19 +511,18 @@ docker login -u -p / With a running Docker engine and your operator script matching the C3 requirements, you can execute the C3 compiler by running `create_operator.py`: ```sh -c3_create_operator.py ".py" "" "" --repository "/" +c3_create_operator --repository "/" ".py" "" "" ``` -The first positional argument is the path to the python script or the ipython notebook. Optional, you can provide additional files that are copied to the container images with in all following parameters. The additional files are placed within the same directory as the operator script. -C3 automatically increases the version of the container image (default: "0.1") but you can set the version with `--version` or `-v`. You need to provide the repository with `--repository` or `-r`. -If you don't have access to the repository, C3 still creates the docker image and the other files but the images is not pushed to the registry and cannot be used on clusters. +You need to provide the repository with `--repository` or `-r`. You can specify the version of the container image (default: "0.1") with `--version` or `-v`. +The first positional argument is the path to the python script or the ipython notebook. Optional, you can define additional files that are copied to the container images in the following positinal arguments. You can use wildcards for additional files. E.g., `*` would copy all files in the current directory to the container image. (Hidden files and directories must be specified. Be aware of `data/` folders and others before including all files.) View all arguments by running: ```sh c3_create_operator --help ``` -C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, and a `.job.yaml` that can be directly used as described above. +C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, a `.job.yaml` for Kubernetes, and a `.cwl` file for CWL. --- @@ -571,7 +565,7 @@ Note that the grid computing is currently not implemented for R scripts. The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_grid_wrapper.py` using `--process` or `-p`. ```sh -c3_create_gridwrapper ".py" "" "" --process "grid_process" -r "/" +c3_create_gridwrapper -r "/" --process "grid_process" ".py" "" "" ``` C3 also includes a grid computing pattern for Cloud Object Storage (COS). You can create a COS grid wrapper by adding a `--cos` flag. @@ -579,7 +573,7 @@ The COS grid wrapper downloads all files of a batch to local storage, compute th Note that the COS grid wrapper requires the file paths to include the batch id to be identified, see details in the next subsection. The created files include a `gw_.py` file that includes the generated code for the grid wrapper (`cgw_.py` for the COS version). -Similar to an operator, `gw_.yaml` and `gw_.job.yaml` are created. +Similar to an operator, `gw_.yaml`, `gw_.cwl`, and `gw_.job.yaml` are created. ### 5.3 Apply grid wrappers From 302b167d9ac1e0d0546ef856d96f1d72b9f9a601 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel <64090593+blumenstiel@users.noreply.github.com> Date: Thu, 1 Feb 2024 12:17:35 +0100 Subject: [PATCH 123/177] Update GettingStarted.md --- GettingStarted.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/GettingStarted.md b/GettingStarted.md index 95aba746..009e3dfe 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -516,13 +516,15 @@ c3_create_operator --repository "/" ".p You need to provide the repository with `--repository` or `-r`. You can specify the version of the container image (default: "0.1") with `--version` or `-v`. The first positional argument is the path to the python script or the ipython notebook. Optional, you can define additional files that are copied to the container images in the following positinal arguments. You can use wildcards for additional files. E.g., `*` would copy all files in the current directory to the container image. (Hidden files and directories must be specified. Be aware of `data/` folders and others before including all files.) +Note,that the docker build messages are suppressed by default. If you want to display the docker logs, you can add `--log_level DEBUG`. View all arguments by running: ```sh c3_create_operator --help ``` -C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, a `.job.yaml` for Kubernetes, and a `.cwl` file for CWL. +C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, a `.job.yaml` for Kubernetes, and a `.cwl` file for CWL. + --- From 04b6d619b708bb8c0768bf2eb615277741d9efc0 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 1 Feb 2024 14:09:27 +0100 Subject: [PATCH 124/177] Added --skip-logging argument Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 14 +++++++++----- src/c3/create_operator.py | 16 ++++++++++++---- src/c3/templates/__init__.py | 6 +++--- ...ode.py => component_setup_code_wo_logging.py} | 0 4 files changed, 24 insertions(+), 12 deletions(-) rename src/c3/templates/{gw_component_setup_code.py => component_setup_code_wo_logging.py} (100%) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 5cbcbd9c..166e06e7 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -7,7 +7,7 @@ from c3.pythonscript import Pythonscript from c3.utils import convert_notebook from c3.create_operator import create_operator -from c3.templates import grid_wrapper_template, cos_grid_wrapper_template, gw_component_setup_code +from c3.templates import grid_wrapper_template, cos_grid_wrapper_template, component_setup_code_wo_logging def wrap_component(component_path, @@ -88,14 +88,14 @@ def edit_component_code(file_path): file_name = os.path.basename(file_path) else: # write edited code to different file - target_file = os.path.join(os.path.dirname(file_path), 'component_' + file_name) + target_file = os.path.join(os.path.dirname(file_path), 'component_' + file_name.replace('-', '_')) target_file_name = os.path.basename(target_file) with open(file_path, 'r') as f: script = f.read() # Add code for logging and cli parameters to the beginning of the script - script = gw_component_setup_code + script + script = component_setup_code_wo_logging + script # replace old filename with new file name script = script.replace(file_name, target_file_name) with open(target_file, 'w') as f: @@ -156,8 +156,11 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') - parser.add_argument('--test_mode', action='store_true') - parser.add_argument('--no-cache', action='store_true') + parser.add_argument('--test_mode', action='store_true', + help='Continue processing after docker errors.') + parser.add_argument('--no-cache', action='store_true', help='Not using cache for docker build.') + parser.add_argument('--skip-logging', action='store_true', + help='Exclude logging code from component setup code') args = parser.parse_args() # Init logging @@ -200,6 +203,7 @@ def main(): no_cache=args.no_cache, overwrite_files=args.overwrite, rename_files=args.rename, + skip_logging=args.skip_logging, ) logging.info('Remove local component file') diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 6660d2c2..a970bbdd 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -13,7 +13,7 @@ from c3.pythonscript import Pythonscript from c3.rscript import Rscript from c3.utils import convert_notebook, get_image_version -from c3.templates import (python_component_setup_code, r_component_setup_code, +from c3.templates import (python_component_setup_code, component_setup_code_wo_logging, r_component_setup_code, python_dockerfile_template, r_dockerfile_template, kfp_component_template, kubernetes_job_template, cwl_component_template) @@ -228,6 +228,7 @@ def create_operator(file_path: str, no_cache=False, rename_files=None, overwrite_files=False, + skip_logging=False, ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -263,7 +264,10 @@ def create_operator(file_path: str, # Add code for logging and cli parameters to the beginning of the script with open(target_code, 'r') as f: script = f.read() - script = python_component_setup_code + script + if skip_logging: + script = component_setup_code_wo_logging + script + else: + script = python_component_setup_code + script with open(target_code, 'w') as f: f.write(script) @@ -417,8 +421,11 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') - parser.add_argument('--test_mode', action='store_true') - parser.add_argument('--no-cache', action='store_true') + parser.add_argument('--test_mode', action='store_true', + help='Continue processing after docker errors.') + parser.add_argument('--no-cache', action='store_true', help='Not using cache for docker build.') + parser.add_argument('--skip-logging', action='store_true', + help='Exclude logging code from component setup code') args = parser.parse_args() # Init logging @@ -449,6 +456,7 @@ def main(): no_cache=args.no_cache, overwrite_files=args.overwrite, rename_files=args.rename, + skip_logging=args.skip_logging, ) diff --git a/src/c3/templates/__init__.py b/src/c3/templates/__init__.py index 1e276976..5ec64b6d 100644 --- a/src/c3/templates/__init__.py +++ b/src/c3/templates/__init__.py @@ -6,7 +6,7 @@ # template file names PYTHON_COMPONENT_SETUP_CODE = 'component_setup_code.py' R_COMPONENT_SETUP_CODE = 'component_setup_code.R' -GW_COMPONENT_SETUP_CODE = 'gw_component_setup_code.py' +PYTHON_COMPONENT_SETUP_CODE_WO_LOGGING = 'component_setup_code_wo_logging.py' PYTHON_DOCKERFILE_FILE = 'python_dockerfile_template' R_DOCKERFILE_FILE = 'R_dockerfile_template' KFP_COMPONENT_FILE = 'kfp_component_template.yaml' @@ -24,8 +24,8 @@ with open(template_path / R_COMPONENT_SETUP_CODE, 'r') as f: r_component_setup_code = f.read() -with open(template_path / GW_COMPONENT_SETUP_CODE, 'r') as f: - gw_component_setup_code = f.read() +with open(template_path / PYTHON_COMPONENT_SETUP_CODE_WO_LOGGING, 'r') as f: + component_setup_code_wo_logging = f.read() with open(template_path / PYTHON_DOCKERFILE_FILE, 'r') as f: python_dockerfile_template = Template(f.read()) diff --git a/src/c3/templates/gw_component_setup_code.py b/src/c3/templates/component_setup_code_wo_logging.py similarity index 100% rename from src/c3/templates/gw_component_setup_code.py rename to src/c3/templates/component_setup_code_wo_logging.py From bff4f9d1d5e84f2fd0fa58e25d3166f9c510940f Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 1 Feb 2024 14:09:50 +0100 Subject: [PATCH 125/177] Fixed cwl type error Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index a970bbdd..6bfb1cd6 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -129,12 +129,16 @@ def create_kubernetes_job(name, repository, version, target_code, target_dir, co def create_cwl_component(name, repository, version, file_path, inputs, outputs): + type_dict = {'String': 'string', 'Integer': 'int', 'Float': 'float', 'Boolean': 'bool'} # get environment entries i = 1 input_envs = str() for input, options in inputs.items(): i += 1 - input_envs += (f" {input}:\n type: string\n default: {options['default']}\n " + # Convert string default value to CWL types + default_value = options['default'] if options['type'] == 'String' and options['default'] != '"None"' \ + else options['default'].strip('"\'') + input_envs += (f" {input}:\n type: {type_dict[options['type']]}\n default: {default_value}\n " f"inputBinding:\n position: {i}\n prefix: --{input}\n") if len(outputs) == 0: From 2ecc666dcecacf9af7799ebd924a595d99807af0 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 1 Feb 2024 14:10:23 +0100 Subject: [PATCH 126/177] Minor fixes Signed-off-by: Benedikt Blumenstiel --- src/c3/utils.py | 2 +- tests/test_compiler.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/c3/utils.py b/src/c3/utils.py index 020506f2..b1cc4e99 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -32,7 +32,7 @@ def convert_notebook(path): # add import get_ipython code = 'from IPython import get_ipython \n' + code - py_path = path.split('/')[-1].replace('.ipynb', '.py') + py_path = path.split('/')[-1].replace('.ipynb', '.py').replace('-', '_') assert not os.path.exists(py_path), f"File {py_path} already exist. Cannot convert notebook." with open(py_path, 'w') as py_file: diff --git a/tests/test_compiler.py b/tests/test_compiler.py index 1018605d..4907cd1e 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -121,7 +121,7 @@ def test_create_operator( args: List, ): subprocess.run(['python', '../src/c3/create_operator.py', file_path, *args, '-r', repository, - '--test_mode', '-v', 'test', '--log_level', 'DEBUG'], + '--test_mode', '-v', 'test', '--log_level', 'DEBUG', '--overwrite'], check=True) file = Path(file_path) From 72b7cd73430e86a37b0f936c54aa753f2c7e1772 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 1 Feb 2024 15:09:32 +0100 Subject: [PATCH 127/177] Added ipython version for notebooks Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 56 ++++++++++++---------- src/c3/notebook.py | 98 +++++++++++++++++++++++++++++++++++++++ src/c3/parser.py | 33 +++++++------ 3 files changed, 146 insertions(+), 41 deletions(-) create mode 100644 src/c3/notebook.py diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 6bfb1cd6..3c5d72a8 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -7,10 +7,12 @@ import subprocess import glob import re +import json from pathlib import Path from string import Template from typing import Optional from c3.pythonscript import Pythonscript +from c3.notebook import Notebook from c3.rscript import Rscript from c3.utils import convert_notebook, get_image_version from c3.templates import (python_component_setup_code, component_setup_code_wo_logging, r_component_setup_code, @@ -240,31 +242,11 @@ def create_operator(file_path: str, logging.info('version: ' + str(version)) logging.info('additional_files: ' + str(additional_files)) - if file_path.endswith('.ipynb'): - logging.info('Convert notebook to python script') - target_code = convert_notebook(file_path) - command = '/opt/app-root/bin/ipython' - working_dir = '/opt/app-root/src/' - - elif file_path.endswith('.py'): - # use temp file for processing - target_code = 'claimed_' + os.path.basename(file_path) - # Copy file to current working directory - shutil.copy(file_path, target_code) - command = '/opt/app-root/bin/python' - working_dir = '/opt/app-root/src/' - - elif file_path.lower().endswith('.r'): + if file_path.endswith('.py'): # use temp file for processing target_code = 'claimed_' + os.path.basename(file_path) # Copy file to current working directory shutil.copy(file_path, target_code) - command = 'Rscript' - working_dir = '/home/docker/' - else: - raise NotImplementedError('Please provide a file_path to a jupyter notebook, python script, or R script.') - - if target_code.endswith('.py'): # Add code for logging and cli parameters to the beginning of the script with open(target_code, 'r') as f: script = f.read() @@ -274,12 +256,36 @@ def create_operator(file_path: str, script = python_component_setup_code + script with open(target_code, 'w') as f: f.write(script) - # getting parameter from the script script_data = Pythonscript(target_code) dockerfile_template = custom_dockerfile_template or python_dockerfile_template + command = '/opt/app-root/bin/python' + working_dir = '/opt/app-root/src/' + + elif file_path.endswith('.ipynb'): + # use temp file for processing + target_code = 'claimed_' + os.path.basename(file_path) + # Copy file to current working directory + shutil.copy(file_path, target_code) + with open(target_code, 'r') as json_file: + notebook = json.load(json_file) + # Add code for logging and cli parameters to the beginning of the notebook + notebook['cells'].insert(0, { + 'cell_type': 'code', 'execution_count': None, 'metadata': {}, 'outputs': [], + 'source': component_setup_code_wo_logging if skip_logging else python_component_setup_code}) + with open(target_code, 'w') as json_file: + json.dump(notebook, json_file) + # getting parameter from the script + script_data = Notebook(target_code) + dockerfile_template = custom_dockerfile_template or python_dockerfile_template + command = '/opt/app-root/bin/ipython' + working_dir = '/opt/app-root/src/' - elif target_code.lower().endswith('.r'): + elif file_path.lower().endswith('.r'): + # use temp file for processing + target_code = 'claimed_' + os.path.basename(file_path) + # Copy file to current working directory + shutil.copy(file_path, target_code) # Add code for logging and cli parameters to the beginning of the script with open(target_code, 'r') as f: script = f.read() @@ -289,8 +295,10 @@ def create_operator(file_path: str, # getting parameter from the script script_data = Rscript(target_code) dockerfile_template = custom_dockerfile_template or r_dockerfile_template + command = 'Rscript' + working_dir = '/home/docker/' else: - raise NotImplementedError('C3 currently only supports jupyter notebooks, python scripts, and R scripts.') + raise NotImplementedError('Please provide a file_path to a jupyter notebook, python script, or R script.') name = script_data.get_name() # convert description into a string with a single line diff --git a/src/c3/notebook.py b/src/c3/notebook.py new file mode 100644 index 00000000..73c1602c --- /dev/null +++ b/src/c3/notebook.py @@ -0,0 +1,98 @@ +import json +import re +import os +import logging +from c3.parser import ContentParser, NotebookReader + + +class Notebook(): + def __init__(self, path): + self.path = path + with open(path) as json_file: + self.notebook = json.load(json_file) + + self.name = os.path.basename(path)[:-6].replace('_', '-').lower() + + if self.notebook['cells'][1]['cell_type'] == self.notebook['cells'][2]['cell_type'] == 'markdown': + # backwards compatibility (v0.1 description was included in second cell, merge first two markdown cells) + logging.info('Merge first two markdown cells for description. ' + 'The file name is used as the operator name, not the first markdown cell.') + self.description = self.notebook['cells'][1]['source'][0] + '\n' + self.notebook['cells'][2]['source'][0] + else: + # Using second cell because first cell was added for setup code + self.description = self.notebook['cells'][1]['source'][0] + + self.inputs = self._get_input_vars() + self.outputs = self._get_output_vars() + + def _get_input_vars(self): + cp = ContentParser() + env_names = cp.parse(self.path)['inputs'] + return_value = dict() + notebook_code_lines = list(NotebookReader(self.path).read_next_code_line()) + for env_name, default in env_names.items(): + comment_line = str() + for line in notebook_code_lines: + if re.search("[\"']" + env_name + "[\"']", line): + if not comment_line.strip().startswith('#'): + # previous line was no description, reset comment_line. + comment_line = '' + if comment_line == '': + logging.info(f'Interface: No description for variable {env_name} provided.') + if re.search(r'=\s*int\(\s*os', line): + type = 'Integer' + elif re.search(r'=\s*float\(\s*os', line): + type = 'Float' + elif re.search(r'=\s*bool\(\s*os', line): + type = 'Boolean' + else: + type = 'String' + return_value[env_name] = { + 'description': comment_line.replace('#', '').replace("\"", "\'").strip(), + 'type': type, + 'default': default + } + break + comment_line = line + return return_value + + def _get_output_vars(self): + cp = ContentParser() + output_names = cp.parse(self.path)['outputs'] + # TODO: Does not check for description code + return_value = {name: { + 'description': f'Output path for {name}', + 'type': 'String', + } for name in output_names} + return return_value + + def get_requirements(self): + requirements = [] + notebook_code_lines = list(NotebookReader(self.path).read_next_code_line()) + # Add dnf install + for line in notebook_code_lines: + if re.search(r'[\s#]*dnf\s*.[^#]*', line): + if '-y' not in line: + # Adding default repo + line += ' -y' + requirements.append(line.replace('#', '').strip()) + + # Add pip install + pattern = r"^[# ]*(pip[ ]*install)[ ]*(.[^#]*)" + for line in notebook_code_lines: + result = re.findall(pattern, line) + if len(result) == 1: + requirements.append((result[0][0] + ' ' + result[0][1].strip())) + return requirements + + def get_name(self): + return self.name + + def get_description(self): + return self.description + + def get_inputs(self): + return self.inputs + + def get_outputs(self): + return self.outputs diff --git a/src/c3/parser.py b/src/c3/parser.py index 524d5409..1be4307d 100644 --- a/src/c3/parser.py +++ b/src/c3/parser.py @@ -49,14 +49,14 @@ def language(self) -> str: else: return None - def read_next_code_chunk(self) -> List[str]: + def read_next_code_line(self) -> List[str]: """ Implements a generator for lines of code in the specified filepath. Subclasses may override if explicit line-by-line parsing is not feasible, e.g. with Notebooks. """ with open(self._filepath) as f: for line in f: - yield [line.strip()] + yield line.strip() class NotebookReader(FileReader): @@ -79,10 +79,11 @@ def __init__(self, filepath: str): def language(self) -> str: return self._language - def read_next_code_chunk(self) -> List[str]: + def read_next_code_line(self) -> List[str]: for cell in self._notebook.cells: if cell.source and cell.cell_type == "code": - yield cell.source.split('\n') + for line in cell.source.split('\n'): + yield line class ScriptParser(): @@ -157,19 +158,17 @@ def parse(self, filepath: str) -> dict: if not parser: return properties - for chunk in reader.read_next_code_chunk(): - if chunk: - for line in chunk: - matches = parser.parse_environment_variables(line) - for key, match in matches: - if key == "inputs": - default_value = match.group(2) - if default_value: - # The default value match can end with an additional ', ", or ) which is removed - default_value = re.sub(r"['\")]?$", '', default_value, count=1) - properties[key][match.group(1)] = default_value - else: - properties[key].append(match.group(1)) + for line in reader.read_next_code_line(): + matches = parser.parse_environment_variables(line) + for key, match in matches: + if key == "inputs": + default_value = match.group(2) + if default_value: + # The default value match can end with an additional ', ", or ) which is removed + default_value = re.sub(r"['\")]?$", '', default_value, count=1) + properties[key][match.group(1)] = default_value + else: + properties[key].append(match.group(1)) return properties From a92900b3c0f877e518df1d3685b744ce4754277e Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 1 Feb 2024 15:15:27 +0100 Subject: [PATCH 128/177] Add grid_process default value Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 166e06e7..1560036b 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -79,7 +79,7 @@ def get_component_elements(file_path): # Adding code -def edit_component_code(file_path): +def edit_component_code(file_path, component_process): file_name = os.path.basename(file_path) if file_path.endswith('.ipynb'): logging.info('Convert notebook to python script') @@ -94,6 +94,8 @@ def edit_component_code(file_path): with open(file_path, 'r') as f: script = f.read() + assert component_process in script, (f'Did not find the grid process {component_process} in the script. ' + f'Please provide the grid process in the arguments `-p `.') # Add code for logging and cli parameters to the beginning of the script script = component_setup_code_wo_logging + script # replace old filename with new file name @@ -114,7 +116,7 @@ def apply_grid_wrapper(file_path, component_process, cos): assert file_path.endswith('.py') or file_path.endswith('.ipynb'), \ "Please provide a component file path to a python script or notebook." - file_path = edit_component_code(file_path) + file_path = edit_component_code(file_path, component_process) description, interface, inputs, dependencies = get_component_elements(file_path) @@ -142,7 +144,7 @@ def main(): help='Path to python script or notebook') parser.add_argument('ADDITIONAL_FILES', type=str, nargs='*', help='List of paths to additional files to include in the container image') - parser.add_argument('-p', '--component_process', type=str, required=True, + parser.add_argument('-p', '--component_process', type=str, default='grid_process', help='Name of the component sub process that is executed for each batch.') parser.add_argument('--cos', action=argparse.BooleanOptionalAction, default=False, help='Creates a grid wrapper for processing COS files') From 8e2ab604f5ee2852f7f8a844e4996e3458ae118e Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 1 Feb 2024 15:42:25 +0100 Subject: [PATCH 129/177] Make repository optional and renamed test_mode to local_mode Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 61 ++++++++++++++-------------- src/c3/create_operator.py | 77 +++++++++++++++++++----------------- 2 files changed, 71 insertions(+), 67 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 1560036b..d91fcdbf 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -158,7 +158,7 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') - parser.add_argument('--test_mode', action='store_true', + parser.add_argument('--local_mode', action='store_true', help='Continue processing after docker errors.') parser.add_argument('--no-cache', action='store_true', help='Not using cache for docker build.') parser.add_argument('--skip-logging', action='store_true', @@ -180,36 +180,35 @@ def main(): cos=args.cos, ) - if args.repository is not None: - logging.info('Generate CLAIMED operator for grid wrapper') - - # Add component path and init file path to additional_files - args.ADDITIONAL_FILES.append(component_path) - - # Update dockerfile template if specified - if args.dockerfile_template_path != '': - logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') - with open(args.dockerfile_template_path, 'r') as f: - custom_dockerfile_template = Template(f.read()) - else: - custom_dockerfile_template = None - - create_operator( - file_path=grid_wrapper_file_path, - repository=args.repository, - version=args.version, - custom_dockerfile_template=custom_dockerfile_template, - additional_files=args.ADDITIONAL_FILES, - log_level=args.log_level, - test_mode=args.test_mode, - no_cache=args.no_cache, - overwrite_files=args.overwrite, - rename_files=args.rename, - skip_logging=args.skip_logging, - ) - - logging.info('Remove local component file') - os.remove(component_path) + logging.info('Generate CLAIMED operator for grid wrapper') + + # Add component path and init file path to additional_files + args.ADDITIONAL_FILES.append(component_path) + + # Update dockerfile template if specified + if args.dockerfile_template_path != '': + logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') + with open(args.dockerfile_template_path, 'r') as f: + custom_dockerfile_template = Template(f.read()) + else: + custom_dockerfile_template = None + + create_operator( + file_path=grid_wrapper_file_path, + repository=args.repository, + version=args.version, + custom_dockerfile_template=custom_dockerfile_template, + additional_files=args.ADDITIONAL_FILES, + log_level=args.log_level, + local_mode=args.local_mode, + no_cache=args.no_cache, + overwrite_files=args.overwrite, + rename_files=args.rename, + skip_logging=args.skip_logging, + ) + + logging.info('Remove local component file') + os.remove(component_path) if __name__ == '__main__': diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 3c5d72a8..15a5344b 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -230,7 +230,7 @@ def create_operator(file_path: str, custom_dockerfile_template: Optional[Template], additional_files: str = None, log_level='INFO', - test_mode=False, + local_mode=False, no_cache=False, rename_files=None, overwrite_files=False, @@ -238,7 +238,7 @@ def create_operator(file_path: str, ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) - logging.info('repository: ' + repository) + logging.info('repository: ' + str(repository)) logging.info('version: ' + str(version)) logging.info('additional_files: ' + str(additional_files)) @@ -350,6 +350,12 @@ def create_operator(file_path: str, # auto increase version based on registered images version = get_image_version(repository, name) + if repository is None: + if not local_mode: + logging.warning('No repository provided. The container image is only saved locally. Add `-r ` ' + 'to push the image to a container registry or run `--local_mode` to suppress this warning.') + local_mode = True + logging.info(f'Building container image claimed-{name}:{version}') try: # Run docker build @@ -357,42 +363,41 @@ def create_operator(file_path: str, f"docker build --platform linux/amd64 -t claimed-{name}:{version} . {'--no-cache' if no_cache else ''}", stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True ) - - # Run docker tag - logging.debug(f'Tagging images with "latest" and "{version}"') - subprocess.run( - f"docker tag claimed-{name}:{version} {repository}/claimed-{name}:{version}", - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, - ) - subprocess.run( - f"docker tag claimed-{name}:{version} {repository}/claimed-{name}:latest", - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, - ) + if repository is not None: + # Run docker tag + logging.debug(f'Tagging images with "latest" and "{version}"') + subprocess.run( + f"docker tag claimed-{name}:{version} {repository}/claimed-{name}:{version}", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, + ) + subprocess.run( + f"docker tag claimed-{name}:{version} {repository}/claimed-{name}:latest", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, + ) except Exception as err: remove_temporary_files(file_path, target_code) logging.error('Docker build failed. Consider running C3 with `--log_level DEBUG` to see the docker build logs.') raise err - logging.info('Successfully built image') + logging.info(f'Successfully built image claimed-{name}:{version}') - logging.info(f'Pushing images to registry {repository}') - try: - # Run docker push - subprocess.run( - f"docker push {repository}/claimed-{name}:latest", - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, - ) - subprocess.run( - f"docker push {repository}/claimed-{name}:{version}", - stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, - ) - logging.info('Successfully pushed image to registry') - except Exception as err: - logging.error(f'Could not push images to namespace {repository}. ' - f'Please check if docker is logged in or select a namespace with access.') - if test_mode: - logging.info('Continue processing (test mode).') - pass - else: + if local_mode: + logging.info(f'No repository provided, skip docker push.') + else: + logging.info(f'Pushing images to registry {repository}') + try: + # Run docker push + subprocess.run( + f"docker push {repository}/claimed-{name}:latest", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, + ) + subprocess.run( + f"docker push {repository}/claimed-{name}:{version}", + stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, + ) + logging.info('Successfully pushed image to registry') + except Exception as err: + logging.error(f'Could not push images to namespace {repository}. ' + f'Please check if docker is logged in or select a namespace with access.') remove_temporary_files(file_path, target_code) raise err @@ -423,7 +428,7 @@ def main(): help='Path to python script or notebook') parser.add_argument('ADDITIONAL_FILES', type=str, nargs='*', help='Paths to additional files to include in the container image') - parser.add_argument('-r', '--repository', type=str, required=True, + parser.add_argument('-r', '--repository', type=str, default=None, help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, help='Container image version. Auto-increases the version number if not provided (default 0.1)') @@ -433,7 +438,7 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') - parser.add_argument('--test_mode', action='store_true', + parser.add_argument('--local_mode', action='store_true', help='Continue processing after docker errors.') parser.add_argument('--no-cache', action='store_true', help='Not using cache for docker build.') parser.add_argument('--skip-logging', action='store_true', @@ -464,7 +469,7 @@ def main(): custom_dockerfile_template=custom_dockerfile_template, additional_files=args.ADDITIONAL_FILES, log_level=args.log_level, - test_mode=args.test_mode, + local_mode=args.local_mode, no_cache=args.no_cache, overwrite_files=args.overwrite, rename_files=args.rename, From 3f6501df4c4c6c4b10ec9df95eac925a740db0f2 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 1 Feb 2024 16:16:43 +0100 Subject: [PATCH 130/177] Added --keep-generated-files Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 84 +++++++++++++++++++++--------------- src/c3/create_operator.py | 26 +++++++---- src/c3/utils.py | 4 ++ tests/test_compiler.py | 9 ++-- 4 files changed, 74 insertions(+), 49 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index d91fcdbf..158fd8ff 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -163,6 +163,9 @@ def main(): parser.add_argument('--no-cache', action='store_true', help='Not using cache for docker build.') parser.add_argument('--skip-logging', action='store_true', help='Exclude logging code from component setup code') + parser.add_argument('--keep-generated-files', action='store_true', + help='Do not delete temporary generated files.') + args = parser.parse_args() # Init logging @@ -174,41 +177,52 @@ def main(): handler.setLevel(args.log_level) root.addHandler(handler) - grid_wrapper_file_path, component_path = apply_grid_wrapper( - file_path=args.FILE_PATH, - component_process=args.component_process, - cos=args.cos, - ) - - logging.info('Generate CLAIMED operator for grid wrapper') - - # Add component path and init file path to additional_files - args.ADDITIONAL_FILES.append(component_path) - - # Update dockerfile template if specified - if args.dockerfile_template_path != '': - logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') - with open(args.dockerfile_template_path, 'r') as f: - custom_dockerfile_template = Template(f.read()) - else: - custom_dockerfile_template = None - - create_operator( - file_path=grid_wrapper_file_path, - repository=args.repository, - version=args.version, - custom_dockerfile_template=custom_dockerfile_template, - additional_files=args.ADDITIONAL_FILES, - log_level=args.log_level, - local_mode=args.local_mode, - no_cache=args.no_cache, - overwrite_files=args.overwrite, - rename_files=args.rename, - skip_logging=args.skip_logging, - ) - - logging.info('Remove local component file') - os.remove(component_path) + grid_wrapper_file_path = component_path = '' + try: + grid_wrapper_file_path, component_path = apply_grid_wrapper( + file_path=args.FILE_PATH, + component_process=args.component_process, + cos=args.cos, + ) + + logging.info('Generate CLAIMED operator for grid wrapper') + + # Add component path and init file path to additional_files + args.ADDITIONAL_FILES.append(component_path) + + # Update dockerfile template if specified + if args.dockerfile_template_path != '': + logging.info(f'Uses custom dockerfile template from {args.dockerfile_template_path}') + with open(args.dockerfile_template_path, 'r') as f: + custom_dockerfile_template = Template(f.read()) + else: + custom_dockerfile_template = None + + create_operator( + file_path=grid_wrapper_file_path, + repository=args.repository, + version=args.version, + custom_dockerfile_template=custom_dockerfile_template, + additional_files=args.ADDITIONAL_FILES, + log_level=args.log_level, + local_mode=args.local_mode, + no_cache=args.no_cache, + overwrite_files=args.overwrite, + rename_files=args.rename, + skip_logging=args.skip_logging, + keep_generated_files=args.keep_generated_files, + ) + except Exception as err: + logging.error('Error while generating CLAIMED grid wrapper. ' + 'Consider using `--log_level DEBUG` and `--keep-generated-files` for debugging.') + raise err + finally: + if not args.keep_generated_files: + logging.info('Remove local component file and grid wrapper code.') + if os.path.isfile(grid_wrapper_file_path): + os.remove(grid_wrapper_file_path) + if os.path.isfile(component_path): + os.remove(component_path) if __name__ == '__main__': diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 15a5344b..18df7fdd 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -235,12 +235,13 @@ def create_operator(file_path: str, rename_files=None, overwrite_files=False, skip_logging=False, + keep_generated_files=False, ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) logging.info('repository: ' + str(repository)) logging.info('version: ' + str(version)) - logging.info('additional_files: ' + str(additional_files)) + logging.info('additional_files: ' + '; '.join(additional_files)) if file_path.endswith('.py'): # use temp file for processing @@ -319,10 +320,10 @@ def create_operator(file_path: str, target_dir += '/' logging.info('Operator name: ' + name) - logging.info('Description:: ' + description) - logging.info('Inputs: ' + str(inputs)) - logging.info('Outputs: ' + str(outputs)) - logging.info('Requirements: ' + str(requirements)) + logging.info('Description: ' + description) + logging.info('Inputs:\n' + ('\n'.join([f'{k}: {v}' for k, v in inputs.items()]))) + logging.info('Outputs:\n' + ('\n'.join([f'{k}: {v}' for k, v in outputs.items()]))) + logging.info('Requirements: ' + '; '.join(requirements)) logging.debug(f'Target code: {target_code}') logging.debug(f'Target directory: {target_dir}') @@ -375,8 +376,9 @@ def create_operator(file_path: str, stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True, ) except Exception as err: - remove_temporary_files(file_path, target_code) logging.error('Docker build failed. Consider running C3 with `--log_level DEBUG` to see the docker build logs.') + if not keep_generated_files: + remove_temporary_files(file_path, target_code) raise err logging.info(f'Successfully built image claimed-{name}:{version}') @@ -398,14 +400,16 @@ def create_operator(file_path: str, except Exception as err: logging.error(f'Could not push images to namespace {repository}. ' f'Please check if docker is logged in or select a namespace with access.') - remove_temporary_files(file_path, target_code) + if not keep_generated_files: + remove_temporary_files(file_path, target_code) raise err # Check for existing files and optionally modify them before overwriting try: check_existing_files(file_path, rename_files, overwrite_files) except Exception as err: - remove_temporary_files(file_path, target_code) + if not keep_generated_files: + remove_temporary_files(file_path, target_code) raise err # Create application scripts @@ -419,7 +423,8 @@ def create_operator(file_path: str, print_claimed_command(name, repository, version, inputs) # Remove temp files - remove_temporary_files(file_path, target_code) + if not keep_generated_files: + remove_temporary_files(file_path, target_code) def main(): @@ -443,6 +448,8 @@ def main(): parser.add_argument('--no-cache', action='store_true', help='Not using cache for docker build.') parser.add_argument('--skip-logging', action='store_true', help='Exclude logging code from component setup code') + parser.add_argument('--keep-generated-files', action='store_true', + help='Do not delete temporary generated files.') args = parser.parse_args() # Init logging @@ -474,6 +481,7 @@ def main(): overwrite_files=args.overwrite, rename_files=args.rename, skip_logging=args.skip_logging, + keep_generated_files=args.keep_generated_files, ) diff --git a/src/c3/utils.py b/src/c3/utils.py index b1cc4e99..0bbe5442 100644 --- a/src/c3/utils.py +++ b/src/c3/utils.py @@ -107,6 +107,10 @@ def get_image_version(repository, name): Get current version of the image from the registry and increase the version by 1. Defaults to 0.1 if no image is found in the registry. """ + if repository is None: + logging.debug('Using 0.1 as local version.') + return '0.1' + logging.debug(f'Get image version from registry.') if 'docker.io' in repository: logging.debug('Get image tags from docker.') diff --git a/tests/test_compiler.py b/tests/test_compiler.py index 4907cd1e..29cdff69 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -121,7 +121,7 @@ def test_create_operator( args: List, ): subprocess.run(['python', '../src/c3/create_operator.py', file_path, *args, '-r', repository, - '--test_mode', '-v', 'test', '--log_level', 'DEBUG', '--overwrite'], + '--local_mode', '-v', 'test', '--log_level', 'DEBUG', '--overwrite'], check=True) file = Path(file_path) @@ -136,7 +136,7 @@ def test_create_operator( test_create_gridwrapper_input = [ ( TEST_SCRIPT_PATH, - DUMMY_REPO, + None, 'process', [TEST_NOTEBOOK_PATH], ), @@ -157,8 +157,8 @@ def test_create_gridwrapper( process: str, args: List, ): - subprocess.run(['python', '../src/c3/create_gridwrapper.py', file_path, *args, - '-r', repository, '-p', process, '--test_mode', '-v', 'test', '--log_level', 'DEBUG'], check=True) + subprocess.run(['python', '../src/c3/create_gridwrapper.py', file_path, *args, '--overwrite', + '-p', process, '--local_mode', '-v', 'test', '--log_level', 'DEBUG'], check=True) file = Path(file_path) gw_file = file.parent / f'gw_{file.stem}.py' @@ -166,7 +166,6 @@ def test_create_gridwrapper( gw_file.with_suffix('.yaml').unlink() gw_file.with_suffix('.job.yaml').unlink() gw_file.with_suffix('.cwl').unlink() - gw_file.unlink() image_name = f"{repository}/claimed-gw-{file_path.rsplit('.')[0].replace('_', '-')}:test" # TODO: Modify subprocess call to test grid wrapper # subprocess.run(['docker', 'run', image_name], check=True) From aad0e3727822f8344aee93a617f20622fcb4da81 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Tue, 13 Feb 2024 23:46:01 +0100 Subject: [PATCH 131/177] add nbformat requirement to python docker template --- src/c3/templates/python_dockerfile_template | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/c3/templates/python_dockerfile_template b/src/c3/templates/python_dockerfile_template index 7c79ef70..024986a0 100644 --- a/src/c3/templates/python_dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -2,9 +2,9 @@ FROM registry.access.redhat.com/ubi8/python-39 USER root ADD ${target_code} ${working_dir}${target_dir} ${additional_files_docker} -RUN pip install ipython +RUN pip install ipython nbformat ${requirements_docker} RUN chmod -R 777 ${working_dir} USER default WORKDIR "${working_dir}" -CMD ["${command}", "${target_dir}${target_code}"] \ No newline at end of file +CMD ["${command}", "${target_dir}${target_code}"] From 1d64c5a6e22b70a2e58c545f2194fa1439cc6585 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 20 Feb 2024 10:14:59 +0100 Subject: [PATCH 132/177] Fixed pip installs for notebook and minor changes Signed-off-by: Benedikt Blumenstiel --- src/c3/notebook.py | 4 ++-- src/c3/pythonscript.py | 4 ++-- src/c3/rscript.py | 2 +- src/c3/templates/python_dockerfile_template | 1 + 4 files changed, 6 insertions(+), 5 deletions(-) diff --git a/src/c3/notebook.py b/src/c3/notebook.py index 73c1602c..1a5a25bb 100644 --- a/src/c3/notebook.py +++ b/src/c3/notebook.py @@ -38,7 +38,7 @@ def _get_input_vars(self): # previous line was no description, reset comment_line. comment_line = '' if comment_line == '': - logging.info(f'Interface: No description for variable {env_name} provided.') + logging.debug(f'Interface: No description for variable {env_name} provided.') if re.search(r'=\s*int\(\s*os', line): type = 'Integer' elif re.search(r'=\s*float\(\s*os', line): @@ -78,7 +78,7 @@ def get_requirements(self): requirements.append(line.replace('#', '').strip()) # Add pip install - pattern = r"^[# ]*(pip[ ]*install)[ ]*(.[^#]*)" + pattern = r"^[# !]*(pip[ ]*install)[ ]*(.[^#]*)" for line in notebook_code_lines: result = re.findall(pattern, line) if len(result) == 1: diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 7e4bc298..9deea18d 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -34,7 +34,7 @@ def _get_input_vars(self): # previous line was no description, reset comment_line. comment_line = '' if comment_line == '': - logging.info(f'Interface: No description for variable {env_name} provided.') + logging.debug(f'Interface: No description for variable {env_name} provided.') if re.search(r'=\s*int\(\s*os', line): type = 'Integer' elif re.search(r'=\s*float\(\s*os', line): @@ -73,7 +73,7 @@ def get_requirements(self): requirements.append(line.replace('#', '').strip()) # Add pip install - pattern = r"^[# ]*(pip[ ]*install)[ ]*(.[^#]*)" + pattern = r"^[# !]*(pip[ ]*install)[ ]*(.[^#]*)" for line in self.script.split('\n'): result = re.findall(pattern, line) if len(result) == 1: diff --git a/src/c3/rscript.py b/src/c3/rscript.py index 144ca15a..9e6cc93e 100644 --- a/src/c3/rscript.py +++ b/src/c3/rscript.py @@ -31,7 +31,7 @@ def _get_input_vars(self): # previous line was no description, reset comment_line. comment_line = '' if comment_line == '': - logging.info(f'Interface: No description for variable {env_name} provided.') + logging.debug(f'Interface: No description for variable {env_name} provided.') if re.search(r'=\s*as.numeric\(\s*os', line): type = 'Float' # double in R elif re.search(r'=\s*bool\(\s*os', line): diff --git a/src/c3/templates/python_dockerfile_template b/src/c3/templates/python_dockerfile_template index 7c79ef70..f8079a21 100644 --- a/src/c3/templates/python_dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -2,6 +2,7 @@ FROM registry.access.redhat.com/ubi8/python-39 USER root ADD ${target_code} ${working_dir}${target_dir} ${additional_files_docker} +RUN pip install --upgrade pip RUN pip install ipython ${requirements_docker} RUN chmod -R 777 ${working_dir} From dd01481ebe1aacfe91a68b03b34e6c81ea9fd1d2 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Tue, 20 Feb 2024 23:12:47 +0100 Subject: [PATCH 133/177] add csv support --- src/c3/templates/grid_wrapper_template.py | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/c3/templates/grid_wrapper_template.py b/src/c3/templates/grid_wrapper_template.py index f4fd6188..d759bd1b 100644 --- a/src/c3/templates/grid_wrapper_template.py +++ b/src/c3/templates/grid_wrapper_template.py @@ -14,6 +14,7 @@ import time import glob from pathlib import Path +import pandas as pd # import component code from ${component_name} import * @@ -49,6 +50,12 @@ def load_batches_from_file(batch_file): batch_dict = json.load(f) batches = batch_dict.keys() + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() + else: # Load batches from comma-separated txt file logging.info(f'Loading comma-separated batch strings from file: {batch_file}') From 42874965705f5ab92ac238137edf35323cab2ab7 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 13:39:37 +0100 Subject: [PATCH 134/177] add pandas as requirements --- pyproject.toml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index e5778350..8ae226eb 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -17,7 +17,7 @@ maintainers = [ { name="Romeo Kienzler", email="claimed-framework@proton.me"}, { name="Benedikt Blumenstiel"}, ] -description = "The CLAIMED component compiler (C3) generates container images, KFP components, and Kubernetes jobs." +description = "The CLAIMED component compiler (C3) generates container images, KFP components, Kubernetes jobs, CWL Tasks, CLI applications" readme = "README.md" requires-python = ">=3.7" license = {file = "LICENSE.txt"} @@ -31,6 +31,7 @@ dependencies = [ 'nbconvert >= 7.9.2', 'ipython >= 8.16.1', 'traitlets >= 5.11.2', + 'pandas', ] [project.urls] From 7f5544678bad0e9bab9b4ab346cd8eac6e5815db Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 15:53:33 +0100 Subject: [PATCH 135/177] add csv support, move to cos connection strings --- src/c3/templates/cos_grid_wrapper_template.py | 92 ++++++++++--------- 1 file changed, 51 insertions(+), 41 deletions(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index 83281a53..20b62c2b 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -18,6 +18,8 @@ import s3fs from datetime import datetime from pathlib import Path +import pandas as pd + # import component code from ${component_name} import * @@ -37,36 +39,34 @@ # upload local target files to target cos path gw_local_target_path = os.environ.get('gw_local_target_path', 'target') -# cos source_access_key_id -gw_source_access_key_id = os.environ.get('gw_source_access_key_id') -# cos source_secret_access_key -gw_source_secret_access_key = os.environ.get('gw_source_secret_access_key') -# cos source_endpoint -gw_source_endpoint = os.environ.get('gw_source_endpoint') -# cos source_bucket -gw_source_bucket = os.environ.get('gw_source_bucket') - -# cos target_access_key_id (uses source s3 if not provided) -gw_target_access_key_id = os.environ.get('gw_target_access_key_id', None) -# cos target_secret_access_key (uses source s3 if not provided) -gw_target_secret_access_key = os.environ.get('gw_target_secret_access_key', None) -# cos target_endpoint (uses source s3 if not provided) -gw_target_endpoint = os.environ.get('gw_target_endpoint', None) -# cos target_bucket (uses source s3 if not provided) -gw_target_bucket = os.environ.get('gw_target_bucket', None) -# cos target_path -gw_target_path = os.environ.get('gw_target_path') - -# cos coordinator_access_key_id (uses source s3 if not provided) -gw_coordinator_access_key_id = os.environ.get('gw_coordinator_access_key_id', None) -# cos coordinator_secret_access_key (uses source s3 if not provided) -gw_coordinator_secret_access_key = os.environ.get('gw_coordinator_secret_access_key', None) -# cos coordinator_endpoint (uses source s3 if not provided) -gw_coordinator_endpoint = os.environ.get('gw_coordinator_endpoint', None) -# cos coordinator_bucket (uses source s3 if not provided) -gw_coordinator_bucket = os.environ.get('gw_coordinator_bucket', None) -# cos path to grid wrapper coordinator directory -gw_coordinator_path = os.environ.get('gw_coordinator_path') + +explode_connection_string(cs): + if cs.startswith('cos') or cs.startswith('s3'): + buffer=cs.split('://')[1] + access_key_id=buffer.split('@')[0].split(':')[0] + secret_access_key=buffer.split('@')[0].split(':')[1] + endpoint=buffer.split('@')[1].split('/')[0] + path='/'.join(buffer.split('@')[1].split('/')[1:]) + return (access_key_id, secret_access_key, endpoint, path) + else: + None + # TODO consider cs as secret and grab connection string from kubernetes + + + +# cos gw_source_connection +gw_source_connection = os.environ.get('gw_source_connection') +(gw_source_access_key_id, gw_source_secret_access_key, gw_source_endpoint, gw_source_bucket) = explode_connection_string(gw_source_connection) + +# cos gw_target_connection +gw_target_connection = os.environ.get('gw_target_connection') +(gw_target_access_key_id, gw_target_secret_access_key, gw_target_endpoint, gw_target_path) = explode_connection_string(gw_target_connection) + +# cos gw_coordinator_connection + +gw_coordinator_connection = os.environ.get('gw_coordinator_connection') +(gw_coordinator_access_key_id, gw_coordinator_secret_access_key, gw_coordinator_endpoint, gw_coordinator_path) = explode_connection_string(gw_target_connection) + # lock file suffix gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') # processed file suffix @@ -78,6 +78,9 @@ # ignore error files and rerun batches with errors gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) +# maximal wait time for staggering start +gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering',60)) + # component interface ${component_interface} @@ -89,18 +92,18 @@ secret=gw_source_secret_access_key, client_kwargs={'endpoint_url': gw_source_endpoint}) -if gw_target_endpoint is not None: +if gw_target_connection is not None: s3target = s3fs.S3FileSystem( anon=False, key=gw_target_access_key_id, secret=gw_target_secret_access_key, client_kwargs={'endpoint_url': gw_target_endpoint}) else: - logging.debug('Using source bucket as target bucket.') - gw_target_bucket = gw_source_bucket + logging.debug('Using source path as target path.') + gw_target_path = gw_source_path s3target = s3source -if gw_coordinator_bucket is not None: +if gw_coordinator_connection is not None: s3coordinator = s3fs.S3FileSystem( anon=False, key=gw_coordinator_access_key_id, @@ -108,21 +111,28 @@ client_kwargs={'endpoint_url': gw_coordinator_endpoint}) else: logging.debug('Using source bucket as coordinator bucket.') - gw_coordinator_bucket = gw_source_bucket + gw_coordinator_path = gw_source_path s3coordinator = s3source def load_batches_from_file(batch_file): if batch_file.endswith('.json'): # load batches from keys of a json file logging.info(f'Loading batches from json file: {batch_file}') - with s3source.open(Path(gw_source_bucket) / batch_file, 'r') as f: + with s3source.open(gw_source_path / batch_file, 'r') as f: batch_dict = json.load(f) batches = batch_dict.keys() + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + s3source.get(gw_source_path / batch_file, batch_file) + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() + else: # Load batches from comma-separated txt file logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - with s3source.open(Path(gw_source_bucket) / batch_file, 'r') as f: + with s3source.open(gw_source_path / batch_file, 'r') as f: batch_string = f.read() batches = [b.strip() for b in batch_string.split(',')] @@ -139,7 +149,7 @@ def get_files_from_pattern(file_path_patterns): # Iterate over comma-separated paths for file_path_pattern in file_path_patterns.split(','): logging.info(f'Get file paths from pattern: {file_path_pattern}') - files = s3source.glob(str(Path(gw_source_bucket) / file_path_pattern.strip())) + files = s3source.glob(str(Path(gw_source_path) / file_path_pattern.strip())) assert len(files) > 0, f"Found no files with file_path_pattern {file_path_pattern}." all_files.extend(files) logging.info(f'Found {len(all_files)} cos files') @@ -165,7 +175,7 @@ def identify_batches_from_pattern(file_path_patterns, group_by): def perform_process(process, batch, cos_files): logging.debug(f'Check coordinator files for batch {batch}.') # init coordinator files - coordinator_dir = Path(gw_coordinator_bucket) / gw_coordinator_path + coordinator_dir = gw_coordinator_path lock_file = str(coordinator_dir / (batch + gw_lock_file_suffix)) processed_file = str(coordinator_dir / (batch + gw_processed_file_suffix)) error_file = str(coordinator_dir / (batch + gw_error_file_suffix)) @@ -244,7 +254,7 @@ def perform_process(process, batch, cos_files): local_target_files = list(target_path.glob('*')) logging.info(f'Uploading {len(local_target_files)} target files to COS.') for local_file in local_target_files: - cos_file = Path(gw_target_bucket) / gw_target_path / local_file.relative_to(target_path) + cos_file = gw_target_path / local_file.relative_to(target_path) logging.debug(f'Uploading {local_file} to {cos_file}') s3target.put(str(local_file), str(cos_file)) @@ -268,7 +278,7 @@ def process_wrapper(sub_process): time.sleep(delay) # Init coordinator dir - coordinator_dir = Path(gw_coordinator_bucket) / gw_coordinator_path + coordinator_dir = gw_coordinator_path s3coordinator.makedirs(coordinator_dir, exist_ok=True) # get batches From 9f721cdcedd605eb6ea53486d33097d56fabf306 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Tue, 20 Feb 2024 23:12:47 +0100 Subject: [PATCH 136/177] add csv support Signed-off-by: Romeo Kienzler --- src/c3/templates/grid_wrapper_template.py | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/c3/templates/grid_wrapper_template.py b/src/c3/templates/grid_wrapper_template.py index f4fd6188..d759bd1b 100644 --- a/src/c3/templates/grid_wrapper_template.py +++ b/src/c3/templates/grid_wrapper_template.py @@ -14,6 +14,7 @@ import time import glob from pathlib import Path +import pandas as pd # import component code from ${component_name} import * @@ -49,6 +50,12 @@ def load_batches_from_file(batch_file): batch_dict = json.load(f) batches = batch_dict.keys() + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() + else: # Load batches from comma-separated txt file logging.info(f'Loading comma-separated batch strings from file: {batch_file}') From 6bc9fc9be67c604cd197fe8923fe8464108c027c Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 13:39:37 +0100 Subject: [PATCH 137/177] add pandas as requirements Signed-off-by: Romeo Kienzler --- pyproject.toml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index e5778350..8ae226eb 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -17,7 +17,7 @@ maintainers = [ { name="Romeo Kienzler", email="claimed-framework@proton.me"}, { name="Benedikt Blumenstiel"}, ] -description = "The CLAIMED component compiler (C3) generates container images, KFP components, and Kubernetes jobs." +description = "The CLAIMED component compiler (C3) generates container images, KFP components, Kubernetes jobs, CWL Tasks, CLI applications" readme = "README.md" requires-python = ">=3.7" license = {file = "LICENSE.txt"} @@ -31,6 +31,7 @@ dependencies = [ 'nbconvert >= 7.9.2', 'ipython >= 8.16.1', 'traitlets >= 5.11.2', + 'pandas', ] [project.urls] From 5955f4b8fdb9b59225ad2df9586871f43785484a Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 15:53:33 +0100 Subject: [PATCH 138/177] add csv support, move to cos connection strings Signed-off-by: Romeo Kienzler --- src/c3/templates/cos_grid_wrapper_template.py | 92 ++++++++++--------- 1 file changed, 51 insertions(+), 41 deletions(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index 83281a53..20b62c2b 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -18,6 +18,8 @@ import s3fs from datetime import datetime from pathlib import Path +import pandas as pd + # import component code from ${component_name} import * @@ -37,36 +39,34 @@ # upload local target files to target cos path gw_local_target_path = os.environ.get('gw_local_target_path', 'target') -# cos source_access_key_id -gw_source_access_key_id = os.environ.get('gw_source_access_key_id') -# cos source_secret_access_key -gw_source_secret_access_key = os.environ.get('gw_source_secret_access_key') -# cos source_endpoint -gw_source_endpoint = os.environ.get('gw_source_endpoint') -# cos source_bucket -gw_source_bucket = os.environ.get('gw_source_bucket') - -# cos target_access_key_id (uses source s3 if not provided) -gw_target_access_key_id = os.environ.get('gw_target_access_key_id', None) -# cos target_secret_access_key (uses source s3 if not provided) -gw_target_secret_access_key = os.environ.get('gw_target_secret_access_key', None) -# cos target_endpoint (uses source s3 if not provided) -gw_target_endpoint = os.environ.get('gw_target_endpoint', None) -# cos target_bucket (uses source s3 if not provided) -gw_target_bucket = os.environ.get('gw_target_bucket', None) -# cos target_path -gw_target_path = os.environ.get('gw_target_path') - -# cos coordinator_access_key_id (uses source s3 if not provided) -gw_coordinator_access_key_id = os.environ.get('gw_coordinator_access_key_id', None) -# cos coordinator_secret_access_key (uses source s3 if not provided) -gw_coordinator_secret_access_key = os.environ.get('gw_coordinator_secret_access_key', None) -# cos coordinator_endpoint (uses source s3 if not provided) -gw_coordinator_endpoint = os.environ.get('gw_coordinator_endpoint', None) -# cos coordinator_bucket (uses source s3 if not provided) -gw_coordinator_bucket = os.environ.get('gw_coordinator_bucket', None) -# cos path to grid wrapper coordinator directory -gw_coordinator_path = os.environ.get('gw_coordinator_path') + +explode_connection_string(cs): + if cs.startswith('cos') or cs.startswith('s3'): + buffer=cs.split('://')[1] + access_key_id=buffer.split('@')[0].split(':')[0] + secret_access_key=buffer.split('@')[0].split(':')[1] + endpoint=buffer.split('@')[1].split('/')[0] + path='/'.join(buffer.split('@')[1].split('/')[1:]) + return (access_key_id, secret_access_key, endpoint, path) + else: + None + # TODO consider cs as secret and grab connection string from kubernetes + + + +# cos gw_source_connection +gw_source_connection = os.environ.get('gw_source_connection') +(gw_source_access_key_id, gw_source_secret_access_key, gw_source_endpoint, gw_source_bucket) = explode_connection_string(gw_source_connection) + +# cos gw_target_connection +gw_target_connection = os.environ.get('gw_target_connection') +(gw_target_access_key_id, gw_target_secret_access_key, gw_target_endpoint, gw_target_path) = explode_connection_string(gw_target_connection) + +# cos gw_coordinator_connection + +gw_coordinator_connection = os.environ.get('gw_coordinator_connection') +(gw_coordinator_access_key_id, gw_coordinator_secret_access_key, gw_coordinator_endpoint, gw_coordinator_path) = explode_connection_string(gw_target_connection) + # lock file suffix gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') # processed file suffix @@ -78,6 +78,9 @@ # ignore error files and rerun batches with errors gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) +# maximal wait time for staggering start +gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering',60)) + # component interface ${component_interface} @@ -89,18 +92,18 @@ secret=gw_source_secret_access_key, client_kwargs={'endpoint_url': gw_source_endpoint}) -if gw_target_endpoint is not None: +if gw_target_connection is not None: s3target = s3fs.S3FileSystem( anon=False, key=gw_target_access_key_id, secret=gw_target_secret_access_key, client_kwargs={'endpoint_url': gw_target_endpoint}) else: - logging.debug('Using source bucket as target bucket.') - gw_target_bucket = gw_source_bucket + logging.debug('Using source path as target path.') + gw_target_path = gw_source_path s3target = s3source -if gw_coordinator_bucket is not None: +if gw_coordinator_connection is not None: s3coordinator = s3fs.S3FileSystem( anon=False, key=gw_coordinator_access_key_id, @@ -108,21 +111,28 @@ client_kwargs={'endpoint_url': gw_coordinator_endpoint}) else: logging.debug('Using source bucket as coordinator bucket.') - gw_coordinator_bucket = gw_source_bucket + gw_coordinator_path = gw_source_path s3coordinator = s3source def load_batches_from_file(batch_file): if batch_file.endswith('.json'): # load batches from keys of a json file logging.info(f'Loading batches from json file: {batch_file}') - with s3source.open(Path(gw_source_bucket) / batch_file, 'r') as f: + with s3source.open(gw_source_path / batch_file, 'r') as f: batch_dict = json.load(f) batches = batch_dict.keys() + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + s3source.get(gw_source_path / batch_file, batch_file) + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() + else: # Load batches from comma-separated txt file logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - with s3source.open(Path(gw_source_bucket) / batch_file, 'r') as f: + with s3source.open(gw_source_path / batch_file, 'r') as f: batch_string = f.read() batches = [b.strip() for b in batch_string.split(',')] @@ -139,7 +149,7 @@ def get_files_from_pattern(file_path_patterns): # Iterate over comma-separated paths for file_path_pattern in file_path_patterns.split(','): logging.info(f'Get file paths from pattern: {file_path_pattern}') - files = s3source.glob(str(Path(gw_source_bucket) / file_path_pattern.strip())) + files = s3source.glob(str(Path(gw_source_path) / file_path_pattern.strip())) assert len(files) > 0, f"Found no files with file_path_pattern {file_path_pattern}." all_files.extend(files) logging.info(f'Found {len(all_files)} cos files') @@ -165,7 +175,7 @@ def identify_batches_from_pattern(file_path_patterns, group_by): def perform_process(process, batch, cos_files): logging.debug(f'Check coordinator files for batch {batch}.') # init coordinator files - coordinator_dir = Path(gw_coordinator_bucket) / gw_coordinator_path + coordinator_dir = gw_coordinator_path lock_file = str(coordinator_dir / (batch + gw_lock_file_suffix)) processed_file = str(coordinator_dir / (batch + gw_processed_file_suffix)) error_file = str(coordinator_dir / (batch + gw_error_file_suffix)) @@ -244,7 +254,7 @@ def perform_process(process, batch, cos_files): local_target_files = list(target_path.glob('*')) logging.info(f'Uploading {len(local_target_files)} target files to COS.') for local_file in local_target_files: - cos_file = Path(gw_target_bucket) / gw_target_path / local_file.relative_to(target_path) + cos_file = gw_target_path / local_file.relative_to(target_path) logging.debug(f'Uploading {local_file} to {cos_file}') s3target.put(str(local_file), str(cos_file)) @@ -268,7 +278,7 @@ def process_wrapper(sub_process): time.sleep(delay) # Init coordinator dir - coordinator_dir = Path(gw_coordinator_bucket) / gw_coordinator_path + coordinator_dir = gw_coordinator_path s3coordinator.makedirs(coordinator_dir, exist_ok=True) # get batches From 96950e17a879506e46a44956c30310b261d63918 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 23:02:22 +0100 Subject: [PATCH 139/177] Update cos_grid_wrapper_template.py Signed-off-by: Romeo Kienzler --- src/c3/templates/cos_grid_wrapper_template.py | 1 - 1 file changed, 1 deletion(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index 20b62c2b..c7f9966f 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -63,7 +63,6 @@ (gw_target_access_key_id, gw_target_secret_access_key, gw_target_endpoint, gw_target_path) = explode_connection_string(gw_target_connection) # cos gw_coordinator_connection - gw_coordinator_connection = os.environ.get('gw_coordinator_connection') (gw_coordinator_access_key_id, gw_coordinator_secret_access_key, gw_coordinator_endpoint, gw_coordinator_path) = explode_connection_string(gw_target_connection) From 48f8aa58996187e337d89f9409594c6674476c3b Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 23:03:13 +0100 Subject: [PATCH 140/177] Update __init__.py Signed-off-by: Romeo Kienzler --- src/c3/templates/__init__.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/c3/templates/__init__.py b/src/c3/templates/__init__.py index 5ec64b6d..85394c29 100644 --- a/src/c3/templates/__init__.py +++ b/src/c3/templates/__init__.py @@ -14,6 +14,7 @@ CWL_COMPONENT_FILE = 'cwl_component_template.cwl' GRID_WRAPPER_FILE = 'grid_wrapper_template.py' COS_GRID_WRAPPER_FILE = 'cos_grid_wrapper_template.py' +S3KV_GRID_WRAPPER_FILE = 's3kv_grid_wrapper_template.py' # load templates template_path = Path(os.path.dirname(__file__)) @@ -47,3 +48,6 @@ with open(template_path / COS_GRID_WRAPPER_FILE, 'r') as f: cos_grid_wrapper_template = Template(f.read()) + +with open(template_path / S3KV_GRID_WRAPPER_FILE, 'r') as f: + s3kv_grid_wrapper_template = Template(f.read()) From e0875f29cae4a6f79e5cbed2d25e90b70f798879 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 23:03:43 +0100 Subject: [PATCH 141/177] Update grid_wrapper_template.py Signed-off-by: Romeo Kienzler --- src/c3/templates/grid_wrapper_template.py | 88 ++++++++++++++++++----- 1 file changed, 69 insertions(+), 19 deletions(-) diff --git a/src/c3/templates/grid_wrapper_template.py b/src/c3/templates/grid_wrapper_template.py index d759bd1b..71a0ae33 100644 --- a/src/c3/templates/grid_wrapper_template.py +++ b/src/c3/templates/grid_wrapper_template.py @@ -15,13 +15,34 @@ import glob from pathlib import Path import pandas as pd +import s3fs + # import component code from ${component_name} import * -# File with batches. Provided as a comma-separated list of strings or keys in a json dict. +explode_connection_string(cs): + if cs is None: + return None + if cs.startswith('cos') or cs.startswith('s3'): + buffer=cs.split('://')[1] + access_key_id=buffer.split('@')[0].split(':')[0] + secret_access_key=buffer.split('@')[0].split(':')[1] + endpoint=buffer.split('@')[1].split('/')[0] + path='/'.join(buffer.split('@')[1].split('/')[1:]) + return (access_key_id, secret_access_key, endpoint, path) + else: + return (None, None, None, cs) + # TODO consider cs as secret and grab connection string from kubernetes + + + +# File with batches. Provided as a comma-separated list of strings, keys in a json dict or single column CSV with 'filename' has header. Either local path as [cos|s3]://user:pw@endpoint/path gw_batch_file = os.environ.get('gw_batch_file', None) +(gw_batch_file_access_key_id, gw_batch_secret_access_key, gw_batch_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file): + + # file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. Is ignored if gw_batch_file is provided. gw_file_path_pattern = os.environ.get('gw_file_path_pattern', None) # pattern for grouping file paths into batches like ".split('.')[-1]". Is ignored if gw_batch_file is provided. @@ -43,25 +64,54 @@ ${component_interface} def load_batches_from_file(batch_file): - if batch_file.endswith('.json'): - # load batches from keys of a json file - logging.info(f'Loading batches from json file: {batch_file}') - with open(batch_file, 'r') as f: - batch_dict = json.load(f) - batches = batch_dict.keys() - - elif batch_file.endswith('.csv'): - # load batches from keys of a csv file - logging.info(f'Loading batches from csv file: {batch_file}') - df = pd.read_csv(batch_file, header='infer') - batches = df['filename'].to_list() + if gw_batch_file_access_key_id is not None: + s3source = s3fs.S3FileSystem( + anon=False, + key=gw_batch_file_access_key_id, + secret=gw_batch_secret_access_key, + client_kwargs={'endpoint_url': gw_batch_endpoint}) + + + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + with s3source.open(gw_batch_file, 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + s3source.get(batch_file, batch_file) + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() + else: + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + with s3source.open(gw_batch_file, 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] else: - # Load batches from comma-separated txt file - logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - with open(batch_file, 'r') as f: - batch_string = f.read() - batches = [b.strip() for b in batch_string.split(',')] + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + with open(batch_file, 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() + + else: + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + with open(batch_file, 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] logging.info(f'Loaded {len(batches)} batches') logging.debug(f'List of batches: {batches}') @@ -198,4 +248,4 @@ def process_wrapper(sub_process): if __name__ == '__main__': - process_wrapper(${component_process}) + process_wrapper(${component_process}) From 5678f7514066403f8ccfe592ce38db550e5462bb Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 23:05:05 +0100 Subject: [PATCH 142/177] Create s3kv_grid_wrapper_template.py Signed-off-by: Romeo Kienzler --- .../templates/s3kv_grid_wrapper_template.py | 251 ++++++++++++++++++ 1 file changed, 251 insertions(+) create mode 100644 src/c3/templates/s3kv_grid_wrapper_template.py diff --git a/src/c3/templates/s3kv_grid_wrapper_template.py b/src/c3/templates/s3kv_grid_wrapper_template.py new file mode 100644 index 00000000..71a0ae33 --- /dev/null +++ b/src/c3/templates/s3kv_grid_wrapper_template.py @@ -0,0 +1,251 @@ +""" +${component_name} got wrapped by grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 + +CLAIMED component description: ${component_description} +""" + +# component dependencies +# ${component_dependencies} + +import os +import json +import random +import logging +import time +import glob +from pathlib import Path +import pandas as pd +import s3fs + + +# import component code +from ${component_name} import * + + +explode_connection_string(cs): + if cs is None: + return None + if cs.startswith('cos') or cs.startswith('s3'): + buffer=cs.split('://')[1] + access_key_id=buffer.split('@')[0].split(':')[0] + secret_access_key=buffer.split('@')[0].split(':')[1] + endpoint=buffer.split('@')[1].split('/')[0] + path='/'.join(buffer.split('@')[1].split('/')[1:]) + return (access_key_id, secret_access_key, endpoint, path) + else: + return (None, None, None, cs) + # TODO consider cs as secret and grab connection string from kubernetes + + + +# File with batches. Provided as a comma-separated list of strings, keys in a json dict or single column CSV with 'filename' has header. Either local path as [cos|s3]://user:pw@endpoint/path +gw_batch_file = os.environ.get('gw_batch_file', None) +(gw_batch_file_access_key_id, gw_batch_secret_access_key, gw_batch_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file): + + +# file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. Is ignored if gw_batch_file is provided. +gw_file_path_pattern = os.environ.get('gw_file_path_pattern', None) +# pattern for grouping file paths into batches like ".split('.')[-1]". Is ignored if gw_batch_file is provided. +gw_group_by = os.environ.get('gw_group_by', None) +# path to grid wrapper coordinator directory +gw_coordinator_path = os.environ.get('gw_coordinator_path') +# lock file suffix +gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') +# processed file suffix +gw_processed_file_suffix = os.environ.get('gw_lock_file_suffix', '.processed') +# error file suffix +gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') +# timeout in seconds to remove lock file from struggling job (default 3 hours) +gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 10800)) +# ignore error files and rerun batches with errors +gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) + +# component interface +${component_interface} + +def load_batches_from_file(batch_file): + if gw_batch_file_access_key_id is not None: + s3source = s3fs.S3FileSystem( + anon=False, + key=gw_batch_file_access_key_id, + secret=gw_batch_secret_access_key, + client_kwargs={'endpoint_url': gw_batch_endpoint}) + + + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + with s3source.open(gw_batch_file, 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + s3source.get(batch_file, batch_file) + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() + + else: + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + with s3source.open(gw_batch_file, 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] + else: + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + with open(batch_file, 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() + + else: + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + with open(batch_file, 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] + + logging.info(f'Loaded {len(batches)} batches') + logging.debug(f'List of batches: {batches}') + assert len(batches) > 0, f"batch_file {batch_file} has no batches." + return batches + + +def identify_batches_from_pattern(file_path_patterns, group_by): + logging.info(f'Start identifying files and batches') + batches = set() + all_files = [] + + # Iterate over comma-separated paths + for file_path_pattern in file_path_patterns.split(','): + logging.info(f'Get file paths from pattern: {file_path_pattern}') + files = glob.glob(file_path_pattern.strip()) + assert len(files) > 0, f"Found no files with file_path_pattern {file_path_pattern}." + all_files.extend(files) + + # get batches by applying the group by function to all file paths + for path_string in all_files: + part = eval('str(path_string)' + group_by, {"group_by": group_by, "path_string": path_string}) + assert part != '', f'Could not extract batch with path_string {path_string} and group_by {group_by}' + batches.add(part) + + logging.info(f'Identified {len(batches)} batches') + logging.debug(f'List of batches: {batches}') + + return batches + + +def perform_process(process, batch): + logging.debug(f'Check coordinator files for batch {batch}.') + # init coordinator files + lock_file = Path(gw_coordinator_path) / (batch + gw_lock_file_suffix) + error_file = Path(gw_coordinator_path) / (batch + gw_error_file_suffix) + processed_file = Path(gw_coordinator_path) / (batch + gw_processed_file_suffix) + + if lock_file.exists(): + # remove strugglers + if lock_file.stat().st_mtime < time.time() - gw_lock_timeout: + logging.debug(f'Lock file {lock_file} is expired.') + lock_file.unlink() + else: + logging.debug(f'Batch {batch} is locked.') + return + + if processed_file.exists(): + logging.debug(f'Batch {batch} is processed.') + return + + if error_file.exists(): + if gw_ignore_error_files: + logging.info(f'Ignoring previous error in batch {batch} and rerun.') + else: + logging.debug(f'Batch {batch} has error.') + return + + logging.debug(f'Locking batch {batch}.') + lock_file.parent.mkdir(parents=True, exist_ok=True) + lock_file.touch() + + # processing files with custom process + logging.info(f'Processing batch {batch}.') + try: + target_files = process(batch, ${component_inputs}) + except Exception as err: + logging.error(f'{type(err).__name__} in batch {batch}: {err}') + # Write error to file + with open(error_file, 'w') as f: + f.write(f"{type(err).__name__} in batch {batch}: {err}") + lock_file.unlink() + logging.error(f'Continue processing.') + return + + # optional verify target files + if target_files is not None: + if isinstance(target_files, str): + target_files = [target_files] + for target_file in target_files: + if not os.path.exists(target_file): + logging.error(f'Target file {target_file} does not exist for batch {batch}.') + else: + logging.info(f'Cannot verify batch {batch} (target files not provided).') + + logging.info(f'Finished Batch {batch}.') + processed_file.touch() + + # Remove lock file + if lock_file.exists(): + lock_file.unlink() + else: + logging.warning(f'Lock file {lock_file} was removed by another process. ' + f'Consider increasing gw_lock_timeout (currently {gw_lock_timeout}s) to repeated processing.') + + + +def process_wrapper(sub_process): + delay = random.randint(1, 60) + logging.info(f'Staggering start, waiting for {delay} seconds') + time.sleep(delay) + + # Init coordinator dir + coordinator_dir = Path(gw_coordinator_path) + coordinator_dir.mkdir(exist_ok=True, parents=True) + + # get batches + if gw_batch_file is not None and os.path.isfile(gw_batch_file): + batches = load_batches_from_file(gw_batch_file) + elif gw_file_path_pattern is not None and gw_group_by is not None: + batches = identify_batches_from_pattern(gw_file_path_pattern, gw_group_by) + else: + raise ValueError("Cannot identify batches. " + "Provide valid gw_batch_file or gw_file_path_pattern and gw_group_by.") + + # Iterate over all batches + for batch in batches: + perform_process(sub_process, batch) + + # Check and log status of batches + processed_status = [(coordinator_dir / (batch + gw_processed_file_suffix)).exists() for batch in batches] + lock_status = [(coordinator_dir / (batch + gw_lock_file_suffix)).exists() for batch in batches] + error_status = [(coordinator_dir / (batch + gw_error_file_suffix)).exists() for batch in batches] + + logging.info(f'Finished current process. Status batches: ' + f'{sum(processed_status)} processed / {sum(lock_status)} locked / {sum(error_status)} errors / {len(processed_status)} total') + + if sum(error_status): + logging.error(f'Found errors! Resolve errors and rerun operator with gw_ignore_error_files=True.') + # print all error messages + for error_file in coordinator_dir.glob('**/*' + gw_error_file_suffix): + with open(error_file, 'r') as f: + logging.error(f.read()) + + +if __name__ == '__main__': + process_wrapper(${component_process}) From 186eaba2b758a52c22a966d2b8d8197b0c183976 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 22 Feb 2024 23:07:01 +0100 Subject: [PATCH 143/177] Update create_gridwrapper.py Signed-off-by: Romeo Kienzler --- src/c3/create_gridwrapper.py | 31 ++++++++++++++++++++++--------- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 158fd8ff..f82120ff 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -7,7 +7,8 @@ from c3.pythonscript import Pythonscript from c3.utils import convert_notebook from c3.create_operator import create_operator -from c3.templates import grid_wrapper_template, cos_grid_wrapper_template, component_setup_code_wo_logging +from c3.templates import component_setup_code_wo_logging +import c3 def wrap_component(component_path, @@ -16,12 +17,24 @@ def wrap_component(component_path, component_interface, component_inputs, component_process, - cos, + backend, ): # get component name from path component_name = os.path.splitext(os.path.basename(component_path))[0] - gw_template = cos_grid_wrapper_template if cos else grid_wrapper_template + logging.info(f'Using backend: {backend}') + + + backends = { + 'cos_grid_wrapper' : c3.templates.cos_grid_wrapper_template, + 'grid_wrapper' : c3.templates.grid_wrapper_template, + 's3kv_grid_wrapper': c3.templates.s3kv_grid_wrapper_template, + } + gw_template = backends.get(backend) + + logging.debug(f'Using backend template: {gw_template}') + + grid_wrapper_code = gw_template.substitute( component_name=component_name, component_description=component_description, @@ -32,7 +45,7 @@ def wrap_component(component_path, ) # Write edited code to file - grid_wrapper_file = f'cgw_{component_name}.py' if cos else f'gw_{component_name}.py' + grid_wrapper_file = 'grid_wrapper.py' grid_wrapper_file_path = os.path.join(os.path.dirname(component_path), grid_wrapper_file) # remove 'component_' from gw path grid_wrapper_file_path = grid_wrapper_file_path.replace('component_', '') @@ -112,7 +125,7 @@ def edit_component_code(file_path, component_process): return target_file -def apply_grid_wrapper(file_path, component_process, cos): +def apply_grid_wrapper(file_path, component_process, backend): assert file_path.endswith('.py') or file_path.endswith('.ipynb'), \ "Please provide a component file path to a python script or notebook." @@ -134,7 +147,7 @@ def apply_grid_wrapper(file_path, component_process, cos): logging.debug(component + ':\n' + str(value) + '\n') logging.info('Wrap component') - grid_wrapper_file_path = wrap_component(cos=cos, **component_elements) + grid_wrapper_file_path = wrap_component(backend=backend, **component_elements) return grid_wrapper_file_path, file_path @@ -146,8 +159,8 @@ def main(): help='List of paths to additional files to include in the container image') parser.add_argument('-p', '--component_process', type=str, default='grid_process', help='Name of the component sub process that is executed for each batch.') - parser.add_argument('--cos', action=argparse.BooleanOptionalAction, default=False, - help='Creates a grid wrapper for processing COS files') + parser.add_argument('-b', '--backend', type=str, default='s3kv_grid_wrapper', + help='Define backend. Default: s3kv_grid_wrapper. Others: grid_wrapper, cos_grid_wrapper') parser.add_argument('-r', '--repository', type=str, default=None, help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, @@ -182,7 +195,7 @@ def main(): grid_wrapper_file_path, component_path = apply_grid_wrapper( file_path=args.FILE_PATH, component_process=args.component_process, - cos=args.cos, + backend=args.backend, ) logging.info('Generate CLAIMED operator for grid wrapper') From b765e8324960cb325d48e33c0583e3f002fdf1ea Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 23 Feb 2024 08:56:33 +0100 Subject: [PATCH 144/177] fix constant name issue --- src/c3/create_gridwrapper.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index f82120ff..c5317a47 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -45,7 +45,7 @@ def wrap_component(component_path, ) # Write edited code to file - grid_wrapper_file = 'grid_wrapper.py' + grid_wrapper_file = f'gw_{component_name}.py' grid_wrapper_file_path = os.path.join(os.path.dirname(component_path), grid_wrapper_file) # remove 'component_' from gw path grid_wrapper_file_path = grid_wrapper_file_path.replace('component_', '') From cc1d66c6f414764284bc2373d77b53d263821d9b Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 1 Mar 2024 11:46:36 +0100 Subject: [PATCH 145/177] Update create_gridwrapper.py From 34911a5b65ef1625729a8c27349c67547739a703 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 1 Mar 2024 11:47:46 +0100 Subject: [PATCH 146/177] Update s3kv_grid_wrapper_template.py --- .../templates/s3kv_grid_wrapper_template.py | 701 +++++++++++++----- 1 file changed, 531 insertions(+), 170 deletions(-) diff --git a/src/c3/templates/s3kv_grid_wrapper_template.py b/src/c3/templates/s3kv_grid_wrapper_template.py index 71a0ae33..4d794227 100644 --- a/src/c3/templates/s3kv_grid_wrapper_template.py +++ b/src/c3/templates/s3kv_grid_wrapper_template.py @@ -16,20 +16,498 @@ from pathlib import Path import pandas as pd import s3fs +from hashlib import sha256 + # import component code from ${component_name} import * +#------------------REMOVE once pip install for s3kv is fixed +import os +import time +from datetime import datetime +import shutil +import boto3 +import json + + +class S3KV: + def __init__(self, s3_endpoint_url:str, bucket_name: str, + aws_access_key_id: str = None, aws_secret_access_key: str = None , enable_local_cache=True): + """ + Initializes the S3KV object with the given S3 bucket, AWS credentials, and Elasticsearch host. + + :param s3_endpoint_url: The s3 endpoint. + :param bucket_name: The name of the S3 bucket to use for storing the key-value data. + :param aws_access_key_id: (Optional) AWS access key ID. + :param aws_secret_access_key: (Optional) AWS secret access key. + """ + self.bucket_name = bucket_name + self.enable_local_cache = enable_local_cache + self.s3_client = boto3.client( + 's3', + endpoint_url=s3_endpoint_url, + aws_access_key_id=aws_access_key_id, + aws_secret_access_key=aws_secret_access_key + ) + + if not os.path.exists('/tmp/s3kv_cache'): + os.makedirs('/tmp/s3kv_cache') + + def _get_object_key(self, key: str) -> str: + """ + Constructs the S3 object key for the given key. + + :param key: The key used to access the value in the S3 bucket. + :return: The S3 object key for the given key. + """ + return f"s3kv/{key}.json" + + def cache_all_keys(self): + """ + Saves all keys to the local /tmp directory as they are being added. + """ + keys = self.list_keys() + for key in keys: + value = self.get(key) + if value is not None: + with open(f'/tmp/s3kv_cache/{key}.json', 'w') as f: + json.dump(value, f) + + def get_from_cache(self, key: str) -> dict: + """ + Retrieves a key from the local cache if present, and clears old cache entries. + + :param key: The key to retrieve from the cache. + :return: The value associated with the given key if present in the cache, else None. + """ + self.clear_old_cache() + cache_path = f'/tmp/s3kv_cache/{key}.json' + if os.path.exists(cache_path): + with open(cache_path, 'r') as f: + return json.load(f) + else: + return None + + + def add(self, key: str, value: dict, metadata: dict = None): + """ + Adds a new key-value pair to the S3KV database, caches it locally, and sends metadata to Elasticsearch. + + :param key: The key to be added. + :param value: The value corresponding to the key. + :param metadata: (Optional) Metadata associated with the data (will be sent to Elasticsearch). + """ + s3_object_key = self._get_object_key(key) + serialized_value = json.dumps(value) + self.s3_client.put_object(Bucket=self.bucket_name, Key=s3_object_key, Body=serialized_value) + + with open(f'/tmp/s3kv_cache/{key}.json', 'w') as f: + json.dump(value, f) + + + + def delete(self, key: str): + """ + Deletes a key-value pair from the S3KV database. + + :param key: The key to be deleted. + """ + s3_object_key = self._get_object_key(key) + self.s3_client.delete_object(Bucket=self.bucket_name, Key=s3_object_key) + + cache_path = f'/tmp/s3kv_cache/{key}.json' + if os.path.exists(cache_path): + os.remove(cache_path) + + + def get(self, key: str, default: dict = None) -> dict: + """ + Retrieves the value associated with the given key from the S3KV database. + + :param key: The key whose value is to be retrieved. + :param default: (Optional) The default value to return if the key does not exist. + :return: The value associated with the given key, or the default value if the key does not exist. + """ + s3_object_key = self._get_object_key(key) + try: + response = self.s3_client.get_object(Bucket=self.bucket_name, Key=s3_object_key) + value = response['Body'].read() + return json.loads(value) + except self.s3_client.exceptions.NoSuchKey: + return default + + + def list_keys(self) -> list: + """ + Lists all the keys in the S3KV database. + + :return: A list of all keys in the database. + """ + response = self.s3_client.list_objects_v2(Bucket=self.bucket_name, Prefix="") + keys = [obj['Key'][5:-5] for obj in response.get('Contents', []) if obj['Key'].endswith('.json')] + return keys + + + def clear_cache(self): + """ + Clears the local cache by removing all cached JSON files. + """ + cache_directory = '/tmp/s3kv_cache' + if os.path.exists(cache_directory): + shutil.rmtree(cache_directory) + os.makedirs('/tmp/s3kv_cache') + + + def clear_old_cache(self, max_days: int = 7): + """ + Clears the cache for keys that have been in the cache for longer than a specific number of days. + + :param max_days: The maximum number of days a key can stay in the cache before being cleared. + """ + cache_directory = '/tmp/s3kv_cache' + current_time = time.time() + + if os.path.exists(cache_directory): + for filename in os.listdir(cache_directory): + file_path = os.path.join(cache_directory, filename) + if os.path.isfile(file_path): + file_age = current_time - os.path.getmtime(file_path) + if file_age > max_days * 86400: # Convert days to seconds + os.remove(file_path) + + + def clear_cache_for_key(self, key: str): + """ + Clears the local cache for a specific key in the S3KV database. + + :param key: The key for which to clear the local cache. + """ + cache_path = f'/tmp/s3kv_cache/{key}.json' + if os.path.exists(cache_path): + os.remove(cache_path) + + + def key_exists(self, key: str) -> bool: + """ + Checks if a key exists in the S3KV database. + + :param key: The key to check. + :return: True if the key exists, False otherwise. + """ + s3_object_key = self._get_object_key(key) + try: + self.s3_client.head_object(Bucket=self.bucket_name, Key=s3_object_key) + return True + except Exception as e: + # Return false even if response is unauthorized or similar + return False + + + def list_keys_with_prefix(self, prefix: str) -> list: + """ + Lists all the keys in the S3KV database that have a specific prefix. + + :param prefix: The prefix to filter the keys. + :return: A list of keys in the database that have the specified prefix. + """ + response = self.s3_client.list_objects_v2(Bucket=self.bucket_name, Prefix=prefix) + keys = [obj['Key'][5:-5] for obj in response.get('Contents', []) if obj['Key'].endswith('.json')] + return keys + + + def copy_key(self, source_key: str, destination_key: str): + """ + Copies the value of one key to another key in the S3KV database. + + :param source_key: The key whose value will be copied. + :param destination_key: The key to which the value will be copied. + """ + source_s3_object_key = self._get_object_key(source_key) + destination_s3_object_key = self._get_object_key(destination_key) + + response = self.s3_client.get_object(Bucket=self.bucket_name, Key=source_s3_object_key) + value = response['Body'].read() + + self.s3_client.put_object(Bucket=self.bucket_name, Key=destination_s3_object_key, Body=value) + + # Copy the key in the local cache if it exists + source_cache_path = f'/tmp/s3kv_cache/{source_key}.json' + destination_cache_path = f'/tmp/s3kv_cache/{destination_key}.json' + if os.path.exists(source_cache_path): + shutil.copy(source_cache_path, destination_cache_path) + + + def get_key_size(self, key: str) -> int: + """ + Gets the size (file size) of a key in the S3KV database. + + :param key: The key whose size will be retrieved. + :return: The size (file size) of the key in bytes, or 0 if the key does not exist. + """ + s3_object_key = self._get_object_key(key) + try: + response = self.s3_client.head_object(Bucket=self.bucket_name, Key=s3_object_key) + return response['ContentLength'] + except self.s3_client.exceptions.NoSuchKey: + return 0 + + + def get_key_last_updated_time(self, key: str) -> float: + """ + Gets the last updated time of a key in the S3KV database. + + :param key: The key whose last updated time will be retrieved. + :return: The last updated time of the key as a floating-point timestamp, or 0 if the key does not exist. + """ + s3_object_key = self._get_object_key(key) + try: + response = self.s3_client.head_object(Bucket=self.bucket_name, Key=s3_object_key) + last_modified = response['LastModified'] + st = time.mktime(last_modified.timetuple()) + + return datetime.fromtimestamp(st) + + except self.s3_client.exceptions.NoSuchKey: + return 0 + + + def set_bucket_policy(self): + """ + Sets a bucket policy to grant read and write access to specific keys used by the S3KV library. + """ + policy = { + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "S3KVReadWriteAccess", + "Effect": "Allow", + "Principal": { + "AWS": "*" + }, + "Action": [ + "s3:GetObject", + "s3:PutObject" + ], + "Resource": f"arn:aws:s3:::{self.bucket_name}/s3kv/*" + } + ] + } + + policy_json = json.dumps(policy) + self.s3_client.put_bucket_policy(Bucket=self.bucket_name, Policy=policy_json) + + + def tag_key(self, key: str, tags: dict): + """ + Tags a key in the S3KV database with the provided tags. + + :param key: The key to be tagged. + :param tags: A dictionary containing the tags to be added to the key. + For example, {'TagKey1': 'TagValue1', 'TagKey2': 'TagValue2'} + """ + s3_object_key = self._get_object_key(key) + + # Convert the tags dictionary to a format compatible with the `put_object_tagging` method + tagging = {'TagSet': [{'Key': k, 'Value': v} for k, v in tags.items()]} + + # Apply the tags to the object + self.s3_client.put_object_tagging(Bucket=self.bucket_name, Key=s3_object_key, Tagging=tagging) + + + def tag_keys_with_prefix(self, prefix: str, tags: dict): + """ + Tags all keys in the S3KV database with the provided prefix with the specified tags. + + :param prefix: The prefix of the keys to be tagged. + :param tags: A dictionary containing the tags to be added to the keys. + For example, {'TagKey1': 'TagValue1', 'TagKey2': 'TagValue2'} + """ + keys_to_tag = self.list_keys_with_prefix(prefix) + + for key in keys_to_tag: + self.tag_key(key, tags) + + + def merge_keys(self, source_keys: list, destination_key: str): + """ + Merges the values of source keys into the value of the destination key in the S3KV database. + + :param source_keys: A list of source keys whose values will be merged. + :param destination_key: The key whose value will be updated by merging the source values. + """ + destination_s3_object_key = self._get_object_key(destination_key) + + # Initialize an empty dictionary for the destination value + destination_value = {} + + # Retrieve and merge values from source keys + for source_key in source_keys: + source_value = self.get(source_key) + if source_value: + destination_value.update(source_value) + + # Update the destination value in the S3 bucket + serialized_value = json.dumps(destination_value) + self.s3_client.put_object(Bucket=self.bucket_name, Key=destination_s3_object_key, Body=serialized_value) + + # Update the value in the local cache if it exists + destination_cache_path = f'/tmp/s3kv_cache/{destination_key}.json' + with open(destination_cache_path, 'w') as f: + json.dump(destination_value, f) + + + + def find_keys_by_tag_value(self, tag_key: str, tag_value: str) -> list: + """ + Finds keys in the S3KV database based on the value of a specific tag. + + :param tag_key: The tag key to search for. + :param tag_value: The tag value to search for. + :return: A list of keys that have the specified tag key with the specified value. + """ + response = self.s3_client.list_objects_v2(Bucket=self.bucket_name, Prefix="s3kv/") + keys_with_tag = [] + + for obj in response.get('Contents', []): + s3_object_key = obj['Key'] + tags = self.get_tags(s3_object_key) + if tags and tag_key in tags and tags[tag_key] == tag_value: + keys_with_tag.append(s3_object_key[5:-5]) # Extract the key name + + return keys_with_tag + + def get_tags(self, s3_object_key: str) -> dict: + """ + Gets the tags of an object in the S3KV database. + + :param s3_object_key: The S3 object key whose tags will be retrieved. + :return: A dictionary containing the tags of the object. + """ + response = self.s3_client.get_object_tagging(Bucket=self.bucket_name, Key=s3_object_key) + tags = {} + for tag in response.get('TagSet', []): + tags[tag['Key']] = tag['Value'] + return tags + + + + def place_retention_lock(self, key: str, retention_days: int): + """ + Places a retention lock on a key in the S3KV database for the specified number of days. + + :param key: The key to place the retention lock on. + :param retention_days: The number of days to lock the key for. + """ + s3_object_key = self._get_object_key(key) + print(s3_object_key) + + retention_period = retention_days * 24 * 60 * 60 # Convert days to seconds + + self.s3_client.put_object_retention( + Bucket=self.bucket_name, + Key=s3_object_key, + Retention={ + 'Mode': 'GOVERNANCE', + 'RetainUntilDate': int(time.time()) + retention_period + } + ) + + + def remove_retention_lock(self, key: str): + """ + Removes the retention lock from a key in the S3KV database. + + :param key: The key to remove the retention lock from. + """ + s3_object_key = self._get_object_key(key) + + self.s3_client.put_object_retention( + Bucket=self.bucket_name, + Key=s3_object_key, + BypassGovernanceRetention=True, + Retention={ + + } + ) + + + def delete_by_tag(self, tag_key: str, tag_value: str): + """ + Deletes keys in the S3KV database based on a specific tag. + + :param tag_key: The tag key to match for deletion. + :param tag_value: The tag value to match for deletion. + """ + keys_to_delete = self.find_keys_by_tag_value(tag_key, tag_value) + + for key in keys_to_delete: + self.delete(key) + + + def apply_legal_hold(self, key: str): + """ + Applies a legal hold on a key in the S3KV database. + + :param key: The key on which to apply the legal hold. + """ + s3_object_key = self._get_object_key(key) + + self.s3_client.put_object_legal_hold( + Bucket=self.bucket_name, + Key=s3_object_key, + LegalHold={ + 'Status': 'ON' + } + ) + + -explode_connection_string(cs): + + + def is_legal_hold_applied(self, key: str) -> bool: + """ + Checks if a key in the S3KV database is under legal hold. + + :param key: The key to check for legal hold. + :return: True if the key is under legal hold, False otherwise. + """ + s3_object_key = self._get_object_key(key) + + response = self.s3_client.get_object_legal_hold(Bucket=self.bucket_name, Key=s3_object_key) + + legal_hold_status = response.get('LegalHold', {}).get('Status') + return legal_hold_status == 'ON' + + + def release_legal_hold(self, key: str): + """ + Releases a key from legal hold in the S3KV database. + + :param key: The key to release from legal hold. + """ + s3_object_key = self._get_object_key(key) + + self.s3_client.put_object_legal_hold( + Bucket=self.bucket_name, + Key=s3_object_key, + LegalHold={ + 'Status': 'OFF' + } + ) + +#----------------------------------------------------------- + + +def explode_connection_string(cs): if cs is None: return None if cs.startswith('cos') or cs.startswith('s3'): buffer=cs.split('://')[1] access_key_id=buffer.split('@')[0].split(':')[0] secret_access_key=buffer.split('@')[0].split(':')[1] - endpoint=buffer.split('@')[1].split('/')[0] + endpoint=f"https://{buffer.split('@')[1].split('/')[0]}" path='/'.join(buffer.split('@')[1].split('/')[1:]) return (access_key_id, secret_access_key, endpoint, path) else: @@ -40,78 +518,32 @@ # File with batches. Provided as a comma-separated list of strings, keys in a json dict or single column CSV with 'filename' has header. Either local path as [cos|s3]://user:pw@endpoint/path gw_batch_file = os.environ.get('gw_batch_file', None) -(gw_batch_file_access_key_id, gw_batch_secret_access_key, gw_batch_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file): - - -# file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. Is ignored if gw_batch_file is provided. -gw_file_path_pattern = os.environ.get('gw_file_path_pattern', None) -# pattern for grouping file paths into batches like ".split('.')[-1]". Is ignored if gw_batch_file is provided. -gw_group_by = os.environ.get('gw_group_by', None) -# path to grid wrapper coordinator directory -gw_coordinator_path = os.environ.get('gw_coordinator_path') -# lock file suffix -gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') -# processed file suffix -gw_processed_file_suffix = os.environ.get('gw_lock_file_suffix', '.processed') -# error file suffix -gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') -# timeout in seconds to remove lock file from struggling job (default 3 hours) -gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 10800)) -# ignore error files and rerun batches with errors -gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) +(gw_batch_file_access_key_id, gw_batch_secret_access_key, gw_batch_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file) + +# cos gw_coordinator_connection +gw_coordinator_connection = os.environ.get('gw_coordinator_connection') +(gw_coordinator_access_key_id, gw_coordinator_secret_access_key, gw_coordinator_endpoint, gw_coordinator_path) = explode_connection_string(gw_coordinator_connection) + +# maximal wait time for staggering start +gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering',60)) # component interface -${component_interface} +#${component_interface} def load_batches_from_file(batch_file): - if gw_batch_file_access_key_id is not None: - s3source = s3fs.S3FileSystem( - anon=False, - key=gw_batch_file_access_key_id, - secret=gw_batch_secret_access_key, - client_kwargs={'endpoint_url': gw_batch_endpoint}) - - - if batch_file.endswith('.json'): - # load batches from keys of a json file - logging.info(f'Loading batches from json file: {batch_file}') - with s3source.open(gw_batch_file, 'r') as f: - batch_dict = json.load(f) - batches = batch_dict.keys() - - elif batch_file.endswith('.csv'): - # load batches from keys of a csv file - logging.info(f'Loading batches from csv file: {batch_file}') - s3source.get(batch_file, batch_file) - df = pd.read_csv(batch_file, header='infer') - batches = df['filename'].to_list() + s3source = s3fs.S3FileSystem( + anon=False, + key=gw_batch_file_access_key_id, + secret=gw_batch_secret_access_key, + client_kwargs={'endpoint_url': gw_batch_endpoint}) + + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + s3source.get(batch_file, batch_file) + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() - else: - # Load batches from comma-separated txt file - logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - with s3source.open(gw_batch_file, 'r') as f: - batch_string = f.read() - batches = [b.strip() for b in batch_string.split(',')] - else: - if batch_file.endswith('.json'): - # load batches from keys of a json file - logging.info(f'Loading batches from json file: {batch_file}') - with open(batch_file, 'r') as f: - batch_dict = json.load(f) - batches = batch_dict.keys() - - elif batch_file.endswith('.csv'): - # load batches from keys of a csv file - logging.info(f'Loading batches from csv file: {batch_file}') - df = pd.read_csv(batch_file, header='infer') - batches = df['filename'].to_list() - else: - # Load batches from comma-separated txt file - logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - with open(batch_file, 'r') as f: - batch_string = f.read() - batches = [b.strip() for b in batch_string.split(',')] logging.info(f'Loaded {len(batches)} batches') logging.debug(f'List of batches: {batches}') @@ -119,133 +551,62 @@ def load_batches_from_file(batch_file): return batches -def identify_batches_from_pattern(file_path_patterns, group_by): - logging.info(f'Start identifying files and batches') - batches = set() - all_files = [] - - # Iterate over comma-separated paths - for file_path_pattern in file_path_patterns.split(','): - logging.info(f'Get file paths from pattern: {file_path_pattern}') - files = glob.glob(file_path_pattern.strip()) - assert len(files) > 0, f"Found no files with file_path_pattern {file_path_pattern}." - all_files.extend(files) - - # get batches by applying the group by function to all file paths - for path_string in all_files: - part = eval('str(path_string)' + group_by, {"group_by": group_by, "path_string": path_string}) - assert part != '', f'Could not extract batch with path_string {path_string} and group_by {group_by}' - batches.add(part) - - logging.info(f'Identified {len(batches)} batches') - logging.debug(f'List of batches: {batches}') - - return batches - - -def perform_process(process, batch): +def perform_process(process, batch, coordinator): logging.debug(f'Check coordinator files for batch {batch}.') - # init coordinator files - lock_file = Path(gw_coordinator_path) / (batch + gw_lock_file_suffix) - error_file = Path(gw_coordinator_path) / (batch + gw_error_file_suffix) - processed_file = Path(gw_coordinator_path) / (batch + gw_processed_file_suffix) - - if lock_file.exists(): - # remove strugglers - if lock_file.stat().st_mtime < time.time() - gw_lock_timeout: - logging.debug(f'Lock file {lock_file} is expired.') - lock_file.unlink() - else: - logging.debug(f'Batch {batch} is locked.') - return - if processed_file.exists(): - logging.debug(f'Batch {batch} is processed.') - return + batch_id = sha256(batch.encode('utf-8')).hexdigest() # ensure no special characters break cos + logging.info(f'Generating {batch_id} for {batch}') - if error_file.exists(): - if gw_ignore_error_files: - logging.info(f'Ignoring previous error in batch {batch} and rerun.') + if coordinator.key_exists(batch_id): + if coordinator.get(batch_id) == 'locked': + logging.debug(f'Batch {batch_id} is locked') + return + elif coordinator.get(batch_id) == 'processed': + logging.debug(f'Batch {batch_id} is processed') + return else: - logging.debug(f'Batch {batch} has error.') + logging.debug(f'Batch {batch_id} is failed') return - logging.debug(f'Locking batch {batch}.') - lock_file.parent.mkdir(parents=True, exist_ok=True) - lock_file.touch() + + logging.debug(f'Locking batch {batch_id}.') + coordinator.add(batch_id,'locked') # processing files with custom process - logging.info(f'Processing batch {batch}.') + logging.info(f'Processing batch {batch_id}.') try: - target_files = process(batch, ${component_inputs}) + process(batch, ${component_inputs}) except Exception as err: - logging.error(f'{type(err).__name__} in batch {batch}: {err}') - # Write error to file - with open(error_file, 'w') as f: - f.write(f"{type(err).__name__} in batch {batch}: {err}") - lock_file.unlink() + logging.error(f'{type(err).__name__} in batch {batch_id}: {err}') + coordinator.add(batch_id,f"{type(err).__name__} in batch {batch_id}: {err}") logging.error(f'Continue processing.') return - # optional verify target files - if target_files is not None: - if isinstance(target_files, str): - target_files = [target_files] - for target_file in target_files: - if not os.path.exists(target_file): - logging.error(f'Target file {target_file} does not exist for batch {batch}.') - else: - logging.info(f'Cannot verify batch {batch} (target files not provided).') - - logging.info(f'Finished Batch {batch}.') - processed_file.touch() - - # Remove lock file - if lock_file.exists(): - lock_file.unlink() - else: - logging.warning(f'Lock file {lock_file} was removed by another process. ' - f'Consider increasing gw_lock_timeout (currently {gw_lock_timeout}s) to repeated processing.') - + logging.info(f'Finished Batch {batch_id}.') + coordinator.add(batch_id,'processed') def process_wrapper(sub_process): - delay = random.randint(1, 60) + delay = random.randint(0, gw_max_time_wait_staggering) logging.info(f'Staggering start, waiting for {delay} seconds') time.sleep(delay) - # Init coordinator dir - coordinator_dir = Path(gw_coordinator_path) - coordinator_dir.mkdir(exist_ok=True, parents=True) + # Init coordinator + coordinator = S3KV(gw_coordinator_endpoint, + gw_coordinator_path, + gw_coordinator_access_key_id, gw_coordinator_secret_access_key, + enable_local_cache=False) + # get batches - if gw_batch_file is not None and os.path.isfile(gw_batch_file): - batches = load_batches_from_file(gw_batch_file) - elif gw_file_path_pattern is not None and gw_group_by is not None: - batches = identify_batches_from_pattern(gw_file_path_pattern, gw_group_by) - else: - raise ValueError("Cannot identify batches. " - "Provide valid gw_batch_file or gw_file_path_pattern and gw_group_by.") + batches = load_batches_from_file(gw_batch_file) # Iterate over all batches for batch in batches: - perform_process(sub_process, batch) - - # Check and log status of batches - processed_status = [(coordinator_dir / (batch + gw_processed_file_suffix)).exists() for batch in batches] - lock_status = [(coordinator_dir / (batch + gw_lock_file_suffix)).exists() for batch in batches] - error_status = [(coordinator_dir / (batch + gw_error_file_suffix)).exists() for batch in batches] - - logging.info(f'Finished current process. Status batches: ' - f'{sum(processed_status)} processed / {sum(lock_status)} locked / {sum(error_status)} errors / {len(processed_status)} total') + perform_process(sub_process, batch, coordinator) - if sum(error_status): - logging.error(f'Found errors! Resolve errors and rerun operator with gw_ignore_error_files=True.') - # print all error messages - for error_file in coordinator_dir.glob('**/*' + gw_error_file_suffix): - with open(error_file, 'r') as f: - logging.error(f.read()) + if __name__ == '__main__': - process_wrapper(${component_process}) + process_wrapper(${component_process}) From 0d391c03fc769b1ff4ac6e69858441ac14c25ec2 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 5 Mar 2024 16:41:08 +0100 Subject: [PATCH 147/177] Fix cos batch file for cos grid wrapper Signed-off-by: Benedikt Blumenstiel --- src/c3/templates/cos_grid_wrapper_template.py | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index 83281a53..a7ea8341 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -272,9 +272,16 @@ def process_wrapper(sub_process): s3coordinator.makedirs(coordinator_dir, exist_ok=True) # get batches - if gw_batch_file is not None and os.path.isfile(gw_batch_file): + cos_gw_batch_file = str(Path(gw_source_bucket) / gw_batch_file) + if (gw_batch_file is not None and (os.path.isfile(gw_batch_file) or s3source.exists(cos_gw_batch_file))): + if not os.path.isfile(gw_batch_file): + # Download batch file + s3source.get(cos_gw_batch_file, gw_batch_file) batches = load_batches_from_file(gw_batch_file) - cos_files = get_files_from_pattern(gw_file_path_pattern) + if gw_file_path_pattern: + cos_files = get_files_from_pattern(gw_file_path_pattern) + else: + cos_files = [] elif gw_file_path_pattern is not None and gw_group_by is not None: batches, cos_files = identify_batches_from_pattern(gw_file_path_pattern, gw_group_by) else: From 63538a455f377dd239a0d032349f66729fe3a33e Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 6 Mar 2024 18:25:30 +0100 Subject: [PATCH 148/177] Fix cos grid wrapper Signed-off-by: Benedikt Blumenstiel --- src/c3/templates/cos_grid_wrapper_template.py | 48 +++++++++++++------ 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index 4a8be983..f2a958a9 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -27,6 +27,8 @@ # File containing batches. Provided as a comma-separated list of strings or keys in a json dict. All batch file names must contain the batch name. gw_batch_file = os.environ.get('gw_batch_file', None) +# Optional column name for a csv batch file (default: 'filename') +gw_batch_file_col_name = os.environ.get('gw_batch_file_col_name', 'filename') # file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. It is ignored if gw_batch_file is provided. gw_file_path_pattern = os.environ.get('gw_file_path_pattern', None) # pattern for grouping file paths into batches like ".split('.')[-2]". It is ignored if gw_batch_file is provided. @@ -40,8 +42,10 @@ gw_local_target_path = os.environ.get('gw_local_target_path', 'target') -explode_connection_string(cs): - if cs.startswith('cos') or cs.startswith('s3'): +def explode_connection_string(cs): + if cs is None: + return None, None, None, None + elif cs.startswith('cos') or cs.startswith('s3'): buffer=cs.split('://')[1] access_key_id=buffer.split('@')[0].split(':')[0] secret_access_key=buffer.split('@')[0].split(':')[1] @@ -49,14 +53,14 @@ path='/'.join(buffer.split('@')[1].split('/')[1:]) return (access_key_id, secret_access_key, endpoint, path) else: - None # TODO consider cs as secret and grab connection string from kubernetes + raise NotImplementedError # cos gw_source_connection gw_source_connection = os.environ.get('gw_source_connection') -(gw_source_access_key_id, gw_source_secret_access_key, gw_source_endpoint, gw_source_bucket) = explode_connection_string(gw_source_connection) +(gw_source_access_key_id, gw_source_secret_access_key, gw_source_endpoint, gw_source_path) = explode_connection_string(gw_source_connection) # cos gw_target_connection gw_target_connection = os.environ.get('gw_target_connection') @@ -78,7 +82,7 @@ gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) # maximal wait time for staggering start -gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering',60)) +gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering', 60)) # component interface @@ -91,12 +95,15 @@ secret=gw_source_secret_access_key, client_kwargs={'endpoint_url': gw_source_endpoint}) +gw_source_path = Path(gw_source_path) + if gw_target_connection is not None: s3target = s3fs.S3FileSystem( anon=False, key=gw_target_access_key_id, secret=gw_target_secret_access_key, client_kwargs={'endpoint_url': gw_target_endpoint}) + gw_target_path = Path(gw_target_path) else: logging.debug('Using source path as target path.') gw_target_path = gw_source_path @@ -108,6 +115,7 @@ key=gw_coordinator_access_key_id, secret=gw_coordinator_secret_access_key, client_kwargs={'endpoint_url': gw_coordinator_endpoint}) + gw_coordinator_path = Path(gw_coordinator_path) else: logging.debug('Using source bucket as coordinator bucket.') gw_coordinator_path = gw_source_path @@ -117,23 +125,29 @@ def load_batches_from_file(batch_file): if batch_file.endswith('.json'): # load batches from keys of a json file logging.info(f'Loading batches from json file: {batch_file}') - with s3source.open(gw_source_path / batch_file, 'r') as f: + with open(batch_file, 'r') as f: batch_dict = json.load(f) batches = batch_dict.keys() elif batch_file.endswith('.csv'): # load batches from keys of a csv file logging.info(f'Loading batches from csv file: {batch_file}') - s3source.get(gw_source_path / batch_file, batch_file) df = pd.read_csv(batch_file, header='infer') - batches = df['filename'].to_list() + assert gw_batch_file_col_name in df.columns, \ + f'gw_batch_file_col_name {gw_batch_file_col_name} not in columns of batch file {batch_file}' + batches = df[gw_batch_file_col_name].to_list() - else: + elif batch_file.endswith('.txt'): # Load batches from comma-separated txt file logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - with s3source.open(gw_source_path / batch_file, 'r') as f: + with open(batch_file, 'r') as f: batch_string = f.read() batches = [b.strip() for b in batch_string.split(',')] + else: + raise ValueError(f'C3 only supports batch files of type ' + f'json (batches = dict keys), ' + f'csv (batches = column values), or ' + f'txt (batches = comma-seperated list).') logging.info(f'Loaded {len(batches)} batches') logging.debug(f'List of batches: {batches}') @@ -148,8 +162,9 @@ def get_files_from_pattern(file_path_patterns): # Iterate over comma-separated paths for file_path_pattern in file_path_patterns.split(','): logging.info(f'Get file paths from pattern: {file_path_pattern}') - files = s3source.glob(str(Path(gw_source_path) / file_path_pattern.strip())) - assert len(files) > 0, f"Found no files with file_path_pattern {file_path_pattern}." + files = s3source.glob(str(gw_source_path / file_path_pattern.strip())) + if len(files) == 0: + logging.warning(f"Found no files with file_path_pattern {file_path_pattern}.") all_files.extend(files) logging.info(f'Found {len(all_files)} cos files') return all_files @@ -229,7 +244,7 @@ def perform_process(process, batch, cos_files): try: target_files = process(batch, ${component_inputs}) except Exception as err: - logging.error(f'{type(err).__name__} in batch {batch}: {err}') + logging.exception(err) # Write error to file with s3coordinator.open(error_file, 'w') as f: f.write(f"{type(err).__name__} in batch {batch}: {err}") @@ -281,7 +296,7 @@ def process_wrapper(sub_process): s3coordinator.makedirs(coordinator_dir, exist_ok=True) # get batches - cos_gw_batch_file = str(Path(gw_source_bucket) / gw_batch_file) + cos_gw_batch_file = str(gw_source_path / gw_batch_file) if (gw_batch_file is not None and (os.path.isfile(gw_batch_file) or s3source.exists(cos_gw_batch_file))): if not os.path.isfile(gw_batch_file): # Download batch file @@ -290,12 +305,15 @@ def process_wrapper(sub_process): if gw_file_path_pattern: cos_files = get_files_from_pattern(gw_file_path_pattern) else: + logging.warning('gw_file_path_pattern is not provided. ' + 'Grid wrapper expects the wrapped operator to handle COS files instead of the automatic download and upload.') cos_files = [] elif gw_file_path_pattern is not None and gw_group_by is not None: batches, cos_files = identify_batches_from_pattern(gw_file_path_pattern, gw_group_by) else: raise ValueError("Cannot identify batches. " - "Provide valid gw_batch_file or gw_file_path_pattern and gw_group_by.") + "Provide valid gw_batch_file (local path or path within source bucket) " + "or gw_file_path_pattern and gw_group_by.") # Iterate over all batches for batch in batches: From 6b4df415ab170c7cc71d06d1fef489273466d52d Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 13:44:12 +0100 Subject: [PATCH 149/177] Fix grid wrapper Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 11 ++- src/c3/templates/grid_wrapper_template.py | 88 +++++------------------ 2 files changed, 23 insertions(+), 76 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index c5317a47..fef0e545 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -1,4 +1,3 @@ - import logging import os import argparse @@ -24,17 +23,15 @@ def wrap_component(component_path, logging.info(f'Using backend: {backend}') - backends = { - 'cos_grid_wrapper' : c3.templates.cos_grid_wrapper_template, - 'grid_wrapper' : c3.templates.grid_wrapper_template, - 's3kv_grid_wrapper': c3.templates.s3kv_grid_wrapper_template, + 'cos_grid_wrapper': c3.templates.cos_grid_wrapper_template, + 'grid_wrapper': c3.templates.grid_wrapper_template, + 's3kv_grid_wrapper': c3.templates.s3kv_grid_wrapper_template, } gw_template = backends.get(backend) logging.debug(f'Using backend template: {gw_template}') - grid_wrapper_code = gw_template.substitute( component_name=component_name, component_description=component_description, @@ -159,7 +156,7 @@ def main(): help='List of paths to additional files to include in the container image') parser.add_argument('-p', '--component_process', type=str, default='grid_process', help='Name of the component sub process that is executed for each batch.') - parser.add_argument('-b', '--backend', type=str, default='s3kv_grid_wrapper', + parser.add_argument('-b', '--backend', type=str, default='grid_wrapper', help='Define backend. Default: s3kv_grid_wrapper. Others: grid_wrapper, cos_grid_wrapper') parser.add_argument('-r', '--repository', type=str, default=None, help='Container registry address, e.g. docker.io/') diff --git a/src/c3/templates/grid_wrapper_template.py b/src/c3/templates/grid_wrapper_template.py index 7aa9a58e..5eed2d06 100644 --- a/src/c3/templates/grid_wrapper_template.py +++ b/src/c3/templates/grid_wrapper_template.py @@ -21,27 +21,8 @@ from ${component_name} import * -explode_connection_string(cs): - if cs is None: - return None - if cs.startswith('cos') or cs.startswith('s3'): - buffer=cs.split('://')[1] - access_key_id=buffer.split('@')[0].split(':')[0] - secret_access_key=buffer.split('@')[0].split(':')[1] - endpoint=buffer.split('@')[1].split('/')[0] - path='/'.join(buffer.split('@')[1].split('/')[1:]) - return (access_key_id, secret_access_key, endpoint, path) - else: - return (None, None, None, cs) - # TODO consider cs as secret and grab connection string from kubernetes - - - -# File with batches. Provided as a comma-separated list of strings, keys in a json dict or single column CSV with 'filename' has header. Either local path as [cos|s3]://user:pw@endpoint/path +# File with batches. Provided as a comma-separated list of strings, keys in a json dict or single column CSV with 'filename' has header. gw_batch_file = os.environ.get('gw_batch_file', None) -(gw_batch_file_access_key_id, gw_batch_secret_access_key, gw_batch_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file): - - # file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. Is ignored if gw_batch_file is provided. gw_file_path_pattern = os.environ.get('gw_file_path_pattern', None) # pattern for grouping file paths into batches like ".split('.')[-1]". Is ignored if gw_batch_file is provided. @@ -63,56 +44,25 @@ ${component_interface} def load_batches_from_file(batch_file): - if gw_batch_file_access_key_id is not None: - s3source = s3fs.S3FileSystem( - anon=False, - key=gw_batch_file_access_key_id, - secret=gw_batch_secret_access_key, - client_kwargs={'endpoint_url': gw_batch_endpoint}) - + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + with open(batch_file, 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + df = pd.read_csv(batch_file, header='infer') + batches = df['filename'].to_list() - if batch_file.endswith('.json'): - # load batches from keys of a json file - logging.info(f'Loading batches from json file: {batch_file}') - with s3source.open(gw_batch_file, 'r') as f: - batch_dict = json.load(f) - batches = batch_dict.keys() - - elif batch_file.endswith('.csv'): - # load batches from keys of a csv file - logging.info(f'Loading batches from csv file: {batch_file}') - s3source.get(batch_file, batch_file) - df = pd.read_csv(batch_file, header='infer') - batches = df['filename'].to_list() - - - else: - # Load batches from comma-separated txt file - logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - with s3source.open(gw_batch_file, 'r') as f: - batch_string = f.read() - batches = [b.strip() for b in batch_string.split(',')] - else: - if batch_file.endswith('.json'): - # load batches from keys of a json file - logging.info(f'Loading batches from json file: {batch_file}') - with open(batch_file, 'r') as f: - batch_dict = json.load(f) - batches = batch_dict.keys() - - elif batch_file.endswith('.csv'): - # load batches from keys of a csv file - logging.info(f'Loading batches from csv file: {batch_file}') - df = pd.read_csv(batch_file, header='infer') - batches = df['filename'].to_list() - - else: - # Load batches from comma-separated txt file - logging.info(f'Loading comma-separated batch strings from file: {batch_file}') - with open(batch_file, 'r') as f: - batch_string = f.read() - batches = [b.strip() for b in batch_string.split(',')] + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + with open(batch_file, 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] logging.info(f'Loaded {len(batches)} batches') logging.debug(f'List of batches: {batches}') @@ -180,7 +130,7 @@ def perform_process(process, batch): try: target_files = process(batch, ${component_inputs}) except Exception as err: - logging.error(f'{type(err).__name__} in batch {batch}: {err}') + logging.exception(err) # Write error to file with open(error_file, 'w') as f: f.write(f"{type(err).__name__} in batch {batch}: {err}") From 692cd1a8a1ee035e31f7b1e9995fa047780b789b Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 14:45:58 +0100 Subject: [PATCH 150/177] Fix inputs error Signed-off-by: Benedikt Blumenstiel --- src/c3/pythonscript.py | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/c3/pythonscript.py b/src/c3/pythonscript.py index 9deea18d..eeed2226 100644 --- a/src/c3/pythonscript.py +++ b/src/c3/pythonscript.py @@ -37,10 +37,13 @@ def _get_input_vars(self): logging.debug(f'Interface: No description for variable {env_name} provided.') if re.search(r'=\s*int\(\s*os', line): type = 'Integer' + default = default.strip('\"\'') elif re.search(r'=\s*float\(\s*os', line): type = 'Float' + default = default.strip('\"\'') elif re.search(r'=\s*bool\(\s*os', line): type = 'Boolean' + default = default.strip('\"\'') else: type = 'String' return_value[env_name] = { From 36a7b7d07e136abd25639d42ecb92819da2e8db4 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 14:46:15 +0100 Subject: [PATCH 151/177] Updated grid wrapper Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 12 ++++++++---- src/c3/templates/cos_grid_wrapper_template.py | 3 +-- src/c3/templates/grid_wrapper_template.py | 4 +++- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index fef0e545..0369345e 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -24,8 +24,11 @@ def wrap_component(component_path, logging.info(f'Using backend: {backend}') backends = { - 'cos_grid_wrapper': c3.templates.cos_grid_wrapper_template, + 'local': c3.templates.grid_wrapper_template, + 'cos': c3.templates.cos_grid_wrapper_template, + 's3kv': c3.templates.s3kv_grid_wrapper_template, 'grid_wrapper': c3.templates.grid_wrapper_template, + 'cos_grid_wrapper': c3.templates.cos_grid_wrapper_template, 's3kv_grid_wrapper': c3.templates.s3kv_grid_wrapper_template, } gw_template = backends.get(backend) @@ -71,7 +74,8 @@ def get_component_elements(file_path): type_to_func = {'String': '', 'Boolean': 'bool', 'Integer': 'int', 'Float': 'float'} for variable, d in inputs.items(): interface += f"# {d['description']}\n" - if d['type'] == 'String' and d['default'] is not None and d['default'][0] not in '\'\"': + if (d['type'] == 'String' and d['default'] is not None and + (d['default'] == '' or d['default'][0] not in '\'\"')): # Add quotation marks d['default'] = "'" + d['default'] + "'" interface += f"component_{variable} = {type_to_func[d['type']]}(os.getenv('{variable}', {d['default']}))\n" @@ -156,8 +160,8 @@ def main(): help='List of paths to additional files to include in the container image') parser.add_argument('-p', '--component_process', type=str, default='grid_process', help='Name of the component sub process that is executed for each batch.') - parser.add_argument('-b', '--backend', type=str, default='grid_wrapper', - help='Define backend. Default: s3kv_grid_wrapper. Others: grid_wrapper, cos_grid_wrapper') + parser.add_argument('-b', '--backend', type=str, default='local', + help='Define backend. Default: local. Others: cos, s3kv') parser.add_argument('-r', '--repository', type=str, default=None, help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index f2a958a9..85640802 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -80,7 +80,6 @@ def explode_connection_string(cs): gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 10800)) # ignore error files and rerun batches with errors gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) - # maximal wait time for staggering start gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering', 60)) @@ -287,7 +286,7 @@ def perform_process(process, batch, cos_files): def process_wrapper(sub_process): - delay = random.randint(1, 60) + delay = random.randint(0, gw_max_time_wait_staggering) logging.info(f'Staggering start, waiting for {delay} seconds') time.sleep(delay) diff --git a/src/c3/templates/grid_wrapper_template.py b/src/c3/templates/grid_wrapper_template.py index 5eed2d06..7fc4d78f 100644 --- a/src/c3/templates/grid_wrapper_template.py +++ b/src/c3/templates/grid_wrapper_template.py @@ -39,6 +39,8 @@ gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 10800)) # ignore error files and rerun batches with errors gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) +# maximal wait time for staggering start +gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering', 60)) # component interface ${component_interface} @@ -161,7 +163,7 @@ def perform_process(process, batch): def process_wrapper(sub_process): - delay = random.randint(1, 60) + delay = random.randint(0, gw_max_time_wait_staggering) logging.info(f'Staggering start, waiting for {delay} seconds') time.sleep(delay) From b288ccb7f120e052c1e962670d90ce5f7169d421 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 14:46:39 +0100 Subject: [PATCH 152/177] Updated tests Signed-off-by: Benedikt Blumenstiel --- tests/example_script.py | 2 +- tests/test_compiler.py | 12 +++++++----- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/tests/example_script.py b/tests/example_script.py index 6af2556c..8d709e14 100644 --- a/tests/example_script.py +++ b/tests/example_script.py @@ -12,7 +12,7 @@ import numpy as np # A comment one line above os.getenv is the description of this variable. -input_path = os.environ.get('input_path') # ('not this') +input_path = os.environ.get('input_path', '') # ('not this') # type casting to int(), float(), or bool() batch_size = int(os.environ.get('batch_size', 16)) # (not this) diff --git a/tests/test_compiler.py b/tests/test_compiler.py index 29cdff69..b3fd2772 100644 --- a/tests/test_compiler.py +++ b/tests/test_compiler.py @@ -136,24 +136,26 @@ def test_create_operator( test_create_gridwrapper_input = [ ( TEST_SCRIPT_PATH, - None, 'process', [TEST_NOTEBOOK_PATH], ), + ( + TEST_SCRIPT_PATH, + 'process', + [TEST_NOTEBOOK_PATH, '--backend', 'cos'], + ), ( TEST_NOTEBOOK_PATH, - DUMMY_REPO, 'your_function', [], ), ] @pytest.mark.parametrize( - "file_path, repository, process, args", + "file_path, process, args", test_create_gridwrapper_input, ) def test_create_gridwrapper( file_path: str, - repository: str, process: str, args: List, ): @@ -166,6 +168,6 @@ def test_create_gridwrapper( gw_file.with_suffix('.yaml').unlink() gw_file.with_suffix('.job.yaml').unlink() gw_file.with_suffix('.cwl').unlink() - image_name = f"{repository}/claimed-gw-{file_path.rsplit('.')[0].replace('_', '-')}:test" + image_name = f"claimed-gw-{file_path.rsplit('.')[0].replace('_', '-')}:test" # TODO: Modify subprocess call to test grid wrapper # subprocess.run(['docker', 'run', image_name], check=True) From c3d718cf8008f98a9c590913d81743f130f992f3 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 16:24:42 +0100 Subject: [PATCH 153/177] Added s3 batch file connection Signed-off-by: Benedikt Blumenstiel --- src/c3/templates/cos_grid_wrapper_template.py | 52 ++++++++++++------- 1 file changed, 33 insertions(+), 19 deletions(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index 85640802..15f82747 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -25,8 +25,24 @@ from ${component_name} import * +def explode_connection_string(cs): + if cs is None: + return None, None, None, None + elif cs.startswith('cos') or cs.startswith('s3'): + buffer=cs.split('://', 1)[1] + access_key_id=buffer.split('@')[0].split(':')[0] + secret_access_key=buffer.split('@')[0].split(':')[1] + endpoint = f"https://{buffer.split('@')[1].split('/')[0]}" + path=buffer.split('@')[1].split('/', 1)[1] + return (access_key_id, secret_access_key, endpoint, path) + else: + return (None, None, None, cs) + # TODO consider cs as secret and grab connection string from kubernetes + + # File containing batches. Provided as a comma-separated list of strings or keys in a json dict. All batch file names must contain the batch name. gw_batch_file = os.environ.get('gw_batch_file', None) +(gw_batch_file_access_key_id, gw_batch_file_secret_access_key, gw_batch_file_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file) # Optional column name for a csv batch file (default: 'filename') gw_batch_file_col_name = os.environ.get('gw_batch_file_col_name', 'filename') # file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. It is ignored if gw_batch_file is provided. @@ -41,23 +57,6 @@ # upload local target files to target cos path gw_local_target_path = os.environ.get('gw_local_target_path', 'target') - -def explode_connection_string(cs): - if cs is None: - return None, None, None, None - elif cs.startswith('cos') or cs.startswith('s3'): - buffer=cs.split('://')[1] - access_key_id=buffer.split('@')[0].split(':')[0] - secret_access_key=buffer.split('@')[0].split(':')[1] - endpoint=buffer.split('@')[1].split('/')[0] - path='/'.join(buffer.split('@')[1].split('/')[1:]) - return (access_key_id, secret_access_key, endpoint, path) - else: - # TODO consider cs as secret and grab connection string from kubernetes - raise NotImplementedError - - - # cos gw_source_connection gw_source_connection = os.environ.get('gw_source_connection') (gw_source_access_key_id, gw_source_secret_access_key, gw_source_endpoint, gw_source_path) = explode_connection_string(gw_source_connection) @@ -120,6 +119,18 @@ def explode_connection_string(cs): gw_coordinator_path = gw_source_path s3coordinator = s3source +if gw_batch_file_access_key_id is not None: + s3batch_file = s3fs.S3FileSystem( + anon=False, + key=gw_batch_file_access_key_id, + secret=gw_batch_file_secret_access_key, + client_kwargs={'endpoint_url': gw_batch_file_endpoint}) +else: + logging.debug('Loading batch file from source s3.') + s3batch_file = s3source + gw_batch_file = str(gw_source_path / gw_batch_file) + + def load_batches_from_file(batch_file): if batch_file.endswith('.json'): # load batches from keys of a json file @@ -298,8 +309,11 @@ def process_wrapper(sub_process): cos_gw_batch_file = str(gw_source_path / gw_batch_file) if (gw_batch_file is not None and (os.path.isfile(gw_batch_file) or s3source.exists(cos_gw_batch_file))): if not os.path.isfile(gw_batch_file): - # Download batch file - s3source.get(cos_gw_batch_file, gw_batch_file) + # Download batch file from s3 + if s3batch_file.exists(gw_batch_file): + s3batch_file.get(gw_batch_file, gw_batch_file) + else: + s3batch_file.get(str(gw_source_path / gw_batch_file), gw_batch_file) batches = load_batches_from_file(gw_batch_file) if gw_file_path_pattern: cos_files = get_files_from_pattern(gw_file_path_pattern) From f518331a4d4264cf47a885bfac486469eb55513f Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 16:30:23 +0100 Subject: [PATCH 154/177] Moved cos gw to legacy gw Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 4 +++- src/c3/templates/__init__.py | 4 ++++ ...rapper_template.py => legacy_cos_grid_wrapper_template.py} | 0 3 files changed, 7 insertions(+), 1 deletion(-) rename src/c3/templates/{cos_grid_wrapper_template.py => legacy_cos_grid_wrapper_template.py} (100%) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 0369345e..c457a9a9 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -26,9 +26,11 @@ def wrap_component(component_path, backends = { 'local': c3.templates.grid_wrapper_template, 'cos': c3.templates.cos_grid_wrapper_template, + 'legacy_cos': c3.templates.legacy_cos_grid_wrapper_template, 's3kv': c3.templates.s3kv_grid_wrapper_template, 'grid_wrapper': c3.templates.grid_wrapper_template, 'cos_grid_wrapper': c3.templates.cos_grid_wrapper_template, + 'legacy_cos_grid_wrapper': c3.templates.legacy_cos_grid_wrapper_template, 's3kv_grid_wrapper': c3.templates.s3kv_grid_wrapper_template, } gw_template = backends.get(backend) @@ -161,7 +163,7 @@ def main(): parser.add_argument('-p', '--component_process', type=str, default='grid_process', help='Name of the component sub process that is executed for each batch.') parser.add_argument('-b', '--backend', type=str, default='local', - help='Define backend. Default: local. Others: cos, s3kv') + help='Define backend. Default: local. Others: cos, s3kv, legacy_cos (with automatic file download/upload)') parser.add_argument('-r', '--repository', type=str, default=None, help='Container registry address, e.g. docker.io/') parser.add_argument('-v', '--version', type=str, default=None, diff --git a/src/c3/templates/__init__.py b/src/c3/templates/__init__.py index 85394c29..f761b5d1 100644 --- a/src/c3/templates/__init__.py +++ b/src/c3/templates/__init__.py @@ -14,6 +14,7 @@ CWL_COMPONENT_FILE = 'cwl_component_template.cwl' GRID_WRAPPER_FILE = 'grid_wrapper_template.py' COS_GRID_WRAPPER_FILE = 'cos_grid_wrapper_template.py' +LEGACY_COS_GRID_WRAPPER_FILE = 'legacy_cos_grid_wrapper_template.py' S3KV_GRID_WRAPPER_FILE = 's3kv_grid_wrapper_template.py' # load templates @@ -49,5 +50,8 @@ with open(template_path / COS_GRID_WRAPPER_FILE, 'r') as f: cos_grid_wrapper_template = Template(f.read()) +with open(template_path / LEGACY_COS_GRID_WRAPPER_FILE, 'r') as f: + legacy_cos_grid_wrapper_template = Template(f.read()) + with open(template_path / S3KV_GRID_WRAPPER_FILE, 'r') as f: s3kv_grid_wrapper_template = Template(f.read()) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/legacy_cos_grid_wrapper_template.py similarity index 100% rename from src/c3/templates/cos_grid_wrapper_template.py rename to src/c3/templates/legacy_cos_grid_wrapper_template.py From 260a385bebd9de439b08b19a3c6a24384dd9a3d2 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 17:01:44 +0100 Subject: [PATCH 155/177] Unified functionality for cos and s3kv grid wrapper Signed-off-by: Benedikt Blumenstiel --- src/c3/templates/cos_grid_wrapper_template.py | 218 ++++++++++++++++++ .../legacy_cos_grid_wrapper_template.py | 2 +- .../templates/s3kv_grid_wrapper_template.py | 64 +++-- 3 files changed, 266 insertions(+), 18 deletions(-) create mode 100644 src/c3/templates/cos_grid_wrapper_template.py diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py new file mode 100644 index 00000000..7526929d --- /dev/null +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -0,0 +1,218 @@ +""" +${component_name} got wrapped by cos_grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern for cos files https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 + +CLAIMED component description: ${component_description} +""" + +# pip install s3fs +# component dependencies +# ${component_dependencies} + +import os +import json +import random +import logging +import shutil +import time +import glob +import s3fs +from datetime import datetime +from pathlib import Path +import pandas as pd + + +# import component code +from ${component_name} import * + + +def explode_connection_string(cs): + if cs is None: + return None, None, None, None + elif cs.startswith('cos') or cs.startswith('s3'): + buffer=cs.split('://', 1)[1] + access_key_id=buffer.split('@')[0].split(':')[0] + secret_access_key=buffer.split('@')[0].split(':')[1] + endpoint = f"https://{buffer.split('@')[1].split('/')[0]}" + path=buffer.split('@')[1].split('/', 1)[1] + return (access_key_id, secret_access_key, endpoint, path) + else: + return (None, None, None, cs) + # TODO consider cs as secret and grab connection string from kubernetes + + +# File containing batches. Provided as a comma-separated list of strings or keys in a json dict. All batch file names must contain the batch name. +gw_batch_file = os.environ.get('gw_batch_file', None) +(gw_batch_file_access_key_id, gw_batch_file_secret_access_key, gw_batch_file_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file) +# Optional column name for a csv batch file (default: 'filename') +gw_batch_file_col_name = os.environ.get('gw_batch_file_col_name', 'filename') +# cos gw_coordinator_connection +gw_coordinator_connection = os.environ.get('gw_coordinator_connection') +(gw_coordinator_access_key_id, gw_coordinator_secret_access_key, gw_coordinator_endpoint, gw_coordinator_path) = explode_connection_string(gw_coordinator_connection) +# timeout in seconds to remove lock file from struggling job (default 3 hours) +gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 10800)) +# ignore error files and rerun batches with errors +gw_ignore_error_files = bool(os.environ.get('gw_ignore_error_files', False)) +# maximal wait time for staggering start +gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering', 60)) + +# coordinator file suffix +suffix_lock = '.lock' +suffix_processed = '.processed' +suffix_error = '.err' + +# component interface +${component_interface} + +# init s3 +s3coordinator = s3fs.S3FileSystem( + anon=False, + key=gw_coordinator_access_key_id, + secret=gw_coordinator_secret_access_key, + client_kwargs={'endpoint_url': gw_coordinator_endpoint}) +gw_coordinator_path = Path(gw_coordinator_path) + +if gw_batch_file_access_key_id is not None: + s3batch_file = s3fs.S3FileSystem( + anon=False, + key=gw_batch_file_access_key_id, + secret=gw_batch_file_secret_access_key, + client_kwargs={'endpoint_url': gw_batch_file_endpoint}) +else: + logging.debug('Loading batch file from source s3.') + s3batch_file = s3coordinator + # TODO: Fix this somewhere else + # Adding coordinator bucket + gw_batch_file = str(gw_coordinator_path.split('/')[0] / gw_batch_file) + + +def load_batches_from_file(batch_file): + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + with open(batch_file, 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + df = pd.read_csv(batch_file, header='infer') + assert gw_batch_file_col_name in df.columns, \ + f'gw_batch_file_col_name {gw_batch_file_col_name} not in columns of batch file {batch_file}' + batches = df[gw_batch_file_col_name].to_list() + + elif batch_file.endswith('.txt'): + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + with open(batch_file, 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] + else: + raise ValueError(f'C3 only supports batch files of type ' + f'json (batches = dict keys), ' + f'csv (batches = column values), or ' + f'txt (batches = comma-seperated list).') + + logging.info(f'Loaded {len(batches)} batches') + logging.debug(f'List of batches: {batches}') + assert len(batches) > 0, f"batch_file {batch_file} has no batches." + return batches + + +def perform_process(process, batch): + logging.debug(f'Check coordinator files for batch {batch}.') + # init coordinator files + lock_file = str(gw_coordinator_path / (batch + suffix_lock)) + processed_file = str(gw_coordinator_path / (batch + suffix_processed)) + error_file = str(gw_coordinator_path / (batch + suffix_error)) + + if s3coordinator.exists(lock_file): + # remove strugglers + last_modified = s3coordinator.info(lock_file)['LastModified'] + if (datetime.now(last_modified.tzinfo) - last_modified).total_seconds() > gw_lock_timeout: + logging.info(f'Lock file {lock_file} is expired.') + s3coordinator.rm(lock_file) + else: + logging.debug(f'Batch {batch} is locked.') + return + + if s3coordinator.exists(processed_file): + logging.debug(f'Batch {batch} is processed.') + return + + if s3coordinator.exists(error_file): + if gw_ignore_error_files: + logging.info(f'Ignoring previous error in batch {batch} and rerun.') + else: + logging.debug(f'Batch {batch} has error.') + return + + logging.debug(f'Locking batch {batch}.') + s3coordinator.touch(lock_file) + + # processing files with custom process + logging.info(f'Processing batch {batch}.') + try: + target_files = process(batch, ${component_inputs}) + except Exception as err: + logging.exception(err) + # Write error to file + with s3coordinator.open(error_file, 'w') as f: + f.write(f"{type(err).__name__} in batch {batch}: {err}") + s3coordinator.rm(lock_file) + logging.error(f'Continue processing.') + return + + logging.info(f'Finished Batch {batch}.') + s3coordinator.touch(processed_file) + # Remove lock file + if s3coordinator.exists(lock_file): + s3coordinator.rm(lock_file) + else: + logging.warning(f'Lock file {lock_file} was removed by another process. ' + f'Consider increasing gw_lock_timeout to avoid repeated processing (currently {gw_lock_timeout}s).') + + +def process_wrapper(sub_process): + delay = random.randint(0, gw_max_time_wait_staggering) + logging.info(f'Staggering start, waiting for {delay} seconds') + time.sleep(delay) + + # Init coordinator dir + s3coordinator.makedirs(gw_coordinator_path, exist_ok=True) + + # download batch file + if not os.path.isfile(gw_batch_file): + cos_gw_batch_file = str(gw_coordinator_path.split([0]) / gw_batch_file) + # Download batch file from s3 + if s3batch_file.exists(cos_gw_batch_file): + s3batch_file.get(gw_batch_file, gw_batch_file) + else: + raise ValueError("Cannot identify batches. Provide valid gw_batch_file " + "(local path, path within coordinator bucket, or s3 connection to batch file).") + + # get batches + batches = load_batches_from_file(gw_batch_file) + + # Iterate over all batches + for batch in batches: + perform_process(sub_process, batch) + + # Check and log status of batches + processed_status = sum(s3coordinator.exists(gw_coordinator_path / (batch + suffix_processed)) for batch in batches) + lock_status = sum(s3coordinator.exists(gw_coordinator_path / (batch + suffix_lock)) for batch in batches) + error_status = sum(s3coordinator.exists(gw_coordinator_path / (batch + suffix_error)) for batch in batches) + + logging.info(f'Finished current process. Status batches: ' + f'{processed_status} processed / {lock_status} locked / {error_status} errors / {len(batches)} total') + + if error_status: + logging.error(f'Found errors! Resolve errors and rerun operator with gw_ignore_error_files=True.') + # print all error messages + for error_file in s3coordinator.glob(str(gw_coordinator_path / ('**/*' + suffix_error))): + with s3coordinator.open(error_file, 'r') as f: + logging.error(f.read()) + + +if __name__ == '__main__': + process_wrapper(${component_process}) diff --git a/src/c3/templates/legacy_cos_grid_wrapper_template.py b/src/c3/templates/legacy_cos_grid_wrapper_template.py index 15f82747..16f56f59 100644 --- a/src/c3/templates/legacy_cos_grid_wrapper_template.py +++ b/src/c3/templates/legacy_cos_grid_wrapper_template.py @@ -67,7 +67,7 @@ def explode_connection_string(cs): # cos gw_coordinator_connection gw_coordinator_connection = os.environ.get('gw_coordinator_connection') -(gw_coordinator_access_key_id, gw_coordinator_secret_access_key, gw_coordinator_endpoint, gw_coordinator_path) = explode_connection_string(gw_target_connection) +(gw_coordinator_access_key_id, gw_coordinator_secret_access_key, gw_coordinator_endpoint, gw_coordinator_path) = explode_connection_string(gw_coordinator_connection) # lock file suffix gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') diff --git a/src/c3/templates/s3kv_grid_wrapper_template.py b/src/c3/templates/s3kv_grid_wrapper_template.py index 4d794227..21cae458 100644 --- a/src/c3/templates/s3kv_grid_wrapper_template.py +++ b/src/c3/templates/s3kv_grid_wrapper_template.py @@ -500,15 +500,15 @@ def release_legal_hold(self, key: str): #----------------------------------------------------------- -def explode_connection_string(cs): +def explode_connection_string(cs) if cs is None: - return None + return None, None, None, None if cs.startswith('cos') or cs.startswith('s3'): buffer=cs.split('://')[1] access_key_id=buffer.split('@')[0].split(':')[0] secret_access_key=buffer.split('@')[0].split(':')[1] endpoint=f"https://{buffer.split('@')[1].split('/')[0]}" - path='/'.join(buffer.split('@')[1].split('/')[1:]) + path=buffer.split('@')[1].split('/', 1)[1] return (access_key_id, secret_access_key, endpoint, path) else: return (None, None, None, cs) @@ -518,7 +518,9 @@ def explode_connection_string(cs): # File with batches. Provided as a comma-separated list of strings, keys in a json dict or single column CSV with 'filename' has header. Either local path as [cos|s3]://user:pw@endpoint/path gw_batch_file = os.environ.get('gw_batch_file', None) -(gw_batch_file_access_key_id, gw_batch_secret_access_key, gw_batch_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file) +(gw_batch_file_access_key_id, gw_batch_file_secret_access_key, gw_batch_file_endpoint, gw_batch_file) = explode_connection_string(gw_batch_file) +# Optional column name for a csv batch file (default: 'filename') +gw_batch_file_col_name = os.environ.get('gw_batch_file_col_name', 'filename') # cos gw_coordinator_connection gw_coordinator_connection = os.environ.get('gw_coordinator_connection') @@ -531,19 +533,40 @@ def explode_connection_string(cs): #${component_interface} def load_batches_from_file(batch_file): - s3source = s3fs.S3FileSystem( + # Download batch file from s3 + s3_batch_file = s3fs.S3FileSystem( anon=False, key=gw_batch_file_access_key_id, - secret=gw_batch_secret_access_key, - client_kwargs={'endpoint_url': gw_batch_endpoint}) - - # load batches from keys of a csv file - logging.info(f'Loading batches from csv file: {batch_file}') - s3source.get(batch_file, batch_file) - df = pd.read_csv(batch_file, header='infer') - batches = df['filename'].to_list() - - + secret=gw_batch_file_secret_access_key, + client_kwargs={'endpoint_url': gw_batch_file_endpoint}) + s3_batch_file.get(batch_file, batch_file) + + if batch_file.endswith('.json'): + # load batches from keys of a json file + logging.info(f'Loading batches from json file: {batch_file}') + with open(batch_file, 'r') as f: + batch_dict = json.load(f) + batches = batch_dict.keys() + + elif batch_file.endswith('.csv'): + # load batches from keys of a csv file + logging.info(f'Loading batches from csv file: {batch_file}') + df = pd.read_csv(batch_file, header='infer') + assert gw_batch_file_col_name in df.columns, \ + f'gw_batch_file_col_name {gw_batch_file_col_name} not in columns of batch file {batch_file}' + batches = df[gw_batch_file_col_name].to_list() + + elif batch_file.endswith('.txt'): + # Load batches from comma-separated txt file + logging.info(f'Loading comma-separated batch strings from file: {batch_file}') + with open(batch_file, 'r') as f: + batch_string = f.read() + batches = [b.strip() for b in batch_string.split(',')] + else: + raise ValueError(f'C3 only supports batch files of type ' + f'json (batches = dict keys), ' + f'csv (batches = column values), or ' + f'txt (batches = comma-seperated list).') logging.info(f'Loaded {len(batches)} batches') logging.debug(f'List of batches: {batches}') @@ -577,7 +600,7 @@ def perform_process(process, batch, coordinator): try: process(batch, ${component_inputs}) except Exception as err: - logging.error(f'{type(err).__name__} in batch {batch_id}: {err}') + logging.exception(err) coordinator.add(batch_id,f"{type(err).__name__} in batch {batch_id}: {err}") logging.error(f'Continue processing.') return @@ -605,7 +628,14 @@ def process_wrapper(sub_process): for batch in batches: perform_process(sub_process, batch, coordinator) - + # Check and log status of batches + processed_status = sum(coordinator.get(batch_id) == 'processed' for batch_id in batches) + lock_status = sum(coordinator.get(batch_id) == 'locked' for batch_id in batches) + exists_status = sum(coordinator.key_exists(batch_id) for batch_id in batches) + error_status = exists_status - processed_status - lock_status + + logging.info(f'Finished current process. Status batches: ' + f'{processed_status} processed / {lock_status} locked / {error_status} errors / {len(batches)} total') if __name__ == '__main__': From 749284fd6f620a6b1b444c0e83ec6ff288791dfe Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 17:05:32 +0100 Subject: [PATCH 156/177] Fix cos grid wrapper Signed-off-by: Benedikt Blumenstiel --- src/c3/templates/cos_grid_wrapper_template.py | 23 +++++++++---------- 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index 7526929d..c237a102 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -63,7 +63,7 @@ def explode_connection_string(cs): # component interface ${component_interface} -# init s3 +# Init s3 s3coordinator = s3fs.S3FileSystem( anon=False, key=gw_coordinator_access_key_id, @@ -80,21 +80,18 @@ def explode_connection_string(cs): else: logging.debug('Loading batch file from source s3.') s3batch_file = s3coordinator - # TODO: Fix this somewhere else - # Adding coordinator bucket - gw_batch_file = str(gw_coordinator_path.split('/')[0] / gw_batch_file) def load_batches_from_file(batch_file): if batch_file.endswith('.json'): - # load batches from keys of a json file + # Load batches from keys of a json file logging.info(f'Loading batches from json file: {batch_file}') with open(batch_file, 'r') as f: batch_dict = json.load(f) batches = batch_dict.keys() elif batch_file.endswith('.csv'): - # load batches from keys of a csv file + # Load batches from keys of a csv file logging.info(f'Loading batches from csv file: {batch_file}') df = pd.read_csv(batch_file, header='infer') assert gw_batch_file_col_name in df.columns, \ @@ -121,13 +118,13 @@ def load_batches_from_file(batch_file): def perform_process(process, batch): logging.debug(f'Check coordinator files for batch {batch}.') - # init coordinator files + # Init coordinator files lock_file = str(gw_coordinator_path / (batch + suffix_lock)) processed_file = str(gw_coordinator_path / (batch + suffix_processed)) error_file = str(gw_coordinator_path / (batch + suffix_error)) if s3coordinator.exists(lock_file): - # remove strugglers + # Remove strugglers last_modified = s3coordinator.info(lock_file)['LastModified'] if (datetime.now(last_modified.tzinfo) - last_modified).total_seconds() > gw_lock_timeout: logging.info(f'Lock file {lock_file} is expired.') @@ -181,17 +178,19 @@ def process_wrapper(sub_process): # Init coordinator dir s3coordinator.makedirs(gw_coordinator_path, exist_ok=True) - # download batch file + # Download batch file + if s3batch_file.exists(gw_batch_file): + s3batch_file.get(gw_batch_file, gw_batch_file) if not os.path.isfile(gw_batch_file): + # Download batch file from s3 coordinator cos_gw_batch_file = str(gw_coordinator_path.split([0]) / gw_batch_file) - # Download batch file from s3 if s3batch_file.exists(cos_gw_batch_file): s3batch_file.get(gw_batch_file, gw_batch_file) else: raise ValueError("Cannot identify batches. Provide valid gw_batch_file " "(local path, path within coordinator bucket, or s3 connection to batch file).") - # get batches + # Get batches batches = load_batches_from_file(gw_batch_file) # Iterate over all batches @@ -208,7 +207,7 @@ def process_wrapper(sub_process): if error_status: logging.error(f'Found errors! Resolve errors and rerun operator with gw_ignore_error_files=True.') - # print all error messages + # Print all error messages for error_file in s3coordinator.glob(str(gw_coordinator_path / ('**/*' + suffix_error))): with s3coordinator.open(error_file, 'r') as f: logging.error(f.read()) From 1f1b4245318ddf64fadda8fdabeaadc297e1bc61 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 17:40:53 +0100 Subject: [PATCH 157/177] Updated GettingStarted Signed-off-by: Benedikt Blumenstiel --- GettingStarted.md | 54 ++++++++++++++++++----------------------------- 1 file changed, 20 insertions(+), 34 deletions(-) diff --git a/GettingStarted.md b/GettingStarted.md index 009e3dfe..050f0990 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -553,7 +553,7 @@ def grid_process(batch_id, parameter1, parameter2, *args, **kwargs): You might want to add `*args, **kwargs` to avoid errors, if not all interface variables are used in the grid process. Note that the operator script is imported by the grid wrapper script. Therefore, all code in the script is executed. -It is recommended to avoid executions in the code and to use a main block if the script is also used as a single operator. +If the script is also used as a single operator, it is recommended to check for `__main__` to avoid executions when the code is imported by the grid wrapper. ```python if __name__ == '__main__': @@ -564,17 +564,18 @@ Note that the grid computing is currently not implemented for R scripts. ### 5.2 Compile a grid wrapper with C3 -The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_grid_wrapper.py` using `--process` or `-p`. +The compilation is similar to an operator. Additionally, the name of the grid process is passed to `create_gridwrapper.py` using `--process` or `-p` (default: `"grid_process"`) +and a backend for the coordinator is selected with `--backend` or `-b` (default: `"local"`). ```sh -c3_create_gridwrapper -r "/" --process "grid_process" ".py" "" "" +c3_create_gridwrapper -r "/" -p "grid_process" -b "local" ".py" "" "" ``` -C3 also includes a grid computing pattern for Cloud Object Storage (COS). You can create a COS grid wrapper by adding a `--cos` flag. -The COS grid wrapper downloads all files of a batch to local storage, compute the process, and uploads the output files to COS. -Note that the COS grid wrapper requires the file paths to include the batch id to be identified, see details in the next subsection. +C3 supports three backends for the coordination: Coordinator files on a shared local storage (`"local"`), on COS (`"cos"`), or as a key-value storage on S3 (`"s3kv"`). -The created files include a `gw_.py` file that includes the generated code for the grid wrapper (`cgw_.py` for the COS version). +Note, that the backend `"legacy_cos"` also handles downloading and uploading files from COS. We removed this functionality to simplify the grid wrapper. + +The grid wrapper creates a temporary file `gw_.py` which is copied to the container image and deleted. Similar to an operator, `gw_.yaml`, `gw_.cwl`, and `gw_.job.yaml` are created. @@ -582,38 +583,23 @@ Similar to an operator, `gw_.yaml`, `gw_ The grid wrapper uses coordinator files to split up the batch processes between different pods. Therefore, each pod needs access to a shared persistent volume, see [storage](#storage). -Alternatively, you can use the COS grid wrapper which uses a coordinator path in COS. +Alternatively, you can use the COS or S3kv grid wrapper which uses a coordinator in S3. The grid wrapper adds specific variables to the `job.yaml`, that define the batches and some coordination settings. -First, you can define the list of batches in a file and pass `gw_batch_file` to the grid wrapper. -You can use either a txt file with a comma-separated list of strings or a json file with the keys being the batch ids. -Alternatively, the batch ids can be defined by a file name pattern via `gw_file_path_pattern` and `gw_group_by`. -You can provide multiple patterns via a comma-separated list and the patterns can include wildcards like `*` or `?` to find all relevant files. -`gw_group_by` is code that extracts the batch id from a file name by merging the file name string with the code string and passing it to `eval()`. -Assuming, we have the file names `file-from-batch-42-metadata.json` and `second_file-42-image.png`. -The code `gw_group_by = ".split('-')[-2]"` extracts the batch `42` from both files. -You can also to use something like `"[-15:-10]"` or `".split('/')[-1].split('.')[0]"`. -`gw_group_by` is ignored if you provide `gw_batch_file`. -Be aware that the file names need to include the batch name if you are using `gw_group_by` or the COS version -(because files are downloaded based on a match with the batch id). - -Second, you need to define `gw_coordinator_path` and optionally other coordinator variables. -The `gw_coordinator_path` is a path to a persistent and shared directory that is used by the pods to lock batches and mark them as processed. -`gw_lock_file_suffix` and similar variables are the suffixes for coordinator files (default: `.lock`, `.processed`, and `.err`). -`gw_lock_timeout` defines the time in seconds until other pods remove the `.lock` file from batches that might be struggling (default `3600`). -You need to increase `gw_lock_timeout` to avoid multiple processing if batch processes run very long. -By default, pods skip batches with `.err` files. You can set `gw_ignore_error_files` to `True` after you fixed the error. +First, you can define the list of batch ids in a file and pass `gw_batch_file` to the grid wrapper. +You can use either a `txt` file with a comma-separated list of strings, a `json` file with the keys being the batch ids, or a `csv` file with `gw_batch_file_col_name` being the column with the batch ids. +`gw_batch_file` can be a local path, a path within the coordinator bucket or a COS connection to a file (`cos://:@///`). -If your using the COS grid wrapper, further variables are required. -You can provide a comma-separated list of additional files that should be downloaded COS using `gw_additional_source_files`. -All batch files and additional files are download to an input directory, defined via `gw_local_input_path` (default: `input`). -Similar, all files in `gw_local_target_path` are uploaded to COS after the batch processing (default: `target`). +Second, you need to define a `gw_coordinator_path` or `gw_coordinator_connection`. +The `gw_coordinator_path` is used in the `local` version. It is a path to a persistent and shared directory that is used by the pods to lock batches and mark them as processed. +`gw_coordinator_connection` is used in the `cos` and `s3kv` version. It defines a connection to a directory on COS: `cos://:@//`. +The coordinator uses files with specific suffixes: `.lock`, `.processed`, and `.err`. +`gw_lock_timeout` defines the time in seconds until other pods remove the `.lock` file from batches that might be struggling (default `10800`). +If your processes run very long, you can increase `gw_lock_timeout` to avoid duplicated processing of batches. +By default, pods skip batches with `.err` files. You can set `gw_ignore_error_files` to `True` after you fixed the error. -Furthermore, `gw_source_access_key_id`, `gw_source_secret_access_key`, `gw_source_endpoint`, and `gw_source_bucket` define the COS bucket to the source files. -You can specify other buckets for the coordinator and target files. -If the buckets are similar to the source bucket, you just need to provide `gw_target_path` and `gw_coordinator_path` and remove the other variables from the `job.yaml`. -It is recommended to use [secrets](#secrets) for the access key and secret. +The grid wrapper currently does not support [secrets](#secrets) for the access key and secret within a connection. Lastly, you want to add the number of parallel pods by adding `parallelism : ` to the `job.yaml`. From 1ef40c0727a0eca5fadcb7558d95142cae00b763 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 8 Mar 2024 17:41:02 +0100 Subject: [PATCH 158/177] Grid wrapper fixes Signed-off-by: Benedikt Blumenstiel --- src/c3/templates/cos_grid_wrapper_template.py | 2 +- src/c3/templates/grid_wrapper_template.py | 67 ++++++++++--------- .../legacy_cos_grid_wrapper_template.py | 2 +- .../templates/s3kv_grid_wrapper_template.py | 3 +- 4 files changed, 38 insertions(+), 36 deletions(-) diff --git a/src/c3/templates/cos_grid_wrapper_template.py b/src/c3/templates/cos_grid_wrapper_template.py index c237a102..30fa86d9 100644 --- a/src/c3/templates/cos_grid_wrapper_template.py +++ b/src/c3/templates/cos_grid_wrapper_template.py @@ -4,7 +4,7 @@ CLAIMED component description: ${component_description} """ -# pip install s3fs +# pip install s3fs pandas # component dependencies # ${component_dependencies} diff --git a/src/c3/templates/grid_wrapper_template.py b/src/c3/templates/grid_wrapper_template.py index 7fc4d78f..9a418be7 100644 --- a/src/c3/templates/grid_wrapper_template.py +++ b/src/c3/templates/grid_wrapper_template.py @@ -4,6 +4,8 @@ CLAIMED component description: ${component_description} """ +# pip install pandas + # component dependencies # ${component_dependencies} @@ -15,7 +17,6 @@ import glob from pathlib import Path import pandas as pd -import s3fs # import component code from ${component_name} import * @@ -23,18 +24,16 @@ # File with batches. Provided as a comma-separated list of strings, keys in a json dict or single column CSV with 'filename' has header. gw_batch_file = os.environ.get('gw_batch_file', None) +# Optional column name for a csv batch file (default: 'filename') +gw_batch_file_col_name = os.environ.get('gw_batch_file_col_name', 'filename') # file path pattern like your/path/**/*.tif. Multiple patterns can be separated with commas. Is ignored if gw_batch_file is provided. gw_file_path_pattern = os.environ.get('gw_file_path_pattern', None) # pattern for grouping file paths into batches like ".split('.')[-1]". Is ignored if gw_batch_file is provided. gw_group_by = os.environ.get('gw_group_by', None) # path to grid wrapper coordinator directory gw_coordinator_path = os.environ.get('gw_coordinator_path') -# lock file suffix -gw_lock_file_suffix = os.environ.get('gw_lock_file_suffix', '.lock') -# processed file suffix -gw_processed_file_suffix = os.environ.get('gw_lock_file_suffix', '.processed') -# error file suffix -gw_error_file_suffix = os.environ.get('gw_error_file_suffix', '.err') +gw_coordinator_path = Path(gw_coordinator_path) + # timeout in seconds to remove lock file from struggling job (default 3 hours) gw_lock_timeout = int(os.environ.get('gw_lock_timeout', 10800)) # ignore error files and rerun batches with errors @@ -42,29 +41,41 @@ # maximal wait time for staggering start gw_max_time_wait_staggering = int(os.environ.get('gw_max_time_wait_staggering', 60)) +# coordinator file suffix +suffix_lock = '.lock' +suffix_processed = '.processed' +suffix_error = '.err' + # component interface ${component_interface} def load_batches_from_file(batch_file): if batch_file.endswith('.json'): - # load batches from keys of a json file + # Load batches from keys of a json file logging.info(f'Loading batches from json file: {batch_file}') with open(batch_file, 'r') as f: batch_dict = json.load(f) batches = batch_dict.keys() elif batch_file.endswith('.csv'): - # load batches from keys of a csv file + # Load batches from keys of a csv file logging.info(f'Loading batches from csv file: {batch_file}') df = pd.read_csv(batch_file, header='infer') - batches = df['filename'].to_list() + assert gw_batch_file_col_name in df.columns, \ + f'gw_batch_file_col_name {gw_batch_file_col_name} not in columns of batch file {batch_file}' + batches = df[gw_batch_file_col_name].to_list() - else: + elif batch_file.endswith('.txt'): # Load batches from comma-separated txt file logging.info(f'Loading comma-separated batch strings from file: {batch_file}') with open(batch_file, 'r') as f: batch_string = f.read() batches = [b.strip() for b in batch_string.split(',')] + else: + raise ValueError(f'C3 only supports batch files of type ' + f'json (batches = dict keys), ' + f'csv (batches = column values), or ' + f'txt (batches = comma-seperated list).') logging.info(f'Loaded {len(batches)} batches') logging.debug(f'List of batches: {batches}') @@ -99,9 +110,9 @@ def identify_batches_from_pattern(file_path_patterns, group_by): def perform_process(process, batch): logging.debug(f'Check coordinator files for batch {batch}.') # init coordinator files - lock_file = Path(gw_coordinator_path) / (batch + gw_lock_file_suffix) - error_file = Path(gw_coordinator_path) / (batch + gw_error_file_suffix) - processed_file = Path(gw_coordinator_path) / (batch + gw_processed_file_suffix) + lock_file = gw_coordinator_path / (batch + suffix_lock) + error_file = gw_coordinator_path / (batch + suffix_error) + processed_file = gw_coordinator_path / (batch + suffix_processed) if lock_file.exists(): # remove strugglers @@ -140,16 +151,6 @@ def perform_process(process, batch): logging.error(f'Continue processing.') return - # optional verify target files - if target_files is not None: - if isinstance(target_files, str): - target_files = [target_files] - for target_file in target_files: - if not os.path.exists(target_file): - logging.error(f'Target file {target_file} does not exist for batch {batch}.') - else: - logging.info(f'Cannot verify batch {batch} (target files not provided).') - logging.info(f'Finished Batch {batch}.') processed_file.touch() @@ -158,7 +159,7 @@ def perform_process(process, batch): lock_file.unlink() else: logging.warning(f'Lock file {lock_file} was removed by another process. ' - f'Consider increasing gw_lock_timeout (currently {gw_lock_timeout}s) to repeated processing.') + f'Consider increasing gw_lock_timeout to avoid repeated processing (currently {gw_lock_timeout}s).') @@ -168,13 +169,13 @@ def process_wrapper(sub_process): time.sleep(delay) # Init coordinator dir - coordinator_dir = Path(gw_coordinator_path) - coordinator_dir.mkdir(exist_ok=True, parents=True) + gw_coordinator_path.mkdir(exist_ok=True, parents=True) # get batches if gw_batch_file is not None and os.path.isfile(gw_batch_file): batches = load_batches_from_file(gw_batch_file) elif gw_file_path_pattern is not None and gw_group_by is not None: + logging.warning("gw_file_path_pattern and gw_group_by are legacy and might be removed in a future release.") batches = identify_batches_from_pattern(gw_file_path_pattern, gw_group_by) else: raise ValueError("Cannot identify batches. " @@ -185,17 +186,17 @@ def process_wrapper(sub_process): perform_process(sub_process, batch) # Check and log status of batches - processed_status = [(coordinator_dir / (batch + gw_processed_file_suffix)).exists() for batch in batches] - lock_status = [(coordinator_dir / (batch + gw_lock_file_suffix)).exists() for batch in batches] - error_status = [(coordinator_dir / (batch + gw_error_file_suffix)).exists() for batch in batches] + processed_status = sum((gw_coordinator_path / (batch + suffix_processed)).exists() for batch in batches) + lock_status = sum((gw_coordinator_path / (batch + suffix_lock)).exists() for batch in batches) + error_status = sum((gw_coordinator_path / (batch + suffix_error)).exists() for batch in batches) logging.info(f'Finished current process. Status batches: ' - f'{sum(processed_status)} processed / {sum(lock_status)} locked / {sum(error_status)} errors / {len(processed_status)} total') + f'{processed_status} processed / {lock_status} locked / {error_status} errors / {len(batches)} total') - if sum(error_status): + if error_status: logging.error(f'Found errors! Resolve errors and rerun operator with gw_ignore_error_files=True.') # print all error messages - for error_file in coordinator_dir.glob('**/*' + gw_error_file_suffix): + for error_file in gw_coordinator_path.glob('**/*' + suffix_error): with open(error_file, 'r') as f: logging.error(f.read()) diff --git a/src/c3/templates/legacy_cos_grid_wrapper_template.py b/src/c3/templates/legacy_cos_grid_wrapper_template.py index 16f56f59..f68a2094 100644 --- a/src/c3/templates/legacy_cos_grid_wrapper_template.py +++ b/src/c3/templates/legacy_cos_grid_wrapper_template.py @@ -4,7 +4,7 @@ CLAIMED component description: ${component_description} """ -# pip install s3fs +# pip install s3fs pandas # component dependencies # ${component_dependencies} diff --git a/src/c3/templates/s3kv_grid_wrapper_template.py b/src/c3/templates/s3kv_grid_wrapper_template.py index 21cae458..799be82b 100644 --- a/src/c3/templates/s3kv_grid_wrapper_template.py +++ b/src/c3/templates/s3kv_grid_wrapper_template.py @@ -4,6 +4,7 @@ CLAIMED component description: ${component_description} """ +# pip install s3fs boto3 pandas # component dependencies # ${component_dependencies} @@ -500,7 +501,7 @@ def release_legal_hold(self, key: str): #----------------------------------------------------------- -def explode_connection_string(cs) +def explode_connection_string(cs): if cs is None: return None, None, None, None if cs.startswith('cos') or cs.startswith('s3'): From c6160db5bb861d454cc2b3e96e60da78b019e3d5 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Mon, 25 Mar 2024 09:44:49 +0100 Subject: [PATCH 159/177] add operator utils and tests --- src/c3/operator_utils.py | 43 ++++++++++++++++++++++++++++++++++++ tests/test_operator_utils.py | 9 ++++++++ 2 files changed, 52 insertions(+) create mode 100644 src/c3/operator_utils.py create mode 100644 tests/test_operator_utils.py diff --git a/src/c3/operator_utils.py b/src/c3/operator_utils.py new file mode 100644 index 00000000..5f524872 --- /dev/null +++ b/src/c3/operator_utils.py @@ -0,0 +1,43 @@ +import contextlib +import logging +import os + +# converts string in form [cos|s3]://access_key_id:secret_access_key@endpoint/bucket/path to +# access_key_id, secret_access_key, endpoint, path - path includes bucket name +def explode_connection_string(cs): + if cs is None: + return None + if cs.startswith('cos') or cs.startswith('s3'): + buffer=cs.split('://')[1] + access_key_id=buffer.split('@')[0].split(':')[0] + secret_access_key=buffer.split('@')[0].split(':')[1] + endpoint=f"https://{buffer.split('@')[1].split('/')[0]}" + path='/'.join(buffer.split('@')[1].split('/')[1:]) + return (access_key_id, secret_access_key, endpoint, path) + else: + return (None, None, None, cs) + # TODO consider cs as secret and grab connection string from kubernetes + + +def run_and_log(cos_conn, log_folder, task_id, command_array): + log_root_name = time.time() + job_id = ('-').join(command_array).replace('/','-') # TODO get a unique job id + job_id = re.sub(r'[^a-zA-Z0-9]', '-', job_id) + task_id = re.sub(r'[^a-zA-Z0-9]', '-', task_id) + std_out_log_name = f'{job_id}-{task_id}-{log_root_name}-stdout.log' + std_err_log_name = f'{job_id}-{task_id}-{log_root_name}-stderr.log' + with open(std_out_log_name,'w') as so: + with open(std_err_log_name,'w') as se: + with contextlib.redirect_stdout(so): + with contextlib.redirect_stderr(se): + logging.info('-----INVOKING TASK-----------------------------------') + logging.info(f'Task ID: {task_id}') + logging.info(f'Command: {command_array}') + result = subprocess.run(command_array, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=os.environ.copy()) + output = result.stdout.decode('utf-8') + logging.info("Output:", output) + logging.info("Return code:", result.returncode) + cos_conn.put(std_out_log_name,os.path.join(log_folder,std_out_log_name)) + cos_conn.put(std_err_log_name,os.path.join(log_folder,std_err_log_name)) + os.remove(std_out_log_name) + os.remove(std_err_log_name) \ No newline at end of file diff --git a/tests/test_operator_utils.py b/tests/test_operator_utils.py new file mode 100644 index 00000000..67d1f4b6 --- /dev/null +++ b/tests/test_operator_utils.py @@ -0,0 +1,9 @@ +from c3.operator_utils import explode_connection_string + + +def test_explode_connection_string(): + (ac, sc, ep, p) = explode_connection_string('cos://DF)S)DFU8:!#$%^*(){}[]"><@s3.us-east.cloud-object-storage.appdomain.cloud/claimed-test/ds=335/dl=50254/dt=20220101/tm=000000/lvl=0/gh=0/S1A_IW_GRDH_1SDV_20220101T090715_20220101T090740_041265_04E78F_73F0_VH.cog') + assert ac=='DF)S)DFU8' + assert sc=='!#$%^*(){}[]"><' + assert ep=='https://s3.us-east.cloud-object-storage.appdomain.cloud' + assert p=='claimed-test/ds=335/dl=50254/dt=20220101/tm=000000/lvl=0/gh=0/S1A_IW_GRDH_1SDV_20220101T090715_20220101T090740_041265_04E78F_73F0_VH.cog' \ No newline at end of file From e08161aa8a16db74dc97557e6c9714a4b6135d7c Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Wed, 27 Mar 2024 17:16:12 +0100 Subject: [PATCH 160/177] Added plattfrom arg and added docker buildx Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 3 +++ src/c3/create_operator.py | 16 +++++++++++++++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index c457a9a9..49e29f91 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -181,6 +181,8 @@ def main(): help='Exclude logging code from component setup code') parser.add_argument('--keep-generated-files', action='store_true', help='Do not delete temporary generated files.') + parser.add_argument('--platform', type=str, default='linux/amd64', + help='Select image platform, default is linux/amd64. Alternativly, select linux/arm64".') args = parser.parse_args() @@ -227,6 +229,7 @@ def main(): rename_files=args.rename, skip_logging=args.skip_logging, keep_generated_files=args.keep_generated_files, + platform=args.platform, ) except Exception as err: logging.error('Error while generating CLAIMED grid wrapper. ' diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 18df7fdd..eb3e3d6f 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -236,6 +236,7 @@ def create_operator(file_path: str, overwrite_files=False, skip_logging=False, keep_generated_files=False, + platform='linux/amd64', ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -356,12 +357,21 @@ def create_operator(file_path: str, logging.warning('No repository provided. The container image is only saved locally. Add `-r ` ' 'to push the image to a container registry or run `--local_mode` to suppress this warning.') local_mode = True + repository = 'local' + + if subprocess.run('docker buildx', shell=True, stdout=subprocess.PIPE).returncode == 0: + # Using docker buildx + logging.debug('Using docker buildx') + build_command = 'docker buildx build' + else: + logging.debug('Using docker build. Consider installing docker-buildx.') + build_command = 'docker build' logging.info(f'Building container image claimed-{name}:{version}') try: # Run docker build subprocess.run( - f"docker build --platform linux/amd64 -t claimed-{name}:{version} . {'--no-cache' if no_cache else ''}", + f"{build_command} --platform {platform} -t claimed-{name}:{version} . {'--no-cache' if no_cache else ''}", stdout=None if log_level == 'DEBUG' else subprocess.PIPE, check=True, shell=True ) if repository is not None: @@ -450,6 +460,9 @@ def main(): help='Exclude logging code from component setup code') parser.add_argument('--keep-generated-files', action='store_true', help='Do not delete temporary generated files.') + parser.add_argument('--platform', type=str, default='linux/amd64', + help='Select image platform, default is linux/amd64. Alternativly, select linux/arm64".') + args = parser.parse_args() # Init logging @@ -482,6 +495,7 @@ def main(): rename_files=args.rename, skip_logging=args.skip_logging, keep_generated_files=args.keep_generated_files, + platform=args.platform, ) From e43f8f84c812728ac3343e5937651af8824b54a4 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 28 Mar 2024 14:43:52 +0100 Subject: [PATCH 161/177] Fix kfp yaml name Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index eb3e3d6f..8dcab385 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -60,8 +60,8 @@ def create_dockerfile(dockerfile_template, requirements, target_code, target_dir def create_kfp_component(name, description, repository, version, command, target_code, target_dir, file_path, inputs, outputs): inputs_list = str() - for name, options in inputs.items(): - inputs_list += f'- {{name: {name}, type: {options["type"]}, description: "{options["description"]}"' + for input, options in inputs.items(): + inputs_list += f'- {{name: {input}, type: {options["type"]}, description: "{options["description"]}"' if options['default'] is not None: if not options["default"].startswith('"'): options["default"] = f'"{options["default"]}"' @@ -69,8 +69,8 @@ def create_kfp_component(name, description, repository, version, command, target inputs_list += '}\n' outputs_list = str() - for name, options in outputs.items(): - outputs_list += f'- {{name: {name}, type: String, description: "{options["description"]}"}}\n' + for output, options in outputs.items(): + outputs_list += f'- {{name: {output}, type: String, description: "{options["description"]}"}}\n' parameter_list = str() for index, key in enumerate(list(inputs.keys()) + list(outputs.keys())): From 1ad3e1869cbb15cfe4b1f44b4d4b3b5cd173b176 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Tue, 2 Apr 2024 13:09:19 +0200 Subject: [PATCH 162/177] Fix non-alpha chars in requirements file name Signed-off-by: Benedikt Blumenstiel --- src/c3/create_operator.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 8dcab385..043908fc 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -26,7 +26,7 @@ def create_dockerfile(dockerfile_template, requirements, target_code, target_dir # Check for requirements file for i in range(len(requirements)): if '-r ' in requirements[i]: - r_file_search = re.search('-r ~?\/?([A-Za-z0-9\/]*\.txt)', requirements[i]) + r_file_search = re.search('-r ~?\/?([^\s]*\.txt)', requirements[i]) if len(r_file_search.groups()): # Get file from regex requirements_file = r_file_search.groups()[0] From 21659c7e9daa10f4245724e4190c47989b9d423c Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 8 Apr 2024 16:02:54 +0200 Subject: [PATCH 163/177] Added argument dockerfile Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 3 +++ src/c3/create_operator.py | 19 ++++++++++++------- 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 49e29f91..123e310f 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -174,6 +174,8 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') + parser.add_argument('--dockerfile', type=str, default='dockerfile.generated', + help='Name or path of the generated dockerfile.') parser.add_argument('--local_mode', action='store_true', help='Continue processing after docker errors.') parser.add_argument('--no-cache', action='store_true', help='Not using cache for docker build.') @@ -230,6 +232,7 @@ def main(): skip_logging=args.skip_logging, keep_generated_files=args.keep_generated_files, platform=args.platform, + dockerfile=args.dockerfile, ) except Exception as err: logging.error('Error while generating CLAIMED grid wrapper. ' diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 043908fc..0ebda329 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -22,7 +22,8 @@ CLAIMED_VERSION = 'V0.1' -def create_dockerfile(dockerfile_template, requirements, target_code, target_dir, additional_files, working_dir, command): +def create_dockerfile(dockerfile_template, dockerfile, requirements, target_code, target_dir, additional_files, + working_dir, command): # Check for requirements file for i in range(len(requirements)): if '-r ' in requirements[i]: @@ -52,9 +53,9 @@ def create_dockerfile(dockerfile_template, requirements, target_code, target_dir ) logging.info('Create Dockerfile') - with open("Dockerfile", "w") as text_file: + with open(dockerfile, "w") as text_file: text_file.write(docker_file) - logging.debug('Dockerfile:\n' + docker_file) + logging.debug(f'{dockerfile}:\n' + docker_file) def create_kfp_component(name, description, repository, version, command, target_code, target_dir, file_path, inputs, outputs): @@ -237,6 +238,7 @@ def create_operator(file_path: str, skip_logging=False, keep_generated_files=False, platform='linux/amd64', + dockerfile='Dockerfile.generated', ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -345,8 +347,8 @@ def create_operator(file_path: str, logging.info(f'Found {len(additional_files_found)} additional files and directories\n' f'{", ".join(additional_files_found)}') - create_dockerfile(dockerfile_template, requirements, target_code, target_dir, additional_files_found, working_dir, - command) + create_dockerfile(dockerfile_template, dockerfile, requirements, target_code, target_dir, additional_files_found, + working_dir, command) if version is None: # auto increase version based on registered images @@ -362,10 +364,10 @@ def create_operator(file_path: str, if subprocess.run('docker buildx', shell=True, stdout=subprocess.PIPE).returncode == 0: # Using docker buildx logging.debug('Using docker buildx') - build_command = 'docker buildx build' + build_command = f'docker buildx build -f {dockerfile}' else: logging.debug('Using docker build. Consider installing docker-buildx.') - build_command = 'docker build' + build_command = f'docker build -f {dockerfile}' logging.info(f'Building container image claimed-{name}:{version}') try: @@ -453,6 +455,8 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') + parser.add_argument('--dockerfile', type=str, default='dockerfile.generated', + help='Name or path of the generated dockerfile.') parser.add_argument('--local_mode', action='store_true', help='Continue processing after docker errors.') parser.add_argument('--no-cache', action='store_true', help='Not using cache for docker build.') @@ -496,6 +500,7 @@ def main(): skip_logging=args.skip_logging, keep_generated_files=args.keep_generated_files, platform=args.platform, + dockerfile=args.dockerfile, ) From ab746e1e4cfafe5dc7704918aa0653be8f576a0d Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Mon, 8 Apr 2024 17:03:45 +0200 Subject: [PATCH 164/177] Fixed typo Signed-off-by: Benedikt Blumenstiel --- src/c3/create_gridwrapper.py | 2 +- src/c3/create_operator.py | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 123e310f..f8bf5478 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -174,7 +174,7 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') - parser.add_argument('--dockerfile', type=str, default='dockerfile.generated', + parser.add_argument('--dockerfile', type=str, default='Dockerfile.generated', help='Name or path of the generated dockerfile.') parser.add_argument('--local_mode', action='store_true', help='Continue processing after docker errors.') diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 0ebda329..430b9a1f 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -455,7 +455,7 @@ def main(): parser.add_argument('-l', '--log_level', type=str, default='INFO') parser.add_argument('--dockerfile_template_path', type=str, default='', help='Path to custom dockerfile template') - parser.add_argument('--dockerfile', type=str, default='dockerfile.generated', + parser.add_argument('--dockerfile', type=str, default='Dockerfile.generated', help='Name or path of the generated dockerfile.') parser.add_argument('--local_mode', action='store_true', help='Continue processing after docker errors.') From 555f088a986aa1cdbf61ac7cae04eb62d3519f79 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 20 Jun 2024 14:09:09 +0200 Subject: [PATCH 165/177] add containerless operator builder --- pyproject.toml | 1 + src/c3/create_containerless_operator.py | 65 +++++++++++++++++++++++++ 2 files changed, 66 insertions(+) create mode 100644 src/c3/create_containerless_operator.py diff --git a/pyproject.toml b/pyproject.toml index 8ae226eb..d8440d0c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -40,6 +40,7 @@ dependencies = [ [project.scripts] c3_create_operator = "c3.create_operator:main" +c3_create_containerless_operator = "c3.create_containerless_operator:main" c3_create_gridwrapper = "c3.create_gridwrapper:main" [tool.setuptools.packages.find] diff --git a/src/c3/create_containerless_operator.py b/src/c3/create_containerless_operator.py new file mode 100644 index 00000000..25de085b --- /dev/null +++ b/src/c3/create_containerless_operator.py @@ -0,0 +1,65 @@ +import argparse +import os +import sys +import logging +import subprocess +import re + +def create_containerless_operator( + file_path, + version, + ): + + logging.debug(f'Called create_containerless_operator with {file_path}') + + filename, file_extension = os.path.splitext(file_path) + + if file_extension != '.py': + raise NotImplementedError('Containerless operators currenly only support python scripts') + + all_pip_packages_found = '' + with open(file_path, 'r') as file: + for line in file: + if re.search('pip ', line): + pip_packages = re.sub('[#, ,!]*pip[ ]*install[ ]*', '', line) + logging.debug(f'PIP packages found: {pip_packages}') + all_pip_packages_found += (f' {pip_packages}') + logging.info(f'all PIP packages found: {all_pip_packages_found}') + + + subprocess.run(';'.join(['rm -Rf claimedenv','python -m venv claimedenv', + 'source ./claimedenv/bin/activate', + f'pip install {all_pip_packages_found.strip()}', + 'pip list', + f'zip -r {filename}.zip {file_path} claimedenv', + 'rm -Rf claimedenv']), shell=True) + + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('FILE_PATH', type=str, + help='Path to python script or notebook') + parser.add_argument('ADDITIONAL_FILES', type=str, nargs='*', default=None, + help='Paths to additional files to include in the container image') + parser.add_argument('-v', '--version', type=str, default=None, + help='Container image version. Auto-increases the version number if not provided (default 0.1)') + parser.add_argument('-l', '--log_level', type=str, default='INFO') + args = parser.parse_args() + + # Init logging + root = logging.getLogger() + root.setLevel(args.log_level) + handler = logging.StreamHandler(sys.stdout) + formatter = logging.Formatter('%(levelname)s - %(message)s') + handler.setFormatter(formatter) + handler.setLevel(args.log_level) + root.addHandler(handler) + + create_containerless_operator( + file_path=args.FILE_PATH, + version=args.version, + ) + +if __name__ == '__main__': + main() From 4bf6687b5aae88798dd36b42eff1ed81698aa87f Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Thu, 20 Jun 2024 23:04:47 +0200 Subject: [PATCH 166/177] create cwl task for containerless operator --- src/c3/create_containerless_operator.py | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/src/c3/create_containerless_operator.py b/src/c3/create_containerless_operator.py index 25de085b..3b6ee4ed 100644 --- a/src/c3/create_containerless_operator.py +++ b/src/c3/create_containerless_operator.py @@ -4,13 +4,19 @@ import logging import subprocess import re +from c3.create_operator import create_cwl_component +from c3.pythonscript import Pythonscript + def create_containerless_operator( file_path, version, ): - logging.debug(f'Called create_containerless_operator with {file_path}') + if version is None: + version = 'latest' + + logging.debug(f'Called create_containerless_operator {version} with {file_path}') filename, file_extension = os.path.splitext(file_path) @@ -26,7 +32,6 @@ def create_containerless_operator( all_pip_packages_found += (f' {pip_packages}') logging.info(f'all PIP packages found: {all_pip_packages_found}') - subprocess.run(';'.join(['rm -Rf claimedenv','python -m venv claimedenv', 'source ./claimedenv/bin/activate', f'pip install {all_pip_packages_found.strip()}', @@ -34,6 +39,13 @@ def create_containerless_operator( f'zip -r {filename}.zip {file_path} claimedenv', 'rm -Rf claimedenv']), shell=True) + script_data = Pythonscript(file_path) + inputs = script_data.get_inputs() + outputs = script_data.get_outputs() + + create_cwl_component(filename, "containerless", version, file_path, inputs, outputs) + + def main(): From 944b1d987845721654145a3304113aac01f618c4 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 21 Jun 2024 14:30:55 +0200 Subject: [PATCH 167/177] add version information of generated containerless zip --- src/c3/create_containerless_operator.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/create_containerless_operator.py b/src/c3/create_containerless_operator.py index 3b6ee4ed..4135f846 100644 --- a/src/c3/create_containerless_operator.py +++ b/src/c3/create_containerless_operator.py @@ -36,7 +36,7 @@ def create_containerless_operator( 'source ./claimedenv/bin/activate', f'pip install {all_pip_packages_found.strip()}', 'pip list', - f'zip -r {filename}.zip {file_path} claimedenv', + f'zip -r claimed-{filename}:{version}.zip {file_path} claimedenv', 'rm -Rf claimedenv']), shell=True) script_data = Pythonscript(file_path) From 0245adf717f50c00f0656e4788fc033678479cbe Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Wed, 26 Jun 2024 16:42:20 +0200 Subject: [PATCH 168/177] finish c3 containerless operator generation --- src/c3/create_containerless_operator.py | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/src/c3/create_containerless_operator.py b/src/c3/create_containerless_operator.py index 4135f846..b72f48d2 100644 --- a/src/c3/create_containerless_operator.py +++ b/src/c3/create_containerless_operator.py @@ -6,11 +6,12 @@ import re from c3.create_operator import create_cwl_component from c3.pythonscript import Pythonscript - +from c3.templates import component_setup_code_wo_logging, python_component_setup_code def create_containerless_operator( file_path, version, + skip_logging = False ): if version is None: @@ -32,13 +33,29 @@ def create_containerless_operator( all_pip_packages_found += (f' {pip_packages}') logging.info(f'all PIP packages found: {all_pip_packages_found}') + + # prepend init code to script + target_code = 'runnable.py' + + if os.path.exists(target_code): + os.remove(target_code) + + with open(file_path, 'r') as f: + script = f.read() + if skip_logging: + script = component_setup_code_wo_logging + script + else: + script = python_component_setup_code + script + with open(target_code, 'w') as f: + f.write(script) + subprocess.run(';'.join(['rm -Rf claimedenv','python -m venv claimedenv', 'source ./claimedenv/bin/activate', f'pip install {all_pip_packages_found.strip()}', 'pip list', - f'zip -r claimed-{filename}:{version}.zip {file_path} claimedenv', - 'rm -Rf claimedenv']), shell=True) - + f'zip -r claimed-{filename}:{version}.zip {target_code} claimedenv', + 'rm -Rf claimedenv', + f'rm {target_code}']), shell=True) script_data = Pythonscript(file_path) inputs = script_data.get_inputs() outputs = script_data.get_outputs() From bd1ea198b959be2763786cc42aca69cd567f6809 Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Thu, 8 Aug 2024 11:08:33 +0200 Subject: [PATCH 169/177] Moved operator code in dockerfile template Signed-off-by: Benedikt Blumenstiel --- src/c3/templates/python_dockerfile_template | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/c3/templates/python_dockerfile_template b/src/c3/templates/python_dockerfile_template index 24180668..81c265e8 100644 --- a/src/c3/templates/python_dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -1,10 +1,10 @@ FROM registry.access.redhat.com/ubi8/python-39 USER root -ADD ${target_code} ${working_dir}${target_dir} ${additional_files_docker} RUN pip install --upgrade pip RUN pip install ipython nbformat ${requirements_docker} +ADD ${target_code} ${working_dir}${target_dir} RUN chmod -R 777 ${working_dir} USER default WORKDIR "${working_dir}" From 017282ec9673a0e7f41e0b409f15577676705c3a Mon Sep 17 00:00:00 2001 From: Benedikt Blumenstiel Date: Fri, 13 Dec 2024 13:44:04 +0100 Subject: [PATCH 170/177] Added image_version arg --- src/c3/create_gridwrapper.py | 3 +++ src/c3/create_operator.py | 21 +++++++++++++++++++-- src/c3/templates/R_dockerfile_template | 2 +- src/c3/templates/python_dockerfile_template | 2 +- 4 files changed, 24 insertions(+), 4 deletions(-) diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index f8bf5478..b9f641f3 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -185,6 +185,8 @@ def main(): help='Do not delete temporary generated files.') parser.add_argument('--platform', type=str, default='linux/amd64', help='Select image platform, default is linux/amd64. Alternativly, select linux/arm64".') + parser.add_argument('--image_version', type=str, default='python3.12', + help='Select python or R version (defaults to python3.12).') args = parser.parse_args() @@ -233,6 +235,7 @@ def main(): keep_generated_files=args.keep_generated_files, platform=args.platform, dockerfile=args.dockerfile, + image_version=args.image_version, ) except Exception as err: logging.error('Error while generating CLAIMED grid wrapper. ' diff --git a/src/c3/create_operator.py b/src/c3/create_operator.py index 430b9a1f..0e2bb738 100644 --- a/src/c3/create_operator.py +++ b/src/c3/create_operator.py @@ -23,7 +23,7 @@ def create_dockerfile(dockerfile_template, dockerfile, requirements, target_code, target_dir, additional_files, - working_dir, command): + working_dir, command, image_version): # Check for requirements file for i in range(len(requirements)): if '-r ' in requirements[i]: @@ -43,7 +43,20 @@ def create_dockerfile(dockerfile_template, dockerfile, requirements, target_code additional_files_docker = list(map(lambda s: f"ADD {s} {working_dir}{s}", additional_files)) additional_files_docker = '\n'.join(additional_files_docker) + # Select base image + if 'python' in command: + base_image = f"registry.access.redhat.com/ubi8/python-{image_version.strip('python').replace('.', '')}" + elif command == 'Rscript': + if 'python' in image_version: + # Using default R version + image_version = 'R4.3.2' + base_image = f"r-base:{image_version.strip('Rr:')}" + else: + raise ValueError(f'Unrecognized command {command}') + logging.info(f'Using base image {base_image}') + docker_file = dockerfile_template.substitute( + base_image=base_image, requirements_docker=requirements_docker, target_code=target_code, target_dir=target_dir, @@ -239,6 +252,7 @@ def create_operator(file_path: str, keep_generated_files=False, platform='linux/amd64', dockerfile='Dockerfile.generated', + image_version='python3.12', ): logging.info('Parameters: ') logging.info('file_path: ' + file_path) @@ -348,7 +362,7 @@ def create_operator(file_path: str, f'{", ".join(additional_files_found)}') create_dockerfile(dockerfile_template, dockerfile, requirements, target_code, target_dir, additional_files_found, - working_dir, command) + working_dir, command, image_version) if version is None: # auto increase version based on registered images @@ -466,6 +480,8 @@ def main(): help='Do not delete temporary generated files.') parser.add_argument('--platform', type=str, default='linux/amd64', help='Select image platform, default is linux/amd64. Alternativly, select linux/arm64".') + parser.add_argument('--image_version', type=str, default='python3.12', + help='Select python or R version (defaults to python3.12).') args = parser.parse_args() @@ -501,6 +517,7 @@ def main(): keep_generated_files=args.keep_generated_files, platform=args.platform, dockerfile=args.dockerfile, + image_version=args.image_version, ) diff --git a/src/c3/templates/R_dockerfile_template b/src/c3/templates/R_dockerfile_template index 5d7d09e5..e60449e5 100644 --- a/src/c3/templates/R_dockerfile_template +++ b/src/c3/templates/R_dockerfile_template @@ -1,4 +1,4 @@ -FROM r-base:4.3.2 +FROM ${base_image} USER root RUN apt update ${requirements_docker} diff --git a/src/c3/templates/python_dockerfile_template b/src/c3/templates/python_dockerfile_template index 81c265e8..d4498650 100644 --- a/src/c3/templates/python_dockerfile_template +++ b/src/c3/templates/python_dockerfile_template @@ -1,4 +1,4 @@ -FROM registry.access.redhat.com/ubi8/python-39 +FROM ${base_image} USER root ${additional_files_docker} RUN pip install --upgrade pip From bebb6f525e11d052c5b7b5fab86c33731f7423d3 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler Date: Fri, 21 Feb 2025 21:56:59 +0100 Subject: [PATCH 171/177] add simple grid wrapper --- examples/gw_simple_grid_wrapper_example.cwl | 33 ++++++ .../gw_simple_grid_wrapper_example.job.yaml | 22 ++++ examples/gw_simple_grid_wrapper_example.py | 105 ++++++++++++++++++ examples/gw_simple_grid_wrapper_example.yaml | 23 ++++ examples/simple_grid_wrapper_example.py | 6 + examples/simple_grid_wrapper_source/1.txt | 1 + examples/simple_grid_wrapper_source/2.txt | 1 + examples/simple_grid_wrapper_source/3.txt | 1 + .../1.txt | 1 + .../2.txt | 1 + .../3.txt | 1 + .../1.PROCESSED.txt | 1 + .../2.PROCESSED.txt | 1 + .../3.PROCESSED.txt | 1 + src/c3/create_gridwrapper.py | 1 + src/c3/templates/__init__.py | 4 + .../templates/simple_grid_wrapper_template.py | 105 ++++++++++++++++++ 17 files changed, 308 insertions(+) create mode 100644 examples/gw_simple_grid_wrapper_example.cwl create mode 100644 examples/gw_simple_grid_wrapper_example.job.yaml create mode 100644 examples/gw_simple_grid_wrapper_example.py create mode 100644 examples/gw_simple_grid_wrapper_example.yaml create mode 100644 examples/simple_grid_wrapper_example.py create mode 100644 examples/simple_grid_wrapper_source/1.txt create mode 100644 examples/simple_grid_wrapper_source/2.txt create mode 100644 examples/simple_grid_wrapper_source/3.txt create mode 100644 examples/simple_grid_wrapper_source_and_target/1.txt create mode 100644 examples/simple_grid_wrapper_source_and_target/2.txt create mode 100644 examples/simple_grid_wrapper_source_and_target/3.txt create mode 100644 examples/simple_grid_wrapper_target/1.PROCESSED.txt create mode 100644 examples/simple_grid_wrapper_target/2.PROCESSED.txt create mode 100644 examples/simple_grid_wrapper_target/3.PROCESSED.txt create mode 100644 src/c3/templates/simple_grid_wrapper_template.py diff --git a/examples/gw_simple_grid_wrapper_example.cwl b/examples/gw_simple_grid_wrapper_example.cwl new file mode 100644 index 00000000..9ae91cab --- /dev/null +++ b/examples/gw_simple_grid_wrapper_example.cwl @@ -0,0 +1,33 @@ +cwlVersion: v1.2 +class: CommandLineTool + +baseCommand: "claimed" + +inputs: + component: + type: string + default: local/claimed-gw-simple-grid-wrapper-example:0.1 + inputBinding: + position: 1 + prefix: --component + log_level: + type: string + default: "INFO" + inputBinding: + position: 2 + prefix: --log_level + sgw_source_folder: + type: string + default: None + inputBinding: + position: 3 + prefix: --sgw_source_folder + sgw_target_folder: + type: string + default: "sgw_source_folder" + inputBinding: + position: 4 + prefix: --sgw_target_folder + + +outputs: [] diff --git a/examples/gw_simple_grid_wrapper_example.job.yaml b/examples/gw_simple_grid_wrapper_example.job.yaml new file mode 100644 index 00000000..cbb8e7e4 --- /dev/null +++ b/examples/gw_simple_grid_wrapper_example.job.yaml @@ -0,0 +1,22 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: gw-simple-grid-wrapper-example +spec: + template: + spec: + containers: + - name: gw-simple-grid-wrapper-example + image: local/claimed-gw-simple-grid-wrapper-example:0.1 + workingDir: /opt/app-root/src/ + command: ["/opt/app-root/bin/python","claimed_gw_simple_grid_wrapper_example.py"] + env: + - name: log_level + value: value_of_log_level + - name: sgw_source_folder + value: value_of_sgw_source_folder + - name: sgw_target_folder + value: value_of_sgw_target_folder + restartPolicy: OnFailure + imagePullSecrets: + - name: image_pull_secret \ No newline at end of file diff --git a/examples/gw_simple_grid_wrapper_example.py b/examples/gw_simple_grid_wrapper_example.py new file mode 100644 index 00000000..1560e48b --- /dev/null +++ b/examples/gw_simple_grid_wrapper_example.py @@ -0,0 +1,105 @@ +""" +component_simple_grid_wrapper_example got wrapped by grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 +This simple grid wrapper just scans a folder and for each file the grid_process function is called. Locking is achieved the following way: +Given source file1.ext is processed, simple_grid_wrapper creates files in the target_directory following the pattern file1.{STATUS}.ext where STATUS in: +LOCKED +PROCESSED +FAILED + + +CLAIMED component description: component-simple-grid-wrapper-example +""" + +# pip install pandas + +# component dependencies +# + +import os +import json +import random +import logging +import time +import glob +from pathlib import Path +import pandas as pd + +# import component code +from component_simple_grid_wrapper_example import * + + +#folder containing input data in single files +sgw_source_folder = os.environ.get('sgw_source_folder') + +#folder to store the output data in single files. Default: sgw_source_folder, in case sgw_source_folder==sgw_target_folder, files containing .LOCKED., .PROCESSED., .FAILED. are ignored +sgw_target_folder = os.environ.get('sgw_target_folder', sgw_source_folder) + +# component interface + + +def get_next_batch(): + files = os.listdir(sgw_source_folder) + if sgw_source_folder == sgw_target_folder: + files = [ + f for f in files + if not any(keyword in f for keyword in ["LOCKED", "PROCESSED", "FAILED"]) + ] + + # Filter files and check if corresponding target file exists + filtered_files = [] + for file in files: + file_name, file_ext = os.path.splitext(file) + + # Create target file names with LOCKED, PROCESSED, FAILED extensions + target_file_locked = f"{file_name}.LOCKED{file_ext}" + target_file_processed = f"{file_name}.PROCESSED{file_ext}" + target_file_failed = f"{file_name}.FAILED{file_ext}" + + # Check if any of the target files exists + if not any( + os.path.exists(os.path.join(sgw_target_folder, target_file)) + for target_file in [target_file_locked, target_file_processed, target_file_failed] + ): + filtered_files.append(file) + + if filtered_files: + return random.choice(filtered_files) + else: + return None + + +def process_wrapper(sub_process): + sgw_target_folder_path = Path(sgw_target_folder) + sgw_target_folder_path.mkdir(exist_ok=True, parents=True) + + while True: + file_to_process = get_next_batch() + logging.info(f"Processing batch: {file_to_process}") + if file_to_process is None: + break + + file_name = Path(file_to_process).stem + file_ext = Path(file_to_process).suffix + locked_file = sgw_target_folder+f"/{file_name}.LOCKED{file_ext}" + locked_file_path = Path(locked_file) + + try: + locked_file_path.touch() + sub_process(sgw_source_folder +'/'+ file_to_process, locked_file) + processed_file = sgw_target_folder+f"/{file_name}.PROCESSED{file_ext}" + locked_file_path.rename(processed_file) + + except Exception as e: + failed_file = sgw_target_folder+f"/{file_name}.FAILED{file_ext}" + locked_file_path.rename(failed_file) + + with open(failed_file, 'w') as f: + f.write(f"Exception occurred: {str(e)}\n") + + logging.error(f"Processing failed for {file_to_process}: {str(e)}") + + logging.info("Finished processing all batches.") + + +if __name__ == '__main__': + process_wrapper(grid_process) diff --git a/examples/gw_simple_grid_wrapper_example.yaml b/examples/gw_simple_grid_wrapper_example.yaml new file mode 100644 index 00000000..da527cdd --- /dev/null +++ b/examples/gw_simple_grid_wrapper_example.yaml @@ -0,0 +1,23 @@ +name: gw-simple-grid-wrapper-example +description: "component_simple_grid_wrapper_example got wrapped by grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 This simple grid wrapper just scans a folder and for each file the grid_process function is called. Locking is achieved the following way: Given source file1.ext is processed, simple_grid_wrapper creates files in the target_directory following the pattern file1.{STATUS}.ext where STATUS in: LOCKED PROCESSED FAILED CLAIMED component description: component-simple-grid-wrapper-example – CLAIMED V0.1" + +inputs: +- {name: log_level, type: String, description: "update log level", default: "INFO"} +- {name: sgw_source_folder, type: String, description: "folder containing input data in single files"} +- {name: sgw_target_folder, type: String, description: "folder to store the output data in single files. Default: sgw_source_folder, in case sgw_source_folder==sgw_target_folder, files containing .LOCKED., .PROCESSED., .FAILED. are ignored", default: "sgw_source_folder"} + + +outputs: + + +implementation: + container: + image: local/claimed-gw-simple-grid-wrapper-example:0.1 + command: + - sh + - -ec + - | + python ./claimed_gw_simple_grid_wrapper_example.py log_level="${0}" sgw_source_folder="${1}" sgw_target_folder="${2}" + - {inputValue: log_level} + - {inputValue: sgw_source_folder} + - {inputValue: sgw_target_folder} diff --git a/examples/simple_grid_wrapper_example.py b/examples/simple_grid_wrapper_example.py new file mode 100644 index 00000000..5a2e23b0 --- /dev/null +++ b/examples/simple_grid_wrapper_example.py @@ -0,0 +1,6 @@ +# append processed to each line +def grid_process(source_file, target_file): + with open(source_file, 'r') as src, open(target_file, 'w') as tgt: + for line in src: + processed_line = line.strip() + ' processed\n' + tgt.write(processed_line) \ No newline at end of file diff --git a/examples/simple_grid_wrapper_source/1.txt b/examples/simple_grid_wrapper_source/1.txt new file mode 100644 index 00000000..9daeafb9 --- /dev/null +++ b/examples/simple_grid_wrapper_source/1.txt @@ -0,0 +1 @@ +test diff --git a/examples/simple_grid_wrapper_source/2.txt b/examples/simple_grid_wrapper_source/2.txt new file mode 100644 index 00000000..9daeafb9 --- /dev/null +++ b/examples/simple_grid_wrapper_source/2.txt @@ -0,0 +1 @@ +test diff --git a/examples/simple_grid_wrapper_source/3.txt b/examples/simple_grid_wrapper_source/3.txt new file mode 100644 index 00000000..9daeafb9 --- /dev/null +++ b/examples/simple_grid_wrapper_source/3.txt @@ -0,0 +1 @@ +test diff --git a/examples/simple_grid_wrapper_source_and_target/1.txt b/examples/simple_grid_wrapper_source_and_target/1.txt new file mode 100644 index 00000000..9daeafb9 --- /dev/null +++ b/examples/simple_grid_wrapper_source_and_target/1.txt @@ -0,0 +1 @@ +test diff --git a/examples/simple_grid_wrapper_source_and_target/2.txt b/examples/simple_grid_wrapper_source_and_target/2.txt new file mode 100644 index 00000000..9daeafb9 --- /dev/null +++ b/examples/simple_grid_wrapper_source_and_target/2.txt @@ -0,0 +1 @@ +test diff --git a/examples/simple_grid_wrapper_source_and_target/3.txt b/examples/simple_grid_wrapper_source_and_target/3.txt new file mode 100644 index 00000000..9daeafb9 --- /dev/null +++ b/examples/simple_grid_wrapper_source_and_target/3.txt @@ -0,0 +1 @@ +test diff --git a/examples/simple_grid_wrapper_target/1.PROCESSED.txt b/examples/simple_grid_wrapper_target/1.PROCESSED.txt new file mode 100644 index 00000000..a7ebda21 --- /dev/null +++ b/examples/simple_grid_wrapper_target/1.PROCESSED.txt @@ -0,0 +1 @@ +test processed diff --git a/examples/simple_grid_wrapper_target/2.PROCESSED.txt b/examples/simple_grid_wrapper_target/2.PROCESSED.txt new file mode 100644 index 00000000..a7ebda21 --- /dev/null +++ b/examples/simple_grid_wrapper_target/2.PROCESSED.txt @@ -0,0 +1 @@ +test processed diff --git a/examples/simple_grid_wrapper_target/3.PROCESSED.txt b/examples/simple_grid_wrapper_target/3.PROCESSED.txt new file mode 100644 index 00000000..a7ebda21 --- /dev/null +++ b/examples/simple_grid_wrapper_target/3.PROCESSED.txt @@ -0,0 +1 @@ +test processed diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index b9f641f3..76a21e69 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -32,6 +32,7 @@ def wrap_component(component_path, 'cos_grid_wrapper': c3.templates.cos_grid_wrapper_template, 'legacy_cos_grid_wrapper': c3.templates.legacy_cos_grid_wrapper_template, 's3kv_grid_wrapper': c3.templates.s3kv_grid_wrapper_template, + 'simple_grid_wrapper': c3.templates.simple_grid_wrapper_template, } gw_template = backends.get(backend) diff --git a/src/c3/templates/__init__.py b/src/c3/templates/__init__.py index f761b5d1..d1422114 100644 --- a/src/c3/templates/__init__.py +++ b/src/c3/templates/__init__.py @@ -16,6 +16,7 @@ COS_GRID_WRAPPER_FILE = 'cos_grid_wrapper_template.py' LEGACY_COS_GRID_WRAPPER_FILE = 'legacy_cos_grid_wrapper_template.py' S3KV_GRID_WRAPPER_FILE = 's3kv_grid_wrapper_template.py' +SIMPLE_GRID_WRAPPER_FILE = 'simple_grid_wrapper_template.py' # load templates template_path = Path(os.path.dirname(__file__)) @@ -55,3 +56,6 @@ with open(template_path / S3KV_GRID_WRAPPER_FILE, 'r') as f: s3kv_grid_wrapper_template = Template(f.read()) + +with open(template_path / SIMPLE_GRID_WRAPPER_FILE, 'r') as f: + simple_grid_wrapper_template = Template(f.read()) diff --git a/src/c3/templates/simple_grid_wrapper_template.py b/src/c3/templates/simple_grid_wrapper_template.py new file mode 100644 index 00000000..66908801 --- /dev/null +++ b/src/c3/templates/simple_grid_wrapper_template.py @@ -0,0 +1,105 @@ +""" +${component_name} got wrapped by grid_wrapper, which wraps any CLAIMED component and implements the generic grid computing pattern https://romeokienzler.medium.com/the-generic-grid-computing-pattern-transforms-any-sequential-workflow-step-into-a-transient-grid-c7f3ca7459c8 +This simple grid wrapper just scans a folder and for each file the grid_process function is called. Locking is achieved the following way: +Given source file1.ext is processed, simple_grid_wrapper creates files in the target_directory following the pattern file1.{STATUS}.ext where STATUS in: +LOCKED +PROCESSED +FAILED + + +CLAIMED component description: ${component_description} +""" + +# pip install pandas + +# component dependencies +# ${component_dependencies} + +import os +import json +import random +import logging +import time +import glob +from pathlib import Path +import pandas as pd + +# import component code +from ${component_name} import * + + +#folder containing input data in single files +sgw_source_folder = os.environ.get('sgw_source_folder') + +#folder to store the output data in single files. Default: sgw_source_folder, in case sgw_source_folder==sgw_target_folder, files containing .LOCKED., .PROCESSED., .FAILED. are ignored +sgw_target_folder = os.environ.get('sgw_target_folder', sgw_source_folder) + +# component interface +${component_interface} + +def get_next_batch(): + files = os.listdir(sgw_source_folder) + if sgw_source_folder == sgw_target_folder: + files = [ + f for f in files + if not any(keyword in f for keyword in ["LOCKED", "PROCESSED", "FAILED"]) + ] + + # Filter files and check if corresponding target file exists + filtered_files = [] + for file in files: + file_name, file_ext = os.path.splitext(file) + + # Create target file names with LOCKED, PROCESSED, FAILED extensions + target_file_locked = f"{file_name}.LOCKED{file_ext}" + target_file_processed = f"{file_name}.PROCESSED{file_ext}" + target_file_failed = f"{file_name}.FAILED{file_ext}" + + # Check if any of the target files exists + if not any( + os.path.exists(os.path.join(sgw_target_folder, target_file)) + for target_file in [target_file_locked, target_file_processed, target_file_failed] + ): + filtered_files.append(file) + + if filtered_files: + return random.choice(filtered_files) + else: + return None + + +def process_wrapper(sub_process): + sgw_target_folder_path = Path(sgw_target_folder) + sgw_target_folder_path.mkdir(exist_ok=True, parents=True) + + while True: + file_to_process = get_next_batch() + logging.info(f"Processing batch: {file_to_process}") + if file_to_process is None: + break + + file_name = Path(file_to_process).stem + file_ext = Path(file_to_process).suffix + locked_file = sgw_target_folder+f"/{file_name}.LOCKED{file_ext}" + locked_file_path = Path(locked_file) + + try: + locked_file_path.touch() + sub_process(sgw_source_folder +'/'+ file_to_process, locked_file) + processed_file = sgw_target_folder+f"/{file_name}.PROCESSED{file_ext}" + locked_file_path.rename(processed_file) + + except Exception as e: + failed_file = sgw_target_folder+f"/{file_name}.FAILED{file_ext}" + locked_file_path.rename(failed_file) + + with open(failed_file, 'w') as f: + f.write(f"Exception occurred: {str(e)}\n") + + logging.error(f"Processing failed for {file_to_process}: {str(e)}") + + logging.info("Finished processing all batches.") + + +if __name__ == '__main__': + process_wrapper(${component_process}) From d1361447d34a31a34386fe7c161532b32ade1042 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler <5694071+romeokienzler@users.noreply.github.com> Date: Fri, 21 Feb 2025 21:09:12 +0000 Subject: [PATCH 172/177] Update GettingStarted.md --- GettingStarted.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/GettingStarted.md b/GettingStarted.md index 050f0990..f9ac986b 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -715,3 +715,16 @@ spec: imagePullSecrets: - name: image-pull-secret ``` + +### 5.4 Simple Grid Wrapper +Although CLAIMED grid wrappers with the different coordinator plugins are very powerful, sometimes it is also overwhelming. Therefore we created the simple_grid_wrapper plugin which allows you to just point as many parallel workers as you like to a directory of files. Those files are randomly processed by each worker, making sure there is only one worker processing a file. Once all files are processed, the results are renamed to original_file_name.PROCESSED.ext. Please have a look at the examples folder to create your own simple grid wrapper. Here are the commands, given you are in the examples folder of this repository: + +``` +(pip install claimed c3) +c3_create_gridwrapper simple_grid_wrapper_example.py -b simple_grid_wrapper +export CLAIMED_DATA_PATH=/path/to/your/c3/examples +claimed --component local/claimed-gw-simple-grid-wrapper-example:0.1 --log_level "INFO" --sgw_source_folder /opt/app-root/src/data/simple_grid_wrapper_source --sgw_target_folder /opt/app-root/src/data/simple_grid_wrapper_target + +# you can also store the results in the source folder +claimed --component local/claimed-gw-simple-grid-wrapper-example:0.1 --log_level "INFO" --sgw_source_folder /opt/app-root/src/data/simple_grid_wrapper_source_and_target --sgw_target_folder /opt/app-root/src/data/simple_grid_wrapper_source_and_target +``` From 0a747eebe7c0449033fd006a194434c857044270 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler <5694071+romeokienzler@users.noreply.github.com> Date: Fri, 21 Mar 2025 16:12:30 +0000 Subject: [PATCH 173/177] Update GettingStarted.md --- GettingStarted.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/GettingStarted.md b/GettingStarted.md index f9ac986b..8ce11fbe 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -525,6 +525,13 @@ c3_create_operator --help C3 generates the container image that is pushed to the registry, a `.yaml` file for KubeFlow, a `.job.yaml` for Kubernetes, and a `.cwl` file for CWL. +### 4.6 CLAIMED Containerless Operators +CLAIMED containerless operators allow you to execute scripts as fully functional workflow components without the need for traditional containerization. + +After installing the claimed component compiler via pip install claimed c3, you can compile a script into a containerless operator just as you would for containerized components like Docker, Kubernetes (jobs, pods, deployments), Kubeflow, or Apache Airflow. + +Using the command c3_create_containerless_operator my_script.py, your script is transformed into a standalone, executable operator. An example of a containerless operator can be found in the [containerless-bootstrap repository](https://github.com/claimed-framework/containerless-bootstrap). These operators can be executed seamlessly using the claimed CLI, replacing the container registry path with the containerless prefix. For instance, running claimed --component containerless/claimed-util-cos:latest --cos_connection cos://access_key_id:secret_access_key@s3.us-east.cloud-object-storage.appdomain.cloud/some_bucket/some_path --operation put --local_path some_file.zip enables cloud object storage operations with the 'claimed-util-cos' without requiring a container runtime. This approach significantly reduces overhead and speeds up execution while maintaining compatibility with established workflow orchestration frameworks. + --- From 6315b2edc7da60b6753902a76511dc3498255919 Mon Sep 17 00:00:00 2001 From: Romeo Kienzler <5694071+romeokienzler@users.noreply.github.com> Date: Wed, 2 Jul 2025 09:10:49 +0000 Subject: [PATCH 174/177] add EU funding notice Signed-off-by: Romeo Kienzler --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index a27c4752..68c5fd10 100644 --- a/README.md +++ b/README.md @@ -52,5 +52,9 @@ Please see [VULNERABILITIES.md](VULNERABILITIES.md) for reporting vulnerabilitie Interested in helping make CLAIMED better? We encourage you to take a look at our [Contributing](CONTRIBUTING.md) page. +## Credits + +CLAIMED is supported by the EU’s Horizon Europe program under Grant Agreement number 101131841 and also received funding from the Swiss State Secretariat for Education, Research and Innovation (SERI) and the UK Research and Innovation (UKRI). + ## License This software is released under Apache License v2.0. From d3deed4b75e53261ce5f3152eb948cabd033e456 Mon Sep 17 00:00:00 2001 From: Duc Kieu Date: Tue, 19 Aug 2025 16:46:28 +0200 Subject: [PATCH 175/177] folder grid wrapper Signed-off-by: Duc Kieu --- GettingStarted.md | 9 ++ .../folder_grid_wrapper_example.py | 10 ++ .../folder_grid_wrapper_source/folder_1/1.txt | 1 + .../folder_grid_wrapper_source/folder_1/2.txt | 1 + .../folder_grid_wrapper_source/folder_1/3.txt | 1 + .../folder_grid_wrapper_source/folder_2/1.txt | 1 + .../folder_grid_wrapper_source/folder_2/2.txt | 1 + .../folder_grid_wrapper_source/folder_2/3.txt | 1 + .../folder_grid_wrapper_source/folder_3/1.txt | 1 + .../folder_grid_wrapper_source/folder_3/2.txt | 1 + .../folder_grid_wrapper_source/folder_3/3.txt | 1 + .../folder_1.PROCESSED/1.txt | 1 + .../folder_1.PROCESSED/2.txt | 1 + .../folder_1.PROCESSED/3.txt | 1 + .../folder_2.PROCESSED/1.txt | 1 + .../folder_2.PROCESSED/2.txt | 1 + .../folder_2.PROCESSED/3.txt | 1 + .../folder_3.PROCESSED/1.txt | 1 + .../folder_3.PROCESSED/2.txt | 1 + .../folder_3.PROCESSED/3.txt | 1 + src/c3/create_gridwrapper.py | 1 + src/c3/templates/__init__.py | 5 + .../templates/folder_grid_wrapper_template.py | 137 ++++++++++++++++++ 23 files changed, 180 insertions(+) create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_example.py create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/1.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/2.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/3.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/1.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/2.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/3.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/1.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/2.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/3.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/1.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/2.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/3.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/1.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/2.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/3.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/1.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/2.txt create mode 100644 examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/3.txt create mode 100644 src/c3/templates/folder_grid_wrapper_template.py diff --git a/GettingStarted.md b/GettingStarted.md index 8ce11fbe..0b9eb826 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -735,3 +735,12 @@ claimed --component local/claimed-gw-simple-grid-wrapper-example:0.1 --log_level # you can also store the results in the source folder claimed --component local/claimed-gw-simple-grid-wrapper-example:0.1 --log_level "INFO" --sgw_source_folder /opt/app-root/src/data/simple_grid_wrapper_source_and_target --sgw_target_folder /opt/app-root/src/data/simple_grid_wrapper_source_and_target ``` + +### 5.5 Folder Grid Wrapper +It's exactly like the simple grid wrapper but here you lock folder instead of files. +Here are the commands, given you are in the examples/folder_grid_wrapper_example folder of this repository: +``` +c3_create_gridwrapper folder_grid_wrapper_example.py -b folder_grid_wrapper +export CLAIMED_DATA_PATH=/path/to/your/c3/examples +claimed --component local/claimed-gw-folder-grid-wrapper-example:0.1 --log_level "INFO" --sgw_source_folder /opt/app-root/src/data/folder_grid_wrapper_source --sgw_target_folder /opt/app-root/src/data/folder_grid_wrapper_target +``` \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_example.py b/examples/folder_grid_wrapper_example/folder_grid_wrapper_example.py new file mode 100644 index 00000000..03a9d79c --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_example.py @@ -0,0 +1,10 @@ +from pathlib import Path + +def grid_process(source_folder: str, target_folder: str) -> None: + src_dir = Path(source_folder) + tgt_dir = Path(target_folder) + + for src_file in sorted(src_dir.glob("*.txt")): + text = src_file.read_text(encoding="utf-8") + updated = text.replace("test", "test processed") + (tgt_dir / src_file.name).write_text(updated, encoding="utf-8") \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/1.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/1.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/1.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/2.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/2.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/2.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/3.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/3.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_1/3.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/1.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/1.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/1.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/2.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/2.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/2.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/3.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/3.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_2/3.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/1.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/1.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/1.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/2.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/2.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/2.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/3.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/3.txt new file mode 100644 index 00000000..30d74d25 --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_source/folder_3/3.txt @@ -0,0 +1 @@ +test \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/1.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/1.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/1.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/2.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/2.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/2.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/3.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/3.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_1.PROCESSED/3.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/1.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/1.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/1.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/2.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/2.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/2.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/3.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/3.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_2.PROCESSED/3.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/1.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/1.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/1.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/2.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/2.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/2.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/3.txt b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/3.txt new file mode 100644 index 00000000..02b2ffde --- /dev/null +++ b/examples/folder_grid_wrapper_example/folder_grid_wrapper_target/folder_3.PROCESSED/3.txt @@ -0,0 +1 @@ +test processed \ No newline at end of file diff --git a/src/c3/create_gridwrapper.py b/src/c3/create_gridwrapper.py index 76a21e69..e8184ea3 100644 --- a/src/c3/create_gridwrapper.py +++ b/src/c3/create_gridwrapper.py @@ -33,6 +33,7 @@ def wrap_component(component_path, 'legacy_cos_grid_wrapper': c3.templates.legacy_cos_grid_wrapper_template, 's3kv_grid_wrapper': c3.templates.s3kv_grid_wrapper_template, 'simple_grid_wrapper': c3.templates.simple_grid_wrapper_template, + 'folder_grid_wrapper': c3.templates.folder_grid_wrapper_template, } gw_template = backends.get(backend) diff --git a/src/c3/templates/__init__.py b/src/c3/templates/__init__.py index d1422114..94a3b13f 100644 --- a/src/c3/templates/__init__.py +++ b/src/c3/templates/__init__.py @@ -17,6 +17,7 @@ LEGACY_COS_GRID_WRAPPER_FILE = 'legacy_cos_grid_wrapper_template.py' S3KV_GRID_WRAPPER_FILE = 's3kv_grid_wrapper_template.py' SIMPLE_GRID_WRAPPER_FILE = 'simple_grid_wrapper_template.py' +FOLDER_GRID_WRAPPER_FILE = 'folder_grid_wrapper_template.py' # load templates template_path = Path(os.path.dirname(__file__)) @@ -59,3 +60,7 @@ with open(template_path / SIMPLE_GRID_WRAPPER_FILE, 'r') as f: simple_grid_wrapper_template = Template(f.read()) + +with open(template_path / FOLDER_GRID_WRAPPER_FILE, 'r') as f: + folder_grid_wrapper_template = Template(f.read()) + \ No newline at end of file diff --git a/src/c3/templates/folder_grid_wrapper_template.py b/src/c3/templates/folder_grid_wrapper_template.py new file mode 100644 index 00000000..900ace74 --- /dev/null +++ b/src/c3/templates/folder_grid_wrapper_template.py @@ -0,0 +1,137 @@ +""" +${component_name} got wrapped by folder_grid_wrapper, which wraps any CLAIMED component and implements folder-level locking. +This folder grid wrapper scans immediate subdirectories of sgw_source_folder and for each folder the ${component_process} function is called once. +Locking is achieved by creating files in the target directory using the pattern .{STATUS} where STATUS in: +LOCKED +PROCESSED +FAILED + + +CLAIMED component description: ${component_description} +""" + +# pip install pandas + +# component dependencies +# ${component_dependencies} + +import os +import json +import random +import logging +from pathlib import Path +import pandas as pd + +# import component code +from ${component_name} import * + +# folder containing input data in single files or subfolders +sgw_source_folder = os.environ.get('sgw_source_folder') + +# folder to store the output markers and results +# Default: sgw_source_folder. If equal, entries containing LOCKED or PROCESSED or FAILED are ignored. +sgw_target_folder = os.environ.get('sgw_target_folder', sgw_source_folder) + +# component interface +${component_interface} + +def _marker_paths(entry_name: str, is_dir: bool): + """Return (LOCKED, PROCESSED, FAILED) marker paths for a file or a folder.""" + tgt = Path(sgw_target_folder) + if is_dir: + # folder markers are directories + return ( + tgt / f"{entry_name}.LOCKED", + tgt / f"{entry_name}.PROCESSED", + tgt / f"{entry_name}.FAILED", + ) + # file markers are files + base, ext = os.path.splitext(entry_name) + return ( + tgt / f"{base}.LOCKED{ext}", + tgt / f"{base}.PROCESSED{ext}", + tgt / f"{base}.FAILED{ext}", + ) + +def _claimed_any(locked, processed, failed) -> bool: + return locked.exists() or processed.exists() or failed.exists() + +def get_next_batch(): + """Pick a random unclaimed entry from source, supporting files and folders.""" + filtered = [] + with os.scandir(sgw_source_folder) as it: + for e in it: + name = e.name + + # If source and target are the same, skip marker entries + if sgw_source_folder == sgw_target_folder and ( + "LOCKED" in name or "PROCESSED" in name or "FAILED" in name + ): + continue + + locked, processed, failed = _marker_paths(name, e.is_dir()) + if not _claimed_any(locked, processed, failed): + filtered.append((name, e.is_dir())) + + if filtered: + return random.choice(filtered) # (name, is_dir) + return None + +def _try_acquire_lock(name: str, is_dir: bool): + """Create the LOCKED marker atomically and return its Path, or None if already claimed.""" + locked, _, _ = _marker_paths(name, is_dir) + try: + if is_dir: + # atomic directory creation is a good folder lock + locked.mkdir() + else: + # atomic file creation + fd = os.open(str(locked), os.O_CREAT | os.O_EXCL | os.O_WRONLY) + os.close(fd) + return locked + except FileExistsError: + return None + +def process_wrapper(sub_process): + sgw_target_folder_path = Path(sgw_target_folder) + sgw_target_folder_path.mkdir(exist_ok=True, parents=True) + + while True: + nxt = get_next_batch() + if nxt is None: + break + + entry_name, is_dir = nxt + src_path = str(Path(sgw_source_folder) / entry_name) + locked, processed, failed = _marker_paths(entry_name, is_dir) + logging.info(f"Processing: {src_path}") + + # Acquire the lock. If we lose the race, pick another entry. + lock_path = _try_acquire_lock(entry_name, is_dir) + if lock_path is None: + continue + + try: + # Call user component. For folders, src_path points to the folder. + # The second argument remains the marker path, same as before. + sub_process(src_path, str(lock_path)) + + # Success marker + lock_path.rename(processed) + + except Exception as e: + # Failure marker + lock_path.rename(failed) + if is_dir: + # Put the error message inside the FAILED directory + errfile = Path(failed) / "error.txt" + errfile.write_text(f"Exception occurred: {str(e)}\n", encoding="utf-8") + else: + # For files, FAILED is itself a file; overwrite with the error text + Path(failed).write_text(f"Exception occurred: {str(e)}\n", encoding="utf-8") + logging.error(f"Processing failed for {src_path}: {str(e)}") + + logging.info("Finished processing all batches.") + +if __name__ == '__main__': + process_wrapper(${component_process}) \ No newline at end of file From 0356320c21328a9ca4ade3063ea530fad6c0d412 Mon Sep 17 00:00:00 2001 From: Duc Kieu Date: Tue, 19 Aug 2025 16:51:50 +0200 Subject: [PATCH 176/177] add c3_create_gridwrapper ouput for folder_grid_wrapper Signed-off-by: Duc Kieu --- .../gw_folder_grid_wrapper_example.cwl | 33 +++++++++++++++++++ .../gw_folder_grid_wrapper_example.job.yaml | 22 +++++++++++++ .../gw_folder_grid_wrapper_example.yaml | 23 +++++++++++++ 3 files changed, 78 insertions(+) create mode 100644 examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.cwl create mode 100644 examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.job.yaml create mode 100644 examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.yaml diff --git a/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.cwl b/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.cwl new file mode 100644 index 00000000..d7571613 --- /dev/null +++ b/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.cwl @@ -0,0 +1,33 @@ +cwlVersion: v1.2 +class: CommandLineTool + +baseCommand: "claimed" + +inputs: + component: + type: string + default: local/claimed-gw-folder-grid-wrapper-example:0.1 + inputBinding: + position: 1 + prefix: --component + log_level: + type: string + default: "INFO" + inputBinding: + position: 2 + prefix: --log_level + sgw_source_folder: + type: string + default: None + inputBinding: + position: 3 + prefix: --sgw_source_folder + sgw_target_folder: + type: string + default: "sgw_source_folder" + inputBinding: + position: 4 + prefix: --sgw_target_folder + + +outputs: [] diff --git a/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.job.yaml b/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.job.yaml new file mode 100644 index 00000000..1260f050 --- /dev/null +++ b/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.job.yaml @@ -0,0 +1,22 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: gw-folder-grid-wrapper-example +spec: + template: + spec: + containers: + - name: gw-folder-grid-wrapper-example + image: local/claimed-gw-folder-grid-wrapper-example:0.1 + workingDir: /opt/app-root/src/ + command: ["/opt/app-root/bin/python","examples/folder_grid_wrapper_example/claimed_gw_folder_grid_wrapper_example.py"] + env: + - name: log_level + value: value_of_log_level + - name: sgw_source_folder + value: value_of_sgw_source_folder + - name: sgw_target_folder + value: value_of_sgw_target_folder + restartPolicy: OnFailure + imagePullSecrets: + - name: image_pull_secret \ No newline at end of file diff --git a/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.yaml b/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.yaml new file mode 100644 index 00000000..c56a36c7 --- /dev/null +++ b/examples/folder_grid_wrapper_example/gw_folder_grid_wrapper_example.yaml @@ -0,0 +1,23 @@ +name: gw-folder-grid-wrapper-example +description: "component_folder_grid_wrapper_example got wrapped by folder_grid_wrapper, which wraps any CLAIMED component and implements folder-level locking. This folder grid wrapper scans immediate subdirectories of sgw_source_folder and for each folder the grid_process function is called once. Locking is achieved by creating files in the target directory using the pattern .{STATUS} where STATUS in: LOCKED PROCESSED FAILED CLAIMED component description: component-folder-grid-wrapper-example – CLAIMED V0.1" + +inputs: +- {name: log_level, type: String, description: "update log level", default: "INFO"} +- {name: sgw_source_folder, type: String, description: "folder containing input data in single files or subfolders"} +- {name: sgw_target_folder, type: String, description: "Default: sgw_source_folder. If equal, entries containing LOCKED or PROCESSED or FAILED are ignored.", default: "sgw_source_folder"} + + +outputs: + + +implementation: + container: + image: local/claimed-gw-folder-grid-wrapper-example:0.1 + command: + - sh + - -ec + - | + python ./examples/folder_grid_wrapper_example/claimed_gw_folder_grid_wrapper_example.py log_level="${0}" sgw_source_folder="${1}" sgw_target_folder="${2}" + - {inputValue: log_level} + - {inputValue: sgw_source_folder} + - {inputValue: sgw_target_folder} From fa90db6db24ae51a8ec8d94f654ce223bfc1ea0e Mon Sep 17 00:00:00 2001 From: Duc Kieu Date: Fri, 22 Aug 2025 13:57:56 +0200 Subject: [PATCH 177/177] improve documentation for folder grid wrapper Signed-off-by: Duc Kieu --- GettingStarted.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/GettingStarted.md b/GettingStarted.md index 0b9eb826..92762c9c 100644 --- a/GettingStarted.md +++ b/GettingStarted.md @@ -743,4 +743,15 @@ Here are the commands, given you are in the examples/folder_grid_wrapper_example c3_create_gridwrapper folder_grid_wrapper_example.py -b folder_grid_wrapper export CLAIMED_DATA_PATH=/path/to/your/c3/examples claimed --component local/claimed-gw-folder-grid-wrapper-example:0.1 --log_level "INFO" --sgw_source_folder /opt/app-root/src/data/folder_grid_wrapper_source --sgw_target_folder /opt/app-root/src/data/folder_grid_wrapper_target -``` \ No newline at end of file +``` +CLAIMED_DATA_PATH specifies the root directory that contains both the source and target folders used by the folder grid wrapper. +For example, if +``` +CLAIMED_DATA_PATH=/c3/examples/folder_grid_wrapper_example +``` +then the directory structure should look like this: +``` +/c3/examples/folder_grid_wrapper_example/ +├── folder_grid_wrapper_source/ +├── folder_grid_wrapper_target/ +```