Commit b028346
authored
fix: Use server side defaults for Train definitions (#177)
Issue:
We [do not](https://github.com/aws-controllers-k8s/sagemaker-controller/blob/3996ba037349bb0b65088341afdfd03cf657e09c/generator.yaml#L326) late initialize some parameters in TrainingJobDefinitions like we do in TrainingJob defintion. As a result the controller will infinitely requeue if all parameters are not explicity specified(because the server sends back the default values it uses).
Description of changes:
`pkg/resource/hyper_parameter_tuning_job/custom_delta.go` - Sets some parameters to their server side default.
`pkg/resource/hyper_parameter_tuning_job/testdata/v1alpha1/readone/observed/completed_variation.yaml` - Modified unit test.
CRD I used to test:
```
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: HyperParameterTuningJob
metadata:
name: 2022-10-31-hpo-3
spec:
hyperParameterTuningJobName: 2022-10-31-hpo-3
hyperParameterTuningJobConfig:
strategy: Bayesian
resourceLimits:
maxNumberOfTrainingJobs: 2
maxParallelTrainingJobs: 1
trainingJobEarlyStoppingType: Auto
trainingJobDefinitions:
- staticHyperParameters:
base_score: '0.5'
definitionName: training-job-for-hpo
algorithmSpecification:
trainingImage: 433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:1
trainingInputMode: File
roleARN: <arn>
tuningObjective:
type_: Minimize
metricName: validation:error
hyperParameterRanges:
integerParameterRanges:
- name: num_round
minValue: '10'
maxValue: '20'
scalingType: Linear
continuousParameterRanges:
- name: gamma
minValue: '0'
maxValue: '5'
scalingType: Linear
inputDataConfig:
- channelName: train
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3URI: <train>
s3DataDistributionType: FullyReplicated
contentType: text/libsvm
compressionType: None
recordWrapperType: None
inputMode: File
- channelName: validation
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3URI: <validation>
s3DataDistributionType: FullyReplicated
contentType: text/libsvm
compressionType: None
recordWrapperType: None
inputMode: File
outputDataConfig:
s3OutputPath: <output>
resourceConfig:
instanceType: ml.m5.large
instanceCount: 1
volumeSizeInGB: 25
stoppingCondition:
maxRuntimeInSeconds: 3600
enableNetworkIsolation: true
enableInterContainerTrafficEncryption: false
tags:
- key: algorithm
value: xgboost
- key: environment
value: testing
- key: customer
value: test-user
```
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.1 parent 3996ba0 commit b028346
File tree
2 files changed
+31
-1
lines changed- pkg/resource/hyper_parameter_tuning_job
- testdata/v1alpha1/readone/observed
2 files changed
+31
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
35 | 66 | | |
Lines changed: 0 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
112 | | - | |
113 | 112 | | |
114 | 113 | | |
115 | 114 | | |
| |||
0 commit comments