Skip to content

Conversation

@zanderfriz
Copy link

What this PR does: Introduces a proposal for a crossplane provider to the cortex project) to declaratively manage Cortex Alertmanager and Ruler configurations through Kubernetes Custom Resources.

Which issue(s) this PR fixes: N/A
Checklist

  • [N/A] Tests updated
  • Documentation added
  • [ N/A] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@dosubot dosubot bot added component/alertmanager component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. labels Nov 3, 2025
@friedrichg
Copy link
Member

thanks!. please follow https://github.com/cortexproject/cortex/pull/7085/checks?check_run_id=54406852290 to fix DCO

@zanderfriz zanderfriz force-pushed the proposal-crossplane-provider branch from 3f0d67e to 2f35332 Compare November 4, 2025 19:12
@friedrichg
Copy link
Member

@zanderfriz please rebase to have CI pass the PR. We made some changes in GitHub Actions

@friedrichg
Copy link
Member

I am in support of this proposal

I have 2 requests to merge this as accepted:

  • Let's put this in a separate repo inside cortexproject, where the selected maintainers will be able to keep this component updated.
  • We need 2 maintainers for this. (I can't be a maintainer, sorry). I am expecting you will be one of the mantainers. Can you find 1 person to help you with this?

@alolita
Copy link

alolita commented Nov 18, 2025

+1 on making sure there are at least 2 maintainers for this provider component.

Support a separate repo within the project.

@SungJin1212
Copy link
Member

+1

Signed-off-by: afrisvold <afrisvold@apple.com>
@zanderfriz zanderfriz force-pushed the proposal-crossplane-provider branch from 2f35332 to 40fdd1f Compare November 20, 2025 19:26
@zanderfriz
Copy link
Author

After discussing with @devopsjedi, he said he would be happy to be a maintainer on this project

@devopsjedi
Copy link

After discussing with @devopsjedi, he said he would be happy to be a maintainer on this project

Agreed- excited to support this effort!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 21, 2025

#### TenantConfig

The TenantConfig CRD manages connection details and authentication for a specific Cortex tenant:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why connection and auth only? And how tenant config will be consumed by Cortex?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TenantConfig is the configuration for the crossplane provider to connect to the cortex instance as a Tennant. It is not for configuring a Tenant on cortex as the cortex administrator.

Copy link

@forestsword forestsword left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had written a long winded version of this comment but realized it was all just a matter of my internal organization's organization. In short we can't use this version of the operator because we can't run crossplane. We're an observability team and do not have the responsibility to run something that at the same time can provision s3 buckets.

Also, neither the Prometheus nor Opentelemetry operator, two work-horses of our observability infrastructure, require that we run crossplane, why should cortex?

Don't get me wrong, I don't want to trash the idea of crossplane. It's better for the cortex community to have a crossplane provider than nothing. But we won't be able to use it where I work and that makes me sad.

Comment on lines +134 to +135
providerConfigRef:
name: cortex-config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this referring to? Is it crossplane specific?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is referencing the Crossplane provider config for the cortex provider.

Comment on lines +167 to +168
tenantConfigRef:
name: production-tenant

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run multiple clusters and it would be helpful to be able to specify multiple clusters where rules should be deployed to. Otherwise we'd need a RuleGroup per cortex cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a use case where Crossplane can shine. The provider's job is to provide the primitive objects you need. As the platform owner, you can create an XRD/Composition which fits your use case. For example:

Create XRD

apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xsharedrulegroups.cortex.platform.example.com
spec:
  group: cortex.platform.example.com
  names:
    kind: XSharedRuleGroup
    plural: xsharedrulegroups
  claimNames:
    kind: SharedRuleGroup  # What users create
    plural: sharedrulegroups
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                tenantRefs:
                  type: array
                  description: "List of TenantConfigs to apply rules to"
                  items:
                    type: object
                    properties:
                      name:
                        type: string
                      namespace:
                        type: string
                    required: [name]
                namespace:
                  type: string
                  description: "Cortex rules namespace"
                groupName:
                  type: string
                  description: "Rule group name"
                interval:
                  type: string
                  default: "1m"
                rules:
                  type: array
                  description: "Alert/recording rules"
                  # ... same schema as RuleGroup.spec.forProvider.rules
              required:
                - tenantRefs
                - namespace
                - groupName
                - rules

Create composition with function:

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: sharedrulegroup-fanout
spec:
  compositeTypeRef:
    apiVersion: cortex.platform.example.com/v1alpha1
    kind: XSharedRuleGroup
  mode: Pipeline
  pipeline:
    - step: fan-out-to-tenants
      functionRef:
        name: function-go-templating
      input:
        apiVersion: gotemplating.fn.crossplane.io/v1beta1
        kind: GoTemplate
        source: Inline
        inline:
          template: |
            {{- range $i, $tenant := .observed.composite.resource.spec.tenantRefs }}
            ---
            apiVersion: config.cortexmetrics.io/v1alpha1
            kind: RuleGroup
            metadata:
              name: {{ $.observed.composite.resource.metadata.name }}-{{ $tenant.name }}
              annotations:
                crossplane.io/composition-resource-name: rulegroup-{{ $tenant.name }}
            spec:
              forProvider:
                tenantConfigRef:
                  name: {{ $tenant.name }}
                  {{- if $tenant.namespace }}
                  namespace: {{ $tenant.namespace }}
                  {{- end }}
                namespace: {{ $.observed.composite.resource.spec.namespace }}
                groupName: {{ $.observed.composite.resource.spec.groupName }}
                {{- if $.observed.composite.resource.spec.interval }}
                interval: {{ $.observed.composite.resource.spec.interval }}
                {{- end }}
                rules: {{ toJson $.observed.composite.resource.spec.rules }}
            {{- end }}

Then create your SharedRuleGroup

apiVersion: cortex.platform.example.com/v1alpha1
kind: SharedRuleGroup
metadata:
  name: platform-cpu-alerts
  namespace: platform-team
spec:
  tenantRefs:
    - name: team-a-tenant
    - name: team-b-tenant
    - name: team-c-tenant

  namespace: "monitoring"
  groupName: "cpu-alerts"
  interval: "30s"

  rules:
    - alert: HighCPUUsage
      expr: 'rate(cpu_usage[5m]) > 0.8'
      for: "5m"
      labels:
        severity: warning
      annotations:
        summary: "CPU usage above 80% for 5 minutes"

This is the way most providers are written. For example provider-aws ships with the primitives VPC, Subnet, and Instance which is 1:1 with AWS API, but does not ship with XCluster which needs multiple primitives to define a Kubernetes cluster in crossplane.

The RuleGroup CRD manages Prometheus alerting and recording rules within a Cortex namespace:

```yaml
apiVersion: config.cortexmetrics.io/v1alpha1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard some talk of people wanting a prometheus operator compatible api for cortex CRDs. Would that be a goal here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not a goal and IMHO would bring a lot of unnecessary complexity. If you have a proposal for how this could be done I'd be open to hear more.

- Applies necessary changes via HTTP API calls
- Updates resource status with current state and any errors

2. **External Resource Identification**: Resources are identified using:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible (obviously not ideal but I've seen a lot of mistakes in my life) that you could have the same alerts defined on two clusters in the exact same namespace name and without further identifying attributes they would conflict with each other. Each operator would try and take control. I think it might be necessary to provide additional identifying attributes to prevent conflicts like this. For instance each operator would be passed k8s.cluster.name at start as an identifying attribute and resources would be saved in cortex like k8s.cluster.name/k8s.namespace.name/resource. Wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each object in the cortex API needs to have a unique ID and this scheme is a good example. The encoding scheme used to map clusters/tenants/namespaces to the objects in cortex is something we could document as a best practice. It would not be something explicitly enforced within the provider.

Copy link
Member

@friedrichg friedrichg Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with crossplane, but I think you can hit this problem with any crossplane provider. For example If you use s3.aws.crossplane.io to define an s3 bucket with the same name in the same region in 2 different Kubernetes clusters, conflicts will appear.

I think one way to solve this problem is to use crossplane compositions so that the tenant config is constructed from namespace name and the kubernetes cluster name.
https://docs.crossplane.io/latest/composition/compositions/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good callout if you are running multiple operators. In general, Crossplane operators are run in a admin/central cluster, not the managed/edge clusters. That being said, this is a good call out and I will create an issue to address it.

**Comparison**:
- **Pros**: Direct control over implementation, no external dependencies
- **Cons**:
- Requires building and maintaining complex controller infrastructure

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what complex infrastructure would be required for a classic k8s operator other than running the operator and setting it up with the api server. Running crossplane is more complex from my perspective especially because its feature set extends way beyond just cortex. Could you provide an example?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a Crossplane provider, it can work as a standalone Kubernetes operator. Crossplane providers are essentially highly opinionated operators that allow interaction with the Crossplane ecosystem (specifically XRDs). If we make our own operator, we'd have to define our own opinions. We get to re-use a lot of libraries, best practices etc. that the Crossplane community has already put a lot of thought into.

- **Pros**: Direct control over implementation, no external dependencies
- **Cons**:
- Requires building and maintaining complex controller infrastructure
- No composition or configuration management capabilities

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. I don't see the responsibility of an operator to do this. It's the 'deployment delivery' tech that does this like helm or tanka etc. Could you provide an example of how the provider would do this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referencing the crossplane concept of compositions where a crossplane admin team can create a high level compositions and with minimal less configuration will result. I gave an example of how it could be useful above when you asked how to implement the same rules across multiple tenants.

- **Cons**:
- Requires building and maintaining complex controller infrastructure
- No composition or configuration management capabilities
- Limited reusability across different Kubernetes clusters

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, not everyone can or will use crossplane, everyone can run a classic operator IMO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel feee to run this as an operator and not run the full crossplane system. It works as a standalone operator.

- Requires building and maintaining complex controller infrastructure
- No composition or configuration management capabilities
- Limited reusability across different Kubernetes clusters
- Missing advanced features like external secret management

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide an example? We'd be delivering secrets via the external secrets operator from vault. We would only need to reference the secret like described in the CRDs above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically anything crossplane does you could implement yourself in an operator. Some things like cross-namespace secretRef come built in with crossplane library. Good comment.

- No composition or configuration management capabilities
- Limited reusability across different Kubernetes clusters
- Missing advanced features like external secret management
- Significant development and maintenance overhead

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subjective. There's years of experience out there running and writing k8s operators, from opentelemetry to prometheus as examples. Crossplane is much younger and not a given. Kubebuilder for it limitations does provide a relief from much of the plumbing.

Copy link
Author

@zanderfriz zanderfriz Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to us being able to use the development best practices and tools already written by the crossplane project. I wound up using the xp-provider-gen repository to stub out most of my provider. It uses the build/test best practices from the crossplane project and I got to focus on the business logic of interacting with cortex.

@friedrichg friedrichg requested a review from CharlieTLe December 1, 2025 20:18
@zanderfriz
Copy link
Author

I had written a long winded version of this comment but realized it was all just a matter of my internal organization's organization. In short we can't use this version of the operator because we can't run crossplane. We're an observability team and do not have the responsibility to run something that at the same time can provision s3 buckets.

Also, neither the Prometheus nor Opentelemetry operator, two work-horses of our observability infrastructure, require that we run crossplane, why should cortex?

Don't get me wrong, I don't want to trash the idea of crossplane. It's better for the cortex community to have a crossplane provider than nothing. But we won't be able to use it where I work and that makes me sad.

@forestsword I really appreciate the feedback. Technically you don't need to run the operators like provider-aws that enable the deployment of s3 buckets and other resources. You could just run Crossplane and have the cortex operator be the only one installed. That being said, internal organization polies are just that. I'd encourage you to try running the provider as a standalone operator. You can also save yourself some copy pasta using kustomize to manage your TenantConfig, RuleGroup, and AlertmanagerConfig objects which would let you easily share a base RuleGroup between clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/alertmanager component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. lgtm This PR has been approved by a maintainer size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants