Install Kubeflow Operators
This page describes how to deploy Kubeflow-related operators in Alauda AI 2.3 and later.
Starting in v26.3.0 (Alauda AI v2.3), Kubeflow ships as OLM Helm Operators rather than the earlier Cluster Plugin form factor. Installation is performed from the ACP OperatorHub instead of Cluster Plugins. The operators wrap the upstream kubeflow/manifests 26.03 charts.
Supported operators:
kfbase-operator: Kubeflow base components, including authentication and authorization, the central dashboard, Notebooks, PVC Viewer, TensorBoards, Volumes, Model Registry UI, KServe Endpoints UI, and the Model Catalog API service. Owns theKubeflowBaseCR. Architecture:amd64,arm64.kfp-operator: Kubeflow Pipelines (KFP runtime 2.16.0). Owns theKubeflowPipelinesCR. Architecture:amd64only (Kubeflow Pipelines is x86_64-only per Alauda AI v2.3 supported configurations).kubeflow-trainer-operator: Kubeflow Trainer v2 (controller-manager 2.1.0 + JobSet 0.10.1). Owns theKubeflowTrainerCR. Architecture:amd64,arm64. Replaces the deprecatedkftrainingplugin.model-registry-operator: Kubeflow Model Registry Operator (unchanged form factor).
Note: The
kftrainingCluster Plugin (Kubeflow Training Operator v1) was deprecated in earlier versions and has been retired in v26.3.0. Usekubeflow-trainer-operator(Trainer v2) instead.
TOC
Environment PreparationConfigure Dex RedirectionConfigure theoauth2-proxy PluginConfigure ASM v1 (Deprecated)Configure ASM v2Component OnboardingDeployment Steps1. Deploy kfbase-operator (Kubeflow Base)2. Create a Kubeflow User Namespace and Bind a User3. Bind a User to an Existing Namespace4. Deploy kfp-operator (Kubeflow Pipelines)5. Deploy Kubeflow Model Registry6. Deploy kubeflow-trainer-operator (Kubeflow Trainer v2)Environment Preparation
Before you begin, make sure the following prerequisites are met:
- An ACP environment is available and running.
- Alauda AI is already deployed. Alauda AI 2.3 or later is required for the v26.3.0 operator set.
- Alauda Build of KServe is installed.
- ASM is deployed in the business cluster where Kubeflow will run. If ASM is not already installed, deploy it before continuing. ASM v1 is deprecated. Use ASM v2 whenever possible.
- The LWS plugin, Alauda Build of LeaderWorkerSet, is installed if you plan to deploy
kubeflow-trainer-operator. - The
oauth2-proxyplugin is configured as described below.
Configure Dex Redirection
Note: Configure the platform access URL for Dex redirection before installing the
kfbase-operatorand creating itsKubeflowBaseCR. This step may update the platform CA certificate. If the certificate changes after you configureoauth2-proxy, theoauth2-proxyconfiguration may fail.
In Administrator > System Settings > Platform Parameters, click Edit next to Platform Access URLs and add a redirect URL in the format https://<your-kubeflow-domain>, for example https://kubeflow.example.com.
<your-kubeflow-domain>must matchspec.global.kubeflowHostin theKubeflowBaseCR.
Configure the oauth2-proxy Plugin
Get the platform Dex CA certificate for use later in the Global cluster:
Configure ASM v1 (Deprecated)
In the global cluster, or in ACP Platform Management > Resource Management, update the ServiceMesh resource and add the following content under spec.
Note: If
spec.values.pilot.jwksResolverExtraRootCAis already configured, update onlyspec.meshConfig.extensionProviders. Add new entries without deleting the existing ones.
Configure ASM v2
Note: If any ASM v1 webhooks are still present, delete them first. Otherwise Kubeflow authentication may fail.
In ACP, go to Administrator > MarketPlace > OperatorHub, find Alauda Service Mesh v2, open the All Instances tab, locate the instance of type Istio such as default, click Update, and add the following content under spec:
Component Onboarding
Download the operator bundle packages for the following operators and upload them with violet. The bundles register the operators with the ACP OperatorHub.
kfbase-operator: Kubeflow base functionality (ownsKubeflowBaseCR).kfp-operator: Kubeflow Pipelines (ownsKubeflowPipelinesCR). amd64-only.kubeflow-trainer-operator: Kubeflow Trainer v2 (ownsKubeflowTrainerCR). Replaces the deprecatedkftraining.model-registry-operator: Kubeflow Model Registry Operator.
Deployment Steps
1. Deploy kfbase-operator (Kubeflow Base)
In Administrator > MarketPlace > OperatorHub, find the kfbase-operator and click Install. Then open the All Instances tab and create a KubeflowBase CR with spec.global.kubeflowHost set to your Kubeflow domain. Wait for the deployment to finish.
After deployment:
- In Administrator > System Settings > Platform Parameters, verify that Platform Access URLs contains an address in the format
https://<your-kubeflow-domain>, where<your-kubeflow-domain>matchesspec.global.kubeflowHostin theKubeflowBaseCR. - Configure DNS resolution, or add a local hosts entry, so that
<your-kubeflow-domain>resolves to the IP address assigned tokubectl -n istio-system get gateway kubeflow-external-gateway.
After deployment, the Kubeflow entry appears under Tools in Alauda AI.
For upgrade-specific actions, see Upgrade Kubeflow Operators.
2. Create a Kubeflow User Namespace and Bind a User
Before a user signs in to Kubeflow for the first time, bind the ACP user to a namespace. The following example creates namespace kubeflow-admin-cpaas-io and assigns admin@cpaas.io as the owner.
Note: If this
Profileresource was already created during Alauda AI deployment, you can skip this step.Note: You may need to lower the Pod Security Admission level of the user namespace before creating Notebook instances and similar workloads.
3. Bind a User to an Existing Namespace
If Alauda AI was already deployed and the namespace kubeflow-admin-cpaas-io already exists, the Profile may also already exist. If the namespace still does not appear in Kubeflow, create the following resources to bind the account to the namespace:
4. Deploy kfp-operator (Kubeflow Pipelines)
In Administrator > MarketPlace > OperatorHub, find the kfp-operator and click Install. Then create a KubeflowPipelines CR in the operator instance namespace. After the CR reconciles, KFP runtime components are deployed and pipeline-related features become available in the Kubeflow UI.
Note:
kfp-operatoris amd64-only. Do not install it on arm64-only clusters.Note: Pipeline-related features become available in the Kubeflow UI only after
KubeflowPipelinesis reconciled.
5. Deploy Kubeflow Model Registry
In Administrator > MarketPlace > OperatorHub, find Model Registry Operator and click Install.
After the operator is installed, open the All Instances tab and create a ModelRegistry instance in the user's namespace.
Note: Create the instance in a namespace that is already bound to a Kubeflow
Profile. Otherwise the Model Registry UI is not displayed.
When creating the instance, configure the following fields as needed:
- Name: Name of the Model Registry instance.
- Namespace: Namespace where the instance will run. This must be a namespace that is already bound to a Kubeflow
Profile. - MySQL Storage Class: Storage class used for Model Registry metadata, for example
standard. - MySQL Storage Size: Storage size for the metadata database. The default is
10Gi. - DisplayName: Display name of the Model Registry instance.
- Description: Short description of the instance.
Note: After the instance starts, refresh the Model Registry entry in the Kubeflow left navigation to see the new instance. Before the first instance is created, the Model Registry page is empty.
Note: The Model Registry instance restricts network requests from other namespaces. To allow additional namespaces, edit
authorizationpolicyfor the instance, for examplekubectl -n <your-namespace> edit authorizationpolicy <model-registry-name>, and update the policy according to the Istio documentation.Note: You can deploy multiple Model Registry instances in different namespaces. Each instance is independent.
6. Deploy kubeflow-trainer-operator (Kubeflow Trainer v2)
Note: If the deprecated
kftrainingCluster Plugin is still installed (from a pre-v26.3.0 cluster), uninstall it before installingkubeflow-trainer-operator.Note: Install the LWS plugin before deploying
kubeflow-trainer-operator, because LWS is a dependency ofkubeflow-trainer-operator.Note: v26.3.0 of
kubeflow-trainer-operatoraligns with upstreamkubeflow/manifests26.03 and ships Trainer v2.1.0 + JobSet v0.10.1. For clusters where an OLM CatalogSource already advertises a higher trainer version (>=2.2.0), install withinstallPlanApproval: ManualandstartingCSV: kubeflow-trainer-operator.v2.1.0to prevent OLM from auto-upgrading past the 26.03 pin.
In Administrator > MarketPlace > OperatorHub, find kubeflow-trainer-operator, click Install, choose the Manual install-plan approval if you need version pinning, then open the All Instances tab and create a KubeflowTrainer CR with JobSet enabled.