splunk-ai-operator

Custom Resource Guide

The Splunk AI Operator provides a collection of custom resources you can use to manage Splunk AI Platform deployments in your Kubernetes cluster.

For examples on how to use these custom resources, please see Configuring Splunk Enterprise Deployments.

Metadata Parameters

All resources in Kubernetes include a metadata section. You can use this to define a name for a specific instance of the resource, and which namespace you would like the resource to reside within:

Key Type Description
name string Each instance of your resource is distinguished using this name.
namespace string Your instance will be created within this namespace. You must ensure that this namespace exists beforehand.

If you do not provide a namespace, you current context will be used.

apiVersion: ai.splunk.com/v1
kind: AIPlatform
metadata:
  name: example
  namespace: test

AI Platform Spec Parameters

apiVersion: ai.splunk.com/v1
kind: AIPlatform
metadata:
  name: example
  labels:
    app.kubernetes.io/name: splunk-ai-platform-example
    app.kubernetes.io/instance: example
    app.kubernetes.io/version: 0.1.0
spec:
  objectStorage:
    path: "s3://my-ai-bucket"
    region: "us-west-2"
    secretRef: s3-secret
  serviceAccountName: "ai-platform-sa"
  features:
    - name: "saia"
      serviceAccountName: "saia-sa"
      version: "0.1.0"
  workerGroupConfig:
    serviceAccountName: "ray-worker-sa"
    imageRegistry: "123456789012.dkr.ecr.us-west-2.amazonaws.com/ray/ray-worker-gpu"  
  sidecars:
    envoy: true
    otel: true
    prometheusOperator: true
  certificateRef: "platform-issuer"
  clusterDomain: "cluster.local"
  images:
    saiaImage: "splunkai/saia:latest"
    weaviateImage: "docker.io/weaviate:latest"
    rayHeadGroupImage: "rayproject/ray-head:latest"
    rayWorkerGroupImage: "rayproject/ray-worker:latest"
  defaultAcceleratorType: "L40S"
  splunkConfiguration:
    crName: "splunk-standalone"
    crNamespace: "default"
    secretRef:
        name: "splunk-secret"
        namespace: "default"
    endpoint: "https://splunk.default.svc.cluster.local:8089"
    # Optional, if not using secretRef
    # token: "splunk-token"
  # Persistent storage for Weaviate vector database
  storage:
    vectorDB:
      # Option 1: Use existing PVC
      # pvcName: "my-existing-pvc"

      # Option 2: Create dynamic PVC (recommended)
      size: "100Gi"
      storageClassName: "gp3"  # Use appropriate StorageClass

  # Scheduling for GPU workloads (Ray workers)
  gpuScheduler:
    nodeSelector:
      node.kubernetes.io/instance-type: "g5.2xlarge"
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

  # Scheduling for CPU workloads (Ray head, Weaviate)
  cpuScheduler:
    nodeSelector:
      workload-type: "cpu"
    tolerations: []

  # External access via Ingress (optional)
  ingress:
    enabled: true
    className: "nginx"  # or "alb", "traefik", etc.
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
      nginx.ingress.kubernetes.io/ssl-redirect: "true"
    hosts:
      - host: "ai.example.com"
        paths:
          - path: "/"
            pathType: "Prefix"
    tls:
      - hosts:
          - "ai.example.com"
        secretName: "ai-platform-tls"

  # mTLS certificates for secure communication (optional)
  mtls:
    enabled: true
    termination: "operator"  # Operator manages certificates
    secretName: "ai-platform-mtls"
    issuerRef:
      name: "ca-issuer"
      kind: "ClusterIssuer"
    dnsNames:
      - "saia.default.svc.cluster.local"

The AIPlatform resource provides the following Spec configuration parameters:

Key Type Description
objectStorage object Required. S3/GCS/Azure storage configuration for model artifacts. See Service Artifacts Storage
serviceAccountName string Kubernetes Service Account name. Used for IAM roles (IRSA on AWS) to access cloud resources
features array List of AI features to enable (e.g., saia for Splunk AI Assistant)
defaultAcceleratorType string GPU type for AI workloads (e.g., nvidia-tesla-t4, nvidia-a100, L40S)
gpuInstanceType string GPU instance type for Ray worker groups (e.g., g6.24xlarge, p4d.24xlarge)
workerGroupConfig object Ray worker node configuration (service account, image registry)
sidecars object Enable/disable sidecars: envoy, otel, prometheusOperator
clusterDomain string Kubernetes cluster domain suffix. Default: cluster.local
images object Container image overrides for Ray head/worker, SAIA, Weaviate
certificateRef string References a cert-manager Certificate or Issuer for mTLS
splunkConfiguration object Connection details for Splunk Enterprise instance
storage object Persistent storage for Weaviate vector database. See Storage Configuration
gpuScheduler object Node selectors, affinity, tolerations for GPU workloads
cpuScheduler object Node selectors, affinity, tolerations for CPU workloads (head, Weaviate)
ingress object External access configuration. Exposes AI services via HTTP/HTTPS. See Ingress Usage
mtls object mTLS/TLS certificates managed by cert-manager for secure service communication
serviceTemplate object Template used to create Kubernetes services for platform components

AI Service Spec Parameters

The AIService CR is created automatically by the AIPlatform CR, so there are no additional spec values to deploy an AIService CR on its own.

Monitoring Your AI Platform

Check Status

View the overall status of your AI Platform:

# View status conditions
kubectl get aiplatform <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq .

# Check if platform is ready
kubectl get aiplatform <name> -n <namespace> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'

Key Status Conditions:

View Events

See what’s happening with your deployment:

# Watch all events
kubectl get events -n <namespace> --watch --field-selector involvedObject.name=<name>

# See recent events
kubectl describe aiplatform <name> -n <namespace> | grep -A 20 Events:

# Filter specific event types
kubectl get events -n <namespace> --field-selector reason=RayServiceReady
kubectl get events -n <namespace> --field-selector reason=PlatformDegraded

For more details on events and troubleshooting, see Error Handling and Events.

Quick Health Check

# One-liner to check if platform is ready
kubectl get aiplatform <name> -n <namespace> -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
# Output: True (ready) or False (not ready)

# Get Ray service name for accessing inference API
kubectl get aiplatform <name> -n <namespace> -o jsonpath='{.status.rayServiceName}'

# Get Weaviate service name
kubectl get aiplatform <name> -n <namespace> -o jsonpath='{.status.vectorDbServiceName}'

# Get Ingress address (if enabled)
kubectl get aiplatform <name> -n <namespace> -o jsonpath='{.status.conditions[?(@.type=="IngressReady")].message}'