This guide helps you understand what’s happening with your AI Platform deployments using Kubernetes events and status conditions.
# Check overall status
kubectl get aiplatform <name> -n <namespace>
# Get detailed readiness
kubectl get aiplatform <name> -n <namespace> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
If status: "True" - your platform is ready!
If status: "False" - check the message field for what’s wrong.
# Watch events in real-time
kubectl get events -n <namespace> --watch --field-selector involvedObject.name=<name>
# See recent events
kubectl describe aiplatform <name> -n <namespace> | tail -30
Your AI Platform tracks several health indicators:
| Condition | What It Means | When It’s False |
|---|---|---|
Ready |
Everything is working | One or more components have issues |
RayServiceReady |
Ray cluster is operational | Ray is starting, upgrading, or failed |
RayClusterReady |
Ray pods are running | Pods are pending, failing, or not enough replicas |
RayServeRouteReady |
AI inference API is available | Applications failed to deploy or endpoints not ready |
WeaviateDatabaseReady |
Vector database is running | Weaviate pods are not ready |
IngressReady |
External access is configured | Ingress hasn’t received an address yet |
# Check if Ray is ready
kubectl get aiplatform <name> -n <namespace> \
-o jsonpath='{.status.conditions[?(@.type=="RayServiceReady")]}'
# Check if Weaviate is ready
kubectl get aiplatform <name> -n <namespace> \
-o jsonpath='{.status.conditions[?(@.type=="WeaviateDatabaseReady")]}'
# Check if external access is ready
kubectl get aiplatform <name> -n <namespace> \
-o jsonpath='{.status.conditions[?(@.type=="IngressReady")]}'
Events tell you what’s happening as your platform deploys and runs.
These indicate successful operations:
| Event | Meaning |
|---|---|
RayServiceCreated |
Ray cluster was created successfully |
RayServiceReady |
Ray cluster is now operational |
RayClusterReady |
All Ray pods are running |
RayServeReady |
AI inference endpoints are available |
WeaviateCreated |
Vector database was created |
WeaviateReady |
Vector database is operational |
IngressCreated |
External access was configured |
IngressReady |
External URL is now available |
PlatformReady |
Everything is working! |
These indicate problems that need investigation:
| Event | What’s Wrong | What To Do |
|---|---|---|
PlatformDegraded |
One or more components failing | Check the message to see which components |
RayServiceNotReady |
Ray cluster is unhealthy | Check Ray pods and logs |
RayApplicationErrors |
AI models failed to load | Check application logs and model paths |
RayClusterNotReady |
Ray pods are failing | Check pod status and events |
WeaviateNotReady |
Vector database is failing | Check Weaviate pod status |
IngressNotReady |
External access lost | Check Ingress controller |
Check what’s failing:
kubectl get aiplatform <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq '.[] | select(.status=="False")'
This shows all components that aren’t ready yet.
Check recent events:
kubectl get events -n <namespace> --field-selector involvedObject.name=<name> --sort-by='.lastTimestamp' | tail -20
Symptoms:
RayApplicationErrorsRayServeRouteReady is FalseCheck which models are failing:
# View detailed error messages
kubectl get aiplatform <name> -n <namespace> \
-o jsonpath='{.status.conditions[?(@.type=="RayServeRouteReady")].message}'
# Check Ray Serve logs
kubectl logs -l ray.io/cluster=<name> -n <namespace> | grep -i error
Common causes:
objectStorage.pathSymptoms:
WeaviateNotReadyWeaviateDatabaseReady is FalseCheck Weaviate status:
# Check StatefulSet
kubectl get statefulset <name>-weaviate -n <namespace>
# Check pods
kubectl get pods -l app=<name>-weaviate -n <namespace>
# Check logs
kubectl logs <name>-weaviate-0 -n <namespace>
Common causes:
Symptoms:
IngressReady is FalseCheck Ingress status:
# View Ingress resource
kubectl get ingress <name> -n <namespace>
# Check if address is assigned
kubectl describe ingress <name> -n <namespace>
# Check Ingress controller logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller
Common causes:
When AI models fail to load, you’ll see detailed errors:
# View application errors
kubectl get events -n <namespace> \
--field-selector involvedObject.name=<name>,reason=RayApplicationErrors
# Check specific application logs
kubectl logs -l ray.io/node-type=worker -n <namespace> | grep <app-name>
Example error messages:
FileNotFoundError: model_artifacts/my-model/model.bin → Check S3 pathCUDA_VISIBLE_DEVICES is set to empty string → GPU configuration issueRuntimeError: CUDA out of memory → Increase GPU resources# View Weaviate errors
kubectl get events -n <namespace> \
--field-selector involvedObject.name=<name>,reason=WeaviateNotReady
# Check Weaviate logs
kubectl logs <name>-weaviate-0 -n <namespace>
Sometimes individual pods fail:
# List all pods
kubectl get pods -n <namespace> -l ai.splunk.com/platform=<name>
# Check failing pods
kubectl describe pod <pod-name> -n <namespace>
# View pod logs
kubectl logs <pod-name> -n <namespace>
During deployment, you’ll typically see events in this order:
RayServiceCreatingRayServiceCreatedWeaviateCreatingWeaviateCreatedIngressCreating (if enabled)IngressCreated (if enabled)RayClusterReady - Pods are runningWeaviateReady - Database is runningRayServiceReady - Ray is operationalRayServeReady - AI inference readyIngressReady (if enabled) - External access availablePlatformReady - Everything is operationalMonitor Warning events to catch problems early:
# Count Warning events
kubectl get events -n <namespace> --field-selector type=Warning
# Watch for specific problems
kubectl get events -n <namespace> --watch --field-selector reason=PlatformDegraded
Export events to your monitoring system:
Prometheus:
# Example PromQL query
rate(kube_event_count{type="Warning",involved_object_kind="AIPlatform"}[5m]) > 0
Splunk: Configure the Splunk operator to forward events to your Splunk instance.
If you’re still stuck:
# Save all relevant information
kubectl get aiplatform <name> -n <namespace> -o yaml > aiplatform.yaml
kubectl get events -n <namespace> > events.txt
kubectl get pods -n <namespace> > pods.txt
kubectl logs <pod-name> -n <namespace> > pod-logs.txt
kubectl logs -n splunk-ai-operator-system \
deployment/splunk-ai-operator-controller-manager
Use Status Conditions to understand current state:
kubectl get aiplatform <name> -o jsonpath='{.status.conditions}'
Use Events to understand what happened:
kubectl get events --field-selector involvedObject.name=<name>
Use Logs for detailed debugging:
kubectl logs <pod-name>
For more technical details about the event system, see Event Coverage and Event Strategy.