Tools: Beyond InferenceService Readiness: 5 GitOps Failure Modes That Break KServe Deployments

Tools: Beyond InferenceService Readiness: 5 GitOps Failure Modes That Break KServe Deployments

The platform context

Failure Mode 1: Webhook Has No Endpoints — Sync Fails Cluster-Wide

Symptom

Root Cause

Lesson

Failure Mode 2: CRD Deleted by a Misapplied Patch — All Endpoints Gone Instantly

Symptom

Root Cause

Lesson

Failure Mode 3: Permanent OutOfSync Due to Label Key Mismatch

Symptom

Root Cause

Lesson

Failure Mode 4: Kyverno Install Breaks ArgoCD Reconciliation Loop

Symptom

Root Cause

Lesson

Symptom

Root Cause

Lesson

The Triage Sequence That Saves Hours

Why This Matters for Platform Teams

What I Would Improve Next

See Also A sequel to my KServe readiness post — five GitOps control-plane failure modes with exact terminal output, diagnostics, and repeatable fixes for ArgoCD + KServe stacks. This post is a follow-up to my earlier KServe piece on endpoint readiness: 👉 Why Your KServe InferenceService Won't Become Ready: Four Production Failures and Fixes That article focused on why an InferenceService may not become Ready. This one zooms out to a broader question: What breaks when the GitOps control plane itself is unstable? Most GitOps + AI serving tutorials still focus on the happy path — install ArgoCD, apply KServe, deploy InferenceService, done. But in real platform work, the happy path is the easy part. The hard part is when your app is OutOfSync, the webhook has no endpoints, and everything looks healthy except the thing you actually need. This post covers the five failure modes that repeatedly broke KServe deployments in a real production-grade platform build, with exact terminal output, root causes, and the fixes that worked. All failures come from hands-on implementation work documented here:

Project repo: github.com/sodiq-code/neuroscale-platform Time lost: ~1 hour | Impact: All InferenceService operations blocked ArgoCD syncs child apps and hits this: Meanwhile the KServe controller pod shows only 1 of 2 containers ready: The kube-rbac-proxy sidecar inside kserve-controller-manager was pulling from gcr.io/kubebuilder/ — a registry that restricted access in late 2025. The manager container was healthy but because the sidecar was not running, the webhook server had no valid certificate endpoint. Result: every InferenceService apply or update was blocked cluster-wide. Remove the sidecar via Kustomize strategic merge patch: Verify webhook endpoints are restored after re-sync: When webhook endpoints are missing, your app YAML is never the real problem. Diagnose the controller first. An external registry access change can silently kill your entire admission layer cluster-wide with no obvious error in the app itself. Time lost: 4 minutes recovery | Impact: SEV-1 equivalent — all InferenceServices deleted All InferenceService objects disappeared silently: A Kustomize patch file named remove-inferenceservice-crd.yaml was mistakenly applied directly with kubectl apply -f instead of being used as a build-time patch inside kustomization.yaml. The file contained a $patch: delete directive: When applied directly, it deleted the actual CRD from Kubernetes. When a CRD is deleted, Kubernetes immediately garbage-collects every custom resource of that type. Every InferenceService was gone within seconds. Restore the CRD immediately: $patch: delete in a Kustomize file is a build-time instruction — it tells kustomize build to omit that resource from output. It must never be applied directly with kubectl apply -f. Ambiguous filenames like remove-inferenceservice-crd.yaml are dangerous footguns. In a production cluster with 50 deployed models this would be a full SEV-1. ⚠️ Rule: Any file containing $patch: delete must only ever be referenced inside a kustomization.yaml patches block, never applied directly. Time lost: 2 weeks undetected | Impact: CI was green while policy enforcement was silently broken A PR is merged, ArgoCD syncs, but the InferenceService stays OutOfSync/Degraded: Kyverno denies the resource at admission: But the label is present in the manifest: costCenter (camelCase) and cost-center (kebab-case) are completely different Kubernetes label keys. The Backstage template skeleton was generating costCenter. The Kyverno policy required cost-center. CI passed because CI used the same manifest that would pass — the mismatch only surfaced at admission time. Additionally, kyverno-cli apply exits with code 0 even when policy violations are found. CI was checking $? rather than ${PIPESTATUS[0]}, so the CI step appeared green while enforcement was completely broken for two weeks. Standardize on kebab-case throughout (Kubernetes convention): Fix the CI Kyverno check to catch actual violations: $? captures the exit code of tee, not kyverno. ${PIPESTATUS[0]} captures kyverno's actual exit code. "Guardrails exist" and "guardrails enforce" are different states. The most dangerous failure mode for a policy system is silent false positives — everything looks green while nothing is actually being enforced. Time lost: 2–5 minutes per cluster | Impact: All ArgoCD apps enter Unknown state After adding Kyverno to the platform, previously healthy apps enter Unknown state: Kyverno installs its own ValidatingWebhookConfiguration and MutatingWebhookConfiguration during install. While Kyverno is initializing, the webhook configurations are registered but point to endpoints that do not exist yet. During this window, any kubectl apply operation — including ArgoCD's sync reconciliation loop — times out waiting for a response from a not-yet-running webhook server. This cascades into the ArgoCD repo-server losing its connection. Add a Kyverno webhookAnnotations ConfigMap patch to suppress automatic webhook registration during the initialization window: After Kyverno reaches Running state, force a hard refresh: Adding a policy engine to an existing cluster disrupts all other ArgoCD-managed applications during the install window. In production, this requires a maintenance window or a canary install strategy. Kyverno must be fully healthy before any other component syncs. Time lost: 30+ minutes | Impact: All Deployments in the namespace silently blocked After fixing the repo-server, apps sync but Deployments never appear: ArgoCD shows the Deployment as "synced" but it does not exist — a contradiction. Checking conditions: A ValidatingWebhookConfiguration from a previous cluster experiment was still registered but pointing to a service that no longer existed. Kubernetes admission webhooks are cluster-scoped. The stale ingress-nginx webhook was intercepting every resource creation attempt and failing them — the error only appears in ArgoCD events, not on the Deployment itself. A stale webhook from a previous workload silently blocks all resource creation in the affected namespace for hours without any obvious error message. The admission error only appears in ArgoCD events logs, not on the resource itself. Always check for stale webhooks before blaming manifests. When a KServe app is failing in ArgoCD, run this exact order before touching any manifest: Only after every step above passes should you edit app manifests. A platform is credible when it supports both: Most teams build the first and postpone the second. That creates operational debt fast. The fix is not more dashboards. It is better failure-model documentation, tighter GitOps guardrails, and the discipline to document what breaks — not just what works. A platform is not "done" when the happy path works. It's done when the failure path is understandable and recoverable. Jimoh Sodiq Bolaji | Platform Engineer | Technical Content Engineer | Abuja, Nigeria

NeuroScale Platform · Dev.to Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: neuroscale-infrastructure namespace: argocd spec: source: repoURL: https://github.com/sodiq-code/neuroscale-platform.git targetRevision: main path: infrastructure/apps apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: neuroscale-infrastructure namespace: argocd spec: source: repoURL: https://github.com/sodiq-code/neuroscale-platform.git targetRevision: main path: infrastructure/apps apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: neuroscale-infrastructure namespace: argocd spec: source: repoURL: https://github.com/sodiq-code/neuroscale-platform.git targetRevision: main path: infrastructure/apps $ kubectl -n argocd describe application ai-model-alpha ... Message: admission webhook "inferenceservice.kserve-webhook-server.validator.webhook" denied the request: Internal error occurred: no endpoints available for service "kserve-webhook-server-service" $ kubectl -n argocd describe application ai-model-alpha ... Message: admission webhook "inferenceservice.kserve-webhook-server.validator.webhook" denied the request: Internal error occurred: no endpoints available for service "kserve-webhook-server-service" $ kubectl -n argocd describe application ai-model-alpha ... Message: admission webhook "inferenceservice.kserve-webhook-server.validator.webhook" denied the request: Internal error occurred: no endpoints available for service "kserve-webhook-server-service" $ kubectl -n kserve get pods NAME READY STATUS kserve-controller-manager-8d7c5b9f4-xr2lm 1/2 Running $ kubectl -n kserve describe pod kserve-controller-manager-8d7c5b9f4-xr2lm ... kube-rbac-proxy: State: Waiting Reason: ImagePullBackOff Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1 Events: Warning Failed kubelet Failed to pull image: unexpected status code 403 Forbidden $ kubectl -n kserve get pods NAME READY STATUS kserve-controller-manager-8d7c5b9f4-xr2lm 1/2 Running $ kubectl -n kserve describe pod kserve-controller-manager-8d7c5b9f4-xr2lm ... kube-rbac-proxy: State: Waiting Reason: ImagePullBackOff Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1 Events: Warning Failed kubelet Failed to pull image: unexpected status code 403 Forbidden $ kubectl -n kserve get pods NAME READY STATUS kserve-controller-manager-8d7c5b9f4-xr2lm 1/2 Running $ kubectl -n kserve describe pod kserve-controller-manager-8d7c5b9f4-xr2lm ... kube-rbac-proxy: State: Waiting Reason: ImagePullBackOff Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1 Events: Warning Failed kubelet Failed to pull image: unexpected status code 403 Forbidden # infrastructure/serving-stack/patches/ # kserve-controller-kube-rbac-proxy-image.yaml apiVersion: apps/v1 kind: Deployment metadata: name: kserve-controller-manager namespace: kserve spec: template: spec: containers: - name: kube-rbac-proxy $patch: delete # infrastructure/serving-stack/patches/ # kserve-controller-kube-rbac-proxy-image.yaml apiVersion: apps/v1 kind: Deployment metadata: name: kserve-controller-manager namespace: kserve spec: template: spec: containers: - name: kube-rbac-proxy $patch: delete # infrastructure/serving-stack/patches/ # kserve-controller-kube-rbac-proxy-image.yaml apiVersion: apps/v1 kind: Deployment metadata: name: kserve-controller-manager namespace: kserve spec: template: spec: containers: - name: kube-rbac-proxy $patch: delete $ kubectl -n kserve get endpoints kserve-webhook-server-service NAME ENDPOINTS AGE kserve-webhook-server-service 10.42.0.23:9443 45s $ kubectl -n kserve get endpoints kserve-webhook-server-service NAME ENDPOINTS AGE kserve-webhook-server-service 10.42.0.23:9443 45s $ kubectl -n kserve get endpoints kserve-webhook-server-service NAME ENDPOINTS AGE kserve-webhook-server-service 10.42.0.23:9443 45s $ kubectl -n default get inferenceservices No resources found in default namespace. $ kubectl -n argocd get application demo-iris-2 NAME SYNC STATUS HEALTH STATUS demo-iris-2 OutOfSync Missing $ kubectl -n default get inferenceservices No resources found in default namespace. $ kubectl -n argocd get application demo-iris-2 NAME SYNC STATUS HEALTH STATUS demo-iris-2 OutOfSync Missing $ kubectl -n default get inferenceservices No resources found in default namespace. $ kubectl -n argocd get application demo-iris-2 NAME SYNC STATUS HEALTH STATUS demo-iris-2 OutOfSync Missing apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: inferenceservices.serving.kserve.io $patch: delete apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: inferenceservices.serving.kserve.io $patch: delete apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: inferenceservices.serving.kserve.io $patch: delete kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.1/kserve.yaml kubectl wait crd/inferenceservices.serving.kserve.io \ --for=condition=Established --timeout=60s kubectl -n argocd patch application demo-iris-2 \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.1/kserve.yaml kubectl wait crd/inferenceservices.serving.kserve.io \ --for=condition=Established --timeout=60s kubectl -n argocd patch application demo-iris-2 \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.12.1/kserve.yaml kubectl wait crd/inferenceservices.serving.kserve.io \ --for=condition=Established --timeout=60s kubectl -n argocd patch application demo-iris-2 \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' $ kubectl -n argocd get application demo-iris-2 NAME SYNC STATUS HEALTH STATUS demo-iris-2 OutOfSync Degraded $ kubectl -n argocd get application demo-iris-2 NAME SYNC STATUS HEALTH STATUS demo-iris-2 OutOfSync Degraded $ kubectl -n argocd get application demo-iris-2 NAME SYNC STATUS HEALTH STATUS demo-iris-2 OutOfSync Degraded Error from server: error when creating "STDIN": admission webhook "clusterpolice.kyverno.svc" denied the request: resource InferenceService/default/test-model was blocked due to the following policies require-standard-labels-inferenceservice: check-owner-and-cost-center-on-isvc: 'validation error: InferenceService resources must set metadata.labels.owner and metadata.labels.cost-center. rule check-owner-and-cost-center-on-isvc failed at path /metadata/labels/cost-center/' Error from server: error when creating "STDIN": admission webhook "clusterpolice.kyverno.svc" denied the request: resource InferenceService/default/test-model was blocked due to the following policies require-standard-labels-inferenceservice: check-owner-and-cost-center-on-isvc: 'validation error: InferenceService resources must set metadata.labels.owner and metadata.labels.cost-center. rule check-owner-and-cost-center-on-isvc failed at path /metadata/labels/cost-center/' Error from server: error when creating "STDIN": admission webhook "clusterpolice.kyverno.svc" denied the request: resource InferenceService/default/test-model was blocked due to the following policies require-standard-labels-inferenceservice: check-owner-and-cost-center-on-isvc: 'validation error: InferenceService resources must set metadata.labels.owner and metadata.labels.cost-center. rule check-owner-and-cost-center-on-isvc failed at path /metadata/labels/cost-center/' $ kubectl -n default get inferenceservice demo-iris-2 \ -o jsonpath='{.metadata.labels}' | python3 -m json.tool { "owner": "platform-team", "costCenter": "ai-platform" } $ kubectl -n default get inferenceservice demo-iris-2 \ -o jsonpath='{.metadata.labels}' | python3 -m json.tool { "owner": "platform-team", "costCenter": "ai-platform" } $ kubectl -n default get inferenceservice demo-iris-2 \ -o jsonpath='{.metadata.labels}' | python3 -m json.tool { "owner": "platform-team", "costCenter": "ai-platform" } # Backstage template skeleton # apps/${{ values.name }}/inference-service.yaml labels: owner: platform-team cost-center: ai-platform # was: costCenter # Backstage template skeleton # apps/${{ values.name }}/inference-service.yaml labels: owner: platform-team cost-center: ai-platform # was: costCenter # Backstage template skeleton # apps/${{ values.name }}/inference-service.yaml labels: owner: platform-team cost-center: ai-platform # was: costCenter set +e docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \ apply infrastructure/kyverno/policies/*.yaml \ --resource "${app_files[@]}" \ 2>&1 | tee /tmp/kyverno-output.txt kyverno_exit="${PIPESTATUS[0]}" set -e if [ "${kyverno_exit}" -ne 0 ] \ || grep -qE "^FAIL" /tmp/kyverno-output.txt \ || grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then echo "Kyverno policy violations detected. Failing CI." exit 1 fi set +e docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \ apply infrastructure/kyverno/policies/*.yaml \ --resource "${app_files[@]}" \ 2>&1 | tee /tmp/kyverno-output.txt kyverno_exit="${PIPESTATUS[0]}" set -e if [ "${kyverno_exit}" -ne 0 ] \ || grep -qE "^FAIL" /tmp/kyverno-output.txt \ || grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then echo "Kyverno policy violations detected. Failing CI." exit 1 fi set +e docker run --rm -v "$PWD:/work" -w /work ghcr.io/kyverno/kyverno-cli:v1.12.5 \ apply infrastructure/kyverno/policies/*.yaml \ --resource "${app_files[@]}" \ 2>&1 | tee /tmp/kyverno-output.txt kyverno_exit="${PIPESTATUS[0]}" set -e if [ "${kyverno_exit}" -ne 0 ] \ || grep -qE "^FAIL" /tmp/kyverno-output.txt \ || grep -qE "fail: [1-9][0-9]*" /tmp/kyverno-output.txt; then echo "Kyverno policy violations detected. Failing CI." exit 1 fi $ kubectl -n argocd get applications NAME SYNC STATUS HEALTH STATUS neuroscale-infrastructure Synced Healthy serving-stack Unknown Unknown # was Healthy 10 minutes ago policy-guardrails Synced Healthy $ kubectl -n argocd describe application serving-stack ... Message: rpc error: code = Unavailable desc = connection refused $ kubectl -n argocd get applications NAME SYNC STATUS HEALTH STATUS neuroscale-infrastructure Synced Healthy serving-stack Unknown Unknown # was Healthy 10 minutes ago policy-guardrails Synced Healthy $ kubectl -n argocd describe application serving-stack ... Message: rpc error: code = Unavailable desc = connection refused $ kubectl -n argocd get applications NAME SYNC STATUS HEALTH STATUS neuroscale-infrastructure Synced Healthy serving-stack Unknown Unknown # was Healthy 10 minutes ago policy-guardrails Synced Healthy $ kubectl -n argocd describe application serving-stack ... Message: rpc error: code = Unavailable desc = connection refused # infrastructure/kyverno/kustomization.yaml patches: - target: kind: ConfigMap name: kyverno patch: |- apiVersion: v1 kind: ConfigMap metadata: name: kyverno namespace: kyverno data: webhookAnnotations: "{}" # infrastructure/kyverno/kustomization.yaml patches: - target: kind: ConfigMap name: kyverno patch: |- apiVersion: v1 kind: ConfigMap metadata: name: kyverno namespace: kyverno data: webhookAnnotations: "{}" # infrastructure/kyverno/kustomization.yaml patches: - target: kind: ConfigMap name: kyverno patch: |- apiVersion: v1 kind: ConfigMap metadata: name: kyverno namespace: kyverno data: webhookAnnotations: "{}" kubectl -n argocd patch application serving-stack \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' kubectl -n argocd patch application serving-stack \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' kubectl -n argocd patch application serving-stack \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' $ kubectl get applications -n argocd NAME SYNC STATUS HEALTH STATUS neuroscale-infrastructure Synced Healthy test-app Synced Progressing # stuck $ kubectl get deploy -n default No resources found in default namespace. $ kubectl get applications -n argocd NAME SYNC STATUS HEALTH STATUS neuroscale-infrastructure Synced Healthy test-app Synced Progressing # stuck $ kubectl get deploy -n default No resources found in default namespace. $ kubectl get applications -n argocd NAME SYNC STATUS HEALTH STATUS neuroscale-infrastructure Synced Healthy test-app Synced Progressing # stuck $ kubectl get deploy -n default No resources found in default namespace. $ kubectl -n argocd get application test-app -o yaml | grep -A 20 conditions conditions: - message: 'Failed sync attempt: one or more objects failed to apply, reason: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/...": dial tcp 10.96.x.x:443: connect: connection refused' type: SyncError $ kubectl -n argocd get application test-app -o yaml | grep -A 20 conditions conditions: - message: 'Failed sync attempt: one or more objects failed to apply, reason: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/...": dial tcp 10.96.x.x:443: connect: connection refused' type: SyncError $ kubectl -n argocd get application test-app -o yaml | grep -A 20 conditions conditions: - message: 'Failed sync attempt: one or more objects failed to apply, reason: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/...": dial tcp 10.96.x.x:443: connect: connection refused' type: SyncError # Discover stale webhooks kubectl get validatingwebhookconfigurations kubectl get mutatingwebhookconfigurations # Delete the stale one kubectl delete validatingwebhookconfiguration ingress-nginx-admission # Force ArgoCD to retry kubectl -n argocd patch application test-app \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' # Discover stale webhooks kubectl get validatingwebhookconfigurations kubectl get mutatingwebhookconfigurations # Delete the stale one kubectl delete validatingwebhookconfiguration ingress-nginx-admission # Force ArgoCD to retry kubectl -n argocd patch application test-app \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' # Discover stale webhooks kubectl get validatingwebhookconfigurations kubectl get mutatingwebhookconfigurations # Delete the stale one kubectl delete validatingwebhookconfiguration ingress-nginx-admission # Force ArgoCD to retry kubectl -n argocd patch application test-app \ --type merge \ -p '{"metadata":{"annotations":{"argocd.argoproj.io/refresh":"hard"}}}' $ kubectl get deploy -n default NAME READY UP-TO-DATE AVAILABLE AGE nginx-test 1/1 1 1 23s $ kubectl get deploy -n default NAME READY UP-TO-DATE AVAILABLE AGE nginx-test 1/1 1 1 23s $ kubectl get deploy -n default NAME READY UP-TO-DATE AVAILABLE AGE nginx-test 1/1 1 1 23s # 1. Environment gate — if this fails, stop and fix environment first kubectl get nodes kubectl -n argocd get applications # 2. Control-plane health kubectl -n kserve get deploy,pods,svc,endpoints kubectl get crd | grep serving.kserve.io # 3. Controller logs kubectl -n kserve logs deploy/kserve-controller-manager --tail=100 # 4. Webhook availability kubectl -n kserve get endpoints kserve-webhook-server-service # 5. Stale webhooks kubectl get validatingwebhookconfigurations kubectl get mutatingwebhookconfigurations # 6. App-level sync error detail kubectl -n argocd get application <app-name> -o yaml | grep -A 20 conditions # 1. Environment gate — if this fails, stop and fix environment first kubectl get nodes kubectl -n argocd get applications # 2. Control-plane health kubectl -n kserve get deploy,pods,svc,endpoints kubectl get crd | grep serving.kserve.io # 3. Controller logs kubectl -n kserve logs deploy/kserve-controller-manager --tail=100 # 4. Webhook availability kubectl -n kserve get endpoints kserve-webhook-server-service # 5. Stale webhooks kubectl get validatingwebhookconfigurations kubectl get mutatingwebhookconfigurations # 6. App-level sync error detail kubectl -n argocd get application <app-name> -o yaml | grep -A 20 conditions # 1. Environment gate — if this fails, stop and fix environment first kubectl get nodes kubectl -n argocd get applications # 2. Control-plane health kubectl -n kserve get deploy,pods,svc,endpoints kubectl get crd | grep serving.kserve.io # 3. Controller logs kubectl -n kserve logs deploy/kserve-controller-manager --tail=100 # 4. Webhook availability kubectl -n kserve get endpoints kserve-webhook-server-service # 5. Stale webhooks kubectl get validatingwebhookconfigurations kubectl get mutatingwebhookconfigurations # 6. App-level sync error detail kubectl -n argocd get application <app-name> -o yaml | grep -A 20 conditions - ArgoCD — GitOps reconciliation - KServe — model serving (InferenceService, runtimes) - Knative + Kourier — serving networking - Kyverno — policy guardrails - Backstage — self-service PR generation - Self-service delivery — the Golden Path works - Self-service recovery — failures are understandable and fixable without a platform expert - Pre-merge CI assertions for probe and resource fields in rendered manifests - Explicit dependency ordering using ArgoCD sync waves to prevent Kyverno install disruption - Conformance checks for Helm dependency values nesting to catch silently ignored overrides - Policy test fixtures that verify both pass and fail cases in CI - docs/REALITY_CHECK_MILESTONE_1_GITOPS_SPINE.md — ArgoCD spine failures with exact terminal output - docs/REALITY_CHECK_MILESTONE_4_GUARDRAILS.md — Kyverno CI false-green and the $PIPESTATUS[0] fix - infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md — full incident postmortem with 12-section RCA - docs/REALITY_CHECK_MILESTONE_2_KSERVE_SERVING.md — the kube-rbac-proxy failure in full detail