Agent issues
Agent stuck in Pending
Agent stuck in Pending
An agent stays in Read the pod events for scheduling or image-pull failures:Common causes and fixes:
Pending when Kubernetes cannot start the pods. Begin by finding the pods and reading their events.Check which pods exist (or don’t):- Image not found or inaccessible — verify the image name and tag exist in your registry. Check
imagePullPolicyin the spec;Alwaysforces a fresh pull on every start, which can surface auth errors. - Missing image pull secret — for private registries, create a pull secret and reference it in the Agent spec under
spec.runtime.spec.imagePullSecrets. - Insufficient cluster resources — the scheduler may be unable to place the pod if no node has enough free CPU or memory. Run
kubectl describe nodesto see allocatable capacity versus current requests. - Unresolvable model or tool reference — if
spec.model.namereferences a Model that does not exist or is in the wrong namespace, the operator holds the agent inPending. Runkubectl describe agent <name>and read thestatus.conditionsarray for aModelResolved: Falsecondition.
Agent pods are crashing (CrashLoopBackOff)
Agent pods are crashing (CrashLoopBackOff)
CrashLoopBackOff means the container starts but exits immediately. Kubernetes keeps restarting it with increasing back-off delays.Read the logs from the most recent (crashed) pod:- Missing API key secret — the most frequent cause. If your agent reads an environment variable backed by a
secretKeyRef, verify the Secret exists and the key name is correct: - Missing or misconfigured environment variables — cross-check every
enventry in the Agent spec against what your application code expects. - Resource limits too tight — if the container is killed with exit code
137, it was OOM-killed. Increaseresources.limits.memoryin the Agent spec. - Failing health probe — if the liveness probe fires before the application finishes starting up, Kubernetes kills the container repeatedly. Increase
initialDelaySecondson thelivenessProbe. - Application startup error — read the full log output carefully. Stack traces or missing dependency errors appear here.
Agent service not reachable
Agent service not reachable
If an agent is Running but you cannot send it requests, work through the network path from the outside in.Verify the agent has reached Running phase:Confirm the Service exists and has endpoints:Test connectivity with a port-forward (bypasses Ingress and NetworkPolicies):If the port-forward works but direct cluster traffic does not, a NetworkPolicy is blocking the path. Review any
NetworkPolicy resources that select your agent’s pods and ensure your client’s namespace or IP range is permitted in the ingress rules.If the port-forward also fails, the readiness probe is likely reporting unhealthy. Check the pod events and probe configuration:Model and provider issues
ModelProvider not ready
ModelProvider not ready
A Verify the Secret referenced by Check that the key name in the Secret matches Common causes and fixes:
ModelProvider that stays not-ready prevents every Model that references it from resolving, which in turn blocks any Agent that references those models.Inspect the provider’s conditions:apiKeySecretRef exists:apiKeySecretRef.key:- Secret does not exist — create it with
kubectl create secret generic <name> --from-literal=api-key=<value>. - Wrong key name — the
keyfield inapiKeySecretRefmust exactly match a key in the Secret’sdatamap. - Invalid API key — decode the stored value and confirm it is valid by testing it directly against the provider’s API outside of Kubernetes.
- Wrong namespace — Secrets are namespaced. Make sure the Secret and the ModelProvider are in the same namespace.
Model not found error in agent
Model not found error in agent
This error appears in agent logs or in the Check whether the model is ready:Confirm the ModelProvider that backs this model is also ready:Common causes and fixes:
ModelResolved status condition when the operator cannot locate the referenced Model resource.Verify the Model exists:- Namespace mismatch — if the Model is in a different namespace from the Agent, you must set
spec.model.namespacein the Agent spec. The operator does not search other namespaces by default. - Typo in the model name —
spec.model.nameis case-sensitive and must exactly match the Model resource’smetadata.name. - Provider not ready — a Model cannot become ready until its ModelProvider is ready. Fix the provider first.
Tool issues
AgentTool not working
AgentTool not working
When an agent logs errors about a tool or the tool produces no results, start with the tool’s own status and then test the underlying endpoint directly.Inspect the tool’s conditions:Test the tool’s endpoint from inside the cluster (avoids false negatives caused by your local network):Common causes and fixes:
- Invalid OpenAPI spec — if the
Validatedcondition isFalse, the operator could not parse the spec. Use an online validator (for example, editor.swagger.io) to check the spec at the path returned byendpointPath. - Wrong endpoint path — confirm the service actually serves its OpenAPI document at the path you specified in
openApiSchema.endpointPath. - ConfigMap key missing — if you use
openApiSchema.valueFrom, verify the ConfigMap exists and contains the expected key: - Internal service not reachable — for
serviceReftools, confirm the referenced Service exists and is in the correct namespace: - Timeout too short — if the backing API is slow to respond, the agent may be hitting the default 30-second timeout before the response arrives. Increase
timeoutSecondsin the tool spec.
Tool timeout errors
Tool timeout errors
Tool timeouts typically mean the upstream service is slow or unreachable from within the cluster.Increase the timeout in the AgentTool spec:Apply the change with a patch:Test connectivity from an agent pod:Check NetworkPolicies — if the agent namespace has egress restrictions, verify that traffic to the tool endpoint’s IP range and port is explicitly allowed. DNS (UDP 53) must also be permitted so that service names resolve.
Operator issues
Operator pod not running
Operator pod not running
If the operator is not running, no CRD reconciliation occurs, meaning new agents won’t start and existing ones won’t be updated.Check the operator pod status:Describe the pod for events (image pull errors, OOM, etc.):Read the operator logs for startup errors:If the pod is in
ImagePullBackOff, your cluster cannot reach ghcr.io. Check your cluster’s egress rules and any image pull secrets configured via the Helm images.pullSecrets value.If the pod is in CrashLoopBackOff, the --previous flag retrieves logs from the container that just crashed:CRDs not found (no kind 'Agent' found)
CRDs not found (no kind 'Agent' found)
This error from Re-apply the install bundle to restore any missing CRDs:If you installed via Helm and CRDs are missing, they may have been removed during a Check your kubeconfig context — if you manage multiple clusters, confirm you are pointing at the right one:
kubectl means the CRDs were never installed, were accidentally deleted, or the cluster is targeting the wrong Kubernetes context.Verify which CRDs are present:helm uninstall. Re-install with crds.install=true (the default):Getting more help
If you have worked through the items above and still cannot resolve your issue, the following resources are available:GitHub Repository
Browse the source code, read open issues, and check the changelog for recent fixes that may address your problem.
File an Issue
Open a new GitHub issue. Include the output of
kubectl describe agent <name>, relevant pod logs, and your Flokoa version (kubectl get crds agents.agent.flokoa.ai -o jsonpath='{.metadata.annotations}').