kubectl, scrape Prometheus metrics from a dedicated port, instrument your agent code with the OpenTelemetry SDK for end-to-end distributed tracing, and interrogate real-time health through the status.conditions array on every Agent resource. This page covers each approach in turn.
Logs
Flokoa labels every pod it creates withflokoa.ai/agent=<name>, so you can target logs by agent name regardless of how many replicas are running.
Failed phase or why a model reference could not be resolved:
Metrics
Expose a dedicated metrics port from your agent container so that Prometheus (or any compatible scraper) can collect custom metrics alongside standard Kubernetes resource metrics. Declare the metrics port in your Agent spec alongside the main HTTP port:8443 (HTTPS) in the flokoa-system namespace. You can enable a Prometheus ServiceMonitor for the operator by setting controller.metrics.serviceMonitor.enabled=true in the Helm values.
Monitor live CPU and memory consumption for all replicas of an agent:
Distributed Tracing with OpenTelemetry
Flokoa integrates with the OpenTelemetry ecosystem to give you end-to-end trace visibility across A2A request spans and framework-level spans such as LLM calls and tool invocations.Install the tracing extra
Add the
tracing optional dependency to your agent’s Python environment. This pulls in the OpenTelemetry SDK and the pydantic-ai instrumentation package.Configure the OTEL exporter via environment variables
Set the standard OpenTelemetry environment variables in your Agent spec. The
flokoa run CLI automatically initialises the tracer provider at startup when the tracing extra is installed — no code changes are required.Deploy an OpenTelemetry Collector (if needed)
Point
OTEL_EXPORTER_OTLP_ENDPOINT at an existing collector in your cluster, or deploy the OpenTelemetry Operator to manage collectors as CRDs. The collector can forward traces to any backend — Jaeger, Tempo, Honeycomb, Datadog, and others.When the
tracing extra is installed, traces automatically include:- A2A request spans — one span per incoming agent task, including input and output metadata
- Framework-level spans — for pydantic-ai, each LLM call and tool invocation is captured as a child span
Status Conditions
The operator maintains astatus.conditions array on every Agent resource that provides a machine-readable summary of the agent’s health. Each condition has the following fields:
| Field | Description |
|---|---|
type | The condition name, e.g. Ready or ModelResolved |
status | "True", "False", or "Unknown" |
reason | A short CamelCase code, e.g. DeploymentAvailable |
message | A human-readable explanation of the current state |
lastTransitionTime | ISO 8601 timestamp of the most recent status change |
Ready—Truewhen the agent’s Deployment has the desired number of available replicas and the Service is reachable.ModelResolved—Truewhen the referenced Model and its ModelProvider have been found, validated, and injected into the agent’s environment.
Agent Status Phase
Thestatus.phase field provides a coarse-grained summary of the agent’s state. The three possible values are:
Pending
The operator has created the Deployment but pods are still being scheduled or the container image is being pulled.
Running
At least one pod is running and the Service endpoint is available. The agent is ready to receive traffic.
Failed
The Deployment failed to reach a healthy state. Inspect
status.conditions and pod events for the root cause.