k8s-resource-inspector (kri)
Analyzes Kubernetes workload resource utilization by combining ArgoCD Application CRs with Prometheus metrics. Classifies pod behavior, emits right-sizing recommendations, and can open PRs to apply them.
Installation
go install github.com/LiveViewTech/platform-lab/tools/k8s-resource-inspector@latest
Or build from source:
go build -o ~/go/bin/kri ./tools/k8s-resource-inspector/
Configuration
~/.kri/config.yaml:
clusters:
- argocd_cluster: in-cluster # matches spec.destination.name in ArgoCD Application CR
prometheus: http://localhost:9090
# Optional: namespace where ArgoCD Application CRs live (default: "argocd")
argocd_namespace: argocd
# Optional: floor values for recommendations (defaults shown)
minimums:
cpu_millicores: 10
memory_mi: 16
# Optional: git committer identity for kri-authored commits (defaults shown)
git:
author_name: kri
author_email: kri@noreply.local
# Optional: GitHub settings (defaults shown)
github:
base_branch: main
api_url: https://api.github.com # override for GitHub Enterprise Server
Usage
Inspect
kri inspect [flags]
Flags:
--window string Observation window for Prometheus queries (default "7d")
--confidence float Minimum confidence threshold for recommendations (default 0.8)
--findings-only Only show workloads with recommendations or HPA warnings
--app string Filter to a single ArgoCD application by name
-n, --namespace string Filter to a specific namespace
-o, --output string Output format: table (default) or json
--kubeconfig string Path to kubeconfig (defaults to KUBECONFIG env / ~/.kube/config)
--context string Kubeconfig context to use
--config string Path to kri config file (default ~/.kri/config.yaml)
Plan and apply
kri can open one PR per app containing a values-resources.yaml file with the recommended resource changes. See Apply workflow for setup.
Two-step (recommended):
kri plan # generates kri-plan.yaml
# edit kri-plan.yaml — set apply: false to skip an app, adjust values
kri apply # prints summary, prompts for confirmation, opens PRs
kri apply --dry-run # shows what would happen without opening PRs
One-shot:
kri apply --all # runs inspect, prints table + findings, prompts, opens PRs
kri apply --all --dry-run
kri plan [flags]
--window string Observation window (default "7d")
--confidence float Confidence threshold (default 0.8)
--dir string Directory to write kri-plan.yaml (default: current directory)
kri apply [flags]
--all Run inspect pipeline instead of reading kri-plan.yaml
--dry-run Show what would be applied without opening PRs
--dir string Directory to read kri-plan.yaml from (default: current directory)
--window string Observation window (only with --all, default "7d")
--confidence float Confidence threshold (only with --all, default 0.8)
GITHUB_TOKEN must be set in the environment when running kri apply (not required for --dry-run).
Apply workflow
kri writes resource changes to a separate values-resources.yaml file alongside each app's main values.yaml. This keeps kri's changes isolated and avoids YAML roundtrip issues with the main config.
One-time ArgoCD setup: Add values-resources.yaml to your AppSet's valueFiles list with ignoreMissingValueFiles: true. The file is picked up automatically once kri creates it; before that ArgoCD silently ignores the missing file.
# In your ApplicationSet template
spec:
source:
helm:
valueFiles:
- values.yaml
- values-resources.yaml
ignoreMissingValueFiles: true
Values file format (values-resources.yaml — do not edit manually):
Single-container apps:
# Generated by kri — do not edit manually
resources:
requests:
cpu: 10m
memory: 16Mi
limits:
cpu: 10m
memory: 16Mi
Multi-container apps use a containers: list with per-container resources blocks.
Note: When requests == limits (Guaranteed QoS), kri updates both together to preserve the QoS class.
Output
Table columns
| Column |
Description |
| APP |
ArgoCD Application CR name (metadata.name). Typically matches the Helm release name unless spec.source.helm.releaseName is set explicitly. |
| CLUSTER |
ArgoCD destination cluster name (spec.destination.name) |
| NAMESPACE |
Pod namespace |
| POD |
Pod name |
| CONTAINER |
Container name |
| CPU_REQ |
CPU request from kube-state-metrics |
| CPU_P95 |
p95 CPU usage over the observation window |
| CPU_P99 |
p99 CPU usage over the observation window |
| MEM_REQ |
Memory request from kube-state-metrics |
| MEM_P95 |
p95 memory working set over the observation window |
| MEM_P99 |
p99 memory working set over the observation window |
| MEM/LIM |
p99 memory as a percentage of the memory limit |
| BEHAVIOR |
Classified behavior (see below) |
| CONF |
Classification confidence |
| HPA |
HPA validation result: -, OK, WARN, or ERROR |
| REC |
Recommendation flag: - (none), ok (within tolerance), YES (actionable), hold |
Rows flagged with WARN/ERROR in HPA or YES in REC are expanded in the Findings block printed below the table.
Behavior classes
| Class |
Meaning |
| STATIC |
Low, stable utilization — good candidate for right-sizing |
| SPIKY |
High p99/p50 ratio — bursting workload |
| GROWTH |
Sustained memory trend upward toward the limit |
| RUNAWAY |
Memory p99 at or near limit — OOM risk |
| MIXED |
Pods within the same workload disagree — investigate divergence before acting |
| UNKNOWN |
Insufficient data |
Classification thresholds:
- RUNAWAY: mem p99 ≥ 90% of limit
- SPIKY: CPU p99/p50 ≥ 2.0 or mem p99/p50 ≥ 1.8
- GROWTH: trend > 1% of mem p50/hr AND mem p99 ≥ 30% of limit (pods well within their limit are not classified GROWTH, to avoid trend noise on idle workloads)
- STATIC: CPU p99/p50 < 1.5 AND mem p99/p50 < 1.3 AND flat trend
Recommendations add headroom above observed p99: +20% for CPU (rounded up to nearest 10m), +30% for memory (rounded up to nearest Mi). A change is only emitted when the recommended value differs from the current request by more than 10%.
HPA validation
| Check |
Condition |
Severity |
| CPU request missing |
HPA targets CPU but no CPU request set |
ERROR |
| Memory request missing |
HPA targets memory but no memory request set |
ERROR |
| Target utilization too high |
HPA target % above p95 actual utilization |
WARN |
| Target utilization too low |
HPA target % well below p50 |
WARN |
| Min replicas too low |
minReplicas = 1 on a SPIKY workload |
WARN |
| Max replicas too low |
maxReplicas hit in Prometheus history |
WARN |
| Scaling metric mismatch |
CPU HPA on a memory-bound workload |
WARN |
Data sources
- Pod inventory:
kube_pod_container_resource_requests, kube_pod_container_resource_limits, kube_pod_status_phase from kube-state-metrics
- Workload resolution:
kube_pod_owner (pod → ReplicaSet) + kube_replicaset_owner (RS → Deployment) chain
- CPU usage:
rate(container_cpu_usage_seconds_total[5m]) quantiles via quantile_over_time
- Memory usage:
container_memory_working_set_bytes quantiles via quantile_over_time
- Memory trend:
deriv(container_memory_working_set_bytes[window]) * 3600 (bytes/hour)
- HPA config:
autoscaling/v2 HorizontalPodAutoscaler resources via Kubernetes API
- Values files: Helm values read directly from git via shallow clone
kri-operator
kri logic is automated by kri-operator, a Kubernetes operator that runs the inspect → plan → apply workflow on a schedule and posts rollback diagnosis reports to Slack. The CLI remains fully functional as a developer and debug interface into the same underlying logic.