Mastering Node Readiness Controller: Ensuring Node Health in Kubernetes
The Node Readiness Controller exists to solve a critical problem in Kubernetes: ensuring that workloads are only placed on nodes that meet specific infrastructure requirements. Traditional readiness checks can fall short, especially during node bootstrapping. This controller enhances the readiness guarantee by dynamically managing taints based on custom health signals, thus preventing workloads from being scheduled on nodes that are not yet ready.
At its core, the Node Readiness Controller revolves around the NodeReadinessRule (NRR) API. This allows you to define declarative gates for your nodes. You can set it up in two operational modes: 'continuous enforcement' for ongoing checks or 'bootstrap-only enforcement' for one-time initialization steps. The controller reacts to Node Conditions, which means it doesn't perform health checks itself but relies on existing conditions to determine readiness. For example, you can create a rule that specifies a condition type like 'cniplugin.example.net/NetworkReady' and requires its status to be 'True'. If the condition is not met, the controller applies a taint, such as 'readiness.k8s.io/acme.com/network-unavailable', with an effect of 'NoSchedule' to prevent scheduling on that node.
In production, deploying new readiness rules carries inherent risks, especially across a fleet of nodes. You need to be cautious about the implications of taints and ensure that your conditions are correctly defined. The dry run mode can be a lifesaver here, allowing you to simulate the impact of your rules before applying them. Remember, this controller is set to be available starting February 3, 2026, so plan your upgrades accordingly.
Key takeaways
- →Define NodeReadinessRule (NRR) to set custom readiness gates for your nodes.
- →Choose between 'continuous enforcement' and 'bootstrap-only enforcement' based on your needs.
- →Utilize dry run mode to simulate impacts before applying taints to your nodes.
- →React to Node Conditions instead of performing health checks directly.
- →Be cautious when deploying new readiness rules across a fleet.
Why it matters
In production, ensuring that workloads are only scheduled on fully prepared nodes can significantly reduce downtime and improve application reliability. The Node Readiness Controller helps maintain this readiness throughout the node's lifecycle.
Code examples
1apiVersion: readiness.node.x-k8s.io/v1alpha1
2kind: NodeReadinessRule
3metadata:
4 name: network-readiness-rule
5spec:
6 conditions:
7 - type: "cniplugin.example.net/NetworkReady"
8 requiredStatus: "True"
9 taint:
10 key: "readiness.k8s.io/acme.com/network-unavailable"
11 effect: "NoSchedule"
12 value: "pending"
13enforcementMode: "bootstrap-only"
14nodeSelector:
15 matchLabels:
16 node-role.kubernetes.io/worker: ""When NOT to use this
The official docs don't call out specific anti-patterns here. Use your judgment based on your scale and requirements.
Want the complete reference?
Read official docsUnified observability — logs, uptime monitoring, and on-call in one place. Used by 50,000+ engineering teams to ship faster and sleep better.
Try Better Stack free →Mastering WG Device Management in Kubernetes
Device management in Kubernetes just got a major upgrade with Dynamic Resource Allocation (DRA). This framework replaces the rigid device plugin model, allowing for a flexible, declarative API that enhances how you manage hardware resources. Dive in to understand how the ResourceSlice and ResourceClaim APIs work together to optimize your workloads.
Mastering Workload-Aware Scheduling in Kubernetes v1.36
Kubernetes v1.36 introduces powerful workload-aware scheduling features that can transform how you deploy applications. With the new Workload and PodGroup APIs, you can prevent resource wastage and deadlocks through gang scheduling. This is a game changer for managing complex workloads effectively.
Unlocking Kubernetes v1.36: PSI Metrics for Proactive Resource Management
Kubernetes v1.36 introduces Pressure Stall Information (PSI) metrics, a game changer for monitoring resource saturation. With cumulative totals and moving averages, you can now detect issues before they escalate into outages.
Get the daily digest
One email. 5 articles. Every morning.
No spam. Unsubscribe anytime.