Microk8s IP Address Exhaustion - Troubleshooting Guide¶

Problem Statement¶

A microk8s Kubernetes cluster experienced a complete outage where all pods were stuck in ContainerCreating, PodInitializing, or Init:0/1 states and unable to start. The pods had no IP addresses assigned (<none> in the IP column), and the cluster had been in this state for an extended period (some pods showing 124+ days of age).

Initial Symptoms¶

39 pods across multiple namespaces all failing to start
All pods showing IP: <none> when queried with kubectl get pods -o wide
Error messages in events: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox... plugin type="flannel" name="flannel-plugin" failed (add): failed to delegate add: failed to allocate for range 0: no IP addresses available in range set: 10.244.69.1-10.244.69.254

Initial Status Assessment¶

Commands Run for Initial Diagnosis¶

# Check microk8s status
ssh virt-infra-prod "microk8s status"

# Check node status
ssh virt-infra-prod "microk8s kubectl get nodes"

# Check all pods across namespaces
ssh virt-infra-prod "microk8s kubectl get pods --all-namespaces -o wide"

# Check recent events
ssh virt-infra-prod "microk8s kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20"

# Check microk8s service logs
ssh virt-infra-prod "systemctl status snap.microk8s.daemon-kubelite.service"

Initial Findings¶

Microk8s service was running normally
Node status showed Ready
All 39 pods had no IP addresses
Consistent error: no IP addresses available in range set: 10.244.69.1-10.244.69.254
Flannel subnet configuration: 10.244.69.1/24 (254 available IPs)

Investigation Process¶

Investigation 1: Check Container Runtime Status¶

What was checked:

# Try to count running containers
ssh virt-infra-prod "/snap/bin/microk8s.ctr --namespace k8s.io containers list 2>/dev/null | wc -l"

Result: 0 containers running

What this told us: No containers were actually running, yet IPs were supposedly exhausted. This suggested leaked/orphaned resources.

Investigation 2: Examine Network Interfaces¶

What was checked:

# Look for CNI network interfaces
ssh virt-infra-prod "ip addr show | grep -E '^[0-9]+:|inet ' | grep -A1 cni"

# Check cni0 bridge status
ssh virt-infra-prod "ip link show cni0 && ip addr show cni0"

# Count veth interfaces
ssh virt-infra-prod "ip link show | grep 'veth.*master cni0' | wc -l"

Result: Found 253 orphaned veth interfaces attached to the cni0 bridge

What this told us: These orphaned virtual ethernet interfaces were holding IP addresses but not associated with any running containers. This is a major clue pointing to leaked network resources.

Investigation 3: Check Flannel Configuration¶

What was checked:

# Check flannel subnet configuration
ssh virt-infra-prod "cat /var/snap/microk8s/common/run/flannel/subnet.env"

# Check if flanneld is running
ssh virt-infra-prod "ps aux | grep -i flannel | grep -v grep"

# Check CNI configuration
ssh virt-infra-prod "sudo cat /var/snap/microk8s/current/args/cni-network/*.conflist"

Result: - Flannel configured for subnet 10.244.69.1/24 (254 IPs) - Flanneld daemon running normally - CNI configuration using flannel plugin with delegation to bridge/host-local IPAM

What this told us: Configuration was correct, so the issue was in the state/allocation tracking, not the configuration.

Investigation 4: Check IPAM (IP Address Management) State - First Location¶

What was checked:

# Check flannel dataDir
ssh virt-infra-prod "sudo ls -la /var/snap/microk8s/common/var/lib/cni/flannel/"

# Count files in flannel dataDir
ssh virt-infra-prod "sudo ls /var/snap/microk8s/common/var/lib/cni/flannel | wc -l"

Result: Found 506 stale IPAM allocation files in /var/snap/microk8s/common/var/lib/cni/flannel/

What this told us: The flannel plugin was maintaining state about container network configurations, and these files were never cleaned up when containers were removed.

Investigation 5: First Cleanup Attempt (Partial Success)¶

What was tried:

# Stop microk8s
ssh virt-infra-prod "sudo microk8s stop"

# Remove orphaned veth interfaces
ssh virt-infra-prod "sudo bash -c 'for veth in \$(ip link show | grep -oP \"veth[a-f0-9]+(?=@)\"); do ip link delete \$veth 2>/dev/null; done'"

# Clear flannel dataDir
ssh virt-infra-prod "sudo rm -rf /var/snap/microk8s/common/var/lib/cni/flannel/*"

# Restart microk8s
ssh virt-infra-prod "sudo microk8s start"
ssh virt-infra-prod "/snap/bin/microk8s status --wait-ready"

# Wait and check pods
sleep 30
ssh virt-infra-prod "/snap/bin/microk8s kubectl get pods --all-namespaces -o wide"

Result: FAILED - Pods still showed no IPs and same error messages

What this told us: There was another location where IPAM state was being persisted that we hadn't found yet.

Investigation 6: Check System-Wide CNI Directories¶

What was checked:

# Check for default CNI networks directory
ssh virt-infra-prod "sudo ls /var/lib/cni/networks/ 2>/dev/null"

# Count files in the system CNI directory
ssh virt-infra-prod "sudo ls /var/lib/cni/networks/microk8s-flannel-network/ | wc -l"

# Show sample files
ssh virt-infra-prod "sudo ls /var/lib/cni/networks/microk8s-flannel-network/ | head -10"

Result: Found 508 additional stale IPAM files in /var/lib/cni/networks/microk8s-flannel-network/ containing allocations from an OLD subnet range (10.1.54.x)

What this told us: This was the smoking gun! The host-local IPAM plugin (used by flannel's delegate) was writing to the system default location /var/lib/cni/networks/ and contained very old allocations from a previous network configuration. This state was preventing new allocations even though the subnet had changed to 10.244.69.0/24.

Investigation 7: Examine Flannel Delegate Configuration¶

What was checked:

# Look at generated delegate config from flannel
ssh virt-infra-prod "sudo find /var/snap/microk8s/common/var/lib/cni/flannel -type f | head -3 | xargs -I {} bash -c 'echo \"File: {}\" && sudo cat {}'"

Result: Found the generated CNI configuration:

{
  "cniVersion":"0.3.1",
  "hairpinMode":true,
  "ipMasq":false,
  "ipam":{
    "ranges":[[{"subnet":"10.244.69.0/24"}]],
    "routes":[{"dst":"10.244.0.0/16"}],
    "type":"host-local"
  },
  "isDefaultGateway":true,
  "isGateway":true,
  "mtu":1450,
  "name":"microk8s-flannel-network",
  "type":"bridge"
}

What this told us: The flannel plugin generates a delegate configuration using host-local IPAM, but critically, no dataDir was specified in the IPAM configuration. This means host-local uses its default location: /var/lib/cni/networks/<network-name>/. This is where the old stale files were living and blocking new allocations.

Root Cause Analysis¶

The problem was caused by three layers of leaked network resources:

253 orphaned veth interfaces - Virtual ethernet interfaces that remained attached to the cni0 bridge after containers were removed
506 stale files in flannel dataDir (/var/snap/microk8s/common/var/lib/cni/flannel/) - Flannel plugin state files
508 stale IPAM allocations (/var/lib/cni/networks/microk8s-flannel-network/) - host-local IPAM state from an old subnet configuration (10.1.54.x range)

The third issue was the critical one: the host-local IPAM plugin checks for existing allocations in its state directory before allocating new IPs. Even though the subnet had changed from 10.1.54.0/24 to 10.244.69.0/24, the presence of 508+ allocation files caused the IPAM to believe the IP space was exhausted.

Combined, these consumed all 254 available IP addresses in the 10.244.69.0/24 subnet, preventing any new pods from starting.

Solution - Step by Step¶

Prerequisites¶

SSH access to the microk8s host with sudo privileges
The user must be in the microk8s group (verify with groups)

Complete Fix Procedure¶

# Step 1: Connect to the microk8s host
ssh <microk8s-host>

# Step 2: Stop microk8s to prevent new allocations during cleanup
sudo microk8s stop
# Expected output: "Stopped."

# Step 3: Delete all orphaned veth interfaces
sudo bash -c 'for veth in $(ip link show | grep -oP "veth[a-f0-9]+(?=@)"); do ip link delete $veth 2>/dev/null; done'

# Step 4: Verify veth interfaces are removed
ip link show | grep veth | wc -l
# Expected output: 0

# Step 5: Clear flannel dataDir IPAM state
sudo rm -rf /var/snap/microk8s/common/var/lib/cni/flannel/*

# Step 6: Verify flannel dataDir is empty
sudo ls /var/snap/microk8s/common/var/lib/cni/flannel/ | wc -l
# Expected output: 0

# Step 7: Clear system CNI directory IPAM state (THE CRITICAL FIX)
sudo rm -rf /var/lib/cni/networks/microk8s-flannel-network/*

# Step 8: Verify system CNI directory is empty
sudo ls /var/lib/cni/networks/microk8s-flannel-network/ | wc -l
# Expected output: 0

# Step 9: Start microk8s
sudo microk8s start

# Step 10: Wait for microk8s to be ready
/snap/bin/microk8s status --wait-ready
# Expected output: "microk8s is running"

# Step 11: Wait for pods to initialize (30-60 seconds)
sleep 45

# Step 12: Verify pods are getting IPs and starting
/snap/bin/microk8s kubectl get pods --all-namespaces -o wide | grep -E "NAMESPACE|Running|10\.244\."

# Step 13: Check running pod count
/snap/bin/microk8s kubectl get pods --all-namespaces --field-selector=status.phase=Running | wc -l

# Step 14: Verify new IP allocations are working
sudo ls /var/lib/cni/networks/microk8s-flannel-network/ | wc -l
# Expected output: Should match or slightly exceed the number of running pods

What Exactly Fixed the Problem¶

The critical fix was Step 7: Clearing the stale IPAM allocation files from /var/lib/cni/networks/microk8s-flannel-network/.

This directory contained 508 IP allocation files from a previous subnet configuration (10.1.54.x). The host-local IPAM plugin reads this directory to determine which IPs are allocated, and the presence of these old files caused it to believe the current 10.244.69.0/24 subnet was exhausted, even though: - The subnet had changed - The old allocations were for a different IP range - No actual containers were using those IPs

By removing these stale files, the IPAM plugin could start fresh with the current subnet and properly allocate IPs from the 10.244.69.1-254 range.

The other cleanup steps (removing veth interfaces and clearing flannel dataDir) were important for complete system hygiene but were not the root cause of the IP exhaustion.

Verification and Results¶

After applying the fix: - Pods began receiving IP addresses in the correct 10.244.69.x range - 28+ pods transitioned to Running state within 2 minutes - New veth interfaces created properly (one per pod) - Network connectivity restored - IPAM allocations working correctly (40 IPs allocated for ~39 pods)

Prevention and Best Practices¶

How This Happens¶

This issue typically occurs when: 1. Pods/containers are forcefully terminated or the node crashes without proper cleanup 2. The CNI plugin fails to remove network resources during pod deletion 3. Network configuration changes (subnet changes) without clearing old IPAM state 4. Long-running clusters accumulate leaked resources over time

Prevention Measures¶

Regular monitoring:

# Monitor orphaned veth interfaces
ip link show | grep veth | wc -l

# Monitor IPAM allocation count vs running pods
sudo ls /var/lib/cni/networks/microk8s-flannel-network/ | wc -l
microk8s kubectl get pods --all-namespaces --field-selector=status.phase=Running | wc -l

Before subnet changes: Always clear IPAM state when changing network configuration:

sudo microk8s stop
sudo rm -rf /var/lib/cni/networks/microk8s-flannel-network/*
sudo rm -rf /var/snap/microk8s/common/var/lib/cni/flannel/*
# Update network configuration
sudo microk8s start

Troubleshooting Tips¶

If the fix doesn't work immediately:¶

Check for additional CNI state locations:

sudo find /var -name "*cni*" -type d 2>/dev/null
sudo find /var/snap/microk8s -name "*networks*" -type d 2>/dev/null

Check containerd logs for other errors:

sudo journalctl -u snap.microk8s.daemon-containerd -n 100 --no-pager

Verify flannel daemon is running:

ps aux | grep flanneld
systemctl status snap.microk8s.daemon-flanneld

Check for disk space issues:
```
df -h
```
Verify network connectivity:
```
ip route
ping -c 3 8.8.8.8
```

If pods still won't start after cleanup:¶

Check individual pod logs:

microk8s kubectl describe pod <pod-name> -n <namespace>
microk8s kubectl logs <pod-name> -n <namespace>

Check for image pull issues (common after long downtime):

microk8s kubectl get events --all-namespaces | grep -i "image\|pull"

Summary¶

This was a case of IP address exhaustion due to leaked IPAM state, specifically stale allocation files in the host-local IPAM directory that survived a subnet change. The investigation required checking multiple layers of the networking stack (containers � veth interfaces � IPAM state � CNI configuration) to identify all three locations where resources were leaked. The definitive fix was clearing the system CNI IPAM directory (/var/lib/cni/networks/microk8s-flannel-network/), which allowed the IPAM plugin to start fresh and properly allocate IPs from the current subnet range.