The Kubernetes ecosystem has been working hard on supporting IPv6 the last few years, and kOps is no different.
There are two ways we have been exploring:
- Running with a private subnet with Pods IPs behind NAT.
- Running with a public subnet with fully routable Pod IPs.
Both of these sort of work on AWS, but it is not without its caveats.
Regardless of what mode is used, the VPC needs IPv6 enabled, and each instance need an allocated IPv6 address that is added to their respective Node object. This is all handled by kOps and the Cloud Controller Manager.
A cluster with private IPv6 addresses is relatively simple to set up. As with IPv4, the cluster is configured with one flat IPv6 CIDR and CNI takes care to configure routes and tunnelling between the instances, masq traffic destined for external IPs and so on.
You can configure the Cluster spec directly to use IPv6, but kOPs also provides teh
--ipv6 flag to simplify the configuration.
Running with private IPv6 addresses is nice for testing how well K8s and K8s components work with IPv6, but the true advantages come when the IPs are publicly routable. The obviation of NAT, tunnelling, and overlay networking in itself gives a performance boost, but you can also do things such as having cloud load balancer directly target Pods instead of going through NodePorts and bouncing off kube-proxy.
kOps supports public IPs on AWS by assigning an IPv6 prefix to each Node's primary interface and using this prefix as the Node's Pod CIDR.
This means any CNI that supports Kubernetes IPAM (and most do) can support publicly routable IPv6 addresses.
In order to run in this mode, just add
spec.podCIDRFromCloud: true to the Cluster spec.
$ kgp -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES aws-cloud-controller-manager-rm9bf 1/1 Running 0 16h 172.20.52.202 ip-172-20-52-202.eu-west-1.compute.internal <none> <none> cert-manager-58c7f89d46-5ttmx 1/1 Running 0 16h 2a05:d018:4ea:8101:ba62::f4c8 ip-172-20-52-202.eu-west-1.compute.internal <none> <none> cert-manager-cainjector-5998558479-lvvsr 1/1 Running 0 16h 2a05:d018:4ea:8101:ba62::6d33 ip-172-20-52-202.eu-west-1.compute.internal <none> <none> cert-manager-webhook-756bb49f7d-f4pfh 1/1 Running 0 16h 2a05:d018:4ea:8101:ba62::2cdc ip-172-20-52-202.eu-west-1.compute.internal <none> <none> cilium-7mjbl 1/1 Running 0 16h 2a05:d018:4ea:8103:6f5a:dc57:f7b7:b73a ip-172-20-97-249.eu-west-1.compute.internal <none> <none> cilium-operator-677b9469b7-8pndm 1/1 Running 0 16h 172.20.52.202 ip-172-20-52-202.eu-west-1.compute.internal <none> <none> cilium-psxfs 1/1 Running 0 16h 2a05:d018:4ea:8101:2cc1:f30c:f885:6e6f ip-172-20-54-232.eu-west-1.compute.internal <none> <none> cilium-wq6xg 1/1 Running 0 16h 2a05:d018:4ea:8102:ccc:bcce:24de:4840 ip-172-20-81-228.eu-west-1.compute.internal <none> <none>
(Yes, some Pods with
hostNetworking: true have IPv4 addresses here. The reason for that is that Pods receive the IP that the Node had at the time, which in the case of the control plane was IPv4 as the Node came up before Cloud Controller Manager assigned it an IPv6 address)
So the big question is how mature is running IPv6 clusters on AWS?
Not very. Yet.
Taking the simpler private IP mode first, we found various issues with how various components decide which IP to use. E.g metrics-server will pick the first IP on the Node object regardless of what the Pod IP is. So ordering of the Node IPs matter. CNIs still show behavior that suggests it is not that well-tested yet. For example Cilium struggles with routing issues in this 18-months-old issue.
For public IPs, there are some additional problems. On most Linux distro's
accept_ra=2 sysctl must be set on the correct interfaces. And since the interface name depends on distro and instance type, this is a bit tricky. On Ubunutu, this is not need because Systemd has taken over a lot of the kernel responsibilities in this area. Systemd is not without bugs though, so when IPv6 single-address DHCPv6 is mixed with prefix delegation, DHCPv6 breaks. Hopefully this fix will make it into Ubuntu soon. Cilium works around this issue, but all other CNIs lose Node connectivity about 5 min after kOps configuration has finished.
Then there are various important apps that do not understand IPv6 well. Many will try to talk to the IPv4 metadata API, for example. If you are lucky, the application use a new version of the AWS SDK so you can set
One of the benefits I mentioned above was using Pods as targets for load balancers. This is a feature that AWS Load Balancer Controller supports. But alas! AWS has two endpoints for the EC2 API. A single-stack IPv4 endpoint at
ec2.<region>.amazonaws.com and a dual-stack one at https://api.ec2.eu-west-1.aws`. The SDK will use the former unless configured in code to use something else, and this is not currently possible. There is a pull request for this, but that only brings you to the next component. And if you want to use Cluster Autocaler you are also out of luck because AWS doesn't provide a dual-stack endpoint of the autoscaling API at all.
Even if IPv6 worked perfectly on cluster level, and AWS provides dual-stack endpoints for all their APIs, you would probably need to talk to other resources that only provides IPv4 IPs. In order to reach those, AWS would have to provide DNS64/NAT64, which can allow resources with single-stack IPv6 addresses to talk to resources with single-stack IPv4 addresses.
Hopefully support for this will be available soon.