Kubernetes with IPv6 on AWS

#aws #kubernetes #ipv6

The Kubernetes ecosystem has been working hard on supporting IPv6 the last few years, and kOps is no different.
There are two ways we have been exploring:

Running with a private subnet with Pods IPs behind NAT.
Running with a public subnet with fully routable Pod IPs.

Both of these sort of work on AWS, but it is not without its caveats.

Configuring the cluster

Regardless of what mode is used, the VPC needs IPv6 enabled, and each instance need an allocated IPv6 address that is added to their respective Node object. This is all handled by kOps and the Cloud Controller Manager.

Private IPs

A cluster with private IPv6 addresses is relatively simple to set up. As with IPv4, the cluster is configured with one flat IPv6 CIDR and CNI takes care to configure routes and tunnelling between the instances, masq traffic destined for external IPs and so on.

You can configure the Cluster spec directly to use IPv6, but kOPs also provides teh --ipv6 flag to simplify the configuration.

Public IPs

Running with private IPv6 addresses is nice for testing how well K8s and K8s components work with IPv6, but the true advantages come when the IPs are publicly routable. The obviation of NAT, tunnelling, and overlay networking in itself gives a performance boost, but you can also do things such as having cloud load balancer directly target Pods instead of going through NodePorts and bouncing off kube-proxy.

kOps supports public IPs on AWS by assigning an IPv6 prefix to each Node's primary interface and using this prefix as the Node's Pod CIDR.

This means any CNI that supports Kubernetes IPAM (and most do) can support publicly routable IPv6 addresses.

In order to run in this mode, just add spec.podCIDRFromCloud: true to the Cluster spec.

$ kgp -o wide
NAME                                                                  READY   STATUS    RESTARTS   AGE   IP                                       NODE                                          NOMINATED NODE   READINESS GATES
aws-cloud-controller-manager-rm9bf                                    1/1     Running   0          16h   172.20.52.202                            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cert-manager-58c7f89d46-5ttmx                                         1/1     Running   0          16h   2a05:d018:4ea:8101:ba62::f4c8            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cert-manager-cainjector-5998558479-lvvsr                              1/1     Running   0          16h   2a05:d018:4ea:8101:ba62::6d33            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cert-manager-webhook-756bb49f7d-f4pfh                                 1/1     Running   0          16h   2a05:d018:4ea:8101:ba62::2cdc            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cilium-7mjbl                                                          1/1     Running   0          16h   2a05:d018:4ea:8103:6f5a:dc57:f7b7:b73a   ip-172-20-97-249.eu-west-1.compute.internal   <none>           <none>
cilium-operator-677b9469b7-8pndm                                      1/1     Running   0          16h   172.20.52.202                            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cilium-psxfs                                                          1/1     Running   0          16h   2a05:d018:4ea:8101:2cc1:f30c:f885:6e6f   ip-172-20-54-232.eu-west-1.compute.internal   <none>           <none>
cilium-wq6xg                                                          1/1     Running   0          16h   2a05:d018:4ea:8102:ccc:bcce:24de:4840    ip-172-20-81-228.eu-west-1.compute.internal   <none>           <none>

(Yes, some Pods with hostNetworking: true have IPv4 addresses here. The reason for that is that Pods receive the IP that the Node had at the time, which in the case of the control plane was IPv4 as the Node came up before Cloud Controller Manager assigned it an IPv6 address)

Can I use this in production?

So the big question is how mature is running IPv6 clusters on AWS?

Not very. Yet.

Taking the simpler private IP mode first, we found various issues with how various components decide which IP to use. E.g metrics-server will pick the first IP on the Node object regardless of what the Pod IP is. So ordering of the Node IPs matter. CNIs still show behavior that suggests it is not that well-tested yet. For example Cilium struggles with routing issues in this 18-months-old issue.

For public IPs, there are some additional problems. On most Linux distro's accept_ra=2 sysctl must be set on the correct interfaces. And since the interface name depends on distro and instance type, this is a bit tricky. On Ubunutu, this is not need because Systemd has taken over a lot of the kernel responsibilities in this area. Systemd is not without bugs though, so when IPv6 single-address DHCPv6 is mixed with prefix delegation, DHCPv6 breaks. Hopefully this fix will make it into Ubuntu soon. Cilium works around this issue, but all other CNIs lose Node connectivity about 5 min after kOps configuration has finished.

Then there are various important apps that do not understand IPv6 well. Many will try to talk to the IPv4 metadata API, for example. If you are lucky, the application use a new version of the AWS SDK so you can set AWS_EC2_METADATA_SERVICE_ENDPOINT_MODE=IPv6.

One of the benefits I mentioned above was using Pods as targets for load balancers. This is a feature that AWS Load Balancer Controller supports. But alas! AWS has two endpoints for the EC2 API. A single-stack IPv4 endpoint at ec2.<region>.amazonaws.com and a dual-stack one at https://api.ec2.eu-west-1.aws`. The SDK will use the former unless configured in code to use something else, and this is not currently possible. There is a pull request for this, but that only brings you to the next component. And if you want to use Cluster Autocaler you are also out of luck because AWS doesn't provide a dual-stack endpoint of the autoscaling API at all.

Even if IPv6 worked perfectly on cluster level, and AWS provides dual-stack endpoints for all their APIs, you would probably need to talk to other resources that only provides IPv4 IPs. In order to reach those, AWS would have to provide DNS64/NAT64, which can allow resources with single-stack IPv6 addresses to talk to resources with single-stack IPv4 addresses.

Hopefully support for this will be available soon.

The Ops Community ⚙️

Kubernetes with IPv6 on AWS

Configuring the cluster

Private IPs

Public IPs

Can I use this in production?

Top comments (0)