Fun with EKS and multiarch
Getting started, but not from scratch
So, if you haven’t been living under a rock, I would guess the emergence of arm processors in the cloud has been of interest to you. Promises of better price/performance, availability on the three major clouds, and a roadmap onwards, are touted as the future.
But obviously, it’s not without pitfalls or engineering effort. What do you think would happen if you swapped all your k8s nodes today with arm64 instead of amd64?
The recommended gradual process for doing so is usually done with affinity/anti-affinity or with taints and tolerations:
One way is to taint the node with its architecture, then slowly add tolerations to all the pods.
The other is to draw the pods to the right architecture with affinity and anti-affinity, slowly letting things settle after they may error for a bit.
But there is another, much quicker way. No taints. No tolerations. No affinity and anti-affinity — and still non-destructive.
Enter QEMU, playing its role as a Linux user mode emulator.
User mode emulation, a quick primer
Using binfmt_misc, qemu will register ‘new’ binary formats for the kernel to understand and register itself as the execution wrapper.
Normally, you would see something like this:
bash: ./program: cannot execute binary file: Exec format errorprogram: ELF-32-bit LSB executable, ARM, EABI4 version 1 (SYSV), dynamically linked (uses share libs), for GNU/LINUX 2.6.16, not stripped
Obviously, if not assisted, an x86_64 kernel/machine can’t do much with an arm/aarch binary. It’s not the same language, so to speak.
But qemu can emulate the environment sufficiently for such a binary to run. When you register qemu with the linux kernel’s binfmt_misc facility, instead of trying to directly execute the file, the kernel will first check if the file’s format matches a registered one. Locating a valid binfmt registration, the appropriate qemu user emulation would be called and the executable will run. It may still have issues with loading dynamic libraries etc, if they are missing.
Now, as you may recall — containers are processes. When you run a container, sooner or later the container runtime will simply call ‘exec’ on the entrypoint, spawning a process.
So what happens if that process is not a native binary or eventually calls a non-native binary? With qemu it’s seamless. The kernel can’t even ‘tell’ it’s not running a native binary. It ‘just works’ regardless of your container runtime etc.
But slower. Definitely and noticeably slower.
So, when would this approach be recommended?
- You have many containers in your cluster, some of them third-party or otherwise hard to track architecture availability for (otherwise, you may feel comfortable with just replacing it all)
- You know most of them will run natively (otherwise you’re in for a world of performance pains, making this a moot exercise.
- You’re willing to trade in some performance degradation in turn for a quicker and safer deployment.
Also, if you’re setting up a cluster for embedded testing etc, you get all sorts of embedded architectures for free. For certain embedded targets, the host itself is so slow that a server emulating it would be much much faster.
Making the magic happen
For today’s magic we need 2 things:
- The QEMU static binary packages. The dynamic ones won’t do!
- A script to load binfmt_misc and register the executors
The qemu package is available on amazon linux 2 and will even register itself — though only after a reboot. Since we don’t want a reboot, we’ll also use a qemu registration script, available courtesy of the qemu project.
I had issues with the dynamically linked qemu (this was a pain to debug as is) and things seemed better using the statically linked version. Using the static binaries requires adding –qemu-suffix=-static parameter to the script and that’s it. You will also need –persistent to make sure emulation seeps into other kernel namespaces, or else containers won’t get the benefit of qemu.
Time to add this to our user data script:
yum -y install qemu-user-staticwget https://raw.githubusercontent.com/qemu/qemu/e75941331e4cdc05878119e08635ace437aae721/scripts/qemu-binfmt-conf.shchmod u+x qemu-binfmt-conf.sh./qemu-binfmt-conf.sh — qemu-path=/usr/bin — qemu-suffix=-static –persistent=yes
And presto! As a cluster node starts up, it will install and configure qemu, the configuration holding a reboot too.
Let’s test this by running an arm64 workload on an x86_64 machine:
➜ ~ kubectl exec -n laminar mycontainer-69c45b944c-99gf8 — uname -aLinux mycontainer-69c45b944c-99gf8 5.4.209–116.363.amzn2.x86_64 #1 SMP Wed Aug 10 21:19:18 UTC 2022 aarch64 GNU/Linux
As you can see, it recognizes the kernel architecture as x86_64 but reports its own architecture as aarch64.
The opposite works too, an amd64 workload under arm64:
✗ kubectl exec -n laminar mycontainer-c69f4b4c6–4b9kq — uname -aLinux mycontainer-c69f4b4c6–4b9kq 5.4.209–116.363.amzn2.aarch64 #1 SMP Wed Aug 10 21:19:18 UTC 2022 x86_64 GNU/Linux
Wonderful. Now we can upgrade our cluster to Graviton instances and it will keep on humming even if by chance we still have a few x86_64-only containers.