Making Every Cycle Count in the Fight Against COVID

I have asthma, and the quicker we determine the right set biochemical properties of SARS-CoV-2 needed to develop an antiviral or vaccine at mass scale, the quicker I go back to some semblance of normal life. I’ll do quite a bit if I can to help the cause, which is why I’ve gone fairly deep on optimizing how I can increase my Folding@Home throughput. What follows is a whirlwind tour of the best ways I’ve found to increase folding performance, and by extension make my htop graphs look like this:

12 CPU folding threads and 1 GPU folding thread optimally scheduled

Folding@Home is a distributed computing project that simulates biological proteins as their atoms interact with each other. This involves a great number of floating point and vector operations. We’ve discussed this in detail on our previous podcast episode, “Fold, Baby, Fold.”

The high level strategies for increasing throughput include:

Optimizing thread scheduling
Reducing cache contention
Increasing instructions retired per cycle
Maximizing memory and I/O performance relative to NUMA layout
Eliminating wasteful overhead

Overclock Everything

This first point almost goes without saying: but overclock your CPU, your memory, and your GPU wherever possible without overheating or losing stability. (There’s an entirely separate post to be written about my heat management journey to date).

CPU Isolation

The rest of this post is gonna assume you’re in this to win this and willing to sacrifice nearly all of your computing cycles to the viral alchemical overlords who manage the big work unit in the sky. Me? I have 12 physical cores and 24 logical cores to work with on my Threadripper 1920x.

My normal usermode processes get 1 core: core 0. Why core 0? There are multiple IRQs that can’t be moved from that core, and it’s gonna have more context switches than the average core to begin with. Everything else is gonna be managed by me. I use the kernel boot arg “isolcpus=1-23”. I also set “nohz_full=1-23” to prevent the “scheduler tick” from running, which supposedly helps reduce context switches.

NUMA NUMA

It’s no longer 2004 and viral Europop pantomime videos are dead. It’s 2020 and consumer CPUs are letting some abstractions run a little closer to the physical layout of cores on die. Some of your cores might have priority access to one memory channel over another. Various devices on the PCI bus are also assigned priority access to certain cores. You can run the lstopo utility to check out your own configuration. Here’s mine:

Pay special attention to L3 cache layout and PCI device priority

In order for each node to properly be allocated half of the RAM in my system, I had to move one of my two DIMMs to the other side of the motherboard. This was discovered only after searching through random forum posts. There’s no great documentation here from AMD!

NUMA node #1 has access to the GPU on the PCI bus, so any threads managing GPU folding will be assigned to a core in that node.

Pinning the GPU Core

Folding@Home can give you work units to stress the CUDA cores on your overpriced GPU. You deserve to get the most out of your investment. In order to move data around all those CUDA cores, there’s one usermode thread that needs as much processing power as it can get. I pin my threads to logical cores 11 and 23 (both on NUMA node 1). Nothing else will run on those two logical cores (one physical core; logical core number modulo 12 is the physical core number for my Threadripper).

If you pin cores as above, that’s the single best way to improve GPU folding performance without tweaking a single GPU setting. You can do some stuff with renicing processes, but I couldn’t tell you how much that does compared to just giving a whole physical core to GPU folding.

CPU Pinning

Hyperthreading is a convenient white lie. You can try and run two CPU folding threads on one physical core, but unless you have very specialized hardware chances are the floating point unit in each core is going to be a bottleneck that prevents you from actually getting double the performance. Here’s what the backend instruction pipeline looks for the Zen architecture, which is what each of the cores on my TR1920x is based on:

You get a max of four 128-bit floating point operations per cycle. AVX256 instructions are 256 bits wide. That’s your bottleneck. Empirically, I’m not getting much better folding throughput per core by running 1 CPU thread versus 2. Your mileage may vary, but I’ve stuck with only scheduling 1 folding thread 1 CPU core per CPU core except where necessary.

Specifically, of logical cores 0-23, CPU folding currently occurs on core 8 and cores 12-22. I chose core 8 because it does not share an L3 cache with any cores that do GPU folding. I would like to keep L3 cache pressure as low as reasonably possible for those cores. This means physical core 8 (logical cores 8 and 20) has two CPU folding threads scheduled while every other physical core has at most 1. This, in my experience, has been a better arrangement than turning off hyperthreading/SMT in the BIOS settings.

Again, you can do some stuff with scheduler hints but again I can’t tell you if it’s worth it.

numactl

Linux processes can have hints for how to allocate new memory pages to a process relative to the the NUMA layout of a system. The one we want for CPU folding is “localalloc” which says that physical memory must be provided from the local NUMA node of the calling thread. This helps to ensure optimal memory performance. The easiest way to set this for a process is to use the numactl command.

Bypassing Needless Syscalls

If you are a writing a multithreaded application, one way you can have one thread wait for another to complete an action is to check as often as possible if that action is done yet. Another is to say “I’m gonna let some other thread do some work, wake me when it’s done and I’ll do another check.” The latter is what happens when you call the “sched_yield” syscall. Folding at Home CPU cores call that a lot. Probably to be nice to other processes (this is meant to be run in the background after all).

Do 11% of cycles really need to be spent in entry_SYSCALL_64?

The calls are initiated from userland via the sched_yield libc function which is a wrapper for the syscall. Because sched_yield is a dynamically loaded symbol, we can hook the loading with an appropriate LD_PRELOAD setting and force all calls to that function to immediately return 0 without ever yielding to the kernel. This empirically boosts folding throughput by a noticeable amount once you have threads pinned appropriately.

Disable Spectre Mitigations

Cycles spent on preventing Spectre attacks are cycles not spent folding proteins. There can be a non-trivial number of these cycles. The image above showed something like 7% of cycles spent in __x86_indirect_thunk_rax which is a Spectre mitigation construct.

Get rid of them by setting the “mitigations=off” kernel boot argument. Does this argument affect microcode, or do I need to downgrade microcode to fully disable mitigations? I don’t know–the documentation kinda sucks and I haven’t been able to find out!

Keep Your House in Order

Put kernel threads and IRQs on unused cores.

Keep your other usermode threads on core 0 unless you need more parallelism.

Putting It Together

Folding@Home distributes binary files named “FahCore_a7” and “FahCore_22” which nearly all folding work units are processed by. If you rename one to something like “FahCore_a7.orig” and replace the file on disk where “FahCore_a7” originally was with an executable shell script, you can run shell commands on the start of each new work unit. For example, you can set how the folding processes run without needing to poll for created processes. This also allows for LD_PRELOADing libraries into subprocesses.

Additionally, I run a shell script on a cron job set for every 10 minutes that ensures folding processes are properly pinned. A script moves kernel threads and IRQs to unused cores every half hour.

All of this is public at https://github.com/mitchellharper12/folding-scripts

Useful Tuning Tools:

cat for writing things to sysfs
taskset for pinning specific threads to specified CPUs
cset/cpuset for creating meta constructs for managing sets of threads and cores and memory nodes
numactl for setting NUMA flags for processes
renice for changing thread scheduler priority
ionice for changing io priority for threads
chrt for changing which part of the scheduler subsystem is used to manage a given thread
perf stat for getting periodic instructions per cycle data (I like to run watch -n 1 perf stat -C 8,12-22 sleep 1 in a root shell)
perf record for capturing trace data (don’t sleep on the -g flag for collecting stack traces)
perf report for displaying the data from perf record
Reading the perf examples blog post
htop for point-in-time visualization of workload distribution across cores (explanation of colors and symbols)

Applications to Fuzzing and Red Teaming

If you have a spare fuzzing rig or password cracker, running Folding@Home and optimizing thread scheduling is a great way to learn about how your kernel scheduler works. This can help you learn how to schedule threads for your workloads in order to maximize your iterations per second.

Additionally, this might give you some leverage to run untrusted workloads in a VM or container to mitigate Spectre without needing to take the performance hit of kernel mitigations (note we make no claims or warranties on this point).

The Linux perf tools allow for sampling of the behavior of a target thread without attaching a debugger.

Optimize for Throughput

The number 1 most predictive statistic for how well your optimizations are working is “the change over time in how long it takes the Folding@Home logs to update percent complete.” I’ve been running tail -F /var/lib/fahclient/log.txt and counting the average delta between timestamp updates for a given work unit. There are other stats you can try and optimize for, like instructions retired per cycle as reported by perf stat, but that can be misleading if you start over-optimizing for that. Note that when you restart the Folding@Home client, the first reported update in percentage complete needs to be ignored (a work unit can be checkpointed in between percentage updates).

Although these techniques were developed on a Threadripper, they apply to all the Intel Core series laptops I have scattered around my apartment running Folding@Home. You’ll know you’re making progress when the amount of time red bars are visible on htop CPU graphs significantly changes.

A Parting Message

Finally, you can help Symbol Crash help protein researchers by registering your client with folding team 244374. Remember that if you register for a user passkey you are eligible for a quick return bonus.

This post is a bit terse, but that’s because half the fun is building a mental model of how all the pieces fit together! Here’s a thread documenting some of my intermediate progress. The intro to our DMA special has some backstory on this whole effort.

If you have any follow up questions or want to rave about your own performance, you can call the Hacker HelpLine at (206) 486-6272 or send an email to podcast@symbolcrash.com to be featured on an upcoming episode of Hack the Planet!