<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Scott's Blog</title><link>https://scottjg.com/</link><description>Recent content on Scott's Blog</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 05 May 2026 10:17:25 -0800</lastBuildDate><atom:link href="https://scottjg.com/index.xml" rel="self" type="application/rss+xml"/><item><title>RTX 5090 + M4 MacBook Air: Can it Game?</title><link>https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/</link><pubDate>Tue, 05 May 2026 10:17:25 -0800</pubDate><guid>https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/</guid><description>&lt;p&gt;What if you could strap a full desktop GPU to your MacBook Air? Turns out, you can.&lt;/p&gt;
&lt;p style="font-size: 0.75em; color: var(--secondary);"&gt;Just a quick FTC required note: When you buy through my links, I may earn a commission.&lt;/p&gt;
&lt;h1 id="never-tell-me-the-odds"&gt;Never tell me the odds&lt;/h1&gt;
&lt;p&gt;As much as I hate to admit it, step one in most of my projects now is to ask AI about it. Maybe it&amp;rsquo;ll tell me something I don&amp;rsquo;t know.&lt;/p&gt;
&lt;p&gt;&lt;img alt="chatgpt says no" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/never-tell-me-the-odds.png"&gt;&lt;/p&gt;
&lt;p&gt;Fortunately, borderline-impractical is kind of my thing.&lt;/p&gt;
&lt;h2 id="whats-a-thunderbolt-egpu"&gt;What&amp;rsquo;s a Thunderbolt eGPU?&lt;/h2&gt;
&lt;p&gt;Ok, so the plan is to plug a big PC gaming GPU, an NVIDIA RTX 5090, into my M4 MacBook Air. To do that, we plug it into a Thunderbolt dock which adapts PCIe to Thunderbolt, and we plug that into a USB-C port.&lt;/p&gt;
&lt;p&gt;Thunderbolt tunnels PCIe over a USB-C cable, so from the computer&amp;rsquo;s perspective a Thunderbolt device really is a PCIe device, not a USB one. You get 4 PCIe lanes at up to 40Gbps on Thunderbolt 4, with a small performance penalty for the tunneling. USB4 includes the same PCIe tunneling as an optional feature, so some non-Thunderbolt USB4 ports can do this too. You can use this to plug a GPU into a laptop with a compatible port.&lt;/p&gt;
&lt;figure&gt;
 &lt;img loading="lazy" src="thunderbolt-in-gpu-out.jpg"
 alt="Thunderbolt from the laptop plugs into the GPU dock. The GPU plugs into the monitor via DisplayPort. Shortly after this was taken, I broke this dock."/&gt; &lt;figcaption&gt;
 &lt;p&gt;Thunderbolt from the laptop plugs into the &lt;a href='https://amzn.to/3R2JWRe'&gt;GPU dock&lt;/a&gt;. The GPU plugs into the monitor via DisplayPort. Shortly after this was taken, I broke this dock.&lt;/p&gt;
 &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;From the computer&amp;rsquo;s perspective, the device looks more or less like a slightly slower PCIe device, so you can usually use the same drivers you&amp;rsquo;d normally use for those devices. eGPUs work pretty much out of the box on Linux and Windows. &lt;a href="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/"&gt;It&amp;rsquo;s even possible to use one on a Raspberry Pi&lt;/a&gt; (albeit with &lt;a href="https://en.wikipedia.org/wiki/PCI_Express#PCI_Express_OCuLink"&gt;Oculink&lt;/a&gt;, not Thunderbolt).&lt;/p&gt;
&lt;p&gt;The first hurdle is that macOS does not ship with drivers for NVIDIA or AMD GPUs on Apple Silicon.&lt;/p&gt;
&lt;h2 id="what-about-tinygrad"&gt;What about tinygrad?&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://x.com/__tinygrad__/status/2039213719155310736"&gt;tinygrad recently released their own macOS eGPU drivers&lt;/a&gt;. It&amp;rsquo;s a whole new AI stack with its own open source driver pipeline for NVIDIA and AMD hardware.&lt;/p&gt;
&lt;p&gt;Sadly, if your main objective is to run AI inference or play games, tinygrad probably isn&amp;rsquo;t the solution you&amp;rsquo;re looking for. &lt;a href="https://www.youtube.com/watch?v=C4KWsmezXm4"&gt;This video by YouTuber Alex Ziskind&lt;/a&gt; shows that using an eGPU via tinygrad for inference is about &lt;strong&gt;10 times slower&lt;/strong&gt; than running native Metal inference directly on an M4 Pro without an eGPU. You can only use the tinygrad eGPU driver with the tinygrad stack, not for anything else. It also has very limited support for different AI models.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Graph showing tinygrad inference 10x slower than using native Metal inference on the same computer" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/its-10x-slower.png"&gt;&lt;/p&gt;
&lt;p&gt;Getting NVIDIA PTX code running on the GPU is one thing. Writing a full general-purpose display driver that works with arbitrary software is a significantly harder problem. So for now, &lt;strong&gt;what can you actually do with an eGPU and a Mac?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id="the-existing-linux-driver"&gt;The existing Linux driver&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://asahilinux.org"&gt;Linux can run on Apple Silicon Macs now&lt;/a&gt;. Regrettably, at this time, the Linux kernel does not support Thunderbolt on Apple Silicon (only internal devices and USB3). But&amp;hellip;&lt;/p&gt;
&lt;p&gt;You can run Linux in a 64-bit ARM VM on a macOS host. macOS supports Thunderbolt devices. Linux supports NVIDIA GPUs. Let&amp;rsquo;s put the pieces together and pass through the GPU into the Linux VM.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Diagram explaining PCI passthrough" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/diagram1.png"&gt;&lt;/p&gt;
&lt;p&gt;At a high level, we&amp;rsquo;re just going to put the GPU in the Linux VM. The VM is the same architecture as the Mac host (arm64), so performance should be comparable. Of course, the devil is in the details.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There is no driver for NVIDIA cards on ARM64 Windows. That&amp;rsquo;s why we use Linux.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For a quick video demo of the result, take a look:
&lt;video src="https://pkg.scottjg.com/blog/mac-egpu-demo.mov#t=0.1" controls playsinline preload="metadata"&gt;&lt;/video&gt;&lt;/p&gt;
&lt;p&gt;In the rest of the post, I&amp;rsquo;ll go through the long and winding road of getting this to actually work. If you just want to see screenshots and benchmarks, you can probably skip to the &lt;a href="#benchmarks"&gt;benchmark section&lt;/a&gt;.&lt;/p&gt;
&lt;h1 id="engineering-pci-passthrough-on-macos"&gt;Engineering PCI Passthrough on macOS&lt;/h1&gt;
&lt;h2 id="pci-device-basics"&gt;PCI device basics&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s look at two things we need working for the VM to talk to the PCI device:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;PCI BAR (Base Address Registers)&lt;/strong&gt; - Each PCI device communicates through chunks of memory that the computer can read and write to. There&amp;rsquo;s basically a reserved region of memory on your computer for each device. Those memory regions have to be mirrored into the VM for PCI passthrough to work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;DMA (Direct Memory Access)&lt;/strong&gt; - This is how the device can read and write information directly in/out of your computer&amp;rsquo;s memory. Instead of having the CPU burn cycles copying data from the device, the device can copy the memory automatically. For a GPU, it might be used to copy textures directly from the computer&amp;rsquo;s memory into its own video memory.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="mapping-pci-bars"&gt;Mapping PCI BARs&lt;/h2&gt;
&lt;p&gt;When QEMU starts a VM, it sets up the guest&amp;rsquo;s memory layout. For normal RAM, this boils down to a call to &lt;a href="https://github.com/qemu/qemu/blob/master/accel/hvf/hvf-all.c#L81"&gt;hvf_set_phys_mem()&lt;/a&gt; in QEMU, which uses the &lt;code&gt;Hypervisor.framework&lt;/code&gt; method:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;hv_vm_map&lt;/span&gt;(mem, guest_physical_address, size, HV_MEMORY_READ &lt;span style="color:#f92672"&gt;|&lt;/span&gt; HV_MEMORY_WRITE &lt;span style="color:#f92672"&gt;|&lt;/span&gt; HV_MEMORY_EXEC);
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Next, we connect to the host PCIDriverKit driver and ask to map the memory from the PCI device into our process. (I&amp;rsquo;m leaving the driver-side code out for now, but it&amp;rsquo;s very similar boilerplate.)&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// map BAR0 into the current process and set `addr` to the location
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// where it was mapped
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;mach_vm_address_t&lt;/span&gt; addr &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;mach_vm_size_t&lt;/span&gt; size &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;IOConnectMapMemory64&lt;/span&gt;(driverConnection, &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;, &lt;span style="color:#a6e22e"&gt;mach_task_self&lt;/span&gt;(), &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;addr, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;size, kIOMapAnywhere);
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Ok, so then we have &lt;code&gt;addr&lt;/code&gt;, which now points to the BAR0 memory that we can access directly in our process. At this point you can just read and write stuff to it, like any other piece of memory.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;volatile&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;uint32_t&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;bar0 &lt;span style="color:#f92672"&gt;=&lt;/span&gt; (&lt;span style="color:#66d9ef"&gt;volatile&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;uint32_t&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;)addr;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;printf&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;BAR0[0] = %x&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;\n&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;, bar0[&lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;]);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// this would output: BAR0[0] = 0x1b2000a1
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// which is a device-specific constant that describes my RTX 5090
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;//
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// BAR0[0] is the BOOT_0 register. The fields break down as:
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// arch = 0x1b → GB200 GPU family
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// impl = 0x2 → GB202 die (RTX 5090)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// major_rev = 0xa → stepping A
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;// minor_rev = 0x1 → revision 1 (together: stepping A1)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Now we just make sure QEMU calls &lt;code&gt;hvf_set_phys_mem()&lt;/code&gt; for our device memory, and we can map that into the guest. When guest code touches that mapping, it talks directly to the GPU with minimal host overhead. This is the best case for performance. At least, in theory.&lt;/p&gt;
&lt;p&gt;In practice, as soon as the VM touched the PCI BAR memory, the host kernel crashed.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the Problem Report dialog" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/panic.png"&gt;&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;ve never experienced this before, it&amp;rsquo;s disorienting. Your entire computer will hang, and because the trackpad feedback is controlled by software, suddenly the trackpad will no longer click. The dogs and cats in your neighborhood start howling. Pictures fall off the walls of your house. Eventually your computer will reboot, and you will be presented with this dialog.&lt;/p&gt;
&lt;p&gt;Ok, so we can&amp;rsquo;t map device memory directly, but we have other tricks up our sleeve. We can trap every access to the memory, exit the guest back into QEMU, and have QEMU forward each read or write to the device. That keeps behavior correct, but it&amp;rsquo;s brutally slow. In many workloads the pain is elsewhere. Most of the performance-sensitive work is DMA, but some paths still care how fast you can push commands through the BAR.&lt;/p&gt;
&lt;p&gt;I started preparing a bug report for Apple and wrote a small reproduction (well, AI-assisted) to demonstrate the issue:&lt;/p&gt;
&lt;div class="code-scroll"&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#include&lt;/span&gt; &lt;span style="color:#75715e"&gt;&amp;lt;Hypervisor/Hypervisor.h&amp;gt;&lt;/span&gt;&lt;span style="color:#75715e"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#include&lt;/span&gt; &lt;span style="color:#75715e"&gt;&amp;lt;IOKit/IOMapTypes.h&amp;gt;&lt;/span&gt;&lt;span style="color:#75715e"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#include&lt;/span&gt; &lt;span style="color:#75715e"&gt;&amp;lt;libkern/OSCacheControl.h&amp;gt;&lt;/span&gt;&lt;span style="color:#75715e"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#include&lt;/span&gt; &lt;span style="color:#75715e"&gt;&amp;lt;stdlib.h&amp;gt;&lt;/span&gt;&lt;span style="color:#75715e"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#include&lt;/span&gt; &lt;span style="color:#75715e"&gt;&amp;lt;string.h&amp;gt;&lt;/span&gt;&lt;span style="color:#75715e"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#include&lt;/span&gt; &lt;span style="color:#75715e"&gt;&amp;lt;unistd.h&amp;gt;&lt;/span&gt;&lt;span style="color:#75715e"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#define FAIL(code) do { result-&amp;gt;status = (code); goto cleanup; } while (0)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#define HV_CHECK(expr, code) do { \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt; if ((expr) != HV_SUCCESS) FAIL(code); \
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;} while (0)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#define PREFETCHABLE_MASK 0x08
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#define SELECTOR_GET_BAR_INFO 10
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#define GUEST_CODE_IPA 0x4000ULL
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;#define GUEST_BAR_IPA 0x10000000ULL
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;static&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;const&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;uint32_t&lt;/span&gt; prog_read[] &lt;span style="color:#f92672"&gt;=&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#ae81ff"&gt;0xf9400001&lt;/span&gt;, &lt;span style="color:#75715e"&gt;/* ldr x1, [x0] */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#ae81ff"&gt;0xd4000002&lt;/span&gt;, &lt;span style="color:#75715e"&gt;/* hvc #0 */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#ae81ff"&gt;0xd4200000&lt;/span&gt;, &lt;span style="color:#75715e"&gt;/* brk #0 */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;};
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;int&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;vfio_guest_bar_touch_run&lt;/span&gt;(&lt;span style="color:#66d9ef"&gt;io_connect_t&lt;/span&gt; connection, &lt;span style="color:#66d9ef"&gt;uint8_t&lt;/span&gt; bar,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; VFIOGuestBarTouchResult &lt;span style="color:#f92672"&gt;*&lt;/span&gt;result)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;size_t&lt;/span&gt; page &lt;span style="color:#f92672"&gt;=&lt;/span&gt; (&lt;span style="color:#66d9ef"&gt;size_t&lt;/span&gt;)&lt;span style="color:#a6e22e"&gt;sysconf&lt;/span&gt;(_SC_PAGESIZE);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;void&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;code &lt;span style="color:#f92672"&gt;=&lt;/span&gt; NULL;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;bool&lt;/span&gt; vm_up &lt;span style="color:#f92672"&gt;=&lt;/span&gt; false, vcpu_up &lt;span style="color:#f92672"&gt;=&lt;/span&gt; false, bar_mapped &lt;span style="color:#f92672"&gt;=&lt;/span&gt; false;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;hv_vcpu_t&lt;/span&gt; vcpu &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;hv_vcpu_exit_t&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;exit_info &lt;span style="color:#f92672"&gt;=&lt;/span&gt; NULL;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;mach_vm_address_t&lt;/span&gt; bar_addr &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;mach_vm_size_t&lt;/span&gt; bar_size &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;memset&lt;/span&gt;(result, &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;, &lt;span style="color:#66d9ef"&gt;sizeof&lt;/span&gt;(&lt;span style="color:#f92672"&gt;*&lt;/span&gt;result));
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;uint64_t&lt;/span&gt; bar_in[&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;] &lt;span style="color:#f92672"&gt;=&lt;/span&gt; { bar };
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;uint64_t&lt;/span&gt; bar_out[&lt;span style="color:#ae81ff"&gt;3&lt;/span&gt;] &lt;span style="color:#f92672"&gt;=&lt;/span&gt; {&lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;};
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;uint32_t&lt;/span&gt; bar_cnt &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;3&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;IOConnectCallMethod&lt;/span&gt;(connection, SELECTOR_GET_BAR_INFO,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bar_in, &lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;, NULL, &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bar_out, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;bar_cnt, NULL, NULL) &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; KERN_SUCCESS) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;FAIL&lt;/span&gt;(VFIO_GUEST_BAR_TOUCH_MAP_BAR_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;barType &lt;span style="color:#f92672"&gt;=&lt;/span&gt; (&lt;span style="color:#66d9ef"&gt;uint8_t&lt;/span&gt;)bar_out[&lt;span style="color:#ae81ff"&gt;2&lt;/span&gt;];
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; IOOptionBits opts &lt;span style="color:#f92672"&gt;=&lt;/span&gt; kIOMapAnywhere;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;barType &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt; PREFETCHABLE_MASK)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; opts &lt;span style="color:#f92672"&gt;|=&lt;/span&gt; kIOMapWriteCombineCache;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;IOConnectMapMemory64&lt;/span&gt;(connection, &lt;span style="color:#ae81ff"&gt;1u&lt;/span&gt; &lt;span style="color:#f92672"&gt;+&lt;/span&gt; bar, mach_task_self_,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;bar_addr, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;bar_size, opts) &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; KERN_SUCCESS) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;FAIL&lt;/span&gt;(VFIO_GUEST_BAR_TOUCH_MAP_BAR_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bar_mapped &lt;span style="color:#f92672"&gt;=&lt;/span&gt; true;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;hostBARAddress &lt;span style="color:#f92672"&gt;=&lt;/span&gt; (&lt;span style="color:#66d9ef"&gt;uint64_t&lt;/span&gt;)bar_addr;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;mappedSize &lt;span style="color:#f92672"&gt;=&lt;/span&gt; (&lt;span style="color:#66d9ef"&gt;uint64_t&lt;/span&gt;)bar_size;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (page &lt;span style="color:#f92672"&gt;==&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt; &lt;span style="color:#f92672"&gt;||&lt;/span&gt; (bar_size &lt;span style="color:#f92672"&gt;%&lt;/span&gt; page) &lt;span style="color:#f92672"&gt;!=&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;FAIL&lt;/span&gt;(VFIO_GUEST_BAR_TOUCH_MAP_BAR_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (&lt;span style="color:#a6e22e"&gt;posix_memalign&lt;/span&gt;(&lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;code, page, page))
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;FAIL&lt;/span&gt;(VFIO_GUEST_BAR_TOUCH_ALLOC_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;memset&lt;/span&gt;(code, &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;, page);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;memcpy&lt;/span&gt;(code, prog_read, &lt;span style="color:#66d9ef"&gt;sizeof&lt;/span&gt;(prog_read));
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;sys_icache_invalidate&lt;/span&gt;(code, page);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;HV_CHECK&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;hv_vm_create&lt;/span&gt;(NULL), VFIO_GUEST_BAR_TOUCH_HV_VM_CREATE_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; vm_up &lt;span style="color:#f92672"&gt;=&lt;/span&gt; true;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;HV_CHECK&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;hv_vm_map&lt;/span&gt;(code, GUEST_CODE_IPA, page,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; HV_MEMORY_READ &lt;span style="color:#f92672"&gt;|&lt;/span&gt; HV_MEMORY_WRITE),
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; VFIO_GUEST_BAR_TOUCH_HV_MAP_CODE_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;HV_CHECK&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;hv_vm_map&lt;/span&gt;((&lt;span style="color:#66d9ef"&gt;void&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;)(&lt;span style="color:#66d9ef"&gt;uintptr_t&lt;/span&gt;)bar_addr, GUEST_BAR_IPA,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; (&lt;span style="color:#66d9ef"&gt;size_t&lt;/span&gt;)bar_size,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; HV_MEMORY_READ &lt;span style="color:#f92672"&gt;|&lt;/span&gt; HV_MEMORY_WRITE),
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; VFIO_GUEST_BAR_TOUCH_HV_MAP_BAR_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;HV_CHECK&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;hv_vcpu_create&lt;/span&gt;(&lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;vcpu, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;exit_info, NULL),
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; VFIO_GUEST_BAR_TOUCH_HV_VCPU_CREATE_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; vcpu_up &lt;span style="color:#f92672"&gt;=&lt;/span&gt; true;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;hv_vcpu_set_reg&lt;/span&gt;(vcpu, HV_REG_PC, GUEST_CODE_IPA);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;hv_vcpu_set_reg&lt;/span&gt;(vcpu, HV_REG_X0, GUEST_BAR_IPA);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;hv_vcpu_set_reg&lt;/span&gt;(vcpu, HV_REG_CPSR, &lt;span style="color:#ae81ff"&gt;0x3c5&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;HV_CHECK&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;hv_vcpu_run&lt;/span&gt;(vcpu), VFIO_GUEST_BAR_TOUCH_HV_VCPU_RUN_FAILED);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;exitReason &lt;span style="color:#f92672"&gt;=&lt;/span&gt; exit_info&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;reason;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;syndrome &lt;span style="color:#f92672"&gt;=&lt;/span&gt; exit_info&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;exception.syndrome;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;virtualAddress &lt;span style="color:#f92672"&gt;=&lt;/span&gt; exit_info&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;exception.virtual_address;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;physicalAddress &lt;span style="color:#f92672"&gt;=&lt;/span&gt; exit_info&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;exception.physical_address;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;hv_vcpu_get_reg&lt;/span&gt;(vcpu, HV_REG_PC, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;programCounter);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;hv_vcpu_get_reg&lt;/span&gt;(vcpu, HV_REG_X0, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;x0);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;hv_vcpu_get_reg&lt;/span&gt;(vcpu, HV_REG_X1, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;x1);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;cleanup:
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (vcpu_up) &lt;span style="color:#a6e22e"&gt;hv_vcpu_destroy&lt;/span&gt;(vcpu);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (vm_up) &lt;span style="color:#a6e22e"&gt;hv_vm_destroy&lt;/span&gt;();
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (bar_mapped) &lt;span style="color:#a6e22e"&gt;IOConnectUnmapMemory64&lt;/span&gt;(connection, &lt;span style="color:#ae81ff"&gt;1u&lt;/span&gt; &lt;span style="color:#f92672"&gt;+&lt;/span&gt; bar,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; mach_task_self_, bar_addr);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;free&lt;/span&gt;(code);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; result&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;status;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In ~100 lines of C, you can spin up a VM, map the device BAR into the guest, and run code that touches it. I&amp;rsquo;m still not sure whether that was more frustrating or encouraging, but that version ran without crashing, while QEMU was still panicking the host. I was stumped for a while. Was it the guest page tables? Was the BAR colliding with guest RAM in some subtle way? Why were the dogs and cats still howling?&lt;/p&gt;
&lt;p&gt;Eventually, in my desperation, I asked an AI coding assistant to compare my sample and QEMU. It immediately flagged that my mapping used &lt;code&gt;HV_MEMORY_READ | HV_MEMORY_WRITE&lt;/code&gt; while QEMU used &lt;code&gt;HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC&lt;/code&gt;. Alas, bested again by AI. Not even silly blog projects are safe anymore (mostly kidding).&lt;/p&gt;
&lt;p&gt;The workaround in QEMU was a small change:&lt;/p&gt;
&lt;div class="code-scroll"&gt;&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-diff" data-lang="diff"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;index 5f357c6d19..76cec4655b 100644
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;--- a/accel/hvf/hvf-all.c
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+++ b/accel/hvf/hvf-all.c
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;@@ -114,7 +114,15 @@ static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; return;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;- flags = HV_MEMORY_READ | HV_MEMORY_EXEC | (writable ? HV_MEMORY_WRITE : 0);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ flags = HV_MEMORY_READ | (writable ? HV_MEMORY_WRITE : 0);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ /*
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ * Leave RAM-device/MMIO mappings RW-only: on macOS, accessing them through
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ * executable HVF mappings can panic the host kernel. Ordinary guest RAM
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ * still needs EXEC.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ */
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ if (!memory_region_is_ram_device(area)) {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ flags |= HV_MEMORY_EXEC;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ }
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; mem = memory_region_get_ram_ptr(area) + section-&amp;gt;offset_within_region;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; trace_hvf_vm_map(gpa, size, mem, flags,
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It works, but it&amp;rsquo;s not perfect. ARM has several flavors of device memory (the &lt;a href="https://developer.arm.com/documentation/102376/0200/Device-memory/Sub-types-of-Device"&gt;Device-nGnRnE/nGnRE/nGRE/GRE&lt;/a&gt; family), with different rules for whether writes can be gathered, reordered, or acknowledged early. It&amp;rsquo;s roughly analogous to x86 &lt;a href="https://en.wikipedia.org/wiki/Write_combining"&gt;write-combining&lt;/a&gt; on the most permissive end.&lt;/p&gt;
&lt;p&gt;On real hardware, the prefetchable BARs on my GPU are supposed to allow gathering, which makes them several times faster for bulk writes than BAR0. But &lt;code&gt;hv_vm_map()&lt;/code&gt; has no flags to configure this, so every device mapping ends up as the strictest nGnRnE. There&amp;rsquo;s nothing we can do about it, and it&amp;rsquo;s still ~30x faster than trapping every access, but it makes writing the BAR ~10x slower than it would be normally.&lt;/p&gt;
&lt;h2 id="dma"&gt;DMA&lt;/h2&gt;
&lt;p&gt;This was by far the sketchiest part of the project. To start, let&amp;rsquo;s go over how this works on a PC running Linux with VM PCI-passthrough, and then we&amp;rsquo;ll compare to our challenge on macOS.&lt;/p&gt;
&lt;p&gt;When there&amp;rsquo;s just a computer talking to a device (no VM involved), they can talk together directly. The PC will tell the device &amp;ldquo;hey I got that DMA buffer ready at this memory address&amp;rdquo; and the device can access that memory directly (AKA DMA). Easy.&lt;/p&gt;
&lt;p&gt;When a VM is involved, it&amp;rsquo;s more complicated. Guest physical addresses don&amp;rsquo;t correspond to host physical addresses. The VM&amp;rsquo;s RAM is just some chunk of host memory allocated wherever it was available. So if the guest tells the device &amp;ldquo;DMA into 0x00000000,&amp;rdquo; the device will happily scribble over whatever actually lives there on the host. The simplest fix is two things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Virtual_memory#Pinned_pages"&gt;Pin&lt;/a&gt; all guest memory so it can&amp;rsquo;t be paged out while the device might touch it.&lt;/li&gt;
&lt;li&gt;Put a hardware unit called the &lt;a href="https://en.wikipedia.org/wiki/Input%E2%80%93output_memory_management_unit"&gt;IOMMU&lt;/a&gt; between the device and host memory. The hypervisor programs it with the guest → host translations, and every DMA request from the device gets remapped on the fly.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="pci-pt-mem-diagram--egpu-mac-gaming pci-pt-mem-diagram--egpu-mac-gaming-iommu"&gt;
&lt;div class="diagram"&gt;
&lt;div class="arrow"&gt;
&lt;div class="label"&gt;DMA Request:&lt;br&gt;Read/Write&lt;br&gt;0x00000000&lt;/div&gt;
&lt;div class="line"&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="box iommu-box"&gt;
&lt;div class="box-title"&gt;IOMMU&lt;/div&gt;
&lt;div class="iommu-table-title"&gt;Translation Table&lt;/div&gt;
&lt;div class="translation-table"&gt;
&lt;div class="row"&gt;
&lt;div class="header-cell"&gt;Guest Address Range&lt;/div&gt;
&lt;div class="header-cell"&gt;Host Physical Range&lt;/div&gt;
&lt;/div&gt;
&lt;div class="row"&gt;
&lt;div class="data-cell"&gt;0x00000000 -&lt;br&gt;0x80000000&lt;/div&gt;
&lt;div class="data-cell arrow-cell"&gt;0x20000000 -&lt;br&gt;0xA0000000&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="arrow green"&gt;
&lt;div class="label"&gt;Translated to:&lt;br&gt;0x20000000&lt;/div&gt;
&lt;div class="line"&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="box small-host-box"&gt;
&lt;div class="box-title"&gt;Host Physical Memory&lt;/div&gt;
&lt;div class="inner-box"&gt;0x20000000 -&lt;br&gt;0xA0000000&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;This is a blunt solution. The guest doesn&amp;rsquo;t have to do anything special, but the host has to keep all guest RAM pinned. There are more advanced approaches (like a virtual IOMMU), but they&amp;rsquo;re outside the scope of this post.&lt;/p&gt;
&lt;h3 id="dma-on-apple-silicon"&gt;DMA on Apple Silicon&lt;/h3&gt;
&lt;p&gt;On Apple Silicon, there&amp;rsquo;s a hardware unit called &lt;abbr title="Device Address Resolution Table"&gt;DART&lt;/abbr&gt; that&amp;rsquo;s more or less equivalent to an IOMMU. It&amp;rsquo;s not specific to VMs; it also acts as a security boundary, preventing devices from accessing arbitrary host memory. Ideally we&amp;rsquo;d just use DART the same way Linux uses the IOMMU in the simple case above.&lt;/p&gt;
&lt;p&gt;Unfortunately, DART (at least via PCIDriverKit for Thunderbolt devices) has some hard constraints:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;~1.5GB mapping limit.&lt;/strong&gt; A VM with 1.5GB of RAM can technically boot, but CUDA runs out of memory and any modern game needs 8–16GB.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;~64k mapping cap.&lt;/strong&gt; With many small DMA buffers the mapping table fills up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No address or alignment control.&lt;/strong&gt; PCIDriverKit assigns mapped addresses for you. You can&amp;rsquo;t pick them, or specify alignment constraints. This rules out a virtual IOMMU, which requires the guest to choose its own DMA addresses.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The 1.5GB ceiling was the biggest initial blocker. I tried a few workarounds: pre-mapping ranges where I guessed DMAs might land (obviously didn&amp;rsquo;t work), and using a &lt;a href="https://lwn.net/Articles/841916/"&gt;&lt;code&gt;restricted-dma-pool&lt;/code&gt;&lt;/a&gt; device tree attribute to force all DMA through a pre-allocated region. The restricted pool approach actually works for simpler devices, but GPU drivers are too weird to fit into that model. (If you&amp;rsquo;re curious about the specifics, there&amp;rsquo;s &lt;a href="https://lore.kernel.org/qemu-devel/DHNKXKNWFI3S.P78ERGSFU3RQ@gmail.com/"&gt;a qemu-devel thread&lt;/a&gt; where I discuss it.)&lt;/p&gt;
&lt;h3 id="apple-dma-pci"&gt;apple-dma-pci&lt;/h3&gt;
&lt;p&gt;I ended up designing a new virtual PCI device in QEMU called &lt;code&gt;apple-dma-pci&lt;/code&gt;. It gets inserted into the VM alongside the passed-through GPU, and a companion kernel driver in the guest intercepts the NVIDIA driver&amp;rsquo;s DMA mapping calls. The solution is, frankly, a very upsetting hack, but it works.&lt;/p&gt;
&lt;p&gt;Because mappings are created on demand per DMA request and torn down when the buffer is freed, we reduce the amount of mapped memory we need at any given time. Only the working set of &lt;em&gt;live&lt;/em&gt; DMA buffers at any given moment has to fit in our 1.5GB limit, as opposed to the entirety of guest memory.&lt;/p&gt;
&lt;p&gt;The guest driver is loaded early (via an &lt;code&gt;/etc/modules-load.d/&lt;/code&gt; config), so it can find the GPU at probe time and swap in custom DMA ops before the NVIDIA driver touches it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;static&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; dma_map_ops apple_dma_ops &lt;span style="color:#f92672"&gt;=&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; .map_page &lt;span style="color:#f92672"&gt;=&lt;/span&gt; apple_dma_map_page,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; .unmap_page &lt;span style="color:#f92672"&gt;=&lt;/span&gt; apple_dma_unmap_page,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; .map_sg &lt;span style="color:#f92672"&gt;=&lt;/span&gt; apple_dma_map_sg,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; .unmap_sg &lt;span style="color:#f92672"&gt;=&lt;/span&gt; apple_dma_unmap_sg,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; .alloc &lt;span style="color:#f92672"&gt;=&lt;/span&gt; apple_dma_alloc,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; .free &lt;span style="color:#f92672"&gt;=&lt;/span&gt; apple_dma_free,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;};
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;static&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;int&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;apple_dma_pci_probe&lt;/span&gt;(&lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; pci_dev &lt;span style="color:#f92672"&gt;*&lt;/span&gt;pdev,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;const&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; pci_device_id &lt;span style="color:#f92672"&gt;*&lt;/span&gt;id)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; pci_dev &lt;span style="color:#f92672"&gt;*&lt;/span&gt;gpu &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;pci_get_device&lt;/span&gt;(PCI_VENDOR_NVIDIA,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; PCI_ANY_ID, NULL);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (&lt;span style="color:#f92672"&gt;!&lt;/span&gt;gpu)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#f92672"&gt;-&lt;/span&gt;ENODEV;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;set_dma_ops&lt;/span&gt;(&lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;gpu&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;dev, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;apple_dma_ops);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;pci_dev_put&lt;/span&gt;(gpu);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Each of the custom ops is a thin wrapper. It marshals its arguments into a small request, writes it into memory for the &lt;code&gt;apple-dma-pci&lt;/code&gt; virtual BAR, kicks a doorbell register, and waits for a reply. On the host side, QEMU picks up the request, hands it off to the PCIDriverKit driver, which performs the actual DART mapping, and the resulting DMA address gets written back to guest memory. The NVIDIA driver shouldn&amp;rsquo;t know the difference.&lt;/p&gt;
&lt;div class="pci-pt-mem-diagram--egpu-mac-gaming dma-flow-diagram"&gt;
&lt;div class="dma-flow-stack"&gt;
&lt;div class="dma-flow-section dma-flow-guest"&gt;
&lt;div class="dma-flow-section-label"&gt;Linux VM (Guest)&lt;/div&gt;
&lt;div class="dma-flow-step"&gt;
&lt;div class="dma-flow-box dma-flow-nvidia"&gt;NVIDIA Driver&lt;/div&gt;
&lt;div class="dma-flow-arrow-down"&gt;&lt;div class="dma-flow-arrow-shaft"&gt;&lt;/div&gt;&lt;div class="dma-flow-arrow-label"&gt;dma_map_page()&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="dma-flow-step"&gt;
&lt;div class="dma-flow-box dma-flow-custom"&gt;apple_dma_ops handler&lt;/div&gt;
&lt;div class="dma-flow-arrow-down"&gt;&lt;div class="dma-flow-arrow-shaft"&gt;&lt;/div&gt;&lt;div class="dma-flow-arrow-label"&gt;virtual PCI BAR write&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="dma-flow-step"&gt;
&lt;div class="dma-flow-box dma-flow-vdev"&gt;apple-dma-pci &lt;span class="dma-flow-tag"&gt;virtual device&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="dma-flow-boundary"&gt;&lt;span&gt;VM exit&lt;/span&gt;&lt;/div&gt;
&lt;div class="dma-flow-section dma-flow-host"&gt;
&lt;div class="dma-flow-section-label"&gt;macOS Host&lt;/div&gt;
&lt;div class="dma-flow-step"&gt;
&lt;div class="dma-flow-box dma-flow-qemu"&gt;QEMU&lt;/div&gt;
&lt;div class="dma-flow-arrow-down"&gt;&lt;div class="dma-flow-arrow-shaft"&gt;&lt;/div&gt;&lt;div class="dma-flow-arrow-label"&gt;IOConnectCallMethod()&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="dma-flow-step"&gt;
&lt;div class="dma-flow-box dma-flow-driverkit"&gt;PCIDriverKit driver&lt;/div&gt;
&lt;div class="dma-flow-arrow-down"&gt;&lt;div class="dma-flow-arrow-shaft"&gt;&lt;/div&gt;&lt;div class="dma-flow-arrow-label"&gt;IODMACommand&lt;/div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="dma-flow-step"&gt;
&lt;div class="dma-flow-box dma-flow-dart"&gt;DART &lt;span class="dma-flow-tag"&gt;hardware&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="dma-flow-return"&gt;
&lt;div class="dma-flow-return-line"&gt;&lt;/div&gt;
&lt;div class="dma-flow-return-label"&gt;mapped address returned back up the stack&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3 id="nvidia-alignment-quirk"&gt;NVIDIA alignment quirk&lt;/h3&gt;
&lt;p&gt;It didn&amp;rsquo;t immediately work well, though. While the driver initially loaded and initialized the card, I was greeted with this fun kernel log message as soon as I attempted to run a CUDA workload:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;[ 456.194883] NVRM: nvAssertOkFailedNoLog: Assertion failed: The offset passed is not valid [NV_ERR_INVALID_OFFSET] (0x00000037) returned from pRmApi-&amp;gt;Alloc(pRmApi, device-&amp;gt;session-&amp;gt;handle, isSystemMemory ? device-&amp;gt;handle : device-&amp;gt;subhandle, &amp;amp;physHandle, isSystemMemory ? NV01_MEMORY_SYSTEM : NV01_MEMORY_LOCAL_USER, &amp;amp;memAllocParams, sizeof(memAllocParams)) @ nv_gpu_ops.c:4972
[ 456.371282] NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: 0 == (physAddr &amp;amp; (RM_PAGE_SIZE_HUGE - 1)) @ mem_mgr_gm107.c:1312
[ 456.372020] NVRM: nvAssertOkFailedNoLog: Assertion failed: The offset passed is not valid [NV_ERR_INVALID_OFFSET] (0x00000037) returned from pRmApi-&amp;gt;Alloc(pRmApi, device-&amp;gt;session-&amp;gt;handle, isSystemMemory ? device-&amp;gt;handle : device-&amp;gt;subhandle, &amp;amp;physHandle, isSystemMemory ? NV01_MEMORY_SYSTEM : NV01_MEMORY_LOCAL_USER, &amp;amp;memAllocParams, sizeof(memAllocParams)) @ nv_gpu_ops.c:4972
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;If you recall the earlier DMA section, we noted that we can&amp;rsquo;t control the alignment of DMA-mapped buffers. Bummer. At this point, I dug into the driver to try to see if there was something simple we could patch.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s the relevant segment:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (type &lt;span style="color:#f92672"&gt;==&lt;/span&gt; UVM_RM_MEM_TYPE_SYS) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (size &lt;span style="color:#f92672"&gt;&amp;gt;=&lt;/span&gt; UVM_PAGE_SIZE_2M)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; alloc_info.pageSize &lt;span style="color:#f92672"&gt;=&lt;/span&gt; UVM_PAGE_SIZE_2M;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;else&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;if&lt;/span&gt; (size &lt;span style="color:#f92672"&gt;&amp;gt;=&lt;/span&gt; UVM_PAGE_SIZE_64K)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; alloc_info.pageSize &lt;span style="color:#f92672"&gt;=&lt;/span&gt; UVM_PAGE_SIZE_64K;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; status &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;uvm_rm_locked_call&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;nvUvmInterfaceMemoryAllocSys&lt;/span&gt;(gpu&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;rm_address_space, size, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;gpu_va, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;alloc_info));
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;// TODO: Bug 5042223
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;if&lt;/span&gt; (status &lt;span style="color:#f92672"&gt;==&lt;/span&gt; NV_ERR_NO_MEMORY &lt;span style="color:#f92672"&gt;&amp;amp;&amp;amp;&lt;/span&gt; size &lt;span style="color:#f92672"&gt;&amp;gt;=&lt;/span&gt; UVM_PAGE_SIZE_64K) {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#a6e22e"&gt;UVM_ERR_PRINT&lt;/span&gt;(&lt;span style="color:#e6db74"&gt;&amp;#34;nvUvmInterfaceMemoryAllocSys alloc failed with big page size, retry with default page size&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;\n&lt;/span&gt;&lt;span style="color:#e6db74"&gt;&amp;#34;&lt;/span&gt;);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; alloc_info.pageSize &lt;span style="color:#f92672"&gt;=&lt;/span&gt; UVM_PAGE_SIZE_DEFAULT;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; status &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;uvm_rm_locked_call&lt;/span&gt;(&lt;span style="color:#a6e22e"&gt;nvUvmInterfaceMemoryAllocSys&lt;/span&gt;(gpu&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;rm_address_space, size, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;gpu_va, &lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;alloc_info));
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;By adding more debug logging in the module, I could see it was a 16MB allocation of type &lt;code&gt;UVM_RM_MEM_TYPE_SYS&lt;/code&gt;. So, it uses the largest (2MB) page size. Ironically, there is already a workaround here when the allocation fails. It&amp;rsquo;ll just try again with a smaller page size. It just doesn&amp;rsquo;t take into account the different error code for alignment (&lt;code&gt;NV_ERR_INVALID_OFFSET&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;So&amp;hellip; if we expand the status check to include this new error, it will fall back to reallocating the memory, and everything works!&lt;/p&gt;
&lt;p&gt;Ok, but that&amp;rsquo;s really annoying to have to patch the driver. Whenever a new driver release comes out, you have to patch it again. I guess I could maintain a fork of the driver, and automatically generate a parallel set of packages that my system could run from, but I was wondering if there was a way to make the existing mainline driver work.&lt;/p&gt;
&lt;p&gt;What if we could hot-patch the call to &lt;code&gt;nvUvmInterfaceMemoryAllocSys()&lt;/code&gt;, and make it always use a smaller page size?&lt;/p&gt;
&lt;p&gt;The Linux kernel&amp;rsquo;s &lt;a href="https://www.kernel.org/doc/html/latest/trace/kprobes.html"&gt;kprobes&lt;/a&gt; feature lets you attach a handler to the entry of any kernel function. Inside the handler, you get the CPU registers at that point, which means you can inspect or &lt;em&gt;mutate&lt;/em&gt; the function&amp;rsquo;s arguments before it runs. A simplified version of the patch looks like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;/* Must match the driver&amp;#39;s internal layout */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; uvm_alloc_info {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;/* ... other fields ... */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; u64 pageSize;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;/* ... other fields ... */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;};
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;/* nvUvmInterfaceMemoryAllocSys(address_space, size, gpu_va_out, alloc_info)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt; * arm64 ABI: args are in x0, x1, x2, x3 — alloc_info is the 4th arg. */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;static&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;int&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;pre_alloc_sys&lt;/span&gt;(&lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; kprobe &lt;span style="color:#f92672"&gt;*&lt;/span&gt;p, &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; pt_regs &lt;span style="color:#f92672"&gt;*&lt;/span&gt;regs)
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; uvm_alloc_info &lt;span style="color:#f92672"&gt;*&lt;/span&gt;info &lt;span style="color:#f92672"&gt;=&lt;/span&gt; (&lt;span style="color:#66d9ef"&gt;void&lt;/span&gt; &lt;span style="color:#f92672"&gt;*&lt;/span&gt;)regs&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;regs[&lt;span style="color:#ae81ff"&gt;3&lt;/span&gt;];
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#75715e"&gt;/* Force smaller pages so DART&amp;#39;s 16KB mappings always satisfy alignment. */&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; info&lt;span style="color:#f92672"&gt;-&amp;gt;&lt;/span&gt;pageSize &lt;span style="color:#f92672"&gt;=&lt;/span&gt; UVM_PAGE_SIZE_DEFAULT;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#ae81ff"&gt;0&lt;/span&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;static&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;struct&lt;/span&gt; kprobe kp &lt;span style="color:#f92672"&gt;=&lt;/span&gt; {
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; .symbol_name &lt;span style="color:#f92672"&gt;=&lt;/span&gt; &lt;span style="color:#e6db74"&gt;&amp;#34;nvUvmInterfaceMemoryAllocSys&amp;#34;&lt;/span&gt;,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; .pre_handler &lt;span style="color:#f92672"&gt;=&lt;/span&gt; pre_alloc_sys,
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;};
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;static&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;int&lt;/span&gt; __init &lt;span style="color:#a6e22e"&gt;patch_init&lt;/span&gt;(&lt;span style="color:#66d9ef"&gt;void&lt;/span&gt;) { &lt;span style="color:#66d9ef"&gt;return&lt;/span&gt; &lt;span style="color:#a6e22e"&gt;register_kprobe&lt;/span&gt;(&lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;kp); }
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#66d9ef"&gt;static&lt;/span&gt; &lt;span style="color:#66d9ef"&gt;void&lt;/span&gt; __exit &lt;span style="color:#a6e22e"&gt;patch_exit&lt;/span&gt;(&lt;span style="color:#66d9ef"&gt;void&lt;/span&gt;) { &lt;span style="color:#a6e22e"&gt;unregister_kprobe&lt;/span&gt;(&lt;span style="color:#f92672"&gt;&amp;amp;&lt;/span&gt;kp); }
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Loading this kernel module would turn every call to &lt;code&gt;nvUvmInterfaceMemoryAllocSys()&lt;/code&gt; into one that requests the default (small) page size, with no changes to the NVIDIA driver itself.&lt;/p&gt;
&lt;p&gt;I suspect other drivers may need similar fixes for various types of &amp;ldquo;quirks,&amp;rdquo; and since we already load a driver for &lt;code&gt;apple-dma-pci&lt;/code&gt;, I added a quirks section of code that applies patches like this one automatically.&lt;/p&gt;
&lt;p&gt;At this point, the NVIDIA driver works for simple workloads.&lt;/p&gt;
&lt;h3 id="coalescing-mappings"&gt;Coalescing mappings&lt;/h3&gt;
&lt;p&gt;Great! Now the driver works and basic workloads seem to run. Unfortunately, if you really crank up the settings in games, we start to create tons of tiny mappings that run over the total ~64k mapping count limit. Recall that we mentioned &lt;a href="#dma-on-apple-silicon"&gt;earlier&lt;/a&gt; that this could be a problem. Initially I thought it might be exacerbated with the driver patch we were applying, but it turns out that&amp;rsquo;s not the case. We hit the limits either way.&lt;/p&gt;
&lt;p&gt;I had gotten this far, and I wasn&amp;rsquo;t ready to just give up. After logging all the mappings and looking at the distribution, it seemed like 90%+ of the mappings were 4kB. They weren&amp;rsquo;t often contiguous, so it wasn&amp;rsquo;t obvious that we could join them to reduce overall map counts, but they did appear in clusters.&lt;/p&gt;
&lt;p&gt;I came up with a scheme to look at memory in terms of larger clusters. We&amp;rsquo;d divide all guest memory into fixed-size regions (say, 256kB). When the driver asks to map a 4kB buffer, we map the whole 256kB cluster it falls inside, and any later allocations that land in the same cluster reuse that mapping.&lt;/p&gt;
&lt;div class="pci-pt-mem-diagram--egpu-mac-gaming map-coalesce-diagram"&gt;
&lt;div class="diagram"&gt;
&lt;div class="memstack"&gt;
&lt;div class="memstack-title"&gt;Individual mappings&lt;/div&gt;
&lt;div class="memstack-subtitle"&gt;one 4kB map per buffer&lt;/div&gt;
&lt;div class="memstack-strip memstack-strip--fine"&gt;
&lt;div class="memstack-group"&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;div class="memstack-row"&gt;&lt;/div&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="memstack-group"&gt;
&lt;div class="memstack-row"&gt;&lt;/div&gt;
&lt;div class="memstack-row"&gt;&lt;/div&gt;
&lt;div class="memstack-row"&gt;&lt;/div&gt;
&lt;div class="memstack-row"&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="memstack-group"&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="memstack-group"&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;div class="memstack-row"&gt;&lt;/div&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;div class="memstack-row mapped"&gt;&lt;span&gt;4kB&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="memstack-count"&gt;&lt;strong&gt;10&lt;/strong&gt; mappings&lt;/div&gt;
&lt;/div&gt;
&lt;div class="arrow green map-coalesce-arrow"&gt;
&lt;div class="label"&gt;Group into&lt;br&gt;256 kB clusters&lt;/div&gt;
&lt;div class="line"&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class="memstack"&gt;
&lt;div class="memstack-title"&gt;Clustered mappings&lt;/div&gt;
&lt;div class="memstack-subtitle"&gt;one 256kB map shared by many buffers&lt;/div&gt;
&lt;div class="memstack-strip memstack-strip--coarse"&gt;
&lt;div class="memstack-cluster mapped"&gt;
&lt;span class="memstack-cluster-size"&gt;256kB&lt;/span&gt;
&lt;span class="memstack-cluster-dots"&gt;&lt;i&gt;&lt;/i&gt;&lt;i&gt;&lt;/i&gt;&lt;i&gt;&lt;/i&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;div class="memstack-cluster"&gt;&lt;/div&gt;
&lt;div class="memstack-cluster mapped"&gt;
&lt;span class="memstack-cluster-size"&gt;256kB&lt;/span&gt;
&lt;span class="memstack-cluster-dots"&gt;&lt;i&gt;&lt;/i&gt;&lt;i&gt;&lt;/i&gt;&lt;i&gt;&lt;/i&gt;&lt;i&gt;&lt;/i&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;div class="memstack-cluster mapped"&gt;
&lt;span class="memstack-cluster-size"&gt;256kB&lt;/span&gt;
&lt;span class="memstack-cluster-dots"&gt;&lt;i&gt;&lt;/i&gt;&lt;i&gt;&lt;/i&gt;&lt;i&gt;&lt;/i&gt;&lt;/span&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="memstack-count"&gt;&lt;strong&gt;3&lt;/strong&gt; mappings&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;There are a few edge cases worth mentioning. A buffer that straddles two clusters doesn&amp;rsquo;t fit neatly into this scheme, so in that case we just fall back to mapping it directly, outside the cluster allocator. In practice that only happens for a small fraction of allocations. Cluster lifetime is handled via reference counting. When the last live 4kB buffer inside a cluster is freed, the cluster is automatically unmapped, so we&amp;rsquo;re not holding on to DART mappings any longer than we need to.&lt;/p&gt;
&lt;p&gt;The scheme does end up mapping slightly more total memory than strictly necessary. Each cluster has some &amp;ldquo;slop&amp;rdquo; bytes that aren&amp;rsquo;t actually backing any live buffer. Thankfully, in practice, we still stay under the ~1.5GB mapping ceiling for workloads I tested. What really mattered was cutting the mapping &lt;em&gt;count&lt;/em&gt; back under 64k. In the workloads I tried, the number of live mappings dropped by roughly 4x, which was more than enough headroom to run demanding games at the highest settings.&lt;/p&gt;
&lt;p&gt;This isn&amp;rsquo;t a performance-sensitive path. Mappings mostly happen while games are loading, not while they&amp;rsquo;re running, but the change does speed things up anyway. The clustering happens purely in the guest driver, so we now call into the host less often to create DART mappings.&lt;/p&gt;
&lt;h2 id="other-performance-concerns"&gt;Other performance concerns&lt;/h2&gt;
&lt;p&gt;Now, with a relatively stable base for the PCI passthrough part of the project, I pivoted to looking at overall VM performance. Progress is not always a straight line, but for the sake of artistic license (yes, this blog is my art), I&amp;rsquo;m just going to include the things that weren&amp;rsquo;t dead ends.&lt;/p&gt;
&lt;h3 id="scheduling"&gt;Scheduling&lt;/h3&gt;
&lt;p&gt;When testing things out, I noticed performance was very inconsistent, and often pretty slow. Benchmark scores were swinging wildly, often appearing 50% slower at random. I am a little embarrassed by how long it took me to figure this one out, but it turns out that QEMU doesn&amp;rsquo;t set any priority for the vCPU threads. I&amp;rsquo;m not sure if the scheduler was deprioritizing it because it&amp;rsquo;s kind of a background thread, or if it&amp;rsquo;s just bad luck, but it seemed like the scheduler was just not giving the VM a lot of time to run during a lot of my benchmark runs.&lt;/p&gt;
&lt;p&gt;There are a bunch of different APIs for manipulating your process priority in macOS. I tried a few and settled on these, which seemed to meaningfully help. I patched them into QEMU so when the vCPU started, it would gain a much higher priority:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-patch" data-lang="patch"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;diff --git a/accel/hvf/hvf-accel-ops.c b/accel/hvf/hvf-accel-ops.c
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;index b74a5779c3..5b6337cd17 100644
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;--- a/accel/hvf/hvf-accel-ops.c
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+++ b/accel/hvf/hvf-accel-ops.c
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;@@ -162,6 +162,18 @@ static void *hvf_cpu_thread_fn(void *arg)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; rcu_register_thread();
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ pthread_set_qos_class_self_np(QOS_CLASS_USER_INTERACTIVE, 0);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ struct sched_param param;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ param.sched_priority = sched_get_priority_max(SCHED_RR);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ pthread_setschedparam(pthread_self(), SCHED_RR, &amp;amp;param);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ }
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; bql_lock();
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; qemu_thread_get_self(cpu-&amp;gt;thread);
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="total-store-ordering"&gt;Total store ordering&lt;/h3&gt;
&lt;p&gt;The VM I&amp;rsquo;ve been describing is arm64 Linux, but almost no shipping games run on ARM natively. They&amp;rsquo;re x86-64 Windows binaries. To actually play them, you stack &lt;a href="https://en.wikipedia.org/wiki/Proton_(software)"&gt;Proton&lt;/a&gt; (Valve&amp;rsquo;s WINE fork, to implement the Windows API) on top of &lt;a href="https://fex-emu.com"&gt;FEX-Emu&lt;/a&gt; (JIT x86-64 to aarch64). That translation layer has one non-obvious concern: x86 and ARM have different &lt;a href="https://en.wikipedia.org/wiki/Memory_ordering"&gt;memory ordering&lt;/a&gt; rules. x86 uses &lt;strong&gt;Total Store Ordering&lt;/strong&gt; (TSO), where stores from one core become visible to others in program order. ARM&amp;rsquo;s model is much weaker and can reorder almost anything without explicit barriers. A lot of code relies on plain loads and stores between threads, so emulating x86 on ARM either needs expensive barriers everywhere, or things crash in subtle ways.&lt;/p&gt;
&lt;p&gt;Apple Silicon has an escape hatch: a per-thread hardware &lt;strong&gt;TSO mode&lt;/strong&gt;. Flip bit 1 of &lt;a href="https://developer.arm.com/documentation/111107/2026-03/AArch64-Registers/ACTLR-EL1--Auxiliary-Control-Register--EL1-"&gt;&lt;code&gt;ACTLR_EL1&lt;/code&gt;&lt;/a&gt; on a vCPU, and every load and store on that thread follows x86-style ordering, no barriers required. Apple exposes this through &lt;code&gt;Hypervisor.framework&lt;/code&gt; on macOS 15+. On the Linux side, Hector Martin &lt;a href="https://lore.kernel.org/lkml/20240411-tso-v1-0-754f11abfbff@marcan.st/"&gt;posted a Linux kernel patch series&lt;/a&gt; in 2024 that adds &lt;code&gt;PR_SET_MEM_MODEL&lt;/code&gt; prctls to flip the bit per-thread. Upstream never merged it, but Asahi&amp;rsquo;s kernel carries it.&lt;/p&gt;
&lt;p&gt;QEMU doesn&amp;rsquo;t expose TSO either, but &lt;a href="https://mac.getutm.app"&gt;UTM&lt;/a&gt; (a fork of QEMU with a macOS GUI) carries a patch that enables it for the whole vCPU. I have cribbed it for my testing:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"&gt;&lt;code class="language-diff" data-lang="diff"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;diff --cc target/arm/hvf/hvf.c
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;index f5d7221845,ebae2886d3..0000000000
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#f92672"&gt;--- a/target/arm/hvf/hvf.c
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+++ b/target/arm/hvf/hvf.c
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#75715e"&gt;@@@ -1358,8 -1352,21 +1358,23 @@@ int hvf_arch_init_vcpu(CPUState *cpu
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; arm_cpu-&amp;gt;isar.idregs[ID_AA64MMFR0_EL1_IDX]);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; assert_hvf_ok(ret);
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ if (__builtin_available(macOS 15, *)) {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ uint64_t actlr;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ ret = hv_vcpu_get_sys_reg(cpu-&amp;gt;accel-&amp;gt;fd,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ HV_SYS_REG_ACTLR_EL1, &amp;amp;actlr);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ assert_hvf_ok(ret);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ actlr |= (1 &amp;lt;&amp;lt; 1); /* Apple TSO enable bit */
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ ret = hv_vcpu_set_sys_reg(cpu-&amp;gt;accel-&amp;gt;fd,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ HV_SYS_REG_ACTLR_EL1, actlr);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ assert_hvf_ok(ret);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ } else {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ error_report(&amp;#34;HVF TSO mode requires macOS 15 or later&amp;#34;);
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ return -ENOTSUP;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;&lt;span style="color:#a6e22e"&gt;+ }
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; aarch64_add_sme_properties(OBJECT(cpu));
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt; return 0;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With that bit set, I can disable the FEX-Emu software TSO emulation. Toggling the software TSO emulation has clear side effects. If you turn it off without the hardware bit, Geekbench 6 crashes partway through.&lt;/p&gt;
&lt;p&gt;I eventually realized FEX can detect the TSO flag automatically if you run a kernel with Hector&amp;rsquo;s patch (the default in Asahi Linux). His patch adds a &lt;code&gt;prctl()&lt;/code&gt; that lets a process query the CPU&amp;rsquo;s TSO state and set the flag for itself. The kernel then saves and restores the bit on context switches, per-process.&lt;/p&gt;
&lt;p&gt;I wanted to avoid requiring a custom kernel. Implementing this purely with kprobes would mean patching code that runs on every context switch, which felt risky. But detecting the TSO bit from inside the kernel is easy, so I added another &amp;ldquo;quirk&amp;rdquo; to the &lt;code&gt;apple-dma-pci&lt;/code&gt; driver, similar to the one earlier. When it sees the TSO bit set (QEMU sets it before the VM boots), it installs a kprobe on &lt;code&gt;prctl()&lt;/code&gt; that mimics the getter/setter from Hector&amp;rsquo;s patch. The setter is really a no-op, but FEX can still query the flag and switch itself to the faster mode that skips TSO emulation.&lt;/p&gt;
&lt;p&gt;Below, you can see the performance implications of the different layers of virtualization/emulation:&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 400px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="vm-benchmark"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('vm-benchmark', 
{
 type: 'bar',
 data: {
 labels: [
 'Host (macOS, M4)',
 'VM (arm64 native, TSO off)',
 'VM (arm64 native, TSO on)',
 'VM (FEX x86-64, TSO on)',
 'VM (FEX x86-64, TSO emulated)'
 ],
 datasets: [{
 label: 'Single-Core',
 data: [3630, 3479, 3341, 1729, 1491],
 }, {
 label: 'Multi-Core',
 data: [14339, 13089, 12601, 7107, 6272],
 }]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Geekbench 6 CPU Scores: Host vs VM vs FEX'
 }
 },
 scales: {
 y: {
 ticks: {
 font: function(context) {
 const label = String(context.tick.label || '');
 const emphasized = ['Host (macOS, M4)', 'VM (FEX x86-64, TSO on)'];
 const isBold = emphasized.includes(label);
 return { weight: isBold ? 700 : 400, size: isBold ? 13 : 11 };
 }
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;For the sake of the rest of the post, we focus on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Host performance&lt;/strong&gt; - This tells you the best performance you can expect from the CPU, without any emulation layers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Guest performance under FEX with CPU TSO on&lt;/strong&gt; - This tells you the best performance you can expect from the CPU when we try to play x86 games. Trouble is, you can see this is around &lt;em&gt;50% less&lt;/em&gt; than native host performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="benchmarks"&gt;Benchmarks&lt;/h1&gt;
&lt;p&gt;If you&amp;rsquo;re a software engineer, hopefully you enjoyed the story so far. If you&amp;rsquo;re not, then this is where you probably want to start.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;re going to answer the question you probably never asked: If you hook an RTX 5090 to the MacBook Air, does it make games perform better? What about AI inference?&lt;/p&gt;
&lt;h2 id="cpu-comparison"&gt;CPU comparison&lt;/h2&gt;
&lt;p&gt;Gaming performance isn&amp;rsquo;t just about how fast your GPU can run, it&amp;rsquo;s also about your CPU. Let&amp;rsquo;s start with a CPU benchmark comparison, just to level-set what the machine is capable of.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 450px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="cpu-comparison-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('cpu-comparison-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M5 Max MacBook Pro 14" (macOS)',
 'M4 Max Mac Studio (macOS)',
 'M4 MacBook Air (macOS)',
 'M5 Max MacBook Pro 14" (Linux VM + FEX)',
 'Older gaming PC (i5-12600K)',
 'M4 Max Mac Studio (Linux VM + FEX)',
 '2020 MacBook Pro (i7-1068NG7)',
 'M4 MacBook Air (Linux VM + FEX)'
 ],
 datasets: [{
 label: 'Single-Core',
 data: [4309, 4114, 3630, 1934, 2443, 1892, 1851, 1734],
 }, {
 label: 'Multi-Core',
 data: [29708, 26756, 14339, 13330, 12719, 11679, 5696, 4713],
 }]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Geekbench 6 CPU Scores'
 }
 },
 scales: {
 y: {
 ticks: {
 font: function(context) {
 const label = String(context.tick.label || '');
 const emphasized = ['Linux VM + FEX'];
 const isBold = emphasized.some(e =&gt; label.includes(e));
 return { weight: isBold ? 700 : 400, size: isBold ? 13 : 11 };
 }
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;If you look at the chart, you can see Apple Silicon Macs are extremely fast. They&amp;rsquo;re some of the fastest consumer CPUs you can buy, and they do it at a fraction of the power consumption of comparable Intel chips.&lt;/p&gt;
&lt;p&gt;The catch is that emulating x86 through FEX costs us roughly 50% of that performance right off the bat. You can see the M4 MacBook Air is really affected by that. The performance becomes worse than a 2020 Intel-based MacBook Pro. The M5 Max MacBook Pro, on the other hand, is in decent shape. The performance is pretty close to my older gaming PC, which is no slouch.&lt;/p&gt;
&lt;h2 id="cyberpunk-2077"&gt;Cyberpunk 2077&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Screenshot of Cyberpunk 2077" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/cyberpunk.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s run Cyberpunk 2077 across six setups:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;M4 Air running native macOS&lt;/li&gt;
&lt;li&gt;M4 Air with the eGPU through an ARM Linux VM, using FEX to emulate x86&lt;/li&gt;
&lt;li&gt;2020 Intel MacBook Pro running Linux natively (no VM, no FEX) with the same eGPU&lt;/li&gt;
&lt;li&gt;M5 Max MacBook Pro running native macOS&lt;/li&gt;
&lt;li&gt;M5 Max with the eGPU through an ARM Linux VM&lt;/li&gt;
&lt;li&gt;Older gaming PC (i5-12600K) with the same RTX 5090 plugged in over native PCIe&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;ldquo;+ Framegen&amp;rdquo; uses &lt;a href="https://en.wikipedia.org/wiki/Deep_Learning_Super_Sampling"&gt;DLSS&lt;/a&gt; 4x for the eGPU/native PCIe configurations and FSR 2x for the native macOS configurations (DLSS is NVIDIA-only, and isn&amp;rsquo;t available on macOS).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3 id="720p-low"&gt;720p Low&lt;/h3&gt;
&lt;div class="chart-container" style="position: relative; height: 280px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="cyberpunk-720p-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('cyberpunk-720p-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Air (native)',
 'M4 Air + eGPU',
 '2020 MBP + eGPU (i7-1068NG7)',
 'M5 Max (native)',
 'M5 Max + eGPU',
 'Gaming PC (i5-12600K, native PCIe)'
 ],
 datasets: [
 { label: '720p Low', data: [60.96, 48.83, 45.08, 200.37, 73.34, 179.67] }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Cyberpunk 2077 — 720p Low Average FPS'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Average FPS'
 }
 },
 y: {
 ticks: {
 font: function(context) {
 const label = String(context.tick.label || '');
 const emphasized = ['M4 Air + eGPU', 'M5 Max + eGPU'];
 const isBold = emphasized.some(e =&gt; label.includes(e));
 return { weight: isBold ? 700 : 400, size: isBold ? 13 : 11 };
 }
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;At 720p Low, the GPU barely breaks a sweat, so the CPU and emulation/virtualization overhead dominate.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;M5 Max on native macOS wins with 200fps.&lt;/strong&gt; With such a powerful CPU, and unshackled from emulation/virtualization layers, perf is excellent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The gaming PC follows at 180fps.&lt;/strong&gt; With this little load on the GPU, the desktop CPU isn&amp;rsquo;t able to pull as far ahead as you&amp;rsquo;d expect.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M4 Air native (61fps) actually beats M4 Air + eGPU (49fps).&lt;/strong&gt; At 720p Low the integrated GPU has enough headroom, and we save the cost of FEX + virtualization. This is the only resolution where going native pays off on the M4 Air.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M5 Max + eGPU (73fps) is far slower than M5 Max native.&lt;/strong&gt; Now hamstrung with emulation/virtualization overhead, you can see how much performance we lose (2.7x).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="1080p"&gt;1080p&lt;/h3&gt;
&lt;div class="chart-container" style="position: relative; height: 500px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="cyberpunk-1080p-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('cyberpunk-1080p-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Air (native)',
 'M4 Air + eGPU',
 '2020 MBP + eGPU (i7-1068NG7)',
 'M5 Max (native)',
 'M5 Max + eGPU',
 'Gaming PC (i5-12600K, native PCIe)'
 ],
 datasets: [
 { label: 'RT Ultra', data: [7.16, 30.32, 30.71, 58.78, null, 104.97] },
 { label: 'High', data: [19.30, 41.60, 39.02, 131.05, 67.55, 161.47] },
 { label: 'RT Ultra + Framegen', data: [12.55, 118.82, 118.63, 87.76, 173.50, 406.95] }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Cyberpunk 2077 — 1080p Average FPS'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Average FPS'
 }
 },
 y: {
 ticks: {
 font: function(context) {
 const label = String(context.tick.label || '');
 const emphasized = ['M4 Air + eGPU', 'M5 Max + eGPU'];
 const isBold = emphasized.some(e =&gt; label.includes(e));
 return { weight: isBold ? 700 : 400, size: isBold ? 13 : 11 };
 }
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;At 1080p the GPU starts to matter more, but the integrated GPUs can still hang on at the lighter settings.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The gaming PC is the clear winner across the board.&lt;/strong&gt; 161fps at High, 105fps at RT Ultra, and a wild 407fps with DLSS framegen.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M5 Max on native macOS crushes 1080p High at 131fps without any eGPU at all.&lt;/strong&gt; It also handles RT Ultra at 59fps, or 88fps with FSR framegen. If you&amp;rsquo;re OK playing at 1080p, you&amp;rsquo;re already set without any of this project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;At 1080p High, M5 Max native (131fps) is faster than M5 Max + eGPU (68fps).&lt;/strong&gt; Same story as 720p. The integrated GPU is plenty for this resolution, and the emulation/virtualization overhead hurts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M4 Air on native macOS is still unplayable at RT Ultra (7fps).&lt;/strong&gt; FSR framegen only doubles it to 13fps, which is still unplayable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M4 Air + eGPU brings the M4 Air to 30fps at RT Ultra and 119fps with framegen.&lt;/strong&gt; Big improvement over native.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The 2020 MBP + eGPU performs comparably to the M4 Air + eGPU.&lt;/strong&gt; Same GPU, similar effective CPU once you account for FEX.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="4k"&gt;4K&lt;/h3&gt;
&lt;div class="chart-container" style="position: relative; height: 500px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="cyberpunk-4k-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('cyberpunk-4k-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Air (native)',
 'M4 Air + eGPU',
 '2020 MBP + eGPU (i7-1068NG7)',
 'M5 Max (native)',
 'M5 Max + eGPU',
 'Gaming PC (i5-12600K, native PCIe)'
 ],
 datasets: [
 { label: 'RT Ultra', data: [3.42, 27.15, 30.59, 25.05, 47.03, 99.88] },
 { label: 'High', data: [6.34, 44.43, 39.70, 41.45, 62.57, 149.20] },
 { label: 'RT Ultra + Framegen', data: [6.23, 111.00, 119.41, 42.26, 145.38, 282.52] }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Cyberpunk 2077 — 4K Average FPS'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Average FPS'
 }
 },
 y: {
 ticks: {
 font: function(context) {
 const label = String(context.tick.label || '');
 const emphasized = ['M4 Air + eGPU', 'M5 Max + eGPU'];
 const isBold = emphasized.some(e =&gt; label.includes(e));
 return { weight: isBold ? 700 : 400, size: isBold ? 13 : 11 };
 }
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;At 4K, the GPU becomes the bottleneck.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The gaming PC dominates again: 100fps at RT Ultra and 282fps with DLSS framegen.&lt;/strong&gt; It helps to have no overhead from Thunderbolt on the GPU, and no virtualization/emulation penalty.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M5 Max native manages 25fps at 4K RT Ultra, or 42fps with FSR framegen.&lt;/strong&gt; Borderline playable. Even without an eGPU, the integrated GPU on the M5 Max can almost get you there at 4K with ray tracing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M5 Max + eGPU is solidly playable at 47fps on 4K RT Ultra, and 145fps with framegen.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M4 Air native is hopeless at 4K.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M4 Air + eGPU brings the same machine to 27fps at RT Ultra and 111fps with DLSS framegen.&lt;/strong&gt; This is the most dramatic example of what attaching the GPU does. From completely unplayable to totally playable at 4K.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;2020 MBP + eGPU is again comparable to M4 Air + eGPU.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="takeaways"&gt;Takeaways&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;At 720p, native beats eGPU.&lt;/strong&gt; When the GPU isn&amp;rsquo;t the bottleneck, the FEX + virtualization overhead matters more than the GPU upgrade.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;At higher resolutions, the eGPU is essential for the M4 Air.&lt;/strong&gt; It takes Cyberpunk from &amp;ldquo;completely unplayable&amp;rdquo; (~3fps at 4K RT Ultra) to &amp;ldquo;totally playable&amp;rdquo; (27fps, or 111fps with framegen).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The M5 Max with eGPU is roughly 30–70% faster than the M4 Air with the same eGPU.&lt;/strong&gt; This is purely the M5&amp;rsquo;s CPU advantage showing through.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The gaming PC with native PCIe is still ~2x faster than the M5 Max + eGPU.&lt;/strong&gt; Thunderbolt + virtualization + x86 emulation costs you a lot compared to a native PC. There&amp;rsquo;s just no way around it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The M5 Max&amp;rsquo;s integrated GPU is genuinely impressive.&lt;/strong&gt; Without an eGPU at all, it can hit 131fps at 1080p High and 88fps at 1080p RT Ultra (with FSR framegen).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I even tested the game on an Apple XDR display, which enabled running it at 6K resolution. On the M5 Max + eGPU with framegen, even at 6K with Ultra Ray Tracing you can get an average of 128fps. I played the game in this setup and it didn&amp;rsquo;t feel laggy or stuttery. Without the eGPU, even with framegen you&amp;rsquo;re talking about ~20fps.&lt;/p&gt;
&lt;h2 id="gravitymark"&gt;GravityMark&lt;/h2&gt;
&lt;p&gt;This was the only GPU graphics benchmark I could find that didn&amp;rsquo;t use much CPU, which makes it useful for isolating the cost of the Thunderbolt link, along with any Apple-specific overhead we added with the GPU passthrough solution.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 400px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="gravitymark-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('gravitymark-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'Gaming PC (i5-12600K, native PCIe)',
 '2020 MBP + eGPU (i7-1068NG7)',
 'Gaming PC + eGPU (i5-12600K, Thunderbolt)',
 'M4 MacBook Air + eGPU'
 ],
 datasets: [
 { label: '1080p', data: [542.0, 439.4, 429.8, 372.2] },
 { label: '4K', data: [395.3, 331.0, 311.7, 294.1] }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'GravityMark — Average FPS (Vulkan, defaults)'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Average FPS'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Going through Thunderbolt costs about 20% of GPU performance on this benchmark.&lt;/strong&gt; You can see this in the gaming PC results: the same machine and same GPU loses ~20% of its frames when you move it from a PCIe slot to a Thunderbolt connection. That&amp;rsquo;s the cost of the PCIe-over-USB tunnel.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Thunderbolt performance across devices isn&amp;rsquo;t fully predictable.&lt;/strong&gt; I was expecting the eGPU to be fastest on the gaming PC and slowest on the M4 MacBook Air. The MacBook Air is indeed the slowest, but the 2020 Intel MacBook Pro actually edges out the gaming PC over Thunderbolt by ~2% at 1080p and ~6% at 4K. Different hardware topologies can shake out in unexpected ways. I found even the Thunderbolt cable could move the number by a few percent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The M4 MacBook Air comes out on the bottom.&lt;/strong&gt; There&amp;rsquo;s a long list of things that could be investigated to close the gap: interrupt latency, resizable BAR support, BAR access latency (the Apple &lt;code&gt;Hypervisor.framework&lt;/code&gt; issue from earlier). For now, the M4 Air over the eGPU is about 13% slower than the same GPU running over Thunderbolt on the gaming PC, and ~31% slower than the GPU plugged in over native PCIe.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="shadow-of-the-tomb-raider"&gt;Shadow of the Tomb Raider&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Screenshot of Shadow of the Tomb Raider benchmark" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/tomb-raider.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Another game that has a native macOS port. Same three setups: M4 Air native, M4 Air with eGPU through a Linux VM, and the gaming PC with the GPU plugged in over native PCIe.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 320px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="tomb-raider-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('tomb-raider-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Air (native)',
 'M4 Air + eGPU',
 'Gaming PC (i5-12600K, native PCIe)'
 ],
 datasets: [
 { label: '4K High', data: [8, 40, 95] },
 { label: '1080p High', data: [26, 42, 108] }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Shadow of the Tomb Raider — Average FPS'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Average FPS'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;The eGPU takes the M4 Air from unplayable at 4K (8fps native) to actually playable (40fps), and from borderline at 1080p (26fps) to comfortably above 30 (42fps). Interestingly, 1080p and 4K with the eGPU come out almost identical (42 vs 40fps) — the bottleneck is the CPU under FEX, not the GPU, so dropping the resolution doesn&amp;rsquo;t help. The gaming PC, with no FEX in the way, is roughly 2.5x faster at both resolutions.&lt;/p&gt;
&lt;h2 id="horizon-zero-dawn-remastered"&gt;Horizon Zero Dawn Remastered&lt;/h2&gt;
&lt;p&gt;This was the only game I tried where it bumped up into the total mappable DMA memory limit we discussed earlier. Even at 720p in the lowest settings, I couldn&amp;rsquo;t start the benchmark. It wanted more than 1.5GB of memory mapped at once.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Horizon Zero Dawn with an error overlay" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/horizon-error.jpg"&gt;&lt;/p&gt;
&lt;p&gt;This illustrates one of the major platform issues that prevent this setup from working well.&lt;/p&gt;
&lt;h2 id="doom-2016"&gt;Doom (2016)&lt;/h2&gt;
&lt;p&gt;Doom uses an older id Tech game engine that was very reliant on OpenGL. A decade ago, if you were willing to make your game run on OpenGL as opposed to just DirectX, it was easier to port your game to other platforms, since DirectX was proprietary (Microsoft) and OpenGL was the &amp;ldquo;open&amp;rdquo; standard.&lt;/p&gt;
&lt;p&gt;Because OpenGL is not well-supported anymore on macOS, the game is completely unplayable there, even with &lt;a href="https://www.codeweavers.com/crossover"&gt;CrossOver&lt;/a&gt;. Ironically, it plays totally fine on a Windows PC, but this is a game you literally can&amp;rsquo;t play on Mac without this eGPU setup.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Doom (2016)" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/doom-2016.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Sure enough, it works! There is no in-game benchmark, so I didn&amp;rsquo;t do an exhaustive test, but you can see the in-game performance metrics showing 49fps. The game felt pretty playable to me. The framerate varied, going as high as 60fps, but it always stayed above 30fps. As you can see on the performance overlay, CPU is the bottleneck, as usual.&lt;/p&gt;
&lt;h2 id="can-it-run-crysis"&gt;Can it run Crysis?&lt;/h2&gt;
&lt;p&gt;&lt;img alt="It can run crysis" loading="lazy" src="https://scottjg.com/posts/2026-05-05-egpu-mac-gaming/crysis.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m glad you asked. I tested Crysis Remastered at 1080p with two profiles: the bundled &lt;code&gt;veryhigh.cfg&lt;/code&gt; preset, and the famous &amp;ldquo;Can it run Crysis?&amp;rdquo; preset that ships in the remaster as &lt;code&gt;canitrunspec.cfg&lt;/code&gt;.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 280px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="crysis-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('crysis-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Air + eGPU',
 'Gaming PC (i5-12600K, native PCIe)'
 ],
 datasets: [
 { label: '1080p — veryhigh', data: [29.94, 124.91] },
 { label: '1080p — Can It Run Crysis', data: [23.16, 90.45] }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Crysis Remastered — Average FPS'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Average FPS'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;Of course, the M4 Air doesn&amp;rsquo;t really stack up to the gaming PC. Crysis is very dependent on single-threaded CPU performance. In 2007, we didn&amp;rsquo;t have Threadrippers with 96 cores, so it wasn&amp;rsquo;t as big a deal. Now that we lose so much performance from the emulation layers, it really stings to know we have so much CPU capacity available, and most of it doesn&amp;rsquo;t even get used.&lt;/p&gt;
&lt;p&gt;The gaming PC is ultimately able to get almost 4x the framerate, but the M4 MacBook Air can indeed run Crysis at playable framerates.&lt;/p&gt;
&lt;h2 id="ai-inference"&gt;AI Inference&lt;/h2&gt;
&lt;p&gt;Games aren&amp;rsquo;t the only thing you can do with a GPU. Let&amp;rsquo;s try using some local LLMs.&lt;/p&gt;
&lt;h3 id="qwen-36"&gt;Qwen 3.6&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/Qwen"&gt;Qwen&lt;/a&gt; is one of the more popular &amp;ldquo;open weight&amp;rdquo; large language models. It&amp;rsquo;s developed by Alibaba Cloud.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;re testing the 35B-parameter mixture-of-experts version, which uses 3B active parameters. If you&amp;rsquo;ve dug into the local LLM landscape at all, you probably know that there&amp;rsquo;s a zillion of these quantized model versions. The full-sized models will rarely fit on a normal consumer GPU, but if you round all the weights down, they can work with reduced accuracy.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;re not trying to really measure how accurate the models are; we&amp;rsquo;re focusing on getting the fastest possible version running on each platform. So, we use 4-bit &amp;ldquo;quants&amp;rdquo; of the model. On NVIDIA GPUs, that means running the NVFP4 versions via vLLM, and on Apple Silicon, we use 4-bit MLX quants with vllm-mlx. The benchmark runs are orchestrated with &lt;a href="https://github.com/johnwlambert/llama-benchy"&gt;&lt;code&gt;llama-benchy&lt;/code&gt;&lt;/a&gt;. So, it&amp;rsquo;s not exactly identical, but I think it&amp;rsquo;s the best apples-to-apples comparison that I could set up.&lt;/p&gt;
&lt;p&gt;The two metrics worth comparing are &lt;strong&gt;token generation speed&lt;/strong&gt; (how fast the model spits out new tokens once it&amp;rsquo;s started) and &lt;strong&gt;time to first token&lt;/strong&gt; (how long you wait after pressing enter before anything appears).&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 450px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="qwen-tg-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('qwen-tg-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Air (native, MLX)',
 'M4 Max Mac Studio (native, MLX)',
 'M5 Max MacBook Pro (native, MLX)',
 '2020 MBP + eGPU (vLLM)',
 'Gaming PC (native PCIe, vLLM)',
 'M4 Air + eGPU (vLLM)'
 ],
 datasets: [
 { label: 'after 512-token prompt', data: [23.5, 71.6, 124.9, 153.0, 169.5, 154.7], backgroundColor: '#bbdefb' },
 { label: 'after 1024-token prompt', data: [23.7, 70.5, 121.4, 153.3, 169.3, 153.5], backgroundColor: '#64b5f6' },
 { label: 'after 2048-token prompt', data: [22.3, 70.0, 116.5, 153.5, 169.7, 153.2], backgroundColor: '#1976d2' },
 { label: 'after 4096-token prompt', data: [20.8, 68.9, 114.4, 154.2, 170.0, 156.9], backgroundColor: '#0d47a1' }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Qwen 3.6 — Token Generation Speed (single-stream)'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Tokens/sec'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;Here you can see a few things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Size of the prompt does not meaningfully affect the speed.&lt;/strong&gt; It&amp;rsquo;s mostly constrained by memory bandwidth, streaming the entire model through the GPU for each new token generated, not compute speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Thunderbolt eGPU performance is pretty similar to PCIe.&lt;/strong&gt; Most of the processing in this step will take place on the card, not on the computer. We do still lose about 9% performance in the eGPU configuration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NVIDIA RTX 5090 is 6.5x faster than the M4 Air, 2.1x faster than the M4 Max Mac Studio, and 1.2x faster than the M5 Max MacBook Pro.&lt;/strong&gt; The card uses considerably more power than all of the Macs combined, so it&amp;rsquo;s not a fair fight. That said, if you strap this GPU to your Mac, it really helps performance.&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="chart-container" style="position: relative; height: 450px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="qwen-ttft-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('qwen-ttft-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Air (native, MLX)',
 'M4 Max Mac Studio (native, MLX)',
 'M5 Max MacBook Pro (native, MLX)',
 '2020 MBP + eGPU (vLLM)',
 'Gaming PC (native PCIe, vLLM)',
 'M4 Air + eGPU (vLLM)'
 ],
 datasets: [
 { label: '512-token prompt', data: [2011, 383, 283, 92, 56, 59], backgroundColor: '#bbdefb' },
 { label: '1024-token prompt', data: [3430, 638, 306, 100, 56, 53], backgroundColor: '#64b5f6' },
 { label: '2048-token prompt', data: [8003, 2278, 521, 96, 77, 78], backgroundColor: '#1976d2' },
 { label: '4096-token prompt', data: [17019, 2397, 1125, 147, 244, 142], backgroundColor: '#0d47a1' }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Qwen 3.6 — Time to First Token vs Prompt Length'
 }
 },
 scales: {
 x: {
 type: 'logarithmic',
 title: {
 display: true,
 text: 'TTFT (ms, lower is better, log scale)'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;Here you can see the big issue with Macs: the prompt processing (aka &amp;ldquo;prefill&amp;rdquo;) speed. It just gets worse and worse, the longer the prompt gets. At a 4K-token prompt, which doesn&amp;rsquo;t seem very long, it takes &lt;strong&gt;17 seconds&lt;/strong&gt; for the M4 MacBook Air to parse before we even start generating a response. Meanwhile, if you strap the eGPU to it, it&amp;rsquo;ll only take 150ms. It&amp;rsquo;s &lt;strong&gt;120x faster&lt;/strong&gt;. If it&amp;rsquo;s a long-running chat, in a real system you would utilize KV cache to avoid re-processing stuff you&amp;rsquo;ve already talked about in a previous turn, but ultimately if you ever have to give a bunch of data to your LLM at once, it has to parse that, and it&amp;rsquo;s going to be slow on a Mac.&lt;/p&gt;
&lt;p&gt;While Macs can have good memory bandwidth performance, the prefill stage is compute-bound. The 5090 just has way more processing power than any of the Macs. Hate to see it.&lt;/p&gt;
&lt;p&gt;The other axis worth looking at is &lt;strong&gt;concurrency&lt;/strong&gt;: how much extra total throughput do you get if you serve more than one request at a time? This is what matters if you&amp;rsquo;re hosting an LLM for a small team or running a batch job, rather than chatting with it solo.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 410px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="qwen-concurrency-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('qwen-concurrency-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Air (native, MLX)',
 'M4 Max Mac Studio (native, MLX)',
 'M5 Max MacBook Pro (native, MLX)',
 '2020 MBP + eGPU (vLLM)',
 'Gaming PC (native PCIe, vLLM)',
 'M4 Air + eGPU (vLLM)'
 ],
 datasets: [
 { label: '1 concurrent request', data: [23.5, 71.6, 124.9, 153.0, 169.5, 154.7], backgroundColor: '#bbdefb' },
 { label: '2 concurrent requests', data: [26.9, 97.8, 170.7, 242.5, 269.1, 247.5], backgroundColor: '#1976d2' },
 { label: '4 concurrent requests', data: [null, 143.7, 206.9, 475.8, 516.4, 454.2], backgroundColor: '#0d47a1' }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Qwen 3.6 — Total Throughput vs Concurrency'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Total tokens/sec (across all concurrent requests)'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;With the 5090, in all the configurations here, you can see that they scale almost linearly. If you go from 1 concurrent request to 4 concurrent requests, you get ~3x more throughput. That&amp;rsquo;s because the 5090 has such a massive quantity of compute in the card. If you don&amp;rsquo;t batch multiple requests, most of it just sits idle. So when you do concurrent requests in this way, it just utilizes more of the card. Both Apple Silicon Macs scale more poorly: the M4 Max Mac Studio gets ~2x with 4 concurrent requests, and the M5 Max MacBook Pro only ~1.7x. The M4 Air&amp;rsquo;s integrated GPU is so bottlenecked on memory bandwidth that even the 2 concurrent request measurements are noisy enough to be barely distinguishable from a single request.&lt;/p&gt;
&lt;p&gt;This is the other reason a discrete GPU is interesting beyond just raw speed. It unlocks much better batching headroom in these tests. The Apple Silicon MLX runs here look optimized for low-power single-user inference, not high-throughput serving.&lt;/p&gt;
&lt;h3 id="gemma-4"&gt;Gemma 4&lt;/h3&gt;
&lt;p&gt;Gemma 4 31B is a useful contrast to Qwen 3.6. It&amp;rsquo;s designed by Google, and it&amp;rsquo;s a &lt;em&gt;dense&lt;/em&gt; 31B model rather than a sparse mixture-of-experts, so every token has to flow through all 31 billion parameters instead of just the 3B that Qwen activates per token. That makes it a much heavier workload, roughly 10x more compute per token. It&amp;rsquo;s a stress test for whether the platforms can keep up.&lt;/p&gt;
&lt;p&gt;I left the M4 Air&amp;rsquo;s integrated GPU out of these charts because performance with Gemma 4 was way below the useful range in my testing. I couldn&amp;rsquo;t get more than 2 or 3 tokens per second with it.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 410px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="gemma-tg-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('gemma-tg-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Max Mac Studio (native, MLX)',
 'M5 Max MacBook Pro (native, MLX)',
 '2020 MBP + eGPU (vLLM)',
 'Gaming PC (native PCIe, vLLM)',
 'M4 Air + eGPU (vLLM)'
 ],
 datasets: [
 { label: 'after 512-token prompt', data: [23.8, 26.7, 50.9, 52.0, 50.8], backgroundColor: '#bbdefb' },
 { label: 'after 1024-token prompt', data: [22.5, 23.7, 50.5, 51.6, 50.3], backgroundColor: '#64b5f6' },
 { label: 'after 2048-token prompt', data: [21.9, 22.9, 50.2, 51.3, 50.0], backgroundColor: '#1976d2' },
 { label: 'after 4096-token prompt', data: [21.5, 22.6, 49.5, 50.6, 49.2], backgroundColor: '#0d47a1' }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Gemma 4 31B — Token Generation Speed (single-stream)'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Tokens/sec'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;All three vLLM-backed setups land within a few percent of each other at ~50 t/s. The performance is roughly at a 3x slowdown compared to Qwen 3.6, which makes sense because Gemma activates ~10x more parameters per token. Both Macs drop hard: the M4 Max Mac Studio to ~22 t/s and the M5 Max MacBook Pro to ~27 t/s — about a quarter of what they each managed on Qwen.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 410px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="gemma-ttft-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('gemma-ttft-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Max Mac Studio (native, MLX)',
 'M5 Max MacBook Pro (native, MLX)',
 '2020 MBP + eGPU (vLLM)',
 'Gaming PC (native PCIe, vLLM)',
 'M4 Air + eGPU (vLLM)'
 ],
 datasets: [
 { label: '512-token prompt', data: [2726, 748, null, 66, 73], backgroundColor: '#bbdefb' },
 { label: '1024-token prompt', data: [5458, 1912, 110, 98, 89], backgroundColor: '#64b5f6' },
 { label: '2048-token prompt', data: [10719, 3947, 184, 178, 180], backgroundColor: '#1976d2' },
 { label: '4096-token prompt', data: [21266, 7520, 368, 348, 361], backgroundColor: '#0d47a1' }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Gemma 4 31B — Time to First Token vs Prompt Length'
 }
 },
 scales: {
 x: {
 type: 'logarithmic',
 title: {
 display: true,
 text: 'TTFT (ms, lower is better, log scale)'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;Same pattern here as with Qwen. The M4 Max just gets slower and slower, taking up to 21 seconds to parse a 4K-token prompt. The M5 Max cuts that to about 7.5 seconds, but the RTX 5090 is always under 400ms. The gap between native MLX and the eGPU is even wider here than it was for Qwen.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 370px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="gemma-concurrency-chart"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('gemma-concurrency-chart', 
{
 type: 'bar',
 data: {
 labels: [
 'M4 Max Mac Studio (native, MLX)',
 'M5 Max MacBook Pro (native, MLX)',
 '2020 MBP + eGPU (vLLM)',
 'Gaming PC (native PCIe, vLLM)',
 'M4 Air + eGPU (vLLM)'
 ],
 datasets: [
 { label: '1 concurrent request', data: [23.8, 26.7, 50.9, 52.0, 50.8], backgroundColor: '#bbdefb' },
 { label: '2 concurrent requests', data: [41.0, 41.8, 100.9, 101.1, 97.5], backgroundColor: '#1976d2' },
 { label: '4 concurrent requests', data: [46.4, 46.8, 189.2, 178.1, 184.3], backgroundColor: '#0d47a1' }
 ]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Gemma 4 31B — Total Throughput vs Concurrency'
 }
 },
 scales: {
 x: {
 beginAtZero: true,
 title: {
 display: true,
 text: 'Total tokens/sec (across all concurrent requests)'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;Concurrency scaling is even more dramatic for Gemma. The vLLM setups hit ~3.5x throughput with 4 concurrent requests (vs ~3x for Qwen). Both the M4 Max Studio and the M5 Max MacBook Pro scale to ~2x at 2 concurrent requests but barely budge from there to 4 requests, suggesting they&amp;rsquo;re already saturating whatever batching the MLX backend can do at that point.&lt;/p&gt;
&lt;h1 id="can-i-run-this"&gt;Can I run this?&lt;/h1&gt;
&lt;p&gt;I wish the answer were &amp;ldquo;download this thing, and you&amp;rsquo;re good to go&amp;rdquo; but, alas, it&amp;rsquo;s not that simple. This project requires a special entitlement from Apple. I&amp;rsquo;ve requested it, and heard they may be open to granting it, but I have not yet heard back, and I&amp;rsquo;m told that the wait time could be months.&lt;/p&gt;
&lt;p&gt;In the meantime, you can build your own version of the driver (the signing cert account needs to have your Mac in it, but you don&amp;rsquo;t need to disable SIP or use the reduced security mode). Then you can load it. If you wanna try it, you can &lt;a href="https://github.com/scottjg/qemu-vfio-apple"&gt;grab the code here&lt;/a&gt;. The launcher that it comes with will automatically download a prebuilt Ubuntu image that has the special &lt;code&gt;apple_dma&lt;/code&gt; driver installed. If you want to run your own Linux distro, you&amp;rsquo;ll have to manually install that into your VM for the passthrough part to work.&lt;/p&gt;
&lt;p&gt;I would also warn that the stability of the whole thing is not the greatest. FEX has a bug right now that means &lt;a href="https://github.com/FEX-Emu/FEX/issues/5336"&gt;Steam often crashes in a loop&lt;/a&gt;. For whatever reason, it seems worse in this setup. Even when it&amp;rsquo;s working, it can take minutes to start certain games. The DMA mapping limits mean that sometimes the mappings fragment over time, and you can run out of space to run new games. You then have to halt the Linux VM, unplug/replug the GPU, just to clear all the DMA mappings and try again.&lt;/p&gt;
&lt;p&gt;The most reliable thing you can do with this setup is AI, and I&amp;rsquo;d say that works really well. If you&amp;rsquo;re one of those weirdos who wants an OpenClaw setup with a local LLM, you could use the same Mac for your iMessage bot as the AI server.&lt;/p&gt;
&lt;p&gt;I am also working with upstream QEMU to try and integrate my patches. TBD if that will work out, but ideally this would end up in the mainstream distributions of QEMU like &lt;a href="https://mac.getutm.app"&gt;UTM&lt;/a&gt; so this can be something that just works out of the box.&lt;/p&gt;
&lt;h2 id="get-notified"&gt;Get notified&lt;/h2&gt;
&lt;p&gt;I can keep you posted when this gets easier to install. Just subscribe below:&lt;/p&gt;
&lt;form
 action="https://buttondown.com/api/emails/embed-subscribe/scottjg"
 method="post"
 class="bd-subscribe"
 target="popupwindow"
 onsubmit="window.open('https://buttondown.com/scottjg', 'popupwindow')"
&gt;
 &lt;label for="bd-email"&gt;Email address&lt;/label&gt;
 &lt;div class="bd-row"&gt;
 &lt;input type="email" name="email" id="bd-email" placeholder="you@example.com" required /&gt;
 &lt;button type="submit"&gt;Subscribe&lt;/button&gt;
 &lt;/div&gt;
&lt;/form&gt;
&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;So: can it game?&lt;/p&gt;
&lt;p&gt;Yes, with enough elbow grease. A virtual DMA device, kprobes patching the NVIDIA driver, hardware TSO mode, a pretty big QEMU patch, a mapping coalescer to stay under DART&amp;rsquo;s 64k cap&amp;hellip; and at the end of all that, a MacBook Air really does run Cyberpunk, Crysis, and Doom on an RTX 5090 in a Linux VM. Was it a ridiculous project? Also yes.&lt;/p&gt;
&lt;p&gt;That said, a &amp;ldquo;real&amp;rdquo; PC with the same GPU in a normal PCIe slot is 2-4x faster, depending on the game. There&amp;rsquo;s just a lot of layers here that hurt performance: FEX translating x86 to ARM, Proton translating Windows to Linux, BAR writes paying for &lt;code&gt;hv_vm_map()&lt;/code&gt;&amp;rsquo;s strict device-memory ordering. Not to mention the occasional game (Horizon Zero Dawn) that blows past DART&amp;rsquo;s mapping limits and refuses to start at all.&lt;/p&gt;
&lt;p&gt;The more useful finding was how well AI inference worked. CUDA runs natively on arm64 Linux, and prefill on the M4 Air drops by ~100x, single-stream Qwen token generation goes from ~22 to ~155 tok/s, and concurrency actually scales. Strapping a 600W eGPU to a 22W laptop ends up beating the M4 Max Mac Studio at this workload. Is local inference actually interesting outside of the realm of hobbyist weirdos? Still unclear.&lt;/p&gt;
&lt;p&gt;If Linux could gain support for Thunderbolt on Apple Silicon, it would collapse a lot of the issues: no more BAR latency penalty, no more DMA limits, no more VM overhead, etc. Maybe that&amp;rsquo;ll happen at some point.&lt;/p&gt;
&lt;p&gt;For now, I&amp;rsquo;d say this is firmly a &amp;ldquo;look what&amp;rsquo;s possible&amp;rdquo; project, not a &amp;ldquo;look what you should buy&amp;rdquo; project. I wouldn&amp;rsquo;t be surprised if subsequent generations of Macs cross the threshold where the speed improvements outpace the cost of the emulation layers. I also think in the future we&amp;rsquo;re going to see more ARM64 native games. If the games were native, the Mac could probably outpace my gaming PC.&lt;/p&gt;
&lt;h2 id="follow-on"&gt;Follow-on&lt;/h2&gt;
&lt;p&gt;A few things that might be interesting to follow up on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Release an easily-installable version if/when Apple grants the entitlements.&lt;/li&gt;
&lt;li&gt;Test a &lt;a href="https://amzn.to/4cQTZ4v"&gt;Thunderbolt 5 eGPU dock&lt;/a&gt; (I only tested TB4). Gigabyte even sells &lt;a href="https://amzn.to/4uw2vfb"&gt;one with an integrated 5090&lt;/a&gt; now. M5 supports TB5.&lt;/li&gt;
&lt;li&gt;Retest if Apple fixes the HVF API for the BAR mapping issue.&lt;/li&gt;
&lt;li&gt;Measure interrupt latency and see if there are any tricks that might help with that.&lt;/li&gt;
&lt;li&gt;Profile GravityMark to see specifically what it&amp;rsquo;s doing to try and figure out what makes it slower.&lt;/li&gt;
&lt;li&gt;Continue working with upstream QEMU to try and integrate this work.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id="credits"&gt;Credits&lt;/h1&gt;
&lt;p&gt;Just wanted to thank a few people who helped me out:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aneesiqbal.ai"&gt;Anees Iqbal&lt;/a&gt; mentioned that he was able to get the entitlement from Apple, which inspired me to try the project. I went a different technical direction than he originally suggested, but it was a cool idea.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://connor.town"&gt;Connor Sears&lt;/a&gt; let me test out some things on his maxed-out M5 Max MacBook Pro.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nakajima.github.io"&gt;Pat Nakajima&lt;/a&gt; ran some AI inference benchmarks on his M4 Max Mac Studio for me and also did some editing on the post. He insisted that I give him an &amp;ldquo;executive producer credit.&amp;rdquo; So I need to say that &lt;strong&gt;this blog post was executive-produced by Pat Nakajima&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://patrickbgibson.com"&gt;Patrick Gibson&lt;/a&gt; ran some benchmarks on his Mac Studio for me.&lt;/li&gt;
&lt;li&gt;Mohamed Mediouni did some initial review for me on the qemu-devel mailing list and provided some general feedback on the approach, which was very helpful. It&amp;rsquo;s such a weird niche project, getting any advice is tough.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>RTX 5090 + Raspberry Pi: Can it Game?</title><link>https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/</link><pubDate>Thu, 08 Jan 2026 11:22:32 -0800</pubDate><guid>https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/</guid><description>&lt;p&gt;It turns out, &lt;a href="https://www.jeffgeerling.com/blog/2025/big-gpus-dont-need-big-pcs/"&gt;you can attach an external GPU to a Raspberry Pi 5&lt;/a&gt;. So my natural first question is, can I game on it? Let&amp;rsquo;s try it out and compare it with some similar computers.&lt;/p&gt;
&lt;p style="font-size: 0.75em; color: var(--secondary);"&gt;Just a quick FTC required note: When you buy through my links, I may earn a commission.&lt;/p&gt;
&lt;p&gt;For the showdown of crappy gaming computers, we&amp;rsquo;ll see which of these handles gaming best:&lt;/p&gt;
&lt;h3 id="beelink-mini-s13"&gt;&lt;a href="https://amzn.to/4qh5EOe"&gt;Beelink MINI-S13&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU:&lt;/strong&gt; 4-core Intel N150 @ 3.6GHz&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAM:&lt;/strong&gt; 16GB DDR4&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PCIe:&lt;/strong&gt; M.2 Gen3 x4&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="image-grid" style="display: flex; gap: 1rem; margin-bottom: var(--content-gap); flex-wrap: wrap;"&gt;&lt;img src="beelink-top.jpg" alt="beelink-top.jpg" style="flex: 1; min-width: 250px; max-width: 100%; height: auto; border-radius: 4px; margin: 0" /&gt;&lt;img src="beelink-bottom.jpg" alt="beelink-bottom.jpg" style="flex: 1; min-width: 250px; max-width: 100%; height: auto; border-radius: 4px; margin: 0" /&gt;&lt;/div&gt;

&lt;p&gt;More powerful than the Raspberry Pi 5, but at a similar price point. It also has a potential advantage for running games, since it&amp;rsquo;s not ARM-based.&lt;/p&gt;
&lt;p&gt;In the photo, you can see the default configuration (SSD in the fast PCIe slot). For this experiment, I&amp;rsquo;ll move it into the slower (x1) slot and plug the eGPU into the faster (x4) slot.&lt;/p&gt;
&lt;h3 id="radxa-rock-5b"&gt;&lt;a href="https://amzn.to/4qf4uCU"&gt;Radxa ROCK 5B&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU:&lt;/strong&gt; 8-core RK3588 (4× Cortex-A76 @ 2.4GHz + 4× Cortex-A55 @ 1.8GHz)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAM:&lt;/strong&gt; 16GB DDR4&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PCIe:&lt;/strong&gt; M.2 Gen3 x4&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Also:&lt;/strong&gt; &lt;a href="https://amzn.to/4qwrXjf"&gt;Aftermarket heat sink &amp;amp; fan combo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="image-grid" style="display: flex; gap: 1rem; margin-bottom: var(--content-gap); flex-wrap: wrap;"&gt;&lt;img src="rock5b-top.jpg" alt="rock5b-top.jpg" style="flex: 1; min-width: 250px; max-width: 100%; height: auto; border-radius: 4px; margin: 0" /&gt;&lt;img src="rock5b-bottom.jpg" alt="rock5b-bottom.jpg" style="flex: 1; min-width: 250px; max-width: 100%; height: auto; border-radius: 4px; margin: 0" /&gt;&lt;/div&gt;

&lt;p&gt;Pretty comparable to the Raspberry Pi 5 (it&amp;rsquo;s ARM), but the extra cores give it a little more horsepower. The faster PCIe slot is also included on-board. Since the PCIe slot will be taken for the GPU, we&amp;rsquo;ll just use a &lt;a href="https://amzn.to/4pzyybl"&gt;USB SSD&lt;/a&gt; for both ARM boards.&lt;/p&gt;
&lt;h3 id="raspberry-pi-5"&gt;&lt;a href="https://amzn.to/4jxuw1r"&gt;Raspberry Pi 5&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU:&lt;/strong&gt; 4-core BCM2712 (Cortex-A76 @ 2.4GHz)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RAM:&lt;/strong&gt; 16GB DDR4&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PCIe:&lt;/strong&gt; M.2 Gen2 x1 (via &lt;a href="https://amzn.to/49vdJb2"&gt;NVme HAT&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Also:&lt;/strong&gt; &lt;a href="https://amzn.to/4sCeFmI"&gt;Aftermarket heat sink &amp;amp; fan combo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="image-grid" style="display: flex; gap: 1rem; margin-bottom: var(--content-gap); flex-wrap: wrap;"&gt;&lt;img src="rpi5-top.jpg" alt="rpi5-top.jpg" style="flex: 1; min-width: 250px; max-width: 100%; height: auto; border-radius: 4px; margin: 0" /&gt;&lt;img src="rpi5-bottom.jpg" alt="rpi5-bottom.jpg" style="flex: 1; min-width: 250px; max-width: 100%; height: auto; border-radius: 4px; margin: 0" /&gt;&lt;/div&gt;

&lt;p&gt;This is why we&amp;rsquo;re all here. It&amp;rsquo;s the quintessential hobbyist SBC. Unfortunately it&amp;rsquo;s the most challenged: fewer cores, and significantly less PCIe bandwidth. The Pi 5&amp;rsquo;s Gen2 x1 slot provides ~500 MB/s, compared to ~4,000 MB/s on the Gen3 x4 slots of the other machines, an 8x difference.&lt;/p&gt;
&lt;h3 id="egpu"&gt;eGPU&lt;/h3&gt;
&lt;p&gt;We will be using a relatively inexpensive &lt;a href="https://amzn.to/3LuqWIV"&gt;OCuLink dock&lt;/a&gt; to pair with our very expensive GPU. If you&amp;rsquo;re not familiar with the technology, it&amp;rsquo;s basically a PCIe extension cord to let you plug a graphics card into a computer that wouldn&amp;rsquo;t normally fit one. The dock is powered externally by a &lt;a href="https://www.amazon.com/quiet-Certification-semi-Passive-Technology-Overclocked/dp/B0FBX9VS3B?th=1&amp;amp;linkCode=ll1&amp;amp;tag=scottjg-20&amp;amp;linkId=7741030a2c13875241c115fadf12ab8e&amp;amp;language=en_US&amp;amp;ref_=as_li_ss_tl"&gt;separate power supply&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For this experiment, we&amp;rsquo;re using an NVIDIA RTX 5090 Founders Edition (32GB VRAM).&lt;/p&gt;
&lt;p&gt;&lt;img alt="OCuLink eGPU dock" loading="lazy" src="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/egpu-oculink.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The OCuLink cable plugs into an M.2 card that we&amp;rsquo;ll insert into each machine as we test it.&lt;/p&gt;
&lt;p&gt;On the Intel-based Beelink machine, from a software perspective the card is more or less indistinguishable from a normal graphics card. We can just install the normal NVIDIA drivers.&lt;/p&gt;
&lt;p&gt;The ARM-based computers we&amp;rsquo;re testing have various quirks (lack of DMA coherence, memory alignment requirements, etc.) that make them incompatible with most GPU drivers out of the box. Luckily, &lt;a href="https://github.com/mariobalanica"&gt;@mariobalanca&lt;/a&gt; wrote some patches that allow the drivers to work on these systems. NVIDIA already had some workarounds in the user-space part of their drivers for Ampere-based systems for memory alignment issues, so some of that gets inherited here.&lt;/p&gt;
&lt;p&gt;I have packaged the drivers you can run on Ubuntu or Fedora &lt;a href="https://github.com/scottjg/nvidia-armsbc"&gt;here&lt;/a&gt;, if you&amp;rsquo;d like to try this yourself.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;ve gotten this far and simply don&amp;rsquo;t believe this actually works, here&amp;rsquo;s a screenshot:
&lt;img alt="Raspberry Pi 5 GPU screenshot" loading="lazy" src="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/pi5-gpu-screenshot.jpg"&gt;&lt;/p&gt;
&lt;h2 id="cpu-performance"&gt;CPU Performance&lt;/h2&gt;
&lt;p&gt;Before we get into the games, let&amp;rsquo;s take a look at how these machines compare.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 400px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="benchmark"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('benchmark', 
{
 type: 'bar',
 data: {
 labels: [
 '2018 Mac Mini (i7-8700B)',
 'Beelink (N150)',
 '2014 Mac Mini (i5-4278U)',
 'ROCK 5B (native)',
 'Raspberry Pi 5 (native)',
 '2008 Intel Core 2 Quad Q9650',
 'ROCK 5B (FEX)',
 'Raspberry Pi 5 (FEX)'
 ],
 datasets: [{
 label: 'Single-Core',
 data: [1472, 1282, 1007, 812, 762, 379, 362, 360],
 }, {
 label: 'Multi-Core',
 data: [4705, 3215, 1968, 2964, 1722, 1064, 1377, 1018],
 }]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Geekbench 6 CPU Scores of Crappy Computers'
 }
 },
 scales: {
 y: {
 ticks: {
 font: function(context) {
 const label = context.tick.label;
 const emphasized = ['Beelink', 'ROCK 5B', 'Raspberry Pi'];
 const isBold = emphasized.some(e =&gt; label.includes(e));
 return { weight: isBold ? 'bold' : 'normal' };
 }
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;Most PC games are designed for Intel CPUs. If we want to play them on ARM we&amp;rsquo;ll have to use a compatibility layer called FEX. The graph shows not only the native performance of the machines, but also the significantly degraded performance under FEX. To be fair, FEX is an incredible feat of engineering, but all emulation comes at a cost.&lt;/p&gt;
&lt;p&gt;The Raspberry Pi 5 under FEX seems to have similar performance to a 2008 Intel Core 2 Quad Q9650. Not very promising. That said, gamers usually say that, for most games, it&amp;rsquo;s OK to skimp on CPU a bit as long as you have a good GPU. We will definitely be testing that line of thinking.&lt;/p&gt;
&lt;h2 id="games"&gt;Games&lt;/h2&gt;
&lt;p&gt;I tried to find games that had built-in benchmarks that also worked under FEX, along with Steam&amp;rsquo;s Proton compatibility layer, tilting towards games that didn&amp;rsquo;t have as strong CPU requirements. It turns out this is actually not a huge list. Here are a handful that I tried:&lt;/p&gt;
&lt;h3 id="cyberpunk-2077-2020"&gt;Cyberpunk 2077 (2020)&lt;/h3&gt;
&lt;p&gt;&lt;img alt="Screenshot of Cyberpunk 2077 running on the Raspberry Pi" loading="lazy" src="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/cyberpunk-rpi5-ultra.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Yes, believe it or not, &lt;code&gt;vkcube&lt;/code&gt; is not the only thing you can run in this configuration. Through the maze of compatibility layers (FEX, WINE/Proton, DXVK, etc), you too can run Cyberpunk 2077 on your Raspberry Pi 5. The screenshot above is running at 1080p with Ultra Raytracing quality settings.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 400px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="cyberpunk"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('cyberpunk', 
{
 type: 'bar',
 data: {
 labels: ['Beelink (Linux)', 'Beelink (Windows)', 'ROCK 5B', 'Raspberry Pi 5'],
 datasets: [{
 label: '4K RT Ultra',
 data: [28.45, 27.29, 15.00, 9.37],
 }, {
 label: '1080p RT Ultra',
 data: [26.63, 25.44, 14.74, 8.96],
 }, {
 label: '1080p Low',
 data: [47.49, 50.03, 18.28, 15.86],
 }, {
 label: '720p Low',
 data: [47.53, 49.74, 22.17, 16.35],
 }]
 },
 options: {
 plugins: {
 title: {
 display: true,
 text: 'Cyberpunk 2077 - Average FPS'
 }
 },
 scales: {
 y: {
 title: {
 display: true,
 text: 'FPS'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;The game is playable on the Beelink machine with some lower settings. Since it&amp;rsquo;s an Intel machine, I also tested the game on Windows for posterity. Usually it&amp;rsquo;s suggested that even with all the compatibility layers, Linux gaming can be faster, but not in this case on the lower settings here.&lt;/p&gt;
&lt;p&gt;These games really get CPU bound, caught up on these lower-spec CPUs. I think on a normal gaming PC, it wouldn&amp;rsquo;t matter as much, but every cycle starts to count here, and not all the abstractions provided by WINE are zero cost.&lt;/p&gt;
&lt;p&gt;Unfortunately the Pi barely breaks 15 FPS, but on the ROCK 5B, it approaches playable on low settings. Granted, not sure how fun that would be at 22 FPS.&lt;/p&gt;
&lt;h3 id="doom-the-dark-ages-2025"&gt;Doom: The Dark Ages (2025)&lt;/h3&gt;
&lt;p&gt;This game doesn&amp;rsquo;t run under FEX, so I didn&amp;rsquo;t collect full benchmarks here. The anti-cheat stuff is too weird and doesn&amp;rsquo;t get properly emulated.&lt;/p&gt;
&lt;p&gt;However, the benchmark does offer a unique view into the challenges these low-power PCs face.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Doom: The Dark Ages benchmark" loading="lazy" src="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/doom-benchmark.jpg"&gt;&lt;/p&gt;
&lt;p&gt;You can see it running on the Beelink here. The GPU is absolutely shredding through the Ultra quality frames at 4K resolution, but the CPU is really struggling. You can see the GPU is able to process almost 90 FPS, but because of the bottleneck at the CPU, the overall frame rate can&amp;rsquo;t break 30 FPS. That&amp;rsquo;s the main challenge here.&lt;/p&gt;
&lt;h3 id="alien-isolation-2014"&gt;Alien: Isolation (2014)&lt;/h3&gt;
&lt;p&gt;My next thought was, maybe if we jump back a decade, we can have better luck. This game actually ships with a Linux port. Unfortunately the Linux port doesn&amp;rsquo;t include the built-in benchmark tool, so I ran it under Proton/WINE. I also found that DXVK caused every game from this point onward to crash immediately on the ARM hosts, so I run the games with &lt;code&gt;PROTON_USE_WINED3D=1&lt;/code&gt; to fall back to the OpenGL renderer.&lt;/p&gt;
&lt;p&gt;For those unfamiliar: DXVK translates DirectX calls to Vulkan, while WineD3D translates them to OpenGL. The GPU driver, when running on ARM, has a Vulkan implementation that apparently has issues when running under FEX that OpenGL avoids. Something to keep in mind if you&amp;rsquo;re trying to replicate this.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Alien: Isolation running the benchmark" loading="lazy" src="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/alien-isolation-rpi5.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Honestly, not the best looking game by modern standards, even on Ultra settings. It does have some cool lighting effects, at least. I admit I have never played this game for real, so I can&amp;rsquo;t vouch for it being fun or not. I just ran the benchmark tool.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 350px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="alien"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('alien', 
{
 type: 'bar',
 data: {
 labels: ['Beelink (Linux)', 'Beelink (Windows)', 'ROCK 5B', 'Raspberry Pi 5'],
 datasets: [{
 label: 'Average FPS',
 data: [156.96, 166.95, 23.53, 15.44],
 }, {
 label: 'Max FPS',
 data: [252.27, 231.05, 54.44, 45.51],
 }]
 },
 options: {
 plugins: {
 title: {
 display: true,
 text: 'Alien: Isolation - 1080p Ultra'
 }
 },
 scales: {
 y: {
 title: {
 display: true,
 text: 'FPS'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;I initially tested this game on the Beelink and thought it looked promising. Relatively low CPU usage. It seems like it is playable on the ROCK 5B with an average 23 FPS. Not sure about the Pi though, at only 15 FPS.&lt;/p&gt;
&lt;h3 id="hitman-absolution-2012"&gt;Hitman: Absolution (2012)&lt;/h3&gt;
&lt;p&gt;OK, OK. So we already know the performance of the Pi is on par with a PC from 2008, so I figured, let&amp;rsquo;s go back a couple more years.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Hitman Absolution Benchmark" loading="lazy" src="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/hitman-absolution.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Couldn&amp;rsquo;t get the windowed mode to work right on this one, but I swear it&amp;rsquo;s running on the Raspberry Pi 5. You can probably tell from the FPS counter.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 400px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="hitman"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('hitman', 
{
 type: 'bar',
 data: {
 labels: ['Beelink (Linux)', 'Beelink (Windows)', 'ROCK 5B', 'Raspberry Pi 5'],
 datasets: [{
 label: '4K Ultra',
 data: [47.05, 63.8, 8.53, 6.67],
 }, {
 label: '1080p Ultra',
 data: [57.56, 65.2, 8.34, 6.34],
 }, {
 label: '1080p Low',
 data: [62.42, 69.84, 8.95, 7.13],
 }, {
 label: '720p Lowest',
 data: [61.44, 70.27, 8.88, 7.22],
 }]
 },
 options: {
 plugins: {
 title: {
 display: true,
 text: 'Hitman: Absolution - Average FPS'
 }
 },
 scales: {
 y: {
 title: {
 display: true,
 text: 'FPS'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;I would say the performance makes it basically unusable on these ARM machines.&lt;/p&gt;
&lt;p&gt;That said, the Beelink really shines here. Windows perf is way ahead of Linux on this one too. More than playable on both, though.&lt;/p&gt;
&lt;p&gt;I was actually a little puzzled by this one. It seems like it shouldn&amp;rsquo;t be this bad on the ARM hosts. This feels like a performance bug, but it&amp;rsquo;s hard to say where in the stack it might be. Oh well.&lt;/p&gt;
&lt;h3 id="just-cause-2-demo-2010"&gt;Just Cause 2 Demo (2010)&lt;/h3&gt;
&lt;p&gt;OK, so let&amp;rsquo;s go back &lt;em&gt;another&lt;/em&gt; couple years. This demo was free, thankfully.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Just Cause 2 Benchmark" loading="lazy" src="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/just-cause-2-rpi5.jpg"&gt;&lt;/p&gt;
&lt;p&gt;So remember earlier when I said I had to disable DXVK for these games to run on ARM? On Intel Windows, I had to actually add DXVK because the game crashed immediately on launch. Weird.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 400px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="justcause2"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('justcause2', 
{
 type: 'bar',
 data: {
 labels: ['Beelink (Linux)', 'Beelink (Windows)', 'ROCK 5B', 'Raspberry Pi 5'],
 datasets: [{
 label: '4K Defaults',
 data: [118.04, 160.02, 27.24, 23.83],
 }, {
 label: '1080p Defaults',
 data: [120.61, 161.27, 26.21, 25.25],
 }, {
 label: '1080p Low',
 data: [128.97, 177.43, 39.51, 38.96],
 }, {
 label: '720p Low',
 data: [129.94, 172.44, 39.47, 39.08],
 }]
 },
 options: {
 plugins: {
 title: {
 display: true,
 text: 'Just Cause 2 - Average FPS'
 }
 },
 scales: {
 y: {
 title: {
 display: true,
 text: 'FPS'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;Nearly 40 FPS average on a Raspberry Pi 5. 2010 is our year! Windows still dominates here. It&amp;rsquo;s more apples-to-apples on Beelink&amp;rsquo;s Linux vs Windows now since now both are using DXVK.&lt;/p&gt;
&lt;h3 id="portal-2-2011"&gt;Portal 2 (2011)&lt;/h3&gt;
&lt;p&gt;After I had run all of these, I was curious to try Portal 2. Valve is the company that maintains Proton and FEX. You&amp;rsquo;d think they maybe would have optimized it for their own games. It&amp;rsquo;s also old enough that it&amp;rsquo;s in the sweet spot of potentially being playable on the Pi.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Portal 2" loading="lazy" src="https://scottjg.com/posts/2026-01-08-crappy-computer-showdown/portal2-rpi5.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Sadly, Portal 2 does not ship with a built-in benchmark. However, it does have a &lt;code&gt;timedemo&lt;/code&gt; feature where you can record yourself playing and then play it back as a benchmark. I picked a random level and recorded it. Then, ran it on the test systems. Since there was a native version, I benchmarked that alongside the Proton/WINE version.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 400px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="portal2"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('portal2', 
{
 type: 'bar',
 data: {
 labels: ['RPi 5 (Linux)', 'RPi 5 (WINE)', 'ROCK 5B (Linux)', 'ROCK 5B (WINE)', 'Beelink (Linux)', 'Beelink (WINE)', 'Beelink (Windows)'],
 datasets: [{
 label: '4K Defaults',
 data: [67.3, 43.5, 84.9, 50.3, 298, 143.6, 183.9],
 }, {
 label: '1080p Defaults',
 data: [66.5, 46.2, 82.8, 50.2, 300, 182, 199.5],
 }]
 },
 options: {
 indexAxis: 'y',
 plugins: {
 title: {
 display: true,
 text: 'Portal 2 - Average FPS'
 }
 },
 scales: {
 x: {
 title: {
 display: true,
 text: 'FPS'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;So, now that we have a native Linux port to compare with, it totally leaves Windows in the dust (finally). Most importantly, the Raspberry Pi 5 can play this game at 4K resolution, way above 60 FPS.&lt;/p&gt;
&lt;p&gt;So I can now say with a straight face, that it&amp;rsquo;s possible to use the Raspberry Pi 5 to game in 4K, admittedly strapped to a GPU that&amp;rsquo;s roughly worth 10x the price of the Pi. In all seriousness, probably any lower-end GPU would work here. Clearly we&amp;rsquo;re not using the 5090 to its full potential anyway.&lt;/p&gt;
&lt;h2 id="power-usage"&gt;Power Usage&lt;/h2&gt;
&lt;p&gt;These machines are also known to be low power. I guess for a gaming computer, I&amp;rsquo;m not sure how important that is. You can just turn it off when you&amp;rsquo;re not using it. That said, a gaming PC CPU could use 20-50w while completely idle.&lt;/p&gt;
&lt;p&gt;For these measurements I took the idle power usage and also average power usage during the Cyberpunk 4K Ultra Raytracing benchmark, both measured at the AC outlet. This does not include the GPU, just the CPU, since that&amp;rsquo;s what we&amp;rsquo;re really comparing here.&lt;/p&gt;
&lt;div class="chart-container" style="position: relative; height: 300px; width: 100%; margin-top: var(--content-gap); margin-bottom: var(--content-gap);"&gt;
 &lt;canvas id="power"&gt;&lt;/canvas&gt;
&lt;/div&gt;

&lt;script&gt;
(function() {
 const init = () =&gt; ChartManager.createChart('power', 
{
 type: 'bar',
 data: {
 labels: ['Raspberry Pi 5', 'ROCK 5B', 'Beelink'],
 datasets: [{
 label: 'Idle (W)',
 data: [4.5, 5.9, 9],
 }, {
 label: 'Benchmark Avg (W)',
 data: [8.95, 12.04, 28.41],
 }]
 },
 options: {
 plugins: {
 title: {
 display: true,
 text: 'Power Usage (CPU only, measured at plug)'
 }
 },
 scales: {
 y: {
 title: {
 display: true,
 text: 'Watts'
 }
 }
 }
 }
}
);
 if (document.readyState === 'loading') {
 document.addEventListener('DOMContentLoaded', init);
 } else {
 init();
 }
})();
&lt;/script&gt;

&lt;p&gt;The Pi 5 sips power at under 9W even under load, while the Beelink pulls almost 30W during the benchmark. One way of looking at it, is that the Beelink performs so much faster in games, and the amount of power is proportional to that.&lt;/p&gt;
&lt;p&gt;Another way to look at it, is if the ARM-based machines weren&amp;rsquo;t mired in emulating x86, they probably would have considerably better performance on per-watt basis compared to the Intel CPU.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;So, can you game on a Raspberry Pi 5 with an RTX 5090? I guess, technically, yes. Would you want to? Probably not.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Modern games (2020+):&lt;/strong&gt; Most likely unplayable. The CPU perf degradation under FEX is brutal. Even playing on the lowest 720p settings, Cyberpunk barely hits 16 FPS average on the Pi 5.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;2010-era games:&lt;/strong&gt; If you&amp;rsquo;re trying to play older games, you can probably get away with it. You also probably do not need a graphics card as powerful as the 5090.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Beelink is the clear winner if you actually want to game. It&amp;rsquo;s still terrible, but it&amp;rsquo;s cheap, runs x86 natively, and with the right settings, it can hit 50 FPS+ in every game I tried. Windows consistently outperformed Linux on most WINE/Proton titles, so you&amp;rsquo;re probably better off just installing Windows on it.&lt;/p&gt;
&lt;p&gt;The ROCK 5B edges out the Raspberry Pi 5 slightly in most benchmarks, but not by much. The extra cores and PCIe bandwidth don&amp;rsquo;t seem to matter as much as the raw performance lost to FEX emulation. That said, it does bring the game from painfully playable to borderline playable in some games.&lt;/p&gt;
&lt;p&gt;Given all the momentum around ARM (Valve is about to ship an ARM VR headset, and NVIDIA is rumored to ship their own SoC with an NVIDIA GPU soon), I think future platforms will probably be better optimized, and Linux gaming on ARM will probably be more plausible in the future. Sadly, I don&amp;rsquo;t recommend strapping your super expensive graphics card to a cheap SBC for now. Unless it&amp;rsquo;s just for a fun blog post.&lt;/p&gt;</description></item></channel></rss>