Lightweight Kernel Isolation with Virtualization and VM Functions

Vikram Narayanan  
University of California, Irvine

Yongzhe Huang  
Pennsylvania State University

Gang Tan  
Pennsylvania State University

Trent Jaeger  
Pennsylvania State University

Anton Burtsev  
University of California, Irvine

Abstract

Commodity operating systems execute core kernel subsystems in a single address space along with hundreds of dynamically loaded extensions and device drivers. Lack of isolation within the kernel implies that a vulnerability in any of the kernel subsystems or device drivers opens a way to mount a successful attack on the entire kernel.

Historically, isolation within the kernel remained prohibitive due to the high cost of hardware isolation primitives. Recent CPUs, however, bring a new set of mechanisms. Extended page-table (EPT) switching with VM functions and memory protection keys (MPKs) provide memory isolation and invocations across boundaries of protection domains with overheads comparable to system calls. Unfortunately, neither MPKs nor EPT switching provide architectural support for isolation of privileged ring 0 kernel code, i.e., control of privileged instructions and well-defined entry points to securely restore state of the system on transition between isolated domains.

Our work develops a collection of techniques for lightweight isolation of privileged kernel code. To control execution of privileged instructions, we rely on a minimal hypervisor that transparently deprivileges the system into a non-root VT-x guest. We develop a new isolation boundary that leverages extended page table (EPT) switching with the VMFUNC instruction. We define a set of invariants that allows us to isolate kernel components in the face of an intricate execution model of the kernel, e.g., provide isolation of preemptable, concurrent interrupt handlers. To minimize overheads of virtualization, we develop support for exitless interrupt delivery across isolated domains. We evaluate our approach by developing isolated versions of several device drivers in the Linux kernel.

ACM Reference Format:

1 Introduction

Despite many arguments for running kernel subsystems in separate protection domains over the years, commodity operating systems remain monolithic. Today, the lack of isolation within the operating system kernel is one of the main factors undermining its security. While the core kernel is relatively stable, the number of kernel extensions and device drivers is growing with every hardware generation (a modern Linux kernel contains around 8,867 device drivers [3], with around 80-130 drivers running on a typical system). Developed by third party vendors that often have an incomplete understanding of the kernel programming and security idioms, kernel extensions and device drivers are a primary source of vulnerabilities in the kernel [7, 17]. While modern kernels deploy a number of security mechanisms to protect their execution, e.g., stack canaries [20], address space randomization (ASLR) [49], data execution prevention (DEP) [88], superuser-mode execution and access prevention [19, 28], a large fraction of vulnerabilities remains exploitable. Even advanced defense mechanisms like code pointer integrity (CPI) [2, 57] and safe stacks [16] that are starting to make their way into the mainstream kernels remain subject to data-only attacks that become practical when combined with automated attack generation tools [54, 93]. Lack of isolation within the kernel implies that a vulnerability in any of the kernel subsystems creates an opportunity for an attack on the entire kernel.

Unfortunately, introducing isolation in a modern kernel is hard. The emergence of sub-microsecond 40-100Gbps network interfaces [66], and low-latency non-volatile PCIe-attached storage pushed modern kernels to support I/O system calls and device drivers capable of operating with latencies of low thousands of cycles. Minimal cycle budgets put strict requirements on the overheads of isolation solutions. For decades the only two isolation mechanisms exposed
by commodity x86 CPUs were segmentation and paging. Segmentation was deprecated when the x86 architecture moved from 32bit to 64bit addressing mode. On modern machines with recently introduced tagged TLBs, a carefully-optimized page-based inter-process communication mechanism requires 952 cycles for a cross-domain invocation [4]. With two domain crossings on the network transmission path, traditional page-based isolation solutions [83, 84] result in prohibitive overheads on modern systems.

Today, the landscape of isolation solutions is starting to change with the emergence of new hardware isolation primitives. VM function extended page-table (EPT) switching and memory protection keys (MPKs) provide support for memory isolation and cross-domain invocations with overheads comparable to system calls [45, 51, 62, 68]. Unfortunately, neither MPKs nor EPT switching implement architectural support for isolation of privileged ring 0 kernel code, the code that runs with superuser privileges and can easily escape such isolation by accessing a wide range of privileged CPU instructions. Traditionally, to control execution of privileged instructions, isolation of kernel subsystems requires an exit into ring 3 [6, 13, 22, 24, 29, 35, 36, 44, 55, 76, 85, 89, 92].

The change of the privilege level, however, can incur 0.35-6.3x overhead for relatively lightweight VMFUNC and MPK based isolation techniques.

Our work on Lightweight Virtualized Domains (LVDs) develops new mechanisms for isolation of privileged kernel code through a combination of hardware-assisted virtualization and EPT switching. First, to control execution of privileged instructions without requiring a change of the privilege level, we execute the system under control of a minimal late-launch hypervisor. When the isolated subsystem is loaded into the kernel we transparently deprivilege the system into a non-root VT-x guest. Effectively, we trade the cost of changing the privilege level on cross-domain invocations for exits into the hypervisor on execution of privileged instructions. We demonstrate that for I/O intensive workloads that pose the most demanding requirements on the cost of the isolation mechanisms this tradeoff is justified: the exits caused by privileged instructions are much less frequent compared to the number of crossings of isolation boundaries (Section 5).

Second, to protect the state of the system, we develop a new isolation boundary that leverages extended page tables (EPTs) and EPT switching with the VMFUNC instruction. VMFUNC allows the VT-x guest to change EPT mappings, and hence change the view of all accessible memory with a single instruction that takes 109-147 cycles [45, 56, 68]. Several projects explore the use of VMFUNC for isolation of code within user programs [45, 63], implementation of microkernel processes [68], and protecting legacy systems against the Meltdown speculative execution attack [51]. In contrast to previous work, we develop techniques for enforcing isolation of kernel subsystems. Kernel subsystems, e.g., device drivers, adhere to a complex execution model of the kernel, i.e., they run in the context of user and kernel threads, interrupts, and bottom-half software IRQs. Most driver code (including interrupt handlers) runs on multiple CPUs and is fully reentrant, i.e., it runs with interrupts enabled, calls back into the kernel, and can yield execution. VMFUNC does not provide architectural support for protecting the state of the system upon crossing boundaries of isolated domains. We define a set of security invariants and develop a collection of mechanisms that allow us to retain isolation in a complex execution environment of the kernel.

Finally, to minimize overheads introduced by running the system as a non-root VT-x guest, we develop support for exitless interrupt delivery. Even though modern device drivers use interrupt coalescing and polling, interrupt exits can be a major source of overhead in virtualized environments [39, 86]. We develop a collection of new mechanisms that allow us to establish correct state of the system while handling an interrupt in a potentially untrusted state inside an isolated domain and without exiting into the hypervisor.

To evaluate practicality of our approach, we isolate several performance-critical device drivers of the Linux kernel. In general, irrespective of isolation boundary, isolation of kernel code is challenging as it cuts through a network of data and control flow dependencies in a complex, feature-rich system [72, 84]. To isolate device drivers, we leverage an existing device driver isolation framework, LXDs [72]. We evaluate overheads of LVDs on software-only network and NVMe block drivers, and on the 10Gbps Intel Ixgbe network driver.

We argue that our work—a practical, lightweight isolation boundary that supports isolation of kernel code without breaking its execution model—makes another step towards enabling isolation as a first-class abstraction in modern operating system kernels. Our isolation mechanisms can be implemented either as a loadable late-launch hypervisor that transparently provides isolation for a native non-virtualized system, or as a set of hypervisor extensions that enable isolation of kernel code in a virtualized environment. While we utilize EPTs for memory isolation, we argue that our techniques—control over privileged instructions, secure state saving, and exitless interrupts—are general and can be applied to other isolation mechanisms, for example MPK.

## 2 Background and Motivation

Historically, two factors shape the landscape of in-kernel isolation: the availability of hardware isolation mechanisms, and the complexity of decomposing existing shared-memory kernel code.

### 2.1 Isolation Mechanisms and Overheads

#### Segmentation and paging

For decades the only two isolation mechanisms exposed by commodity x86 CPUs were...
segmentation and paging. Segmentation was demonstrated as a low-overhead isolation mechanism by the pioneering work on L4 microkernel [60]. Unfortunately, segmentation was deprecated when the x86 architecture moved from 32-bit to 64-bit addressing mode. On modern machines with recently introduced tagged TLBs, a carefully-optimized page-based isolation mechanism requires 952 cycles for a minimal cross-domain invocation [4] (ironically, the cost of the context switch is growing over the years [25]). With two domain crossings on the network transmission path, page-based isolation solutions like Nooks [84] would introduce an overhead of more than 72%. Less optimized approaches like SIDE [83] that rely on a page fault to detect cross-domain access in addition to a page table switch would result in a 2x slowdown.

**Cache-coherent cross-core invocations** With the commoditization of multi-core CPUs, multiple systems suggested cross-core invocations for acceleration of system calls [82] and cross-domain invocations [8, 50, 72]. Faster than address space switches, the cross-core invocations are still expensive. A minimal call/reply invocation requires four cache-line transactions [72] each taking 109-400 cycles [21, 70, 71] depending on whether the line is transferred between the cores of the same socket or over a cross-socket link. Hence the whole call/reply call takes 448-1988 cycles [72]. More importantly, during the cross-core invocation, the caller core has to wait for the reply from the callee core. At this point, two cores are involved in the cross-domain invocation, but the caller core is wasting time constantly checking for the reply in a tight loop. Asynchronous runtimes like LXDs [72] and AC [43] provide a way to utilize the caller core by performing a lightweight context switch to another asynchronous thread. Unfortunately, exploiting asynchrony is hard: kernel queues are often short and overheads of creating and joining asynchronous threads add up quickly. Overall, cross-core isolation achieves acceptable overhead (12-18% overheads for an isolated 10Gbps network drivers) but at the cost of additional cores [72].

**Memory Protection Keys (MPKs)** Recent Intel CPUs introduced memory protection keys (MPKs) to provide fine-grained isolation within a single address space by tagging each page with a 4-bit protection key in the page table entry. A special register, pkr, holds the current protection key. The read or write access to a page is allowed only if the value of the pkr register matches the tag of the page. Crossing between protection domains is performed by writing a new tag value into the pkr, which is a fast operation taking 20-26 cycles [45, 74]. However, several challenges complicate the use of MPKs for isolating kernel code. First, the 4-bit protection keys are interpreted by the CPU only for user-accessible pages, i.e., the page table entries with the “user” bit set. It is possible to map the kernel pages of an isolated subsystem as user-accessible, but additional measures have to be taken to protect the isolated kernel code from user accesses through either a page table switch [40] (which is expensive), or MPKs themselves. The reliance on MPKs requires either binary rewriting of all user applications [87] (which in general is undecidable [45]), or dynamic validation of all wrpkr instructions with hardware breakpoints [45] (which can also accumulate significant overhead). Second, the isolation of the kernel code requires careful handling of privileged instructions, e.g., updates of control and segment registers, etc. In turn, this requires either exiting into privilege level 3 (which can be done with an overhead of a system call), through compile-time or load-time binary rewriting of all privileged instructions (which becomes challenging in light of possible control-flow attacks), or executing the system as a non-root VT-x guest, which requires techniques developed in this work.

**Extended Page Table switching with VM functions** The EPTP switching via the vmfunc instruction is yet another new hardware mechanism appearing in Intel CPUs that enables a virtual machine guest to change the root of the extended page table (EPT) by re-loading it with one of a set of values preconfigured by the hypervisor. VMFUNC allows the guest to change EPT mappings, and hence change the boundaries of a protection domain, with a single instruction that takes 109-147 cycles [45, 56, 68]. Compared with MPKs, EPT switching does not require exits into ring 3, binary rewriting, or validation of VMFUNC instructions for isolation of privileged kernel code—all sensitive state can be protected by the hypervisor through construction of non-overlapping address spaces [62] (we describe details of isolating privileged ring 0 code in Section 3.3). Several projects explore the use of VMFUNC for isolation of user programs [45, 63], addressing speculative execution attacks [51], and implementing fast microkernel IPC [68]. LVDis extend VMFUNC-based solutions with support for isolation of privileged kernel code and isolation invariants in the face of fast exitless interrupts.

**Software fault isolation (SFI) and MPX** Software fault isolation (SFI) allows enforcing isolation boundaries in software without relying on hardware protection [91]. XFI [26], BGI [14], and LXFI [64] apply SFI for isolation of kernel modules in Windows [14, 26] and Linux [64] kernels. LXFI [64] saturates a 1Gbps network adapter for TCP connections, but results in a 2.2-3.7x higher CPU utilization (UDP throughput drops by 30%). On modern network and storage interfaces increase in CPU utilization will likely result in a proportional drop in performance. Recent implementations of SFI, e.g., MemSentry [56] rely on Intel Memory Protection Extensions (MPX)—a set of architectural extensions that provide support for bounds checking in hardware—to accelerate bounds checks. Nevertheless, the overhead of MPX-based SFI remains high: on a CPU-bound workload, the NginX experiences a 30% slowdown [87]. Moreover, additional control
flow enforcement mechanisms \cite{65,81,94} are required to secure SFI in the face of control-flow attacks (such mechanisms will result in additional overhead).

### 2.2 Complexity of decomposition

**Clean slate designs** Representing one side of the isolation spectrum, microkernel projects develop kernel subsystems and device drivers from scratch \cite{8–10,27,30,37,46–48,52,53,61}. Engineered to run in isolation, microkernel drivers synchronize their state via explicit messages or cross-domain invocations. To assist isolated development, microkernels typically rely on interface definition languages (IDLs) \cite{23,38,42} to generate caller and callee stubs and message dispatch loops. Unfortunately, clean slate device driver development requires a large engineering effort that also negates decades of work aimed at improving reliability and security of the kernel code.

**Device driver frameworks and VMs** More practical strategies for isolating parts of the kernel are device driver frameworks and virtualized environments that provide a backward compatible execution environment for the isolated code \cite{6,13,22,24,29,35,36,44,55,76,85,89,92}. While requiring less effort compared to re-writing device drivers from scratch, development of a backward compatible driver execution environment is still a large effort. Outside of several self-contained device driver frameworks, e.g., IOKit in MacOS \cite{59}, device drivers rely on a broad collection of kernel functions that range from memory allocation to specialized subsystem-specific helpers.

Alternatively, an unmodified device driver can execute inside a complete copy of the kernel running on top of a virtual machine monitor \cite{11,12,31,33,58,73,79}. Unfortunately, a virtualized kernel extends the driver execution environment with multiple software layers, e.g., interrupt handling, thread scheduling, context-switching, memory management, etc. These layers introduce overheads of tens of thousands of cycles on the critical data-path of the isolated driver, and provide a large attack surface.

**Backward compatible code isolation** SawMill \cite{34} was probably the first system aiming at development of in-kernel isolation mechanisms for isolation of unmodified kernel code. SawMill relied on the Flick IDL \cite{23} for communication with isolated subsystems, hence, isolation required re-implementation of all interfaces. Nooks developed a framework for isolating Linux kernel device drivers into separate protection domains \cite{84}. Nooks maintained and synchronized private copies of kernel objects between the kernel and the isolated driver, however, the synchronization code had to be developed manually. Nooks’ successors, Decaf \cite{77} and Microdrivers \cite{32} developed static analysis techniques \cite{77} to generate synchronization glue code directly from the kernel source. Wahbe et al. \cite{91} and later XFI \cite{26} and BGI \cite{14} relied on SFI to isolate kernel extensions but were not capable of handling semantically-rich boundary between the isolated subsystem and the kernel. LXFI \cite{64} extended previous SFI approaches with support for explicit, fine-grained policies that control access to all data structures shared between the isolated subsystem and the kernel. Conceptually, LXFI’s policies serve the same goal as projections in LXDs \cite{72} (that we use in this work)—they are designed to control the access of an isolated subsystem to a specific subset of kernel objects.

### 3 LVDs Architecture

LVDs are designed to block an adversary who discovers an exploitable vulnerability in one of the kernel subsystems from attacking the rest of the kernel, i.e., gaining additional privileges by reading kernel data structures or code, hijacking control flow, or overwriting sensitive kernel objects.

Similar to prior work, LXFI \cite{64}, LVDs aim to enforce 1) **data structure safety**, i.e., the isolated driver can only read and write a well-defined subset of objects and their fields that are required for the driver to function (effectively we enforce least privilege \cite{80}), 2) **data structure integrity** \cite{64}, i.e., the isolated driver cannot change pointers used by the kernel or the types of objects referenced by those pointers, and 3) **function call integrity** \cite{64}, i.e., the isolated code a) can only invoke a well-defined subset of kernel functions and pass legitimate pointers to objects they “own” as arguments, and b) cannot trick the kernel into invocation of an unsafe function pointer registered as part of the driver interface.

Ensuring that an isolation mechanism achieves these goals for complex kernel subsystems like device drivers is challenging. Despite many advances in the modularity of modern kernels, device drivers interact with the kernel through a web of functions and data structures. While the device drivers themselves expose a well-defined interface to the kernel—a collection of function pointers that implement the driver’s interface—the driver itself relies on thousands of helper functions exported by the kernel (a typical driver imports over 110-200 functions) that often have deep call graphs.

**Threat model** We assume a powerful adversary that has full control over the isolated subsystem (its memory, CPU register state, and control flow). Specifically, an attacker can make up cross-domain invocations with any arguments, attempt to read and write CPU registers, try accessing hardware interface, and trigger interrupts. We trust that attacks will not originate from the kernel domain. While LVDs can detect denial of service attacks, we leave efficient handling of driver restart for future work. Also, while LVDs provide a least-privileged isolation boundary and block trivial lagoon-style attacks \cite{15}, e.g., an isolated driver cannot return a rogue pointer to the kernel, we leave complete analysis of the feasibility to construct an arbitrary computation (e.g., to overwrite sensitive kernel data structures like page tables) in the kernel by modifying shared objects, passing or returning values to and from cross-domain invocations, etc., as future
work. Finally, speculative execution and side channel attacks are out of scope of this work as well.

3.1 Overview of the LVDs Architecture

LVDs utilize hardware-assisted virtualization for isolation and control of privileged instructions inside isolated domains (Section 4). We execute the system under control of a minimal late-launch hypervisor that transparently demotes the system into a non-root VT-x guest right before it loads the first isolated subsystem (Figure 1, 4). Specifically, we leverage a modified version of the Bareflank [1] hypervisor that is loaded as a kernel module that pushes the system into a VT-x non-root mode by creating a virtual-machine control structure (VMCS) and a hierarchy of per-CPU extended page tables. The hypervisor remains transparent to the monolithic kernel, i.e., all exceptions and interrupts are delivered directly to the demoted kernel through the original kernel interrupt descriptor table (IDT). The demoted kernel can access entire physical memory and I/O regions via the one-to-one mappings in EPT.

LVDs run as a collection of isolated domains managed by a small kernel module that exposes an interface of a microkernel to the isolated domains (Figure 1, 3). When a new isolated driver is created, the microkernel module creates a new EPT (EPT_I) that maps physical addresses of the driver domain. Upon cross-domain invocation the VMFUNC instruction switches between EPT_K and EPT_I (we discuss details of our implementation below in Section 3.3).

3.2 Device Driver Isolation

Isolation of kernel code requires analyzing all driver dependencies, deciding the cut between the driver and the kernel, and providing mechanisms for cross-domain calls and secure synchronization of data structures that are no longer shared between the isolated subsystems. LVDs rely on the LXD decomposition framework [72] that includes an interface definition language (IDL) for specifying the interface between kernel modules and generating code for synchronizing the hierarchies of data structures across isolated subsystems.

In LXDs, isolated subsystems do not share any state that might break isolation guarantees, e.g., pointers, indexes into memory buffers, etc. Each isolated subsystem maintains a private copy of each kernel object. To support synchronization of object hierarchies across domains, the IDL provides the mechanism of projections that describe how objects are marshaled across domains. A projection explicitly defines a subset of fields of a data structure that is synchronized upon domain invocation (hence, defining how a data structure is projected into another domain).

Definitions of cross-domain invocations can take projections as arguments. Passed as an argument, a projection grants another domain a right to access a specific object, i.e., synchronize a subset of object’s fields described by the projection. LXDs rely on the idea of capabilities that is similar to object capability languages [67, 69], where capabilities are unforgeable cross-domain object references. The IDL generates the code to reflect the capability “grant” operation by inserting an entry in a capability address space, CSpace, the data structure that links capabilities to actual data structures. The capability itself is an opaque number that has no meaning outside of a specific CSpace. Projections, therefore, define the minimal set of objects and their fields accessible to another domain. As projections may define pointers to other projections, LXDs provide a way to synchronize hierarchies of objects. Finally, the IDL provides a way to define remote...
procedure calls specifying all functions accessible across the isolation boundary.

LXDs provide a backward-compatible execution environment capable of executing unmodified device drivers (Figure 1, 5). Inside the isolated driver, LXDs provide: 1) the glue code generated by the IDL compiler that implements marshaling and synchronization of objects (Figure 1, 6), and 2) a minimal library that provides common utility functions compatible with the non-isolated kernel (Figure 1, 7), i.e., memory management, common utilities like memcpy(), and a collection of functions to interface with the microkernel, e.g., capability management, debugging, etc. To ensure that the legacy, non-decomposed kernel can communicate with an isolated driver, a layer of synchronization glue code is used on the kernel side (Figure 1, 1).

3.3 Lightweight Isolation with VMFUNC

VMFUNC is a machine instruction available in recent Intel CPUs that allows a non-root VT-x guest to switch the root of the extended page table (EPT) to one of a set of pre-configured EPT pointers, thus changing the entire view of accessible memory. To enable EPT switching, the hypervisor configures a table of possible EPT pointers. A non-root guest can freely invoke VMFUNC at any privilege level and select the active EPT by choosing it from the EPT table. Immediately after the switch, all guest physical addresses (GPAs) are translated to host-physical addresses (HPAs) through the new EPT. As EPTs support TLB tagging (virtual processor identifiers (VPIDs)), the VMFUNC instruction is fast (Section 5).

Isolation with EPTs Lightweight EPT switching allows for a conceptually simple isolation approach. We create two EPTs that map disjoint subsets of machine pages isolating the address spaces of mistrusting domains. To switch between the address spaces, a call-gate page with the VMFUNC instruction is mapped in both EPTs. This straightforward approach however requires a range of careful design decisions to ensure security of the isolation boundary.

4 Enforcing Isolation

In contrast to traditional privilege transition mechanisms, e.g., interrupts, system call instructions, and VT-x entry/exit transitions, VMFUNC provides no support for entering an isolated domain at a predefined entry point. The next instruction after the VMFUNC executes with the memory rights of another domain. The cross-domain invocation mechanisms, however, must ensure that transition is safe, i.e., all possible VMFUNC invocations lead to a set of well defined entry points in the kernel, and the kernel can securely restore its state.

Safety of the VMFUNC instructions By subverting the control flow inside an isolated domain, an attacker can potentially find executable byte sequences that form a valid VMFUNC instruction. If the virtual address next after the VMFUNC instruction is mapped in the address space of another domain, an attacker can escape the isolation boundary. Two possible approaches to prevent such attacks are to: 1) ensure that virtual address spaces across isolated domains are not overlapping [62], or 2) ensure that no sequences of executable bytes can form a valid VMFUNC instruction [56, 68, 90]. Inspired by ERIM [87], SkyBridge [68] relies on scanning and rewriting executable space of the program to ensure that no byte sequences form valid VMFUNC instructions. In the case of LVDs, the attack surface for preventing unsafe VMFUNC instructions expands into user applications, i.e., any user program in the system can invoke a VMFUNC instruction triggering a switch into the isolated device driver. In the face of dynamic code compilation, program rewriting [87] requires a large TCB with a large attack surface. We therefore choose the memory isolation approach similar to SeCage [62]. Specifically, we enforce the following invariant:

Inv 1. Virtual address spaces of isolated domains, kernel, and user processes do not overlap.

This invariant ensures that if an isolated domain or a user process invokes a self-prepared VMFUNC instruction anywhere in its address space, the next instruction after the VMFUNC causes a page fault.

Locking the LVD’s address space To maintain Inv 1, we have to ensure that isolated domains cannot modify the layout of their address space, or specifically:

Inv 2. Isolated domains have read-only access to their page table.

This is challenging: isolated subsystems run in ring 0 and have privileges to change their page tables. It is possible to map all pages of the page table as read-only in the EPT of the isolated domain. While this ensures that domain’s page table hierarchy cannot be modified, it also causes a prohibitive number of VT-x exits when the CPU tries to update the dirty and accessed bits in the page table of the isolated driver [51]. We therefore, employ a technique similar to EPTI [51] and map all the physical pages of the page table as read-only in the leaf entries of the guest page table (Figure 2). I.e., all virtual addresses that point into the pages of the page table have only the read permission. At the same time, the
pages that contain the page itself are mapped with write permissions in the EPT. This way the CPU can access pages of the page table and update accessed and dirty bits without causing an exit.

The following design allows us to avoid modifications to the read-only page table inside the LVD. We create a large virtual address space when the LVD starts, i.e., create a page table that maps guest virtual pages to guest physical. The physical pages are not backed up by real host physical pages. We then never update the LVD’s page table. Instead we allocate host physical pages and update the EPT mappings to map these pages into guest physical addresses already mapped by the read-only page table.

**CR3 remapping for EPT switch** While by itself the VMFUNC instruction does not change the root of the page table hierarchy, i.e., the CR3 register on x86 CPUs, the ability to switch EPTs, i.e., the GPA to HPA mappings, opens the possibility to change the guest page table too. The advantage of such design is the ability to execute non-isolated kernel and isolated drivers on independent virtual address spaces and page table hierarchies. Since individual processes and kernel threads execute on different page tables, we need to ensure that for each new process that tries to enter an LVD the physical address of the process’ root of the page table, i.e., the physical address pointed by the CR3, is mapped to the HPA page that contains the root of the page table of the LVD’s address space (Figure 3). We enforce the following invariant:

**Inv 3.** Physical address spaces of isolated domains and the kernel must not overlap.

This guarantees that the physical address that is used for the root of the page table inside the kernel is not occupied inside the isolated domain, and hence can be remapped into the HPA page that contains the root of the page table inside the isolated domain.

**Protecting sensitive state** Isolated subsystems execute with ring 0 privileges. Hence they can read and alter sensitive hardware registers, e.g., re-load the root of the page table hierarchy by changing the CR3 register. To ensure isolation, we enforce the following invariant:

**Inv 4.** Access to sensitive state is mediated by the hypervisor.

To implement Inv 4, we configure the guest VM to exit into the hypervisor on the following instructions that access the sensitive state: 1) stores to control registers (cr0, cr3, cr4), 2) stores to extended control register (xcr0), 3) reads and writes of model specific registers (MSRs), 4) reads and writes of I/O ports, 5) access to debug registers, and 6) loads and stores of GDT, LDT, IDT, and TR registers. Inside the hypervisor we validate if the exit happens from the non-isolated kernel and emulate the exit-causing instruction.

**Restoring kernel state** When the execution re-enters the kernel from the isolated domain, e.g., either returning from the domain invocation, or entering the kernel with a call from an isolated subsystem, the kernel cannot trust any of the general, segment, or floating-point registers.

**Inv 5.** General, segment, and extended state (x87 FPU, SSE, AVX, etc.), registers are saved and restored on domain crossings.

As we cannot trust any general registers, upon entering the kernel we restore the kernel’s stack pointer from a trusted location in memory and then use the stack to restore other registers. We rely on the fact that the isolated driver cannot modify kernel’s address space (ensured by Inv 2 and Inv 4). We create a special page, vmfunc_state_page, which stores the pointer to the kernel stack of the current thread right before it enters the LVD. The entry-exit trampoline code uses the stack to save and restore the state of the thread.

LVDs are multi-threaded and re-entrant. While we do not allow context switches inside LVDs, the same LVD can run simultaneously on multiple CPUs. We therefore create a private copy of vmfunc_state_page on each CPU. Linux uses the gs register to implement per-CPU data-structures (on each CPU gs specifies a different base for the segment that stores local CPU variables). As we cannot trust the value of gs on entry into the kernel from an LVD, we create a per-CPU mapping for the vmfunc_state_page in EPTK. This ensures that on different CPUs, the vmfunc_state_page is mapped by a different machine page and hence holds local CPU state.

**Stacks and multi-threading** Any thread in the system can enter an isolated domain either as part of the system call that invokes a function of an isolated subsystem, or as part of an interrupt handler implemented inside an LVD. Every time the thread enters an isolated domain we allocate a new stack for execution of the thread inside the LVD. We use a lock-free allocator that relies on a per-CPU pool of pre-allocated stacks inside each LVD. From inside the LVD the thread can invoke a kernel function that in turn can re-enter the isolated domain. To prevent allocation of a new stack,
we maintain a counter to count nested invocations of the isolated subsystem.

4.1 Exitless Interrupt Handling

Historically, lack of hardware support for fine-grained assignment of interrupts across VMs and hypervisor required multiple exits into the hypervisor on the interrupt path [39, 86]. ELI [39] and DID [86] developed mechanisms for exitless delivery of interrupts for hardware-assisted VMs. We develop an exitless scheme that allows LVDs to handle interrupts even when execution is preempted inside an isolated domain.

At a high level, LVDs allow delivery of interrupts through the interrupt descriptor table (IDT) of the non-isolated kernel. The IDT is mapped inside both kernel and isolated domains. When interrupt is delivered we switch back to the kernel EPT early in the interrupt handler. To ensure that interrupt delivery is possible, we map the IDT, global descriptor table (GDT), task-state segment (TSS), and interrupt handler trampolines on both EPT$_K$, and EPT$_I$. Upon interrupt transition the hardware takes the normal interrupt delivery path, i.e., saves the state of the currently executing thread on the stack, locates the interrupt handler through the IDT, and jumps to it. The interrupt handler trampoline checks if the execution is still inside the LVD, and performs a VMFUNC transition back to the kernel if it’s required.

While conceptually simple, the exitless interrupt delivery scheme requires careful design in the face of possible isolation attacks.

**Interrupt Stack Table (IST)** Both non-isolated kernel and LVDs execute with the privileges of ring 0. As privilege level does not change during the interrupt transition, the traditional interrupt path does not require change of the stack, i.e., the hardware saves the trap frame on the stack pointed by the current stack pointer. This opens a possibility for a straightforward attack: an LVD can configure the stack to point to a writable kernel memory in the kernel domain, and perform a VMFUNC transition back into the kernel through one of the trampoline pages. VMFUNC is a long-running instruction, and often interrupts are delivered right after the VMFUNC instruction completes\(^1\). The interrupt will be delivered inside the kernel domain and hence will overwrite the kernel memory pointed by the stack pointer register configured by the isolated domain.

To prevent this attack, and to make sure that an interrupt is always executed on a valid stack, we rely on Interrupt Stack Table (IST) [5]. The IST allows one to configure the interrupt handler to always switch to a preconfigured new stack even if the privilege level remains unchanged. Each IDT entry has 8 bits to specify one of the seven available IST stacks. Linux already uses ISTs for NMIs, double-fault, debug, and machine-check exceptions.

\(^1\)We empirically confirmed this with perf, a profiler tool that relies on frequent interrupts from the hardware performance counter interface.

![Figure 4. Data-structures involved in interrupt transition](image)
4.2 VMFUNC Isolation Attacks and Defenses

Due to its unusual semantics, VMFUNC opens possibility for a series of non-traditional attacks that we discuss below.

**Rogue VMFUNC transitions**  A compromised LVD can use one of the available VMFUNC instruction instances to perform a rogue transition into the kernel, e.g., try to enter the kernel via the kernel exit trampoline. We insert a check right after each VMFUNC instruction to see that ECX register that is used as an index to choose the active EPT is set to the correct value, i.e., zero or one based on the direction. If we detect a violation we abort execution by exiting into the hypervisor that upcalls into the kernel triggering termination of the LVD.

**Asynchronous write into the IST stack from another CPU**  An LVD can try to write into the IST stack of another core that executes an interrupt handler at the same time, hence crashing or confusing the kernel. We ensure that all IST stacks are private to a core, i.e., are allocated for each core and are mapped only on the core that uses them.

**Interrupt injection attack**  As LVDs run in ring 0, they can invoke INT instruction injecting an interrupt into the kernel. We disable synchronous interrupts 0-31 originating from isolated domains, i.e., in the interrupt handler we check if the handler is executing on EPT$_i$ and if so terminate the LVD. Note, legit asynchronous interrupts can preempt the LVD and hence it is impossible to say whether interrupt injection is happening without inspecting the instruction that was executed right when interrupt happened. While we do not implement this defense at the moment, we suggest a periodic inspection of the LVD’s instruction if the frequency of a specific interrupt exceeds an expected threshold.

**Interrupt IPI from another core running LVD**  An LVD running on another core can try to inject an inter-processor interrupt (IPD) implementing a flavor of interrupt injection attack. We protect against this attack by making sure that the APIC I/O pages used to send inter-processor interrupts are not mapped inside LVDs.

**VMFUNC to a non-existent EPT entry**  An LVD can try to VMFUNC into a non-existent EPT list entry. We configure the EPTP list page to have an invalid pointer to make sure that such transition causes an exit into the hypervisor. The hypervisor then delivers an upcall exception to the kernel which terminates the LVD.

**VMFUNC from one LVD into another**  One LVD can try to VMFUNC into another LVD. We configure the EPTP list page to have only two entries at every moment of time: an EPT of the kernel (entry zero), and EPT of the LVD that is about to be invoked. Since, EPTP list is mapped inside the kernel, we re-load the first entry when the kernel is about to invoke a specific LVD.

**Kernel stack exhaustion attack**  An LVD might try to arrange a valid control flow to force the kernel to call the LVD over and over again until the kernel stack is exhausted. To prevent this loop we check that the kernel stack is above the threshold every time we enter the LVD.

**LVD never exits**  LVDs can disable interrupts by clearing the interrupt flag in the EFLAGS register. An LVD then never returns control to the kernel. We configure a non-conditional VM preemption timer that passes control to the hypervisor periodically. The hypervisor checks if the kernel is making progress by checking an entry in the vmfunc_state_page entry.

**LVD re-enables interrupts**  An LVDs can re-enable interrupts by setting the interrupt flag in the EFLAGS register. The kernel might then receive an interrupt in an interrupt-disabled state and as a result crash or corrupt sensitive kernel state. We make a practical assumption that under normal conditions the isolated subsystem should not re-enable interrupts. We save the interrupt flag when we enter into an LVD and check the state of the flag in the interrupt handler. If the interrupt originates while inside the LVD and the interrupts were disabled before entering the isolated domain we signal the attack. We also check the state of the saved interrupt flag every time we exit from LVD, and signal the attack if it does not match the saved value.

5 Evaluation

We conduct all experiments in the openly-available CloudLab cloud infrastructure testbed [18] (we make all experiments available via an open CloudLab [78] profile that automatically instantiates software environment used in this section). Our experiments utilize two CloudLab c220g2 servers configured with two Intel E5-2660 v3 10-core Haswell CPUs running at 2.60 GHz, 160 GB RAM, and a dual-port Intel X520 10Gb NIC. All machines run 64-bit Ubuntu 18.04 Linux with kernel version 4.8.4. In all experiments we disable hyper-threading, turbo boost, and frequency scaling to reduce the variance in benchmarking.

5.1 VMFUNC Domain-Crossings

To understand the overheads of the VMFUNC-based isolation we conduct a series of experiments aimed at measuring overheads of VMFUNC instructions, and VMFUNC-based cross-domain invocations (Table 1). In all tests we run 10 million iterations and measure the latency in cycles with the RDTSC and RDTSCP instructions. Further, to avoid flushing the cached EPT translations, we enable support for virtual processor identifiers (VPIDs). On our hardware, a single invocation of the VMFUNC instruction takes 169 cycles. To put this number in perspective, we measure the overhead of a null system call to be 140 cycles.

---

CloudLab profile is available at https://www.cloudlab.us/p/lvds/lvd-linux
LVDS source code is available at https://mars-research.github.io/lvds
To understand the benefits of VMFUNC-based cross-domain invocations over traditional page-based address-space switches, we compare LVDs’ cross-domain calls with the synchronous IPC mechanism implemented by the seL4 microkernel [25]. We choose seL4 as it implements the fastest synchronous IPC across several modern microkernels [68]. To defend against Meltdown attacks, seL4 provides support for a page-table-based kernel isolation mechanism similar to KPTI [41]. However, this mechanism negatively affects IPC performance due to an additional reload of the page table root pointer. Since recent Intel CPUs address Meltdown attacks in hardware, we configure seL4 without these mitigations. With tagged TLBs seL4’s call/reply IPC takes 834 cycles (Table 1). LVDs’s VMFUNC-based call/reply invocation requires 396 cycles.

### 5.1.1 Overheads of Running Under a Hypervisor

LVDs execute the system under control of a hypervisor resulting in two kinds of overheads: 1) overheads due to virtualization, i.e., EPT translation layer, and 2) exits to the hypervisor caused by the need to protect sensitive instructions that can potentially break isolation if executed by an LVD.

**Sensitive instructions** We first conduct a collection of experiments aimed at measuring the cost of individual VM-exits that are required to mediate execution of sensitive instructions (Table 2). On average an exit into the hypervisor takes 1171-1240 cycles. To reduce the number of exits due to updates of the cr3 register, we implement an LRU cache and maintain the list of three target cr3 values.

**Whole-system benchmarks** We evaluate the impact of virtualization and LVD-specific kernel modifications by running a collection of Phoronix benchmarks [75]. The Phoronix suite provides a large library of whole-system benchmarks; we use a set of benchmarks that characterize both whole-system performance and stress specific subsystems. The whole-system benchmarks include apache (measures sustained requests/second; 100 concurrent requests); nginx (measures sustained requests/second; 500 concurrent requests); pybench (tests basic, low-level functions of Python); phpbench (large numbers of simple tests against the PHP interpreter). Subsystem-specific benchmarks include dbench (file system calls to test disk performance, varying the number of clients); postmark (transactions on 500 small files (5–512 KB) simultaneously); povray (3D ray tracing); sysbench (performs CPU and memory tests); gnupg (encryption time with GnuPG).

Figure 5 shows the performance of the virtualized LVD kernel relative to the performance of an unmodified Linux running on bare metal. Apache, nginx, dbench (with more than 12 clients), and sysbench incur 2–5% slowdown relative to the unmodified Linux system. All other benchmarks stay within 1% of the performance of the bare-metal system.

**Breakdown of VM exits** To better understand the reasons of possible performance degradation due to virtualization, we collect the number and the nature of VM-exits for Phoronix benchmarks. In our tests Phoronix benchmarks experience from 465 (phpbench) to 155862 (nginx) VM-exits per second. Two most frequent exit reasons are 1) access to MSRs required for programming the APIC timer (ia32_tsc_deadline) and updating the base address of the FS segment during a context switch (ia32_fs_base), and 2) access to control registers.

### 5.2 Overheads of isolation

To evaluate the overheads of isolation we developed several isolated device drivers in the Linux kernel. Specifically, we developed isolated versions of 1) a software-only "null" network driver (nullnet), 2) an Intel 82599 10Gbps Ethernet driver (ixgbe), and 3) a software-only "null" block NVMe driver (nullblock). Neither null net nor null block are connected to a real hardware device. Instead they emulate infinitely fast devices in software. The software-only drivers allow us to stress overheads of isolation without any artificial hardware limits. Both network and storage layers are kernel subsystems with the tightest performance budgets; hence we choose them for evaluating overheads of our isolation mechanisms.

---

**Table 1.** Cost of VMFUNC-based cross-domain invocations.

<table>
<thead>
<tr>
<th>Operation</th>
<th>Cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>VMFUNC instruction</td>
<td>169</td>
</tr>
<tr>
<td>System call</td>
<td>140</td>
</tr>
<tr>
<td>seL4’s call/reply invocation</td>
<td>834</td>
</tr>
<tr>
<td>VMFUNC-based call/reply invocation</td>
<td>396</td>
</tr>
</tbody>
</table>

**Table 2.** Cost of hypervisor exits.

<table>
<thead>
<tr>
<th>Operation</th>
<th>Native</th>
<th>Virtualized</th>
</tr>
</thead>
<tbody>
<tr>
<td>write MSR</td>
<td>127</td>
<td>1367</td>
</tr>
<tr>
<td>out instruction</td>
<td>4213</td>
<td>5384</td>
</tr>
<tr>
<td>write cr3</td>
<td>130</td>
<td>143</td>
</tr>
</tbody>
</table>

---

**Figure 5.** Phoronix benchmarks.

**Figure 6.** Null net Tx IOPS (K)
We report total packet transmission I/O requests per-second (IOPS) across all CPUs (Figure 6). This synthetic configuration allows us to analyze overheads of isolation in the ideal scenario of a device driver that requires only one crossing on the device I/O path. With one application thread the non-isolated driver achieves 968K IOPS (i.e., on average, a well-optimized I/O submission path of the kernel network stack). To evaluate the isolated nullnet and ixgbe drivers, we use the iperf2 benchmark that measures the transmit bandwidth for the MTU sized packets and by varying the number of threads from 1 to 20 (Figure 6). We report total packet transmission I/O requests per-second (IOPS) across all CPUs (Figure 6).

In our first experiment we change nullnet to perform only one crossing between the kernel and the driver for sending each packet (nullnet−1, Figure 6). This synthetic configuration allows us to analyze overheads of isolation in the ideal scenario of a device driver that requires only one crossing on the device I/O path. With one application thread the non-isolated driver achieves 968K IOPS (i.e., on average, a well-optimized network send path takes only 2680 cycles to submit an MTU-sized packet from the user process to the network interface). The isolated driver (nullnet−1) achieves 876K IOPS (91% of the non-isolated performance), and on average requires 2960 cycles to submit one packet. Since, iperf application uses floating point operations, we incur additional overhead due to saving and restoring of FPU regs when we jump between the kernel and isolated domain. On 20 threads the isolated driver achieves 91% of the performance of the non-isolated driver.

In our second experiment, we run the isolated nullnet driver in its default configuration (i.e., perform two domain crossings per packet transmission). In a single threaded test, the isolated driver achieves 65% performance of the non-isolated driver. Finally, we measure the overhead of saving and restoring the processor extended state, i.e., floating-point, SSE, and AVX registers. Not all programs use extended state registers and thus can benefit from faster domain crossings. The Linux kernel dynamically tracks if extended state was used, hence we save and restore it only when needed. We disable saving and restoring extended state for the iperf nullnet benchmarks ((nullnet−2–nofpu, Figure 6). Without extended state on a single core the isolated driver achieves 72% performance of the non-isolated driver.

5.2.2 Ixgbe Device Driver

To measure performance of the isolated ixgbe driver, we configure an iperf2 test with a varying number of iperf threads ranging from 1 to 20 (Figure 7). On our system, even two application threads saturate a 10Gbps network adapter. Configured with one iperf thread, on the transmission path using an MTU sized packet, the isolated ixgbe achieves 95.7% of the performance of the non-isolated driver. This difference disappears as we add more iperf clients. For two or more threads, both drivers saturate the network interface and show a nearly identical throughput. On the receive path, the isolated driver is 10% faster for one application thread. With a higher number of application threads, the isolated driver is within 1% to 11% of the performance of the native driver (Figure 8).

To measure the end-to-end latency, we rely on the UDP request-response test implemented by the netperf benchmarking tool. The UDP RR measures the number of round-trip request-response transactions per second, i.e., the client sends a 64 byte UDP packet and waits for the response from the server. The native driver achieves 29633 transactions per second (which equals the round-trip latency of 33.7µs), the isolated driver is 8% (1.2µs) slower with 27333 transactions per second (round-trip latency of 36.5µs).

5.2.3 Multi-queue block device driver

In our block device experiments, we use the fio benchmark to generate I/O requests. To set an optimal baseline for our evaluation, we chose the configuration parameters that can give us the lowest latency path to the driver, so that overheads of isolation are more profound. We use fio’s libaio engine to overlap I/O submissions, and bypass the page cache by setting direct I/O flag to ensure raw device performance. We vary the number of fio threads involved in the test from 1 to 20 and use the block size of 512B, and I/O submission queue length of 1 and 16 (Figure 9). We submit a set of requests at once, i.e., either 1 or 16 and also poll for the same number of completions. Since the nullblock driver does not interact with an actual storage medium, reads and writes perform the same, so we use read I/O requests in all experiments.

For 512 byte requests on a single thread and submission queue length of one the isolated driver achieves 337K IOPS compared to the 559K IOPS for the native (Figure 9). The isolated nullblock driver goes through three cross-domain calls on the I/O path and hence it incurs higher overhead compared
5.2.4 Exitless Interrupt Delivery

To understand the overheads of exitless interrupt delivery, we measure latencies introduced on the interrupt path by virtualization, and LVDs’ isolation mechanisms; specifically execution on IST stacks, and VMFUNC domain crossings when interrupt is delivered while inside an isolated domain. To eliminate the overheads introduced by general layers of interrupt processing in the Linux kernel, we register a minimal interrupt handler that acknowledges the interrupt right above the machine-specific interrupt processing layer. Our tests invoke the \texttt{int} instruction in the following system configurations which we run on bare metal and on top of the hypervisor (Table 3): 1) in a kernel module loaded inside an unmodified vanilla Linux kernel, 2) inside a kernel module in the LVD kernel, and 3) inside an LVD. In both vanilla and LVD kernels virtualization itself introduces only a minimal overhead to the interrupt processing path (13 and 14 cycles respectively). The LVD kernel executes all interrupts and exceptions on IST stacks, and thus pays additional price of switching to an IST stack while entering the interrupt handler (59–60 cycles). An interrupt inside an isolated domain introduces the overhead of 380 cycles due to two VMFUNC transitions required to exit and re-enter the LVD. In all three tests (nullnet, nullblock, and ixgbe) 11% of interrupts are delivered while inside an LVD.

LVDs trade the cost of changing the privilege level on cross-domain invocations for exits into the hypervisor on execution of privileged instructions. To justify this choice, we measure the number of VM exits and compare it with the number of VMFUNC transitions for I/O intensive workloads (nullnet, nullblock, and ixgbe tests) (Table 4). For the iperf test that stresses performance of the isolated ixgbe driver, we recorded a total of 13235 VM exits, a number that is three orders of magnitude smaller compared to the number of cross-domain transitions (27 x 10^6).

### Table 3. Overhead of interrupt delivery.

<table>
<thead>
<tr>
<th>Kernel/hypervisor setup</th>
<th>Cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>Vanilla kernel</td>
<td>607</td>
</tr>
<tr>
<td>Vanilla kernel (virtualized)</td>
<td>620</td>
</tr>
<tr>
<td>LVD kernel</td>
<td>666</td>
</tr>
<tr>
<td>LVD kernel (virtualized) inside kernel</td>
<td>680</td>
</tr>
<tr>
<td>LVD kernel (virtualized, inside isolated domain)</td>
<td>1060</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Transitions/exits</th>
<th>Experiments</th>
</tr>
</thead>
<tbody>
<tr>
<td>VMFUNC</td>
<td>nullnet  ixgbe nullb</td>
</tr>
<tr>
<td>nullnet</td>
<td>41 x 10^6 27 x 10^6 33 x 10^6</td>
</tr>
<tr>
<td>VM-Exit</td>
<td>14074 15235 25789</td>
</tr>
</tbody>
</table>

### Table 4. VMFUNC crossings vs number of exits

6 Conclusions

Over the last four decades operating systems gravitated towards a monolithic kernel architecture. However, the availability of new low-overhead hardware isolation mechanisms in recent CPUs brings a promise to enable kernels that employ fine-grained isolation of kernel subsystems and device drivers. Our work on LVDs develops new mechanisms for isolation of kernel code. We demonstrate how hardware-assisted virtualization can be used for controlling execution of privileged instructions and define a set of invariants that allows us to isolate kernel subsystems in the face of an intricate execution model of the kernel, e.g., provide isolation of preemptable, concurrent interrupt handlers. While our work utilizes EPTs for memory isolation, we argue that our techniques can be combined with other architectural mechanisms, e.g., MPK, a direction we plan to explore in the future.

Acknowledgments

We would like to thank OSDI 2019, ASPLOS 2019, and VEE 2020 reviewers for numerous insights helping us to improve this work. We are further grateful to the Utah CloudLab team for the patience with accommodating our countless requests and outstanding technical support. This research is supported in part by the National Science Foundation under Grant Number 1840197.
Lightweight Kernel Isolation with Virtualization and VM Functions

References


[31] Keir Fraser, Steven Hand, Rolf Neugebauer, Ian Pratt, Andrew Warfield, and Mark Williamson. Safe hardware access with the Xen virtual machine monitor. In 1st Workshop on Operating System and Architecture Support for the on demand IT Infrastructure (OASIS), 2004.


[44] Andreas Haeberlen, Jochen Liedtke, Yoonho Park, Lars Reuthier, and Volkmar Uhlig. Stub-code performance is becoming important. In


[75] Phoronix Test Suite: An automated, open-source testing framework. https://lwn.net/Articles/662953/


