ARMv8 Virtual Memory Architecture

Lately I've been working on a side-project involving a specialized operating system running on an Arm Cortex-A class CPU, supporting the ARMv8 architecture (aka "aarch64").

I spent a bunch of time learning about the virtual memory architecture in particular. The official documentation on it is very thorough but also very dense, and so this article is primarily intended to be a more accessible overview for my own reference but I'm posting it here in cast is' useful to others.

I'll start with some general background information on virtual memory that is applicable to many different CPU architectures, before discussing the specifics of the ARMv8 virtual memory model, which is also called VMSAv8.

What is Virtual Memory Anyway?

The broad idea of virtual memory is to allow an operating system programmer to change how memory addresses will be interpreted by the CPU, typically so that different processes can be given access to different regions of memory and can thus be isolated from one another.

A virtual memory system can be thought of as essentially a big lookup table, where the CPU can look up a virtual address and find a corresponding physical address. Some CPU architectures, including the Arm architectures, also use this lookup table to capture other metadata about memory such as access permissions (e.g. read-only vs. writable) and cacheability.

Sometimes the idea of "swapping" is also discussed under the topic of virtual memory, which refers to mechanisms where the operating system can make it appear that there is more RAM available by using a disk as temporary storage for data that's not currently being used. It's true that the virtual memory system is an important part of the implementation of "swapping", but I tend to think of swapping as an application of virtual memory rather than as part of its definition, and so I won't be discussing that in much detail here.

Virtual Memory Terminology

A key tradeoff in a virtual memory model is the granularity of the address lookup table. The most specific level of mapping is called a "page" and consists of a range of consecutive physical memory addresses that can only be exposed together as a range of consecutive virtual memory addresses of the same size.

Pages also have alignment, with essentially the same meaning as for data types in high-level programming languages. Taking both the page size and the alignment gives a fixed number of low-order address bits which must always match between virtual and physical memory addresses. The remaining higher-order address bits are what is captured in the virtual memory lookup table.

The conceptual lookup table is typically actually multiple levels of table in practice, with each level corresponding to a different range of bits in the virtual address space. Each of these tables is called a page table, and depending on the virtual address size the CPU might need to consult a series of four or more page tables before finally finding the full physical address corresponding to a virtual address.

Some CPU architectures — including ARMv8 — include support for hardware-assisted virtualization, where multiple operating system can run concurrently on the same CPU cores. In that case the operating systems themselves are essentially another level of independent processes that need to be isolated from one another, and so a CPU architecture can support multiple translation stages. For example, ARMv8 supports two stages, where the first stage translates from virtual addresses to intermediate physical addresses, and then the second stage translates from there to true physical addresses.

This article won't discuss multiple translation stages in much detail because I'm writing an operating system rather than a hypervisor, and so I'm concerned only with the first stage of translation.

AArch32 vs. AArch64

The A-series (application-oriented, rather than microcontroller-oriented) profiles of Arm architecture are currently in transition from the historical 32-bit architectures, retroactively named AArch32, to a new 64-bit architecture called AArch64.

These two architectures each have their own virtual memory system. The concepts are similar between the two, but they are incompatible in the details. This article is concerned only with the AArch64 subset, named VMSAv8-64 in the architecture reference manual. The most recent Arm Cortex-A implementations at the time of writing this support AArch32 only in user mode — requiring operating systems to run in the AArch64 state — and so VMSAv8-32 is only really of historical interest at this point. (The OS is responsible for the virtual memory mappings used in unprivileged mode, and so the 64-bit model applies even when userspace is 32-bit.)

The remainder of this article will discuss only VMSAv8-64 details.

Translation Granule Sizes

ARMv8 describes three different "granule sizes" for translation, which essentially means selecting between three different possible page sizes. Practical implementations of the architecture often support only a subset of these, but the documented architecture describes the following:

Granule Size	Page Size	Max. Entries per Page Table
4KB	4 kiB	512
16KB	16 kiB	2,048
64KB	64 kiB	8,192

Page tables are stored in memory themselves, and each page table has a maximum size of one page. Each page table entry is 64-bit, or eight bytes.

Choosing a larger granule size allows covering more memory with each translation table and thus fewer and smaller translation tables in total, but also increases the minimum size of memory block that can be allocated for a particular process.

The operating system can choose from the available granule sizes by writing to the Translation Control Register, named TCR_EL1. This register specifies some other related settings too, which all together define how the CPU will interpret the page table data:

T0SZ/T1SZ: The size of the virtual address space, specified as a the number of high-order bits that are excluded from consideration. For example, setting T0SZ to 31 means that only the low 33 bits of addresses are significant, and all of the other bits ought to be set to zero in a valid address.
Reducing the size of the address space reduces the number and size of the page tables required for the translation. Any address outside of the configured range is automatically invalid.
IRGN0/ORGN0/SH0/IRGN1/ORGN1/SH1: Sharability and cacheability metadata for the memory containing the page tables themselves.
TG0/TG1: Granule size, as described above.

Translation Table Walks

The process of visiting one or more page tables to translate a virtual address into a physical address is called walking the page tables.

To start this process, the CPU needs to know the address of the first page table to use. At the first translation stage, controlled by the operating system, it's conventional to split the memory space into a low part for the unprivileged program and a high part for the operating system itself, and VMSAv8 encourages that design by offering entirely separate virtual memory controls for addresses above vs. below the middle of the address space.

The registers TTBR0_EL1 and TTBR1_EL1 contain the initial page table addresses for the low half and the high half of the virtual memory space respectively. The previous section discussed TCR_EL1 and its fields for choosing granule size and address size, and those settings are also independently-selectable for the low and high halves of the memory space, which means that e.g. the operating system can choose a 64KB granule size for its own pages while unprivileged code is using the smaller 4KB granule size in the lower half of the memory space.

The translation table walk (aka page table walk) splits the virtual address into up to five parts, all but one of which corresponds to a level of address translation, and thus to a particular page table.

The number of bits "consumed" by each level depends on the number of entries in each page table, which is determined by the granule size:

Granule Size	Max. Entries per Page Table	Address bits per level
4KB	512	9 bits
16KB	2,048	11 bits
64KB	8,192	13 bits

The page size also determines the number of low-order address bits "left over" at the end of the translation process, which are therefore taken verbatim as corresponding bits in the physical address:

4 kiB pages: 12 bits
16 kiB pages: 14 bits
64 kiB pages: 16 bits

Taking these two parameters together, the virtual address bits map to translation levels as follows:

Level	4KB	16KB	64KB
0	47:39	47	51:48
1	38:30	46:36	47:42
2	29:21	35:25	41:29
3	20:12	24:14	28:16
-	11:0	13:0	15:0

The configured virtual address size decides how many of the virtual address bits are considered significant. Excluded address bits are not considered during the page table walk, and so shortening the address size can remove some levels from consideration altogether.

For example, if the granule size is 64KB and the virtual address size is 33 bits, then:

Levels 0 and 1 are skipped altogether, and so the TTBRx_EL1 register points to a level 2 page table.
The level 2 page table covers bits 32:29 — four bits — and so the page table has sixteen entries.
The level 3 page table covers bits 28:16 with 8,192 entries.
The low 16 bits select a byte in the resulting 64 kiB page.

Page Table Entries

Each entry in a page table is eight bytes long, and can be in one of the following formats:

Invalid entry: represents unmapped addresses, access to which will always cause an exception.
Table entry: contains the address of the table to use at the next level.
Block entry: contains a physical address and specifies that all remaining bits should be used directly in the virtual address, skipping all remaining levels.
Page entry: contains a physical address of the final page.

Table and block entries are valid at levels 0 through 2. Only page entries are valid at level 3.

Any entry whose low-order bit is zero is an invalid entry. The operating system can store any data in the remaining bits, such as information about where some data has been stored on disk as part of a swap partition or swap file, but any such data format is decided by the operating system rather than by the CPU architecture.

A table entry has the following format:

Bits	Content
1:0	Type Specifier `0b11`
11:2	Ignored
47:12	Next Table Address
51:48	Reserved; set to zero
58:52	Ignored
59	PXNTable: Heirarchical privileged execute never
60	XNTable: Heirarchical execute never
62:61	APTable: Heirarchical access permissions
63	NSTable: Heirarchical "not secure" flag

The PXNTable, XNTable, APTable, and NSTable fields impose various constraints on all results accessed through this entry. These all echo fields on the leaf entries (page and block), and so we'll discuss them further in that context.

Page entries and block entries both have a similar structure, because they both represent leaf entries that end the traversal. The page entry format, which is valid only at level 3, is as follows:

Bits	Content
1:0	Type Specifier `0b11`
4:2	AttrIdx: Memory attributes index
5	NS: Non-secure flag
7:6	AP[2:1]: Access permissions
9:8	SH: Sharability
10	AF: Access Flag
11	nG: Non-global flag
47:12	Final page physical address
50:48	Reserved; set to zero
51	DBM: Dirty bit modifier flag
52	Contigous flag
53	PXN: Privileged execute never
54	UXN: Unprivileged execute never
63:55	Ignored (but some bits used by extensions)

The block entry format, valid in levels 1 and 2, is largely the same. The most significant difference is that the low-order bits are set to 0b01 to distinguish block entries from table entries. Otherwise, a block entry functions essentially as a page entry for a significantly larger page -- a "block" -- assigning a single set of attributes to a larger consecutive range of addresses.

The final page physical address bits shown above apply to the 4KB granule size, and this overall structure is describing only the base specification where the resulting address is 48-bit. Increasing the granule size changes the size of this field.

For example, with a 64KB granule size the translation walk only needs to determine bits 16 and above, and so in the base specification the physical address occupies only bits 47:16, and thus 12:15 must be zero. If the ARMv8.2-LPA extension is implemented and active, those bits 12:15 are repurposed as physical address bits 51:48, thereby allowing a 51-bit output address.

The AttrIdx field is a 8-bit index into the lookup table of memory attributes stored in register MAIR_EL1. The memory attributes specify memory type (normal vs. device memory) and caching constraints.

The non-secure flag (NS) is relevant only when using the "TrustZone" features, which are not interesting for my goals and so I won't discuss that further here.

The access permissions (AP) field specifies the data access permissions for both privileged and unprivileged load/store instructions. The high bit selects between read/write or read-only access, and the low bit specifies whether unprivileged access is allowed, giving the following combinations:

AP[2:1] with `PSTATE.PAN` clear	Privileged	Unprivileged
`0b00`	Read/write	None
`0b01`	Read/write	Read/write
`0b10`	Read-only	None
`0b11`	Read-only	Read-only

It's not possible to deny privileged access while allowing unprivileged access at the table entry level, but the PSTATE register has a global Privileged Access Never (PAN) flag, which redefines the table to have the following meanings, making privileged and unprivileged access mututally-exclusive:

AP[2:1] with `PSTATE.PAN` set	Privileged	Unprivileged
`0b00`	Read/write	None
`0b01`	None	Read/write
`0b10`	Read-only	None
`0b11`	None	Read-only

Configuring the processor in this way can reduce the risk of the operating system being tricked into acting on user data when it was supposed to be acting on privileged data. The operating system can still access user data regions explicitly using the unprivileged load/store instructions, such as when copying data from process memory into OS memory during a system call.

The Access Flag (AF) must be set for any access to the target page to succeed. If not set, any access will raise an exception.

The non-global flag (nG) indicates that a particular page should be visible in all address spaces, and thus changing the active address space (e.g. when switching between processes) should not disturb any entries in the translation lookaside buffer (TLB) for this page. The TLB is a cache of translation results, so that the CPU can amortize the cost of a translation walk over multiple accesses.

I'm going to skip over the dirty bit modifier and contiguous flags here, because I'm not using them in my current project and so I've not studied them in any detail.

The two "execute never" flags PXN and UXN prevent execution of instructions in privileged and unprivileged contexts respectively. In the base specification, execution access and data access are independent, but it's typical to arrange for write access and execution access to be mutually exclusive to make it harder for an attacker to introduce new code into a running process. Setting the WXN flag in register SCTLR_EL1 makes the CPU enforce that convention: any region that is writable at a particular access level is treated as execute never at that level regardless of the PXN/UXN bits.

Memory Aborts

Unless a particular address matches a valid table entry with suitable attributes, a memory access will raise an exception. There are various types of exceptions that can arise, which are broadly described as "memory aborts" in the technical reference manual.

For the sake of this article I'll focus on the subset known as "MMU faults", since those are the ones that result directly from the contents of the page tables.

As with all exceptions, the exception type can be determined by analyzing the Exception Syndrome Register, ESR_EL1.

The following are the fault types you're most likely to encounter:

Permission Fault: The access was not permitted by the AP, PXN, or UXN fields of the leaf table entry, or by the APTable, PXNTable, and XNTable fields of a table entry at an earlier level.
The FAR_EL1 register might contain the address whose access was denied, if the relevant flag is set in ESR_EL1.
Translation Fault: The translation walk encountered an entry that is somehow invalid, such as having its low bit set to zero or reserved bits not set to the documented reserved value.
A translation fault is also raised when accessing addresses that don't fit into the range determined by the virtual address size. For example, if TCR_EL1 specifies that the low address area has a 32-bit virtual address size then 0xffffffff would be the highest valid address in that area, and so accessing 0x100000000, or any other address in the low half of the memory space greater than the maximum, would raise a translation fault at level zero.
Access Flag Fault: The leaf table entry (either page or block) had the access flag (AF) set to zero.

The situations that cause translation faults and access flag faults prevent an entry from being added to the translation lookaside buffer (TLB), and so the TLB contents cannot cause those fault types. However, permission faults are raised for valid page table entries which don't grant the required access type, and so it's actually the TLB contents that cause such exceptions, rather than the page table contents directly. An unexplained permission fault for a seemingly-correct page table entry can therefore be caused by incorrect TLB management, causing the TLB contents to be out of sync with the underlying page table memory.

TLB Maintenence

If the operating system changes the page table base addresses or contents after the MMU has been used, the relevant entries in the translation lookaside buffer must be invalidated. There are some machine instructions dedicated to this purpose.

All of the TLB maintenence instructions invalidate some subset of the buffer, at different levels of granularity. In Arm assembly language, these instructions all share the mnemonic TLBI, which is then followed by a keyword which specifies what subset of the buffer should be invalidated.

The general form, then, is:

TLBI <operation>, <Xt>

Where Xt represents a register containing an operand, assuming that the selected operation requires one. Some operations do not require an operand, in which case the comma and the register name must be omitted.

The <operation> keyword has the structure <range><type><level><sharability>, where

<range> is either R or omitted. If present, the operand represents a range of addresses. If omitted, the operand represents a single address.
<type> specifies what type of entries are being invalidated, and therefore what the register operand represents and whether it should be present.
I won't cover all of the possibilities here, but the main interesting ones for my purposes are:
- ALL: All translations.
- ASID: All translations with a specific address space ID. <level> must specify EL1 in this case.
- VA: Translations with the specified virtual address for a specific address space ID.
- VAA: Translations with the specified virtual address in any address space.
<level> specifies the execution level whose translation regime the invalidation applies to. Since this article is concerned only with virtual memory at the operating system level, this should always select EL1, which is specified as E1.
<sharability> specifies which sharability domains the invalidation should apply to. If omitted, it applies to all sharability domains that could be used by the current CPU. Specifying IS (inner sharable) or OS (outer sharable) might avoid invalidating entries that other cores ("processing elements") are relying on.

In my specific project I'm only really using the following two forms:

TLBI ALLE1: Invalidate everything that's in the translation regime controlled by my operating system.
TLBI ASVAA, Xt: Invalidate the translations for addresses in a specific virtual page, whose address is encoded in bits 43:0 of the data in Xt. Note that these bits encode a page address, and so encode bits 55:12 of the virtual address.

TLBI instructions should typically be followed by a DSB (data synchronization barrier) instruction, to ensure that the effects of the invalidation will be visible to all subsequent memory accesses.

To correctly preserve various other abstractions offered by the CPU cores, a number of different kinds of changes require following a "break-before-make" sequence, where the relevant table entry is first set to be invalid, and then set to the new valid value. The overall sequence would therefore be:

Overwrite the previously-valid table entry with an invalid one.
Use a TLBI instruction followed by a DSB instruction.
Write the new valid entry into the page table.
Use an additional DSB instruction to ensure that write is visible to subsequent code.

The list of situations where a break-before-make is required is quite lengthy and so I won't reproduce it here, but as a mental shortcut I think of the following broad situations:

If the page table change would cause any virtual addresses to read as a different result as before, or if either the old and new address is writable and therefore could potentially change concurrently with the page table update.
If changing the level at which any address is resolved, such as replacing a block entry with a table entry.
If changing an entry from global to not-global or vice-versa, to avoid potentially creating both local and global TLB entries for the same address.
If the underlying memory would now be accessed in a different way, such as a different memory type or cacheability.

The overall goal is to ensure that there cannot be two competing views of the same virtual address space captured in the TLB at the same time.