Guest RAM is managed as individual pages, but allocated from the host OS in chunks for reasons of portability / efficiency. To minimize the memory footprint all tracking structure must be as small as possible without unnecessary performance penalties.
The allocation chunks has fixed sized, the size defined at compile time by the GMM_CHUNK_SIZE #define.
Each chunk is given an unquie ID. Each page also has a unique ID. The relation ship between the two IDs is:
The physical address of each page in an allocation chunk is maintained by the RTR0MEMOBJ and obtained using RTR0MemObjGetPagePhysAddr. There is no need to duplicate this information (it'll cost 8-bytes per page if we did).
So what do we need to track per page? Most importantly we need to know which state the page is in:
For the page replacement operations (sharing, defragmenting and freeing) to be somewhat efficient, private pages needs to be associated with a particular page in a particular VM.
Tracking the usage of shared pages is impractical and expensive, so we'll settle for a reference counting system instead.
Free pages will be chained on LIFOs
On 64-bit systems we will use a 64-bit bitfield per page, while on 32-bit systems a 32-bit bitfield will have to suffice because of address space limitations. The GMMPAGE structure shows the details.
The first approach is to manage the free pages in two sets depending on whether they are mainly for the allocation of shared or private pages. In the initial implementation there will be almost no possibility for mixing shared and private pages in the same chunk (only if we're really stressed on memory), but when we implement forking of VMs and have to deal with lots of COW pages it'll start getting kind of interesting.
The sets are lists of chunks with approximately the same number of free pages. Say the chunk size is 1MB, meaning 256 pages, and a set consists of 16 lists. So, the first list will contain the chunks with 1-7 free pages, the second covers 8-15, and so on. The chunks will be moved between the lists as pages are freed up or allocated.
On Windows the per page RTR0MEMOBJ cost is 32-bit on 32-bit windows and 64-bit on 64-bit windows (a PFN_NUMBER in the MDL). So, 64-bit per page. The cost on Linux is identical, but here it's because of sizeof(struct page *).VM that locked it. We will make no attempt at implementing page sharing on these systems, just do enough to make it all work.Serializing Access.
There are some challenges here, the main ones are configurability and security. Should we for instance permit anyone to request 100% memory commitment? Who should be allowed to do runtime adjustments of the config. And how to prevent these settings from being lost when the last VM process exits? The solution is probably to have an optional root daemon the will keep VMMR0.r0 in memory and enable the security measures.
The preliminary guesses is that we will have to try allocate memory as close as possible to the CPUs the VM is executed on (EMT and additional CPU threads). Which means it's mostly about allocation and sharing policies. Both the scheduler and allocator interface will to supply some NUMA info and we'll need to have a way to calc access costs.