Building a VM

flowchart LR

subgraph xenopsd: VM_build micro-op
    direction LR

    VM_build(VM_build)
        --> VM.build(VM.build)
            --> VM.build_domain(VM.build_domain)
                --> VM.build_domain_exn(VM.build_domain_exn)
                    --> Domain.build(Domain.build)
end

click VM_build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
click VM.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
click VM.build_domain "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
click VM.build_domain_exn "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
click Domain.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank

Walk-through documents for the VM_build phase:

  • VM_build μ-op

    Overview of the VM_build μ-op (runs after the VM_create μ-op created the domain).

  • Domain.build

    Prepare the build of a VM: Wait for scrubbing, do NUMA placement, run xenguest.

  • xenguest

    Perform building VMs: Allocate and populate the domain's system memory.

Subsections of Building a VM

VM_build micro-op

Overview

On Xen, Xenctrl.domain_create creates an empty domain and returns the domain ID (domid) of the new domain to xenopsd.

In the build phase, the xenguest program is called to create the system memory layout of the domain, set vCPU affinity and a lot more.

The VM_build micro-op collects the VM build parameters and calls VM.build, which calls VM.build_domain, which calls VM.build_domain_exn which calls Domain.build:

flowchart LR

subgraph xenopsd: VM_build micro-op
    direction LR

    VM_build(VM_build)
        --> VM.build(VM.build)
            --> VM.build_domain(VM.build_domain)
                --> VM.build_domain_exn(VM.build_domain_exn)
                    --> Domain.build(Domain.build)
end

click VM_build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
click VM.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
click VM.build_domain "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
click VM.build_domain_exn "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
click Domain.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank

The function VM.build_domain_exn must:

  1. Run pygrub (or eliloader) to extract the kernel and initrd, if necessary

  2. Call Domain.build to

    • optionally run NUMA placement and
    • invoke xenguest to set up the domain memory.

    See the walk-through of the Domain.build function for more details on this phase.

  3. Apply the cpuid configuration

  4. Store the current domain configuration on disk – it’s important to know the difference between the configuration you started with and the configuration you would use after a reboot because some properties (such as maximum memory and vCPUs) as fixed on create.

Domain.build

Overview

flowchart LR
subgraph xenopsd VM_build[
  xenopsd thread pool with two VM_build micro#8209;ops:
  During parallel VM_start, Many threads run this in parallel!
]
direction LR
build_domain_exn[
  VM.build_domain_exn
  from thread pool Thread #1
]  --> Domain.build
Domain.build --> build_pre
build_pre --> wait_xen_free_mem
build_pre -->|if NUMA/Best_effort| numa_placement
Domain.build --> xenguest[Invoke xenguest]
click Domain.build "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank

build_domain_exn2[
  VM.build_domain_exn
  from thread pool Thread #2]  --> Domain.build2[Domain.build]
Domain.build2 --> build_pre2[build_pre]
build_pre2 --> wait_xen_free_mem2[wait_xen_free_mem]
build_pre2 -->|if NUMA/Best_effort| numa_placement2[numa_placement]
Domain.build2 --> xenguest2[Invoke xenguest]
click Domain.build2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
end

VM.build_domain_exn calls Domain.build to call:

  • build_pre to prepare the build of a VM:
    • If the xe config numa_placement is set to Best_effort, invoke the NUMA placement algorithm.
    • Run xenguest
  • xenguest to invoke the xenguest program to setup the domain’s system memory.

build_pre: Prepare building the VM

Domain.build calls build_pre (which is also used for VM restore) to:

  1. Call wait_xen_free_mem to wait (if necessary), for the Xen memory scrubber to catch up reclaiming memory. It

    1. calls Xenctrl.physinfo which returns:
      • hostinfo.free_pages - the free and already scrubbed pages (available)
      • host.scrub_pages - the not yet scrubbed pages (not yet available)
    2. repeats this until a timeout as long as free_pages is lower than the required pages
      • unless if scrub_pages is 0 (no scrubbing left to do)

    Note: free_pages is system-wide memory, not memory specific to a NUMA node. Because this is not NUMA-aware, in case of temporary node-specific memory shortage, this check is not sufficient to prevent the VM from being spread over all NUMA nodes. It is planned to resolve this issue by claiming NUMA node memory during NUMA placement.

  2. Call the hypercall to set the timer mode

  3. Call the hypercall to set the number of vCPUs

  4. Call the numa_placement function as described in the NUMA feature description when the xe configuration option numa_placement is set to Best_effort (except when the VM has a hard CPU affinity).

    match !Xenops_server.numa_placement with
    | Any ->
        ()
    | Best_effort ->
        log_reraise (Printf.sprintf "NUMA placement") (fun () ->
            if has_hard_affinity then
              D.debug "VM has hard affinity set, skipping NUMA optimization"
            else
              numa_placement domid ~vcpus
                ~memory:(Int64.mul memory.xen_max_mib 1048576L)
        )

NUMA placement

build_pre passes the domid, the number of vCPUs and xen_max_mib to the numa_placement function to run the algorithm to find the best NUMA placement.

When it returns a NUMA node to use, it calls the Xen hypercalls to set the vCPU affinity to this NUMA node:

  let vm = NUMARequest.make ~memory ~vcpus in
  let nodea =
    match !numa_resources with
    | None ->
        Array.of_list nodes
    | Some a ->
        Array.map2 NUMAResource.min_memory (Array.of_list nodes) a
  in
  numa_resources := Some nodea ;
  Softaffinity.plan ~vm host nodea

By using the default auto_node_affinity feature of Xen, setting the vCPU affinity causes the Xen hypervisor to activate NUMA node affinity for memory allocations to be aligned with the vCPU affinity of the domain.

Summary: This passes the information to the hypervisor that memory allocation for this domain should preferably be done from this NUMA node.

Invoke the xenguest program

With the preparation in build_pre completed, Domain.build calls the xenguest function to invoke the xenguest program to build the domain.

Notes on future design improvements

The Xen domain feature flag domain->auto_node_affinity can be disabled by calling xc_domain_node_setaffinity() to set a specific NUMA node affinity in special cases:

This can be used, for example, when there might not be enough memory on the preferred NUMA node, and there are other NUMA nodes (in the same CPU package) to use (reference).

xenguest

Introduction

xenguest is called by the xenopsd Domain.build function to perform the build phase for new VMs, which is part of the xenopsd VM.build micro-op:

Domain.build calls xenguest (during boot storms, many run in parallel to accelerate boot storm completion), and during migration, emu-manager also calls xenguest:

flowchart
subgraph "xenopsd & emu-manager call xenguest:"
direction LR
xenopsd1(Domain.build for VM #1) --> xenguest1(xenguest for #1)
xenopsd2(<tt>emu-manager</tt> for VM #2) --> xenguest2(xenguest for #2)
xenguest1 --> libxenguest(libxenguest)
xenguest2 --> libxenguest2(libxenguest)
click xenopsd1 "../Domain.build/index.html"
click xenopsd2 "../Domain.build/index.html"
click xenguest1 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click xenguest2 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click libxenguest "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
click libxenguest2 "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
libxenguest --> Xen(Xen<br>Hypercalls,<p>e.g.:<p><tt>XENMEM<p>populate<p>physmap)
libxenguest2 --> Xen
end

Historical heritage

xenguest was created as a separate program due to issues with libxenguest:

  • It wasn’t threadsafe: fixed, but it still uses a per-call global struct
  • It had an incompatible licence, but now licensed under the LGPL.

Those were fixed, but we still shell out to xenguest, which is currently carried in the patch queue for the Xen hypervisor packages, but could become an individual package once planned changes to the Xen hypercalls are stabilised.

Over time, xenguest evolved to build more of the initial domain state.

Details

The details the the invocation of xenguest, the build modes and the VM memory setup are described in these child pages:

  • Invocation

    Invocation of xenguest and the interfaces used for it

  • Build Modes

    Description of the xenguest build modes (HVM, PVH, PV) with focus on HVM

  • Memory Setup

    Creation and allocation of the boot memory layout of VMs

Subsections of xenguest

Invocation

Interface to xenguest

xenopsd passes this information to xenguest (for migration, using emu-manager):

  • The domain type using the command line option --mode <type>_build.
  • The domid of the created empty domain,
  • The amount of system memory of the domain,
  • A number of other parameters that are domain-specific.

xenopsd uses the Xenstore to provide platform data:

  • in case the domain has a VCPUs-mask, the statically configured vCPU hard-affinity
  • the vCPU credit2 weight/cap parameters
  • whether the NX bit is exposed
  • whether the viridian CPUID leaf is exposed
  • whether the system has PAE or not
  • whether the system has ACPI or not
  • whether the system has nested HVM or not
  • whether the system has an HPET or not

When called to build a domain, xenguest reads those and builds the VM accordingly.

Parameters of the VM build modes

flowchart LR

xenguest_main("
    <tt>xenguest
    --mode hvm_build
    /
    --mode pvh_build
    /
    --mode pv_build
</tt>+<tt>
domid
mem_max_mib
mem_start_mib
image
store_port
store_domid
console_port
console_domid")
    -->  do_hvm_build("<tt>do_hvm_build()</tt> for HVM
    ") & do_pvh_build("<tt>do_pvh_build()</tt> for PVH")
         --> stub_xc_hvm_build("<tt>stub_xc_hvm_build()")

xenguest_main --> do_pv_build(<tt>do_pvh_build</tt> for PV) -->
    stub_xc_pv_build("<tt>stub_xc_pv_build()")

click do_pv_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L575-L594" _blank
click do_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L596-L615" _blank
click do_pvh_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L617-L640" _blank
click stub_xc_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank

Build Modes

Invocation of the HVM build mode

flowchart LR

xenguest_main("
    <tt>xenguest
    --mode hvm_build
    /
    --mode pvh_build
    /
    --mode pv_build
</tt>+<tt>
domid
mem_max_mib
mem_start_mib
image
store_port
store_domid
console_port
console_domid")
    -->  do_hvm_build("<tt>do_hvm_build()</tt> for HVM
    ") & do_pvh_build("<tt>do_pvh_build()</tt> for PVH")
         --> stub_xc_hvm_build("<tt>stub_xc_hvm_build()")

xenguest_main --> do_pv_build(<tt>do_pvh_build</tt> for PV) -->
    stub_xc_pv_build("<tt>stub_xc_pv_build()")

click do_pv_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L575-L594" _blank
click do_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L596-L615" _blank
click do_pvh_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L617-L640" _blank
click stub_xc_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank

Walk-through of the HVM build mode

The domain build functions stub_xc_hvm_build() and stub_xc_pv_build() call these functions:

  • get_flags() to get the platform data from the Xenstore for filling out the fields of struct flags and struct xc_dom_image.
  • configure_vcpus() which uses the platform data from the Xenstore:
    • When platform/vcpu/<vcpu-num>/affinity is set: set the vCPU affinity. By default, this sets the domain’s node_affinity mask (NUMA nodes) as well. This configures get_free_buddy() to prefer memory allocations from this NUMA node_affinity mask.
    • If platform/vcpu/weight is set, the domain’s scheduling weight
    • If platform/vcpu/cap is set, the domain’s scheduling cap (%cpu time)
  • xc_dom_boot_mem_init() to call <domain_type>_build_setup_mem(),

Call graph of do_hvm_build() with emphasis on information flow:

flowchart TD

do_hvm_build("<tt>do_hvm_build()</tt> for HVM")
    --> stub_xc_hvm_build("<tt>stub_xc_hvm_build()</tt>")

get_flags("<tt>get_flags()</tt>") --"VM platform_data from XenStore"
    --> stub_xc_hvm_build

stub_xc_hvm_build
    --> configure_vcpus("configure_vcpus()")

configure_vcpus --"When<br><tt>platform/
                   vcpu/%d/affinity</tt><br>is set"
    --> xc_vcpu_setaffinity

configure_vcpus --"When<br><tt>platform/
                   vcpu/cap</tt><br>or<tt>
                   vcpu/weight</tt><br>is set"
    --> xc_sched_credit_domain_set

stub_xc_hvm_build
    --"struct xc_dom_image, mem_start_mib, mem_max_mib"
        --> hvm_build_setup_mem("hvm_build_setup_mem()")
            -- "struct xc_dom_image
                with
                optional vmemranges"
                    -->  xc_dom_boot_mem_init

subgraph libxenguest
    xc_dom_boot_mem_init("xc_dom_boot_mem_init()")
        -- "struct xc_dom_image
            with
            optional vmemranges" -->
                meminit_hvm("meminit_hvm()")
                    -- page_size(1GB,2M,4k, memflags: e.g. exact) -->
                        xc_domain_populate_physmap("xc_domain_populate_physmap()")
end

subgraph direct xenguest hypercalls
    xc_vcpu_setaffinity("xc_vcpu_setaffinity()")
    --> vcpu_set_affinity("vcpu_set_affinity()")
        --> domain_update_node_aff("domain_update_node_aff()")
            -- "if auto_node_affinity
                is on (default)"--> auto_node_affinity(Update dom->node_affinity)

    xc_sched_credit_domain_set("xc_sched_credit_domain_set()")
end

click do_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L596-L615" _blank
click xc_vcpu_setaffinity "../../../../../lib/xenctrl/xc_vcpu_setaffinity/index.html" _blank
click vcpu_set_affinity
"https://github.com/xen-project/xen/blob/e16acd806/xen/common/sched/core.c#L1353-L1393" _blank
click domain_update_node_aff
"https://github.com/xen-project/xen/blob/e16acd806/xen/common/sched/core.c#L1809-L1876" _blank
click stub_xc_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank
click hvm_build_setup_mem
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2002-L2219" _blank
click get_flags
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288" _blank
click configure_vcpus
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297" _blank
click xc_dom_boot_mem_init
"https://github.com/xen-project/xen/blob/e16acd806/tools/libs/guest/xg_dom_boot.c#L110-L125"
click meminit_hvm
"https://github.com/xen-project/xen/blob/e16acd806/tools/libs/guest/xg_dom_x86.c#L1348-L1648"
click xc_domain_populate_physmap
"../../../../../lib/xenctrl/xc_domain_populate_physmap/index.html" _blank
click auto_node_affinity
"../../../../../lib/xenctrl/xc_domain_node_setaffinity/index.html#flowchart-in-relation-to-xc_set_vcpu_affinity" _blank

The function hvm_build_setup_mem()

For HVM domains, hvm_build_setup_mem() is responsible for deriving the memory layout of the new domain, allocating the required memory and populating for the new domain. It must:

  1. Derive the e820 memory layout of the system memory of the domain including memory holes depending on PCI passthrough and vGPU flags.
  2. Load the BIOS/UEFI firmware images
  3. Store the final MMIO hole parameters in the Xenstore
  4. Call the libxenguest function xc_dom_boot_mem_init() (see below)
  5. Call construct_cpuid_policy() to apply the CPUID featureset policy

It starts this by:

  • Getting struct xc_dom_image, max_mem_mib, and max_start_mib.
  • Calculating start and size of lower ranges of the domain’s memory maps
    • taking memory holes for I/O into account, e.g. mmio_size and mmio_start.
  • Calculating lowmem_end and highmem_end.

It then calls xc_dom_boot_mem_init():

The function xc_dom_boot_mem_init()

hvm_build_setup_mem() calls xc_dom_boot_mem_init() to allocate and populate the domain’s system memory:

flowchart LR
subgraph xenguest
hvm_build_setup_mem[hvm_build_setup_mem#40;#41;]
end
subgraph libxenguest
hvm_build_setup_mem --vmemranges--> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;]
xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;]
click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank
click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank
end

Except error handling and tracing, it only is a wrapper to call the architecture-specific meminit() hook for the domain type:

rc = dom->arch_hooks->meminit(dom);

For HVM domains, it calls meminit_hvm() to loop over the vmemranges of the domain for mapping the system RAM of the guest from the Xen hypervisor heap. Its goals are:

  • Attempt to allocate 1GB superpages when possible
  • Fall back to 2MB pages when 1GB allocation failed
  • Fall back to 4k pages when both failed

It uses xc_domain_populate_physmap() to perform memory allocation and to map the allocated memory to the system RAM ranges of the domain.

For more details on the VM build step involving xenguest and Xen side see: https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest

Memory Setup

HVM boot memory setup

For HVM domains, hvm_build_setup_mem() is responsible for deriving the memory layout of the new domain, allocating the required memory and populating for the new domain. It must:

  1. Derive the e820 memory layout of the system memory of the domain including memory holes depending on PCI passthrough and vGPU flags.
  2. Load the BIOS/UEFI firmware images
  3. Store the final MMIO hole parameters in the Xenstore
  4. Call the libxenguest function xc_dom_boot_mem_init() (see below)
  5. Call construct_cpuid_policy() to apply the CPUID featureset policy

It starts this by:

  • Getting struct xc_dom_image, max_mem_mib, and max_start_mib.
  • Calculating start and size of lower ranges of the domain’s memory maps
    • taking memory holes for I/O into account, e.g. mmio_size and mmio_start.
  • Calculating lowmem_end and highmem_end.

Calling into libxenguest for the bootmem setup

hvm_build_setup_mem() then calls the libxenguest function xc_dom_boot_mem_init() to set up the boot memory of domains.

xl CLI also uses it to set up the boot memory of domains. It constructs the memory layout of the domain and allocates and populates the main system memory of the domain using calls to xc_domain_populate_physmap().

For more details on the VM build step involving xenguest and Xen side see https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest