Libraries

  • libxenctrl

    Xen Control library for controlling the Xen hypervisor

  • libxenguest

    Xen Guest library for building Xen Guest domains

  • Xen

    Insights into Xen hypercall functions exposed to the toolstack

Subsections of Libraries

Subsections of libxenctrl

xc_domain_claim_pages()

Purpose

The purpose of xc_domain_claim_pages() is to attempt to stake a claim on an amount of memory for a given domain which guarantees that memory allocations for the claimed amount will be successful.

The domain can still attempt to allocate beyond the claim, but those are not guaranteed to be successful and will fail if the domain’s memory reaches it’s max_mem value.

The aim is to attempt to stake a claim for a domain on a quantity of pages of system RAM, but not assign specific page frames. It performs only arithmetic so the hypercall is very fast and not be preempted. Thereby, it sidesteps any time-of-check-time-of-use races for memory allocation.

Usage notes

xc_domain_claim_pages() returns 0 if the Xen page allocator has atomically and successfully claimed the requested number of pages, else non-zero.

Info

Errors returned by xc_domain_claim_pages() must be handled as they are a normal result of the xenopsd thread-pool claiming and starting many VMs in parallel during a boot storm scenario.

Warning

This is especially important when staking claims on NUMA nodes using an updated version of this function. In this case, the only options of the calling worker thread would be to adapt to the NUMA boot storm: Attempt to find a different NUMA node for claiming the memory and try again.

Design notes

For reference, see the hypercall comment Xen hypervisor header: xen/include/public/memory.h

The design of the backing hypercall in Xen is as follows:

  • A domain can only have one claim.
  • Subsequent calls to the backing hypercalls update claim.
  • The domain ID is the key of the claim.
  • By killing the domain, the claim is also released.
  • When sufficient memory has been allocated to resolve the claim, the claim silently expires.
  • Depending on the given size argument, the remaining stack of the domain can be set initially, updated to the given amount, or reset.
  • Claiming zero pages effectively resets any outstanding claim and is always successful.

Users

To set up the boot memory of new domain, the libxenguest function xc_dom_boot_mem_init relies on this design:

The memory setup of xenguest and by the xl CLI using libxl use it. libxl actively uses it to stack an initial claim before allocating memory.

Note

The functions meminit_hvm() and meminit_pv used by xc_dom_boot_mem_init() always destroy any remaining claim before they return. Thus, after libxenguest has completed allocating and populating the physical memory of the domain, no domain has a remaining claim!

Warning

From this point onwards, no memory allocation is guaranteed! While swapping memory between domains can be expected to always succeed, Allocating a new page after freeing another can always fail unless a privileged domain stakes a claim for such allocations beforehand!

Implementation

The Xen upstream memory claims code is implemented to work as follows:

When allocating memory, while a domain has an outstanding claim, Xen updates remaining claim accordingly.

If a domain has a stake from claim, when memory is freed, freed amount of memory increases the stake of memory.

Note: This detail may have to change for implementing NUMA claims

Xen doesn’t know if a page was allocated using a NUMA node claim. Therefore, it cannot know if it would be legal to increment a stake a NUMA node claim when freeing pages.

Info

The smallest possible change to achieve a similar effect would be to add a field to the Xen hypervisor’s domain struct. It would be used for remembering the last total NUMA node claim. With it, freeing memory from a NUMA node could attempt to increase the outstanding claim on the claimed NUMA node, but only given this amount is available and not claimed.

Management of claims

  • The stake is centrally managed by the Xen hypervisor using a Hypercall.
  • Claims are not reflected in the amount of free memory reported by Xen.

Reporting of claims

  • xl claims reports the outstanding claims of the domains:
    Sample output of xl claims:
    Name         ID   Mem VCPUs      State   Time(s)  Claimed
    Domain-0      0  2656     8     r-----  957418.2     0
  • xl info reports the host-wide outstanding claims:
    Sample output from xl info | grep outstanding:
    outstanding_claims     : 0

Tracking of claims

Xen only tracks:

  • the outstanding claims of each domain and
  • the outstanding host-wide claims.

Claiming zero pages effectively cancels the domain’s outstanding claim and is always successful.

Info
  • Allocations for outstanding claims are expected to always be successful.
  • But this reduces the amount of outstanding claims if the domain.
  • Freeing memory of the domain increases the domain’s claim again:
    • But, when a domain consumes its claim, it is reset.
    • When the claim is reset, freed memory is longer moved to the outstanding claims!
    • It would have to get a new claim on memory to have spare memory again.
The domain’s max_mem value is used to deny memory allocation

If an allocation would cause the domain to exceed it’s max_mem value, it will always fail.

Hypercall API

Function signature of the libXenCtrl function to call the Xen hypercall:

long xc_memory_op(libxc_handle, XENMEM_claim_pages, struct xen_memory_reservation *)

struct xen_memory_reservation is defined as :

struct xen_memory_reservation {
    .nr_extents   = nr_pages, /* number of pages to claim */
    .extent_order = 0,        /* an order 0 means: 4k pages, only 0 is allowed */
    .mem_flags    = 0,        /* no flags, only 0 is allowed (at the moment) */
    .domid        = domid     /* numerical domain ID of the domain */
};

Current users

It is used by libxenguest, which is used at least by:

libxl and the xl CLI

The xl cli uses claims actively. By default, it enables libxl to pass the struct xc_dom_image with its claim_enabled field set.

The functions dispatched by xc_boot_mem_init() then attempt to claim the boot using xc_domain_claim_pages(). They also (unconditionally) destroy any open claim upon return.

This means that in case the claim fails, xl avoids:

  • The effort of allocating the memory, thereby not blocking it for other domains.
  • The effort of potentially needing to scrub the memory after the build failure.

Updates: Improved NUMA memory allocation

Enablement of a NUMA node claim instead of a host-wide claim

With a proposed update of xc_domain_claim_pages() for NUMA node claims, a node argument is added.

It can be XC_NUMA_NO_NODE for defining a host-wide claim or a NUMA node for staking a claim on one NUMA node.

This update does not change the foundational design of memory claims of the Xen hypervisor where a claim is defined as a single claim for the domain.

For improved support for NUMA, xenopsd is updated to call an updated version of this function for the domain.

It reserves NUMA node memory before xenguest is called, and a new pnode argument is added to xenguest.

It sets the NUMA node for memory allocations by the xenguest boot memory setup xc_boot_mem_init() which also sets the exact flag.

The exact flag forces get_free_buddy() to fail if it could not find scrubbed memory for the give pnode which causes alloc_heap_pages() to re-run get_free_buddy() with the flag to fall back to dirty, not yet scrubbed memory.

alloc_heap_pages() then checks each 4k page for the need to scrub and scrubs those before returning them.

This is expected to improve the issue of potentially spreading memory over all NUMA nodes in case of parallel boot by preventing one NUMA node to become the target of parallel memory allocations which cannot fit on it and also the issue of spreading memory over all NUMA nodes if case of a huge amount of dirty memory that needs to be scrubbed for before a domain can start or restart.

NUMA grant tables and ballooning

This is not seen as an issue as grant tables and I/O pages are usually not as frequently used as regular system memory of domains, but this potential issue remains:

In discussions, it was said that Windows PV drivers unmap and free memory for grant tables to Xen and then re-allocate memory for those grant tables.

xenopsd may want to try to stake a very small claim for the domain on the NUMA node of the domain so that Xen can increase this claim when the PV drivers free this memory and re-use the resulting claimed amount for allocating the grant tables.

This would ensure that the grant tables are then allocated on the local NUMA node of the domain, avoiding remote memory accesses when accessing the grant tables from inside the domain.

Note: In case the corresponding backend process in Dom0 is running on another NUMA node, it would access the domain’s grant tables from a remote NUMA node, but in this would enable a future improvement for Dom0, where it could prefer to run the corresponding backend process on the same or a neighbouring NUMA node.

xc_domain_node_setaffinity()

xc_domain_node_setaffinity() controls the NUMA node affinity of a domain, but it only updates the Xen hypervisor domain’s d->node_affinity mask. This mask is read by the Xen memory allocator as the 2nd preference for the NUMA node to allocate memory from for this domain.

Preferences of the Xen memory allocator:
  1. A NUMA node passed to the allocator directly takes precedence, if present.
  2. Then, if the allocation is for a domain, it’s node_affinity mask is tried.
  3. Finally, it falls back to spread the pages over all remaining NUMA nodes.

As this call has no practical effect on the Xen scheduler, vCPU affinities need to be set separately anyways.

The domain’s auto_node_affinity flag is enabled by default by Xen. This means that when setting vCPU affinities, Xen updates the d->node_affinity mask to consist of the NUMA nodes to which its vCPUs have affinity to.

See xc_vcpu_setaffinity() for more information on how d->auto_node_affinity is used to set the NUMA node affinity.

Thus, so far, there is no obvious need to call xc_domain_node_setaffinity() when building a domain.

Setting the NUMA node affinity using this call can be used, for example, when there might not be enough memory on the preferred NUMA node, but there are other NUMA nodes that have enough free memory to be used for the system memory of the domain.

In terms of future NUMA design, it might be even more favourable to have a strategy in xenguest where in such cases, the superpages of the preferred node are used first and a fallback to neighbouring NUMA nodes only happens to the extent necessary.

Likely, the future allocation strategy should be passed to xenguest using Xenstore like the other platform parameters for the VM.

Walk-through of xc_domain_node_setaffinity()

classDiagram
class `xc_domain_node_setaffinity()` {
    +xch: xc_interface #42;
    +domid: uint32_t
    +nodemap: xc_nodemap_t
    0(on success)
    -EINVAL(if a node in the nodemask is not online)
}
click `xc_domain_node_setaffinity()` href "
https://github.com/xen-project/xen/blob/master/tools/libs/ctrl/xc_domain.c#L122-L158"

`xc_domain_node_setaffinity()` --> `Xen hypercall: do_domctl()`
`xc_domain_node_setaffinity()` <-- `Xen hypercall: do_domctl()`
class `Xen hypercall: do_domctl()` {
    Calls domain_set_node_affinity#40;#41; and returns its return value
    Passes: domain (struct domain *, looked up using the domid)
    Passes: new_affinity (modemask, converted from xc_nodemap_t)
}
click `Xen hypercall: do_domctl()` href "
https://github.com/xen-project/xen/blob/master/xen/common/domctl.c#L516-L525"

`Xen hypercall: do_domctl()` --> `domain_set_node_affinity()`
`Xen hypercall: do_domctl()` <-- `domain_set_node_affinity()`
class `domain_set_node_affinity()` {
    domain: struct domain
    new_affinity: nodemask
    0(on success, the domain's node_affinity is updated)
    -EINVAL(if a node in the nodemask is not online)
}
click `domain_set_node_affinity()` href "
https://github.com/xen-project/xen/blob/master/xen/common/domain.c#L943-L970"

domain_set_node_affinity()

This function implements the functionality of xc_domain_node_setaffinity to set the NUMA affinity of a domain as described above. If the new_affinity does not intersect the node_online_map, it returns -EINVAL. Otherwise, the result is a success, and it returns 0.

When the new_affinity is a specific set of NUMA nodes, it updates the NUMA node_affinity of the domain to these nodes and disables d->auto_node_affinity for this domain. With d->auto_node_affinity disabled, xc_vcpu_setaffinity() no longer updates the NUMA affinity of this domain.

If new_affinity has all bits set, it re-enables the d->auto_node_affinity for this domain and calls domain_update_node_aff() to re-set the domain’s node_affinity mask to the NUMA nodes of the current the hard and soft affinity of the domain’s online vCPUs.

Flowchart in relation to xc_set_vcpu_affinity()

The effect of domain_set_node_affinity() can be seen more clearly on this flowchart which shows how xc_set_vcpu_affinity() is currently used to set the NUMA affinity of a new domain, but also shows how domain_set_node_affinity() relates to it:

In the flowchart, two code paths are set in bold:

  • Show the path when Host.numa_affinity_policy is the default (off) in xenopsd.
  • Show the default path of xc_vcpu_setaffinity(XEN_VCPUAFFINITY_SOFT) in Xen, when the Domain’s auto_node_affinity flag is enabled (default) to show how it changes to the vCPU affinity update the domain’s node_affinity in this default case as well.

xenguest uses the Xenstore to read the static domain configuration that it needs reads to build the domain.

flowchart TD

subgraph VM.create["xenopsd VM.create"]

    %% Is xe vCPU-params:mask= set? If yes, write to Xenstore:

    is_xe_vCPUparams_mask_set?{"

            Is
            <tt>xe vCPU-params:mask=</tt>
            set? Example: <tt>1,2,3</tt>
            (Is used to enable vCPU<br>hard-affinity)

        "} --"yes"--> set_hard_affinity("Write hard-affinity to XenStore:
                        <tt>platform/vcpu/#domid/affinity</tt>
                        (xenguest will read this and other configuration data
                         from Xenstore)")

end

subgraph VM.build["xenopsd VM.build"]

    %% Labels of the decision nodes

    is_Host.numa_affinity_policy_set?{
        Is<p><tt>Host.numa_affinity_policy</tt><p>set?}
    has_hard_affinity?{
        Is hard-affinity configured in <p><tt>platform/vcpu/#domid/affinity</tt>?}

    %% Connections from VM.create:
    set_hard_affinity --> is_Host.numa_affinity_policy_set?
    is_xe_vCPUparams_mask_set? == "no"==> is_Host.numa_affinity_policy_set?

    %% The Subgraph itself:

    %% Check Host.numa_affinity_policy

    is_Host.numa_affinity_policy_set?

        %% If Host.numa_affinity_policy is "best_effort":

        -- Host.numa_affinity_policy is<p><tt>best_effort -->

            %% If has_hard_affinity is set, skip numa_placement:

            has_hard_affinity?
                --"yes"-->exec_xenguest

            %% If has_hard_affinity is not set, run numa_placement:

            has_hard_affinity?
                --"no"-->numa_placement-->exec_xenguest

        %% If Host.numa_affinity_policy is off (default, for now),
        %% skip NUMA placement:

        is_Host.numa_affinity_policy_set?
            =="default: disabled"==>
            exec_xenguest
end

%% xenguest subgraph

subgraph xenguest

    exec_xenguest

        ==> stub_xc_hvm_build("<tt>stub_xc_hvm_build()")

            ==> configure_vcpus("<tT>configure_vcpus()")

                %% Decision
                ==> set_hard_affinity?{"
                        Is <tt>platform/<br>vcpu/#domid/affinity</tt>
                        set?"}

end

%% do_domctl Hypercalls

numa_placement
    --Set the NUMA placement using soft-affinity-->
    XEN_VCPUAFFINITY_SOFT("<tt>xc_vcpu_setaffinity(SOFT)")
        ==> do_domctl

set_hard_affinity?
    --yes-->
    XEN_VCPUAFFINITY_HARD("<tt>xc_vcpu_setaffinity(HARD)")
        --> do_domctl

xc_domain_node_setaffinity("<tt>xc_domain_node_setaffinity()</tt>
                            and
                            <tt>xc_domain_node_getaffinity()")
                                <--> do_domctl

%% Xen subgraph

subgraph xen[Xen Hypervisor]

    subgraph domain_update_node_affinity["domain_update_node_affinity()"]
        domain_update_node_aff("<tt>domain_update_node_aff()")
        ==> check_auto_node{"Is domain's<br><tt>auto_node_affinity</tt><br>enabled?"}
          =="yes (default)"==>set_node_affinity_from_vcpu_affinities("
            Calculate the domain's <tt>node_affinity</tt> mask from vCPU affinity
            (used for further NUMA memory allocation for the domain)")
    end

    do_domctl{"do_domctl()<br>op->cmd=?"}
        ==XEN_DOMCTL_setvcpuaffinity==>
            vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity")
                ==>domain_update_node_aff
    do_domctl
        --XEN_DOMCTL_setnodeaffinity (not used currently)
            -->is_new_affinity_all_nodes?

    subgraph  domain_set_node_affinity["domain_set_node_affinity()"]

        is_new_affinity_all_nodes?{new_affinity<br>is #34;all#34;?}

            --is #34;all#34;

                --> enable_auto_node_affinity("<tt>auto_node_affinity=1")
                    --> domain_update_node_aff

        is_new_affinity_all_nodes?

            --not #34;all#34;

                --> disable_auto_node_affinity("<tt>auto_node_affinity=0")
                    --> domain_update_node_aff
    end

%% setting and getting the struct domain's node_affinity:

disable_auto_node_affinity
    --node_affinity=new_affinity-->
        domain_node_affinity

set_node_affinity_from_vcpu_affinities
    ==> domain_node_affinity@{ shape: bow-rect,label: "domain:&nbsp;node_affinity" }
        --XEN_DOMCTL_getnodeaffinity--> do_domctl

end
click is_Host.numa_affinity_policy_set?
"https://github.com/xapi-project/xen-api/blob/90ef043c1f3a3bc20f1c5d3ccaaf6affadc07983/ocaml/xenopsd/xc/domain.ml#L951-L962"
click numa_placement
"https://github.com/xapi-project/xen-api/blob/90ef043c/ocaml/xenopsd/xc/domain.ml#L862-L897"
click stub_xc_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank
click get_flags
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288" _blank
click do_domctl
"https://github.com/xen-project/xen/blob/7cf163879/xen/common/domctl.c#L282-L894" _blank
click domain_set_node_affinity
"https://github.com/xen-project/xen/blob/7cf163879/xen/common/domain.c#L943-L970" _blank
click configure_vcpus
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297-L1348" _blank
click set_hard_affinity?
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1305-L1326" _blank
click xc_vcpu_setaffinity
"https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank
click vcpu_set_affinity
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank
click domain_update_node_aff
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank
click check_auto_node
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank
click set_node_affinity_from_vcpu_affinities
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank

xc_domain_node_setaffinity can be used to set the domain’s node_affinity (which is normally set by xc_set_vcpu_affinity) to different NUMA nodes.

No effect on the Xen scheduler

Currently, the node affinity does not affect the Xen scheudler: In case d->node_affinity would be set before vCPU creation, the initial pCPU of the new vCPU is the first pCPU of the first NUMA node in the domain’s node_affinity. This is further changed when one of more cpupools are set up. As this is only the initial pCPU of the vCPU, this alone does not change the scheduling of Xen Credit scheduler as it reschedules the vCPUs to other pCPUs.

Notes on future design improvements

It may be possible to call it before vCPUs are created

When done early, before vCPU creation, some domain-related data structures could be allocated using the domain’s d->node_affinity NUMA node mask.

With further changes in Xen and xenopsd, Xen could allocate the vCPU structs on the affine NUMA nodes of the domain.

For this, would be that xenopsd would have to call xc_domain_node_setaffinity() before vCPU creation, after having decided the domain’s NUMA placement, preferably including claiming the required memory for the domain to ensure that the domain will be populated from the same NUMA node(s).

This call cannot influence the past: The xenopsd VM_create micro-ops calls Xenctrl.domain_create. It currently creates the domain’s data structures before numa_placement was done.

Improving Xenctrl.domain_create to pass a NUMA node for allocating the Hypervisor’s data structures (e.g. vCPU) of the domain would require changes to the Xen hypervisor and the xenopsd xenopsd VM_create micro-op.

xc_domain_populate_physmap()

Overview

xenguest uses xc_domain_populate_physmap() to populate a Xen domain’s physical memory map: Both call the XENMEM_populate_physmap hypercall.

xc_domain_populate_physmap_exact() also sets the “exact” flag for allocating memory only on the given NUMA node. This is a very simplified overview of the hypercall:

flowchart LR

subgraph hypercall handlers
    populate_physmap("<tt>populate_physmap()</tt>
    One call for each memory
                      range (extent)")
end


subgraph "Xen buddy allocator:"

    populate_physmap
        --> alloc_domheap_pages("<tt>alloc_domheap_pages()</tt>
            Assign allocated pages to
            the domain")

    alloc_domheap_pages
        --> alloc_heap_pages("<tt>alloc_heap_pages()</tt>
                        If needed: split high-order
                        pages into smaller buddies,
                        and scrub dirty pages")
            --> get_free_buddy("<tt>get_free_buddy()</tt>
            If reqested: Allocate from a
            preferred/exact NUMA node
            and/or from
            unscrubbed memory
            ")

end

click populate_physmap
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314
" _blank

click alloc_domheap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697
" _blank

click get_free_buddy
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958
" _blank

click alloc_heap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116
" _blank

memory_op(XENMEM_populate_physmap)

It calls construct_memop_from_reservation() to convert the arguments for allocating a page from struct xen_memory_reservation to struct memop_args and passes it to populate_physmap():

construct_memop_from_reservation()

It populates struct memop_args using the hypercall arguments. It:

  • Copies extent_start, nr_extents, and extent_order.
  • Populates memop_args->memflags using the given mem_flags.

Converting a vNODE to a pNODE for vNUMA

When a vNUMA vnode is passed using XENMEMF_vnode, and domain->vnuma and domain->vnuma->nr_vnodes are set, and the vnode (virtual NUMA node) maps to a pnode (physical NUMA node), it also:

  • Populates the pnode in the memflags of the struct memop_args
  • and sets a XENMEMF_exact_node_request in them as well.

propagate_node()

If no vNUMA node is passed, construct_memop_from_reservation() calls propagate_node() to propagate the NUMA node and XENMEMF_exact_node_request for use in Xen.

populate_physmap()

It handles hypercall preemption and resumption after a preemption, keeps track of the already populated pages.

For each range (extent), it runs an iteration of the allocation loop. It passes the

  • struct domain
  • page order
  • count remaining pages populate
  • and the converted memflags

to alloc_domheap_pages():

alloc_domheap_pages()

It calls alloc_heap_pages() and on success, assigns the allocated pages to the domain.

alloc_heap_pages()

It calls get_free_buddy()[^1] to allocate a page at the most suitable place: When no pages of the requested size are free, it splits larger superpages into pages of the requested size.

get_free_buddy()

It finds memory based on the flags and domain and return its page struct:

  • Optionally allocate prefer to allocate from a passed NUMA node
  • Optionally allocate from the domain’s next affine NUMA node (round-robin)
  • Optionally return if the preferred NUMA allocation did not succeed
  • Optionally allocate from not-yet scrubbed memory
  • Optionally allocate from the given range of memory zones
  • Fall back allocate from the next NUMA node on the system (round-robin)

For details see the get_free_buddy() finds memory based on the flags and domain.

Full flowchart

flowchart TD

subgraph XenCtrl
xc_domain_populate_physmap["<tt>xc_domain_populate_physmap()"]
xc_domain_populate_physmap_exact["<tt>xc_domain_populate_physmap_exact()"]
end

subgraph Xen

%% sub-subgraph from memory_op() to populate_node() and back

xc_domain_populate_physmap & xc_domain_populate_physmap_exact
<--reservation,<br>and for preempt:<br>nr_start/nr_done-->
memory_op("<tt>memory_op(XENMEM_populate_physmap)")

memory_op
    --struct xen_memory_reservation-->
        construct_memop_from_reservation("<tt>construct_memop_from_reservation()")
            --struct<br>xen_memory_reservation->mem_flags-->
                propagate_node("<tt>propagate_node()")
            --_struct<br>memop_args->memflags_-->
        construct_memop_from_reservation
    --_struct memop_args_-->
memory_op<--struct memop_args *:
            struct domain *,
            List of extent base addrs,
            Number of extents,
            Size of each extent (extent_order),
            Allocation flags(memflags)-->
    populate_physmap[["<tt>populate_physmap()"]]
        <-.domain, extent base addrs, extent size, memflags, nr_start and nr_done.->
        populate_physmap_loop--if memflags & MEMF_populate_on_demand -->guest_physmap_mark_populate_on_demand("
            <tt>guest_physmap_mark_populate_on_demand()")
        populate_physmap_loop@{ label: "While extents to populate,
                and not asked to preempt,
                for each extent left to do:", shape: notch-pent }
            --domain, order, memflags-->
            alloc_domheap_pages("<tt>alloc_domheap_pages()")
              --zone_lo, zone_hi, order, memflags, domain-->
                alloc_heap_pages
                    --zone_lo, zone_hi, order, memflags, domain-->
                        get_free_buddy("<tt>get_free_buddy()")
                    --_page_info_
                -->alloc_heap_pages
                    --if no page-->
                        no_scrub("<tt>get_free_buddy(MEMF_no_scrub)</tt>
                            (honored only when order==0)")
                    --_dirty 4k page_
                -->alloc_heap_pages
                    <--_dirty 4k page_-->
                        scrub_one_page("<tt>scrub_one_page()")
                alloc_heap_pages("<tt>alloc_heap_pages()</tt>
                        (also splits higher-order pages
                         into smaller buddies if needed)")
              --_page_info_
            -->alloc_domheap_pages
                --page_info, order, domain, memflags-->assign_page("<tt>assign_page()")
                    assign_page
                        --page_info, nr_mfns, domain, memflags-->
                            assign_pages("<tt>assign_pages()")
                            --domain, nr_mfns-->
                                domain_adjust_tot_pages("<tt>domain_adjust_tot_pages()")
            alloc_domheap_pages
            --_page_info_-->
        populate_physmap_loop
            --page(gpfn, mfn, extent_order)-->
                guest_physmap_add_page("<tt>guest_physmap_add_page()")

populate_physmap--nr_done, preempted-->memory_op
end

click memory_op
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1409-L1425
" _blank

click construct_memop_from_reservation
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1022-L1071
" _blank

click propagate_node
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L524-L547
" _blank

click populate_physmap
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314
" _blank

click populate_physmap_loop
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L197-L304
" _blank

click guest_physmap_mark_populate_on_demand
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L210-220
" _blank

click guest_physmap_add_page
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L296
" _blank

click alloc_domheap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697
" _blank

click alloc_heap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116
" _blank

click get_free_buddy
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958
" _blank

click assign_page
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2540-L2633
" _blank

click assign_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2635-L2639
" _blank

xc_vcpu_setaffinity()

Introduction

In the Xen hypervisor, each vCPU has:

  • A soft affinity, This is the list of pCPUs where a vCPU prefers to run:

    This can be used in cases to make vCPUs prefer to run on a set on pCPUs, for example the pCPUs of a NUMA node, but in case those are already busy, the Credit schedule can still ignore the soft-affinity. A typical use case for this are NUMA machines, where the soft affinity for the vCPUs of a domain should be set equal to the pCPUs of the NUMA node where the domain’s memory shall be placed.

    See the description of the NUMA feature for more details.

  • A hard affinity, also known as pinning. This is the list of pCPUs where a vCPU is allowed to run

    Hard affinity is currently not used for NUMA placement, but can be configured manually for a given domain, either using xe VCPUs-params:mask= or the API.

    For example, the vCPU’s pinning can be configured for a VM with:1

    xe vm-param-set uuid=<template_uuid> vCPUs-params:mask=1,2,3

    There are also host-level guest_VCPUs_params which are used by host-cpu-tune to exclusively pin Dom0 and guests (i.e. that their pCPUs never overlap). Note: This isn’t currently supported by the NUMA code: It could result that the NUMA placement picks a node that has reduced capacity or unavailable due to the host mask that host-cpu-tune has set.

Purpose

The libxenctrl library call xc_set_vcpu_affinity() controls the pCPU affinity of the given vCPU.

xenguest uses it when building domains if xenopsd added vCPU affinity information to the XenStore platform data path platform/vcpu/#domid/affinity of the domain.

Updating the NUMA node affinity of a domain

Besides that, xc_set_vcpu_affinity() can also modify the NUMA node affinity of the Xen domain if the vCPU:

When Xen creates a domain, it enables the domain’s d->auto_node_affinity feature flag.

When it is enabled, setting the vCPU affinity also updates the NUMA node affinity which is used for memory allocations for the domain:

Simplified flowchart

flowchart TD
subgraph libxenctrl
    xc_vcpu_setaffinity("<tt>xc_vcpu_setaffinity()")--hypercall-->xen
end
subgraph xen[Xen Hypervisor]
direction LR
vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity")
    -->check_auto_node{"Is the domain's<br><tt>auto_node_affinity</tt><br>enabled?"}
        --"yes<br>(default)"-->
            auto_node_affinity("Set the<br>domain's<br><tt>node_affinity</tt>
            mask as well<br>(used for further<br>NUMA memory<br>allocation)")

click xc_vcpu_setaffinity
"https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank
click vcpu_set_affinity
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank
click domain_update_node_aff
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank
click check_auto_node
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank
click auto_node_affinity
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank
end

Current use by xenopsd and xenguest

When Host.numa_affinity_policy is set to best_effort, xenopsd attempts NUMA node placement when building new VMs and instructs xenguest to set the vCPU affinity of the domain.

With the domain’s auto_node_affinity flag enabled by default in Xen, this automatically also sets the d->node_affinity mask of the domain.

This then causes the Xen memory allocator to prefer the NUMA nodes in the d->node_affinity NUMA node mask when allocating memory.

That is, (for completeness) unless Xen’s allocation function alloc_heap_pages() receives a specific NUMA node in its memflags argument when called.

See xc_domain_node_setaffinity() for more information about another way to set the node_affinity NUMA node mask of Xen domains and more depth on how it is used in Xen.

Flowchart of its current use for NUMA affinity

In the flowchart, two code paths are set in bold:

  • Show the path when Host.numa_affinity_policy is the default (off) in xenopsd.
  • Show the default path of xc_vcpu_setaffinity(XEN_VCPUAFFINITY_SOFT) in Xen, when the Domain’s auto_node_affinity flag is enabled (default) to show how it changes to the vCPU affinity update the domain’s node_affinity in this default case as well.

xenguest uses the Xenstore to read the static domain configuration that it needs reads to build the domain.

flowchart TD

subgraph VM.create["xenopsd VM.create"]

    %% Is xe vCPU-params:mask= set? If yes, write to Xenstore:

    is_xe_vCPUparams_mask_set?{"

            Is
            <tt>xe vCPU-params:mask=</tt>
            set? Example: <tt>1,2,3</tt>
            (Is used to enable vCPU<br>hard-affinity)

        "} --"yes"--> set_hard_affinity("Write hard-affinity to XenStore:
                        <tt>platform/vcpu/#domid/affinity</tt>
                        (xenguest will read this and other configuration data
                         from Xenstore)")

end

subgraph VM.build["xenopsd VM.build"]

    %% Labels of the decision nodes

    is_Host.numa_affinity_policy_set?{
        Is<p><tt>Host.numa_affinity_policy</tt><p>set?}
    has_hard_affinity?{
        Is hard-affinity configured in <p><tt>platform/vcpu/#domid/affinity</tt>?}

    %% Connections from VM.create:
    set_hard_affinity --> is_Host.numa_affinity_policy_set?
    is_xe_vCPUparams_mask_set? == "no"==> is_Host.numa_affinity_policy_set?

    %% The Subgraph itself:

    %% Check Host.numa_affinity_policy

    is_Host.numa_affinity_policy_set?

        %% If Host.numa_affinity_policy is "best_effort":

        -- Host.numa_affinity_policy is<p><tt>best_effort -->

            %% If has_hard_affinity is set, skip numa_placement:

            has_hard_affinity?
                --"yes"-->exec_xenguest

            %% If has_hard_affinity is not set, run numa_placement:

            has_hard_affinity?
                --"no"-->numa_placement-->exec_xenguest

        %% If Host.numa_affinity_policy is off (default, for now),
        %% skip NUMA placement:

        is_Host.numa_affinity_policy_set?
            =="default: disabled"==>
            exec_xenguest
end

%% xenguest subgraph

subgraph xenguest

    exec_xenguest

        ==> stub_xc_hvm_build("<tt>stub_xc_hvm_build()")

            ==> configure_vcpus("<tT>configure_vcpus()")

                %% Decision
                ==> set_hard_affinity?{"
                        Is <tt>platform/<br>vcpu/#domid/affinity</tt>
                        set?"}

end

%% do_domctl Hypercalls

numa_placement
    --Set the NUMA placement using soft-affinity-->
    XEN_VCPUAFFINITY_SOFT("<tt>xc_vcpu_setaffinity(SOFT)")
        ==> do_domctl

set_hard_affinity?
    --yes-->
    XEN_VCPUAFFINITY_HARD("<tt>xc_vcpu_setaffinity(HARD)")
        --> do_domctl

xc_domain_node_setaffinity("<tt>xc_domain_node_setaffinity()</tt>
                            and
                            <tt>xc_domain_node_getaffinity()")
                                <--> do_domctl

%% Xen subgraph

subgraph xen[Xen Hypervisor]

    subgraph domain_update_node_affinity["domain_update_node_affinity()"]
        domain_update_node_aff("<tt>domain_update_node_aff()")
        ==> check_auto_node{"Is domain's<br><tt>auto_node_affinity</tt><br>enabled?"}
          =="yes (default)"==>set_node_affinity_from_vcpu_affinities("
            Calculate the domain's <tt>node_affinity</tt> mask from vCPU affinity
            (used for further NUMA memory allocation for the domain)")
    end

    do_domctl{"do_domctl()<br>op->cmd=?"}
        ==XEN_DOMCTL_setvcpuaffinity==>
            vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity")
                ==>domain_update_node_aff
    do_domctl
        --XEN_DOMCTL_setnodeaffinity (not used currently)
            -->is_new_affinity_all_nodes?

    subgraph  domain_set_node_affinity["domain_set_node_affinity()"]

        is_new_affinity_all_nodes?{new_affinity<br>is #34;all#34;?}

            --is #34;all#34;

                --> enable_auto_node_affinity("<tt>auto_node_affinity=1")
                    --> domain_update_node_aff

        is_new_affinity_all_nodes?

            --not #34;all#34;

                --> disable_auto_node_affinity("<tt>auto_node_affinity=0")
                    --> domain_update_node_aff
    end

%% setting and getting the struct domain's node_affinity:

disable_auto_node_affinity
    --node_affinity=new_affinity-->
        domain_node_affinity

set_node_affinity_from_vcpu_affinities
    ==> domain_node_affinity@{ shape: bow-rect,label: "domain:&nbsp;node_affinity" }
        --XEN_DOMCTL_getnodeaffinity--> do_domctl

end
click is_Host.numa_affinity_policy_set?
"https://github.com/xapi-project/xen-api/blob/90ef043c1f3a3bc20f1c5d3ccaaf6affadc07983/ocaml/xenopsd/xc/domain.ml#L951-L962"
click numa_placement
"https://github.com/xapi-project/xen-api/blob/90ef043c/ocaml/xenopsd/xc/domain.ml#L862-L897"
click stub_xc_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank
click get_flags
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288" _blank
click do_domctl
"https://github.com/xen-project/xen/blob/7cf163879/xen/common/domctl.c#L282-L894" _blank
click domain_set_node_affinity
"https://github.com/xen-project/xen/blob/7cf163879/xen/common/domain.c#L943-L970" _blank
click configure_vcpus
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297-L1348" _blank
click set_hard_affinity?
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1305-L1326" _blank
click xc_vcpu_setaffinity
"https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank
click vcpu_set_affinity
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank
click domain_update_node_aff
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank
click check_auto_node
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank
click set_node_affinity_from_vcpu_affinities
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank

  1. The VM parameter VCPUs-params:mask is documented in the official XAPI user documentation. ↩︎

libxenguest

Introduction

libxenguest is a library written in C provided for the Xen Hypervisor in Dom0.

For example, it used as the low-level interface building Xen Guest domains.

Its source is located in the folder tools/libs/guest of the Xen repository.

Responsibilities

Allocating the boot memory for new & migrated VMs

One important responsibility of libxenguest is creating the memory layout of new and migrated VMs.

The boot memory setup of xenguest and libxl (used by the xl CLI command) call xc_dom_boot_mem_init() which dispatches the call to meminit_hvm() and meminit_pv() which layout, allocate and populate the boot memory of domains.

Functions

Subsections of libxenguest

xc_dom_boot_mem_init()

VM boot memory setup

xenguest’s hvm_build_setup_mem() and libxl and the xl CLI call xc_dom_boot_mem_init() to allocate and populate the domain’s system memory for booting it:

flowchart LR

subgraph libxl / xl CLI
    libxl__build_dom("libxl__build_dom()")
end

subgraph xenguest
    hvm_build_setup_mem("hvm_build_setup_mem()")
end

subgraph libxenctrl
    xc_domain_populate_physmap("One call for each memory range&nbsp;(extent):
    xc_domain_populate_physmap()
    xc_domain_populate_physmap()
    xc_domain_populate_physmap()")
end

subgraph libxenguest

    hvm_build_setup_mem & libxl__build_dom
        --> xc_dom_boot_mem_init("xc_dom_boot_mem_init()")

    xc_dom_boot_mem_init
        --> meminit_hvm("meminit_hvm()") & meminit_pv("meminit_pv()")
            --> xc_domain_populate_physmap
end

click xc_dom_boot_mem_init
"https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126
" _blank

click meminit_hvm
"https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648
" _blank

click meminit_pv
"https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1183-L1333
" _blank

The allocation strategies of them called functions are:

Strategy of the libxenguest meminit functions

  • Attempt to allocate 1GB superpages when possible
  • Fall back to 2MB pages when 1GB allocation failed
  • Fall back to 4k pages when both failed

They use xc_domain_populate_physmap() to perform memory allocation and to map the allocated memory to the system RAM ranges of the domain.

Strategy of xc_domain_populate_physmap()

xc_domain_populate_physmap() calls the XENMEM_populate_physmap command of the Xen memory hypercall.

For a more detailed walk-through of the inner workings of this hypercall, see the reference on xc_domain_populate_physmap().

For more details on the VM build step involving xenguest and Xen side see: https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest

Xen

  • get_free_buddy()

    Find free memory based on the given flags and optionally, a domain

Subsections of Xen

get_free_buddy()

Overview

get_free_buddy() is called from alloc_heap_pages() to find a page at the most suitable place for a memory allocation.

It finds memory depending on the given flags and domain:

  • Optionally allocate prefer to allocate from a passed NUMA node
  • Optionally allocate from the domain’s next affine NUMA node (round-robin)
  • Optionally return if the preferred NUMA allocation did not succeed
  • Optionally allocate from not-yet scrubbed memory
  • Optionally allocate from the given range of memory zones
  • Fall back allocate from the next NUMA node on the system (round-robin)

Input parameters

  • struct domain
  • Zones to allocate from (zone_hi until zone_lo)
  • Page order (size of the page)
    • populate_physmap() starts with 1GB pages and falls back to 2MB and 4k pages.

Allocation strategy

Its first attempt is to find a page of matching page order on the requested NUMA node(s).

If this is not successful, it looks to breaking higher page orders, and if that fails too, it lowers the zone until zone_lo.

It does not attempt to use not scrubbed pages, but when memflags tell it MEMF_no_scrub, it uses check_and_stop_scrub(pg) on 4k pages to prevent breaking higher order pages instead.

If this fails, it checks if other NUMA nodes shall be tried.

Exact NUMA allocation (on request, e.g. for vNUMA)

For example for vNUMA domains, the calling functions pass one specific NUMA node, and they would also set MEMF_exact_node to make sure that memory is specifically only allocated from this NUMA node.

If no NUMA node was passed or the allocation from it failed, and MEMF_exact_node was not set in memflags, the function falls back to the first fallback, NUMA-affine allocation.

NUMA-affine allocation

For local NUMA memory allocation, the domain should have one or more NUMA nodes in its struct domain->node_affinity field when this function is called.

This happens as part of NUMA placement which writes the planned vCPU affinity of the domain’s vCPUs to the XenStore which xenguest reads to update the vCPU affinities of the domain’s vCPUs in Xen, which in turn, by default (when to domain->auto_node_affinity is active) also updates the struct domain->node_affinity field.

Note: In case it contains multiple NUMA nodes, this step allocates from the next NUMA node after the previous NUMA node the domain allocated from in a round-robin way.

Otherwise, the function falls back to host-wide round-robin allocation.

Host-wide round-robin allocation

When the domain’s node_affinity is not defined or did not succeed and MEMF_exact_node was not passed in memflags, all remaining NUMA nodes are attempted in a round-robin way: Each subsequent call uses the next NUMA node after the previous node that the domain allocated memory from.

Flowchart

This flowchart shows an overview of the decision chain of get_free_buddy()

flowchart TD

alloc_round_robin
  --No free memory on the host-->
    Failure

node_affinity_exact
  --No free memory<br>on the Domain's
    node_affinity nodes:<br>Abort exact allocation-->
      Failure

get_free_buddy["get_free_buddy()"]
  -->MEMF_node{memflags<br>&<br>MEMF_node?}
    --Yes-->
      try_MEMF_node{Alloc
                    from
                    node}
        --Success: page-->
          Success
      try_MEMF_node
        --No free memory on the node-->
          MEMF_exact{memflags
                     &
                     MEMF_exact?}
            --"No"-->
              node_affinity_set{NUMA affinity set?}
                --  Domain->node_affinity
                    is not set: Fall back to
                    round-robin allocation
                      --> alloc_round_robin

          MEMF_exact
            --Yes:
              As there is not enough
              free memory on the
              exact NUMA node(s):
              Abort exact allocation
                -->Failure

  MEMF_node
    --No NUMA node in memflags-->
      node_affinity_set{domain-><br>node_affinity<br>set?}
        --Set-->
          node_affinity{Alloc from<br>node_affinity<br>nodes}
            --No free memory on
              the node_affinity nodes
              Check if exact request-->
                node_affinity_exact{memflags<br>&<br>MEMF_exact?}
                  --Not exact: Fall back to<br>round-robin allocation-->
                    alloc_round_robin

    node_affinity--Success: page-->Success

    alloc_round_robin{" Fall back to
                        round-robin
                        allocation"}
                        --Success: page-->
                          Success(Success: Return the page)

click get_free_buddy
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L1116
" _blank