Libraries
- libxenctrl
Xen Control library for controlling the Xen hypervisor
- libxenguest
Xen Guest library for building Xen Guest domains
- Xen
Insights into Xen hypercall functions exposed to the toolstack
Xen Control library for controlling the Xen hypervisor
Xen Guest library for building Xen Guest domains
Insights into Xen hypercall functions exposed to the toolstack
Stake a claim for further memory for a domain, and release it too.
Set a Xen domain's NUMA node affinity for memory allocations
Populate a Xen domain's physical memory map
Set a Xen vCPU's pCPU affinity and the domain's NUMA node affinity
The purpose of xc_domain_claim_pages()
is to attempt to
stake a claim on an amount of memory for a given domain which guarantees that
memory allocations for the claimed amount will be successful.
The domain can still attempt to allocate beyond the claim, but those are not
guaranteed to be successful and will fail if the domain’s memory reaches it’s
max_mem
value.
The aim is to attempt to stake a claim for a domain on a quantity of pages of system RAM, but not assign specific page frames. It performs only arithmetic so the hypercall is very fast and not be preempted. Thereby, it sidesteps any time-of-check-time-of-use races for memory allocation.
xc_domain_claim_pages()
returns 0 if the Xen page allocator has atomically
and successfully claimed the requested number of pages, else non-zero.
Errors returned by xc_domain_claim_pages()
must be handled as they are a
normal result of the xenopsd
thread-pool claiming and starting many VMs
in parallel during a boot storm scenario.
This is especially important when staking claims on NUMA nodes using an updated version of this function. In this case, the only options of the calling worker thread would be to adapt to the NUMA boot storm: Attempt to find a different NUMA node for claiming the memory and try again.
For reference, see the hypercall comment Xen hypervisor header: xen/include/public/memory.h
The design of the backing hypercall in Xen is as follows:
To set up the boot memory of new domain, the libxenguest function xc_dom_boot_mem_init relies on this design:
The memory setup
of xenguest
and by the xl
CLI using libxl
use it.
libxl
actively uses it to stack an initial claim before allocating memory.
The functions
meminit_hvm()
and meminit_pv
used by
xc_dom_boot_mem_init()
always destroy any remaining claim before they return.
Thus, after libxenguest has completed allocating and populating
the physical memory of the domain, no domain has a remaining claim!
From this point onwards, no memory allocation is guaranteed! While swapping memory between domains can be expected to always succeed, Allocating a new page after freeing another can always fail unless a privileged domain stakes a claim for such allocations beforehand!
The Xen upstream memory claims code is implemented to work as follows:
When allocating memory, while a domain has an outstanding claim, Xen updates remaining claim accordingly.
If a domain has a stake from claim, when memory is freed, freed amount of memory increases the stake of memory.
Xen doesn’t know if a page was allocated using a NUMA node claim. Therefore, it cannot know if it would be legal to increment a stake a NUMA node claim when freeing pages.
The smallest possible change to achieve a similar effect would be to add a field to the Xen hypervisor’s domain struct. It would be used for remembering the last total NUMA node claim. With it, freeing memory from a NUMA node could attempt to increase the outstanding claim on the claimed NUMA node, but only given this amount is available and not claimed.
xl claims
reports the outstanding claims of the domains:xl claims
:Name ID Mem VCPUs State Time(s) Claimed
Domain-0 0 2656 8 r----- 957418.2 0
xl info
reports the host-wide outstanding claims:xl info | grep outstanding
:outstanding_claims : 0
Xen only tracks:
Claiming zero pages effectively cancels the domain’s outstanding claim and is always successful.
max_mem
value is used to deny memory allocationIf an allocation would cause the domain to exceed it’s max_mem
value, it will always fail.
Function signature of the libXenCtrl function to call the Xen hypercall:
long xc_memory_op(libxc_handle, XENMEM_claim_pages, struct xen_memory_reservation *)
struct xen_memory_reservation
is defined as :
struct xen_memory_reservation {
.nr_extents = nr_pages, /* number of pages to claim */
.extent_order = 0, /* an order 0 means: 4k pages, only 0 is allowed */
.mem_flags = 0, /* no flags, only 0 is allowed (at the moment) */
.domid = domid /* numerical domain ID of the domain */
};
It is used by libxenguest, which is used at least by:
libxl
/the xl
CLI.The xl
cli uses claims actively. By default, it enables libxl
to pass
the struct xc_dom_image
with its claim_enabled
field set.
The functions dispatched by xc_boot_mem_init()
then attempt to claim the boot using xc_domain_claim_pages()
.
They also (unconditionally) destroy any open claim upon return.
This means that in case the claim fails, xl
avoids:
With a
proposed update
of xc_domain_claim_pages()
for NUMA node claims, a node argument is added.
It can be XC_NUMA_NO_NODE
for defining a host-wide claim or a NUMA node
for staking a claim on one NUMA node.
This update does not change the foundational design of memory claims of the Xen hypervisor where a claim is defined as a single claim for the domain.
For improved support for NUMA, xenopsd
is updated to call an updated version of this function for the domain.
It reserves NUMA node memory before xenguest
is called, and a new pnode
argument is added to xenguest
.
It sets the NUMA node for memory allocations by the xenguest
boot memory setup
xc_boot_mem_init() which also
sets the exact
flag.
The exact
flag forces get_free_buddy() to fail
if it could not find scrubbed memory for the give pnode
which causes
alloc_heap_pages()
to re-run get_free_buddy() with the flag to
fall back to dirty, not yet scrubbed memory.
alloc_heap_pages() then checks each 4k page for the need to scrub and scrubs those before returning them.
This is expected to improve the issue of potentially spreading memory over all NUMA nodes in case of parallel boot by preventing one NUMA node to become the target of parallel memory allocations which cannot fit on it and also the issue of spreading memory over all NUMA nodes if case of a huge amount of dirty memory that needs to be scrubbed for before a domain can start or restart.
This is not seen as an issue as grant tables and I/O pages are usually not as frequently used as regular system memory of domains, but this potential issue remains:
In discussions, it was said that Windows PV drivers unmap
and free
memory
for grant tables to Xen and then re-allocate memory for those grant tables.
xenopsd
may want to try to stake a very small claim for the domain on the
NUMA node of the domain so that Xen can increase this claim when the PV drivers
free
this memory and re-use the resulting claimed amount for allocating
the grant tables.
This would ensure that the grant tables are then allocated on the local NUMA node of the domain, avoiding remote memory accesses when accessing the grant tables from inside the domain.
Note: In case the corresponding backend process in Dom0 is running on another NUMA node, it would access the domain’s grant tables from a remote NUMA node, but in this would enable a future improvement for Dom0, where it could prefer to run the corresponding backend process on the same or a neighbouring NUMA node.
xc_domain_node_setaffinity()
controls the NUMA node affinity of a domain,
but it only updates the Xen hypervisor domain’s d->node_affinity
mask.
This mask is read by the Xen memory allocator as the 2nd preference for the
NUMA node to allocate memory from for this domain.
node_affinity
mask is tried.As this call has no practical effect on the Xen scheduler, vCPU affinities need to be set separately anyways.
The domain’s auto_node_affinity
flag is enabled by default by Xen. This means
that when setting vCPU affinities, Xen updates the d->node_affinity
mask
to consist of the NUMA nodes to which its vCPUs have affinity to.
See xc_vcpu_setaffinity() for more information
on how d->auto_node_affinity
is used to set the NUMA node affinity.
Thus, so far, there is no obvious need to call xc_domain_node_setaffinity()
when building a domain.
Setting the NUMA node affinity using this call can be used, for example, when there might not be enough memory on the preferred NUMA node, but there are other NUMA nodes that have enough free memory to be used for the system memory of the domain.
In terms of future NUMA design, it might be even more favourable to
have a strategy in xenguest
where in such cases, the superpages
of the preferred node are used first and a fallback to neighbouring
NUMA nodes only happens to the extent necessary.
Likely, the future allocation strategy should be passed to xenguest
using Xenstore like the other platform parameters for the VM.
classDiagram class `xc_domain_node_setaffinity()` { +xch: xc_interface #42; +domid: uint32_t +nodemap: xc_nodemap_t 0(on success) -EINVAL(if a node in the nodemask is not online) } click `xc_domain_node_setaffinity()` href " https://github.com/xen-project/xen/blob/master/tools/libs/ctrl/xc_domain.c#L122-L158" `xc_domain_node_setaffinity()` --> `Xen hypercall: do_domctl()` `xc_domain_node_setaffinity()` <-- `Xen hypercall: do_domctl()` class `Xen hypercall: do_domctl()` { Calls domain_set_node_affinity#40;#41; and returns its return value Passes: domain (struct domain *, looked up using the domid) Passes: new_affinity (modemask, converted from xc_nodemap_t) } click `Xen hypercall: do_domctl()` href " https://github.com/xen-project/xen/blob/master/xen/common/domctl.c#L516-L525" `Xen hypercall: do_domctl()` --> `domain_set_node_affinity()` `Xen hypercall: do_domctl()` <-- `domain_set_node_affinity()` class `domain_set_node_affinity()` { domain: struct domain new_affinity: nodemask 0(on success, the domain's node_affinity is updated) -EINVAL(if a node in the nodemask is not online) } click `domain_set_node_affinity()` href " https://github.com/xen-project/xen/blob/master/xen/common/domain.c#L943-L970"
This function implements the functionality of xc_domain_node_setaffinity
to set the NUMA affinity of a domain as described above.
If the new_affinity does not intersect the node_online_map
,
it returns -EINVAL
. Otherwise, the result is a success, and it returns 0
.
When the new_affinity
is a specific set of NUMA nodes, it updates the NUMA
node_affinity
of the domain to these nodes and disables d->auto_node_affinity
for this domain. With d->auto_node_affinity
disabled,
xc_vcpu_setaffinity() no longer updates the NUMA affinity
of this domain.
If new_affinity
has all bits set, it re-enables the d->auto_node_affinity
for this domain and calls
domain_update_node_aff()
to re-set the domain’s node_affinity
mask to the NUMA nodes of the current
the hard and soft affinity of the domain’s online vCPUs.
The effect of domain_set_node_affinity()
can be seen more clearly on this
flowchart which shows how xc_set_vcpu_affinity()
is currently used to set
the NUMA affinity of a new domain, but also shows how domain_set_node_affinity()
relates to it:
In the flowchart, two code paths are set in bold:
Host.numa_affinity_policy
is the default (off) in xenopsd
.xc_vcpu_setaffinity(XEN_VCPUAFFINITY_SOFT)
in Xen,
when the Domain’s auto_node_affinity
flag is enabled (default) to show
how it changes to the vCPU affinity update the domain’s node_affinity
in this default case as well.xenguest uses the Xenstore to read the static domain configuration that it needs reads to build the domain.
flowchart TD subgraph VM.create["xenopsd VM.create"] %% Is xe vCPU-params:mask= set? If yes, write to Xenstore: is_xe_vCPUparams_mask_set?{" Is <tt>xe vCPU-params:mask=</tt> set? Example: <tt>1,2,3</tt> (Is used to enable vCPU<br>hard-affinity) "} --"yes"--> set_hard_affinity("Write hard-affinity to XenStore: <tt>platform/vcpu/#domid/affinity</tt> (xenguest will read this and other configuration data from Xenstore)") end subgraph VM.build["xenopsd VM.build"] %% Labels of the decision nodes is_Host.numa_affinity_policy_set?{ Is<p><tt>Host.numa_affinity_policy</tt><p>set?} has_hard_affinity?{ Is hard-affinity configured in <p><tt>platform/vcpu/#domid/affinity</tt>?} %% Connections from VM.create: set_hard_affinity --> is_Host.numa_affinity_policy_set? is_xe_vCPUparams_mask_set? == "no"==> is_Host.numa_affinity_policy_set? %% The Subgraph itself: %% Check Host.numa_affinity_policy is_Host.numa_affinity_policy_set? %% If Host.numa_affinity_policy is "best_effort": -- Host.numa_affinity_policy is<p><tt>best_effort --> %% If has_hard_affinity is set, skip numa_placement: has_hard_affinity? --"yes"-->exec_xenguest %% If has_hard_affinity is not set, run numa_placement: has_hard_affinity? --"no"-->numa_placement-->exec_xenguest %% If Host.numa_affinity_policy is off (default, for now), %% skip NUMA placement: is_Host.numa_affinity_policy_set? =="default: disabled"==> exec_xenguest end %% xenguest subgraph subgraph xenguest exec_xenguest ==> stub_xc_hvm_build("<tt>stub_xc_hvm_build()") ==> configure_vcpus("<tT>configure_vcpus()") %% Decision ==> set_hard_affinity?{" Is <tt>platform/<br>vcpu/#domid/affinity</tt> set?"} end %% do_domctl Hypercalls numa_placement --Set the NUMA placement using soft-affinity--> XEN_VCPUAFFINITY_SOFT("<tt>xc_vcpu_setaffinity(SOFT)") ==> do_domctl set_hard_affinity? --yes--> XEN_VCPUAFFINITY_HARD("<tt>xc_vcpu_setaffinity(HARD)") --> do_domctl xc_domain_node_setaffinity("<tt>xc_domain_node_setaffinity()</tt> and <tt>xc_domain_node_getaffinity()") <--> do_domctl %% Xen subgraph subgraph xen[Xen Hypervisor] subgraph domain_update_node_affinity["domain_update_node_affinity()"] domain_update_node_aff("<tt>domain_update_node_aff()") ==> check_auto_node{"Is domain's<br><tt>auto_node_affinity</tt><br>enabled?"} =="yes (default)"==>set_node_affinity_from_vcpu_affinities(" Calculate the domain's <tt>node_affinity</tt> mask from vCPU affinity (used for further NUMA memory allocation for the domain)") end do_domctl{"do_domctl()<br>op->cmd=?"} ==XEN_DOMCTL_setvcpuaffinity==> vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity") ==>domain_update_node_aff do_domctl --XEN_DOMCTL_setnodeaffinity (not used currently) -->is_new_affinity_all_nodes? subgraph domain_set_node_affinity["domain_set_node_affinity()"] is_new_affinity_all_nodes?{new_affinity<br>is #34;all#34;?} --is #34;all#34; --> enable_auto_node_affinity("<tt>auto_node_affinity=1") --> domain_update_node_aff is_new_affinity_all_nodes? --not #34;all#34; --> disable_auto_node_affinity("<tt>auto_node_affinity=0") --> domain_update_node_aff end %% setting and getting the struct domain's node_affinity: disable_auto_node_affinity --node_affinity=new_affinity--> domain_node_affinity set_node_affinity_from_vcpu_affinities ==> domain_node_affinity@{ shape: bow-rect,label: "domain: node_affinity" } --XEN_DOMCTL_getnodeaffinity--> do_domctl end click is_Host.numa_affinity_policy_set? "https://github.com/xapi-project/xen-api/blob/90ef043c1f3a3bc20f1c5d3ccaaf6affadc07983/ocaml/xenopsd/xc/domain.ml#L951-L962" click numa_placement "https://github.com/xapi-project/xen-api/blob/90ef043c/ocaml/xenopsd/xc/domain.ml#L862-L897" click stub_xc_hvm_build "https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank click get_flags "https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288" _blank click do_domctl "https://github.com/xen-project/xen/blob/7cf163879/xen/common/domctl.c#L282-L894" _blank click domain_set_node_affinity "https://github.com/xen-project/xen/blob/7cf163879/xen/common/domain.c#L943-L970" _blank click configure_vcpus "https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297-L1348" _blank click set_hard_affinity? "https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1305-L1326" _blank click xc_vcpu_setaffinity "https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank click vcpu_set_affinity "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank click domain_update_node_aff "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank click check_auto_node "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank click set_node_affinity_from_vcpu_affinities "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank
xc_domain_node_setaffinity
can be used to set the domain’s node_affinity
(which is normally set by xc_set_vcpu_affinity
) to different NUMA nodes.
Currently, the node affinity does not affect the Xen scheudler:
In case d->node_affinity
would be set before vCPU creation, the initial pCPU
of the new vCPU is the first pCPU of the first NUMA node in the domain’s
node_affinity
. This is further changed when one of more cpupools
are set up.
As this is only the initial pCPU of the vCPU, this alone does not change the
scheduling of Xen Credit scheduler as it reschedules the vCPUs to other pCPUs.
When done early, before vCPU creation, some domain-related data structures
could be allocated using the domain’s d->node_affinity
NUMA node mask.
With further changes in Xen and xenopsd
, Xen could allocate the vCPU structs
on the affine NUMA nodes of the domain.
For this, would be that xenopsd
would have to call xc_domain_node_setaffinity()
before vCPU creation, after having decided the domain’s NUMA placement,
preferably including claiming the required memory for the domain to ensure
that the domain will be populated from the same NUMA node(s).
This call cannot influence the past: The xenopsd
VM_create
micro-ops calls Xenctrl.domain_create
. It currently creates
the domain’s data structures before numa_placement
was done.
Improving Xenctrl.domain_create
to pass a NUMA node
for allocating the Hypervisor’s data structures (e.g. vCPU)
of the domain would require changes
to the Xen hypervisor and the xenopsd
xenopsd VM_create
micro-op.
xenguest uses
xc_domain_populate_physmap()
to populate a Xen domain’s physical memory map:
Both call the XENMEM_populate_physmap
hypercall.
xc_domain_populate_physmap_exact()
also sets the “exact” flag
for allocating memory only on the given NUMA node.
This is a very simplified overview of the hypercall:
flowchart LR subgraph hypercall handlers populate_physmap("<tt>populate_physmap()</tt> One call for each memory range (extent)") end subgraph "Xen buddy allocator:" populate_physmap --> alloc_domheap_pages("<tt>alloc_domheap_pages()</tt> Assign allocated pages to the domain") alloc_domheap_pages --> alloc_heap_pages("<tt>alloc_heap_pages()</tt> If needed: split high-order pages into smaller buddies, and scrub dirty pages") --> get_free_buddy("<tt>get_free_buddy()</tt> If reqested: Allocate from a preferred/exact NUMA node and/or from unscrubbed memory ") end click populate_physmap "https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314 " _blank click alloc_domheap_pages "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697 " _blank click get_free_buddy "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958 " _blank click alloc_heap_pages "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116 " _blank
It calls
construct_memop_from_reservation()
to convert the arguments for allocating a page from
struct xen_memory_reservation
to struct memop_args
and
passes
it to
populate_physmap():
It populates struct memop_args
using the
hypercall arguments. It:
extent_start
, nr_extents
, and extent_order
.memop_args->memflags
using the given mem_flags
.When a vNUMA vnode is passed using XENMEMF_vnode
, and domain->vnuma
and domain->vnuma->nr_vnodes
are set, and the
vnode
(virtual NUMA node) maps to a pnode
(physical NUMA node), it also:
pnode
in the memflags
of the struct memop_args
XENMEMF_exact_node_request
in them as well.If no vNUMA node is passed, construct_memop_from_reservation()
calls
propagate_node()
to propagate the NUMA node and XENMEMF_exact_node_request
for use in Xen.
It handles hypercall preemption and resumption after a preemption, keeps track of the already populated pages.
For each range (extent), it runs an iteration of the allocation loop. It passes the
struct domain
memflags
It calls alloc_heap_pages() and on success, assigns the allocated pages to the domain.
It calls get_free_buddy()[^1] to allocate a page at the most suitable place: When no pages of the requested size are free, it splits larger superpages into pages of the requested size.
It finds memory based on the flags and domain and return its page struct
:
For details see the get_free_buddy() finds memory based on the flags and domain.
flowchart TD subgraph XenCtrl xc_domain_populate_physmap["<tt>xc_domain_populate_physmap()"] xc_domain_populate_physmap_exact["<tt>xc_domain_populate_physmap_exact()"] end subgraph Xen %% sub-subgraph from memory_op() to populate_node() and back xc_domain_populate_physmap & xc_domain_populate_physmap_exact <--reservation,<br>and for preempt:<br>nr_start/nr_done--> memory_op("<tt>memory_op(XENMEM_populate_physmap)") memory_op --struct xen_memory_reservation--> construct_memop_from_reservation("<tt>construct_memop_from_reservation()") --struct<br>xen_memory_reservation->mem_flags--> propagate_node("<tt>propagate_node()") --_struct<br>memop_args->memflags_--> construct_memop_from_reservation --_struct memop_args_--> memory_op<--struct memop_args *: struct domain *, List of extent base addrs, Number of extents, Size of each extent (extent_order), Allocation flags(memflags)--> populate_physmap[["<tt>populate_physmap()"]] <-.domain, extent base addrs, extent size, memflags, nr_start and nr_done.-> populate_physmap_loop--if memflags & MEMF_populate_on_demand -->guest_physmap_mark_populate_on_demand(" <tt>guest_physmap_mark_populate_on_demand()") populate_physmap_loop@{ label: "While extents to populate, and not asked to preempt, for each extent left to do:", shape: notch-pent } --domain, order, memflags--> alloc_domheap_pages("<tt>alloc_domheap_pages()") --zone_lo, zone_hi, order, memflags, domain--> alloc_heap_pages --zone_lo, zone_hi, order, memflags, domain--> get_free_buddy("<tt>get_free_buddy()") --_page_info_ -->alloc_heap_pages --if no page--> no_scrub("<tt>get_free_buddy(MEMF_no_scrub)</tt> (honored only when order==0)") --_dirty 4k page_ -->alloc_heap_pages <--_dirty 4k page_--> scrub_one_page("<tt>scrub_one_page()") alloc_heap_pages("<tt>alloc_heap_pages()</tt> (also splits higher-order pages into smaller buddies if needed)") --_page_info_ -->alloc_domheap_pages --page_info, order, domain, memflags-->assign_page("<tt>assign_page()") assign_page --page_info, nr_mfns, domain, memflags--> assign_pages("<tt>assign_pages()") --domain, nr_mfns--> domain_adjust_tot_pages("<tt>domain_adjust_tot_pages()") alloc_domheap_pages --_page_info_--> populate_physmap_loop --page(gpfn, mfn, extent_order)--> guest_physmap_add_page("<tt>guest_physmap_add_page()") populate_physmap--nr_done, preempted-->memory_op end click memory_op "https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1409-L1425 " _blank click construct_memop_from_reservation "https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1022-L1071 " _blank click propagate_node "https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L524-L547 " _blank click populate_physmap "https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314 " _blank click populate_physmap_loop "https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L197-L304 " _blank click guest_physmap_mark_populate_on_demand "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L210-220 " _blank click guest_physmap_add_page "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L296 " _blank click alloc_domheap_pages "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697 " _blank click alloc_heap_pages "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116 " _blank click get_free_buddy "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958 " _blank click assign_page "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2540-L2633 " _blank click assign_pages "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2635-L2639 " _blank
In the Xen hypervisor, each vCPU has:
A soft affinity, This is the list of pCPUs where a vCPU prefers to run:
This can be used in cases to make vCPUs prefer to run on a set on pCPUs, for example the pCPUs of a NUMA node, but in case those are already busy, the Credit schedule can still ignore the soft-affinity. A typical use case for this are NUMA machines, where the soft affinity for the vCPUs of a domain should be set equal to the pCPUs of the NUMA node where the domain’s memory shall be placed.
See the description of the NUMA feature for more details.
A hard affinity, also known as pinning. This is the list of pCPUs where a vCPU is allowed to run
Hard affinity is currently not used for NUMA placement, but can be configured
manually for a given domain, either using xe VCPUs-params:mask=
or the API.
For example, the vCPU’s pinning can be configured for a VM with:1
xe vm-param-set uuid=<template_uuid> vCPUs-params:mask=1,2,3
There are also host-level guest_VCPUs_params
which are used by
host-cpu-tune
to exclusively pin Dom0 and guests (i.e. that their
pCPUs never overlap). Note: This isn’t currently supported by the
NUMA code: It could result that the NUMA placement picks a node that
has reduced capacity or unavailable due to the host mask that
host-cpu-tune
has set.
The libxenctrl library call xc_set_vcpu_affinity()
controls the pCPU affinity of the given vCPU.
xenguest
uses it when building domains if
xenopsd
added vCPU affinity information to the XenStore platform data path
platform/vcpu/#domid/affinity
of the domain.
Besides that, xc_set_vcpu_affinity()
can also modify the NUMA node
affinity of the Xen domain if the vCPU:
When Xen creates a domain, it enables the domain’s d->auto_node_affinity
feature flag.
When it is enabled, setting the vCPU affinity also updates the NUMA node affinity which is used for memory allocations for the domain:
flowchart TD subgraph libxenctrl xc_vcpu_setaffinity("<tt>xc_vcpu_setaffinity()")--hypercall-->xen end subgraph xen[Xen Hypervisor] direction LR vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity") -->check_auto_node{"Is the domain's<br><tt>auto_node_affinity</tt><br>enabled?"} --"yes<br>(default)"--> auto_node_affinity("Set the<br>domain's<br><tt>node_affinity</tt> mask as well<br>(used for further<br>NUMA memory<br>allocation)") click xc_vcpu_setaffinity "https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank click vcpu_set_affinity "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank click domain_update_node_aff "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank click check_auto_node "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank click auto_node_affinity "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank end
When Host.numa_affinity_policy
is set to
best_effort,
xenopsd attempts NUMA node placement
when building new VMs and instructs
xenguest
to set the vCPU affinity of the domain.
With the domain’s auto_node_affinity
flag enabled by default in Xen,
this automatically also sets the d->node_affinity
mask of the domain.
This then causes the Xen memory allocator to prefer the NUMA nodes in the
d->node_affinity
NUMA node mask when allocating memory.
That is, (for completeness) unless Xen’s allocation function
alloc_heap_pages()
receives a specific NUMA node in its memflags
argument when called.
See xc_domain_node_setaffinity() for more
information about another way to set the node_affinity
NUMA node mask
of Xen domains and more depth on how it is used in Xen.
In the flowchart, two code paths are set in bold:
Host.numa_affinity_policy
is the default (off) in xenopsd
.xc_vcpu_setaffinity(XEN_VCPUAFFINITY_SOFT)
in Xen,
when the Domain’s auto_node_affinity
flag is enabled (default) to show
how it changes to the vCPU affinity update the domain’s node_affinity
in this default case as well.xenguest uses the Xenstore to read the static domain configuration that it needs reads to build the domain.
flowchart TD subgraph VM.create["xenopsd VM.create"] %% Is xe vCPU-params:mask= set? If yes, write to Xenstore: is_xe_vCPUparams_mask_set?{" Is <tt>xe vCPU-params:mask=</tt> set? Example: <tt>1,2,3</tt> (Is used to enable vCPU<br>hard-affinity) "} --"yes"--> set_hard_affinity("Write hard-affinity to XenStore: <tt>platform/vcpu/#domid/affinity</tt> (xenguest will read this and other configuration data from Xenstore)") end subgraph VM.build["xenopsd VM.build"] %% Labels of the decision nodes is_Host.numa_affinity_policy_set?{ Is<p><tt>Host.numa_affinity_policy</tt><p>set?} has_hard_affinity?{ Is hard-affinity configured in <p><tt>platform/vcpu/#domid/affinity</tt>?} %% Connections from VM.create: set_hard_affinity --> is_Host.numa_affinity_policy_set? is_xe_vCPUparams_mask_set? == "no"==> is_Host.numa_affinity_policy_set? %% The Subgraph itself: %% Check Host.numa_affinity_policy is_Host.numa_affinity_policy_set? %% If Host.numa_affinity_policy is "best_effort": -- Host.numa_affinity_policy is<p><tt>best_effort --> %% If has_hard_affinity is set, skip numa_placement: has_hard_affinity? --"yes"-->exec_xenguest %% If has_hard_affinity is not set, run numa_placement: has_hard_affinity? --"no"-->numa_placement-->exec_xenguest %% If Host.numa_affinity_policy is off (default, for now), %% skip NUMA placement: is_Host.numa_affinity_policy_set? =="default: disabled"==> exec_xenguest end %% xenguest subgraph subgraph xenguest exec_xenguest ==> stub_xc_hvm_build("<tt>stub_xc_hvm_build()") ==> configure_vcpus("<tT>configure_vcpus()") %% Decision ==> set_hard_affinity?{" Is <tt>platform/<br>vcpu/#domid/affinity</tt> set?"} end %% do_domctl Hypercalls numa_placement --Set the NUMA placement using soft-affinity--> XEN_VCPUAFFINITY_SOFT("<tt>xc_vcpu_setaffinity(SOFT)") ==> do_domctl set_hard_affinity? --yes--> XEN_VCPUAFFINITY_HARD("<tt>xc_vcpu_setaffinity(HARD)") --> do_domctl xc_domain_node_setaffinity("<tt>xc_domain_node_setaffinity()</tt> and <tt>xc_domain_node_getaffinity()") <--> do_domctl %% Xen subgraph subgraph xen[Xen Hypervisor] subgraph domain_update_node_affinity["domain_update_node_affinity()"] domain_update_node_aff("<tt>domain_update_node_aff()") ==> check_auto_node{"Is domain's<br><tt>auto_node_affinity</tt><br>enabled?"} =="yes (default)"==>set_node_affinity_from_vcpu_affinities(" Calculate the domain's <tt>node_affinity</tt> mask from vCPU affinity (used for further NUMA memory allocation for the domain)") end do_domctl{"do_domctl()<br>op->cmd=?"} ==XEN_DOMCTL_setvcpuaffinity==> vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity") ==>domain_update_node_aff do_domctl --XEN_DOMCTL_setnodeaffinity (not used currently) -->is_new_affinity_all_nodes? subgraph domain_set_node_affinity["domain_set_node_affinity()"] is_new_affinity_all_nodes?{new_affinity<br>is #34;all#34;?} --is #34;all#34; --> enable_auto_node_affinity("<tt>auto_node_affinity=1") --> domain_update_node_aff is_new_affinity_all_nodes? --not #34;all#34; --> disable_auto_node_affinity("<tt>auto_node_affinity=0") --> domain_update_node_aff end %% setting and getting the struct domain's node_affinity: disable_auto_node_affinity --node_affinity=new_affinity--> domain_node_affinity set_node_affinity_from_vcpu_affinities ==> domain_node_affinity@{ shape: bow-rect,label: "domain: node_affinity" } --XEN_DOMCTL_getnodeaffinity--> do_domctl end click is_Host.numa_affinity_policy_set? "https://github.com/xapi-project/xen-api/blob/90ef043c1f3a3bc20f1c5d3ccaaf6affadc07983/ocaml/xenopsd/xc/domain.ml#L951-L962" click numa_placement "https://github.com/xapi-project/xen-api/blob/90ef043c/ocaml/xenopsd/xc/domain.ml#L862-L897" click stub_xc_hvm_build "https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank click get_flags "https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288" _blank click do_domctl "https://github.com/xen-project/xen/blob/7cf163879/xen/common/domctl.c#L282-L894" _blank click domain_set_node_affinity "https://github.com/xen-project/xen/blob/7cf163879/xen/common/domain.c#L943-L970" _blank click configure_vcpus "https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297-L1348" _blank click set_hard_affinity? "https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1305-L1326" _blank click xc_vcpu_setaffinity "https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank click vcpu_set_affinity "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank click domain_update_node_aff "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank click check_auto_node "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank click set_node_affinity_from_vcpu_affinities "https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank
The VM parameter VCPUs-params:mask is documented in the official XAPI user documentation. ↩︎
libxenguest
is a library written in C provided for the Xen Hypervisor in Dom0.
For example, it used as the low-level interface building Xen Guest domains.
Its source is located in the folder tools/libs/guest of the Xen repository.
One important responsibility of libxenguest
is creating the memory layout
of new and migrated VMs.
The boot memory setup
of xenguest
and libxl
(used by the xl
CLI command) call
xc_dom_boot_mem_init() which dispatches the
call to
meminit_hvm()
and
meminit_pv() which layout, allocate and populate the boot memory of domains.
VM boot memory setup by calling meminit_hvm() or meminit_pv()
xenguest’s
hvm_build_setup_mem()
and libxl
and the xl
CLI call
xc_dom_boot_mem_init()
to allocate and populate the domain’s system memory for booting it:
flowchart LR subgraph libxl / xl CLI libxl__build_dom("libxl__build_dom()") end subgraph xenguest hvm_build_setup_mem("hvm_build_setup_mem()") end subgraph libxenctrl xc_domain_populate_physmap("One call for each memory range (extent): xc_domain_populate_physmap() xc_domain_populate_physmap() xc_domain_populate_physmap()") end subgraph libxenguest hvm_build_setup_mem & libxl__build_dom --> xc_dom_boot_mem_init("xc_dom_boot_mem_init()") xc_dom_boot_mem_init --> meminit_hvm("meminit_hvm()") & meminit_pv("meminit_pv()") --> xc_domain_populate_physmap end click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126 " _blank click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648 " _blank click meminit_pv "https://github.com/xen-project/xen/blob/de0254b9/tools/libs/guest/xg_dom_x86.c#L1183-L1333 " _blank
The allocation strategies of them called functions are:
They use xc_domain_populate_physmap() to perform memory allocation and to map the allocated memory to the system RAM ranges of the domain.
xc_domain_populate_physmap()
calls the XENMEM_populate_physmap
command of the Xen memory hypercall.
For a more detailed walk-through of the inner workings of this hypercall, see the reference on xc_domain_populate_physmap().
For more details on the VM build step involving xenguest
and Xen side see:
https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest
Find free memory based on the given flags and optionally, a domain
get_free_buddy() is called from alloc_heap_pages() to find a page at the most suitable place for a memory allocation.
It finds memory depending on the given flags and domain:
struct domain
zone_hi
until zone_lo
)Its first attempt is to find a page of matching page order on the requested NUMA node(s).
If this is not successful, it looks to breaking higher page orders,
and if that fails too, it lowers the zone until zone_lo
.
It does not attempt to use not scrubbed pages, but when memflags
tell it MEMF_no_scrub
, it uses check_and_stop_scrub(pg)
on 4k
pages to prevent breaking higher order pages instead.
If this fails, it checks if other NUMA nodes shall be tried.
For example for vNUMA domains, the calling functions pass one specific
NUMA node, and they would also set MEMF_exact_node
to make sure that
memory is specifically only allocated from this NUMA node.
If no NUMA node was passed or the allocation from it failed, and
MEMF_exact_node
was not set in memflags
, the function falls
back to the first fallback, NUMA-affine allocation.
For local NUMA memory allocation, the domain should have one or more NUMA nodes
in its struct domain->node_affinity
field when this function is called.
This happens as part of
NUMA placement
which writes the planned vCPU affinity of the domain’s vCPUs to the XenStore
which xenguest reads to
update the vCPU affinities of the domain’s vCPUs in Xen, which in turn, by
default (when to domain->auto_node_affinity is active) also updates the
struct domain->node_affinity
field.
Note: In case it contains multiple NUMA nodes, this step allocates from the next NUMA node after the previous NUMA node the domain allocated from in a round-robin way.
Otherwise, the function falls back to host-wide round-robin allocation.
When the domain’s node_affinity
is not defined or did not succeed
and MEMF_exact_node
was not passed in memflags
, all remaining
NUMA nodes are attempted in a round-robin way: Each subsequent call
uses the next NUMA node after the previous node that the domain
allocated memory from.
This flowchart shows an overview of the decision chain of get_free_buddy()
flowchart TD alloc_round_robin --No free memory on the host--> Failure node_affinity_exact --No free memory<br>on the Domain's node_affinity nodes:<br>Abort exact allocation--> Failure get_free_buddy["get_free_buddy()"] -->MEMF_node{memflags<br>&<br>MEMF_node?} --Yes--> try_MEMF_node{Alloc from node} --Success: page--> Success try_MEMF_node --No free memory on the node--> MEMF_exact{memflags & MEMF_exact?} --"No"--> node_affinity_set{NUMA affinity set?} -- Domain->node_affinity is not set: Fall back to round-robin allocation --> alloc_round_robin MEMF_exact --Yes: As there is not enough free memory on the exact NUMA node(s): Abort exact allocation -->Failure MEMF_node --No NUMA node in memflags--> node_affinity_set{domain-><br>node_affinity<br>set?} --Set--> node_affinity{Alloc from<br>node_affinity<br>nodes} --No free memory on the node_affinity nodes Check if exact request--> node_affinity_exact{memflags<br>&<br>MEMF_exact?} --Not exact: Fall back to<br>round-robin allocation--> alloc_round_robin node_affinity--Success: page-->Success alloc_round_robin{" Fall back to round-robin allocation"} --Success: page--> Success(Success: Return the page) click get_free_buddy "https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L1116 " _blank