Subsections of libxenctrl
xc_domain_claim_pages()
Purpose
The purpose of xc_domain_claim_pages()
is to attempt to
stake a claim on an amount of memory for a given domain which guarantees that
memory allocations for the claimed amount will be successful.
The domain can still attempt to allocate beyond the claim, but those are not
guaranteed to be successful and will fail if the domain’s memory reaches it’s
max_mem
value.
The aim is to attempt to stake a claim for a domain on a quantity of pages
of system RAM, but not assign specific page frames.
It performs only arithmetic so the hypercall is very fast and not be preempted.
Thereby, it sidesteps any time-of-check-time-of-use races for memory allocation.
Usage notes
xc_domain_claim_pages()
returns 0 if the Xen page allocator has atomically
and successfully claimed the requested number of pages, else non-zero.
Info
Errors returned by xc_domain_claim_pages()
must be handled as they are a
normal result of the xenopsd
thread-pool claiming and starting many VMs
in parallel during a boot storm scenario.
Warning
This is especially important when staking claims on NUMA nodes using an updated
version of this function. In this case, the only options of the calling worker
thread would be to adapt to the NUMA boot storm:
Attempt to find a different NUMA node for claiming the memory and try again.
Design notes
For reference, see the hypercall comment Xen hypervisor header:
xen/include/public/memory.h
The design of the backing hypercall in Xen is as follows:
- A domain can only have one claim.
- Subsequent calls to the backing hypercalls update claim.
- The domain ID is the key of the claim.
- By killing the domain, the claim is also released.
- When sufficient memory has been allocated to resolve the claim,
the claim silently expires.
- Depending on the given size argument, the remaining stack of the domain
can be set initially, updated to the given amount, or reset.
- Claiming zero pages effectively resets any outstanding claim and
is always successful.
Users
To set up the boot memory of new domain, the libxenguest function
xc_dom_boot_mem_init relies on this design:
The memory setup
of xenguest
and by the xl
CLI using libxl
use it.
libxl
actively uses it to stack an initial claim before allocating memory.
Note
The functions
meminit_hvm()
and meminit_pv
used by
xc_dom_boot_mem_init()
always destroy any remaining claim before they return.
Thus, after libxenguest has completed allocating and populating
the physical memory of the domain, no domain has a remaining claim!
Warning
From this point onwards, no memory allocation is guaranteed!
While swapping memory between domains can be expected to always succeed,
Allocating a new page after freeing another can always fail unless a
privileged domain stakes a claim for such allocations beforehand!
Implementation
The Xen upstream memory claims code is implemented to work as follows:
When allocating memory, while a domain has an outstanding claim,
Xen updates remaining claim accordingly.
If a domain has a stake from claim, when memory is freed,
freed amount of memory increases the stake of memory.
Note: This detail may have to change for implementing NUMA claims
Xen doesn’t know if a page was allocated using a NUMA node claim.
Therefore, it cannot know if it would be legal to increment a stake a NUMA
node claim when freeing pages.
Info
The smallest possible change to achieve a similar effect would be
to add a field to the Xen hypervisor’s domain struct.
It would be used for remembering the last total NUMA node claim.
With it, freeing memory from a NUMA node could attempt to increase
the outstanding claim on the claimed NUMA node, but only given this
amount is available and not claimed.
Management of claims
- The stake is centrally managed by the Xen hypervisor using a
Hypercall.
- Claims are not reflected in the amount of free memory reported by Xen.
Reporting of claims
xl claims
reports the outstanding claims of the domains:
Sample output of xl claims
:
Name ID Mem VCPUs State Time(s) Claimed
Domain-0 0 2656 8 r----- 957418.2 0
xl info
reports the host-wide outstanding claims:
Sample output from xl info | grep outstanding
:
Tracking of claims
Xen only tracks:
- the outstanding claims of each domain and
- the outstanding host-wide claims.
Claiming zero pages effectively cancels the domain’s outstanding claim
and is always successful.
Info
- Allocations for outstanding claims are expected to always be successful.
- But this reduces the amount of outstanding claims if the domain.
- Freeing memory of the domain increases the domain’s claim again:
- But, when a domain consumes its claim, it is reset.
- When the claim is reset, freed memory is longer moved to the outstanding claims!
- It would have to get a new claim on memory to have spare memory again.
The domain’s max_mem
value is used to deny memory allocation
If an allocation would cause the domain to exceed it’s max_mem
value, it will always fail.
Hypercall API
Function signature of the libXenCtrl function to call the Xen hypercall:
long xc_memory_op(libxc_handle, XENMEM_claim_pages, struct xen_memory_reservation *)
struct xen_memory_reservation
is defined as :
struct xen_memory_reservation {
.nr_extents = nr_pages, /* number of pages to claim */
.extent_order = 0, /* an order 0 means: 4k pages, only 0 is allowed */
.mem_flags = 0, /* no flags, only 0 is allowed (at the moment) */
.domid = domid /* numerical domain ID of the domain */
};
Current users
It is used by libxenguest, which is used at least by:
libxl and the xl CLI
The xl
cli uses claims actively. By default, it enables libxl
to pass
the struct xc_dom_image
with its claim_enabled
field set.
The functions dispatched by xc_boot_mem_init()
then attempt to claim the boot using xc_domain_claim_pages()
.
They also (unconditionally) destroy any open claim upon return.
This means that in case the claim fails, xl
avoids:
- The effort of allocating the memory, thereby not blocking it for other domains.
- The effort of potentially needing to scrub the memory after the build failure.
Updates: Improved NUMA memory allocation
Enablement of a NUMA node claim instead of a host-wide claim
With a
proposed update
of xc_domain_claim_pages()
for NUMA node claims, a node argument is added.
It can be XC_NUMA_NO_NODE
for defining a host-wide claim or a NUMA node
for staking a claim on one NUMA node.
This update does not change the foundational design of memory claims of
the Xen hypervisor where a claim is defined as a single claim for the domain.
For improved support for NUMA, xenopsd
is updated to call an updated version of this function for the domain.
It reserves NUMA node memory before xenguest
is called, and a new pnode
argument is added to xenguest
.
It sets the NUMA node for memory allocations by the xenguest
boot memory setup
xc_boot_mem_init() which also
sets the exact
flag.
The exact
flag forces get_free_buddy() to fail
if it could not find scrubbed memory for the give pnode
which causes
alloc_heap_pages()
to re-run get_free_buddy() with the flag to
fall back to dirty, not yet scrubbed memory.
alloc_heap_pages() then
checks each 4k page for the need to scrub and scrubs those before returning them.
This is expected to improve the issue of potentially spreading memory over
all NUMA nodes in case of parallel boot by preventing one NUMA node to become
the target of parallel memory allocations which cannot fit on it and also the
issue of spreading memory over all NUMA nodes if case of a huge amount of
dirty memory that needs to be scrubbed for before a domain can start or restart.
NUMA grant tables and ballooning
This is not seen as an issue as grant tables and I/O pages are usually not as
frequently used as regular system memory of domains, but this potential issue
remains:
In discussions, it was said that Windows PV drivers unmap
and free
memory
for grant tables to Xen and then re-allocate memory for those grant tables.
xenopsd
may want to try to stake a very small claim for the domain on the
NUMA node of the domain so that Xen can increase this claim when the PV drivers
free
this memory and re-use the resulting claimed amount for allocating
the grant tables.
This would ensure that the grant tables are then allocated on the local NUMA
node of the domain, avoiding remote memory accesses when accessing the grant
tables from inside the domain.
Note: In case the corresponding backend process in Dom0 is running on another
NUMA node, it would access the domain’s grant tables from a remote NUMA node,
but in this would enable a future improvement for Dom0, where it could prefer to
run the corresponding backend process on the same or a neighbouring NUMA node.
xc_domain_node_setaffinity()
xc_domain_node_setaffinity()
controls the NUMA node affinity of a domain,
but it only updates the Xen hypervisor domain’s d->node_affinity
mask.
This mask is read by the Xen memory allocator as the 2nd preference for the
NUMA node to allocate memory from for this domain.
Preferences of the Xen memory allocator:
- A NUMA node passed to the allocator directly takes precedence, if present.
- Then, if the allocation is for a domain, it’s
node_affinity
mask is tried. - Finally, it falls back to spread the pages over all remaining NUMA nodes.
As this call has no practical effect on the Xen scheduler, vCPU affinities
need to be set separately anyways.
The domain’s auto_node_affinity
flag is enabled by default by Xen. This means
that when setting vCPU affinities, Xen updates the d->node_affinity
mask
to consist of the NUMA nodes to which its vCPUs have affinity to.
See xc_vcpu_setaffinity() for more information
on how d->auto_node_affinity
is used to set the NUMA node affinity.
Thus, so far, there is no obvious need to call xc_domain_node_setaffinity()
when building a domain.
Setting the NUMA node affinity using this call can be used,
for example, when there might not be enough memory on the
preferred NUMA node, but there are other NUMA nodes that have
enough free memory to be used for the system memory of the domain.
In terms of future NUMA design, it might be even more favourable to
have a strategy in xenguest
where in such cases, the superpages
of the preferred node are used first and a fallback to neighbouring
NUMA nodes only happens to the extent necessary.
Likely, the future allocation strategy should be passed to xenguest
using Xenstore like the other platform parameters for the VM.
Walk-through of xc_domain_node_setaffinity()
classDiagram
class `xc_domain_node_setaffinity()` {
+xch: xc_interface #42;
+domid: uint32_t
+nodemap: xc_nodemap_t
0(on success)
-EINVAL(if a node in the nodemask is not online)
}
click `xc_domain_node_setaffinity()` href "
https://github.com/xen-project/xen/blob/master/tools/libs/ctrl/xc_domain.c#L122-L158"
`xc_domain_node_setaffinity()` --> `Xen hypercall: do_domctl()`
`xc_domain_node_setaffinity()` <-- `Xen hypercall: do_domctl()`
class `Xen hypercall: do_domctl()` {
Calls domain_set_node_affinity#40;#41; and returns its return value
Passes: domain (struct domain *, looked up using the domid)
Passes: new_affinity (modemask, converted from xc_nodemap_t)
}
click `Xen hypercall: do_domctl()` href "
https://github.com/xen-project/xen/blob/master/xen/common/domctl.c#L516-L525"
`Xen hypercall: do_domctl()` --> `domain_set_node_affinity()`
`Xen hypercall: do_domctl()` <-- `domain_set_node_affinity()`
class `domain_set_node_affinity()` {
domain: struct domain
new_affinity: nodemask
0(on success, the domain's node_affinity is updated)
-EINVAL(if a node in the nodemask is not online)
}
click `domain_set_node_affinity()` href "
https://github.com/xen-project/xen/blob/master/xen/common/domain.c#L943-L970"
domain_set_node_affinity()
This function implements the functionality of xc_domain_node_setaffinity
to set the NUMA affinity of a domain as described above.
If the new_affinity does not intersect the node_online_map
,
it returns -EINVAL
. Otherwise, the result is a success, and it returns 0
.
When the new_affinity
is a specific set of NUMA nodes, it updates the NUMA
node_affinity
of the domain to these nodes and disables d->auto_node_affinity
for this domain. With d->auto_node_affinity
disabled,
xc_vcpu_setaffinity() no longer updates the NUMA affinity
of this domain.
If new_affinity
has all bits set, it re-enables the d->auto_node_affinity
for this domain and calls
domain_update_node_aff()
to re-set the domain’s node_affinity
mask to the NUMA nodes of the current
the hard and soft affinity of the domain’s online vCPUs.
Flowchart in relation to xc_set_vcpu_affinity()
The effect of domain_set_node_affinity()
can be seen more clearly on this
flowchart which shows how xc_set_vcpu_affinity()
is currently used to set
the NUMA affinity of a new domain, but also shows how domain_set_node_affinity()
relates to it:
In the flowchart, two code paths are set in bold:
- Show the path when
Host.numa_affinity_policy
is the default (off) in xenopsd
. - Show the default path of
xc_vcpu_setaffinity(XEN_VCPUAFFINITY_SOFT)
in Xen,
when the Domain’s auto_node_affinity
flag is enabled (default) to show
how it changes to the vCPU affinity update the domain’s node_affinity
in this default case as well.
xenguest uses the Xenstore
to read the static domain configuration that it needs reads to build the domain.
flowchart TD
subgraph VM.create["xenopsd VM.create"]
%% Is xe vCPU-params:mask= set? If yes, write to Xenstore:
is_xe_vCPUparams_mask_set?{"
Is
<tt>xe vCPU-params:mask=</tt>
set? Example: <tt>1,2,3</tt>
(Is used to enable vCPU<br>hard-affinity)
"} --"yes"--> set_hard_affinity("Write hard-affinity to XenStore:
<tt>platform/vcpu/#domid/affinity</tt>
(xenguest will read this and other configuration data
from Xenstore)")
end
subgraph VM.build["xenopsd VM.build"]
%% Labels of the decision nodes
is_Host.numa_affinity_policy_set?{
Is<p><tt>Host.numa_affinity_policy</tt><p>set?}
has_hard_affinity?{
Is hard-affinity configured in <p><tt>platform/vcpu/#domid/affinity</tt>?}
%% Connections from VM.create:
set_hard_affinity --> is_Host.numa_affinity_policy_set?
is_xe_vCPUparams_mask_set? == "no"==> is_Host.numa_affinity_policy_set?
%% The Subgraph itself:
%% Check Host.numa_affinity_policy
is_Host.numa_affinity_policy_set?
%% If Host.numa_affinity_policy is "best_effort":
-- Host.numa_affinity_policy is<p><tt>best_effort -->
%% If has_hard_affinity is set, skip numa_placement:
has_hard_affinity?
--"yes"-->exec_xenguest
%% If has_hard_affinity is not set, run numa_placement:
has_hard_affinity?
--"no"-->numa_placement-->exec_xenguest
%% If Host.numa_affinity_policy is off (default, for now),
%% skip NUMA placement:
is_Host.numa_affinity_policy_set?
=="default: disabled"==>
exec_xenguest
end
%% xenguest subgraph
subgraph xenguest
exec_xenguest
==> stub_xc_hvm_build("<tt>stub_xc_hvm_build()")
==> configure_vcpus("<tT>configure_vcpus()")
%% Decision
==> set_hard_affinity?{"
Is <tt>platform/<br>vcpu/#domid/affinity</tt>
set?"}
end
%% do_domctl Hypercalls
numa_placement
--Set the NUMA placement using soft-affinity-->
XEN_VCPUAFFINITY_SOFT("<tt>xc_vcpu_setaffinity(SOFT)")
==> do_domctl
set_hard_affinity?
--yes-->
XEN_VCPUAFFINITY_HARD("<tt>xc_vcpu_setaffinity(HARD)")
--> do_domctl
xc_domain_node_setaffinity("<tt>xc_domain_node_setaffinity()</tt>
and
<tt>xc_domain_node_getaffinity()")
<--> do_domctl
%% Xen subgraph
subgraph xen[Xen Hypervisor]
subgraph domain_update_node_affinity["domain_update_node_affinity()"]
domain_update_node_aff("<tt>domain_update_node_aff()")
==> check_auto_node{"Is domain's<br><tt>auto_node_affinity</tt><br>enabled?"}
=="yes (default)"==>set_node_affinity_from_vcpu_affinities("
Calculate the domain's <tt>node_affinity</tt> mask from vCPU affinity
(used for further NUMA memory allocation for the domain)")
end
do_domctl{"do_domctl()<br>op->cmd=?"}
==XEN_DOMCTL_setvcpuaffinity==>
vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity")
==>domain_update_node_aff
do_domctl
--XEN_DOMCTL_setnodeaffinity (not used currently)
-->is_new_affinity_all_nodes?
subgraph domain_set_node_affinity["domain_set_node_affinity()"]
is_new_affinity_all_nodes?{new_affinity<br>is #34;all#34;?}
--is #34;all#34;
--> enable_auto_node_affinity("<tt>auto_node_affinity=1")
--> domain_update_node_aff
is_new_affinity_all_nodes?
--not #34;all#34;
--> disable_auto_node_affinity("<tt>auto_node_affinity=0")
--> domain_update_node_aff
end
%% setting and getting the struct domain's node_affinity:
disable_auto_node_affinity
--node_affinity=new_affinity-->
domain_node_affinity
set_node_affinity_from_vcpu_affinities
==> domain_node_affinity@{ shape: bow-rect,label: "domain: node_affinity" }
--XEN_DOMCTL_getnodeaffinity--> do_domctl
end
click is_Host.numa_affinity_policy_set?
"https://github.com/xapi-project/xen-api/blob/90ef043c1f3a3bc20f1c5d3ccaaf6affadc07983/ocaml/xenopsd/xc/domain.ml#L951-L962"
click numa_placement
"https://github.com/xapi-project/xen-api/blob/90ef043c/ocaml/xenopsd/xc/domain.ml#L862-L897"
click stub_xc_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank
click get_flags
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288" _blank
click do_domctl
"https://github.com/xen-project/xen/blob/7cf163879/xen/common/domctl.c#L282-L894" _blank
click domain_set_node_affinity
"https://github.com/xen-project/xen/blob/7cf163879/xen/common/domain.c#L943-L970" _blank
click configure_vcpus
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297-L1348" _blank
click set_hard_affinity?
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1305-L1326" _blank
click xc_vcpu_setaffinity
"https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank
click vcpu_set_affinity
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank
click domain_update_node_aff
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank
click check_auto_node
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank
click set_node_affinity_from_vcpu_affinities
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank
xc_domain_node_setaffinity
can be used to set the domain’s node_affinity
(which is normally set by xc_set_vcpu_affinity
) to different NUMA nodes.
No effect on the Xen scheduler
Currently, the node affinity does not affect the Xen scheudler:
In case d->node_affinity
would be set before vCPU creation, the initial pCPU
of the new vCPU is the first pCPU of the first NUMA node in the domain’s
node_affinity
. This is further changed when one of more cpupools
are set up.
As this is only the initial pCPU of the vCPU, this alone does not change the
scheduling of Xen Credit scheduler as it reschedules the vCPUs to other pCPUs.
Notes on future design improvements
It may be possible to call it before vCPUs are created
When done early, before vCPU creation, some domain-related data structures
could be allocated using the domain’s d->node_affinity
NUMA node mask.
With further changes in Xen and xenopsd
, Xen could allocate the vCPU structs
on the affine NUMA nodes of the domain.
For this, would be that xenopsd
would have to call xc_domain_node_setaffinity()
before vCPU creation, after having decided the domain’s NUMA placement,
preferably including claiming the required memory for the domain to ensure
that the domain will be populated from the same NUMA node(s).
This call cannot influence the past: The xenopsd
VM_create
micro-ops calls Xenctrl.domain_create
. It currently creates
the domain’s data structures before numa_placement
was done.
Improving Xenctrl.domain_create
to pass a NUMA node
for allocating the Hypervisor’s data structures (e.g. vCPU)
of the domain would require changes
to the Xen hypervisor and the xenopsd
xenopsd VM_create
micro-op.
xc_domain_populate_physmap()
Overview
xenguest uses
xc_domain_populate_physmap()
to populate a Xen domain’s physical memory map:
Both call the XENMEM_populate_physmap
hypercall.
xc_domain_populate_physmap_exact()
also sets the “exact” flag
for allocating memory only on the given NUMA node.
This is a very simplified overview of the hypercall:
flowchart LR
subgraph hypercall handlers
populate_physmap("<tt>populate_physmap()</tt>
One call for each memory
range (extent)")
end
subgraph "Xen buddy allocator:"
populate_physmap
--> alloc_domheap_pages("<tt>alloc_domheap_pages()</tt>
Assign allocated pages to
the domain")
alloc_domheap_pages
--> alloc_heap_pages("<tt>alloc_heap_pages()</tt>
If needed: split high-order
pages into smaller buddies,
and scrub dirty pages")
--> get_free_buddy("<tt>get_free_buddy()</tt>
If reqested: Allocate from a
preferred/exact NUMA node
and/or from
unscrubbed memory
")
end
click populate_physmap
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314
" _blank
click alloc_domheap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697
" _blank
click get_free_buddy
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958
" _blank
click alloc_heap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116
" _blank
memory_op(XENMEM_populate_physmap)
It calls
construct_memop_from_reservation()
to convert the arguments for allocating a page from
struct xen_memory_reservation
to struct memop_args
and
passes
it to
populate_physmap():
construct_memop_from_reservation()
It populates struct memop_args
using the
hypercall arguments. It:
- Copies
extent_start
, nr_extents
, and extent_order
. - Populates
memop_args->memflags
using the given mem_flags
.
Converting a vNODE to a pNODE for vNUMA
When a vNUMA vnode is passed using XENMEMF_vnode
, and domain->vnuma
and domain->vnuma->nr_vnodes
are set, and the
vnode
(virtual NUMA node) maps to a pnode
(physical NUMA node), it also:
- Populates the
pnode
in the memflags
of the struct memop_args
- and sets a
XENMEMF_exact_node_request
in them as well.
propagate_node()
If no vNUMA node is passed, construct_memop_from_reservation()
calls
propagate_node()
to propagate the NUMA node and XENMEMF_exact_node_request
for use in Xen.
populate_physmap()
It handles hypercall preemption and resumption after a preemption, keeps track of
the already populated pages.
For each range (extent), it runs an iteration of the
allocation loop.
It
passes
the
struct domain
- page order
- count remaining pages populate
- and the converted
memflags
to
alloc_domheap_pages():
alloc_domheap_pages()
It
calls
alloc_heap_pages()
and on success, assigns the allocated pages to the domain.
alloc_heap_pages()
It
calls
get_free_buddy()[^1] to allocate a page at the most suitable place:
When no pages of the requested size are free,
it splits larger superpages into pages of the requested size.
get_free_buddy()
It finds memory based on the flags and domain and return its page struct
:
- Optionally allocate prefer to allocate from a passed NUMA node
- Optionally allocate from the domain’s next affine NUMA node (round-robin)
- Optionally return if the preferred NUMA allocation did not succeed
- Optionally allocate from not-yet scrubbed memory
- Optionally allocate from the given range of memory zones
- Fall back allocate from the next NUMA node on the system (round-robin)
For details see the
get_free_buddy() finds memory based on the flags and domain.
Full flowchart
flowchart TD
subgraph XenCtrl
xc_domain_populate_physmap["<tt>xc_domain_populate_physmap()"]
xc_domain_populate_physmap_exact["<tt>xc_domain_populate_physmap_exact()"]
end
subgraph Xen
%% sub-subgraph from memory_op() to populate_node() and back
xc_domain_populate_physmap & xc_domain_populate_physmap_exact
<--reservation,<br>and for preempt:<br>nr_start/nr_done-->
memory_op("<tt>memory_op(XENMEM_populate_physmap)")
memory_op
--struct xen_memory_reservation-->
construct_memop_from_reservation("<tt>construct_memop_from_reservation()")
--struct<br>xen_memory_reservation->mem_flags-->
propagate_node("<tt>propagate_node()")
--_struct<br>memop_args->memflags_-->
construct_memop_from_reservation
--_struct memop_args_-->
memory_op<--struct memop_args *:
struct domain *,
List of extent base addrs,
Number of extents,
Size of each extent (extent_order),
Allocation flags(memflags)-->
populate_physmap[["<tt>populate_physmap()"]]
<-.domain, extent base addrs, extent size, memflags, nr_start and nr_done.->
populate_physmap_loop--if memflags & MEMF_populate_on_demand -->guest_physmap_mark_populate_on_demand("
<tt>guest_physmap_mark_populate_on_demand()")
populate_physmap_loop@{ label: "While extents to populate,
and not asked to preempt,
for each extent left to do:", shape: notch-pent }
--domain, order, memflags-->
alloc_domheap_pages("<tt>alloc_domheap_pages()")
--zone_lo, zone_hi, order, memflags, domain-->
alloc_heap_pages
--zone_lo, zone_hi, order, memflags, domain-->
get_free_buddy("<tt>get_free_buddy()")
--_page_info_
-->alloc_heap_pages
--if no page-->
no_scrub("<tt>get_free_buddy(MEMF_no_scrub)</tt>
(honored only when order==0)")
--_dirty 4k page_
-->alloc_heap_pages
<--_dirty 4k page_-->
scrub_one_page("<tt>scrub_one_page()")
alloc_heap_pages("<tt>alloc_heap_pages()</tt>
(also splits higher-order pages
into smaller buddies if needed)")
--_page_info_
-->alloc_domheap_pages
--page_info, order, domain, memflags-->assign_page("<tt>assign_page()")
assign_page
--page_info, nr_mfns, domain, memflags-->
assign_pages("<tt>assign_pages()")
--domain, nr_mfns-->
domain_adjust_tot_pages("<tt>domain_adjust_tot_pages()")
alloc_domheap_pages
--_page_info_-->
populate_physmap_loop
--page(gpfn, mfn, extent_order)-->
guest_physmap_add_page("<tt>guest_physmap_add_page()")
populate_physmap--nr_done, preempted-->memory_op
end
click memory_op
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1409-L1425
" _blank
click construct_memop_from_reservation
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L1022-L1071
" _blank
click propagate_node
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L524-L547
" _blank
click populate_physmap
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L159-L314
" _blank
click populate_physmap_loop
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/memory.c#L197-L304
" _blank
click guest_physmap_mark_populate_on_demand
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L210-220
" _blank
click guest_physmap_add_page
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L296
" _blank
click alloc_domheap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2641-L2697
" _blank
click alloc_heap_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L967-L1116
" _blank
click get_free_buddy
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L855-L958
" _blank
click assign_page
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2540-L2633
" _blank
click assign_pages
"https://github.com/xen-project/xen/blob/e16acd80/xen/common/page_alloc.c#L2635-L2639
" _blank
xc_vcpu_setaffinity()
Introduction
In the Xen hypervisor, each vCPU has:
A soft affinity, This is the list of pCPUs where a vCPU prefers to run:
This can be used in cases to make vCPUs prefer to run on a set on pCPUs,
for example the pCPUs of a NUMA node, but in case those are already busy,
the Credit schedule can still ignore the soft-affinity.
A typical use case for this are NUMA machines, where the soft affinity
for the vCPUs of a domain should be set equal to the pCPUs of the NUMA node where the domain’s memory shall be placed.
See the description of the NUMA feature
for more details.
A hard affinity, also known as pinning.
This is the list of pCPUs where a vCPU is allowed to run
Hard affinity is currently not used for NUMA placement, but can be configured
manually for a given domain, either using xe VCPUs-params:mask=
or the API.
For example, the vCPU’s pinning can be configured for a VM with:
xe vm-param-set uuid=<template_uuid> vCPUs-params:mask=1,2,3
There are also host-level guest_VCPUs_params
which are used by
host-cpu-tune
to exclusively pin Dom0 and guests (i.e. that their
pCPUs never overlap). Note: This isn’t currently supported by the
NUMA code: It could result that the NUMA placement picks a node that
has reduced capacity or unavailable due to the host mask that
host-cpu-tune
has set.
Purpose
The libxenctrl library call xc_set_vcpu_affinity()
controls the pCPU affinity of the given vCPU.
xenguest
uses it when building domains if
xenopsd
added vCPU affinity information to the XenStore platform data path
platform/vcpu/#domid/affinity
of the domain.
Updating the NUMA node affinity of a domain
Besides that, xc_set_vcpu_affinity()
can also modify the NUMA node
affinity of the Xen domain if the vCPU:
When Xen creates a domain, it enables the domain’s d->auto_node_affinity
feature flag.
When it is enabled, setting the vCPU affinity also updates the NUMA node
affinity which is used for memory allocations for the domain:
Simplified flowchart
flowchart TD
subgraph libxenctrl
xc_vcpu_setaffinity("<tt>xc_vcpu_setaffinity()")--hypercall-->xen
end
subgraph xen[Xen Hypervisor]
direction LR
vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity")
-->check_auto_node{"Is the domain's<br><tt>auto_node_affinity</tt><br>enabled?"}
--"yes<br>(default)"-->
auto_node_affinity("Set the<br>domain's<br><tt>node_affinity</tt>
mask as well<br>(used for further<br>NUMA memory<br>allocation)")
click xc_vcpu_setaffinity
"https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank
click vcpu_set_affinity
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank
click domain_update_node_aff
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank
click check_auto_node
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank
click auto_node_affinity
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank
end
Current use by xenopsd and xenguest
When Host.numa_affinity_policy
is set to
best_effort,
xenopsd attempts NUMA node placement
when building new VMs and instructs
xenguest
to set the vCPU affinity of the domain.
With the domain’s auto_node_affinity
flag enabled by default in Xen,
this automatically also sets the d->node_affinity
mask of the domain.
This then causes the Xen memory allocator to prefer the NUMA nodes in the
d->node_affinity
NUMA node mask when allocating memory.
That is, (for completeness) unless Xen’s allocation function
alloc_heap_pages()
receives a specific NUMA node in its memflags
argument when called.
See xc_domain_node_setaffinity() for more
information about another way to set the node_affinity
NUMA node mask
of Xen domains and more depth on how it is used in Xen.
Flowchart of its current use for NUMA affinity
In the flowchart, two code paths are set in bold:
- Show the path when
Host.numa_affinity_policy
is the default (off) in xenopsd
. - Show the default path of
xc_vcpu_setaffinity(XEN_VCPUAFFINITY_SOFT)
in Xen,
when the Domain’s auto_node_affinity
flag is enabled (default) to show
how it changes to the vCPU affinity update the domain’s node_affinity
in this default case as well.
xenguest uses the Xenstore
to read the static domain configuration that it needs reads to build the domain.
flowchart TD
subgraph VM.create["xenopsd VM.create"]
%% Is xe vCPU-params:mask= set? If yes, write to Xenstore:
is_xe_vCPUparams_mask_set?{"
Is
<tt>xe vCPU-params:mask=</tt>
set? Example: <tt>1,2,3</tt>
(Is used to enable vCPU<br>hard-affinity)
"} --"yes"--> set_hard_affinity("Write hard-affinity to XenStore:
<tt>platform/vcpu/#domid/affinity</tt>
(xenguest will read this and other configuration data
from Xenstore)")
end
subgraph VM.build["xenopsd VM.build"]
%% Labels of the decision nodes
is_Host.numa_affinity_policy_set?{
Is<p><tt>Host.numa_affinity_policy</tt><p>set?}
has_hard_affinity?{
Is hard-affinity configured in <p><tt>platform/vcpu/#domid/affinity</tt>?}
%% Connections from VM.create:
set_hard_affinity --> is_Host.numa_affinity_policy_set?
is_xe_vCPUparams_mask_set? == "no"==> is_Host.numa_affinity_policy_set?
%% The Subgraph itself:
%% Check Host.numa_affinity_policy
is_Host.numa_affinity_policy_set?
%% If Host.numa_affinity_policy is "best_effort":
-- Host.numa_affinity_policy is<p><tt>best_effort -->
%% If has_hard_affinity is set, skip numa_placement:
has_hard_affinity?
--"yes"-->exec_xenguest
%% If has_hard_affinity is not set, run numa_placement:
has_hard_affinity?
--"no"-->numa_placement-->exec_xenguest
%% If Host.numa_affinity_policy is off (default, for now),
%% skip NUMA placement:
is_Host.numa_affinity_policy_set?
=="default: disabled"==>
exec_xenguest
end
%% xenguest subgraph
subgraph xenguest
exec_xenguest
==> stub_xc_hvm_build("<tt>stub_xc_hvm_build()")
==> configure_vcpus("<tT>configure_vcpus()")
%% Decision
==> set_hard_affinity?{"
Is <tt>platform/<br>vcpu/#domid/affinity</tt>
set?"}
end
%% do_domctl Hypercalls
numa_placement
--Set the NUMA placement using soft-affinity-->
XEN_VCPUAFFINITY_SOFT("<tt>xc_vcpu_setaffinity(SOFT)")
==> do_domctl
set_hard_affinity?
--yes-->
XEN_VCPUAFFINITY_HARD("<tt>xc_vcpu_setaffinity(HARD)")
--> do_domctl
xc_domain_node_setaffinity("<tt>xc_domain_node_setaffinity()</tt>
and
<tt>xc_domain_node_getaffinity()")
<--> do_domctl
%% Xen subgraph
subgraph xen[Xen Hypervisor]
subgraph domain_update_node_affinity["domain_update_node_affinity()"]
domain_update_node_aff("<tt>domain_update_node_aff()")
==> check_auto_node{"Is domain's<br><tt>auto_node_affinity</tt><br>enabled?"}
=="yes (default)"==>set_node_affinity_from_vcpu_affinities("
Calculate the domain's <tt>node_affinity</tt> mask from vCPU affinity
(used for further NUMA memory allocation for the domain)")
end
do_domctl{"do_domctl()<br>op->cmd=?"}
==XEN_DOMCTL_setvcpuaffinity==>
vcpu_set_affinity("<tt>vcpu_set_affinity()</tt><br>set the vCPU affinity")
==>domain_update_node_aff
do_domctl
--XEN_DOMCTL_setnodeaffinity (not used currently)
-->is_new_affinity_all_nodes?
subgraph domain_set_node_affinity["domain_set_node_affinity()"]
is_new_affinity_all_nodes?{new_affinity<br>is #34;all#34;?}
--is #34;all#34;
--> enable_auto_node_affinity("<tt>auto_node_affinity=1")
--> domain_update_node_aff
is_new_affinity_all_nodes?
--not #34;all#34;
--> disable_auto_node_affinity("<tt>auto_node_affinity=0")
--> domain_update_node_aff
end
%% setting and getting the struct domain's node_affinity:
disable_auto_node_affinity
--node_affinity=new_affinity-->
domain_node_affinity
set_node_affinity_from_vcpu_affinities
==> domain_node_affinity@{ shape: bow-rect,label: "domain: node_affinity" }
--XEN_DOMCTL_getnodeaffinity--> do_domctl
end
click is_Host.numa_affinity_policy_set?
"https://github.com/xapi-project/xen-api/blob/90ef043c1f3a3bc20f1c5d3ccaaf6affadc07983/ocaml/xenopsd/xc/domain.ml#L951-L962"
click numa_placement
"https://github.com/xapi-project/xen-api/blob/90ef043c/ocaml/xenopsd/xc/domain.ml#L862-L897"
click stub_xc_hvm_build
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L2329-L2436" _blank
click get_flags
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1164-L1288" _blank
click do_domctl
"https://github.com/xen-project/xen/blob/7cf163879/xen/common/domctl.c#L282-L894" _blank
click domain_set_node_affinity
"https://github.com/xen-project/xen/blob/7cf163879/xen/common/domain.c#L943-L970" _blank
click configure_vcpus
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1297-L1348" _blank
click set_hard_affinity?
"https://github.com/xenserver/xen.pg/blob/65c0438b/patches/xenguest.patch#L1305-L1326" _blank
click xc_vcpu_setaffinity
"https://github.com/xen-project/xen/blob/7cf16387/tools/libs/ctrl/xc_domain.c#L199-L250" _blank
click vcpu_set_affinity
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1353-L1393" _blank
click domain_update_node_aff
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1809-L1876" _blank
click check_auto_node
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1840-L1870" _blank
click set_node_affinity_from_vcpu_affinities
"https://github.com/xen-project/xen/blob/7cf16387/xen/common/sched/core.c#L1867-L1869" _blank