Xenopsd instances run on a host and manage VMs on behalf of clients. This picture shows 3 different Xenopsd instances: 2 named "xenopsd-xc" and 1 named "xenopsd-xenlight".
Each instance is responsible for managing a disjoint set of VMs. Clients should never ask more than one Xenopsd to manage the same VM. Managing a VM means: - handling start/shutdown/suspend/resume/migrate/reboot - allowing devices (disks, nics, PCI cards, vCPUs etc) to be manipulated - providing updates to clients when things change (reboots, console becomes available, guest agent says something etc). For a full list of features, consult the features list.
Each Xenopsd instance has a unique name on the host. A typical name is - org.xen.xcp.xenops.classic - org.xen.xcp.xenops.xenlight
A higher-level tool, such as xapi will associate VMs with individual Xenopsd names.
Running multiple Xenopsds is necessary because - The virtual hardware supported by different technologies (libxc, libxl, qemu) is expected to be different. We can guarantee the virtual hardware is stable across a rolling upgrade by running the VM on the old Xenopsd. We can then switch Xenopsds later over a VM reboot when the VM admin is happy with it. If the VM admin is unhappy then we can reboot back to the original Xenopsd again. - The suspend/resume/migrate image formats will differ across technologies (again libxc vs libxl) and it will be more reliable to avoid switching technology over a migrate. - In the future different security domains may have different Xenopsd instances providing even stronger isolation guarantees between domains than is possible today.
Communication with Xenopsd is handled through a Xapi-global library: xcp-idl. This library supports - message framing: by default using HTTP but a binary framing format is available - message encoding: by default we use JSON but XML is also available - RPCs over Unix domain sockets and persistent queues.
This library allows the communication details to be changed without having to change all the Xapi clients and servers.
Xenopsd has a number of "backends" which perform the low-level VM operations such as (on Xen) "create domain" "hotplug disk" "destroy domain". These backends contain all the hypervisor-specific code including - connecting to Xenstore - opening the libxc /proc/xen/privcmd interface - initialising libxl contexts
The following diagram shows the internal structure of Xenopsd:
At the top of the diagram two client RPC have been sent: one to start a VM and the other to fetch the latest events. The RPCs are all defined in xcp-idl/xen/xenops_interface.ml. The RPCs are received by the Xenops_server module and decomposed into "micro-ops" (labelled "μ op"). These micro ops represent actions like
Each of these micro-ops is represented by a function call in a "backend plugin" interface. The micro-ops are enqueued in queues, one queue per VM. There is a thread pool (whose size can be changed dynamically by the admin) which pulls micro-ops from the VM queues and calls the corresponding backend function.
The active backend (there can only be one backend per Xenopsd instance) executes the micro-ops. The Xenops_server_xen backend in the picture above talks to libxc, libxl and qemu to create and destroy domains. The backend also talks to other Xapi services, in particular
Xenopsd backends are also responsible for monitoring running VMs. In the Xenops_server_xen backend this is done by watching Xenstore for
When such an event happens (for example: @releaseDomain sent when a domain requests a reboot) the corresponding operation does not happen inline. Instead the event is rebroadcast upwards to Xenops_server as a signal ("for example: VM id needs some attention") and a "VM_stat" micro-op is queued in the appropriate queue. Xenopsd does not allow operations to run on the same VM in parallel and enforces this by:
The event takes the form "VM id needs some attention" and not "VM id needs to be rebooted" because, by the time the queue is flushed, the VM may well now be in a different state. Perhaps rather than being rebooted it now needs to be shutdown; or perhaps the domain is now in a good state because the reboot has already happened. The signals sent by the backend to the Xenops_server are a bit like event channel notifications in the Xen ring protocols: they are requests to ask someone to perform work, the don't themselves describe the work that needs to be done.
An implication of this design is that it should always be possible to answer the question, "what operation should be performed to get the VM into a valid state?". If an operation is cancelled half-way through or if Xenopsd is suddenly restarted, it will ask the question about all the VMs and perform the necessary operations. The operations must be designed carefully to make this work. For example if Xenopsd is restarted half-way through starting a VM, it must be obvious on restart that the VM should either be forcibly shutdown or rebooted to make it a valid state again. Note: we don't demand that operations are performed as transactions; we only demand that the state they leave the system be "sensible" in the sense that the admin will recognise it and be able to continue their work.
Sometimes this can be achieved through careful ordering of side-effects within the operations, taking advantage of artifacts of the system such as:
In the absense of "tells" from the system, operations are expected to journal their intentions and support restart after failure.
There are three categories of metadata associated with VMs:
VM and VmExtra metadata is stored by Xenopsd in the domain 0 filesystem, in a simple directory hierarchy.