LVHD is a block-based storage system built on top of Xapi and LVM. LVHD disks are represented as LVM LVs with vhd-format data inside. When a disk is snapshotted, the LVM LV is "deflated" to the minimum-possible size, just big enough to store the current vhd data. All other disks are stored "inflated" i.e. consuming the maximum amount of storage space. This proposal describes how we could add dynamic thin-provisioning to LVHD such that
All VM disk writes are channeled through tapdisk3
which keeps
track of how much space remains reserved in the LVM LV. When the
free space drops below a "low-water mark" (configurable via a host
config file), tapdisk3
opens a connection to a "local allocator"
process and requests more space asynchronously. If tapdisk3
notices the free space approach zero then it should start to slow
I/O in order to provide the local allocator more time.
Eventually if tapdisk3
runs
out of space before the local allocator can satisfy the request then
guest I/O will block. Note Windows VMs will start to crash if guest
I/O blocks for more than 70s.
Every host has a "local allocator" daemon which manages a host-wide
pool of blocks (represented by an LVM LV) and provides them to tapdisk3
on demand. When it receives a request, the local allocator decides
which blocks to provide from its local free pool, writes to the journal
and then reloads the device mapper table to extend the LV. The journal
is replayed on the SRmaster when any VDI on the host is deactivated.
The local allocator also has a "low-water mark" (configurable via a host config file) and will request additional blocks from the SRmaster when it is running low.
Consider what will happen if a host fails when HA is disabled:
Therefore we recommend that users enable HA and only disable it for short periods of time. Note that, unlike other thin-provisioning implementations, we will allow HA to be disabled.
When a host calls SMAPI sr_attach
, it will attach two host-local
LVs:
host-<uuid>-free
: these are free blocks cached on the host.host-<uuid>-journal
: this contains a sequence of block allocation
records describing where the free blocks have been allocated.On sr_attach
and sr_detach
the journal should be replayed
and then emptied. The journal replay code must be run on the SRmaster
since this host is the only one with read/write access to the LVM metadata.
For ease of debugging and troubleshooting, we should create command-line tools to dump and replay the journal.
The local allocator process should export RRD datasources over shared memory named
sr_<SR uuid>_<host uuid>_free
: the number of free blocks in
the local cachesr_<SR uuid>_<host uuid>_allocations
: a counter of the number
of times the local cache had to be refilled from the SRmasterThe admin should examine the allocations counter in particular as if the rate of allocations is too high it means the local host's allocation quantum should be increased. For a particular workload, the allocation quantum should be increased just enough to prevent any allocations being necessary during the HA timeout period.
tapdisk3
will be modified to
/etc/tapdisk3.conf
/etc/tapdisk3.conf
/etc/tapdisk3.conf
The request has the following format:
Octet offsets | Name | Description |
---|---|---|
0,1 | tl | Total length (including this field) of message (in network byte order) |
2 | type | The value '0' indicating an extend request |
3 | nl | The length of the LV name in octets, including NULL terminator |
4,...,4+nl-1 | name | The LV name |
4+nl-12+nl-1 | vdi_size | The virtual size of the logical VDI (in network byte order) |
12+nl-20+nl-1 | lv_size | The current size of the LV (in network byte order) |
20+nl-28+nl-1 | cur_size | The current size of the vhd metadata (in network byte order) |
The response is a single byte value "0" which is a signal to re-examime the LV size. The request will block indefinitely until it succeeds. The request will block for a long time if
There is one local allocator process per attached SR. The process will be
spawned by the SM sr_attach
call, and sent a shutdown message from
the sr_detach
call.
The sr_attach
call shall
--listen <path>
where <path>
is a name for the local Unix domain
socket.When the host allocator process starts up it will read the host local journal and
The sr_detach
call shall
The shutdown request has the following format:
Octet offsets | Name | Description |
---|---|---|
0,1 | tl | Total length (including this field) of message (in network byte order) |
2 | type | The value '1' indicating a shutdown request |
There is no response to the shutdown request. The local allocator will terminate as soon as it is able.
When the local allocator receives an extend request it will examine
the device mapper tables of the local free block LV and choose the first
free blocks, up to the "vdi-allocation-quantum" in the /etc/tapdisk-allocator.conf
.
The local allocator will append an entry to the host local journal recording
this choice of blocks (always using unambiguous physical block addresses).
Once the journal entry it committed, the host local allocator will reload
the device mapper tables of the tapdisk3
device and then reply to
tapdisk.
TODO: describe the journal format
TODO: describe the journal replay tool here
The SRmaster allocator is a XenAPI host plugin lvhd-allocator
.
The local host allocator calls the SRmaster allocator when it is
running low on free blocks on the host. The SRmaster allocator will
perform an LVM resize of the host's local free block LV.
TODO: what should the default resize amount be?
The role of the membership monitor is to
We shall
host-pre-declare-dead
script to replay the journalHost.declare_dead
to call host-pre-declare-dead
before
the VMs are unlockedhost-pre-forget
hook type which will be called just before a Host
is forgottenhost-pre-forget
script to destroy the host's local LVsRolling upgrade should work in the usual way. As soon as the pool master has been upgraded, hosts will be able to use thin provisioning when new VDIs are attached. A VM suspend/resume/reboot or migrate will be needed to turn on thin provisioning for existing running VMs.
A pool may be safely downgraded to a previous version without thin provisioning provided that storage is unplugged cleanly so that journals are replayed. We should document how the journal replay tool works so people can work around problems for themselves. If journals are not replayed then VM disks will be corrupted.
If HA is enabled:
xhad
elects a new master if necessaryxhad
tells Xapi
which hosts are alive and which have failed.Xapi
runs the host-pre-declare-dead
scripts for every failed hosthost-pre-declare-dead
scripts replay the host local journals and
update the LVM metadata on the SRmasterXapi
unlocks the VMs and restarts them on new hosts.If HA is not enabled:
Xapi
which hosts have failed with xe host-declare-dead
Xapi
runs the host-pre-declare-dead
scripts for every failed hosthost-pre-declare-dead
scripts replay the host local journals and
update the LVM metadata on the SRmasterXapi
unlocks the VMsDm-thin also uses 2 local LVs: one for the "thin pool" and one for the metadata. After replaying our journal we could potentially delete our host local LVs and switch over to dm-thin.