Blueberry customer ABC wanted to upgrade their application server hosting facility, and also achieve a greater level of redundancy after experiencing problems with server stability.
At the time, VMWare did not have the high-availability solutions that it has today, so Blueberry investigated whether there was some way to use VMWare to achieve server redundancy.
The solution adopted was to use two Linux host machines, and run the Linux DRDB network replicator software. DRBD allows a partition on one Linux machine to be continually replicated to a second machine. Every disk write that takes place on the primary server is also written to the disk on the secondary server.
Blueberry’s approach was to setup a VM partition on the primary server, which is then replicated by DRDB to the secondary server, which has exactly the same configuration. In the event of a hardware failure on the primary server, the VM images could simply be restarted on the secondary. The virtual machines would lose any data in memory, and they would effectively behave as if they had suffered a hard reboot.
In addition, we configured a night-time backup of the complete VM. This was achieved by suspending the VM on the primary and flushing the disk. DRDB was then disconnected temporarily and the backup image on the secondary server was mounted and copied to a backup file. Then DRDB was reconnected. The effect of all this is to make a nightly copy of a suspended VM image (so data in memory is also included) on the secondary server with only 2 minutes downtime of each VM.
This system has now been deployed for many years, and has run well during this time. At one point, a hardware failure on the primary server required the VMs to be started on the secondary, and the system worked as designed. The customer was pleased.
However, we did find some limitations of this approach. DRDB’s requirement for a dedicated partition made disk space management very time consuming, as changing the partition size meant copying 100GB of VMs to a different place. On our systems, which were using standard SATA drives to reduce cost, DRDB did have a significant performance impact.
In addition, technology does not stand still, and VMWare themselves have solutions that deliver high availability for VM images. However, VMWare’s approach seems to require that all VM images be hosted on a common SAN – which is both expensive and introduces a new single point of failure (the SAN).