Tag Archives: HA

High-Availability Storage Cluster

Synology HA Storage Cluster
Synology HA Storage Cluster

We are building a High-Availability (HA) Storage Cluster to complement our Proxmox HA Server Cluster. Synology has a nice HA solution that we can use for this. To use Synology’s HA’s solution, one must have the following:

  • Two Identical Synology NAS devices (we are using a pair of RS1221+ rack-mounted Synology NAS’)
  • Both NAS devices must have identical memory and disk configurations.
  • Both NAS devices must have at least two network interfaces available (we are using dual 10 GbE network cards in both of our NAS devices)

The two NAS devices work in an active/standby configuration and present a single IP interface for access to storage and administration.

Synology HA Documentation

Synology provides good documentation for their HA system. Here are some useful links:

The video above provides a good overview of Synology HA and how to configure it.

Storage Cluster Hardware

Synology RS1221+ NAS
Synology RS1221+ NAS

We are using a pair of Synology RS1221+ rack-mounted NAS servers. Each one is configured with the following hardware options:

Networking

Our Proxmox Cluster will connect to our HA Storage Cluster via ethernet connections. We will be storing the virtual disk drives for our VMs and LXC in this cluster on our HA Storage Cluster. Maximizing these connections’ speed and minimizing latency is important to maximize our workload’s overall performance.

Each node in our Proxmox Cluster has dedicated high-speed connections (25 GbE for pve1, 10 GbE for pve2 and pve3)  to a dedicated Storage VLAN. These connections are made through a Unfi Switch – an Enterprise XG 24. This switch is supported by a large UPS that provides battery backup power for our Networking Rack.

Ubiquity EnterpriseXG 24 Switch
Ubiquity EnterpriseXG 24 Switch

This approach is taken to minimize latency as the storage traffic cluster is completely handled with a single switch.

Ideally, we would have a pair of these switches and redundant connections to our Proxmox and HA Storage clusters to maximize reliability. While this would be a nice enhancement, we have chosen to use a single switch for cost reasons.

The NAS drives in our HA Storage Cluster are configured to provide an interface to both our Storage VLAN. This approach ensures that the nodes in our Proxmox cluster can access the HA Storage Cluster directly without a routing hop through our firewall. We also set the MTU for this network to 9000 (Jumbo Frames) to minimize packet overhead.

Storage Design

Each Synology RS1221+ in our cluster has eight 960 GB Enterprise SSDs. The performance of the resulting storage system is important as we will be storing the disks for the VMs and LXCs in our Proxmox Cluster on our HA Storage System. The following are the criteria we used to select a storage pool configuration:

  • Performance – we want to be able to saturate the 10 GbE interfaces to our HA Storage Cluster
  • Reliability – we want to be protected against single-drive failures. We will keep spare drives and use backups to manage the chance of simultaneous multiple-drive failures.
  • Storage Capacity – we want to use the available SSD storage capacity efficiently.

We considered using either a RAID-10 or RAID-5 configuration.

Storage Devices – 960 GB Enterprise SSDs

Toshiba 960 GB SSD Performance
Toshiba 960 GB SSD Performance

Our SSD drives are enterprise models with good throughput and IO/s (IOPs) performance.

960 GB SSD Reliability Features
960 GB SSD Reliability Features

They also feature some desirable reliability features, including good write endurance and MTBF numbers. Our drives also feature sudden power-off features to maintain data integrity in the event of a power failure that cannot be backed up by our UPS system.

Performance Comparison – RAID-10 vs. RAID-5

We used a RAID performance calculator to estimate the performance of our storage system. Based on actual runtime data from our VMs and LXCs running in Proxmox, our IO workload is almost completely written operation-dominated. This is probably due to the fact that read caching handles most read operations from memory on our servers.

The first option we considered was RAID-10. The estimated performance for this configuration is shown below.

RAID-10 Throughput Performance
RAID-10 Throughput Performance

As you can see, this configuration’s throughput will more than saturate our 10 GbE connections to our HA Storage Cluster.

The next option we considered was RAID-5. The estimated performance for this configuration is shown below.

RAID-5 Throughput Performance
RAID-5 Throughput Performance

As you can see, performance is a substantial hit due to the need to generate and store parity data each time storage is written. The RAID-5 configuration should also be able to saturate our 10 GbE connections to the Storage Cluster.

The result is that the RAID-10 and RAID-5 configurations will provide the same performance level given our 10 GbE connections to our Storage Cluster.

Capacity Comparison – RAID-10 vs. RAID-5

The next step in our design process was to compare the usable storage capacity between RAID-10 and RAID-5 using Synology’s RAID Calculator.

RAID-10 vs. RAID-5 Usable Storage Capacity
RAID-10 vs. RAID-5 Usable Storage Capacity

Not surprisingly, the RAID-5 configuration creates roughly twice as much usable storage when compared to the RAID-10 configuration.

Chosen Configuration

We decided to formate our SSDs as a Btrfs storage pool configured as a RAID-5. We choose RAID-5 for the following reasons:

  • A good balance between write performance and reliability
  • Efficient use of available SSD storage space
  • Acceptable overall reliability (single disk failures) given the following:
    • Our storage pools are fully redundant between the primary and secondary NAS pools
    • We run regular automatic snapshots, replications, and backups via Synology’s Hyper Backup as well as server-side backups via Proxmox Backup Server.

The following shows the expected IO/s (IOPs) for our storage system.

RAID-5 IOPs Performance
RAID-5 IOPs Performance

This level of performance should be more than adequate for our three-node cluster’s workload.

Dataset / Share Configuration

The final dataset format that we will use for our vdisks is TBD at this point. We plan to test the performance of both iSCSI LUNs and NFS shares. If these perform roughly the same for our workloads, we will use NFS to gain better support for snapshots and replication features. At present, we are using an NFS dataset to store our vdisks.

HA Configuration

Configuring the pair of RS1212+ NAS servers for HAS was straightforward. Only minimal configurations are needed on the secondary NAS to get the storage and network configurations to match the primary NAS. The process that enables HA on the primary NAS will overwrite all of the settings on the secondary NAS.

Here are the steps that we used to do this.

  • Install all of the upgrades and SSDs in both units
  • Connect both units to our network and install an ethernet connection between the two units for heartbeats and synchronization
  • Install DSM on each unit and set a static IP address for the network-facing ethernet connections (we do not set IPs for the heartbeat connections – Synology HAS takes care of this)
  • Configure the network interfaces on both units to provide direct interfaces to our Storage VLAN (see the previous section)
  • Make sure that the MTU settings are identical on each unit. This includes the MTU setting for unused ethernet interfaces. We had to edit the /etc/synoinfo.conf file on each unit to set the MTU values for the inactive interfaces.
  • Ensure both units are running up-to-date versions of the DSM software
  • Configure the pair for HA (see the documentation above)
  • Complete the configuration of the cluster pair, including –
    • Shares
    • Backups
    • Snapshots and Replication
    • Install Apps

The following shows the completed configuration of our HA Storage Cluster.

Completed HA Cluster Configuration
Completed HA Cluster Configuration

The cluster uses a single IP address to present a GUI that configures and manages the primary and secondary NAS units as if they were a single NAS. The same IP address always points to the active NAS for file sharing and iSCSI I/O operations.

Voting Server

A voting server avoids split-brain scenarios where both units in the HA cluster try to act as the master. Any server that is always accessible via ping to both NAS drives in the cluster can serve as a Voting Server. We used the gateway for the Storage VLAN where the cluster is connected for this purpose.

Performance Benchmarking

We used the ATTO Disk Benchmarking Tool to perform benchmark tests on the complete HA cluster. The benchmarks were run from an M2 Mac Mini running macOS, which used an SMB share to access the Storage Cluster over a 10 GbE connection on the Storage VLAN.

Storage Cluster Benchmark Configuration
Storage Cluster Benchmark Configuration

The following are the benchmark results –

Storage Throughput Benchmarks
Storage Cluster Throughput Benchmarks

The Storage Cluster’s performance is quite good, and the 10 GbE connection is saturated for 128 KB writes and larger. The slightly lower read throughput results from a combination of our SSD’s wire performance and the additional latency on writes due to the need to copy data from the primary NAS storage to the secondary NAS.

Storage Cluster IOPs Benchmarks
Storage Cluster IOPs Benchmarks

IOs/sec (IOPs) performance is important for virtual disks such as VMs and LXC containers, as they frequently perform smaller writes.

We also ran benchmarks from a VM running Windows 10 in our Proxmox Cluster. These benchmarks benefit from a number of caching and compression features in our architecture, including:

  • Write Caching with the Windows 10 OS
  • Write Caching with the iSCSI vdisk driver in Proxmox
  • Write Caching on the NAS drives in our Storage Cluster
Windows VM Disk Benchmarks
Windows VM Disk Benchmarks

The overall performance figures for the Windows VM benchmark exceed the capacity of the 10 GbE connections to the Storage Cluster and are quite good. Also, the IOPs performance is close to the specified maximum performance values for the RS1221+ NAS.

Windows VM IOPs Benchmarks
Windows VM IOPs Benchmarks

Failure Testing

The following scenarios were tested under a full workload –

  • Manual Switch between Active and Standby NAS devices
  • Simulate a network failure by disconnecting the primary NAS ethernet cable.
  • Simulate active NAS failure by pulling power from the primary NAS.
  • Simulate a disk failure by pulling a disk from the primary NAS pool.

In all cases, our system failed over within 30 seconds or less and continued handling the workload without error.

Server Cluster

Proxmox Cluster Configuration
Proxmox Cluster Configuration

Our server cluster consists of three servers. Our approach was to pair one high-capacity server (a Dell R740 dual-socket machine) with two smaller Supermicro servers.

NodeModelCPURAMStorageOOB Mgmt.Network
pve1Dell R7402 x Xeon Gold 6154 3.0 GHz
(36 Cores)
768 GB16 x 3.84 TB SSDsiDRAC2 x 10 GbE,
2 x 25 GbE
pve2Supermicro 5018D-FN4TXeon D-1540 2.0 GHz
(8 cores)
128GB2 x 7.68 TB SSDsIPMI2 x 1 GbE,
4 x 10 GbE
pve3Supermicro 5018D-FN4TXeon D-1540 2.0 GHz
(8 cores)
128 GB2 x 7.68 TB SSDsIPMI2 x 1 GbE,
4 x 10 GbE

Cluster Servers

This approach allows us to handle most of our workloads on the high-capacity server, have the advantages of HA availability, and move workloads to the smaller servers to prevent downtime during maintenance activities.

Server Networking Configuration

All three servers in our cluster have similar networking interfaces consisting of:

  • An OOB management interface (iDRAC or IPMI)
  • Two low-speed ports (1 GbE or 10 GbE)
  • Two high-speed ports (10 GbE or 25 GbE)
  • PVE2 and PVE3 each have an additional two high-speed ports (10 GbE) via an add-on NIC

The following table shows the interfaces on our three servers and how they are mapped to the various functions available via a standard set of bridges on each server.

Cluster NodeOOB Mgmt.PVE Mgmt.Low-Speed Svcs.High-Speed Svcs.Storage Svcs.
pve1 (R740)1 GbE iDRAC10 GbE Port 110 GbE Port 225 GbE Port 125 GbE Port 2
pve2 (5018D-FN4T)1 GbE IPMI10 GbE Port 11 GbE Ports 1 & 2 (LAG)10 GbE Port 3 & 4 (LAG)10 GbE Port 2
pve3 (5018D-FN4T)1 GbE IPMI10 GbE Port 1HS Svcs (LAG)10 GbE Port 3 & 4 (LAG)10 GbE Port 2

Each machine uses a combination of interfaces and bridges to realize a standard networking setup. PVE2 and PVE3 also utilize LACP bonds to provide higher capacity for the low-speed and high-speed service bridges.

You can see how we configured the LACP Bond interfaces in this video.

Network Bonding on Proxmox

We must add specific routes to ensure the separate Storage VLAN is used for Virtual Disk I/O. This is done via the following adjustments to the vmbr3 bridge in /etc/network/interfaces.

Finally, use the IP address the target NAS uses in the Storage VLAN when configuring the NFS share for PVE-storage. This ensures that the dedicated Storage VLAN will be used for Virtual Disk I/O by all nodes in our Proxmox Cluster. We ran

# traceroute <storage NAS IP>

from each of our servers to confirm that we have a direct LAN connection to PVE-Storage that does not go through our router.

Cluster Setup

We are currently running a three-server Proxmox cluster. Our servers consist of:

  • A Dell R740 Server
  • Two Supermicro 5018D-FN4T Servers

The first step was to prepare each server in the cluster as follows:

  • Install and configure Proxmox
  • Setup a standard networking configuration
  • Confirm that all servers can ping the shared storage NAS using the storage VLAN

We used the procedure in the following video to setup and configure our cluster –

The first step was to use the pve1 server to create a cluster. Next, we add the other servers to the cluster. If there are problems with connecting to shared stores, check the following:

  • Is the Storage VLAN connection using an address like 192.168.100.<srv>/32?
  • Is there a direct route for VLAN 1000 (Storage) that does not use the router? Check via traceroute  <storage-addr>
  • Is the target NAS drive sitting on the Storage VLAN with multiple gateways enabled
  • Can you ping the storage server from inside the server Proxmox instances?

Backups

For backups to work correctly, we need to modify the Proxmox /etc/vzdump.conf file to set the tmpdir to /var/tmp/ as follows:

# vzdump default settings

tmpdir:  /var/tmp/
#tmpdir: DIR
#dumpdir: DIR
...

This will cause our backups to use the Proxmox tmp file directory to create backup archives for all backups.

We later upgraded to Proxmox Backup Server. You can see how PBS was installed and configured here.

NFS Backup Mount

We set up an NFS backup mount on one of our NAS drives to store Proxmox backups.

An NFS share was set up on NAS-5 as follows:

  • Share PVE-backups (/volume2/PVE-backups)
  • Used the default Management Network

A Storage volume was configured in Proxmox to use for backups as follows:

NAS-5 NFS Share for PVE Backups
NAS-5 NFS Share for PVE Backups

A Note About DNS Load

Proxmox constantly does DNS lookups on the servers associated with NFS and other mounted filesystems, which can result in very high transaction loads on our DNS servers. To avoid this problem, we replaced the server domain names with the associated IP addresses. Note that this cannot be done for the virtual mount for the Proxmox Backup Server, as PBS uses a certificate to validate the domain name used to access it. These adjustments can be made by editing the storage configuration file at /etc/pve/storage.cfg on any node in the cluster (changes in this file are synced for all nodes).

NFS Virtual Disk Mount

We also created an NFS share for VM and LXC virtual disk storage. The volume chosen provides high-speed SSD storage on a dedicated Storage VLAN.

Global Backup Job

A Datacenter level backup job was set up to run daily at 1 am for all VMs and containers as follows (this was later replaced with Proxmox Backup Server backups as explained here):

Proxmox Backup Job
Proxmox Backup Job

The following retention policy was used:

Proxmox Backup Retention Policy
Proxmox Backup Retention Policy

Node File Backups

We installed the Proxmox Backup Client on each of our server’s nodes and created a corn schedule script that backs up the files on each node to our Proxmox Backup Server each day. The following video explains how to install and configure the PBS client.

For the installation to work properly, the locations of the PBS repository and access credentials must be set in both the script and the login bash shell. We also need to create a cron job to run the backup script daily.

Setup SSL Certificates

We use the procedure in the video below to set up signed SSL certificates for our three server nodes and the Proxmox Backup server.

This approach uses a Let’s Encrypt DNS-01 challenge via Cloudflare DNS to authenticate with Let’s Encrypt and obtain a signed certificate for each server node in the cluster and for PBS.

Setup SSH Keys

A public/private key pair is created and set up for Proxmox VE and all VMs and LXC to ensure secure SSH access. The following procedure is used to do this. The public keys are installed on each server using the ssh-copy-id username@host command.

High Availability (HA)

Proxmox can support automatic failover (High Availability) of VMs and Containers to any node in a cluster. The steps to configure this are:

  • Move the virtual disks for all VMs and LXC containers to shared storage. In our case, this is PVE-storage. Note that our TrueNAS VM must run on pve1 as it uses disks that are only available on pve1.
  • Enable HA for all VMs and LXCs (except TrueNAS)
  • Setup an HA group to govern where the VMs and LXC containers migrate to if a node fails
Cluster Failover Configuration – VMs & LXCs

We generally run all of our workloads on pve1 since it is our cluster’s highest performance and capacity node. Should this node fail, we want to migrate the pve1 workload to distribute it between the pve2 and pve3 nodes evenly. We can do this by setting up a HA Failover Group as follows:

HA Failover Group Configuration
HA Failover Group Configuration

The nofallback option is set so workloads don’t automatically migrate back to pve1 when we manually migrate them to other nodes to support maintenance operations.