We had a recent issue with Zerto backups that took some time to remedy. There was a combination of issues that exposed the problem, and here is a run down of what happened.
We had a customer with about 2TB of VM’s replicating via Zerto. We wanted to provide backup copies using the Zerto backup capability. Keep in mind Zerto is primarily a disaster recovery product and not a backup product (read more about that here: Zerto Backup Overview). The replication piece worked flawlessly, but we were trying to create longer-term backups of virtual machines using Zerto’s backup mechanism which is different from Zerto replication.
Zerto performs a backup by writing all of the VM’s within a VPG to a disk target. It’s a full copy, not incremental, so it’s a large backup every time it runs, especially if it’s a VPG holding a lot of VMs. We originally used a 1Gigabit network to transfer this data, but quickly learned we need to upgrade to 10Gigabit to accommodate these frequent large transfers.
However, we found that most of the time the backup would randomly fail. The failure message was:
“Backup Protection Group ‘VPG Name’. Failure. Failed: Either a user or the system aborted the job.”
To resolve the issue we had opened up several support cases with Zerto, upgraded from version 3.5 to v4, implemented 10Gigabit, put the backup repository directly on the Zerto Manager server.
After opening several cases with Zerto we finally had a Zerto support engineer thoroughly review the Zerto logs. They found there were frequent disconnection events. With this information we explored the site-to-site VPN configuration and found there were minor mismatches in the IPSEC configurations on each side of the VPN which were causing very brief disconnections. These disconnections were causing the backup to fail. Lesson learned: It’s important to ensure the VPN end-points are 100% the same. We use VMware vShield to establish the VPN connections and vShield doesn’t provide a lot of flexibility to change VPN settings, so we had to change the customer’s VPN configuration to match the vShield configuration.
Even though we seemed to have solved the issue by fixing the VPN settings, we asked Zerto if there was any way to make sure the backup process ran even if there was a connection problem. They shared with us a tidbit of information that has enabled us to achieve 100% backup success:
There is a tweak that can be implemented in the ZVM which will allow the backup to continue in the event of a disconnection, but there’s a drawback to this in that the ZVM’s will remain disconnected until the backup completes. As of now, there’s no way to both let the backup continue and the ZVM’s reconnect. So there is a drawback, but for this customer it was acceptable to risk a window of time that replication would stop to make a good backup. In our case we made the backup on Sunday when RPO wasn’t as critical, and even then the replication only halts if there is a disconnection between the sites which became even more rare since we fixed the VPN configuration.
The tweak:
On the Recovery (target) ZVM, open the file C:\Program Files (x86)\Zerto\Zerto Virtual Replication\tweaks.txt (may be in another drive, depending on install)
In that file, insert the following string (on a new line if the file is not empty)
t_skipClearBlockingLine = 1
Save and close the file, then restart the Zerto Virtual Manager and Zerto Virtual Backup Appliance services
Now, when you run a backup, either scheduled or manual, any ZVM <-> ZVM disconnection events should not cause the backup to stop.
I hope this helps someone else!