Skip to Content

Layered Protection

1 dead drive + 2 corrupt filesystems + 1 dead drive enclosure + 1 full startup volume (all on one server) = (almost) no data loss

It was potentially a nightmare scenario that I didn't fully comprehend yet because I was on the road. I noticed just before leaving on a week-long engagement that the mirror set on the server that holds our media files had degraded. After some quick checking, it looked like the devices were fine, so I started the rebuild. About this time, I noticed that the filesystem on our server-based Time Machine volume was corrupt. Nice. Time to head to the AirPort. I wasn't too concerned because all the important data is also backed up through CrashPlan Pro and some of it through BRU as well.

The next day, I VPN in to see how the mirror rebuild is going (400+ GB of data on a 640 GB mirror). The server is running very slowly and Disk Utility is claiming 90+ days to rebuild. "That's strange," I think, but I've rebuilt enough RAIDs to know that it often over estimates. However, during the week, it just gets longer and longer, reaching a peak of 174 days.

Clearly something is up. I start getting email notifications from the server that the startup disk is full. Normally there's 25 - 30 GB of free space on the startup disk. Then I start to get emails from the Drobo Pro. Now I'm a little concerned. Our Time Machine backups go to the Time Machine volume. BRU backs up to its own single disk. CrashPlan Pro backs up to the Drobo Pro. Our second backup server for CPP is wedged due to being out of disk space.

So, at this point:

  • Time Machine is offline because the destination volume can't be mounted.
  • CPP's repository might be lost due to a failure in the Drobo Pro.
  • We don't use BRU to back up much more than the servers.

Now I start to sweat a little.

 

Fortunately, the Drobo Pro (which was configured for dual disk (RAID 6 style) redundancy) announces that it is rebuilding. Some time later, it announces that it is done. No data has been lost on the Drobo. Whew! One of the Western Digital 2 TB Green drives had failed (this no surprise). There's still the matter of the full startup volume. I guess that this is virtual memory use from the RAID rebuild. I'd rather the rebuild finish (still estimating 170+ days on Thursday), so I leave it until I return.

When the trip ends, I get back to the office to assess. The Time Machine volume still won't mount and Disk Warrior is unable to repair it on the server. The Drobo Pro is fine as is all its data, but the WD drive needs to be replaced. The culprit on the full startup disk was BRU. The external eSATA drive caddy containing the BRU target disk had failed. BRU was happily backing up to the same path, which now pointed to the startup disk rather than to the external volume. BRU helpfully recreated the missing path components.

I decide to replace the Time Machine volume with a larger eSATA disk and worry about whether it is a hardware or filesystem failure later. I then figure out that the BRU disk was offline because of the enclosure, so I replace that, too, folding in the backups that had been placed on the startup disk and freeing up that space (via a copy to the Drobo Pro which has plenty of capacity at the moment).

Once I had determined that CPP and Drobo Pro had our important data protected, the decision regarding the Time Machine volume was easy. Time Machine is mostly about Rapid Return to Service for us. CPP backs up every 15 minutes and does it whenever we have Internet access. Time Machine runs every hour, but only when we are in the office. Time Machine provides a way for us to get back up and running faster. CrashPlan Pro, is what we rely on to protect our data.

Later, I was able to clean the Time Machine volume using Disk Warrior on another computer. However, the sparse bundle disk images used by Time Machine were hopelessly munged. That's fine. Technically, we lost some data, but nothing that was essential. We lost some operating system history, but in practice it is very unlikely that we would want to restore old OS files. Everything is back under Time Machine, CrashPlan Pro, and BRU protection again. The Drobo Pro has rebuilt and the failed drive has been replaced.

I haven't mentioned that during this seem week or so window, two clients also had corrupted filesystems and my notebook computer's filesystem also went corrupt. One of those clients had more than 1 TB of data that was not backed up on the volume. Fortunately, Disk Warrior was able to repair their filesystem. Apparently there was an ion storm or something going on.

The moral of the story here is "layered protection." If Time Machine had been our only backup, I would have been unprotected for the duration of my trip and would have lost backup history. If the Drobo had failed to protect our data, CrashPlan Pro would have been useless. Any one of these events could have resulted in significant data loss, but by having layered protection, we did not lose anything important. Losing a drive, a drive enclosure, and corrupting two filesystems while traveling doesn't have to be catastrophic.