It was a beautiful Friday afternoon and the start of the long weekend was only hours away. The boys in the lab all had plans of some sort or another, but then Bob walked in. And RAID away I knew someone’s plans would be changing.
Bob Giannoulis, the head of our Client Services, had just spent the last 20 minutes discussing a huge data loss situation with a potential customer. The customer, one of North America’s largest providers of programmable digital mobile content, had lost a critical server. The live data content stored and constantly refreshed from a big old server was sitting in a huge data centre just a few kilometers from our lab. Their product was used by everyone from government and recreational complexes to big businesses and even Vegas casinos.
Unfortunately, most of these clients had performance contracts which penalized any delay or loss of access to online content. Consequently, our client was under extreme stress as several of his larger clients were threatening to terminate their contracts or pursue legal actions or both.
The client was running a RAID 5 setup with 14 hard drives. On Wednesday evening during a thunderstorm the RAID server went down momentarily with a single failed hard drive, but thanks to the redundancy of RAID 5, the server recovered and continued to operate, albeit in a degraded and much slower state. An automatic email message informed our client’s IT people about their hard drive failure and the fact the spare drive was also down. Within an hour or two their tech arrived to replace the defective hard drive and start the 7-8 hour RAID rebuilding process.
After about an hour into the rebuild, things started to go bad! A 2nd hard drive began to physically crash and within minutes the rebuild was aborted in mid process. The IT technician, already driving home, gets a 2nd email message concerning the catastrophic failure of their RAID server and quickly returns to the data center. The server is now totally down and none of their billboards can “call home”. No worries, he’ll just swap out the failed drive and re-start the rebuild process. But the rebuild apparently wouldn’t start, likely due to the fact there were now 2 disk drives down. More hands are called in, and with sparse options they decide to restore a backup of the RAID server to a new set of drives and use one of these drives from the back up set to rebuild the downed RAID array. As Friday morning arrives, the IT teams latest attempts are deemed a failure and the pressure to get the server up is not letting up.
I think it’s worth noting that RAID of any type is NOT a replacement for a good backup. Yes the R in RAID stands for redundant but the redundancy is very delicate and often short lived. You see, the redundancy in RAID 5 is based on being able to replicate the data from ONE failed hard drive, but once one drive fails the odds greatly increase that a second drive will fail shortly after. It makes sense when you consider most RAID arrays are built with identical drives, bought on the same day from the same supplier. They are semi-clones of each other, so as a critical component begins to fail on one drive, it’s just as likely to begin to fail on another hard drive. Add in the extra stress of rebuilding the complete array and it’s pretty clear that disaster is lurking just around the corner.
This is about the spot where Memofix Data Recovery Services comes back into the picture … just after 4pm on the Friday of the Canada Day long weekend a panic stricken gentlemen and a young nerdy guy walk into our facility and hand deliver their RAID server to us. They look tired, frazzled and beaten. This is our first challenge … to give them hope! We know if the data they need is still there, and if they didn’t do too much to the RAID array, we should be able to get it back! The client’s nerdy young guy is their lead IT tech and he’s mentally quick and thankfully possesses a solid memory of each step that was taken in their own efforts to resurrect the RAID server. Our tech and the nerdy young guy are quickly sharing the details of the situation and it’s not long before our confidence and positivity lifts the spirits of our once saddened clients. We are given the approval to do whatever we can to get their data back by Tuesday morning.
With no further ado, we are left alone in our lab to try and resurrect this fallen giant of a storage device. It consists of fourteen 300 GB SAS hard drives and a few extra hard drives still to be sorted out. As always, we began by creating duplicate images of all the working drives before turning our attention to the failed disk drives. Often when a RAID reports a hard drive being down or offline it’s not actually dead or defective, so we 1st diagnose the two drives to see if either is accessible. We are hoping if we can get just one of the two drives working we’ll be able to rebuild the RAID array.
Briefly applying power to the 1st drive initiates a horrible screeching noise that we’ve heard way too often. This drive is likely badly crashed and a quick inspection under class 100 cleanroom conditions reveals it to be true. There are a series of deep rings across the top disk surface and most likely similar damage below. This drive will not be recoverable. Yikes!
Diagnosing the second drive, we are relieved to find the drive in a functional state with no painful noises being heard. However, the drive has triggered its built-in SMART monitoring system to report problems but the drive is still accessible. We quickly determine the cause of the SMART errors is due to one of the heads being defective and unable to read. Inspecting the hard drive’s three disks under class 100 cleanroom conditions shows no visible or apparent disk damage and removing the heads and inspecting them under a microscope shows no damage as well. This is good.
Checking our database of parts drive shows we have 3 of these exact model drives to use for parts and it isn’t long before our best cleanroom technician has the drive happily but slowly imaging along. It’s almost 8pm now and the imaging will likely take 5-6 hours. We decide to use this time to fuel up on some food and get some rest. The plan is for our best RAID file system specialists to return to the lab on Saturday morning and assuming the imaging is complete, continue with the recovery.
Our tech arrives at 8:30 am and as expected all the imaging has completed. There are a few read errors on the drive that required new heads but otherwise everything looks good. Now the task is to determine the original RAID configuration such as the proper drive order, the stripe size and the parity scheme. Sometimes the configuration is easy to determine, but due to the repeated attempts to self-fix the array we have our work cut out for us. By 1 pm we are fairly certain we have the correct parameters and we attempt to mount the RAID array virtually. But something is not right and we spend another 4 hours before we successfully mount the RAID 5 array virtually.
With the RAID 5 array now properly setup, we can mount the clients DATA volume. But nothing is straight forward here and our attempts to copy off the data using the DATA volumes own NTFS file system prove futile, especially on many of the identified critical files. Our tech has spent the better part of the day working on the world’s biggest jigsaw puzzle and he’s getting mentally tired so we decide to call it quits for the day.
Sunday morning arrives and our RAID specialist digs into the new problem. There appears to have been some overwrite or corruption of the volume’s MFT or Master File Table which acts as an index pointing to the location of files. Again, several more hours are spent searching through the RAID array for matching fragments of the MFT that we can use to repair the damaged areas. We are successful in finding the final missing remnants buried deep within some temporary system files. The MFT is now repaired, the DATA volume mounted and the copying of the data begins. Our tech now heads home but continues to monitor the case remotely. The copy completes during the holiday Monday and he tests a large sample of the data and ensures it loads error free. We can now declare this successful and a quick call to the client has them meeting up with our tech at our lab to pick up the data. If smiles were dollars we would be rich!
As a service provider of last resort, we tend to see the worst scenarios and this was no different. Nobody likes to call in outside help, especially if your reputation is built on data security and you have a potential data loss situation. But if you find yourself in a similar situation do the right thing and bring in the professionals. At the very least we will stop any further damage by capturing protected images of all the hard drives before any more modifications are made.
Memofix provides users with a 24-7-365 day cell number for contacting us off hours or on weekends. If disaster should strike you at an inopportune time please call Andrew at (416) 459-1331 to arrange an emergency intervention.
* note: the actual business description has been misrepresented to protect our client’s identity