Multiple times in my career now I’ve come across mdadm RAID sets (RAID1+0, 5, 6 etc) in various environments (e.g. CentOS/Debian boxes, Synology/QNAP NASes) which appear to be simply unable to handle failing disk. That is a disk that is not totally dead, but has tens of thousands of bad sectors and is simply unable to handle I/O. But, it isnt totally dead, it’s still kind of working. The kernel log is typically full of UNC errors.
Sometimes, SMART will identify the disk as failing, other times there are no other symptoms other than slow I/O.
The slow I/O actually causes the entire system to freeze up. Connecting via ssh takes forever, the webGUI (if it is a NAS) stops working usually. Running commands over ssh takes forever as well. That is until I disconnect / purposely “fail” the disk out of the array, then things go back to “normal” – that is as normal as they can be with a degraded array.
I’m just wondering, if a disk is taking so long to read/write from, why not just knock it out of the array, drop a message in the log and keep going? It seems making the whole system grind to a halt because one disk is kinda screwy totally nullifies one of the main benefits of using RAID (fault tolerance – the ability to keep running if a disk fails). I can understand that in a single-disk scenario (e.g. your system has as single SATA disk connected and it is unable to execute read/writes properly) this is catastrophic, but in a RAID set (especially the fault tolerant “personalities”) it seems not only annoying but also contrary to common sense.
Is there a very good reason the default behavior of mdadm is to basically cripple the box until someone remotes in and fixes it manually?