Please address this question with a sense of humor and do not limit yourself to voting because it is a bad idea, sometimes (very rarely!) a user is totally good with data loss and he just needs help to load his gun! After all, ZFS offers other benefits that go beyond the integrity of the data, and I prefer to use them for my defective drives that
ext4. If you're the type of system administrator who reads this with a sly smile and remembers the moment you lost data by doing exactly this, this question is for you.
I am running a group with some USB drives on a non-critical server with non-critical data, and I do not care if it gets corrupted. I am trying to configure it so that ZFS does not forcibly remove USB drives when they experience checksum errors (just like how ext4 or FAT handle this scenario, without noticing / worrying about data loss).
For readers who arrive here through Google trying to fix their ZFS group, do not try anything described in this question or your answers, you will lose your data.
Because the ZFS police love to shout at people who use USB drives
or have any other non-standard configuration: for the sake of this discussion,
I guess they are videos of cats that I have backed in another 32 physically
Remote places on 128 redundant SSDs. I fully recognize that I you will lose 100% of
My data can not be recovered in this group (many times) if I try to do this.
I address this question to people who are curious about
single how bad an environment in which ZFS is capable of running (people
who like to take systems to their breaking points and beyond, just to
So here is the configuration:
- HP EliteDesk server running FreeNAS-11.2-U5
- 2x WD Elements 8TB units connected via USB 3.0
- The unreliable power environment, the server and the drives are often reset / disconnected without warning. (yes, I have a UPS, no, I do not want to use it, I want to break this server, did not you read the disclaimer?)
- a mirror pool
hdd with the two units (with
failmode = continue set)
- a unit is stable, even after multiple reboots and forced disconnections, it seems that it never reports checksum errors or any other problem in ZFS
- a unit is not reliable, with occasional checksum errors during normal operation (even when it is not unexpectedly disconnected), errors appear to be unrelated to the bad power environment, as it will work well for more than 10 hours and will suddenly be expelled from the group due to checksum faults
I have confirmed that the untrusted unit is due to a software or hardware problem with the USB bus on the server and not an untrusted cable or a physical problem with the unit. The way I have confirmed it is by connecting it to my MacBook with USB ports in good condition, zeroing and then writing random data in the whole unit and verifying them (it is done 3 times, 100% success each time). The unit is almost new, without other SMART indicators below 100% status. However, even if the unit failed gradually and lost a few bits here and there, I'm fine with that.
Here is the problem:
When the defective unit has checksum errors, ZFS removes it from the pool. Unfortunately, FreeNAS does not allow me to re-add it to the group without physically rebooting, or disconnecting and reconnecting the USB cable, Y power supply unit. This means that I can not schedule the re-addition process or do it remotely without restarting the whole server, I would have to be physically present to disconnect things or have an Arduino connected to the Internet and a relay connected to both cables.
I have already researched a little about whether this kind of thing is possible, and it has been difficult because every time I find a relevant thread, the data integrity police intervenes and convinces the person asking to leave their unreliable configuration instead to ignore mistakes or work around them. I am resorting to asking here because I have not been able to find documentation or other answers on how to achieve this.
- shutting down the checksums completely with
zfs set checksum = off hdd, I have not done this yet because, ideally, I would like to keep the checks, so I know when the unit is misbehaving, I just want to ignore the faults
- an indicator that maintains the checksum but ignores checksum errors / attempts to repair them without removing the unit from the group
- a ZFS indicator that increases the limit of maximum checksum error allowed before the unit is removed (currently, the unit starts after approximately ~ 13 errors)
- a FreeBSD / FreeNAS command that allows me to force the device online after it was deleted, without having to restart the entire server
- a FreeBSD / FreeNAS kernel option to force this unit to never be removed
- a FreeBSD sysctl option that magically solves the USB bus problem causing errors / wait times only in this unit (unlikely)
- a ZFS option in Linux that does the same (I'd be willing to move these units to my Ubuntu box if I know it's possible to do it there)
zpool clear hdd in a loop every 500 ms to eliminate checksum errors before they reach the threshold
I'm really trying to avoid having to resort to the use of ext4 or another file system that does not force to remove the drives after the USB errors, because I want all the other functions of ZFS as snapshots, datasets, send / recv, etc. simply trying to disable the data integrity check.
This is the
dmesg output every time the unit misbehaves and is removed
July 7 04:10:35 freenas-lemon ZFS: state change vdev, pool_guid = 13427464797767151426 vdev_guid = 11823196300981694957
July 7 04:10:35 freenas-limón ugen0.8: in usbus0 (offline)
July 7 04:10:35 freenas-lemon umass4: in uhub2, port 20, addr 7 (disconnected)
July 7 04:10:35 freenas-lemon da4 in umass-sim4 bus 4 scbus7 target 0 mon 0
July 7 04:10:35 lemon-lemon da4: s / n 5641474A4D56574C separate
July 7 04:10:35 lemon-freenas (da4: umass-sim4: 4: 0: 0): destroyed perifas
July 7 04:10:35 freenas-lemon umass4: separated
July 7 04:10:46 freenas-lemon usbd_req_re_enumerate: addr = 9, the whole address failed! (USB_ERR_IOERROR, ignored)
July 7, 04:10:52 freenas-lemon usbd_setup_device_desc: the device descriptor could not be obtained in the addr 9, USB_ERR_TIMEOUT
July 7, 04:10:52 freenas-lemon usbd_req_re_enumerate: addr = 9, the address could not be established! (USB_ERR_IOERROR, ignored)
July 7, 04:10:58 freenas-lemon usbd_setup_device_desc: the device descriptor could not be obtained in the addr 9, USB_ERR_TIMEOUT
July 7 04:10:58 freenas-lemon usb_alloc_device: Error selecting the configuration index 0: USB_ERR_TIMEOUT, port 20, addr 9 (ignored)
July 7 04:10:58 freenas-limón ugen0.8: in usbus0
July 7 04:10:58 freenas-limón ugen0.8: in usbus0 (offline)