It's never a good sign when fsck leaves 55 Gbytes worth of files in /lost+found
I guess it doesn't help to have the file system on RAID when the logical file system structures break. Seems mostly to be backups that are affected, though, those should repair themselves on the next backup run. That and a couple of Git repositories that lost some objects and needed to be recloned from checkouts/backups.
The file system corruption runs deeper. After a couple of hours, the system bailed with
EXT4-fs error (device md0): ext4_lookup:1856: inode #120979457: comm updatedb.plocat: iget: bad extra_isize 9234 (inode size 256)
Aborting journal on device md0-8.
EXT4-fs error (device md0): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
EXT4-fs error (device md0): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
EXT4-fs (md0): Remounting filesystem read-only
which most definitely does not look good :-/
Turns out I took the wrong memory chip out. Now I have a combination of chips (including some older, smaller, ones) that earned me a "PASS" banner.
I am still going to rebuild the file system, I'm afraid there are deeper corruptions, but that will have to wait a little while. I've bought a new disk to replace the one that has been constantly-on since 2016 (the other one in the RAID is from 2019, so practically still brand-new).
Looks like it's the 2019 drive that has uncorrectable errors, while the 2016 drive is fine. Looks like I need to replace it sooner rather than later.
The 6+ hours overnight memtest gave me a PASS on the new memory configuration, so now I am *copying* the files over to a new hard disk (configured as a single-disk RAID1, to be updated with the disk I'm copying *from* afterwards).
Copying 1.5 Tbyte of files does take a while (5 hours and counting so far), but since I don't trust the metadata of the old filesystem, I am not taking the chance to just have Linux RAID mirror it over.
I need to clean up the filesystem, it is heavy with backups of backups of old root filesystem copies which had backups of old machines with backups on them...
The file copying finished eventually. The tricky part was to switch to booting from the new Raid instead of the old, getting grub to read the correct kernel and pass the correct root. After a few rounds through booting from an USB image (#Ventoy ftw), and updating grub inside a chroot, it eventually worked.
Then I zeroed out the old RAID device, and ran mdadm with --grow and --raid-devices=2 to sync the new fs back to the old drive.
Took all day, and syncing those 2 terabytes takes a while longer still.
Let's hope this works out now...
One power outage later, I've learned that #update-grub isn't enough when booting in #EFI mode and changing the partition that holds the grub menu configuration. One also needs to run #grub-install to update the EFI grub.cfg to point to the first grub.cfg
https://wiki.debian.org/GrubEFIReinstall
And of course, the power outage had to happen just as I had left for Germany and Spiel. Good thing the rest of the family wasn't home either...
And some folks simply use the old, tried and trusted ZFS with a (license-)compatible OS
A little history:
https://klarasystems.com/articles/history-of-zfs-part-1-the-birth-of-zfs/
…Since nobody asked.
Ext3 is what I used most on Linux because it was there by default. Like now I run with what OpenBsD comes with by default. The very simple, even older and more trusted FFS (Fast File System).
@Nihils Yes, defaults matter (ext4 now more than ext3). RHEL defaults to xfs, for some reason (and since it has worked at $DAYJOB, we kept xfs after switching to Ubuntu, even if the installer touts ext4).
Back to my failing harddisks, I have now realized that doing backups with rsync --delete isn't that good when the source is a slowly degrading file system. Seems I lost some digital photos from 10+ years ago.