Without changing anything related to grub, the kernel, or dracut, my system no longer boots (by itself) after I added another disk to an existing boot-able single-disk md RAID (Linux RAID) configuration. The boot log looks like this:
...
Autoassembling MD Raid
...
No root device, sleeping forever
Background:
My system has been running for several years with two md RAID (Linux RAID) partitions on a single drive: one for the /boot filesystem and one for an encrypted LVM volume where everything else lives, including the root filesystem.
When I originally built the RAID partitions, I only had one HDD. I anticipated getting another one in the future, so I set up two MD partitions as single-drive RAID1 devices.
Well, I finally got another disk and added it. This process couldn't have gone more smoothly. I added the drive to my computer, partitioned it into two MD partitions, and added the appropriate partition to the corresponding existing RAID device. Immediately the RAID devices started synchronizing to the new drive. Both sync operations completed successfully and the devices were "clean."
The next time I rebooted, I got the error shown above. dracut couldn't find the root partition.
*Okay, here's the part I don't understand*
When I boot with kernel option "rdshell" I can easily assemble both md devices by hand:
mdadm --assemble /dev/md0 <dev1> <dev2>
mdadm --assemble /dev/md1 <dev1> <dev2>
And that's all it takes. I just exit the dracut shell, and the system boots just fine. It prompts me to unlock the encrypted volume and boots up.
So, why can't dracut assemble the RAID devices by itself? Here are some ideas.
1) outdated mdadm.conf: the mdadm.conf in the old initramfs image still shows device-num=1
2) metadata version 0.9 : some people suggest there exists an ambiguity regarding the superblock with v0.9, where as version 1.0 does not have this problem.
3) mdadm.conf: something else is wrong in mdadm.conf.
I can't tell what dracut is doing when it says "Autoassembling MD Raid". It's obviously not doing it correctly. But, I don't know why. The UUID of the devices hasn't changed, the kernel boot parameters haven't changed. The only thing that has changed is the number of devices in the RAID device. But, I don't know why that would affect assembly. All the partitions are correctly labeled id=fd, Linux raid autodetect.
---------- Post added at 03:21 PM ---------- Previous post was at 03:19 PM ----------
Hmm, naturally, found this right after posting. This is *exactly* my problem.
http://forums.fedoraforum.org/showthread.php?t=255964
But, removing the redundant disk isn't really a good solution. I'm going to see if updating the mdadm.conf with the right number of devices will work.
---------- Post added at 04:04 PM ---------- Previous post was at 03:21 PM ----------
Yes, just setting 'num-devices=2' (it was =1, previously) in mdadm.conf *in the initramfs image* fixed everything.
But, from what I can tell, this option isn't even required. I see recommendations to use the output of 'mdadm --examine --scan' for your mdadm.conf. That doesn't seem exactly right, but maybe for the ARRAY lines, at least.
I'm going to simply remove the num-devices= option and seem what happens, but it might be a while before I reboot again. I'll try to remember to update this post with the results.