Tools: Ultimate Guide: How I Spent a Day Trying to Recover a Crashed OpenStack Environment — And What I Learned

Tools: Ultimate Guide: How I Spent a Day Trying to Recover a Crashed OpenStack Environment — And What I Learned

The Problem

The Environment

Step 1 — Diagnosing the Problem

Step 2 — The Filesystem Corruption

Step 3 — Recovery Attempts in initramfs

Activating LVM Volumes

Running e2fsck with Backup Superblock

Extending the LVM Volume

Rewriting the Superblock

Creating Swap to Help with OOM

Step 4 — The OOM Problem

Step 5 — Attempting to Boot from Live ISO

Step 6 — UEFI Shell to the Rescue (Partially)

Step 7 — The Final Verdict

What Should Have Been Done Differently

Before the Incident

During the Incident

Tools You Need Available

Key Commands Reference

Conclusion A real-world incident report for engineers dealing with filesystem corruption on production Linux servers It started with a simple complaint: our company's OpenStack Horizon portal was unreachable. The browser returned ERR_CONNECTION_TIMED_OUT. No warning, no gradual degradation — just gone. We had two physical HPE ProLiant DL380 Gen10 servers running the environment, accessible only via HP iLO 5 remote console. No physical access. No one near the data centre. Just me, a browser, and an iLO HTML5 console. This is the story of what happened, what we tried, what failed, and what every engineer should know before they find themselves in the same situation. The first thing I noticed was that pinging the servers returned Destination host unreachable even on VPN. This ruled out a simple service crash — something was fundamentally wrong at the OS level. Opening the iLO console for the controller node revealed the server was stuck in a BusyBox initramfs emergency shell with the following critical errors: Lesson #1: Always check iLO/IPMI console first. The OS may be completely down while the management interface is still accessible. The root filesystem was on an LVM logical volume. The initramfs had tried to run an automatic fsck and failed. The errors pointed to: The size mismatch error was particularly telling: Lesson #2: A filesystem size larger than the physical device usually means the LVM volume was shrunk without first shrinking the filesystem, or the superblock was corrupted during an unclean shutdown. The initramfs environment is extremely limited. Here is what we tried and the results: ✅ This worked and activated all volume groups. ⚠️ This started working but kept getting killed by the OOM (Out of Memory) killer because initramfs has very limited RAM available for processes. ✅ This successfully extended the volume to match what the filesystem expected. ✅ The superblock was rewritten. e2fsck then started making real progress fixing inodes. ❌ swapon is not available in initramfs. This is a critical limitation. Lesson #3: The initramfs environment is missing many essential tools including swapon, resize2fs, tune2fs, debugfs, and lvextend. Plan for this limitation before you need it. Every time e2fsck got deep into repairing the large volume, the kernel OOM killer terminated it: The server had significant RAM but initramfs was only making a small portion available for user processes. Without swap, e2fsck couldn't complete the repair. Lesson #4: For large filesystems (500GB+), e2fsck requires significant RAM. Always ensure swap is available before running fsck on large volumes. If you're in initramfs without swap, you need a different approach. We tried to boot Ubuntu 20.04 Live Server from an ISO mounted via iLO Virtual Media. This would have given us a full Ubuntu environment with all tools. The challenges we encountered: Lesson #5: Test your iLO Virtual Media boot process BEFORE you need it in an emergency. Know whether your server's UEFI will boot from iLO virtual media and in what order. We discovered the HPE Embedded UEFI Shell under:

System Utilities → Embedded Applications → Embedded UEFI Shell From there we could launch the GRUB bootloader directly: This gave us access to the GRUB menu and boot parameter editing. We modified the boot parameters to skip fsck: Unfortunately the filesystem was too corrupted to mount even with fsck skipped. Lesson #6: The HPE Embedded UEFI Shell is a powerful recovery tool. Learn how to use it. It can launch bootloaders directly from the EFI partition without needing a working boot order. After extensive repair attempts, the final error was: Inode #2 is the root directory inode — the most critical inode in any ext4 filesystem. When this is destroyed, the filesystem cannot be mounted under any circumstances without specialist data recovery tools. Lesson #7: If inode #2 is corrupted, you need either a backup restore or professional data recovery. No amount of e2fsck will fix a destroyed root inode. Filesystem corruption at the inode level is one of the most serious failures a Linux system administrator can face. The key takeaways from this incident are: If you find yourself in a similar situation, I hope this article saves you some of the hours I spent learning these lessons the hard way. If this article helped you, please clap and share. If you have questions or have been through a similar experience, leave a comment below. Tags: #Linux #OpenStack #SysAdmin #DevOps #DisasterRecovery #Ubuntu #LVM #Filesystem #HPE #iLO Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

ERR_CONNECTION_TIMED_OUT Destination host unreachable UNEXPECTED INCONSISTENCY: RUN fsck MANUALLY Failure: File system check of the root filesystem failed The root filesystem requires a manual fsck UNEXPECTED INCONSISTENCY: RUN fsck MANUALLY Failure: File system check of the root filesystem failed The root filesystem requires a manual fsck UNEXPECTED INCONSISTENCY: RUN fsck MANUALLY Failure: File system check of the root filesystem failed The root filesystem requires a manual fsck The filesystem size is 288358400 blocks The physical size of the device is 285474816 blocks Either the superblock or the partition table is likely to be corrupt The filesystem size is 288358400 blocks The physical size of the device is 285474816 blocks Either the superblock or the partition table is likely to be corrupt The filesystem size is 288358400 blocks The physical size of the device is 285474816 blocks Either the superblock or the partition table is likely to be corrupt vgchange -ay vgchange -ay vgchange -ay e2fsck -y -b 32768 /dev/mapper/<your-lv-name> e2fsck -y -b 32768 /dev/mapper/<your-lv-name> e2fsck -y -b 32768 /dev/mapper/<your-lv-name> lvm lvextend -l +100%FREE /dev/<vg-name>/<lv-name> lvm lvextend -l +100%FREE /dev/<vg-name>/<lv-name> lvm lvextend -l +100%FREE /dev/<vg-name>/<lv-name> mke2fs -S -b 4096 /dev/mapper/<your-lv-name> mke2fs -S -b 4096 /dev/mapper/<your-lv-name> mke2fs -S -b 4096 /dev/mapper/<your-lv-name> dd if=/dev/zero of=/swapfile bs=1048576 count=4096 mkswap /swapfile # swapon /swapfile — NOT AVAILABLE in initramfs dd if=/dev/zero of=/swapfile bs=1048576 count=4096 mkswap /swapfile # swapon /swapfile — NOT AVAILABLE in initramfs dd if=/dev/zero of=/swapfile bs=1048576 count=4096 mkswap /swapfile # swapon /swapfile — NOT AVAILABLE in initramfs Out of memory: Killed process (e2fsck) Out of memory: Killed process (e2fsck) Out of memory: Killed process (e2fsck) fs0: cd EFI\ubuntu shimx64.efi fs0: cd EFI\ubuntu shimx64.efi fs0: cd EFI\ubuntu shimx64.efi linux /vmlinuz-<version>-generic root=/dev/mapper/<lv-name> ro fsck.mode=skip linux /vmlinuz-<version>-generic root=/dev/mapper/<lv-name> ro fsck.mode=skip linux /vmlinuz-<version>-generic root=/dev/mapper/<lv-name> ro fsck.mode=skip EXT4-fs error: inode #2: special inode unallocated get root inode failed mount failed EXT4-fs error: inode #2: special inode unallocated get root inode failed mount failed EXT4-fs error: inode #2: special inode unallocated get root inode failed mount failed # Activate LVM volumes from initramfs vgchange -ay # List mapper devices ls /dev/mapper/ # Find backup superblocks dumpe2fs /dev/mapper/<device> | grep -i superblock # Run fsck with backup superblock e2fsck -y -b 32768 /dev/mapper/<device> # Extend LVM volume (using lvm wrapper in initramfs) lvm lvextend -l +100%FREE /dev/<vg-name>/<lv-name> # Rewrite superblock (does NOT destroy data) mke2fs -S -b 4096 /dev/mapper/<device> # Create swap file dd if=/dev/zero of=/swapfile bs=1048576 count=4096 mkswap /swapfile swapon /swapfile # (not available in initramfs) # Mount filesystem read-only mount -o ro /dev/mapper/<device> /mnt # Chroot into recovered system chroot /mnt /bin/bash # Activate LVM volumes from initramfs vgchange -ay # List mapper devices ls /dev/mapper/ # Find backup superblocks dumpe2fs /dev/mapper/<device> | grep -i superblock # Run fsck with backup superblock e2fsck -y -b 32768 /dev/mapper/<device> # Extend LVM volume (using lvm wrapper in initramfs) lvm lvextend -l +100%FREE /dev/<vg-name>/<lv-name> # Rewrite superblock (does NOT destroy data) mke2fs -S -b 4096 /dev/mapper/<device> # Create swap file dd if=/dev/zero of=/swapfile bs=1048576 count=4096 mkswap /swapfile swapon /swapfile # (not available in initramfs) # Mount filesystem read-only mount -o ro /dev/mapper/<device> /mnt # Chroot into recovered system chroot /mnt /bin/bash # Activate LVM volumes from initramfs vgchange -ay # List mapper devices ls /dev/mapper/ # Find backup superblocks dumpe2fs /dev/mapper/<device> | grep -i superblock # Run fsck with backup superblock e2fsck -y -b 32768 /dev/mapper/<device> # Extend LVM volume (using lvm wrapper in initramfs) lvm lvextend -l +100%FREE /dev/<vg-name>/<lv-name> # Rewrite superblock (does NOT destroy data) mke2fs -S -b 4096 /dev/mapper/<device> # Create swap file dd if=/dev/zero of=/swapfile bs=1048576 count=4096 mkswap /swapfile swapon /swapfile # (not available in initramfs) # Mount filesystem read-only mount -o ro /dev/mapper/<device> /mnt # Chroot into recovered system chroot /mnt /bin/bash - Controller Node: HPE ProLiant DL380 Gen10 (12-core) - Compute Node: HPE ProLiant DL380 Gen10 (10-core) - OS: Ubuntu 22.04 LTS - Storage: LVM on top of hardware RAID (HPE Smart Array P408i-a) - Access: HP iLO 5 remote console (HTML5) - VPN: FortiClient VPN required to reach internal network - Superblock corruption — the filesystem size recorded in the superblock was larger than the actual LVM volume - Journal corruption — e2fsck could not set superblock flags - Thousands of corrupted inodes — invalid flags, bad extended attributes, wrong inode sizes - iLO Virtual Media URL-based ISO streaming was too slow - Local ISO file mounting via iLO HTML5 console worked better - The ISO was detected as a Virtual CD-ROM by the kernel - However, the server's UEFI boot order did not include the virtual CD-ROM - The virtual CD-ROM did not appear in the UEFI one-time boot menu - Regular backups — snapshots of the LVM volume or VM-level backups - Monitoring — disk health monitoring (smartctl), filesystem error monitoring - Documentation — record all credentials, architecture diagrams, and recovery procedures - Test recovery — periodically test that backups can actually be restored - Swap space — ensure servers have adequate swap configured - Boot from USB first — don't spend hours in initramfs; immediately boot from a live USB with full tools - Create swap immediately — before running e2fsck on large volumes, ensure swap is available - Use a higher-level backup superblock — if 32768 doesn't work, try 98304 or 163840 - Document every command — keep a log of everything you try - A bootable Ubuntu Live USB drive (or ISO ready for iLO virtual media) - resize2fs, tune2fs, debugfs — not available in initramfs - swapon — not available in initramfs - Adequate RAM (at least 8GB free) for e2fsck on large volumes - Backups are not optional — this entire incident would have been resolved in minutes with a good backup - Know your recovery tools — understand the limitations of initramfs before you need it - iLO/IPMI is your lifeline — invest time in learning your server's management interface - Large filesystems need special care — e2fsck on a 1TB+ volume needs RAM, swap, and time - Document everything — credentials, architecture, and recovery procedures must be documented and accessible