Tools: Ultimate Guide: How I Spent a Day Trying to Recover a Crashed OpenStack Environment — And What I Learned
The Problem
The Environment
Step 1 — Diagnosing the Problem
Step 2 — The Filesystem Corruption
Step 3 — Recovery Attempts in initramfs
Activating LVM Volumes
Running e2fsck with Backup Superblock
Extending the LVM Volume
Rewriting the Superblock
Creating Swap to Help with OOM
Step 4 — The OOM Problem
Step 5 — Attempting to Boot from Live ISO
Step 6 — UEFI Shell to the Rescue (Partially)
Step 7 — The Final Verdict
What Should Have Been Done Differently
Before the Incident
During the Incident
Tools You Need Available
Key Commands Reference
Conclusion A real-world incident report for engineers dealing with filesystem corruption on production Linux servers It started with a simple complaint: our company's OpenStack Horizon portal was unreachable. The browser returned ERR_CONNECTION_TIMED_OUT. No warning, no gradual degradation — just gone. We had two physical HPE ProLiant DL380 Gen10 servers running the environment, accessible only via HP iLO 5 remote console. No physical access. No one near the data centre. Just me, a browser, and an iLO HTML5 console. This is the story of what happened, what we tried, what failed, and what every engineer should know before they find themselves in the same situation. The first thing I noticed was that pinging the servers returned Destination host unreachable even on VPN. This ruled out a simple service crash — something was fundamentally wrong at the OS level. Opening the iLO console for the controller node revealed the server was stuck in a BusyBox initramfs emergency shell with the following critical errors: Lesson #1: Always check iLO/IPMI console first. The OS may be completely down while the management interface is still accessible. The root filesystem was on an LVM logical volume. The initramfs had tried to run an automatic fsck and failed. The errors pointed to: The size mismatch error was particularly telling: Lesson #2: A filesystem size larger than the physical device usually means the LVM volume was shrunk without first shrinking the filesystem, or the superblock was corrupted during an unclean shutdown. The initramfs environment is extremely limited. Here is what we tried and the results: ✅ This worked and activated all volume groups. ⚠️ This started working but kept getting killed by the OOM (Out of Memory) killer because initramfs has very limited RAM available for processes. ✅ This successfully extended the volume to match what the filesystem expected. ✅ The superblock was rewritten. e2fsck then started making real progress fixing inodes. ❌ swapon is not available in initramfs. This is a critical limitation. Lesson #3: The initramfs environment is missing many essential tools including swapon, resize2fs, tune2fs, debugfs, and lvextend. Plan for this limitation before you need it. Every time e2fsck got deep into repairing the large volume, the kernel OOM killer terminated it: The server had significant RAM but initramfs was only making a small portion available for user processes. Without swap, e2fsck couldn't complete the repair. Lesson #4: For large filesystems (500GB+), e2fsck requires significant RAM. Always ensure swap is available before running fsck on large volumes. If you're in initramfs without swap, you need a different approach. We tried to boot Ubuntu 20.04 Live Server from an ISO mounted via iLO Virtual Media. This would have given us a full Ubuntu environment with all tools. The challenges we encountered: Lesson #5: Test your iLO Virtual Media boot process BEFORE you need it in an emergency. Know whether your server's UEFI will boot from iLO virtual media and in what order. We discovered the HPE Embedded UEFI Shell under:
System Utilities → Embedded Applications → Embedded UEFI Shell From there we could launch the GRUB bootloader directly: This gave us access to the GRUB menu and boot parameter editing. We modified the boot parameters to skip fsck: Unfortunately the filesystem was too corrupted to mount even with fsck skipped. Lesson #6: The HPE Embedded UEFI Shell is a powerful recovery tool. Learn how to use it. It can launch bootloaders directly from the EFI partition without needing a working boot order. After extensive repair attempts, the final error was: Inode #2 is the root directory inode — the most critical inode in any ext4 filesystem. When this is destroyed, the filesystem cannot be mounted under any circumstances without specialist data recovery tools. Lesson #7: If inode #2 is corrupted, you need either a backup restore or professional data recovery. No amount of e2fsck will fix a destroyed root inode. Filesystem corruption at the inode level is one of the most serious failures a Linux system administrator can face. The key takeaways from this incident are: If you find yourself in a similar situation, I hope this article saves you some of the hours I spent learning these lessons the hard way. If this article helped you, please clap and share. If you have questions or have been through a similar experience, leave a comment below. Tags: #Linux #OpenStack #SysAdmin #DevOps #DisasterRecovery #Ubuntu #LVM #Filesystem #HPE #iLO Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse