Wednesday, November 16, 2011

Recover from a missing kernel : The Problem

You read right. A missing kernel. Although this sounds terminal, the fix was fortunately simple enough. If you are in a jam the solution is here. But the journey how it got to this is a cautionary tale of "a little knowledge is a dangerous thing'. This is a long post.
A novice sysadmin in a small company had problems with CentOS VMs on VMWare ESX version 3. I had set it up for them a few years ago and had been maintaining it for a while after that. I recommended to the management to send their sysadmin for training on VMWare administration, even if it wasn't for certification. They agreed on principle but never did anything about it. Don't get me wrong. The guy was smart. He shadowed my work and understood what I was doing and knew to ask questions when he didn't. Not formally trained but experienced in administrating services (e.g. Samba, printing), I think it was a normal progression for him to take on more work closer-related to installation and configuration.

The VMware config - Linux Kernel Dance
I had not heard from them for a while when I began getting calls about "network problems". A quick look and I figured out the VMs that was running their DHCP and DNS servers had frozen up (if you are wondering, the  Magic SysReq key a.k.a. Ctrl-Alt-SysReq BUSIER works in the VMI console). Apparently the VMs were running out of resources with the CPU hitting and sustaining 100% average utilization. There weren't many VMs on the server and being Linux, I knew I could cram more than they were running currently. A more closer look revealed that it was caused by vmware-tools not being loaded. It wasn't being loaded because the sysadmin had updated the kernel but not reconfigured vmware-tools. This was happening for some time despite the message during bootup warning him about it.
I call this the linux vmware-config dance. For reasons known to VMWare, Linux is a second-class citizen. Even though, VMware ESX and ESXi and their flagship product VSphere, run on a Linux kernel, Linux support comes second in everything. The all-powerful VMware Machine Interface (VMI) client is Windows only. Don't point to that pathetic web-based management system. On Linux, we could start and stop servers but console access is broken or at best, works sporadically. We can't even create a VM using it. It's better with VMware Server Free (previously ESX). The web interface provides full access but it requires an ssl 1.0 support which is insecure and requires manual parameter configuration in Firefox to work .
The Vmware-tools service provides the kernel optimized access to memory, disk and network access. If it's not running, the VM can't do things like share memory with other VMs. Basically, it'll run slower and eat up more memory. And apparently, run it long enough, some resource gets gobbled up bit by bit without being properly released. The kicker is that since re-configuring vmware-tools affects network access, you can't do it remotely via ssh. It must be done via the console either through the web interface (VMware Server Free only) or the VMI for the paid stuff.
VMWare requires that vmware-tools to be reconfigured every time there is a kernel update. Updated kernels need to be loaded first, so a reboot is required. If you use a RedHat or SuSe kernel, the related Vmware-tools modules will load ok. But if it doesn't it'll recompile the modules. So you will need gblic and at least kernel headers to recompile. Depending on the distro, you may need to load kernel sources to get the headers. It's also good to restart the server after reconfiguration and reloading of vmware-tools to test whether there are knock-on effects on other modules. So to recap: restart the machine to load new kernel, reconfig vmware-tools and restart to cleanly load and test vmware tools and other modules. Now times that with the number of VMs you have and look at spending a lot of time doing this.

So this young sysadmin had enabled the centos-plus repo for some reason. As a result the system was loading centosplus kernels instead of the base kernels. This required some modules to be recompiled when re-configuring vmware-tools. This adds even more time, so the best thing to do is not use centos-plus kernels unless you really have to. To do this, you must install yum-priorities. When this is installed, you can decide the which repositories have preference. The sysadmin did not take this step.

yum-priorities: The Ounce of Prevention
The update was huge because he hadn't been doing it for some time. When he ran the update, he excluded the kernel and glibc because he didn't have enough time to reboot and reconfigure vmware-tools. Now when a server hasn't been updated for some time, the best thing to do is to update rpm* and yum* first. Update the tools you want to use for the update. Make sure the tools you are using are up to scratch. He also ran with the skip-broken option on, hoping to exclude packages that have problems.
So he ran the big update.  The update completes and he rebooted to ensure the latest libraries and services were loaded. When he did, grub couldn't find the kernel image. It must have tripped over some dependency and removed the kernel package from centos plus in the process. The process was thorough enough to remove the previous kernel entries in grub. 
So I get the call, basically "emergency come now". My first instinct was to boot off a live cd. For the life of me, the only live cd he had was an AVG Live CD. Good enough because it would let me see what the damage was. I booted it up and mounted the boot partition. scp-ed a kernel from another server but since he was in the process of updating, the other servers didn't have the correct kernel for some libraries. Nonetheless, rewrote grub to load kernel because I knew I didn't need to get everything going, just enough to run yum and get the latest kernel. It booted ok but it couldn't load the network interface. The kernel couldn't load the network interface module. Remember the vmware-config dance? The kernel module for the network interface was built specifically for that kernel version. The kernel I was loading was built too differently from the one that was used to compile VM's network interface kernel module. 
Crap. 

No comments:

Post a Comment

Recently Popular