rulururu

post Big VMware ESX Bug

August 14th, 2008 @ 8:34 am

Filed under: Servers, Virtualization

If you are not aware yet, a major bug has been revealed in ESX 3.5 and ESXi 3.5 Update 2.  Apparently, the beta was coded to expire on August 12, 2008 and this code failed to be removed from the actual release.  Details are available in the VMware Knowledge Base and This Topic on their forums.  You might also checkout This Post on Matthew Marlow’s blog for more information.

On the morning of the 12th, I was greeted with several errors like this one in the logs for our ESX cluster:

VMware finally released the patch really late Tuesday night, which, unfortunately kept me up most of the night getting our cluster patched.  It involved setting back the clock on all of the hosts so Vmotion would work, manually migrating VM’s off of a host, going into maintenance mode, applying the patch via the command line, then migrating the VM’s back.

VMware is one of my favorite companies and it know for delivering rock-sold, enterprise-class products, so it really disappoints me that they would let something like this slip through the cracks.  Imagine how many hundreds of thousands (Maybe millions?) of VM’s this affected.  They do seem to be committed to fixing their mistake and making things right.  You can check out the Letter From Their CEO for more info.

post I Love VMware!

June 4th, 2008 @ 11:06 am

Filed under: Servers, Virtualization

Over the Memorial Day holiday, I had a “VMware Upgrade Party.” I’m not sure it was really a “Party” since I was the lone attendee, but I’ll call it one anyway. :-) I got all of our ESX servers upgraded to the latest build of 3.5, as well as Virtualcenter to the latest 2.5 build. I also added our fourth ESX server, which is the diskless, boot from SAN box I talked about a couple of weeks ago.

I was a little bit hesitant to put the diskless box in production since a bug in the QLogic HBA firmware required me to run their beta or “Limited Release” firmware in order to do Jumbo Frames. So far though, it has been rock solid.

Below, you can see our current VMware environment. I get more and more excited about this every day. I can now have a new machine online in less than 20 minutes without adding any physical hardware. Awesome! We currently have one stand-alone ESX server, jfbc-ecc-esx03 that runs our virtual desktops. The other three servers are in an HA cluster sporting a total of 32GHz of CPU resources and 52GB of RAM. I like it! I hope to be able to add Vmotion and Distributed Resource Scheduling later this year so we can more effectively manage our host resources.

post Successful SAN and VMware Upgrades

May 28th, 2008 @ 12:04 am

While everyone was away for the holiday Monday, I took the opportunity to upgrade our SAN and ESX servers.  Everything went surprisingly well.

What was really impressive is how fast the Equallogic SAN reboots.  The firmware upgrade was the first reboot since it was installed.  They claimed you could reboot it “live” without causing any problems with the servers, but I had never tested that theory until now.  I was sending it a series of pings every 1 second during the entire process.  I dropped a total of 12 pings during the reboot and the servers never new the storage had just rebooted.  Pretty impressive!  Check this out (I did it from home, hence the 12-15ms latency):

I also migrated all of our ESX servers from version 3.0.2 to 3.5.  For some reason, the HA agent had to be reconfigured on a couple of them, and the ESX firewall decided to block outbound iSCSI traffic on every box after the upgrade.  Other than that, the ESX upgrades went great!

Out first diskless ESX server is no online also.  The QLogic HBA initially wouldn’t connect to our SAN using jumbo frames.  QLogic’s response was to send me their “Beta” or “Limited Release” firmware, which scares me a little.  I have several production VM’s running on that host with no issues though.  I hope to do some benchmarks on VMware Server vs ESX with software iSCSI vs ESX with hardware iSCSI.  Stay tuned for details on that!

I love it when a project goes as planned!

post Runaway Clock in Virtual Linux Servers

May 18th, 2008 @ 2:45 pm

Filed under: Servers, Virtualization

If you run any Linux guests under VMware, you’ve probably had issues with the clock in the VM drifting or just totally running away.

The Linux clock works by counting timer interrupts. In older kernels, this was usually done at a rate of 100Hz, or 100 times per second. Beginning with the 2.6 kernel, the interrupt timer is now set at 1000Hz, so interrupts are counted 10 times as often.

Due to the fact that VMware divides the host up into “time slots” for each guest OS, and depending on the system load, interrupts are often missed in the guest machines. The more often the guest kernel counts interrupts, the more apparent these “missed” interrupts become and the result clock skew in the gust machine. VMware Tools has the ability to sync the guest clock with the host, but this only occurs once per minute, and can only advance the clock, it can’t slow it down. Generally, the VMware Tools clock sync alone is not enough.

Here’s the steps that are needed in order to keep the clock skew under control (these apply to VMware Server running on a Linux host - in my case, CentOS). The guest OS changes will also apply to ESX.:

  • VMware server needs to be told what clock speed the CPU(s) run at. This can be found by running “cat /proc/cpuinfo”, which will return all kinds of information about the CPU’s, including the clockspeed. You’ll need to edit /etc/vmware/config and add the following lines (where host.cpukHz is the host CPU speek in KHz (2.8GHz in my example below)

    host.cpukHz = 2800000
    host.noTSC = TRUE
    ptsc.noTSC = TRUE

  • VMware Tools needs to be installed in the guest OS. VMware provides instructions on how to install VMware Tools in a Linux guest here.
  • VMware Tools time synchronization needs to be enabled. This is done by editing the VMX file in the virtual machine directory and adding the following line:

    tools.syncTime = “TRUE”

    Note that the host should use NTP to sync to an outside time source, while NTP should be disabled in each guest

  • Now, we need to lower the interrupt frequency in the guest kernel. Generally, this will require installing the kernel source, modifying the CONFIG_HZ parameter to a rate of 100Hz, and then recompiling the kernel. CentOS has made this easy for us by releasing a “VM Optimized” kernel for CentOS 5. Although perfectly stable, this kernel is presently in the “Testing” repository. Here’s how to install the VM Kernel using yum in a CentOS 5 system:Add the “Testing” repo as follows:

    cd /etc/yum.repos.d
    wget http://dev.centos.org/centos/5/CentOS-Testing.repo

    Now, install the VM Optimized kernel:

    yum enablerepo=c5-testing install kernel-vm kernel-vm-devel

  • Now, we need to make sure Grub is set to boot the new kernel, and also add the “clock=pit” parameter to the kernel boot options. We do that by editing /etc/grub.conf and making the following changes:

    default=0

    Where “0″ is the first kernel listed. If the VM Kernel is not the first item, you’ll need to adjust the value accordingly. For example, if it’s second in the list, you’d use “default=1″Now, add the clock=pit parameter to the kernel boot options. That section of the grub.conf file will look something like this:

    title CentOS (2.6.18-53.1.19.el5) root (hd0,0)
    kernel /vmlinuz-2.6.18-53.1.19.el5 ro root=LABEL=/ clock=pit
    initrd /initrd-2.6.18-53.1.19.el5.img

Once all of the above changes are made, reboot the guest, and you should see significantly better clock performance. I had some VM’s where the time would drift by hours, and after making these changes, they stay within a few seconds.

post Diskless ESX Server

May 13th, 2008 @ 1:12 pm

Filed under: Servers, Virtualization

We have had a VMware ESX cluster for a while now, but last night I put together our first diskless ESX server. I’m excited about this because it eliminates a failure point from the environment - the local disks in the servers. I’m using Qlogic iSCSI HBA’s and booting from a 10GB volume on our Equallogic SAN.

I got everything configured and tested last night. Today, it gets racked and added to our cluster in Virtualcenter. Here’s a few pictures:

No disks :-) The machine on the bottom and the Mac are a test environment for our upcoming Windows 2008 and Mac OSX Leopard deployment. The procurve switch is just for testing on the workbench, once racked, it will be attached to our Cisco 6500 core switch.

It doesn’t even know there’s no disks (boots up really fast too)

VI Client showing specs of new machine - 8 x 2.5GHz cores and 20GB of RAM - lots of horsepower :-)

It’s home once I rearrange a few things tonight. The 4 machines at the top are our current ESX cluster. The disk array just underneath is for disk-based backup. The SAN is in another rack.

post Need More ESX Servers

April 10th, 2008 @ 12:56 am

Filed under: Virtualization

I enabled HA on our ESX cluster today and was greated by a nice big warning messages that I don’t have enough resources to satisfy the HA requirements. So, it looks like if one host dies, I don’t have enough RAM/CPU to run all of those VM’s elsewhere.

We do have licenses for a couple more CPU’s than we are actually using currently, so I guess it’s time to add another physical server to the cluster.

ruldrurd