rulururu

post Equallogic Auto-Snapshot Manager for VMware

September 13th, 2008 @ 2:03 am

For a couple of months now, I’ve been hearing about the upcoming 4.0 firmware and Auto-Snapshot Manager, VMware edition for the Equallogic PS series SAN. This new snapshot provider would allow us to coordinate snapshots between VitualCenter and the SAN, and, according to Dell/Equallogic, allow easy restoration of a single Virtual Machine from the SAN-based snapshots.

I had the opportunity on Thursday night to watch a pre-recorded demo as well as to attend a live webinar on Friday morning. I must say that after these demos and seeing exactly what this product does, I am stuck somewhere between excitement and disapointment. It’s a really cool concept, but I believe it still needs a lot of polishing, especially on the recovery side of things.

There are lots of awesome features in the new snapshot provider. We will have the ability to automatically, in a single click or scheduled task, trigger an ESX snapshot, including memory dump, then snapshot the SAN volume, followed by removing the ESX snapshot. This eliminates the journaling effect and associated performance hit and disk requirements of the ESX snapshots. This is all handled through a nice web interface, and the VirtualCenter folder tree is carried over, allowing snapshot schedules to be applied to groups of Virtual Machines.

There are, however, some catches. Only the selected VM’s are triggered for ESX snapshots, but the entire SAN volume, which may contain many other VM’s is snapshotted. This makes the ability to group VM’s using VirtualCenter folders less than useful. Let’s say I have four VM’s split between four volumes and want one machine on each volume to be snapshotted everu 12 hours. Then, I want one machine per volume to be snapshotted every 24 hours. In this scenario, I will actually end up with, at the SAN level, two snapshots per day of both entire volumes and all the VM’s since the entire volume is snapshotted. So, in my opinion, snapshotting VM’s by any grouping other than an entire SAN volume isn’t going to be practical without a lot of wasted disk space.

On the recovery side, I think there is a lot of room for improvement. It is very easy to revert an entire volume and all the VM’s it contains. Beyond that, restoring a single VM, for example, becomes a somewhat lengthy process. Basically, it involved going back to the Equallogic Group Manager, setting the snapshot online, going to ESX and mounting the snapshot as a new volume, deleting the damaged VM, copying it manually from the snapshot to the production volume, adding it to inventory, booting it up, and then unmounting the snapshot. Alternately, the VM can be booted from the snapshot volume, then migrated back to the production volume using Storage VMotion. Storage VMotion, however, requires accessing the ESX command line.

It is my hope that, in a future release, Dell will automate some of the recovery process using the VMware API’s. Currently, there are lots of improvements in creating the snapshot, but no real change in the process of recovering a VM.

I am looking forward to getting the Auto-Snapshot Manager, VMware edition installed in our environment and actually seeing it in action in a production environment. Expect another post in the future with more details once I actually get this up and running.

post Big VMware ESX Bug

August 14th, 2008 @ 8:34 am

Filed under: Servers, Virtualization

If you are not aware yet, a major bug has been revealed in ESX 3.5 and ESXi 3.5 Update 2.  Apparently, the beta was coded to expire on August 12, 2008 and this code failed to be removed from the actual release.  Details are available in the VMware Knowledge Base and This Topic on their forums.  You might also checkout This Post on Matthew Marlow’s blog for more information.

On the morning of the 12th, I was greeted with several errors like this one in the logs for our ESX cluster:

VMware finally released the patch really late Tuesday night, which, unfortunately kept me up most of the night getting our cluster patched.  It involved setting back the clock on all of the hosts so Vmotion would work, manually migrating VM’s off of a host, going into maintenance mode, applying the patch via the command line, then migrating the VM’s back.

VMware is one of my favorite companies and it know for delivering rock-sold, enterprise-class products, so it really disappoints me that they would let something like this slip through the cracks.  Imagine how many hundreds of thousands (Maybe millions?) of VM’s this affected.  They do seem to be committed to fixing their mistake and making things right.  You can check out the Letter From Their CEO for more info.

post I Love VMware!

June 4th, 2008 @ 11:06 am

Filed under: Servers, Virtualization

Over the Memorial Day holiday, I had a “VMware Upgrade Party.” I’m not sure it was really a “Party” since I was the lone attendee, but I’ll call it one anyway. :-) I got all of our ESX servers upgraded to the latest build of 3.5, as well as Virtualcenter to the latest 2.5 build. I also added our fourth ESX server, which is the diskless, boot from SAN box I talked about a couple of weeks ago.

I was a little bit hesitant to put the diskless box in production since a bug in the QLogic HBA firmware required me to run their beta or “Limited Release” firmware in order to do Jumbo Frames. So far though, it has been rock solid.

Below, you can see our current VMware environment. I get more and more excited about this every day. I can now have a new machine online in less than 20 minutes without adding any physical hardware. Awesome! We currently have one stand-alone ESX server, jfbc-ecc-esx03 that runs our virtual desktops. The other three servers are in an HA cluster sporting a total of 32GHz of CPU resources and 52GB of RAM. I like it! I hope to be able to add Vmotion and Distributed Resource Scheduling later this year so we can more effectively manage our host resources.

post Successful SAN and VMware Upgrades

May 28th, 2008 @ 12:04 am

While everyone was away for the holiday Monday, I took the opportunity to upgrade our SAN and ESX servers.  Everything went surprisingly well.

What was really impressive is how fast the Equallogic SAN reboots.  The firmware upgrade was the first reboot since it was installed.  They claimed you could reboot it “live” without causing any problems with the servers, but I had never tested that theory until now.  I was sending it a series of pings every 1 second during the entire process.  I dropped a total of 12 pings during the reboot and the servers never new the storage had just rebooted.  Pretty impressive!  Check this out (I did it from home, hence the 12-15ms latency):

I also migrated all of our ESX servers from version 3.0.2 to 3.5.  For some reason, the HA agent had to be reconfigured on a couple of them, and the ESX firewall decided to block outbound iSCSI traffic on every box after the upgrade.  Other than that, the ESX upgrades went great!

Out first diskless ESX server is no online also.  The QLogic HBA initially wouldn’t connect to our SAN using jumbo frames.  QLogic’s response was to send me their “Beta” or “Limited Release” firmware, which scares me a little.  I have several production VM’s running on that host with no issues though.  I hope to do some benchmarks on VMware Server vs ESX with software iSCSI vs ESX with hardware iSCSI.  Stay tuned for details on that!

I love it when a project goes as planned!

post Runaway Clock in Virtual Linux Servers

May 18th, 2008 @ 2:45 pm

Filed under: Servers, Virtualization

If you run any Linux guests under VMware, you’ve probably had issues with the clock in the VM drifting or just totally running away.

The Linux clock works by counting timer interrupts. In older kernels, this was usually done at a rate of 100Hz, or 100 times per second. Beginning with the 2.6 kernel, the interrupt timer is now set at 1000Hz, so interrupts are counted 10 times as often.

Due to the fact that VMware divides the host up into “time slots” for each guest OS, and depending on the system load, interrupts are often missed in the guest machines. The more often the guest kernel counts interrupts, the more apparent these “missed” interrupts become and the result clock skew in the gust machine. VMware Tools has the ability to sync the guest clock with the host, but this only occurs once per minute, and can only advance the clock, it can’t slow it down. Generally, the VMware Tools clock sync alone is not enough.

Here’s the steps that are needed in order to keep the clock skew under control (these apply to VMware Server running on a Linux host - in my case, CentOS). The guest OS changes will also apply to ESX.:

  • VMware server needs to be told what clock speed the CPU(s) run at. This can be found by running “cat /proc/cpuinfo”, which will return all kinds of information about the CPU’s, including the clockspeed. You’ll need to edit /etc/vmware/config and add the following lines (where host.cpukHz is the host CPU speek in KHz (2.8GHz in my example below)

    host.cpukHz = 2800000
    host.noTSC = TRUE
    ptsc.noTSC = TRUE

  • VMware Tools needs to be installed in the guest OS. VMware provides instructions on how to install VMware Tools in a Linux guest here.
  • VMware Tools time synchronization needs to be enabled. This is done by editing the VMX file in the virtual machine directory and adding the following line:

    tools.syncTime = “TRUE”

    Note that the host should use NTP to sync to an outside time source, while NTP should be disabled in each guest

  • Now, we need to lower the interrupt frequency in the guest kernel. Generally, this will require installing the kernel source, modifying the CONFIG_HZ parameter to a rate of 100Hz, and then recompiling the kernel. CentOS has made this easy for us by releasing a “VM Optimized” kernel for CentOS 5. Although perfectly stable, this kernel is presently in the “Testing” repository. Here’s how to install the VM Kernel using yum in a CentOS 5 system:Add the “Testing” repo as follows:

    cd /etc/yum.repos.d
    wget http://dev.centos.org/centos/5/CentOS-Testing.repo

    Now, install the VM Optimized kernel:

    yum enablerepo=c5-testing install kernel-vm kernel-vm-devel

  • Now, we need to make sure Grub is set to boot the new kernel, and also add the “clock=pit” parameter to the kernel boot options. We do that by editing /etc/grub.conf and making the following changes:

    default=0

    Where “0″ is the first kernel listed. If the VM Kernel is not the first item, you’ll need to adjust the value accordingly. For example, if it’s second in the list, you’d use “default=1″Now, add the clock=pit parameter to the kernel boot options. That section of the grub.conf file will look something like this:

    title CentOS (2.6.18-53.1.19.el5) root (hd0,0)
    kernel /vmlinuz-2.6.18-53.1.19.el5 ro root=LABEL=/ clock=pit
    initrd /initrd-2.6.18-53.1.19.el5.img

Once all of the above changes are made, reboot the guest, and you should see significantly better clock performance. I had some VM’s where the time would drift by hours, and after making these changes, they stay within a few seconds.

post Diskless ESX Server

May 13th, 2008 @ 1:12 pm

Filed under: Servers, Virtualization

We have had a VMware ESX cluster for a while now, but last night I put together our first diskless ESX server. I’m excited about this because it eliminates a failure point from the environment - the local disks in the servers. I’m using Qlogic iSCSI HBA’s and booting from a 10GB volume on our Equallogic SAN.

I got everything configured and tested last night. Today, it gets racked and added to our cluster in Virtualcenter. Here’s a few pictures:

No disks :-) The machine on the bottom and the Mac are a test environment for our upcoming Windows 2008 and Mac OSX Leopard deployment. The procurve switch is just for testing on the workbench, once racked, it will be attached to our Cisco 6500 core switch.

It doesn’t even know there’s no disks (boots up really fast too)

VI Client showing specs of new machine - 8 x 2.5GHz cores and 20GB of RAM - lots of horsepower :-)

It’s home once I rearrange a few things tonight. The 4 machines at the top are our current ESX cluster. The disk array just underneath is for disk-based backup. The SAN is in another rack.

post Need More ESX Servers

April 10th, 2008 @ 12:56 am

Filed under: Virtualization

I enabled HA on our ESX cluster today and was greated by a nice big warning messages that I don’t have enough resources to satisfy the HA requirements. So, it looks like if one host dies, I don’t have enough RAM/CPU to run all of those VM’s elsewhere.

We do have licenses for a couple more CPU’s than we are actually using currently, so I guess it’s time to add another physical server to the cluster.

ruldrurd