rulururu

post Equallogic Auto-Snapshot Manager for VMware

September 13th, 2008 @ 2:03 am

For a couple of months now, I’ve been hearing about the upcoming 4.0 firmware and Auto-Snapshot Manager, VMware edition for the Equallogic PS series SAN. This new snapshot provider would allow us to coordinate snapshots between VitualCenter and the SAN, and, according to Dell/Equallogic, allow easy restoration of a single Virtual Machine from the SAN-based snapshots.

I had the opportunity on Thursday night to watch a pre-recorded demo as well as to attend a live webinar on Friday morning. I must say that after these demos and seeing exactly what this product does, I am stuck somewhere between excitement and disapointment. It’s a really cool concept, but I believe it still needs a lot of polishing, especially on the recovery side of things.

There are lots of awesome features in the new snapshot provider. We will have the ability to automatically, in a single click or scheduled task, trigger an ESX snapshot, including memory dump, then snapshot the SAN volume, followed by removing the ESX snapshot. This eliminates the journaling effect and associated performance hit and disk requirements of the ESX snapshots. This is all handled through a nice web interface, and the VirtualCenter folder tree is carried over, allowing snapshot schedules to be applied to groups of Virtual Machines.

There are, however, some catches. Only the selected VM’s are triggered for ESX snapshots, but the entire SAN volume, which may contain many other VM’s is snapshotted. This makes the ability to group VM’s using VirtualCenter folders less than useful. Let’s say I have four VM’s split between four volumes and want one machine on each volume to be snapshotted everu 12 hours. Then, I want one machine per volume to be snapshotted every 24 hours. In this scenario, I will actually end up with, at the SAN level, two snapshots per day of both entire volumes and all the VM’s since the entire volume is snapshotted. So, in my opinion, snapshotting VM’s by any grouping other than an entire SAN volume isn’t going to be practical without a lot of wasted disk space.

On the recovery side, I think there is a lot of room for improvement. It is very easy to revert an entire volume and all the VM’s it contains. Beyond that, restoring a single VM, for example, becomes a somewhat lengthy process. Basically, it involved going back to the Equallogic Group Manager, setting the snapshot online, going to ESX and mounting the snapshot as a new volume, deleting the damaged VM, copying it manually from the snapshot to the production volume, adding it to inventory, booting it up, and then unmounting the snapshot. Alternately, the VM can be booted from the snapshot volume, then migrated back to the production volume using Storage VMotion. Storage VMotion, however, requires accessing the ESX command line.

It is my hope that, in a future release, Dell will automate some of the recovery process using the VMware API’s. Currently, there are lots of improvements in creating the snapshot, but no real change in the process of recovering a VM.

I am looking forward to getting the Auto-Snapshot Manager, VMware edition installed in our environment and actually seeing it in action in a production environment. Expect another post in the future with more details once I actually get this up and running.

post Big VMware ESX Bug

August 14th, 2008 @ 8:34 am

Filed under: Servers, Virtualization

If you are not aware yet, a major bug has been revealed in ESX 3.5 and ESXi 3.5 Update 2.  Apparently, the beta was coded to expire on August 12, 2008 and this code failed to be removed from the actual release.  Details are available in the VMware Knowledge Base and This Topic on their forums.  You might also checkout This Post on Matthew Marlow’s blog for more information.

On the morning of the 12th, I was greeted with several errors like this one in the logs for our ESX cluster:

VMware finally released the patch really late Tuesday night, which, unfortunately kept me up most of the night getting our cluster patched.  It involved setting back the clock on all of the hosts so Vmotion would work, manually migrating VM’s off of a host, going into maintenance mode, applying the patch via the command line, then migrating the VM’s back.

VMware is one of my favorite companies and it know for delivering rock-sold, enterprise-class products, so it really disappoints me that they would let something like this slip through the cracks.  Imagine how many hundreds of thousands (Maybe millions?) of VM’s this affected.  They do seem to be committed to fixing their mistake and making things right.  You can check out the Letter From Their CEO for more info.

post Odd Exchange Issue

July 2nd, 2008 @ 2:12 am

Filed under: Email, Servers

I’m seeing an interesting, and seemingly recurring issue with one particular user’s Exchange calendar.  The complain is always that when a new appointment is added to the calendar, it disappears a minute or two later.  The event log on the Exchange Server always shows these errors:

Calendaring agent failed with error code 0×80040215 while saving appointment.

Calendaring agent failed to update the free/busy cache during an appointment save or delete operation.

Calendaring agent failed in message save notification with error 0×800703eb on XXXX@jfbc.org: /Calendar/test.EML.

I’ve tested, researched, tried to reproduce, etc, etc and am at a loss as to what’s causing this.  The only thing I can think of is maybe it’s related to syncing the users Palm Pilot.  Maybe the Palm device is allowing something to be input that Exchange doesn’t like?

I have figured out that there is some corrupt record that exists in the mailbox that’s causing this.  Moving the mailbox to another database and then back clears things up for a while, then it pops back up.  I’m curious if anyone else out there has seen anything like this?

post I Love VMware!

June 4th, 2008 @ 11:06 am

Filed under: Servers, Virtualization

Over the Memorial Day holiday, I had a “VMware Upgrade Party.” I’m not sure it was really a “Party” since I was the lone attendee, but I’ll call it one anyway. :-) I got all of our ESX servers upgraded to the latest build of 3.5, as well as Virtualcenter to the latest 2.5 build. I also added our fourth ESX server, which is the diskless, boot from SAN box I talked about a couple of weeks ago.

I was a little bit hesitant to put the diskless box in production since a bug in the QLogic HBA firmware required me to run their beta or “Limited Release” firmware in order to do Jumbo Frames. So far though, it has been rock solid.

Below, you can see our current VMware environment. I get more and more excited about this every day. I can now have a new machine online in less than 20 minutes without adding any physical hardware. Awesome! We currently have one stand-alone ESX server, jfbc-ecc-esx03 that runs our virtual desktops. The other three servers are in an HA cluster sporting a total of 32GHz of CPU resources and 52GB of RAM. I like it! I hope to be able to add Vmotion and Distributed Resource Scheduling later this year so we can more effectively manage our host resources.

post Successful SAN and VMware Upgrades

May 28th, 2008 @ 12:04 am

While everyone was away for the holiday Monday, I took the opportunity to upgrade our SAN and ESX servers.  Everything went surprisingly well.

What was really impressive is how fast the Equallogic SAN reboots.  The firmware upgrade was the first reboot since it was installed.  They claimed you could reboot it “live” without causing any problems with the servers, but I had never tested that theory until now.  I was sending it a series of pings every 1 second during the entire process.  I dropped a total of 12 pings during the reboot and the servers never new the storage had just rebooted.  Pretty impressive!  Check this out (I did it from home, hence the 12-15ms latency):

I also migrated all of our ESX servers from version 3.0.2 to 3.5.  For some reason, the HA agent had to be reconfigured on a couple of them, and the ESX firewall decided to block outbound iSCSI traffic on every box after the upgrade.  Other than that, the ESX upgrades went great!

Out first diskless ESX server is no online also.  The QLogic HBA initially wouldn’t connect to our SAN using jumbo frames.  QLogic’s response was to send me their “Beta” or “Limited Release” firmware, which scares me a little.  I have several production VM’s running on that host with no issues though.  I hope to do some benchmarks on VMware Server vs ESX with software iSCSI vs ESX with hardware iSCSI.  Stay tuned for details on that!

I love it when a project goes as planned!

post Networking for iSCSI

May 25th, 2008 @ 3:34 am

Filed under: Networking, Servers, Storage

I’ve received several comments and question on my post from a few days ago, “iSCSI Slow? I Think Not.”  The network hardware is critical for peak iSCSI performance.  I think a brief follow up with some details on our network configuration are in order.

We are using a Cisco Catalyst 6506 switch at the core of our network, which handles all of our iSCSI traffic.  The current configuration looks like this:

  • (1)  WS-X6K-SUP2-MSFC2 with PFC2 supervisor module
  • (2)  WS-X6148A-GE-TX gigabit modules (connects all server and iSCSI devices)
  • (1)  WX-X6414-GBIC fiber module (backbone to all of our IDF’s)

All SAN ports are configured for Jumbo Frames and Flow Control.

The servers are HP DL360 G5’s with NC360T Nics.  I just deployed a new ESX server with with a QLogic iSCSI HBA, but I don’t really have any benchmarks on that yet.  I’ll post some details on that once I run some benchmarks.  I’m interested in whether there will be a big performance increase over the ESX software iSCSI initiator.

post Got My New Xserve

May 23rd, 2008 @ 2:00 pm

Filed under: Macs, Servers

Our new Xserve arrived yesterday.  I got all the initial configuration done and got it racked.  Apple definitely makes some “Pretty” servers.

Over the next few weeks, I’ll be getting Open Directory and Update Services configured and rolled out to all of our Mac workstations.  At some point, we’ll also be installing Final Cut Server.  I’ll be post updates as we get all of this configured.  In the meantime, here’s a few pics:

post Runaway Clock in Virtual Linux Servers

May 18th, 2008 @ 2:45 pm

Filed under: Servers, Virtualization

If you run any Linux guests under VMware, you’ve probably had issues with the clock in the VM drifting or just totally running away.

The Linux clock works by counting timer interrupts. In older kernels, this was usually done at a rate of 100Hz, or 100 times per second. Beginning with the 2.6 kernel, the interrupt timer is now set at 1000Hz, so interrupts are counted 10 times as often.

Due to the fact that VMware divides the host up into “time slots” for each guest OS, and depending on the system load, interrupts are often missed in the guest machines. The more often the guest kernel counts interrupts, the more apparent these “missed” interrupts become and the result clock skew in the gust machine. VMware Tools has the ability to sync the guest clock with the host, but this only occurs once per minute, and can only advance the clock, it can’t slow it down. Generally, the VMware Tools clock sync alone is not enough.

Here’s the steps that are needed in order to keep the clock skew under control (these apply to VMware Server running on a Linux host - in my case, CentOS). The guest OS changes will also apply to ESX.:

  • VMware server needs to be told what clock speed the CPU(s) run at. This can be found by running “cat /proc/cpuinfo”, which will return all kinds of information about the CPU’s, including the clockspeed. You’ll need to edit /etc/vmware/config and add the following lines (where host.cpukHz is the host CPU speek in KHz (2.8GHz in my example below)

    host.cpukHz = 2800000
    host.noTSC = TRUE
    ptsc.noTSC = TRUE

  • VMware Tools needs to be installed in the guest OS. VMware provides instructions on how to install VMware Tools in a Linux guest here.
  • VMware Tools time synchronization needs to be enabled. This is done by editing the VMX file in the virtual machine directory and adding the following line:

    tools.syncTime = “TRUE”

    Note that the host should use NTP to sync to an outside time source, while NTP should be disabled in each guest

  • Now, we need to lower the interrupt frequency in the guest kernel. Generally, this will require installing the kernel source, modifying the CONFIG_HZ parameter to a rate of 100Hz, and then recompiling the kernel. CentOS has made this easy for us by releasing a “VM Optimized” kernel for CentOS 5. Although perfectly stable, this kernel is presently in the “Testing” repository. Here’s how to install the VM Kernel using yum in a CentOS 5 system:Add the “Testing” repo as follows:

    cd /etc/yum.repos.d
    wget http://dev.centos.org/centos/5/CentOS-Testing.repo

    Now, install the VM Optimized kernel:

    yum enablerepo=c5-testing install kernel-vm kernel-vm-devel

  • Now, we need to make sure Grub is set to boot the new kernel, and also add the “clock=pit” parameter to the kernel boot options. We do that by editing /etc/grub.conf and making the following changes:

    default=0

    Where “0″ is the first kernel listed. If the VM Kernel is not the first item, you’ll need to adjust the value accordingly. For example, if it’s second in the list, you’d use “default=1″Now, add the clock=pit parameter to the kernel boot options. That section of the grub.conf file will look something like this:

    title CentOS (2.6.18-53.1.19.el5) root (hd0,0)
    kernel /vmlinuz-2.6.18-53.1.19.el5 ro root=LABEL=/ clock=pit
    initrd /initrd-2.6.18-53.1.19.el5.img

Once all of the above changes are made, reboot the guest, and you should see significantly better clock performance. I had some VM’s where the time would drift by hours, and after making these changes, they stay within a few seconds.

post ActiveSync + ISA Server

May 16th, 2008 @ 5:53 pm

Filed under: Email, Security, Servers

We worked for a while yesterday to get Bob’s Windows Mobile phone to sync with Exchange (Bob just joined our IT team - welcome Bob!). Without much luck.  Bob is our first user with Windows Mobile.  Everyone else uses Blackberry devices.

We use an ISA 2006 server in the DMZ with RADIUS authentication as a front-end server to Exchange.  I initially added the Microsoft-Server-ActiveSync virtual directory to the list of paths in the existing ISA rule.  We got errors about not having the correct privileges to do ActiveSync, which we obviously did have.  After messing with this for a little while, I realized I needed to create a separate rule for the ActiveSync path and place it above my OWA redirect rule.  I have a rule that allows the user to type in just http://webmail.jfbc.org and get automatically redirected to https://webmail.jfbc.org/owa.  It seems that this rule was also redirecting the ActiveSync directory.  Here’s what the “Correct” setup looks like in ISA server:

Apparently, that wasn’t the only issue.  Next problem: It kept complaining about an incorrect username or password.  Obviously, the username and password were correct.  Some monitoring in ISA server revealed the authentication didn’t seem to be happening.  All of the requests were marked as “anonymous.”

You won’t believe how simple this was.   On the handheld, there are 3 boxes: username, password, and domain.  We run split DNS, with JFBC.ORG as the internal domain name, so that’s what we entered.  Turns out that ISA server wants the NETBIOS name instead, which is simply JFBC.  It’s amazing how something so simple can create such a big issue.

post Open Directory - Here We Come

May 14th, 2008 @ 4:15 am

Filed under: Macs, Servers, Strategy

Currently, our network at JFBC is about 15% Mac. One of the big ongoing projects I’ve been working on is better integration and management of the growing number of Macs in our environment. We currently leverage Active Directory for single signon, but, beyond that, there are no real management tools in place for Macs.

Some things are possible by extending the Active Directory schema to add some of the apple-specific LDAP attributes. However, this moves the AD environment into a somewhat “unsupported” configuration and still doesn’t provide for full control when it comes to Mac management.

The best way to fully manage the Mac clients - including centralized update management and general settings, including appearance, shortcuts, scripting, etc. is through the use of Apple’s Open Directory system. There was definitely some effort put forth on Apple’s part here, because Open Directory can fully integrate with Active Directory. Basically, AD gets used for authentication, then AD users and groups can be linked to OD groups. Specific management settings can then be applied to the OD groups.

I’ve just ordered a new Apple Xserve to handle this task, which should arrive next week. I’m excited about being able to take integration and management of our Mac environment to the next level.

Other Mac stuff on my radar:

  • OS X Leopard deployment (Jonathan has agreed be my next victim beta tester).
  • Office 2008 deployment.
  • Possible Final Cut Server implementation (Already briefly discussed with our media team, will be exploring this further, including storage requirements).
  • Migration of our closed circuit TV announcements from PowerPoint on Windows to Keynote on Mac (Currently working with our communications team on this).

Expect lots of Mac related posts in the coming weeks/months!

ruldrurd
Next Page »