January 22, 2010 - Matthijs Haverink

vSphere bug can bring down entire cluster (not fixed in Update1).

Last month I was on-site at one of my customers and experienced a major problem on the vSphere environment. We suddenly experienced about 150 virtual servers running on 16 hosts in 2 clusters (on the same SAN) going in jabber-mode. They froze, which made them unavailable for pings and other traffic. The freeze moment varied from 1 to 30 seconds. After that the VM’s went back on-line again. This seemed to happen in groups of about 10 to 30 VM’s.

Pretty soon we saw in the logs that a VMFS LUN was removed by one of the SAN administrators. This LUN was still attached to all the ESX hosts in that cluster but was not “in use”, meaning, there were no VM’s running on it.

Off course the unclean removal on SAN-level of a LUN before detaching it from the ESX hosts is not the way to do this but that what I just described shouldn’t have happend!

The quick solution was to do a storage rescan which removed the link to the unavailable LUN and after a couple of minutes the environment was stable again.

In the beginning I couldn’t believe that the LUN removal was the cause of the problems we had but at the end I had reactions on twitter of people confirming the same problems in the same situation. Also, after investigation, VMware confirmed that this was a known “feature” of VMware vSphere.

Now you might think: just upgrade to Update 1 and probably all problems will be fixed: NOT! VMware confirmed that this issue is not fixed in Update 1 and no patch will be released unless more customers experience this problem.

Somewhere I understand this reaction: it occurs when you don’t handle your VMware environment as VMware prescribes but when you see what can happen to your production environment I would like to see this issue fixed a.s.a.p.

I also noted that I’m not the only one that has experienced this problem ( http://communities.vmware.com/thread/251536 ) so I’m curious if VMware will pick this up before Update 2 or not.

For now I’d like to hint you : triple-check every storage change you do when it concerns vSphere!

Virtual Infrastructure bug / down / error / issue / jabber / LUN / Removal / SAN / update 1 / VMFS / vSphere /

Comments

  • Ben Conrad says:

    This has been fixed in ESX400-200912401-BG , see KB 1015084 & 1016626. The wording in 1016626 is a bit confusing though.

    • Ben, thanks for your reply and the info.

      I had this post in my draft for a while so I was somewhat afraid this was outdated info but the scary thing is that we reported this case at the 2 months ago to VMware support and we’ve never had a solution from VMware support so seems there’s some room for improvement there …

  • I dont really see this as a bug. I am not surprised at the reaction of the system after removing storage in an unclean fashion. Can you blame vmware for bad decisions?

    • Hi Eric,

      Thank you for your reaction but I think this is a non-discussion. When the impact can be this big it is VMware’s responsibility to limit the possible damage. And by releasing a fix before update 2, VMware seems to share that opinion.

  • Grant says:

    I’ve actually done this intentionally once because the general rule in our Storage is to un-present it from the systems and then reclaim the space. Why? Because if we delete it via the host there’s no way back. But I learned with vmware this is a bad thing to do. Our system didn’t have these issues, but it was definitely unhappy. I had to get our storage admin to represent the LUN so I could delete it cleanly.

    That’s why I want my own SAN 🙂