My nightmare experience with the new Dell Management Pack 3.1 A01

The new management pack has a bunch of new rules and most importantly doesn’t cause the WMI errors like previous versions.

Download: My Custom Dell MP without SNMP

After importing the new Management Pack I immediately started getting errors on my Root Management Server and Management Server.  Not the good kind of errors you would expect, (like errors telling you there is a hardware issue).  The kind of errors that tell you that your SCOM infrastructure is now hosed!

Here were some the errors I received
*****************************************************************************************************************************
Data Warehouse failed to enumerate database components to be deployed. Failed to enumerate Data Warehouse components for deployment. The operation will be retried.
Exception ‘SqlException’: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.Deployment.Component
*****************************************************************************************************************************
Data Warehouse managed object type synchronization process failed to write data to the Data Warehouse database. Failed to store data in the Data Warehouse. The operation will be retried.
Exception ‘SqlException’: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.Synchronization.TypedManagedEntity
*****************************************************************************************************************************
Event ID: 31569

Description:
Report deployment process failed to request management pack list from Data Warehouse. The operation will be retried.
Exception ‘SqlException’: Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.

Workflow name: Microsoft.SystemCenter.DataWarehouse.Deployment.Report

****************************************************************************************************************************
Health Service Unloaded System Rule(s)
Alert raised by monitor when system rules have been unloaded by the Health Service.
Event ID:      4000
A monitoring host is unresponsive or has crashed.  The status code for the host failure was 2164195371.
*****************************************************************************************************************************

 

After doing some digging I found there was a fix for the Data Warehouse errors.  Installing KB954643 immediately fixed the Data Warehouse errors.

The only thing left was the Event ID 4000 – A monitoring host is unresponsive or has crashed.  The status code for the host failure was 2164195371.  So how do we fix these errors?  I found KB951526.  Problem was I already installed this fix and was still having the issues.

So now how do we fix the problem?  The old management pack didn’t cause these issues so why is the new one? 
So I decided to export both the old and new management pack using Borris’s powershell script to see what changed is this new version of the Dell MP.

After evaluating the two management packs in the Authoring Console it was apparent that the new management pack contains tons of new SNMP monitors that I don’t need.  Also my experience after talking to people in the community is that the SNMP provider is not very scalable or robust.   So I set all of SNMP related discoveries to disabled and removed all rules related to SNMP.  After doing that and re-importing the management pack the 4000 errors are gone and everything seems to be working much better.  I have posted my customized management pack at the beginning of this post.

3 Responses to My nightmare experience with the new Dell Management Pack 3.1 A01

  1. Jim September 24, 2008 at 2:21 pm #

    Already have that t-shirt…

  2. Marnix Wplf October 1, 2008 at 3:15 am #

    Hi Tim.

    Great job you’ve done! I have had an issue at a customers site where EventID 5300 happened on the RMS. With this event the HealthService stalls (on the RMS!) and the RMS state turns to grey. So no flow of information, no more monitoring. With the help of Microsoft Support we were capable to find the culprit: the Dell Management Pack. Because we log everything we do we were capable off backtracking when the fist time EventID 5300 was logged. It was a couple of days after we had imported many new MPs. One of which was the updated MP of Dell…

    After having removed this MP and the DRAC’s as well, SCOM hasn’t shown any EventID 5300 anymore.

    So all I can say is that I concur your story about the new Dell MP and hosing the RMS.

    Kind regards,
    Marnix Wolf

  3. Pete Zerger October 24, 2008 at 7:54 am #

    Tim,

    The 4000 error is a known issues resolved by 951526, and is actually contained in a couple of newer fixes. You have to be sure your DLL versions are actually updated when you apply the fixes, as we have seen a couple of cases where this didn’t happen. I would guess this is the case in your situation, as that fix definitely seems to be resolved by this fix.

    See http://www.systemcenterforum.org/critical-hotfixes-for-opsmgr-and-essentials-2007-sp1-and-how-to-verify-youre-up-to-date/

Leave a Reply