Monday, June 22, 2015

Health Service Failure Runbook



Every time I apply a SCOM 2012 R2 update I go to the best source on the web, Kevin Holman’s Step by Step series.



But here is my experience, issues that I encountered and of course, the solution.

After running all SCOM server-related updates as per Kevin’s document I was now ready to update all Managed Agents waiting to be updated in the Pending Management pane.

Half updated with no issues, but the other half (200+ agents) would generate the following error:


The Agent Management Operation Agent Install failed for remote computer xxxx
Install account: xxx\ScomAction
Error Code: 8007041D
Error Description: The service did not respond to the start or control request in a timely fashion.
Microsoft Installer Error Description:
For more information, see Windows Installer log file "C:\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\AgentManagement\AgentLogs\AgentInstall.LOG
C:\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\AgentManagement\AgentLogs\AgentPatch.LOG
C:\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\AgentManagement\AgentLogs\MOMAgentMgmt.log" on the Management Server.




The key words here is in the error description “The service did not respond to the start or control request in a timely fashion”

When I open the affected agent services, indeed the Microsoft Monitoring Agent (HealthService) was stopped.

That of course generated a “Health Service Heartbeat Failure” alert and if the agent was a Cluster or Domain Controller then there were a plethora of other Critical alerts that came with it.

The funny part is that SCOM had successfully updated the agent but failed to re-start the Health Service which presented a challenge since there is nothing that can be done from within the SCOM console to resuscitate the now grey-out agent.

A quick solution is to remotely start the Microsoft Monitoring Service but it’s impractical on a 400+ agent population.



The Solution:


I created a SCORCH 2012 R2 Runbook to start the Microsoft Monitoring Service




Under the hood:

The Runbook listens for the ‘Health Service Heartbeat Failure’ alert
Ping the server to ensure it has not been shutdown or rebooted. 
If ping fails, an Information alert is created, mainly so it won’t interfere with the ‘Failed to Connect to Computer’ Critical alert that is generated immediately after.
If the ping is successful we pass the information to the next activity to start the Microsoft Monitoring Agent service.
Last step, we closed the ‘Health Service Heartbeat Failure’ alert and write ‘Closed by SCORCH’ in custom filed 1 as a successful stamp.



Disclaimer: 
All software and information is provided “AS IS” with no warranties. Use at your own risk! Please test it in a Lab environment first!






2 comments:

  1. Thanks for sharing its a wonderful website. health issue is facing all people in the world that why your blog is informative.

    hip pain treatment

    ReplyDelete

SCOM and Orchestrator Voice Notification Solution with Twilio and Automys

SCOM and Orchestrator Voice Notification Solution with Twilio and Automys. Cherry Picking SCOM alerts… Problem: Issue # 1 : Spam (a...