Fixing troubled agents

Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine.  Agent health is an ongoing task of any OpsMgr Admin’s life.

This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time.  There are so many factors in an agent’s ability to communicate and work as expected.  A few key areas that commonly affect this are:

  • DNS name resolution (Agent to MS, and MS to Agent)
  • DNS domain membership (disjointed)
  • DNS suffix search order
  • Kerberos connectivity
  • Kerberos SPN’s accessible
  • Firewalls blocking 5723
  • Firewalls blocking access to AD for authentication
  • Packet loss
  • Invalid or old registry entries
  • Missing registry entries
  • Corrupt registry
  • Default agent action accounts locked down/out (HSLockdown)
  • HealthService Certificate configuration issues.
  • Hotfixes required for OS Compatibility
  • Management Server rejecting the agent

 

How do you detect agent issues from the console?  The problem might be that they are not showing up in the console at all!  Perhaps they might be a manual install that never shows up in Pending Actions?  Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”.  Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.

 

One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log.  This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent.  That is ALWAYS one of my first steps in troubleshooting.

 

Another way of examining Agent health – is by the built in views in OpsMgr.  In the console – there is a view – Located at the following:

 

image

 

 

This view is important – because it gives us a perspective of the agent from two different points:

1.  The perspective of the agent monitors running on the agent, measuring its own “health”.

2.  The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server”.

 

If any of these are red or yellow – that is an excellent place to start.  This should be an area that your level 1 support for Operations manager checks DAILY.  We should never have a high number of agents that are not green here.  If they aren’t – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…

Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.

 

Now…. the following are some general steps to take to “fix” broken agents.  These are not in definitive order.  The order of steps really comes down to what you find when looking at the logs after taking these steps.

 

  • Start the HealthService on the agent.  You might find the HealthService is just not running.  This should not be common or systemic.  Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure.  However – if this is systemic – it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM.  Look in the OpsMgr event log for verification.

 

  • Bounce the HealthService on the agent.  Sometimes this is all that is needed to resolve an agent issue.  Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.

 

  • Clear the HealthService queue and config (manually).  This is done by stopping the HealthService.  Then deleting the “\Program Files\System Center Operations Manager 2007\Health Service State” folder.  Then start the HealthService.  This removes the agent config file, and the agent queue files.  The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to.  From the registry – it will find out if it is AD integrated, or a fixed management server to talk to if not.  This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value.  The agent will contact the management server – request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while.  This is very much what happens on a new agent during initial deployment.

 

  • Clear the HealthService queue and config (from the console).  When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane – “Flush Health Service State and Cache”.  This will perform a very similar action to that above…. as a console task.  This will only work on an agent that is somewhat responsive…. if it does not work you need to perform this manually as the agent is really broken from communication with the management server.  This task will never complete, and will not return success – because the task breaks off from itself as the queue is flushed.

 

  • “Repair” the agent from the console.  This is done from the Administration pane – Agent Managed.  You should not run a repair on any AD-integrated agent – as this will break the AD integration and assign it to the management server that ran the repair action.  A “repair” technically just reinstalls the agent in a push fashion, just like an initial agent deployment.  It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

 

  • Reinstall the agent (manually).  This would be for manual installs or when push/repair is not possible.  This section is where the combination of options gets a little tricky.  When you are at this point… where you have given up, I find just going all the way with a brute force reinstall is the best way.  This means performing the following steps:
    • Uninstall the agent via add/remove programs.
    • Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe.  This is designed to make sure that the service, files, and all registry entires are removed.
    • Ensure that the agent’s folder is removed at:  \Program Files\System Center Operations Manager 2007\
    • Ensure that the following registry keys are deleted:
      • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager
      • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService
    • Reboot the agent machine (if possible)
    • Delete the agent from Agent Managed in the OpsMgr console.  This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required.
    • Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System…. manually reinstall the agent.  Keep it simple – install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used)  If it works correctly – you can always reinstall again using low priv or AD integration.
    • Remember to import certificats at this point if you are using those on the individual agent.
    • As always – look in the OperationsManager event log…. this will tell you if it connected, and is working, or if there is a connectivity issue.

 

To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot.  However – to summarize at a very general level, my typical steps are:

  1. Review OpsMgr event log on agent
  2. Bounce HealthService
  3. Bounce HealthService clearing \Health Service State folder.
  4. Complete brute force reinstall of the agent.

If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.

Advertisements
This entry was posted in Misc Stuff, SCOM Agents and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s