Any time we see newly created work items stuck with a “New” status, it generally means that the Service Manager workflows are not processing or are processing slowly. Monitoring the “Minutes behind” of each workflow can be a useful method of troubleshooting.
The following web page has good troubleshooting tips and a SQL query that can be used to display the “Minutes behind” of each workflow. A similar SQL Server Management Studio SQL query is also shown below:
-- Use ServiceManager
-- Select Name, is_broker_enabled from sys.databases Where name = 'ServiceManager'
-- Line above added because it needs to be 1 or some stuff will not run. Confirm is_broker_enabled set to 1
-- Select above is remarked out because it is not directly related to the purpose of this blog posting.
-- SubscriptionStatus.sql -- Workflow / subscription status
Use ServiceManager
DECLARE @MaxState INT, @MaxStateDate Datetime, @Delta INT, @Language nvarchar(3)
SET @Delta = 0
SET @Language = 'ENU'
SET @MaxState = (
SELECT MAX(EntityTransactionLogId)
FROM EntityChangeLog WITH(NOLOCK)
)
SET @MaxStateDate = (
SELECT TimeAdded
FROM EntityTransactionLog
WHERE EntityTransactionLogId = @MaxState
)
SELECT
LT.LTValue AS 'Display Name',
S.State AS 'Current Workflow Watermark',
@MaxState AS 'Current Transaction Log Watermark',
DATEDIFF(mi,(SELECT TimeAdded
FROM EntityTransactionLog WITH(NOLOCK)
WHERE EntityTransactionLogId = S.State), @MaxStateDate) AS 'Minutes Behind',
S.EventCount,
S.LastNonZeroEventCount,
R.RuleName AS 'MP Rule Name',
MT.TypeName AS 'Source Class Name',
S.LastModified AS 'Rule Last Modified',
S.IsPeriodicQueryEvent AS 'Is Periodic Query Subscription', --Note: 1 means it is a periodic query subscription
R.RuleEnabled AS 'Rule Enabled', -- Note: 4 means the rule is enabled
R.RuleID
FROM CmdbInstanceSubscriptionState AS S WITH(NOLOCK)
LEFT OUTER JOIN Rules AS R
ON S.RuleId = R.RuleId
LEFT OUTER JOIN ManagedType AS MT
ON S.TypeId = MT.ManagedTypeId
LEFT OUTER JOIN LocalizedText AS LT
ON R.RuleId = LT.MPElementId
WHERE
S.State <= @MaxState - @Delta
AND R.RuleEnabled <> 0
AND LT.LTStringType = 1
AND LT.LanguageCode = @Language
AND S.IsPeriodicQueryEvent = 0
/* to look at a specific workflow uncomment on of the following */
-- AND LT.LTValue LIKE '%Test%'
-- AND S.RuleId='1D74409B-B2D9-8C45-6702-AB8C94AA0694' -- aka Display Name="New Change Request Workflow"'
ORDER BY S.State Asc
Troubleshooting Workflow Performance and Delays
We run the above SQL query many times waiting a few minutes between each execution to see how the “Minutes Behind” for each workflow changes. We scroll to the bottom of the list to determine the number of workflows in the normal range between executions. 2 minutes or less is normal:
- is the “Minutes Behind” static
- is the “Minutes Behind” only static for a few workflows. It may be that the workflow is disabled, or that there is a management pack override disabling the workflow even though it shows up as enabled, or possibly it is a custom workflow that is not working properly.
- Is the “Minutes Behind” continuously increasing for all workflows or only some of the workflows
- Are all the workflows are impacted (greater than 2 minutes behind)
- Are the “Minutes Behind” continuously increasing or does it go down on occasion.
The solution in this blog is intended to be used when 98% or more of the workflow “Minutes Behind” are static or continuously increasing over time. If the workflow “Minutes Behind” is up and down as you execute the SQL query over and over then the troubleshooting steps in the web link above “Troubleshooting Workflow Performance and Delays” is more appropriate. Below is the list of common issues and solutions that we see from time to time on the Microsoft support lines when 98% or more of the workflow “Minutes Behind” are static or continuously increasing over time:
LIST OF ISSUES / SOLUTIONS:
– Most of the time the issue is resolved in a single minute by stopping the System center services on the Primary Management server, deleting “Health Service State” folder, and then restarting the services.
There are probably several causes however the most common is SQL server was restarted and the Service Manager Services Timed out trying to reach the SQL server. The following steps can be used to reduce the time it takes to stop the services, delete the subfolder “Health Service State” and restart the services. The best way to prevent this problem is to put in a process to stop Service Manager services before applying updates to the SQL server and/or any other time that the Service Manager SQL server is restarted. After the SQL Service has been up and running for 5 minutes then restart the Service Manager services.
## Ideal stopping order: Stop-Service HealthService ; Stop-Service OMCFG; Stop-Service OMSDK Get-Service HealthService,omcfg,omsdk; ## You can use the following to open the Service Manager folder ## From the Service Manager folder delete or rename the "Health Service State" subfolder $SMFolder = (Get-ItemProperty "HKLM:SOFTWAREMicrosoftSystem Center2010CommonSetup").InstallDirectory Start $SMFolder ## ideal starting order (reversed from stopping) Start-Service OMSDK ; Start-Service OMCFG; Start-Service HealthService Get-Service HealthService,omcfg,omsdk;(Get-date).ToString()
– The “Microsoft Monitoring Agent” in Control Panel should not have any management server listed on the Service Manager primary management server, or other Service Manager management servers. If you have a server listed in the “Microsoft Monitoring Agent Properties” it should be removed and the option “Automatically update management group assignments from AD DS” should be unchecked.
Microsoft System Center Management Pack for System Center Service Manager
Under “Mandatory Configuration” page 6
“…You should also ensure that the Service Manager management servers are configured for agentless monitoring…”
I have seen customers use it. Sometimes it works for a long time and then comes the hair pulling. Do not be tempted. Running the SCOM agent locally will on rare occasions cause unexpected behavior.
If the “Minutes Behind” is not changing or increasing over time
The following items below are unlikely to help if the “Minutes Behind” is only for some workflows, or if the “Minutes Behind” for the workflows is going down and up. It is normal to have workflows with 0, 1, or 2 minutes. If “Minutes Behind” is going down then there is likely a SQL Load issue as mentioned in “Troubleshooting Workflow Performance and Delays”. If the “Minutes Behind” is not changing or increasing over time please review the possible solutions below:
– The workflows only run from the Service Manager Primary Management Server. Execute the following SQL Query again the ServiceManager Data Base and confirm the Primary Management server name. Is the server up and running?
-- Display the primary management server Use ServiceManager select DisplayName, [PrincipalName] from [MTV_Computer] where [BaseManagedEntityId]= (SELECT ScopedInstanceId FROM dbo.[ScopedInstanceTargetClass] WHERE [ManagedTypeId] = dbo.fn_ManagedTypeId_MicrosoftSystemCenterWorkflowTarget() )
– Are the Services running on the Primary Management server? From an elevated Powershell prompt:
PS C:> Get-Service HealthService,omcfg,omsdk Status Name DisplayName ----- ---- ----------- Running HealthService Microsoft Monitoring Agent Running omcfg System Center Management Configuration Running omsdk System Center Data Access Service
– The “HKLMSOFTWAREMicrosoftMicrosoft Operations Manager3.0MOMBinsValue1” registry value is required to connect to SQL database. Also the encryption key in Value1 must match the SQL server database that it was generated from and the Management servers FQDN name. Meaning the computer name of the Service Management server and the domain that it belongs to cannot be changed.
– Is the Primary Management server listed in the SQL Health Service table in SQL?
- Display Service Manager Managmenet servers Use ServiceManager Select * from MT_HealthService
Is the primary management server listed in the MT_HealthService? If no then the Primary Service Manager Management Server Windows computer management object was deleted. Rare however sometimes customer accidentally deleted Windows Computer object for the Management server using Powershell or via Service Manager Console, “All Windows Computers”. If deleted via the GUI it should still exist until the items are cleared from Service Manager Console “Deleted Items”. If missing promote a secondary Service Manager Management server to a primary Management server. If no management servers are present in MT_HealthService SQL table then Service Manager Database must be Restored and existing tickets have to be recreated. Attempting to restore just the MT_HealthService table will not work. Microsoft Development team has confirmed that when the Service Manager management Windows computer object is deleted many other interrelated changes occur to the Service manager database requiring the ServiceManager database be restored.
– If the password has been changed even if it has been changed back, reset the password in the Service Manager Console to see if it corrects the workflow problem.
Reset / retype the password of the Service Manager Workflow account stored using the following steps:
Service Manager Console > Administration > Administration > Security > Run as Accounts
Then double click the account the and type in the password
– Service Account Authentication problem or SCSM workflow account authentication problem:
Log Name: Operations Manager Source: HealthService Event ID: 7000 Level: Error Description: The Health Service could not log on the RunAs account CONTOSOSvcMgrWork for management group ServiceMgmtGroup. The error is The user name or password is incorrect.(1326L). This will prevent the health service from monitoring or performing actions using this RunAs account.
Log Name: Operations Manager
Source: HealthService
Event ID: 7000
Level: Error
Description:
The Health Service cannot verify the future validity of the RunAs account CONTOSOSvcMgrWork for management group ServiceMgmtGroup. The error is The user name or password is incorrect.(1326L).
The causes can vary. Account has been deleted from Active Directory, Password has Expired, Account is disabled , time is greater than 5 minutes between systems causing a Kerberos authentication failures. From the Service Manager primary management server you can run the following from an elevated Powershell prompt against the system event log and it might confirm a Kerberos problem. You may need to re-enter SCSM workflow account under Service Manager Console > Administration > Administration > Security > User Roles.
Get-WinEvent -Logname system | ?{$_.Message -like "*KRB_AP_ERR_MODIFIED*"}
The following can be used on different systems to determine if the UTC time is near the 5 minute difference, replacing DomainControllerServerName with the name or your DC:
w32tm.exe /stripchart /computer:DomainControllerServerName
– Check if the PID of the HealthService service is changing often. This would indicate that the service is crashing and then restarting.
Lastly, if you workflows start running properly “Minutes Behind” at 0, 1, or 2 minutes then new workitems should work as expected. In some cases previous workitems may need to have the status reset with Powershell.
Search keywords:
Workitem status not updating
Workitem stuck on new
Workitem status not changing
Incident status not changing
Service Request status not changing
Change Request status not changing
- Austin Mack, Sr. Support Escalation Engineer, Microsoft