Check For Non Running Distribution Agents
Today I am going to share with you a SQL Script that can help monitor the status on your distribution agents in a snapshot/transactional replication topology.
Despite their being some built-in monitoring for replication agents in the Replication Monitor, one that seems to be missing is checking for the agents updating the distributor. Sure there’s ones for agents shutting down for whatever reason, but sometimes latency can be reported as “excellent” for a subscriber despite not having updated the distributor for some time. And because it has not updated the distributor, there’s no calcs for transactional latency, which means that whilst on the surface everything is fine, you are slowly heading towards a real headache with replication.
And, despite the SQL Replication Agent being set to RETRY for 4085 Years, occasionally we find that agents either succeed successfully and stop, of just stop after failures. This in turn affects the cleanup job, and this in turn affects replication to subscribers, and so on until replication becomes a real headache to fix. So to alleviate this, below is a job that will check for any agents that have not updated the distributor_history table in the distributor database for the past 15 minutes. It will then send an email out of all agents not updated, and also log a message and all affected agents to the event viewer with a custom error number. Add this script to a job and run every 15 minutes. You may want to fine-tune the frequency; eg you could run the job every 15 minutes but check for any agents that have not updated in an hour.
-- Email settings DECLARE @profile_name VARCHAR(100) = '_MailProfile' DECLARE @strsubject VARCHAR(100) DECLARE @tableHTML NVARCHAR(max) DECLARE @EmailRecipients VARCHAR(max) = '[email protected]' CREATE TABLE #agentHistory ( ServerName NVARCHAR (64) ,Publication NVARCHAR (64) ,AgentName NVARCHAR(512) ,LastMsgTime DATETIME ) ;WITH CTE ( server_name,publication,agent_name,last_update) AS ( SELECT srvr.srvname ,agent.publication ,agent.NAME ,MAX(history.TIME) FROM MSdistribution_agents agent INNER JOIN msdistribution_history history ON history.agent_id = agent.id INNER JOIN sys.sysservers srvr on srvr.srvid = agent.subscriber_id GROUP BY srvr.srvname,agent.publication,agent.NAME ) INSERT INTO #agentHistory SELECT server_name,publication,agent_name,last_update from CTE WHERE last_update < dateadd(MINUTE, - 30, getdate()) IF EXISTS ( SELECT * FROM #agentHistory ) BEGIN SELECT @strsubject = 'Replication Agent Alert ' + convert(VARCHAR(17), getdate(), 113) + ' ***' SET @tableHTML = N'<H1>Replication Agent has not Logged a Message</H1>' + N'<h3>One or more agents have not logged an update in the past hour</h3>' + N'<table border="1">' + N'<tr><th>Server Name</th>'+ N'<th>Publication</th>'+ N'<th>Agent Name</th>' +N'<th>Last Updated Time</th>' + '</tr>' + cast(( SELECT td = AH.ServerName ,'' ,td = AH.Publication ,'' ,td = AH.AgentName ,'' ,td = AH.LastMsgTime ,'' FROM #agentHistory AH FOR XML path('tr') ,type ) AS NVARCHAR(max)) + N'</table>' + CHAR (10) + CHAR (13) + '<p> In order to resolve this, log onto the server, start up SQL Server and find the job under "SQL Server Agent -> Jobs". If job is stopped, right click and select "start job at step". The job will start running. Do not wait for it to complete as the job runs continuously. If the same jobs alerts again raise the issue with the DBA Team. ' EXEC msdb.dbo.sp_send_dbmail @recipients = @EmailRecipients ,@subject = @strsubject ,@body = @tableHTML ,@body_format = 'HTML' ,@profile_name = @profile_name END DECLARE @COUNT INT DECLARE @PostLog BIT = 0 DECLARE @@SERVER_NAME VARCHAR (128) DECLARE @@AGENTNAME VARCHAR (128) = '' DECLARE @@PUBLICATION VARCHAR (128) = '' DECLARE @@MESSAGE varchar(2000) DECLARE @@MSG_ARRAY varchar (2000) = '' SELECT @COUNT = COUNT (*) FROM #agentHistory SELECT @PostLog = CASE WHEN @COUNT > 0 THEN 1 ELSE 0 END WHILE @COUNT > 0 BEGIN SELECT @@AGENTNAME = MAX (Ah.AgentName) FROM #agentHistory AH SELECT @@MESSAGE = 'Agents have not logged a message. Log on to server, open SQL Server Management Studio and start job if not running. If job continues to alert that it has stopped then contact DBA Team. If job is running then monitor and close ticket when agents no longer alert.' SELECT @@MSG_ARRAY =@@AGENTNAME + ' '+ CHAR(10) + CHAR(13) + @@MSG_ARRAY + ' ' SET @COUNT = @COUNT - 1 DELETE FROM #agentHistory where AgentName = @@AGENTNAME SELECT @@MESSAGE = @@MESSAGE + CHAR(10) + CHAR(13) +@@MSG_ARRAY END IF @PostLog = 1 BEGIN USE master EXEC xp_logevent 68320, @@MESSAGE, error END DROP TABLE #agentHistory