When it comes to monitoring replication, using the Replication Monitor.exe gives you a good view of what is happening now. But unless you have someone watching Replication Monitor 24/7, it's generally not a good idea to rely on it. In fact, its best to avoid running Replication Monitor too often. Instead, you can use the information stored in your distribution database to get information such as agents hitting latency thresholds and a raise in undistributed transactions. Using the information stored in the distribution database ensures that monitoring is centralised. (I'm aware that there is a built in check for replication exceeding a latency threshold, however I have seen a high number of undistributed commands where in fact there were none, and false positives can be frustrating.)
Simple checks like canary tables do sound easy enough, but only if your replication topology is small: if you have hundreds of subscribers and dozens of publications, this approach can be come cumbersome to manage. I also really like this solution to monitor agents statuses and latency posted on Simple Talk, but again there is the question of scale and manageability, not to mention relying on OPENROWSET connections to get the status of subscribers.
But at any rate, high latency is only one way of monitoring replication. As mentioned, checking for high levels of undistributed commands shows that there is an issue with replication. High latency and high levels of undistributed commands are not mutually exclusive; you could have a normal level of undistributed commands but poor latency, which suggests a network issue or maybe the subscribers are struggling to consume the commands because of high load on their servers.
To ensure you're only alerted when there is a genuine issue which requires intervention, you need to get baselines of your replication topology. And to get good baselines for undistributed transactions, I found it necessary to store different baselines of undistributed commands across publications, articles, and across the entire distribution database. These queries were run just before the “distribution clean up” job was ran so that we got a good idea of our workload. The results were stored in tables on tempdb. Then it was just a case of setting threshold alerts in a custom SQL job at a level where we needed to be alerted that we were under considerable load, and a further threshold for investigating the undistributed commands urgently.
How long you measure the baselines for determines on your workload, as you may have a heavy “month end” time where replication is under considerable load.
The first query breaks down the undistributed commands by article and publication per hour
--breakdown by article and publication by hour SELECT a.article, CAST (t.entry_time as date) as StartDate, DATEPART (Hour,t.entry_time) as OnHour, COUNT (1) FROM MSrepl_commands cmds INNER JOIN MSarticles a ON a.article_id = cmds.article_id INNER JOIN MSpublications p ON p.publication_id = a.publication_id INNER JOIN MSrepl_transactions t ON cmds.xact_seqno = t.xact_seqno GROUP BY a.article ,CAST (t.entry_time as date) ,DATEPART (Hour,t.entry_time) ORDER BY DATEPART (Hour,t.entry_time)
This query gets the total count of the undistributed commands in the entire distribution database
--breakdown by hour of entire distribution database SELECT CAST (t.entry_time as date) as StartDate, DATEPART (Hour,t.entry_time) as OnHour, COUNT (1) FROM MSrepl_commands cmds INNER JOIN MSarticles a ON a.article_id = cmds.article_id INNER JOIN MSrepl_transactions t ON cmds.xact_seqno = t.xact_seqno GROUP BY DATEPART (Hour,t.entry_time) ,CAST (t.entry_time as date) ,DATEPART (Hour,t.entry_time) ORDER BY DATEPART (Hour,t.entry_time)
The query below breaks down the undistributed commands by publication.
--by publication SELECT p.publisher_db ,p.publication ,COUNT (1) FROM MSrepl_commands cmds INNER JOIN MSarticles a ON a.article_id = cmds.article_id INNER JOIN MSrepl_transactions t ON cmds.xact_seqno = t.xact_seqno INNER JOIN MSpublications p on p.publication_id = a.publication_id GROUP BY p.publisher_db, p.publication
Remember that the “distribution clean up” job and the frequency which you run that can have an impact on your baselines, particularly if you have set the retention period for distributed commands to 0. Also, sometimes the “distribution clean up” job can fail as it is chosen as the deadlock victim when the job is running whilst many distribution agents are reading the log. It's worth noting that this is also a sign that replication is under load. Worse still is if the clean up job is running for a long time: If you do ever notice that the clean up job has been running for a long time (say, greater than 5 minutes) and the number of undistributed commands are high, then it's best to stop the clean up job and let the commands be distributed out before letting the job run again.