Tag Archives: error

SQL Holmes: The Case of the Fist-fighting Log Readers

And the Case of the Fist-Fighting Log Readers

Every once in a blue moon, you run across a problem that no one has seen before. Sometimes you can’t find anything all. Sometimes, you can only find unanswered forum questions. Sometimes, you find the worst thing imaginable: a forum post with someone replying to their own question with “Nevermind. Fixed it.” (HOW DID YOU FIX IT YODARULES1971?!?)

We had a similar experience a while back. Allow me to take you through it.

It started with a single, innocuous alert from one of our SQL Server Replication distributors:

DESCRIPTION: Replication-Replication Transaction-Log Reader Subsystem: agent SOMESERVER-SomeDB-6 failed. The process could not execute 'sp_replcmds' on 'SOMESERVER'.

Note: If you don’t have SQL Server Alerts set up on your instances, then you really really should.

We had two databases in this instance that were replicated. One was chugging along just fine, the other was giving the old log reader chestnut:

The process could not execute 'sp_replcmds' on 'SOMESERVER'. (Source: MSSQL_REPL, Error number: MSSQL_REPL20011) Get help: http://help/MSSQL_REPL20011

Experience has taught me that the most likely issues are the following:
1. Somebody changed something (Troubleshooting 101)
2. The owners of the databases involved in replication are wrong (usually not ‘sa’)
3. A database trigger somewhere was trying to make a change cross-database.

“I saw some smoke coming from the barn publication.”  -SQL Replication Monitor

Number 1 had happened, as we had deployed schema changes just prior to everything going south. Two and three however, did not. We could see the error in the replication log. It was trying to parse a Primary Key change that we had made. There was nothing obviously wrong with the command, so we moved on.

And Then We Tried to Fix It

After a few hours trying to troubleshoot, restarting the log readers, crying for our mommies, yadda yadda, the call was made that we would just reinitialize the publication. Ye ole’ replication wreck-n-restart.

It didn’t work. The publication would not reinitialize, failing with the same log reader error as before. This make sense as the log reader is shared among all the publications on an instance. It just couldn’t get over the command that couldn’t be parsed.

With few options left, we did the no-no and reset the log reader.

EXEC sp_replrestart

This is an internal command, run on the publisher, that is used when you need to restore a transactionally replicated database. It basically resets the LSN on the distributor to the highest value on the publisher, for our purposes skipping over the LSN/command that was causing the log reader’s issues.

This is going to be my costume for Halloween next year.

And it worked. Mostly. Replication started to flow again, but we needed to re-initialize the subscribers properly. Re-init was necessary as we’d just skipped all transactions that had accumulated between the time the issue occurred and the current time.

Yes, But What Does It MEAN?!?!

So, we had “resolved” the issue, but we still had no idea what the “issue” really was. After we had righted the ship, applied all dropped indexes at the subscriber we circled back around to determine root cause and why no one else seemed to have any idea what would cause this.

We went to our Junior DBA (Google) and got a pretty narrow range of responses, including the ones I listed before. No one seemed to be having the same issue as us, even with similar errors. So, we had to start at the bottom and work to the top.

Combing through the log reader error logs, one error was different than the others:

Cannot find an object ID for the replication system table 'cdc.change_tables'. Verify that the system table exists and is accessible by querying it directly. If it does exist, stop and restart the Log Reader Agent; if it does not exist, drop and reconfigure replication. (Source: MSSQLServer, Error number: 18807)

Well, that’s different. CDC is enabled on this database and used on quite a few tables. We checked, and the table cdc.change_tables existed; we wouldn’t get very far in life without it. The next step was to look at the CDC error log which returns the last 64 errors encountered.

SELECT * FROM sys.dm_cdc_errors

And in the sys.dm_cdc_errors table, we found this error:

Log scan process failed in processing a ddl log record. Refer to previous errors in the current session to identify the cause and correct any associated problems.

That seems pretty familiar. It was preceded by the following three errors:

Invalid length parameter passed to the RIGHT function.
Log Scan process failed in processing log records. Refer to previous errors in the current session to identify the cause and correct any associated problems.
Log scan process failed in processing a ddl log record. Refer to previous errors in the current session to identify the cause and correct any associated problems.

Different CDC, but I could see why you would be confused.

This points pretty squarely to CDC as the culprit. It uses a log reader with replication to replicate commands for capturing in CDC. Apparently, if it gets hung up, replication panics and just starts punching itself in the face. It’s an interesting design choice.

Way #938,308,121 To Break Replication

A few weeks later, when deploying more schema changes the issue resurfaced. Replication dive-bombed, we got the sames alert and everything came to a halt. We simply went in and blew away CDC for the entire database. This may not be a choice for your environment, but we needed to nuke it from orbit, just be sure.

USE SomeDB
GO
EXEC sys.sp_cdc_disable_db
GO

And after executing sys.sp_cdc_disable_db, the issue went away. Success?

We haven’t re-implemented CDC on this database yet, as there are a lot more decisions to be made business-wise, so we don’t know if there’s something internally broken or what. At some point we will need to turn it back on, but when and to what degree (or fallout) isn’t clear.

What is clear is is that something in CDC goofed and it took replication down with it. Moral of the story is: If replication is having issues, make sure CDC isn’t having issues as well. Also, use CDC sparingly. Don’t just throw it on every table you have “just ’cause.”

-CJ Julius

A Funny Thing Happened on my way to enable Management Data Warehouse

Three Job Schedules walk into a bar…
uh-oh
uh-oh.

Management Data Warehouse is a neat tool for collecting data on your instances with regards to performance, disk usage, and anything else related to SQL Server you might want to know. I’ve recently been rolling it out to various servers in test environments to get a good handle on all that it can do, and maybe more importantly, how it’s going to affect performance.

When I was satisfied that it was working the way I wanted in these environments, I started to push it to what you could call QA. Immediately I was hit with litany of errors, all of which said basically the same thing:

Caught error#: 14684, Level: 16, State: 1, in Procedure: sp_syscollector_create_collection_set, Line: 203, with Message: Caught error#: 14262, Level: 16, State: 1, in Procedure: sp_syscollector_create_collection_set, Line: 80, with Message: The specified @schedule_name ('CollectorSchedule_Every_30min') does not exist. (Microsoft SQL Server, Error: 14684)

Um, what? Of course it doesn’t exist, I’m installing it now. Just to verify, I did some poking around, and I found that it was indeed attempting to add a schedule that didn’t exist (as you would expect) to the collection set. After a while my Jr DBA (read: Google) turned up this article on removing MDW which says, in no uncertain terms:

The fact is that, while these schedules may look like they were created by MDW, they are created by default in a default installation of SQL Server 2008 or above. You can delete these easily, if you really want to, and if you are sure you will never enable MDW again. I would still suggest scripting the schedules and saving those scripts to your file system or source control so that you can recover them in the future.

I took a look back on the TEST instances where I was previously successful and found these schedules existed both on instances where MDW had been installed, and in places where it had not. These were part of the defaut SQL Server install and had been manually removed. Whoops.

As far as I know, there isn’t any direct way to script out schedules, so I could port them to the boxes where they had been removed. To get around this, I added them to the collector set jobs and then scripted that. Then I was able to change a few variables (and remove the schedule_uid) so that this:

EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'CollectorSchedule_Every_15min',
@enabled=0,
@freq_type=4,
@freq_interval=1,
@freq_subday_type=4,
@freq_subday_interval=15,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20080709,
@active_end_date=99991231,
@active_start_time=0,
@active_end_time=235959,
@schedule_uid=N'SOME-GUID-HERE'

became this:

EXEC msdb.dbo.sp_add_schedule @schedule_name=N'CollectorSchedule_Every_15min',
@enabled=0,
@freq_type=4,
@freq_interval=1,
@freq_subday_type=4,
@freq_subday_interval=15,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20080709,
@active_end_date=99991231,
@active_start_time=0,
@active_end_time=235959
GO

I ran these add_schedules on the QA boxes, went through the Configure Management Data Warehouse Wizard again, and tada!

Green means good.
Green is good. Green is life.

For your convenience, I’ve scripted out the four schedules I needed to get it going below. These are not from the most current version of SQL Server (2016 as of this writing), and are not all of the schedules included out of the box, but these let me get MDW running so I could add my own custom schedules for the remainder.

The moral of this story: Don’t remove MDW schedules unless you really mean it, for all time, forever and ever, pinky swear.

The four ‘core’ schedules I needed, for your convenience:

EXEC msdb.dbo.sp_add_schedule @schedule_name=N'CollectorSchedule_Every_6h',
@enabled=0,
@freq_type=4,
@freq_interval=1,
@freq_subday_type=8,
@freq_subday_interval=6,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20080709,
@active_end_date=99991231,
@active_start_time=0,
@active_end_time=235959
GO

EXEC msdb.dbo.sp_add_schedule @schedule_name=N'CollectorSchedule_Every_30min',
@enabled=0,
@freq_type=4,
@freq_interval=1,
@freq_subday_type=4,
@freq_subday_interval=30,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20080709,
@active_end_date=99991231,
@active_start_time=0,
@active_end_time=235959
GO

EXEC msdb.dbo.sp_add_schedule @schedule_name=N'CollectorSchedule_Every_15min',
@enabled=0,
@freq_type=4,
@freq_interval=1,
@freq_subday_type=4,
@freq_subday_interval=15,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20080709,
@active_end_date=99991231,
@active_start_time=0,
@active_end_time=235959
GO

EXEC msdb.dbo.sp_add_schedule @schedule_name=N'RunAsSQLAgentServiceStartSchedule',
@enabled=1,
@freq_type=64,
@freq_interval=0,
@freq_subday_type=0,
@freq_subday_interval=0,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20080709,
@active_end_date=99991231,
@active_start_time=0,
@active_end_time=235959
GO

If one of these schedules gets created twice you can look up the job id or use SSMS to ‘Pick’ the schedule for the duplicated schedule (the id is in the first column). And run this to remove it:

--EXEC msdb.dbo.sp_delete_schedule @schedule_id = 9999

-CJ Julius