2012-08-13

We have two production SQL Servers running SQL Server 2005 SP4 with cumulative update 3. Both servers run on physical machines that are identical. DELL PowerEdge R815 with 4 x 12 core CPUs and 512GB (yes GB) of ram, with 10GB iSCSI SAN connected drives for all SQL databases and logs. OS is Microsoft Windows Server 2008 R2 Enterprise edition with all SP's and windows updates. OS drive is a RAID 5 array of 3 x 72GB 2.5" 15k SAS drives. SAN is a Dell EqualLogic 6510 with 48 x 10K SAS 3.5" drives, configured in RAID 50, sliced into various LUNs for the 2 SQL Servers, and also shared with an Exchange machine and several VMWare servers.

We have over 20 databases, 11 of which are mirrored with high availability using a witness server. The witness server is a lower powered machine running a SQL Server instance that is used for nothing other than providing witness services. The biggest mirrored database is 450GB and generates around 100-300 iops. Database Mirroring Monitor reports a current send rate around 100kb to 10mb per second, and a mirror commit overhead of (typically) 0 milliseconds. The mirror server has no problem keeping up with the principal.

We are consistently experiencing mirroring failovers. Sometimes a single database will failover, other times almost all databases will failover simultaneously. For instance, last night, we had 10 of 11 databases failover, the remaining database stayed accessible until I manually failed it over.

I have gone through several troubleshooting steps to attempt to identify the problem, but have so far not been able to resolve the issue:

1) The machine came with a Broadcom BCM5709C NetXtreme II 4 port Gigabit network adapter which we initially used as the primary network connection. We have since installed an Intel(R) PRO/1000 PT Dual Port Server Adapter on both machines to eliminate the NIC as the issue.

2) All databases have an automatic full backup nightly along with a log backup for databases involved in mirroring. Log file useage is monitored and rarely gets above 15% used. The log file for the main database is 125GB, consisting of 159 virtual log files that range in size from 511MB to 1GB. TempDB is on it's own LUN, and consists of 24 x 2GB files.

3) SQL Server log on the witness shows no errors other than: The mirroring connection to "TCP://SQL02.DOMAIN.INET:5022" has timed out for database "Data" after 30 seconds without a response. Check the service and network connections.

SQL Server log on the primary and secondary servers show messages relating to mirroring:

The mirroring connection to "TCP://SQL01.DOMAIN.INET:5022" has timed out for database "Data" after 30 seconds without a response. Check the service and network connections.

The mirrored database "Data" is changing roles from "PRINCIPAL" to "MIRROR" due to Role Syncronization. (Synchronization is misspelled here on purpose since that is precisely how the actual message is display.)

The mirrored database "Data" is changing roles from "PRINCIPAL" to "MIRROR" due to Failover.

The mirrored database "Data" is changing roles from "MIRROR" to "PRINCIPAL" due to Failover from partner.

The SQL Server services continue to run and network connections seem to stay up. We consistently have between 500 to 2500 sessions connected to each server (primarily robotic applications that connect to service broker queues on a single database).

4) TCP Chimney and RSS etc are disabled using NET SH syntax.

5) I have ran the SQL Server 2005 Best Practices Analyzer against both machines and find nothing other than the very occasional Application Event Log error 833, none of which are coincident with the failover events:

SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\Data.MDF] in database [Data] (9). The OS file handle is 0x00000000000010A0. The offset of the latest long I/O is: 0x000007d4b10000).

6) Occasionally we see "The client was unable to reuse a session with SPID XXX, which had been reset for connection pooling. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message." generated by both servers. There appears to be no "earlier" messages that indicate any issue.

7) Occasionally database mail writes an error to the Application event log:

Exception Type: Microsoft.SqlServer.Management.SqlIMail.Server.Common.BaseException Message: There was an error on the connection. Reason: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding., connection parameters: Server Name: MGSQL02, Database Name: msdb Data: System.Collections.ListDictionaryInternal TargetSite: Void OpenConnection(Microsoft.SqlServer.Management.Common.SqlConnectionInfo) HelpLink: NULL Source: DatabaseMailEngine

StackTrace Information at Microsoft.SqlServer.Management.SqlIMail.Server.DataAccess.ConnectionManager.OpenConnection(SqlConnectionInfo ci) at Microsoft.SqlServer.Management.SqlIMail.Server.DataAccess.DataAccessAdapter.OpenConnection(String dbServerName, String dbName, String userName, String password) at Microsoft.SqlServer.Management.SqlIMail.IMailProcess.QueueItemProcesser.ProcessQueueItems(String dbName, String dbServerName, Int32 lifetimeMinimumSec, LogLevel loggingLevel)

I believe the Timeouts are causing the failover; what could cause these timeouts? Obviously if there was an actual network issue such as a bad cable, or a bad switch, that might cause packet loss and therefore a timeout, however what other things could cause timeouts? Blocking? If MSDB, or some other system database had an I/O timeout could that cause the mirroring failover?

Thanks for any advice!

netsh interface tcp show global shows:

netsh interface ipv4 show dynamicportrange tcp

SELECT name, value_in_use FROM sys.configurations

A while ago, I manually modified the mirroring_connection_timeout value for all mirrored databases to 30 seconds to attempt to remediate the problem; this has simply increased the amount of time between failover events. With the mirroring_connection_timeout setting set at the default of 10 seconds, we see a lot more failovers.

A comment had asked me to ensure IPSec is disabled, so I'm posting the contents of several netsh commands that display the IPSec configuration of the operating system:

UPDATE: 2012-12-20

We've now moved our production systems onto SQL Server 2012. We've been running this since the morning of December 17th - so far no failovers. However, a couple of days is well within what we saw with the 2005-based systems.

In an effort to document the performance of our new systems, I've been looking at sys.dm_os_wait_stats more carefully; and noticed DBMIRROR_DBM_EVENT, which is an undocumented wait type. Graham Kent at Microsoft has an interesting article regarding troubleshooting unexpected failovers and this wait type. I shall recap his findings here:

The customer was experiencing a huge blocking chain built on a high volume OLTP database where all the head blockers were waiting on DBMIRROR_DBM_EVENT. Here is the sequence of events I went through:

Review the blocking chain itself - ho help here as all we can see is that we're waiting on DBMIRROR_DBM_EVENT

Review the source for the undocumented wait type. Obviously you can't do this outside of MS, but I can say that at the time of writing this wait type represents the wait used when the principal is waiting for the mirror to harden an LSN, meaning that the transaction it's part of cannot commit. This immediately points quite specifically to the problem that the principal cannot commit transactions as it's waiting on the mirror. Now we need to investigate why the mirror is not committing transactions or why the principal doesn't know whether it is.

Review the msdb system tables

(a) Look at the [backupset] table to see if the size of the logs produced at the time of the problem are significantly higher then normal. If they were exceptionally large it may be that the mirror was flooded with transactions and could simply not keep up with the volume. This is why books online will tell you sometimes to disable mirroring if you need to do an exceptionally large logged operation such as an index rebuild. (reference for why this is at http://technet.microsoft.com/en-us/library/cc917681.aspx). Here i used the following TSQL

(b) secondly I looked at the data in the tables [dbm_monitor_data]. The key here is to locate the timeframe in which we had a problem and then see if we were significant experiencing changes in any of the following:

These are all indicators similar to part (a) in that they might show a component or piece of architecture that wasn't responding. For example if the send_queue suddenly starts to grow but the re_do queue doesn't grow, then it would imply that the the principal cannot send the log records to the mirror so you'd want to look at connectivity maybe, or the service broker queues dealing with the actual transmissions.

In this particular scenario we noted that all the counters appeared to have strange values, in that there were log backups going on of normal sizes, but there were no status changes, 0 send queue, 0 redo queue, a flat send rate and a flat redo rate. This is very strange as it implies that the DBM Monitor could not record any values from anywhere over the problem period.

Review the SQL Server error logs. In this case there were no errors or information messages whatsoever, but in other scenarios such as this, it’s very common for errors in the 1400 range to be reported, examples of which you can find in other places in my other mirroring blogs, such as this Error 1413 example

Review the default trace files – in this scenario I was not provided the default traces, however they are fantastic sources of DBM problem information, as they record state change events on all the partners.This is documented here:

Database Mirroring State Change Event Class

This often gives you a great picture of scenarios such as when network connectivity failed between one or all of the partners and then what the state of the partnership became afterwards.

CONCLUSIONS:

In this particular scenario I’m currently missing 2 key points of data, but that apart I can still make a reasonable hypothesis on the above information. We certainly can say that the blocking was caused by the fact that DBM was enabled to the due the blockers all waiting on the DBMIRROR_DBM_EVENT wait type. Since we know we didn’t flood the mirror with a large logged operation and that this deployment normally runs happily in this mode, we can exclude unusual large operations. This means that we have 2 potential candidates at this stage:

Hardware problems on the connectivity between some or all of the partners.

CPU exhaustion on the mirror server – simply unable to keep up with redos – the CPU exhaustion could itself be from a process outside of SQL Server or outside of the this mirror partnership.

A problem with the mirroring code itself (we’d really need some memory dumps to confirm this though).

Based upon experience I’d suspect 1 or 2, but I always keep an open mind about 3 as well, we’re trying to collect some more data now to look at this problem in more detail.

Show more