
I recently came across the following issue, and found very little documentation online about it, so thought i'd add it here to help others. Following the installation of some windows updates security patches onto a passive SQL cluster node, the cluster service refused to start. Event id 7031, 1073, 1173, 1123 were all logged in the system event log.
Event Type: Warning
Event Source: ClusSvc
Event Category: Node Mgr
Event ID: 1123
Date: 01/11/2010
Time: 13:21:56
User: N/A
Computer: PASSIVENODE
Description:
The node lost communication with cluster node 'ACTIVENODE' on network 'Heartbeat (Left)'.
--------------------------------
Event Type: Error
Event Source: ClusSvc
Event Category: Membership Mgr
Event ID: 1173
Date: 01/11/2010
Time: 13:22:39
User: N/A
Computer: PASSIVENODE
Description:
Cluster service is shutting down because the membership engine detected a membership event while trying to join the server cluster. Shutting down is the normal response to this type of event. Cluster service will restart per the Service Manager's recovery actions.
--------------------------------
Event Type: Error
Event Source: ClusSvc
Event Category: Startup/Shutdown
Event ID: 1073
Date: 01/11/2010
Time: 13:22:39
User: N/A
Computer: PASSIVENODE
Description:
Cluster service was halted to prevent an inconsistency within the server cluster. The error code was 5890.
After numerous network traces and diagnostics i discovered that one of the windows updates previously installed (KB97546) had updated a file named MSV1_0.dll on the passive cluster, this had caused a version mismatch and was the cause of the problems. The active node had a file version 5.2.3790.4587 where as the passive node had 5.2.3790.3959.
Before discovering the mismatch we had already evicted the passive node from the cluster, so i can't say whether just uninstalling KB975467 from the passive node would resolve the issues. Most likely, it will.
Below was my full set of troubleshooting/resolution steps.
- Suspected a Network Issue so collected Network Monitor Trace for Heartbeat NIC - came out clean.
- Evicted NODE B from cluster and ran, cluster.exe node /forcecleanup
- Tried adding NODE B in cluster but it failed to start Cluster Service during the join process.
- Confirmed following registry entries on both the nodes were the same,
Verified that HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\LMCompatibilityLevel and it is set to 2. Verified that HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\lmcompatibilitylevel and restrictanonymous are the same on all nodes.
Verified that HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\MSV1_0\ntlmminclientsec and ntlmminserversec are the same on all nodes.
- Verified the Cluster Service Account Password used was correct and not expired.
- Checked HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Cluster Server and Clusterinstallationstate was set to 1 on passive node and it was set to 2 on active node.
- Confirmed the subnet mask for all the interfaces in cluster on both the nodes were configured correctly.
- Investigated installed KB's and noticed that there was a mismatch in the file version of MSV1_0.dll between the cluster nodes.
- Tried replacing the .dll file from NODE A but that did not help.
- Uninstalled KB 975467 from NODE B and the file version was reverted to 5.2.3790.4530.
- Successfully re added the passive node into the cluster and restarted the cluster service.
Hope this helps someone else out!