Hi Diddly Doodly readers. Michael Rendino here again with a follow up to my “Basic Network Capture Methods” blog, this time to give some best practices on network capture collection when troubleshooting. As you may have guessed, one of my favorite tools, due to my years in networking support, is the network capture. It can provide a plethora of information about what exactly was transpiring when systems were trying (and possibly failing) to communicate. I don’t really concern myself with the tool used, be it Network Monitor, Wireshark, Message Analyzer, Sniffer or any other tool. My biggest point to stress is what I mentioned previously – it shows the communication on the network. The important point to get from that is that collecting a trace from a single point doesn’t provide the full picture. While I will take a single-sided trace over no trace at all, the best scenario is to get it from all points involved in the transaction. With something like SharePoint, this could be a number of machines – the client running the browser, the web front end, the SQL back end and then multiple domain controllers. It sounds like a daunting task to get the captures from every location, but I would rather have too much data rather than too little. To add to that point, please don’t apply a capture filter unless absolutely necessary! By only capturing data between two select points, you could be omitting a critical piece of information.
Following is a perfect example of both of these points. I was engaged to troubleshoot an issue that was described as a problem with a SharePoint web front end talking to the SQL server. I got the captures from the two servers, which fortunately were not filtered. If I just went on the problem description, I would typically have opened the capture from the SQL box, applied the ipv4.address==Web Front End IP (ipv4 because I was using Network Monitor – it would be ip.addr== for you Wireshark fans) to locate the traffic from that box. In fact, I did that to start and saw that all traffic to and from the WFE appeared completely normal.
9873 9:37:54 AM WFE 49346 (0xC0C2) SQL 1433 (0x599) TCP TCP:Flags=…A…., PayloadLen=0, Seq=3198716784, Ack=438404416, Win=510
10093 9:37:55 AM WFE 49346 (0xC0C2) SQL 1433 (0x599) TDS TDS:RPCRequest, SPID = 0, PacketID = 1, Flags=…AP…, PayloadLen=201, Seq=3198716784 – 3198716985, Ack=438404416, Win=510
10094 9:37:55 AM SQL 1433 (0x599) WFE 49346 (0xC0C2) TDS TDS:Response, SPID = 117, PacketID = 1, Flags=…AP…, SrcPort=1433, DstPort=49346, PayloadLen=61, Seq=438404416 – 438404477, Ack=3198716985, Win=255
10188 9:37:55 AM WFE 49346 (0xC0C2) SQL 1433 (0x599) TCP TCP:Flags=…A…., PayloadLen=0, Seq=3198716985, Ack=438404477, Win=509
To me, it looked like clean SQL traffic, moving quickly and without errors. All good so I needed to look elsewhere. To move on, it’s important to know what other types of things will happen when using SharePoint. Other than the SQL traffic, the WFE will also have to communicate with the client, perform name resolution and communicate with a domain controller. I first applied the filter “dns or nbtns” (Again, this was Network Monitor, although I typically use multiple tools for my analysis) and again, everything looked “clean.” I then moved on to examine the authentication traffic. I applied the filter “Kerberosv5” and lo and behold, the issue jumped right out to me. Appearing over and over in the trace was this:
97 9:38:46 AM 0.0000000 WFE 52882 (0xCE92) DC 88 (0x58) TCP TCP:Flags=……S., SrcPort=52882, DstPort=Kerberos(88), PayloadLen=0, Seq=2542638417, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192
98 9:38:46 AM 0.0004965 DC 88 (0x58) WFE 52882 (0xCE92) TCP TCP:Flags=…A..S., SrcPort=Kerberos(88), DstPort=52882, PayloadLen=0, Seq=4098142762, Ack=2542638418, Win=65535 ( Negotiated scale factor 0x1 ) = 131070
99 9:38:46 AM 0.0000200 WFE 52882 (0xCE92) DC 88 (0x58) TCP TCP:Flags=…A…., SrcPort=52882, DstPort=Kerberos(88), PayloadLen=0, Seq=2542638418, Ack=4098142763, Win=513 (scale factor 0x8) = 131328
100 9:38:46 AM 0.0000599 WFE 52882 (0xCE92) DC 88 (0x58) KerberosV5 KerberosV5:AS Request Cname: farmsvc Realm: CONTOSO.COM Sname: krbtgt/CONTOSO.COM
102 9:38:46 AM 0.0022497 DC 88 (0x58) WFE 52882 (0xCE92) KerberosV5 KerberosV5:KRB_ERROR – KDC_ERR_CLIENT_REVOKED (18)
KRB_ERROR – KDC_ERR_CLIENT_REVOKED means that the client account has been locked out. We checked active directory and sure enough, the account used for the WFE service was locked. We then learned that they had recently changed the password for that service account, which resulted in said lockout. One thing to note about Network Monitor (and you can do this with Wireshark, as well) is that I actually had all Kerberos traffic highlighted in green so it stood out quickly.
So what did we learn? We know that if the trace had just been taken from the SQL server, I wouldn’t have found the issue. We also know that if the WFE trace had been filtered to just include SQL traffic or SQL and client traffic, I wouldn’t have found the issue. Remember, more is better! Even if I get gigabytes of captures, I can always parse them or break them into smaller, bite-sized (no pun intended) chunks for faster filtering. Happy tracing!