Chapter 13

๐Ÿ•ต๏ธโ€โ™‚๏ธ Troubleshooting - Detective Mode

By Sys-Metricsยท ยท 60 min chapter

๐ŸŽฏ Meet the Network Detective

If all the previous chapters taught you how to build and configure networks, this final chapter teaches you how to be a network detective who solves mysteries when things go wrong. Every network professional spends significant time troubleshooting, and the best ones approach problems systematically like skilled investigators gathering clues and testing theories.

๐ŸŽฏ Chapter Goals: Master systematic troubleshooting methodology, learn OSI layer-by-layer problem solving, use essential diagnostic tools, identify common network problems, and develop the detective mindset needed to solve any network mystery!

๐Ÿ” The Detective's Methodology

Great network troubleshooting isn't about memorizing solutionsโ€”it's about developing a systematic approach that works for any problem. Think like a detective investigating a case:

The Scientific Method for Networks

1
Define the Problem
What exactly is broken? Get specific symptoms from users
2
Gather Information
Collect facts, error messages, and environmental details
3
Form Hypothesis
Based on symptoms, what's the most likely cause?
4
Test Theory
Use tools and commands to verify your hypothesis
5
Implement Solution
Fix the problem based on confirmed root cause
6
Verify and Document
Confirm fix works and document for future reference

The Detective's Questions

What exactly is the problem?

"Internet doesn't work" vs "Can't reach www.google.com from Sales VLAN"

When did it start?

Recent changes often reveal root causes

Who is affected?

Single user, department, or entire network?

What changed recently?

New equipment, configuration changes, software updates?

Can you reproduce it?

Consistent problems vs intermittent issues

Information Gathering Techniques

User Report: "The Internet is down!"
๐Ÿ‘‚
Listen carefully - don't assume you understand the problem
โ“
Ask specific questions: Which websites? What error messages?
๐Ÿ‘€
Observe directly - see the problem with your own eyes
๐Ÿ“ฑ
Test from multiple devices and locations
๐Ÿ“Š
Check monitoring systems and logs
๐ŸŽฏ
Define the actual problem: "Can't resolve DNS names"

Common Troubleshooting Mistakes

Mistake: Jumping to conclusions
Assuming you know the problem without investigation
Better Approach:
โœ“ Always verify symptoms first
โœ“ Test your assumptions
โœ“ Follow the evidence, not hunches
โœ“ Consider multiple possible causes
Mistake: Random configuration changes
Changing things without understanding the impact
Better Approach:
โœ“ Identify root cause before making changes
โœ“ Change one thing at a time
โœ“ Document what you change
โœ“ Have a rollback plan
๐Ÿง  Detective Tip: The most obvious explanation is usually correct (Occam's Razor), but always verify with evidence!

๐Ÿ—๏ธ OSI Layer Troubleshooting Approach

The OSI model isn't just academic theoryโ€”it's your troubleshooting roadmap. Start at the physical layer and work your way up, or start at the application layer and work down:

Bottom-Up Approach (Physical to Application)

Layer 1 - Physical
Cables, power, LEDs, hardware
Layer 2 - Data Link
Switch ports, VLANs, MAC addresses
Layer 3 - Network
IP addresses, routing, subnets
Layer 4+ - Transport/Application
Ports, services, applications

Layer 1 - Physical Layer Detective Work

Visual Inspection

Check cables, connections, power LEDs, port status lights

Cable Testing

Use cable testers for copper, light meters for fiber

Port Status

Check interface status: up/up, up/down, down/down

Environmental

Temperature, humidity, electrical interference

Layer 1 Troubleshooting Commands
Switch# show interfaces status
Port Name Status Vlan Duplex Speed Type
Fa0/1 PC1 connected 1 a-full a-100 10/100BaseTX
Fa0/2 notconnect 1 auto auto 10/100BaseTX
Fa0/3 Server err-disabled 1 auto auto 10/100BaseTX
Router# show interfaces fastethernet 0/0
FastEthernet0/0 is up, line protocol is down
Hardware is AmdFE, address is 0013.197b.5004
MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 100Mb/s, media type is RJ45
input flow-control is off, output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:05, output 00:00:01, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes, 0 no buffer
Received 0 broadcasts, 0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 output errors, 0 collisions, 0 interface resets
0 unknown protocol drops
0 babbles, 0 late collision, 0 deferred
0 lost carrier, 0 no carrier, 0 PAUSE input
0 PAUSE output
0 output buffer failures, 0 output buffers swapped out

Layer 2 - Data Link Detective Work

MAC Address Tables

Check if devices are learning MAC addresses

VLAN Configuration

Verify VLAN assignments and trunk configurations

Spanning Tree

Check for loops and blocked ports

Switch Port Security

Look for security violations

Layer 2 Troubleshooting Commands
Switch# show mac address-table
Mac Address Table
-------------------------------------------
Vlan Mac Address Type Ports
---- ----------- -------- -----
1 0001.0002.0003 DYNAMIC Fa0/1
1 0004.0005.0006 DYNAMIC Fa0/2
10 0007.0008.0009 DYNAMIC Fa0/3
Total Mac Addresses for this criterion: 3
Switch# show vlan brief
VLAN Name Status Ports
---- -------------------------------- --------- -------------------------------
1 default active Fa0/1, Fa0/2, Fa0/4, Fa0/5
Fa0/6, Fa0/7, Fa0/8, Fa0/9
10 Sales active Fa0/3
20 Engineering active Fa0/10, Fa0/11
Switch# show spanning-tree
VLAN0001
Spanning tree enabled protocol ieee
Root ID Priority 32769
Address 0019.e86a.6f00
This bridge is the root
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Bridge ID Priority 32769 (priority 32768 sys-id-ext 1)
Address 0019.e86a.6f00
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Aging Time 300 sec
Interface Role Sts Cost Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Fa0/1 Desg FWD 19 128.1 P2p
Fa0/2 Desg FWD 19 128.2 P2p
Fa0/24 Desg FWD 19 128.24 P2p

Layer 3 - Network Layer Detective Work

IP Addressing

Verify correct IP addresses and subnet masks

Routing Tables

Check if routes exist to destination networks

Default Gateways

Ensure devices know how to reach other networks

ARP Tables

Verify IP-to-MAC address resolution

Layer 3 Troubleshooting Commands
Router# show ip route
Codes: C - connected, S - static, R - RIP, M - mobile, B - BGP
D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area
Gateway of last resort is 203.0.113.1 to network 0.0.0.0
C 192.168.10.0/24 is directly connected, FastEthernet0/0
C 192.168.20.0/24 is directly connected, FastEthernet0/1
S* 0.0.0.0/0 [1/0] via 203.0.113.1
Router# show arp
Protocol Address Age (min) Hardware Addr Type Interface
Internet 192.168.10.1 - 0013.197b.5004 ARPA FastEthernet0/0
Internet 192.168.10.100 5 0001.0002.0003 ARPA FastEthernet0/0
Internet 192.168.20.1 - 0013.197b.5005 ARPA FastEthernet0/1
PC# ipconfig
IP Address. . . . . . . . . . . : 192.168.10.100
Subnet Mask . . . . . . . . . . : 255.255.255.0
Default Gateway . . . . . . . . : 192.168.10.1

Top-Down vs Bottom-Up Decision

๐Ÿ”บ Bottom-Up (Physical First)

  • Use when: Complete connectivity failure
  • Symptoms: No lights, no link, interface down
  • Advantage: Catches fundamental issues first
  • Example: "Nothing works at all"

๐Ÿ”ป Top-Down (Application First)

  • Use when: Specific application issues
  • Symptoms: Some things work, others don't
  • Advantage: Faster for service-specific problems
  • Example: "Email works but web browsing doesn't"

๐Ÿ› ๏ธ Essential Detective Tools

Ping - The Network's Heartbeat Monitor

Ping is like checking someone's pulseโ€”it tells you if the network path is alive and how healthy it is:

Basic Ping Tests
PC# ping 192.168.10.1
Pinging 192.168.10.1 with 32 bytes of data:
Reply from 192.168.10.1: bytes=32 time=1ms TTL=255
Reply from 192.168.10.1: bytes=32 time=1ms TTL=255
Reply from 192.168.10.1: bytes=32 time=1ms TTL=255
Reply from 192.168.10.1: bytes=32 time=1ms TTL=255
Ping statistics for 192.168.10.1:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 1ms, Maximum = 1ms, Average = 1ms
Router# ping 8.8.8.8 source fastethernet 0/0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 8.8.8.8, timeout is 2 seconds:
Packet sent with a source address of 192.168.10.1
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 20/25/32 ms
Router# ping 192.168.20.100 repeat 10 size 1500
Type escape sequence to abort.
Sending 10, 1500-byte ICMP Echos to 192.168.20.100, timeout is 2 seconds:
!!!!!!!!
Success rate is 100 percent (10/10), round-trip min/avg/max = 1/2/4 ms

Ping Response Analysis

! (Exclamation)
Success - packet received
. (Period)
Timeout - no response
U
Destination unreachable
N
Network unreachable

Traceroute - The Path Detective

Traceroute shows the exact path packets take through the network, like a GPS showing your route:

Traceroute Examples
PC# tracert google.com
Tracing route to google.com [172.217.9.46]
over a maximum of 30 hops:
1 1 ms 1 ms 1 ms 192.168.10.1
2 15 ms 12 ms 18 ms 203.0.113.1
3 25 ms 22 ms 28 ms 10.1.1.1
4 35 ms 32 ms 38 ms 172.217.9.46
Trace complete.
Router# traceroute 8.8.8.8
Type escape sequence to abort.
Tracing the route to 8.8.8.8
1 203.0.113.1 16 msec 12 msec 16 msec
2 10.1.1.1 20 msec 18 msec 22 msec
3 8.8.8.8 28 msec * 32 msec

Telnet and SSH - The Connection Testers

Testing Port Connectivity
PC# telnet 192.168.10.100 80
Trying 192.168.10.100...
Connected to 192.168.10.100.
Escape character is '^]'.
Router# telnet 192.168.20.10 23
Trying 192.168.20.10 ... Open
User Access Verification
Password:

Show Commands - The Information Gatherers

Essential Show Commands Toolkit:

Interface Status:
show interfaces [interface]
show ip interface brief
show interfaces status

Routing Information:
show ip route
show ip route [network]
show ip protocols

Layer 2 Information:
show mac address-table
show vlan brief
show spanning-tree

System Information:
show version
show running-config
show startup-config

Troubleshooting Specific:
show arp
show cdp neighbors
show log

Debug Commands - The Live Investigation Tools

Debug Commands (Use Carefully!)
Router# debug ip packet
IP packet debugging is on
*Sep 17 14:23:15.123: IP: s=192.168.10.100 (FastEthernet0/0), d=8.8.8.8 (Serial0/0/0), len 84, forward
*Sep 17 14:23:15.127: IP: s=8.8.8.8 (Serial0/0/0), d=192.168.10.100 (FastEthernet0/0), len 84, forward
Router# undebug all
All possible debugging has been turned off
โš ๏ธ Debug Warning: Debug commands generate lots of output and can overwhelm routers. Always use "undebug all" when finished, and avoid on production systems during peak hours!

๐Ÿ”ง Common Network Problems and Solutions

Connectivity Problems

Problem: Complete loss of connectivity
User can't reach anything on the network
Detective Investigation:
โœ“ Check physical layer: cables, power, port LEDs
โœ“ Verify IP configuration: address, mask, gateway
โœ“ Test local connectivity: ping default gateway
โœ“ Check switch port configuration and status
โœ“ Verify VLAN assignment and trunk configuration
Problem: Can ping by IP but not by name
Network connectivity works but name resolution fails
DNS Investigation:
โœ“ Check DNS server configuration on client
โœ“ Test DNS server reachability (ping DNS server IP)
โœ“ Use nslookup to test name resolution
โœ“ Check DNS server functionality and configuration
โœ“ Verify firewall/ACL not blocking DNS traffic (port 53)
Problem: Intermittent connectivity issues
Network works sometimes but not others
Intermittent Problem Analysis:
โœ“ Check for spanning tree topology changes
โœ“ Monitor interface utilization and errors
โœ“ Look for duplex mismatches
โœ“ Check for IP address conflicts
โœ“ Monitor logs for patterns or error messages

Performance Problems

Problem: Network is slow
Connections work but performance is poor
Performance Investigation:
โœ“ Check interface utilization (show interfaces)
โœ“ Look for input/output errors and collisions
โœ“ Verify duplex settings (full vs half duplex)
โœ“ Check for broadcast storms or excessive traffic
โœ“ Monitor CPU and memory usage on network devices

VLAN and Switching Problems

Problem: Devices in same VLAN can't communicate
VLAN devices isolated from each other
VLAN Troubleshooting:
โœ“ Verify VLAN exists and is active
โœ“ Check port VLAN assignments
โœ“ Confirm trunk ports allow the VLAN
โœ“ Check spanning tree state for VLAN ports
โœ“ Look for spanning tree blocking ports

Routing Problems

Problem: Can't reach remote networks
Local network works but remote destinations fail
Routing Investigation:
โœ“ Check routing table for destination network
โœ“ Verify default route exists for unknown destinations
โœ“ Test next-hop router reachability
โœ“ Check routing protocol configuration and neighbors
โœ“ Verify return path exists (routing is bidirectional)

Security and Access Problems

Problem: Can reach server but can't access service
Network connectivity exists but application fails
Application/Security Investigation:
โœ“ Test port connectivity with telnet
โœ“ Check ACLs blocking specific traffic
โœ“ Verify NAT configuration for port forwarding
โœ“ Check firewall rules on server and client
โœ“ Verify service is running on server

๐Ÿ“‹ Systematic Troubleshooting Checklists

Layer 1 Physical Checklist

๐Ÿ”Œ Physical Layer Investigation:
โ–ก Check all cable connections are secure
โ–ก Verify power to all network devices
โ–ก Check LED status lights on devices
โ–ก Test cables with cable tester if available
โ–ก Look for physical damage to cables
โ–ก Check interface status: up/up, up/down, down/down
โ–ก Verify correct cable type (straight vs crossover)
โ–ก Check for environmental issues (heat, interference)
โ–ก Confirm port is not administratively shutdown
โ–ก Test with known good cable and port

Layer 2 Data Link Checklist

๐Ÿ”— Data Link Layer Investigation:
โ–ก Check MAC address table for learned addresses
โ–ก Verify VLAN configuration and assignments
โ–ก Check trunk port configuration and allowed VLANs
โ–ก Verify spanning tree status and port states
โ–ก Look for spanning tree topology changes
โ–ก Check for duplex mismatches
โ–ก Monitor for excessive collisions or errors
โ–ก Verify switch port security settings
โ–ก Check for err-disabled ports
โ–ก Test with different switch port if available

Layer 3 Network Checklist

๐ŸŒ Network Layer Investigation:
โ–ก Verify IP address and subnet mask configuration
โ–ก Check default gateway configuration
โ–ก Test connectivity to default gateway
โ–ก Verify routing table has routes to destinations
โ–ก Check ARP table for IP-to-MAC resolution
โ–ก Test routing protocol neighbor relationships
โ–ก Verify no IP address conflicts exist
โ–ก Check NAT translations if applicable
โ–ก Test end-to-end connectivity with ping
โ–ก Use traceroute to verify packet path

Application Layer Checklist

๏ธ๐Ÿ“ฑ Application Layer Investigation:
โ–ก Test DNS name resolution
โ–ก Check DHCP configuration and leases
โ–ก Verify application services are running
โ–ก Test port connectivity with telnet
โ–ก Check ACLs and firewall rules
โ–ก Verify application-specific settings
โ–ก Check for application-layer timeouts
โ–ก Test with different client applications
โ–ก Monitor application logs for errors
โ–ก Verify user credentials and permissions

Documentation and Follow-up

๐Ÿ“ Problem Resolution Documentation:
โ–ก Document problem symptoms and timeline
โ–ก Record troubleshooting steps taken
โ–ก Document root cause identified
โ–ก Record solution implemented
โ–ก Test solution thoroughly
โ–ก Document any configuration changes
โ–ก Update network documentation
โ–ก Create knowledge base entry
โ–ก Inform affected users of resolution
โ–ก Schedule follow-up to confirm stability

๐Ÿ› ๏ธ Hands-On Troubleshooting Labs

Lab 1: Mystery Network - Physical Layer Issues

  1. Scenario Setup:
    • Create network with intentional physical problems
    • Disconnect cables, power off devices, wrong cable types
    • Mix various Layer 1 issues in same topology
    • Don't tell students what's broken
  2. Detective Mission:
    • Students must systematically check all physical components
    • Use show commands to identify interface states
    • Practice proper cable testing procedures
    • Document findings and solutions
  3. Skills Developed:
    • Physical troubleshooting methodology
    • Interface status interpretation
    • Cable and connectivity testing
    • Systematic problem isolation

Lab 2: VLAN Mystery - Layer 2 Chaos

  1. Complex Setup:
    • Multi-switch network with multiple VLANs
    • Introduce VLAN misconfigurations
    • Break trunk configurations
    • Create spanning tree problems
  2. Investigation Required:
    • Devices in same VLAN can't communicate
    • Some trunks not passing all VLANs
    • Intermittent connectivity issues
    • Some ports in wrong VLANs
  3. Advanced Challenges:
    • Use multiple troubleshooting approaches
    • Practice Layer 2 show commands
    • Understand spanning tree impact
    • Fix problems without breaking working parts

Lab 3: Routing Riddle - Multi-Network Mayhem

  1. Enterprise Scenario:
    • Multi-router network with multiple subnets
    • Mix static routes and dynamic routing
    • Introduce routing table problems
    • Create reachability issues
  2. Problems to Solve:
    • Some networks unreachable
    • Asymmetric routing issues
    • Missing default routes
    • Routing protocol neighbor problems
  3. Master Detective Skills:
    • Use ping and traceroute effectively
    • Analyze routing tables systematically
    • Test bidirectional connectivity
    • Verify routing protocol operation

Lab 4: The Ultimate Challenge - Everything Broken

  1. Real-World Chaos:
    • Large network with problems at every layer
    • Mix physical, VLAN, routing, and service issues
    • Time pressure simulation
    • Multiple simultaneous problems
  2. Professional Scenario:
    • Act like network is down in production
    • Prioritize problems by business impact
    • Work systematically under pressure
    • Document everything for post-incident review
  3. Master Level Skills:
    • Rapid problem triage and prioritization
    • Systematic troubleshooting under pressure
    • Effective use of all diagnostic tools
    • Professional problem documentation
๐ŸŽฏ Detective Graduation: Successfully complete all four labs and you'll be ready to solve any network mystery in the real world!

โšก Professional Troubleshooting Best Practices

The Professional Detective Mindset

Stay Calm Under Pressure

Panicking leads to mistakes and overlooked solutions

Be Methodical

Systematic approaches find problems faster than random guessing

Document Everything

Notes help others and create knowledge for future problems

Think Before Acting

Understand the impact of changes before making them

Communication During Incidents

Set Expectations

Tell users and management realistic timeframes

Provide Updates

Regular communication even when no progress

Escalate Appropriately

Know when to get help from senior engineers

Document Timeline

Track problem start, actions taken, and resolution

Change Management During Troubleshooting

Challenge: Making changes safely during outages
Need to fix problems without making things worse
Safe Change Practices:
โœ“ Always have a rollback plan before making changes
โœ“ Test changes in lab environment when possible
โœ“ Change one thing at a time
โœ“ Document every change made
โœ“ Get approval for significant changes

Tools and Resources Management

Physical Tools

Cable testers, tone generators, laptops with network tools

Software Tools

Network monitoring, protocol analyzers, documentation systems

Knowledge Resources

Vendor documentation, internal runbooks, online communities

Emergency Contacts

Vendor support, escalation contacts, key personnel

Monitoring and Proactive Maintenance

Baseline Monitoring

Know what normal looks like for your network

Alerting Systems

Set up monitoring to catch problems early

Regular Maintenance

Preventive maintenance prevents many problems

Capacity Planning

Monitor growth and plan for future needs

Learning from Problems

Post-Incident Review Process:
1. Document what happened and when
2. Identify root cause and contributing factors
3. Review response time and effectiveness
4. Identify lessons learned
5. Create action items to prevent recurrence
6. Update documentation and procedures
7. Share knowledge with team
8. Follow up on action items

Building Your Detective Toolkit

Essential Commands for Your Troubleshooting Toolkit
Quick Status Check:
show ip interface brief
show interfaces status
show ip route summary
show version

Connectivity Testing:
ping [destination]
traceroute [destination]
telnet [ip] [port]

Layer 2 Analysis:
show mac address-table
show vlan brief
show spanning-tree summary
show cdp neighbors

Layer 3 Analysis:
show ip route
show arp
show ip protocols

Security and Services:
show access-lists
show ip nat translations
show ip dhcp binding
show hosts

๐Ÿ“– Chapter Summary

  • Systematic Approach: Use methodical troubleshooting process, not random guessing
  • OSI Model Guide: Troubleshoot layer by layer, bottom-up or top-down
  • Essential Tools: Ping, traceroute, show commands, and debug tools
  • Physical First: Always check cables, power, and interface status
  • Layer 2 Issues: VLANs, spanning tree, MAC tables, and switch configuration
  • Layer 3 Problems: IP addressing, routing tables, and gateway configuration
  • Documentation: Record problems, solutions, and lessons learned
  • Professional Skills: Communication, change management, and continuous learning
๐ŸŽฏ Detective Mastery Achieved! You now have the skills and mindset to solve any network mystery. Remember: the best troubleshooters are made through practice, patience, and persistence!

๐Ÿ“ Troubleshooting Mastery Quiz

1. What are the six steps of systematic troubleshooting? Define problem, gather information, form hypothesis, test theory, implement solution, verify and document

2. When should you use bottom-up vs top-down troubleshooting? Bottom-up for complete failures (physical issues), top-down for specific application problems

3. What does "FastEthernet0/0 is up, line protocol is down" indicate? Layer 1 (physical) is working but Layer 2 (data link) has problems like wrong encapsulation or keepalive issues

4. How do you test if a specific TCP port is open on a server? Use telnet [server-ip] [port-number] to test port connectivity

5. What's the first thing to check when users report "internet is down"? Get specific symptoms, test from multiple locations, check if it's DNS, connectivity, or application specific

6. What does a traceroute showing * * * mean? That hop is not responding, could be firewall blocking ICMP or device configured not to respond

7. Why should you use "undebug all" after troubleshooting? Debug commands generate high CPU usage and lots of output that can overwhelm the router

8. What's the most important rule when making changes during troubleshooting? Always have a rollback plan and change one thing at a time while documenting everything

Comments