Structured Troubleshooting

Having failed the CCIE R&S Lab Troubleshoot section, I decided to make a write-up on how troubleshooting should be approached. Not just for the lab, but in general.

We troubleshooting you are usually under a huge time pressure, because something is down. People can’t work and the business is loosing money. For me keeping calm and having a structured approach is key to becoming efficient in finding the root cause and fixing the issue causing the outage.

Strategy

Always verify the issue! Never trust things are broken the way they are reported!

Once verified, take a deep breath and think about what can cause the problem at hand. Look at the topology and make a plan of attack. This should take a maximum of 1-2 minutes (for the lab at least). And in the lab you’ll be facing approximately 10 tickets. Two of which will be 4 point each. The rest 2 point tickets. Your aim should be to loose a maximum of 4 points to do well on the TS section.

Before beginning solving tickets, be sure to read the guidelines/restrictions. These are global restrictions and apply to all tickets.
Tickets themselve can have local restrictions that overwrite the global restrictions! Do not violate any restrictions. Work around them, or you will loose important points! If you’re unable to work around them, move on!

You must read the entire ticket before jumping to the CLI. I say this, because you can easily miss a restriction mentioned at the bottom of a ticket and you might loose points and time!

Keep track of time! Do not spend more than 10 minutes per ticket. Make a note and move on to the next ticket if you are not making progress.

Also you should keep track of which tickets you have done and which are remaining. Dependencies can be present, but will not be visible if tickets are solved in order.

Always verify all tickets before leaving the TS section of of the lab. Make sure you did not break any previous tickets with a solution you made in later tickets.

Also for the CCIE lab you must match the outputs they give you! Match the correct path for traceroutes, for example. You are not required to match stuff like timers, MPLS labels, CEF load sharing decisions, etc.

Keeping It Simple

In general three thing can cause reachability issues:

  1. Routing and Switching
    • A transit path
  2. Transit filtering
  3. End point filtering

Filtering can be:

  • Access Lists
    • VACLs
    • RACLs
    • PACLs
  • Policing
  • PBR
  • CoPP/CPPr
  • Port-Security

Finally misc. features can cause connectivity issues:

  • NAT
  • Kron
  • EEM

Execution

Once you have verified the ticket/problem and created a plan for how to approach finding the root cause, begin the structured troubleshooting:

  • Traceroute from the source
    • Look at the routing table on the hop you loose reachability
  • Work your way, hop by hop, towards the destination fixing any routing/switching issues along the path.
    • Make sure to follow the path they give you (TE)
  • If routing looks good, time to look at filtering.
    • turn on logging
    • debug ip icmp

You might be able to catch a packet filter with debug ip icmp. Otherwise you might want to quickly look at which problematic features could be configured on the routers along the path (including source and destination routers) using this command:

sh run | in policy|police|nat|group|kron|manager|filter

Once you find an issue, you should solve it using the least amount of change. Always go for the simplest solution. Be careful not to violate any restrictions for the lab.

Finally verify the problem has been fixed before moving on to the next ticket. When done with all 10 tickets, verify them all!

Disable debugging and logging and save your configurations before ending the TS section.