I have logging, monitoring and metrics pretty extensively. It's still super annoying to troubleshoot. For instance one of the micros couldn't communicate to the other in certain situations, I had three instances running with a loadbalancer in between right (kubernetes). First to troubleshoot you need to figure which service is having the issue, then which instance, then when and how, then replicating is nearly impossible as you can't control which instance is hit, etc.