Every developer has experienced the stress of troubleshooting an issue in production. The process usually starts with a customer complaint or an exceeded threshold alert. You immediately jump into “fix it mode” and start searching through application logs trying to find any data that might hint at the underlying cause. Then you trace through the code for a while and realize that the information you need to debug isn’t logged, so you add more logic to capture variables, and redeploy. This ends up being a vicious, repeated cycle -- and you still haven’t identified the root issue.
Earlier this week at Google Cloud Platform Live, we launched the beta release of Cloud Debugger which makes it easier to troubleshoot applications in production. Now you can simply pick a line of code, set a watchpoint and the debugger will return local variables and a full stack trace from the next request that executes that line on any replica of your service. There is zero setup time, no complex configurations and no performance impact noticeable to your users.

Back when developers were building client applications that ran on single thread on a single processor on a single machine, it was much easier to troubleshoot what was going on. Developers would still start with a problem and a stare at the code, but they could reproduce the problem, set a breakpoint inspect the stack and local variables and quickly find the solution. Cloud Debugger brings this productive style of debugging to modern cloud production troubleshooting.
So why is this style of debugging so hard in the cloud? First, cloud based services often have highly interdependent systems. Stopping one process to debug changes the overall system which may make the problem harder to reproduce. Second, cloud services are often replicated across many virtual machines. It is impossible to know on which one to set a breakpoint. Finally, by definition, this is production traffic; you can’t just stop a service in production giving multiple customers a bad experience. The good news is we have solved each of these problems with Cloud Debugger.

After setting a watchpoint on the line in question, Cloud Debugger simultaneously debugs all instances of your service in production. Whether that is a single instance or 10,000 replicas. Cloud Debugger watches execution on all instances and as soon as one hits the condition the debugger stops watching on all other instances.

When the watchpoint is hit the locals and stack are returned. Cloud Debugger does not stop the thread, process or service it is debugging. The debugger pauses execution at the appropriate line of execution, snapshots the stack and local variables then returns execution to the normal flow. The overhead is minimal and limited.

  • Zero overhead on services without active debugging
  • Less than 40 microseconds for having an active debugging session
  • Less than 10 milliseconds to capture the stack and locals
Ready to get started? Try it for yourself -- there’s no set up required. All you need is a Java Managed VM based project with its source code in Cloud Repository or in a connected GitHub or BitBucket repo. Stay tuned for support for other programming languages and environments. We’d love to hear your direct feedback and will be monitoring StackOverflow.

Happy debugging.

- Posted by Brad Abrams, Group Product Manager