SRE Troubleshooting: Tools

placeholder

Site reliability engineers (SREs) are typically good problem solvers. They need to think logically to identify problems correct them and prevent them from happening again. In this course youll explore several built-in and open-source troubleshooting tools SREs can use for resolving system issues. Youll start by examining the techniques of logging and whitebox and blackbox monitoring used to monitor system events. Youll then work with the various built-in Windows troubleshooting tools namely the Event Viewer Resource Monitor and System Information tools. Next youll use Google Cloud Dataflow to process logs before outlining the purpose and benefits of the StatsD standard and the /api/search endpoint. Lastly youll identify how Googles Dapper is used for troubleshooting distributed systems and the open standards tool Prometheus for instrumenting software and exposing metrics.