Monday, June 06, 2011

Continuous profiling at Google

"Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers" (PDF) has some fascinating details on how Google does profiling and looks for performance problems.

From the paper:
GWP collects daily profiles from several thousand applications running on thousands of servers .... At any moment, profiling occurs only on a small subset of all machines in the fleet, and event-based sampling is used at the machine level .... The system has been actively profiling nearly all machines at Google for several years.

Application owners won't tolerate latency degradations of more than a few percent .... We measure the event-based profiling overhead ... to ensure the overhead is always less than a few percent. The aggregated profiling overhead is negligible -- less than 0.01 percent.

GWP profiles revealed that the zlib library accounted for nearly 5 percent of all CPU cycles consumed ... [which] motivated an effort to ... evaluate compression alternatives ... Given the Google fleet's scale, a single percent improvement on a core routine could potentially save significant money per year. Unsurprisingly, the new informal metric, "dollar amount per performance change," has become popular among Google engineers.

GWP profiles provide performance insights for cloud applications. Users can see how cloud applications are actually consuming machine resources and how the picture evolves over time ... Infrastructure teams can see the big picture of how their software stacks are being used ... Always-on profiling ... collects a representative sample of ... [performance] over time. Application developers often are surprised ... when browsing GWP results ... [and find problems] they couldn't have easily located without the aggregated GWP results.

Although application developers already mapped major applications to their best [hardware] through manual assignment, we've measured 10 to 15 percent potential improvements in most cases. Similarly ... GWP data ... [can] identify how to colocate multiple applications on a single machine [optimally].
One thing I love about this work is how measurement provided visibility and motivated people. Just by making it easy for everyone to see how much money could be saved by making code changes, engineers started aggressively going after high value optimizations and measuring themselves on "dollar amount per performance change".

For more color on some of the impressive performance work done at Google, please see my earlier post, "Jeff Dean keynote at WSDM 2009".

No comments: