トップ差分一覧ソース検索ヘルプ PDF RSS ログイン

Online Performance Auditing： Using Hot Optimizations Without Getting Burned

@article{1134010,
author = {Jeremy Lau and Matthew Arnold and Michael Hind and Brad Calder},
title = {Online performance auditing: using hot optimizations without getting burned},
journal = {SIGPLAN Not.},
volume = {41},
number = {6},
year = {2006},
issn = {0362-1340},
pages = {239--251},
doi = {http://doi.acm.org/10.1145/1133255.1134010},
publisher = {ACM},
address = {New York, NY, USA},
}

Abstract
Introduction
Background
- Why Modeling Is Not Enough
- Examples to Motivate Searching the Optimization Space
  - Experimental Setup
  - Method Inlining
The Performance Auditor
Methodology
Offline Convergence Study
Online Performance Auditor
Discussion
Related Work
Conclusions

Abstract

Although optimizations typically perform well on average, they often have unpredictable impact on running time, sometimes degrading performance significantly.
an online framework for evaluating the effectiveness of optimizations, enabling an online system to automatically identify and correct performance anomalies that occur at runtime.

Introduction

a low-overhead completely automatic online system
an offline feasibilty study demonstrating the robustness of the statistical technique used in the framework

Background

Why Modeling Is Not Enough

Example of compiler optimization that have used explicit models are inlining, instruction selection, instruction scheduling, and register allocation
Implicit models are more prevalent than explicit models.
Most optimizations that use profiling information (often referred to as feedback-directed optimization, or FDO) still employ a model, whether explicit or implicit.
they generally do not measure program performance directly
optimizations are designed to assume that branch predictors will predict forward and backward branches in a certain way
one way to address the short comings of performance models is to refine the models to more accurately predict performance, but we believe this is an impossible, never-ending task.
with the current trend of adding more levels of complexity and virtualization to all levels of the execution stack, the problem will only become more difficult as true performance continues

Examples to Motivate Searching the Optimization Space

evidence that demonstrates the difficulty of modeling the performance impact of optimizations

Experimental Setup

modified IBM's J9 JIT compiler to read the HW cycle counter on entry and exit spent in a given method
- this infrastructure allows us to evaluate the impact of optimizations on individual methods in the application
to evaluate the quality, these experiments measure steady-state performance
- which is a configuration of the benchmark that excludes the early execution of a program when all (dynamic) compilation would occur.

Method Inlining

method inlining has a fairly clear cost-benefit model
- the potential benefit of inlining is the performance gained by removing call overhead and opimizing the callee in the context of the coller.
- the potential cost of inlining is a decrease in instruction cache locality, and increasing the pressure on downstream optimizations such as register allocation.

The Performance Auditor

Performance Auditor, a framework that enables online performance evaluation
- the framework tracks the bottom-line performance of multiple implementations of a region of code to determine which one is fastest.
- this work uses method boundaries to delimit code regions and focuses on comparing only two versions of each method at a time
- we refer to the process of identifying the faster method variant as a performance bakeoff
  - the component that requests and uses the bakeoff results is called the client
client of our framework is responsible for
- providing the alternative optimized implementations of the code region to be compared
- dciding what action to take with the outcome of the bakeoff
our solution has 2 components
- a technique to collect execution times of an implementation of a code region
  - the client provides N optimized versions of the code region to evaluate
  - the key idea is to randomly select one of these implementations whenever the code region is entered, and record the amount of time spent
- a statistical mechanism to determine which set of timings is fastest
  - the executions most likely occur with different program state(parameters and/or memory values)
  - this implies that the different implementations of this code region could have taken different paths and operated on difference data during their executions.
  - 入力値によるアルゴリズムの選択に関する話、
  - To calculate our confidence metric, the mean and ariance is computed for each set of timings

Methodology

using a development version of IBM's J9 VM and its high-performance optimizing JIT compiler
20 benchmarks(SPECjvm98, SPECjbb2000, DaCapo)
hot mthoeds as methods that consume enough cycles to be selected by J9 for aggressive feedback-directed optimizations
Pentium 4's read time-stamp counter(RDTSC) instruction

Offline Convergence Study

Experimental Setup

our goal is to determine how many samples are needed to accurately detect a performance difference between two compilations of a method

Fixed Number of Samples

prior online systems that performed a bakeoff between different compiler optimizations took either on or two timing samples for each optimized version of the code.

Statistical Approach

Using Yield Points for TIming

Online Performance Auditor

Adaptive Optimization in Virtual Machines

VMs use the JIT compiler's multiple optimization levels to trade-off the cost of hgh-level optimizations with their benefit
once to insert instrumentation to gather detailed profile information about the method, and then a second time to take advantage of the profile information after the method has run for some duration

Method Dispatcher

we must intercept the method dispatch for the method beng timed(M) so that execution can ump to an implementation at random

Data processing

Overhead

Inlining Client: Full Online Optimization

we modified the recompilation logic to perform a bakeoff for these hot methods after they have been instrumented

Discussion

The approach taken by this paper - comparing the performance of multiple versions of a region of code without attempting to hold inputs and program state constant - is often initially dismissed as infeasible.
One technical challenge was engineering the system to ensure that the collection and processing of data did not skew the results
one source of noise is when timings are polluted by VM activity, such as JIT compilation of garbage collection; these timing can be identified and eliminated fairly easily.

CUDA

Cell

HPM

JIT

JVM

PLDI

SIMD

UPC

キャッシュ

コンパイラ

ヘテロジニアス

メモリ局所性

細粒度

未読了