282 lines
13 KiB
Markdown
282 lines
13 KiB
Markdown
|
|
# Tracing 101
|
||
|
|
*This page provides a birds-eye view of performance analysis.
|
||
|
|
The aim is to orient people who have no idea what "tracing" is.*
|
||
|
|
|
||
|
|
## Introduction to...
|
||
|
|
### Performance
|
||
|
|
Performance analysis is concerned with making software run *better*.
|
||
|
|
The definition of *better* varies widely and depends on the situation.
|
||
|
|
Examples include:
|
||
|
|
* performing the same work using fewer resources (CPU, memory,
|
||
|
|
network, battery, etc.)
|
||
|
|
* increasing utilization of available resources
|
||
|
|
* identifying and eliminating unnecessary work altogether
|
||
|
|
|
||
|
|
Much of the difficulty in improving performance comes from
|
||
|
|
identifying the root cause of performance issues. Modern software systems are
|
||
|
|
complicated, having a lot of components and a web of cross-interactions.
|
||
|
|
Techniques which help engineers understand the execution of a system
|
||
|
|
and pinpoint issues that are critical.
|
||
|
|
|
||
|
|
**Tracing** and **profiling** are two such widely-used techniques for
|
||
|
|
performance analysis. **Perfetto** is an open-source suite of tools, combining
|
||
|
|
tracing and profiling to give users powerful insights into their system.
|
||
|
|
|
||
|
|
### Tracing
|
||
|
|
**Tracing** involves collecting highly detailed data about the execution
|
||
|
|
of a system. A single continuous session of recording is called a trace file
|
||
|
|
or **trace** for short.
|
||
|
|
|
||
|
|
Traces contain enough detail to fully reconstruct the timeline of events.
|
||
|
|
They often include low-level kernel events like scheduler context switches,
|
||
|
|
thread wakeups, syscalls, etc. With the "right" trace, reproduction of a
|
||
|
|
performance bug is not needed as the trace provides all necessary context.
|
||
|
|
|
||
|
|
Application code is also **instrumented** in areas of the program which are
|
||
|
|
considered to be *important*. This instrumentation keeps track of what the
|
||
|
|
program was doing over time (e.g. which functions were being run, or how long
|
||
|
|
each call took) and context about the execution (e.g. what were the parameters
|
||
|
|
to a function call, or why was a function run).
|
||
|
|
|
||
|
|
The level of detail in traces makes it impractical to read traces directly
|
||
|
|
like a log file in all but the simplest cases. Instead, a combination of
|
||
|
|
**trace analysis** libraries and **trace viewers** are used. Trace analysis
|
||
|
|
libraries provide a way for users to extract and summarize trace events in
|
||
|
|
a programmatic manner. Trace viewers visualize the events in a trace on a
|
||
|
|
timeline which give users a graphical view of what their system was doing
|
||
|
|
over time.
|
||
|
|
|
||
|
|
#### Logging vs tracing
|
||
|
|
A good intuition is that logging is to functional testing what
|
||
|
|
tracing is to performance analysis. Tracing is, in a sense, "structured"
|
||
|
|
logging: instead of having arbitrary strings emitted from parts of the system,
|
||
|
|
tracing reflects the detailed state of a system in a structured way to allow
|
||
|
|
reconstruction of the timeline of events.
|
||
|
|
|
||
|
|
Moreover, tracing frameworks (like Perfetto) place heavy emphasis
|
||
|
|
on having minimal overhead. This is essential so that the framework
|
||
|
|
does not significantly disrupt whatever is being measured: modern frameworks
|
||
|
|
are fast enough that they can measure execution at the nanosecond level
|
||
|
|
without significantly impacting the execution speed of the program.
|
||
|
|
|
||
|
|
*Small aside: theoretically, tracing frameworks are powerful enough to act as
|
||
|
|
a logging system as well. However, the utilization of each in practice is
|
||
|
|
different enough that the two tend to be separate.*
|
||
|
|
|
||
|
|
#### Metrics vs tracing
|
||
|
|
Metrics are numerical values which track the performance of a system over time.
|
||
|
|
Usually metrics map to high-level concepts. Examples of metrics include: CPU
|
||
|
|
usage, memory usage, network bandwidth, etc. Metrics are collected directly from
|
||
|
|
the app or operating system while the program is running.
|
||
|
|
|
||
|
|
After glimpsing the power of tracing, a natural question arises: why bother
|
||
|
|
with high level metrics at all? Why not instead just use use tracing and
|
||
|
|
compute metrics on resulting traces? In some settings, this may indeed be the
|
||
|
|
right approach. In local and lab situations using **trace-based metrics**,
|
||
|
|
where metrics are computed from traces instead of collecting them directly,
|
||
|
|
is a powerful approach. If a metric regresses, it's easy to open the trace
|
||
|
|
to root cause why that happened.
|
||
|
|
|
||
|
|
However, trace-based metrics are not a universal solution. When running in
|
||
|
|
production, the heavyweight nature of traces can make it impractical to collect
|
||
|
|
them 24/7. Computing a metric with a trace can take megabytes of data vs bytes
|
||
|
|
for direct metric collection.
|
||
|
|
|
||
|
|
Using metrics is the right choice when you want to understand the performance
|
||
|
|
of a system over time but do not want to or can not pay the cost of collecting
|
||
|
|
traces. In these situations, traces should be used as a **root-causing** tool.
|
||
|
|
When your metrics show there is a problem, targeted tracing can be rolled out
|
||
|
|
to understand why the regression may have happened.
|
||
|
|
|
||
|
|
### Profiling
|
||
|
|
**Profiling** involves sampling some usage of a resource by
|
||
|
|
a program. A single continuous session of recording is known as a **profile**.
|
||
|
|
|
||
|
|
Each sample collects the function callstack (i.e. the line of code along with
|
||
|
|
all calling functions). Generally this information is aggregated across the
|
||
|
|
profile. For each seen callstack, the aggregation gives the percentage of usage
|
||
|
|
of the resource by that callstack. By far the most common types of profiling are
|
||
|
|
**memory profiling** and **CPU profiling**.
|
||
|
|
|
||
|
|
Memory profiling is used to understand which parts of a program are allocating
|
||
|
|
memory on the heap. The profiler generally hooks into `malloc` (and `free`)
|
||
|
|
calls of a native (C/C++/Rust/etc.) program to sample the callstacks
|
||
|
|
calling `malloc`. Information about how many bytes were allocated is also
|
||
|
|
retained. CPU profiling is used for understanding where the program is
|
||
|
|
spending CPU time. The profiler captures the callstack running on a CPU
|
||
|
|
over time. Generally this is done periodically (e.g. every 50ms), but can be
|
||
|
|
also be done when certain events happen in the operating system.
|
||
|
|
|
||
|
|
#### Profiling vs tracing
|
||
|
|
There are two main questions for comparing profiling and tracing:
|
||
|
|
1. Why profile my program statistically when I can just trace *everything*?
|
||
|
|
2. Why use tracing to reconstruct the timeline of events when profiling gives me
|
||
|
|
the exact line of code using the most resources?
|
||
|
|
|
||
|
|
##### When to use profiling over tracing
|
||
|
|
Traces cannot feasibly capture execution of extreme high frequency
|
||
|
|
events e.g. every function call. Profiling tools fill this niche: by
|
||
|
|
sampling, they can significantly cut down on how much information they store.
|
||
|
|
The statistical nature of profilers are rarely a problem; the sampling
|
||
|
|
algorithms for profilers are specifically designed to capture data which is
|
||
|
|
highly representative of the real resource use.
|
||
|
|
|
||
|
|
*Aside: a handful of very specialized tracing tools exist which
|
||
|
|
can capture every function call (e.g.
|
||
|
|
[magic-trace](https://github.com/janestreet/magic-trace)) but they output
|
||
|
|
*gigabytes* of data every second which make them impractical for anything
|
||
|
|
beyond investigating tiny snippets of code. They also generally have higher
|
||
|
|
overhead than general purpose tracing tools.*
|
||
|
|
|
||
|
|
##### When to use tracing over profiling
|
||
|
|
While profilers give callstacks where resources are being used, they lack
|
||
|
|
information about *why* that happened. For example, why was malloc being called
|
||
|
|
by function *foo()* so many times? All they say is *foo()* allocated X bytes
|
||
|
|
over Y calls to `malloc`. Traces are excellent at providing this exact context:
|
||
|
|
application instrumentation and low-level kernel events together provide
|
||
|
|
deep insight into why code was run in the first place.
|
||
|
|
|
||
|
|
NOTE: Perfetto supports collecting, analyzing and visualizing both profiles
|
||
|
|
and traces at the same time so you can have the best of both worlds!
|
||
|
|
|
||
|
|
## Perfetto
|
||
|
|
Perfetto is a suite of tools for performance analysis of software. Its purpose
|
||
|
|
is to empower engineers to understand where resources are being used by their
|
||
|
|
systems. It helps identify the changes they can make to improve performance
|
||
|
|
and verify the impact of those changes.
|
||
|
|
|
||
|
|
NOTE: In Perfetto, since profiles and traces can be collected simultaneously,
|
||
|
|
we call everything a "trace" even if it may contain (only) profiling data
|
||
|
|
inside.
|
||
|
|
|
||
|
|
### Recording traces
|
||
|
|
Perfetto is highly configurable when it comes to recording traces. There are
|
||
|
|
literally hundreds of knobs which can be tweaked to control what data is
|
||
|
|
collected, how it should be collected, how much information a trace should
|
||
|
|
contain etc.
|
||
|
|
|
||
|
|
[Record traces on Linux quickstart](/docs/quickstart/linux-tracing.md) is
|
||
|
|
a good place to start if you're unfamiliar with Perfetto. For Android
|
||
|
|
developers,
|
||
|
|
[Record traces on Android quickstart](/docs/quickstart/android-tracing.md) will
|
||
|
|
be more applicable. The [trace configuration](/docs/concepts/config.md) page
|
||
|
|
is also useful to consult as a reference.
|
||
|
|
|
||
|
|
The following sub-sections give an overview of various points worth considering
|
||
|
|
when recording Perfetto traces.
|
||
|
|
|
||
|
|
#### Kernel tracing
|
||
|
|
Perfetto integrates closely with the Linux kernel's
|
||
|
|
[ftrace](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) tracing
|
||
|
|
system to record kernel events (e.g. scheduling, syscalls, wakeups). The
|
||
|
|
[scheduling](/docs/data-sources/cpu-scheduling.md),
|
||
|
|
[syscall](/docs/data-sources/syscalls.md) and
|
||
|
|
[CPU frequency](/docs/data-sources/cpu-freq.md) data source pages give
|
||
|
|
examples of configuring ftrace collection.
|
||
|
|
|
||
|
|
Natively supported ftrace events can be found in the fields of
|
||
|
|
[this proto message](/docs/reference/trace-packet-proto.autogen#FtraceEvent).
|
||
|
|
Perfetto also supports collecting ftrace events it does not natively understand
|
||
|
|
(i.e. it does not have a protobuf message for) as a
|
||
|
|
["generic"](/docs/reference/trace-packet-proto.autogen#GenericFtraceEvent)
|
||
|
|
events. These events are encoded as key-value pairs, similar to a JSON
|
||
|
|
dictionary.
|
||
|
|
|
||
|
|
It is strongly discouraged to rely on generic events for production use cases:
|
||
|
|
the inefficient encoding causes trace size bloat and the
|
||
|
|
[trace processor](/docs/analysis/trace-processor.md) cannot parse them
|
||
|
|
meaningfully. Instead, support should be added for parsing important ftrace
|
||
|
|
events to Perfetto:
|
||
|
|
[here](/docs/contributing/common-tasks.md#add-a-new-ftrace-event) is a simple
|
||
|
|
set of steps to follow which are found.
|
||
|
|
|
||
|
|
#### Instrumentation with Perfetto SDK
|
||
|
|
Perfetto has a [C++ SDK](https://perfetto.dev/docs/instrumentation/tracing-sdk)
|
||
|
|
which can be used to instrument programs to emit tracing events. The SDK is
|
||
|
|
designed to be very low-overhead and is distributed in an "amalgamated" form
|
||
|
|
of a one `.cc` and one `.h` file, making it easy to integrate in any build
|
||
|
|
system.
|
||
|
|
|
||
|
|
A C SDK is under active development and should be available for general
|
||
|
|
usage by Q2 2023. See [this doc](https://bit.ly/perfetto-c) for details (note
|
||
|
|
viewing this doc requires being a member of
|
||
|
|
[this group](https://groups.google.com/forum/#!forum/perfetto-dev))
|
||
|
|
|
||
|
|
A Java/Kotlin SDK for Android (as a
|
||
|
|
[JetPack library](https://developer.android.com/jetpack/androidx)).
|
||
|
|
This is under development but there is no set timescale for when an official
|
||
|
|
release will happen.
|
||
|
|
|
||
|
|
##### android.os.Trace (atrace) vs Perfetto SDK
|
||
|
|
NOTE: This section is only relevant for Android platform developers or Android
|
||
|
|
app developers with tracing experience. Other readers can safely skip this
|
||
|
|
section.
|
||
|
|
|
||
|
|
Perfetto has significant advantages over atrace. Some of the biggest advantages
|
||
|
|
include:
|
||
|
|
* performance: tracing to Perfetto from system/app code requires just a memory
|
||
|
|
write which is far faster than the syscall latency imposed by atrace. This
|
||
|
|
generally makes Perfetto anywhere from 3-4x faster than atrace
|
||
|
|
* features: atrace's API is extremely limited, lacking support for debug
|
||
|
|
arguments, custom clocks, flow events. Perfetto has a far richer API allowing
|
||
|
|
natural representation of data-flow.
|
||
|
|
* trace size: Perfetto supports various features (delta encoded timestamps,
|
||
|
|
interned strings, protobuf encoding) which vastly reduce to size of trace
|
||
|
|
files.
|
||
|
|
|
||
|
|
Unfortunately, there are also some downsides:
|
||
|
|
* dedicated thread: a thread dedicated to Perfetto is necessary for every
|
||
|
|
process which wants to trace to Perfetto.
|
||
|
|
* wakeups on tracing start: currently, when tracing starts, every process
|
||
|
|
registered for tracing is woken up which significantly limits how many
|
||
|
|
processes can be traced. This limitation should be removed in coming quarters.
|
||
|
|
|
||
|
|
For now, the recommendation from the Perfetto team is to continue utilizing
|
||
|
|
atrace for most usecases: if you think you have a usecase which would benefit
|
||
|
|
from the SDK, please reach out to the team directly. By mid-2023, significant
|
||
|
|
progress should be made addressing the limitations of the current SDK allowing
|
||
|
|
more widespread adoption of the SDK.
|
||
|
|
|
||
|
|
<!--
|
||
|
|
TODO(lalitm): write the remainder of the doc using the following template
|
||
|
|
|
||
|
|
#### Native heap profiling
|
||
|
|
|
||
|
|
#### Java heap graphs
|
||
|
|
|
||
|
|
#### Callstack sampling
|
||
|
|
|
||
|
|
|
||
|
|
#### Flight recorder tracing
|
||
|
|
TODO(lalitm): write this.
|
||
|
|
|
||
|
|
##### Field tracing
|
||
|
|
TODO(lalitm): write this.
|
||
|
|
|
||
|
|
#### Clock sync
|
||
|
|
TODO(lalitm): write this.
|
||
|
|
|
||
|
|
|
||
|
|
#### Analysis
|
||
|
|
TODO(lalitm): write this.
|
||
|
|
* Trace processing
|
||
|
|
* UI
|
||
|
|
* httpd mode
|
||
|
|
* metrics
|
||
|
|
* Python
|
||
|
|
|
||
|
|
|
||
|
|
The remainder of this
|
||
|
|
page will focus on the applications of Perfetto to solve various performance
|
||
|
|
related problems.
|
||
|
|
|
||
|
|
## Solving problems with Perfetto
|
||
|
|
TODO(lalitm): write this.
|
||
|
|
* When to look into callstack sampling
|
||
|
|
* When to use memory profiling
|
||
|
|
* When to look at scheduling latency
|
||
|
|
|
||
|
|
|
||
|
|
TODO(lalitm): write this.
|
||
|
|
|
||
|
|
-->
|