194 lines
10 KiB
Markdown
194 lines
10 KiB
Markdown
|
|
# Batch Trace Processor
|
||
|
|
This document describes the overall design of Batch Trace Processor and
|
||
|
|
aids in integrating it into other systems.
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
## Motivation
|
||
|
|
The Perfetto trace processor is the de-facto way to perform analysis on a
|
||
|
|
single trace. Using the
|
||
|
|
[trace processor Python API](/docs/analysis/trace-processor#python-api),
|
||
|
|
traces can be queried interactively, plots made from those results etc.
|
||
|
|
|
||
|
|
While queries on a single trace are useful when debugging a specific problem
|
||
|
|
in that trace or in the very early stages of understanding a domain, it soon
|
||
|
|
becomes limiting. One trace is unlikely to be representative
|
||
|
|
of the entire population and it's easy to overfit queries i.e. spend a
|
||
|
|
lot of effort on breaking down a problem in that trace while neglecting
|
||
|
|
other, more common issues in the population.
|
||
|
|
|
||
|
|
Because of this, what we actually want is to be able to query many traces
|
||
|
|
(usually on the order of 250-10000+) and identify the patterns which show
|
||
|
|
up in a significant fraction of them. This ensures that time is being spent
|
||
|
|
on issues which are affecting user experience instead of just a random
|
||
|
|
problem which happened to show up in the trace.
|
||
|
|
|
||
|
|
One low-effort option for solving this problem is simply to ask people to use
|
||
|
|
utilities like [Executors](https://docs.python.org/3/library/concurrent.futures.html#executor-objects)
|
||
|
|
with the Python API to load multiple traces and query them in parallel.
|
||
|
|
Unfortunately, there are several downsides to this approach:
|
||
|
|
* Every user has to reinvent the wheel every time they want to query multiple
|
||
|
|
traces. Over time, there would likely be a proliferation of slightly modified
|
||
|
|
code which is copied from each place.
|
||
|
|
* While the basics of parallelising queries on multiple traces on a single
|
||
|
|
machine is straightforward, one day, we may want to shard trace processing
|
||
|
|
across multiple machines. Once this happens, the complexity of the code would
|
||
|
|
rise significantly to the point where a central implementation becomes a
|
||
|
|
necessity. Because of this, it's better to have the API first before engineers
|
||
|
|
start building their own custom solutions.
|
||
|
|
* A big aim for the Perfetto team these days is to make trace analysis more
|
||
|
|
accessible to reduce the number of places where we need to be in the loop.
|
||
|
|
Having a well supported API for an important usecase like bulk trace analysis
|
||
|
|
directly helps with this.
|
||
|
|
|
||
|
|
While we've discussed querying traces so far, the experience for loading traces
|
||
|
|
from different traces should be just as good. This has historically been a big
|
||
|
|
reason why the Python API has not gained as much adoption as we would have
|
||
|
|
liked.
|
||
|
|
|
||
|
|
Especially internally in Google, we should not be relying on engineers
|
||
|
|
knowing where traces live on the network filesystem and the directory layout.
|
||
|
|
Instead, they should be able to simply be able to specify the data source (i.e.
|
||
|
|
lab, testing population) and some parameters (e.g. build id, date, kernel
|
||
|
|
version) that traces should match should match and traces meeting these criteria
|
||
|
|
should found and loaded.
|
||
|
|
|
||
|
|
Putting all this together, we want to build a library which can:
|
||
|
|
* Interactively query ~1000+ traces in O(s) (for simple queries)
|
||
|
|
* Expose full SQL expressiveness from trace processor
|
||
|
|
* Load traces from many sources with minimal ceremony. This should include
|
||
|
|
Google-internal sources: e.g. lab runs and internal testing populations
|
||
|
|
* Integrate with data analysis libraries for easy charting and visulazation
|
||
|
|
|
||
|
|
## Design Highlights
|
||
|
|
In this section, we briefly discuss some of the most impactful design decisions
|
||
|
|
taken when building batch trace processor and the reasons behind them.
|
||
|
|
|
||
|
|
### Language
|
||
|
|
The choice of langugage is pretty straightforward. Python is already the go-to
|
||
|
|
langugage for data analysis in a wide variety of domains and our problem
|
||
|
|
is not unique enough to warrant making a different decision. Moreover, another
|
||
|
|
point in favour is the existence of the Python API for trace processor. This
|
||
|
|
further eases the implementation as we do not have to start from scratch.
|
||
|
|
|
||
|
|
The main downside of choosing Python is performance but given that that all
|
||
|
|
the data crunching happens in C++ inside TP, this is not a big factor.
|
||
|
|
|
||
|
|
### Trace URIs and Resolvers
|
||
|
|
[Trace URIs](/docs/analysis/batch-trace-processor#trace-uris)
|
||
|
|
are an elegant solution to the problem of loading traces from a diverse range
|
||
|
|
of public and internal sources. As with web URIs, the idea with trace URIs is
|
||
|
|
to describe both the protocol (i.e. the source) from which traces should be
|
||
|
|
fetched and the arguments (i.e. query parameters) which the traces should match.
|
||
|
|
|
||
|
|
Batch trace processor should integrate tightly with trace URIs and their
|
||
|
|
resolvers. Users should be able to pass either just the URI (whcih is really
|
||
|
|
just a string for maximum flexibility) or a resolver object which can yield a
|
||
|
|
list of trace file paths.
|
||
|
|
|
||
|
|
To handle URI strings, there should be some mecahinsm of "registering" resolvers
|
||
|
|
to make them eligible to resolve a certain "protocol". By default, we should
|
||
|
|
provide a resolver to handle filesystem. We should ensure that the resolver
|
||
|
|
design is such that resolvers can be closed soruce while the rest of batch trace
|
||
|
|
processor is open.
|
||
|
|
|
||
|
|
Along with the job of yielding a list of traces, resolvers should also be
|
||
|
|
responsible for creating metadata for each trace these are different pieces
|
||
|
|
of information about the trace that the user might be interested in e.g. OS
|
||
|
|
version, device name, collected date etc. The metadata can then be used when
|
||
|
|
"flattening" results across many traces as discussed below.
|
||
|
|
|
||
|
|
### Persisting loaded traces
|
||
|
|
Optimizing the loading of traces is critical for the O(s) query performance
|
||
|
|
we want out of batch trace processor. Traces are often accessed
|
||
|
|
over the network meaning fetching their contents has a high latency.
|
||
|
|
Traces also take at least a few seconds to parse, eating up the budget for
|
||
|
|
O(s) before even getting the running time of queries.
|
||
|
|
|
||
|
|
To address this issue, we take the decision to keep all traces fully loaded in
|
||
|
|
memory in trace processor instances. That way, instead of loading them on every
|
||
|
|
query/set of queries, we can issue queries directly.
|
||
|
|
|
||
|
|
For the moment, we restrict the loading and querying of traces to a
|
||
|
|
single machine. While querying n traces is "embarassngly parallel" and shards
|
||
|
|
perfectly across multiple machines, introducing distributed systems to any
|
||
|
|
solution simply makes everything more complicated. The move to multiple
|
||
|
|
machines is explored further in the "Future plans" section.
|
||
|
|
|
||
|
|
### Flattening query results
|
||
|
|
The naive way to return the result of querying n traces is a list
|
||
|
|
of n elements, with each element being result for a single trace. However,
|
||
|
|
after performing several case-study performance investigations using BTP, it
|
||
|
|
became obvious that this obvious answer was not the most convienent for the end
|
||
|
|
user.
|
||
|
|
|
||
|
|
Instead, a pattern which proved very useful was to "flatten" the results into
|
||
|
|
a single table, containing the results from all the traces. However,
|
||
|
|
simply flattening causes us to lose the information about which trace a row
|
||
|
|
originated from. We can deal with this by allowing resolvers to silently add
|
||
|
|
columns with the metadata for each trace.
|
||
|
|
|
||
|
|
|
||
|
|
So suppose we query three traces with:
|
||
|
|
|
||
|
|
```SELECT ts, dur FROM slice```
|
||
|
|
|
||
|
|
Then in the flattening operation might do something like this behind the scenes:
|
||
|
|

|
||
|
|
|
||
|
|
|
||
|
|
## Integration points
|
||
|
|
Batch trace processor needs to be both open source yet allow deep integration
|
||
|
|
with Google internal tooling. Because of this, there are various integration
|
||
|
|
points built design to allow closed compoentns to be slotted in place of the
|
||
|
|
default, open source ones.
|
||
|
|
|
||
|
|
The first point is the formalization of the idea "platform" code. Even since the
|
||
|
|
begining of the Python API, there was always a need for code internally to be
|
||
|
|
run slightly different to open source code. For example, Google internal Python
|
||
|
|
distrubution does not use Pip, instead packaging dependencies into a single
|
||
|
|
binary. The notion of a "platform" loosely existed to abstract this sort of
|
||
|
|
differences but this was very ad-hoc. As part of batch trace processor
|
||
|
|
implementation, this has been retroactively formalized.
|
||
|
|
|
||
|
|
Resolvers are another big point of pluggability. By allowing registration of
|
||
|
|
a "protocol" for each internal trace source (e.g. lab, testing population), we
|
||
|
|
allow for trace loading to be neatly abstracted.
|
||
|
|
|
||
|
|
Finally, for batch trace processor specifically, we abstract the creation of
|
||
|
|
thread pools for loading traces and running queries. The parallelism and memory
|
||
|
|
available to programs internally is often does not 1:1 correspond with the
|
||
|
|
available CPUs/memory on the system: internal APIs need to be accessed to find
|
||
|
|
out this information.
|
||
|
|
|
||
|
|
## Future plans
|
||
|
|
One common problem when running batch trace processor is that we are
|
||
|
|
constrained by a single machine and so can only load O(1000) traces.
|
||
|
|
For rare problems, there might only be a handful of traces matching a given
|
||
|
|
pattern even in such a large sample.
|
||
|
|
|
||
|
|
A way around this would be to build a "no trace limit" mode. The idea here
|
||
|
|
is that you would develop queries like usual with batch trace processor
|
||
|
|
operating on a O(1000) traces with O(s) performance. Once the queries are
|
||
|
|
relatively finalized, we could then "switch" the mode of batch trace processor
|
||
|
|
to opeate closer to a "MapReduce" style pipeline which operates over O(10000)+
|
||
|
|
traces loading O(n cpus) traces at any one time.
|
||
|
|
|
||
|
|
This allows us to retain both the quick iteration speed while developing queries
|
||
|
|
while also allowing for large scale analysis without needing to move code
|
||
|
|
to a pipeline model. However, this approach does not really resolve the root
|
||
|
|
cause of the problem which is that we are restricted to a single machine.
|
||
|
|
|
||
|
|
The "ideal" solution here is to, as mentioned above, shard batch trace processor
|
||
|
|
across >1 machine. When querying traces, each trace is entirely independent of
|
||
|
|
any other so paralleising across multiple machines yields very close to perfect
|
||
|
|
gains in performance at little cost.
|
||
|
|
|
||
|
|
This is would be however quite a complex undertaking. We would need to design
|
||
|
|
the API in such a way that allows for pluggable integration with various compute
|
||
|
|
platforms (e.g. GCP, Google internal, your custom infra). Even restricting to
|
||
|
|
just Google infra and leaving others as open for contribution, internal infra's
|
||
|
|
ideal workload does not match the approach of "have a bunch of machines tied to
|
||
|
|
one user waiting for their input". There would need to be significiant research
|
||
|
|
and design work before going here but it would likely be wortwhile.
|