Performance Testing Firefox OS

When we talk about performance for the web, a number of familiar questions may come to mind:

Why does this page take so long to load?
How can I optimize my JavaScript to be faster?
If I make some changes to this code, will that make this app slower?

I’ve been working on making these types of questions easier to answer for Gaia, the UI layer for Firefox OS, a completely web-centric mobile device OS. Writing performant web pages for the desktop has its own idiosyncrasies, and writing native applications using web technologies takes the challenge up an order of magnitude. I want to introduce the challenges I’ve faced in making performance an easier topic to tackle in Firefox OS, as well as document my solutions and expose holes in the web’s APIs that need to be filled.

From now on, I’ll refer to web pages, documents, and the like as applications, and while web “documents” typically don’t need the performance attention I’m going to give here, the same techniques could still apply.

Fixing the lifecycle of applications

A common question I get asked with regards to Firefox OS applications:

How long did the app take to load?

Tough question, as we can’t be sure we are speaking the same language. Based on UX and my own research at Mozilla, I’ve tried at adopt this definition for determining the time it takes an application to load:

The amount of time it takes to load an application is measured from the moment a user initiates a request for the application to the moment the application appears ready to interact with.

On mobile devices, this is generally from the time the user taps on an icon to launch an app, until the app appears visually loaded; it looks like a user can start interacting with the application. Some of this time is delegated to the OS to get the application to launch, which is outside the control of the application in question, but the bulk of the loading time should be within the app.

So window load right?

I hate to say it, but with SPAs, Ajax, script loaders, and deferred execution, window load doesn’t hold much meaning anymore. If we could merely measure the time it takes to hit load, our work would be easy. Unfortunately there is no way to infer the moment an application is visually loaded in a deterministic way for everyone. Instead we rely on them to imply it for us.

For Firefox OS, I helped develop a series of conventional moments that are relevant to almost every application for implying its loading lifecycle (also documented on MDN):

navigation loaded

The application designates that its core chrome or navigation interface exists in the DOM and has been marked as ready to be displayed, e.g. when the element is not display: none or other hiding functionality.

navigation interactive

The application designates that the core chrome or navigation interface has its events bound and is ready for user interaction.

visually loaded

The application designates that it is visually loaded, e.g. the “above-the-fold” content exists in the DOM and has been marked as ready to be displayed, again not display: none or other hiding functionality.

content interactive

The application designates that it has bound the events for the minimum set of functionality to allow the user to interact with the “above-the-fold” content made available at visually loaded.

fully loaded

The application has been completely loaded, i.e. any relevant “below-the-fold” content and functionality has been injected into the DOM, been marked visible, and is ready for user interaction. Any required startup background processing is complete and should exist in a stable state barring any further user interaction.

The important moment is visually loaded as this correlates to what the user perceives as something “being ready”. As an added bonus, using this metric pairs nicely with camera-based performance verifications.

Denoting moments

With a clearly-defined application launch lifecycle, we can denote these moments with the User Timing API, available in Firefox OS starting with v2.2:

window.performance.mark( string markName )

Specifically during a startup:

performance.mark('navigationLoaded'); performance.mark('navigationInteractive'); performance.mark('visuallyLoaded'); performance.mark('contentInteractive'); performance.mark('fullyLoaded');

You can even use the measure() method to create a measurement between another mark, or even 2 other marks:

// Denote point of interaction performance.mark('tapOnButton'); loadUI();
// Capture time from now (sectionLoaded) to tapOnButton performance.measure('sectionLoaded', 'tapOnButton');

And fetching these performance metrics is pretty straightforward with getEntries, getEntriesByName, or getEntriesByType which fetch a collection of the entries. The purpose of this article isn’t to cover the usage of User Timing though, so I’ll move on.

Armed with the moment an application is visually loaded, we know how long it took the application to load because we can just compare it to…oh, wait, no. We don’t know the moment of user intent to launch.

While desktop sites may be able to easily procure the moment at which a request was initiated, doing this on Firefox OS isn’t as simple. In order to launch an application, a user will typically tap an icon on the Homescreen. The Homescreen lives in a process separate from the app being launched, and we can’t communicate performance marks between them.

Solving problems with Raptor

Without the APIs or interaction mechanisms present in the platform to be able to overcome this and other difficulties, we build tools to help us out. This is how the Raptor performance testing tool originated. With it we can gather metrics from Gaia applications and answer the performance questions we have.

Raptor was built with a few goals in mind:

Performance test Firefox OS without affecting performance. We shouldn’t need polyfills, test code, or hackery to get realistic performance metrics.
Utilize web APIs as much as possible, filling in gaps through other means as necessary.
Be flexible enough to cater to the many different architectural styles of applications.
Be extensible for performance testing scenarios outside the norm.

Problem: Determining moment of user intent to launch

Given two independent applications, Homescreen and any other user application, how can we create a performance marker in one and compare it in another? Even if we could send our performance mark from one app to another, they are incomparable. According High-Resolution Time, the values produced would be monotonically increasing numbers from the moment of their origin, which is different in each page context. They represent the amount of time passed since another moment, and not to an absolute moment.

The first break-down in our performance APIs is no way to relate a performance mark in one application with another. Raptor overcomes this rather crudely: log parsing.

Yes, you read that correctly. Every time Gecko receives a performance mark, it logs a message (i.e. to adb logcat) and Raptor streams and parses the log looking for these log markers. A typical log entry looks something like this (we will decipher it later):

I/PerformanceTiming( 6118): Performance Entry: clock.gaiamobile.org|mark|visuallyLoaded|1074.739956|0.000000|1434771805380

The important thing to notice in this log entry is its origin: clock.gaiamobile.org, or the Clock app; here the Clock app created its visually loaded marker. In the case of the Homescreen, we want to create a marker that is intended to be for a different context altogether. This is going to need some additional metadata to go along with the marker, but unfortunately this API does not yet have that ability. In Gaia, we have adopted an @ convention to override the context of a marker, and with it, let’s mark the moment of app launch as determined by when the user taps the icon:

performance.mark(`appLaunch@${appOrigin}`)

Launching the Clock from the Homescreen and dispatching this marker, we get the following log entry:

I/PerformanceTiming( 5582): Performance Entry: verticalhome.gaiamobile.org|mark|appLaunch@clock.gaiamobile.org|80081.169720|0.000000|1434771804212

With Raptor we change the context of the marker if we see this @ convention.

Problem: Incomparable numbers

The second break-down of our performance APIs deals with the incomparability of performance marks across processes. Using performance.mark() in two separate apps will not produce any meaningful numbers that we can compare to determine a length of time, because their values do not share a common absolute time reference point. Fortunately there is an absolute time reference that all JS can access: the Unix epoch.

Looking at the output of Date.now() at any given moment will return the number of milliseconds that have elapsed since January 1st, 1970 and details. Raptor had to make an important trade-off: abandon the precision of high-resolution time for the comparability of the Unix epoch. Looking at the previous log entry, let’s break down what is output. Notice the correlation of certain pieces to their User Timing counterparts:

Log level and tag: I/PerformanceTiming
Process ID: 5582
Base context: verticalhome.gaiamobile.org
Entry type: mark, but could be measure
Entry name: appLaunch@clock.gaiamobile.org, the @ convention overriding the mark’s context
Start time: 80081.169720,
Duration: 0.000000, this is a mark, not a measure
Epoch: 1434771804212

For every performance mark and measure, Gecko also captures the epoch of the mark, and we can use this to compare times from across processes.

Pros and Cons

Everything is a game of tradeoffs, and performance testing with Raptor is no exception:

We trade high-resolution times for millisecond resolution so we may compare numbers across processes.
We trade JavaScript APIs for log parsing so we can access data without injecting custom logic into every application which would affect its performance.
We currently trade a high-level interaction API, i.e. Marionette, for low-level interactions using Orangutan behind the scenes. While this lets us have transparent events for the platform, it also makes writing rich tests difficult. There are plans to improve this in the future by adding Marionette integration.

Log Parsing?

You may be a person that believes log parsing is evil, and to a certain extent I would agree with you. While I do wish for every solution to be solvable using a performance API, unfortunately this just doesn’t exist yet. This is just one reason why projects like Firefox OS are important for pushing the web forward: we find use cases in which the web has not been fully fleshed out, poke holes and find what is missing, and ultimately improve APIs for everyone by pushing for these holes to be filled with standards. Log parsing is Raptor’s stop-gap until the web catches up.

Raptor Flow

Raptor is a Node.js module built into the Gaia project that enables the project to do performance tests against a device or emulator. Once you have the project dependencies installed, running performance tests from the Gaia directory is straightforward:

Install the Raptor profile on the device; this configures various settings to assist with performance testing. Note this is a different profile that will reset Gaia, so keep that in mind if you have any particular settings stored.
make raptor
Choose a test to run. Currently tests are stored in tests/raptor in the Gaia tree, so some manual discovery is needed. There are plans to improve the command-line API soon.
Run the test. For example, you can performance test the cold launch of the Clock app using the following command, specifying the number of runs to launch it:
APP=clock RUNS=5 node tests/raptor/launch_test

4. Observe the console output. At the end of the test, you will be given a table of test results with some statistics about the performance runs completed. Example:

[Cold Launch: Clock Results] Results for clock.gaiamobile.org Metric Mean Median Min Max StdDev p95 -------------------------------- ------- ------- ------- ------- ------ ------- coldlaunch.navigationLoaded 214.100 212.000 176.000 269.000 19.693 247.000 coldlaunch.navigationInteractive 245.433 242.000 216.000 310.000 19.944 274.000 coldlaunch.visuallyLoaded 798.433 810.500 674.000 967.000 71.869 922.000 coldlaunch.contentInteractive 798.733 810.500 675.000 967.000 71.730 922.000 coldlaunch.fullyLoaded 802.133 813.500 682.000 969.000 72.036 928.000 coldlaunch.rss 10.850 10.800 10.600 11.300 0.180 11.200 coldlaunch.uss 0.000 0.000 0.000 0.000 0.000 n/a coldlaunch.pss 6.190 6.200 5.900 6.400 0.114 6.300

Having access to raw performance data is helpful for getting a quick glance at how long something takes, or to determine if a change you made causes a number to increase, but it’s not very helpful for capturing changes over time. Raptor has two methods for visualizing performance data over time to improve this area.

Official Metrics

At raptor.mozilla.org, we have dashboards for persisting the values of performance metrics over time. In our automation infrastructure, we execute performance tests against devices for every new build generated by mozilla-central or b2g-inbound (note: that the source of builds could change in the future). Right now this is limited to Flames running at 319MB of memory, but there are plans to expand to different memory configurations and additional device types in the very near future. When automation receives a new build, we run our battery of performance tests against the devices, capturing numbers such as application launch time and memory at fullyLoaded, reboot duration, and power current. These numbers are stored and visualized many times per day, varying based on the commits for the day.

Looking at these graphs, you can drill down into specific apps, focus or expand your time query, and do advanced query manipulation to gain insight into performance. Watching trends over time, you can even use it to pick out regressions that snuck into Firefox OS.

Local visualization

The very same visualization tool and backend used by raptor.mozilla.org is also available as a Docker image. After running the local Raptor tests, data will report to your own visualization dashboard based on those local metrics. There are some additional prerequisites for local visualization, so be sure to read up the docs on MDN to get started.

Building pretty graphs that display metrics is all well and fine, but finding trends in data or signal within noise can be difficult. Graphs help us understand data and make it accessible for others to easily communicate around the topic, but using them for finding regressions in performance is reactive; we should be proactive about keeping things fast.

Regression hunting on CI

Rob Wood has been doing incredible work in our pre-commit continuous integration efforts surrounding the detection of performance regressions in prospective commits. With every pull request to the Gaia repository, our automation runs the Raptor performance tests against the target branch with and without the patch applied. After a certain amount of iterations for statistical accuracy, we have the ability to reject patches from landing in Gaia if a regression is too severe. For scalability purposes we use emulators for running these tests, so this has its own inherent drawbacks such as greater variability in the metrics reported. This variability limits the precision with which we can detect regressions.

Regression hunting in automation

Luckily we have the post-commit automation in place to run performance tests against real devices, and is where the dashboards receive their data from. Based on the excellent Python tool from Will Lachance, we query our historical data daily, attempting to discover any smaller regressions that could have crept into Firefox OS in the previous seven days. Any performance anomalies found are promptly reported to Bugzilla and relevant bug component watchers are notified.

Recap and next steps

Raptor, combined with User Timing, has given us the knowhow to ask questions about the performance of Gaia and receive accurate answers. In the future we plan on improving the API of the tool and adding higher-level interactions. Raptor should also be able to work more seamlessly with third-party applications, something that is not easily doable right now.

Raptor has been an exciting tool to build, while at the same time helping us drive the web forward in the realm of performance. We plan on using it to keep Firefox OS fast, and being proactive about protecting its performance.

Originally published at eliperelman.com.