Blog.venmo.com

Performance Tuning On Android

2013-11-25

Earlier this summer, we released a major update to our Android app. It shipped with a revamped UI, a new robust networking layer, and plenty of awesome. Our users loved the new experience and responded with a lot of positive feedback. It was a great release, but we were a little unhappy with how smooth the UI was. Scrolling on the feed seemed choppy, especially when scrolling quickly.

So let’s talk about UI performance.

While building the new release, we didn't focus on optimizing at each step. Bottlenecks often occur in unexpected places, and in the wise words of Donald Knuth, “Premature optimization is the root of all evil.” But now that we'd released our product, it was time to go that extra mile and make it as smooth as possible. We identified a few key areas that we wanted to optimize individually, with the hope that together, they could add to the best experience we could provide. Specifically, we wanted to:

Flatten view hierarchies.

Reduce overdraw.

Make fewer potentially blocking calls on the main UI thread.

Avoid system garbage collection as much as possible.

As we explored our hypotheses, we found that documentation on performance profiling tools and debugging methods is relatively sparse. We stumbled with getting to know what tools are best for what jobs, but eventually we were able to leverage certain techniques and tools to effectively identify and fix performance bottlenecks. Hopefully by publishing our process we can help others debug their own performance issues.

Measure

For lit displays, the optimal frame rate is considered to be around 60 frames per second. This means that we should spend at most 1/60s = 16.7ms serving each frame. Above that, there isn’t much of a perceivable difference. To get a sense of how well our app was doing before optimizations, we needed to get some data. ADB allows us to use our app and then dump graphics performance information into terminal.

adb shell dumpsys gfxinfo <package-name>

This outputs columns of information about how long it took to process, draw, and execute each frame of your app over the last 128 frames. We were concerned about how smoothly our feed scrolled, so we scrolled for a bit and then turned the output data into a graph for easy visualization:

Note that in recent versions of Android, this graph can be generated and overlaid directly on screen in real time by enabling 'Profile GPU rendering' on screen as bars.

So we're usually just barely reaching our <16ms goal, but sometimes we spend up to 60ms and miss up to 4 frames because something is holding us back in either processing or drawing those frames. This gives us a starting point for measuring our optimizations.

Flatter View Hierarchies

Drawing is done in two passes on Android: measure and layout. Each pass is a top-down recursive call. In the measure pass, parent views push size constraints down to each of the children in the view tree, and child Views must compute their own exact dimensions while respecting those constraints. Those measurements are then used in the layout phase where each parent positions each of its children within its bounds. For deep complicated view trees, these traversals are expensive, yet need to happen many times per second. To optimize here, we looked to reduce the depth of our tree by removing Views with only one child and squashing LinearLayouts to RelativeLayouts. This saved around 4ms/frame on average. Ideally we’ll create custom ViewGroups for each row in our ListView that leverage the consistencies in our views to directly calculate child measurements and avoid some traversals all together.

The SDK’s Hierarchy Viewer is very helpful for visualizing the depth of your tree-structure and identifying slow-to-draw Views.

Less Overdraw

Usually related to deep view hierarchies is overdraw, or drawing the same x-y coordinate pixel multiple times in one onDraw(), which is wasteful and should be avoided when possible. Android has a nifty overdraw viewer that is usually accessible through the Developer Options and overlays certain areas with color depending on how many times the GPU draws over that area in each draw cycle.

transparent = no overdraw

blue = 1 layer

green = 2 layers

light-red = 3 layers

dark red = you’re doing it wrong

Here's what ours looked like before:

Not as bad as some other apps, but we're still drawing pixels on our feed at least 3 times over every pass for no reason. We removed some ViewGroups and looked for areas in xml where we set the background attribute to white even though a parent already painted a white background, and removed those. After profiling again, our average is down by ~6ms/frame (though this has a lot of variation since it’s hard to profile consistently. nonetheless, it’s improved) Here’s the overdraw after:

Fewer Blocking Calls On Main Thread

This can be hard to inspect for, but using the SDK’s Systrace Tool in the /tools directory can help start the investigation. After running the script and using your app for a few seconds, an interactive HTML document is generated that profiles calls on each running thread.

In this particular section of the timeline, we see that a touch event triggers a call to render something on screen, but we mis-schedule at least some of the calls to draw because some touch event triggered something that's keeping a draw from finishing. Doesn't tell us too much about what is doing what to slow what down, but it at least tells us that something is doing something to slow something down. From there, we can use the DDMS method profiling tool called TraceView, which outputs a lot of data letting you know how much time is spent in each executed method over a specific time period. To get an idea of how long we’re spending drawing a frame, we can look at the function ViewRootImpl.draw(). Highlighting that function gives us a visual understanding of how long it takes by underlining its execution time in the main thread timeline (the small gray underlines in the main thread's method profile).

This tells us that on average, draw takes 26ms to complete, longer than our 16ms threshold, but that average is skewed a lot by some times when it takes a super long time (>100ms). The median is closer to 17ms. More interesting than knowing how long it takes to draw might be knowing what prevents calls to onDraw() in the first place — or what happens in the larger gaps between some of those underlines. By choosing a large block in one of those gaps and following the parent trace upstream, we find that most of the time between draws are in our StoryViewHelper.bindView(). Which makes sense; bindView() should prep a view with information for draw. So looking at that method, we see:

BindView seems to consistently take around 19ms to complete, which is losing more than a full frame in between certain frames. This is a little odd. bindView() is supposed to be a lightweight method to prep the view for presentation. We see that 53% of the time that we spend in bindView(), is in a top-level method Util.getGmtTime(). This is the code we use to compute how long ago a story item was posted, e.g. ‘5m ago’. This becomes a bottleneck since every time a new list item comes into view, we create a Calendar instance, two Date instances, and a TimeZone instance. When scrolling quickly this can be many times a second, and Calendar instances are particularly expensive to create. To optimize this, we can compute these values on the same background threads where we insert or update stories in our local database, and then let bindView() use the cursor information directly instead of needing to run this calculation each time a story comes on to the screen.

Removing that calculation brings bindView() time down nearly 63% on average to around 7ms.

Avoid GC

This was a beefy one. Eventually after struggling to realize why there are still random 100-200ms breaks while scrolling our feed, the answer presented itself in the system logs.

Turns out that after every update/paginate request for our feed, we invalidate parts of our ListView. This releases many references in memory and requires many new allocations. If the concurrent garbage collection hasn’t made rounds yet and we’re in need of more memory than we have available, the system is forced to stop all running threads (including main!) and attempt to sweep up any unreachable memory blocks. After investigation, we find two root causes for too much garbage collection:

constantly using more heap memory than we need

invalidating our ListView too many times

We can use the Eclipse Memory Analyzer Tool to get a quick overview of a standard heap dump (obtained and converted using DDMS and the SDK hprof converter).

A heap dump can be an overwhelming amount of information since it shows all memory allocations at a specific point in time. But the Eclipse Memory Analyzer Tool does a lot to organize it. It can show a Dominator Tree that provides a succinct overview of your application heap object graph. It lists root level dominator objects by size, based on all objects in that particular node’s retained set. This helps to quickly identify large consumers. The VM will not garbage collect an object when there is a reference to it from outside of the heap; these objects are considered garbage collection roots. We can investigate why certain allocations exist by tracing it back to it’s GC root.

Looking for the GC root of mVenmoContactsManager

We notice that our largest allocation, mVenmoContactsManager is loaded by a system class loader. This might be the wrong scope for that object. We can further explore by listing all incoming references to our contact manager object:

Drilldown into the single incoming reference to our contact manager object

We find that it’s being stored as a static variable in our Application class, meaning that mVenmoContactsManager is always being held by our live app instance regardless of whether or not it’s being actively used. This object retains many ArrayLists of many Objects of many 8 to 64-bit primitives, amounting to over 2MB of retained heap usage at all times. This is an inefficiency that can cause the system to invoke more nasty main-thread-blocking GC runs. It would be best to remove this and keep the user’s contact management in persistent local database storage.

We also realized that duplicate requests were being made on each refresh of the feed, meaning that duplicate responses would unnecessarily invalidate our ListViews and request many more memory allocations than necessary. To remedy this, we added a layer of security from re-processing requests by passing each request through a state machine and only executing requests if they had not already been filled during the current session. And behold! Garbage collection runs much less frequently and the buttery new experience is liberated from enormous 100ms pauses.

Conclusion

Performance tuning is difficult because every app will have its own specific problems. What’s important is that there are benchmarks to aim for and many SDK tools to help you investigate and remove bottlenecks in your UI performance. The difference between our original, slightly choppy feed, and our new buttery smooth feed is phenomenally stark. And we got there with about a hundred lines of code modification. The bulk of the optimization process is spent justifying code modification. Being able to leverage and act on information given by wonderful diagnostic tools like the ones described here will do a lot to pull the Android ecosystem to a more performant state. This was our process, and we hope others will find it useful in their own applications.

Profile before optimizations

Profile after optimizations