2013-10-17

Over the last few years as a Mozilla Release Engineer, I have completed hundreds of bugs and dozens of large and complex cross-team projects. As time has gone by, I have gained experience from my mistakes and I have developed a mental list of steps that I follow in order to complete cross team projects as effectively as possible and in good terms with the developers involved. I have never attempted to write this down, so I am excited to see what we can glean from this blog post. This may save you from having to build these habits through trial and error over time. Feel free to point out places where I have not figured everything out yet, or where I am wrong.

This post is probably geared towards Release Engineering and people who close closely with us, however, it may have value to understand what a release engineer has to take into consideration before running your jobs on tbpl.mozilla.org visibly.

I would like to use a recent project as an example, namely the Android x86 emulator test infrastructure project. It has taken more than two months to get 80% of this project completed. We are currently blocked on issues external to Mozilla with regards to the actual emulator being unstable.

You should expect the following sections in this post:

Tips

Checklist

Context of the Android x86 project

Sequence of events of the Android x86 project  

Tips

Document as much as possible on the bug

this makes it easier for external people to follow along

Communicate often with the people you're working with

Make it clear what blocks who

Make it clear what you're working on

Set expectations

Report back when you're not meeting expectations

Ask for help!

If you can pick one of many tasks, ask the dev which one would benefit him/her the most

Mention when you can’t work on the project for a period of time (e.g. PTO, buildduty, release)

Communicate major slow downs or change of plans to stakeholders on both sides of the project

Your own managers

The managers of the other developer

File bugs in order to clarify the scope, the dependency and the ownership

Make sure that the dependencies are logical (rather than just adding all bugs to the tracking bugs)

Meet  the members of the other parties through a video call as soon as possible if you have not met them before, especially if the project is rather complex

It makes it so much easier to understand/know each other and work together

You would have the opportunity to read the non-verbal communication

You become a human being in the eyes of each other, rather than IRC nicknames

Not necessary if you have worked together often and/or the project is very simple

If there’s confusion and/or conflict schedule another video meeting

Restate in your own words what you are taking away with you from the bugs and email conversations

This allows the other developer to the debug your understanding

Keep your word

It builds trust

Do not try to force artifical deadlines

You ruin the trust you have gained

Consult the team when in doubt of which approach to use for a big problem

Do not ask for help if you have not even tried for yourself

This builds a reputation that you won’t try to take other people’s time for granted

If you get stuck; ask for help

Re-estate what you’re trying to solve and why it is important

Checklist

NOTE: These questions do not apply to every project we take, however much of it applies when setting up a new platform.

have you read all bugs with regard to the project?

have you written down the questions that you need to ask the other team?

on which machines is this going to run?

do we have enough capacity?

who else should know about this project? have you got them in the loop?

IT? (more machines, method of deployment of artifacts)

A-team?

Sheriffs?

Your manager(s)?

Your own team?

has the developer verified that his scripts runs as expected on one of our machines?

their local machine does not count

loan them a machine if needed

what happens when we run twice a job on a machine?

what artifacts do we need to clobber?

test files and application files always get clobbered

has the developer *recently* run *all* types of jobs required for the project?

not just a handful; must be all and recently

can the developer set up the job to run multiple types in a row and always get the expected results?

this is a new question that I have not asked in the past, however, it might be helpful to spot instability issues early

for instance, we might have been able to catch the QEMU issue on the emulators earlier. Instead, we found after two months when we started running the emulator jobs at scale

which artifacts will we need to deploy?

e.g. the android sdk

e.g. the android emulator template definitions

how are you going to distribute the artifacts?

through puppet?

from tooltool?

from in-tree?

will we need to build it on the build machines and upload it to ftp?

what privacy do those artifacts need?

public/behind LDAP/VPN only

how often will the artifacts need to be recreated or updated?

do we have documentation about it?

what are the expectations? deadlines?

where is the source code?

put it on the bug or a link to the public repo

how long does each test suite take to run?

this is important to know in order to help capacity planning as well as planning how much we need to chunk the suites

have you *manually* reproduced the steps specified by the developer?

this will generally be your highest priority and will initially be on your critical path

unless you have figured out *all* of these with him, you may regret not doing so

how does the machine need to be set up? when was the last time that this set up was done?

request *recently* *verified* step-by-step instructions on a *clean* machine

have you verified the setup steps as instructed by the developer?

have you made the action items clear to each other?

have you found a new blocker? have you discussed it with the developer and does he understand why it is a blocker?

are the blockers clearly filed or specified on the bug?

is there anything particularly different to this project compared to the way we run other projects?

e.g. running four emulators with four different test suites was clearly new

if so, notify those people that might be affected

I hope this is useful when tackling new Release Engineering projects.

Feel free to read the following case study or skip it completely as it is very long!

regards,

Armen

##############################

NOTE: The following two sections can feel *very* long/boring, as the bug ended up having more than 200 comments and I had to file many many bugs.

Context of the Android x86 project

First of all, I would like to set the context of this project: I had just come back from three weeks of holidays, I was not mentally prepared to take on an unexpected and large project, I had to catch up with my intern (his internship was ending shortly), I was trying to cover for a co-worker who was taking four weeks of absence, I had been looking forward to working on a different, more exciting project instead, and the amount of unforeseen interruptions following my return were very, very high. This is important to have in mind, as you will notice in this blogpost that I made some drastic requests of my managers in order for me to meet expectations.

Sequence of events of the Android x86 project

NOTE: As I did an analysis of my work, I could see where I made mistakes or missed the opportunity to ask the right questions. Unfortunately, these oversights delayed the whole project further down the road.

Quote from gbrown after reading this post: “I thought it was really helpful meeting on Vidyo when we did, and I found your frequent in-bug status updates very useful and re-assuring.”

gbrown (a-team developer) filed on 2013-07-17 the bug:

some preliminary scripts had been developed by that time

they were not yet attached to the bug

<span style="background-color: transparent; color: black; font-size: 15px; font-s

Show more