Over the last few years as a Mozilla Release Engineer, I have completed hundreds of bugs and dozens of large and complex cross-team projects. As time has gone by, I have gained experience from my mistakes and I have developed a mental list of steps that I follow in order to complete cross team projects as effectively as possible and in good terms with the developers involved. I have never attempted to write this down, so I am excited to see what we can glean from this blog post. This may save you from having to build these habits through trial and error over time. Feel free to point out places where I have not figured everything out yet, or where I am wrong.
This post is probably geared towards Release Engineering and people who close closely with us, however, it may have value to understand what a release engineer has to take into consideration before running your jobs on tbpl.mozilla.org visibly.
I would like to use a recent project as an example, namely the Android x86 emulator test infrastructure project. It has taken more than two months to get 80% of this project completed. We are currently blocked on issues external to Mozilla with regards to the actual emulator being unstable.
You should expect the following sections in this post:
Tips
Checklist
Context of the Android x86 project
Sequence of events of the Android x86 project
Tips
Document as much as possible on the bug
this makes it easier for external people to follow along
Communicate often with the people you're working with
Make it clear what blocks who
Make it clear what you're working on
Set expectations
Report back when you're not meeting expectations
Ask for help!
If you can pick one of many tasks, ask the dev which one would benefit him/her the most
Mention when you can’t work on the project for a period of time (e.g. PTO, buildduty, release)
Communicate major slow downs or change of plans to stakeholders on both sides of the project
Your own managers
The managers of the other developer
File bugs in order to clarify the scope, the dependency and the ownership
Make sure that the dependencies are logical (rather than just adding all bugs to the tracking bugs)
Meet the members of the other parties through a video call as soon as possible if you have not met them before, especially if the project is rather complex
It makes it so much easier to understand/know each other and work together
You would have the opportunity to read the non-verbal communication
You become a human being in the eyes of each other, rather than IRC nicknames
Not necessary if you have worked together often and/or the project is very simple
If there’s confusion and/or conflict schedule another video meeting
Restate in your own words what you are taking away with you from the bugs and email conversations
This allows the other developer to the debug your understanding
Keep your word
It builds trust
Do not try to force artifical deadlines
You ruin the trust you have gained
Consult the team when in doubt of which approach to use for a big problem
Do not ask for help if you have not even tried for yourself
This builds a reputation that you won’t try to take other people’s time for granted
If you get stuck; ask for help
Re-estate what you’re trying to solve and why it is important
Checklist
NOTE: These questions do not apply to every project we take, however much of it applies when setting up a new platform.
have you read all bugs with regard to the project?
have you written down the questions that you need to ask the other team?
on which machines is this going to run?
do we have enough capacity?
who else should know about this project? have you got them in the loop?
IT? (more machines, method of deployment of artifacts)
A-team?
Sheriffs?
Your manager(s)?
Your own team?
has the developer verified that his scripts runs as expected on one of our machines?
their local machine does not count
loan them a machine if needed
what happens when we run twice a job on a machine?
what artifacts do we need to clobber?
test files and application files always get clobbered
has the developer *recently* run *all* types of jobs required for the project?
not just a handful; must be all and recently
can the developer set up the job to run multiple types in a row and always get the expected results?
this is a new question that I have not asked in the past, however, it might be helpful to spot instability issues early
for instance, we might have been able to catch the QEMU issue on the emulators earlier. Instead, we found after two months when we started running the emulator jobs at scale
which artifacts will we need to deploy?
e.g. the android sdk
e.g. the android emulator template definitions
how are you going to distribute the artifacts?
through puppet?
from tooltool?
from in-tree?
will we need to build it on the build machines and upload it to ftp?
what privacy do those artifacts need?
public/behind LDAP/VPN only
how often will the artifacts need to be recreated or updated?
do we have documentation about it?
what are the expectations? deadlines?
where is the source code?
put it on the bug or a link to the public repo
how long does each test suite take to run?
this is important to know in order to help capacity planning as well as planning how much we need to chunk the suites
have you *manually* reproduced the steps specified by the developer?
this will generally be your highest priority and will initially be on your critical path
unless you have figured out *all* of these with him, you may regret not doing so
how does the machine need to be set up? when was the last time that this set up was done?
request *recently* *verified* step-by-step instructions on a *clean* machine
have you verified the setup steps as instructed by the developer?
have you made the action items clear to each other?
have you found a new blocker? have you discussed it with the developer and does he understand why it is a blocker?
are the blockers clearly filed or specified on the bug?
is there anything particularly different to this project compared to the way we run other projects?
e.g. running four emulators with four different test suites was clearly new
if so, notify those people that might be affected
I hope this is useful when tackling new Release Engineering projects.
Feel free to read the following case study or skip it completely as it is very long!
regards,
Armen
##############################
NOTE: The following two sections can feel *very* long/boring, as the bug ended up having more than 200 comments and I had to file many many bugs.
Context of the Android x86 project
First of all, I would like to set the context of this project: I had just come back from three weeks of holidays, I was not mentally prepared to take on an unexpected and large project, I had to catch up with my intern (his internship was ending shortly), I was trying to cover for a co-worker who was taking four weeks of absence, I had been looking forward to working on a different, more exciting project instead, and the amount of unforeseen interruptions following my return were very, very high. This is important to have in mind, as you will notice in this blogpost that I made some drastic requests of my managers in order for me to meet expectations.
Sequence of events of the Android x86 project
NOTE: As I did an analysis of my work, I could see where I made mistakes or missed the opportunity to ask the right questions. Unfortunately, these oversights delayed the whole project further down the road.
Quote from gbrown after reading this post: “I thought it was really helpful meeting on Vidyo when we did, and I found your frequent in-bug status updates very useful and re-assuring.”
gbrown (a-team developer) filed on 2013-07-17 the bug:
some preliminary scripts had been developed by that time
they were not yet attached to the bug
<span style="background-color: transparent; color: black; font-size: 15px; font-s