Another of the GSoC project areas I have offered to mentor involves automatically and recursively building Java projects.
Why is this important?
Recently, I decided to start using travis-ci to automatically build some of my Java projects.
One of the projects, JMXetric, depends on another, gmetric4j. When travis-ci builds JMXetric, it needs to look in Maven repositories to find gmetric4j (and also remotetea / oncrpc.jar and junit). The alternative to fetching things from repositories is to stop using Maven and ship the binary JARs of dependencies in the JMXetric repository itself, which is not desirable for various reasons.
Therefore, I submitted gmetric4j into the Maven Central repository by using the Sonatype Nexus service. One discovery I made in this process disturbed me: Sonatype allows me to sign and upload my binary JAR, they don't build it from source themselves, so users of the JAR have no way to know that it really is built from the source I publish in Github.
In fact, as anybody who works with Java knows, there is no shortage of Java projects out there that have a mix of binary dependencies in their source tree, sometimes without any easy way to find the dependency sources. HermesJMS is one popular example that is crippled by the inclusion of some JARs that are binary-only.
No silver bullet, but there is hope
Although there are now tens of thousands of JAR libraries out there in repositories and projects that depend on them (and their transitive dependencies and build tools and such), there is some hope:
Many JARs provide a -source JAR including source. This doesn't include all of the build artifacts of a true source package or source tarball, it just provides a subset of the source for use with a debugger.
Many Maven pom.xml files now include metadata about where the source is located - example
With that in mind, I'm hopeful that a system could be developed to scrape some of these data sources to find some source code and properly build some subset of the thousands of JARs available in the Maven Central Repository.
But why bother if you can't completely succeed?
One recent post on maven-users suggested that because nobody would ever be able to build 100% of JARs from source, the project is already doomed to failure.
Personally, I feel it is quite the opposite: by failing to build 100% of JARs from source, the project will help to pinpoint those hierarchies of JARs that are not really free software and increase pressure on their publishers to adhere to the standards that people reasonably expect for source distribution or provide a red flag to help dependant projects stop using them.
On top of that, the confirmation of true free-software status for many other JARs will make it safer for people to rely on them, package them and distribute them in various ways.
Dumping a gazillion new Java packages into Debian
Just to clear up one point: being able to automatically build JARs from source (or a chain of dependencies involving hundreds of them) doesn't mean they will be automatically uploaded to the official Debian archive by a Debian Developer (DD).
Having this view of the source will make it easier for a DD to look at a set of JARs and decide if they are suitable for packaging, but there would still be some manual oversight involved. The tool would simply take out some of the tedious manual steps (where possible) and give the DD the ability to spot traps (JARs without real source hidden in the dependency hierarchy) much more quickly.
How would it work?
The project - whether completed under GSoC or by other means - would probably be broken down into a few discrete components. Furthermore, it would utilize some existing tools where possible. All of this makes it easier for a student to focus on and complete some subset of the work even if the whole thing is quite daunting.
Here are some of the ideas for the architecture of the solution and the different components to be used or developed:
The data set:
A database schema, tracking each binary artifact, the source repository location (e.g. Git or SVN URL), source tarball location, source JAR availability and dependency relationships (including versions)
A local Maven repository - only containing JARs that we have built locally from some source
A set of Git repositories to mirror the upstream repositories of projects that need to be tweaked.
Tool set:
A web interface or command line tool would be necessary for a user to kick-start the process by specifying some artifact they want to build
There would need to be a script that tries to work out all the possible ways to get the source for an artifact (e.g. by looking for a Git URL in the pom.xml from the Maven Central repository). This script would be able to do other things, like identifying the existence of -source JARs which may or may not be sufficient to build the artifact.
A script would need to be created for testing the artifact's source tarball or repository for binary artifacts (e.g. copies of junit.jar). Whenever such things were found, the script would mirror the repository into our local git and create a branch with binaries removed. A record of the binaries would be added to the local database so we can symlink them from a trusted source when building.
A script would need to be created for testing whether the project includes a recognised build system (such as build.xml for ant or pom.xml for Maven). For projects without such artifacts, the script would need to generate a template build.xml and store it in a local clone of the repository
Jenkins would be used to build the JARs. A script would need to be created to build the Jenkins job config file for the artifact, pointing Jenkins to the upstream Git or the local Git repository depending upon the situation.
If the project is a Maven or Ivy project, then there are likely to be attempts to find dependencies during the build process. Running under Jenkins, these tools would be configured in such a way that they only look to the local repository and use dependencies that we have already built. If the build fails during dependency resolution, this is where the recursive process would kick off: the attempt to find each missing dependency would be logged to a queue, and the requests in this queue would each be handled by restarting the whole process again at the beginning. Each of these requests would also be logged to the database.
Sometimes, the system would be unable to proceed (e.g. because there are no clues about source locations in a given pom.xml). A user interface would need to be constructed to show a list of artifacts with exceptions and allow the user to manually locate the source and supply the URLs. The system would then continue iterating with this new data.
Reporting: we already know that for some JARs, we will simply fail to make any progress and we are not going to lose any sleep over that. The important thing is to provide accurate reports to help people make decisions that may involve working around those JARs in future:
For what percentage of projects could we determine the license from the pom.xml? Reports on licensing: can we spot any license mismatch in the dependency hierarchy?
Tools: which build tools in the chain of dependencies don't provide any source code? Are they optional tools (such as code quality analysis) that we can skip in the build process (e.g. by producing a mock version of the tool or plugin)?
Which non-free/sourceless JARs are most widely depended upon by other projects in the free Java eco-system? Can we make a list of the top 10 or 20?
Abandonware: can we detect JARs that haven't been updated for an extended period of time, with no activity in the source repository? For these projects in particular, it is a really good idea to make backups of the source repositories (or mirrors of their web sites and source download directories) in case they disappear altogether.