Alvinashcraft.newsblur.com

Comparing Open XML Documents without Automating Word

2016-06-14

alvinashcraft
shared this story
from Microsoft MVP Award Program Blog.

Editor’s note: The following post was written by Office Development MVP Eric White as part of our Technical Tuesday series.

You know the drill. You write a long technical article or blog post for some publication or other. Before actually publishing, you must have this article reviewed by a number of technical experts. So you send it to all of these reviewers, and after some bothering and blithering, you finally get back eight or ten Word documents that have tracked revisions, and your job is to incorporate changes from each reviewer into the final document. You have two options – you can open one document at a time, incorporating changes from each reviewer into your final document. Or if you have two large monitors, you can open all eight reviewed documents on one screen, neatly arranging the windows so that you can see them all simultaneously, and then you incorporate changes from each reviewer into your final document. Either way, the process of incorporating revisions from multiple reviewers is a pain.

This workflow of reviewing is an integral part of most modern publishing systems. These days, many modern publishing systems are based on Open XML. After all, there is no content editor in the entire history of the computing industry that is as widely used as Word. This is for good reason. Word is the most polished, capable content editor that has ever existed, and everyone knows how to use Word, and Open XML has been the default file format for Word, since Office 2007. If you use files that have the DOCX extension, you are using Open XML, which has been an ISO standard since late 2008.

Yet most publishing systems do not address the needs of authors to incorporate revisions from multiple reviewers. They are left to their own devices.

Sometimes correctly incorporating revisions from reviewers isn’t something that is just nice-to-have. Sometimes it has legal and political implications. You might be part of an international organization that publishes papers that have a broad impact across many countries, and your reviewers may send revisions and feedback that, if not properly incorporated, will cause no end of trouble to you, your boss, and your organization.

Therefore, when we get documents back from reviewers, we must extremely carefully incorporate their revisions, making sure that revision tracking was not turned off, and that no unauthorized changes make it into the final document that we will publish.

Therefore, we need to carefully compare the reviewed documents to the original, making sure that each and every change is accounted for. Good revisions must be reflected in the published document, and incorrect revisions must not be.

We could use Word to compare these documents. Word has the capability, under the Review tab on the ribbon, to compare two documents and produce a new document that contains tracked revisions. However, in a publishing system (typically implemented on a server), automating Word to compare documents is not a good option. There are stability and licensing issues associated with automating Word on a server. See the KB article Considerations for server-side Automation of Office.

However, given that we can use the Open-Xml-Sdk and Open-Xml-PowerTools to crack open and query Open XML documents, we can compare two Open XML documents without invoking the use of Word. I recently implemented a new module (WmlComparer) for Open-Xml-PowerTools, which enables you to compare two WordprocessingML documents, producing a new DOCX that contains just the revisions in content. It contains functionality that enables you to iterate through all of the revisions in the compared document, seeing exactly what has changed. It also contains functionality that enables you to merge revisions from a number of reviewers, producing a new DOCX where all revisions from all reviewers are incorporated into a consolidated document, which then helps authors to incorporate all revisions into the final document. In this blog post, we’ll see how to use the API to do this. But before we can merge revisions from multiple reviewers into a document that consolidates them, we must accurately compare the contents of two DOCX files.

Comparing the Contents of Two DOCX Files

When we contemplate comparing the contents of two DOCX files, it obviously brings the question, what exactly is the content of a DOCX file?

Simply put, if you can select something in the document, then that is content of the document. This means all characters, words, and sentences in the main part of the document are content. Images are content. Smart art is content, as well as math formulas. One interesting thing to note is that you can select the marker for an end note or foot note, so the markers (and the contents of the end note and foot note) are content.

But styling, in and of itself, is not content. This makes sense. When comparing two documents to see their changes in content, we don’t care whether the reviewer either intentionally or inadvertently changed the styling of the document.

Bookmarks are not content. We can’t ‘select’ a bookmark in the document. And this also makes sense – we don’t care whether the reviewer changed, inserted, or deleted a bookmark.

There are other examples of artifacts that are (or are not) content, but in general, we can use this litmus test to determine what is the actual content of the document.

The Algorithm

When I first embarked on the project to compare two Open XML WordprocessingML files, and compare their contents, my first thought was: How hard can this be? After all, the Longest-Common-Subsequence (LCS) algorithm has been known for years. It is not a complex algorithm. If you want to write this algorithm to compare two text files, you can write the code in a day or two.

But as it turns out, the algorithm for comparing two DOCX files is significantly impacted because there are three types of content in a DOCX file:

Paragraphs and text in the main document part

Tables, which contain rows, which contain cells, which can contain paragraphs (and embedded tables)

Text boxes, which can contain paragraphs and tables

Because of the difference in nature between these three types of content, we must highly modify the LCS algorithm to account for this. For instance, if the user deletes a row, then we want to note that the row was deleted, and not just that the text in the cells was deleted.

Coming up with an algorithm to successfully compare two DOCX files has been one of the most fun programming projects that I have encountered in recent years. There is a good algorithm to compare two DOCX files, but it is not the LCS algorithm that you can find documented on Wikipedia. It is a highly modified version of that algorithm, that when run, produces a DOCX that contains the revisions that you would expect. If text or a paragraph is deleted, then we can properly see those revisions. If a row in a table is deleted, then that is properly recorded. The algorithm is a recursive, iterative algorithm that produces the desired revision tracking marks for text boxes and cells within tables.

The key to the algorithm is grouping. We group all words into paragraphs. We group the contents of a table together, and within the table, we group the contents of each row together. Within each row, we group the contents of each cell together. The LCS algorithm then operates on these groups. If we find a group that is changed, then we ‘flatten’ the group to its sub groups, and then re-run the algorithm on these sub-groups. In the case of text boxes and tables, we may need to iterate and flatten several times until we determine exactly what has changed in the content of the document.

The first thing the algorithm does is to convert the entire contents of the ‘before’ and ‘after’ document into a linear list of content ‘atoms’, in other words, the smallest piece of content in the document, which in most cases is a single character. As it converts the document to content atoms, it records the necessary information so that at the end of the comparison process, we can reassemble the entire document from these atoms. In the process of reassembling the document, we create the revision tracking markup. At the end of the process, we have a valid Open XML WordprocessingML document that contains the exact, precise differences between the documents.

One key to picking a document apart into its atomic units, and reassembling them back into a valid WordprocessingML document is to assign a unique ID to each and every element in the document, in the form of a GUID. This enables the algorithm to find out what run, paragraph, table, row, and cell each character is part of.

In the process of creating the comparison atoms and groups, the algorithm computes an SHA1 hash for every atom and group. This means that to compare an atom to another atom, or to compare a group to another group, the algorithm only needs to compare the SHA1 hash that has already been computed for that atom or group. This simplifies and speeds up the algorithm significantly.

Running the Code

If you want to see this code in action, you can get the most recent version of Open-Xml-PowerTools, and run the WmlComparer01 example. Its actual use could not be simpler. You can find out how to ‘git’ this code from GitHub at the Open XML Installation Center at EricWhite.com.

The entire sample code to use WmlComparer is:

As I said, the use of this module is very simple. Let’s look at the sample documents for the WmlComparer01 example.

The ‘before’ source document looks like this:

The ‘after’ source document:

And when we look at the document that WmlComparer produces:

We covered a lot of material to discuss the algorithm to produce the delta between two DOCX files, but in the end, it produces what we want. A DOCX that contains revisions between the two documents.

The code to get the list of revisions is also simple:

When we run the example, we see the following output:

Revision text: deleted

Consolidating Revisions from Multiple Reviewers

Now that we have the capability to accurately compare the contents of two Open XML documents, we can do a neat trick – we can consolidate documents from multiple reviewers into a single document.

We have an ‘original’ document that contains the following paragraph.

We can use the following code to consolidate revisions from multiple documents into a single document:

When we look at the consolidated document, it looks like this:

Document Formats are FUN

Open XML enables to do things to word-processing documents that were not possible before the advent of open document formats. Being about to compare documents precisely enables server-side scenarios that we could not and should not implement using Word automation. Integration of this code to consolidate revisions enables smoother workflows for authors using advanced publishing systems that are based on Microsoft Word.

For more information about the Open-Xml-Sdk, go to the GitHub repository for the open source version of the Open-Xml-Sdk.

You can find the Open-Xml-PowerTools on GitHub as well.

You can find lots of information about Open XML development on my personal blog web site at http://www.ericwhite.com/. In particular, go to the Open-Xml-PowerTools Developer Center on my blog for lots of screen-casts and other content about Open XML.

About the author

Eric is an independent consultant doing .NET and JavaScript development. He started working with Open XML in 2007, before Office 2007 was released, while working at Microsoft. His major project these days is the Open-Xml-PowerTools on GitHub, which contains example code and guidance for implementing a variety of common Open XML developer scenarios, including transformation of DOCX to HTML/CSS, transformation of HTML/CSS to DOCX, splitting and combining DOCX and PPTX files, high performance data-driven DOCX generation, among many other features. In addition, he is the author of the Open-Xml-Sdk-JavaScript, which enables you to query, create, and modify Open XML documents using JavaScript in the browser and node.js. He specializes in Open XML, Open XML SDK, JavaScript, Office client development, and SharePoint 2010 development, with experience with many other technologies.