2014-04-17

While we often hear about privacy concerns with storing data in the cloud such as on Dropbox, one thing we take for granted is data integrity, where files are not altered in any way on the cloud unless the user actually modifies them online.  For example, if a user syncs a spreadsheet file with Google Docs, the file stored on Google drive should be an exact byte for byte match with the original file until the user either modifies the cloud file in Google Docs or the locally stored file in a spreadsheet application.  In fact, many consumers go as far as trusting the cloud as their only backup.

Microsoft OneDrive for Business (formerly SkyDrive Pro) is Microsoft’s workplace equivalent of OneDrive and comes bundled with most Office 365 subscriptions.  It is designed to give the business control over the employee’s data stored within the synced folders.  However, unlike the consumer version of OneDrive, we found out my accident that what gets synced to the cloud is generally not the same as what gets synced back from the cloud, even when no one has touched the files online or elsewhere.

When OneDrive got stuck in an endless loop of trying to sync a few files and the issue returned when I tried clearing its cache as instructed on Microsoft’s discussion forum, I decided to stop syncing the OneDrive folder and backed it up.  I then deleted the original synced folder and got OneDrive to start syncing it again, so it would get a fresh copy from the cloud.  In an aim to check if any files got damaged due to the earlier syncing issue, I used a utility called MD5summer to create MD5 hashes for its content and repeated this process for the freshly synced folder.  To my surprise, the vast majority of the files showed ‘Checksum did not match’.  Surely most of my files haven’t gone corrupt?

I then started opening various files that failed the MD5 check, but could not find any obvious damage to any file.  That was until I noticed several PHP files from a website theme that also failed the MD5 check.  When I compared them side by side in Notepad++, I noticed straight away a few pieces of code injected into the header that clearly could not have been caused by any form of data corruption.  I knew for sure that neither I nor anyone else would have made these changes as the theme files were from a former website CMS package, so I then tried finding out what was modifying these files.

To check if OneDrive for Business was the culprit, I created a handful of mostly empty files of different types I frequently use and handwrote a simple PHP file and HTML file in Notepad++, so any modifications would clearly stand out.  I then used MD5summer to create MD5 hashes and then placed these files in a folder for OneDrive for Business to sync.  A few hours later, I booted my laptop which also has OneDrive for Business installed and a moment later, this folder appeared.  I then ran MD5summer and this is what I got:


The following highlighted in red is what OneDrive for Business injected into the HTML file:


While ‘uuid’ stands for Universally unique identifier, this code “C2F41010-65B3-11d1-A29F-00AA00C14882” remains the same in every PHP and HTML file it modified, including with other users.  Even though this modification does not make the file traceable, this is obviously going to be a nuisance for web developers who use OneDrive for Business to sync web files with each other, especially handwritten files where they don’t expect extra code to be added.

As for Word, Excel and Publisher files (‘docx’, ‘xlsx’ and ‘pub’ file extensions), these grew by about 8KB.  Unlike the web files, these Microsoft Office files had what appears to be uniquely identifiable code added, potentially making it possible to match them to a company and possibly even to a specific user’s account.  To get an idea of what was added, I used 7-Zip to extract the content of the Word file before and after syncing.  There were two ‘.rels’ files and one XML file modified and three folders with files added – ‘customXml’ containing 6 XML files, a folder ‘_rels’ inside this containing three ‘.rels’ files and a ‘[trash]‘ folder containing a ’0000.dat’ file.  In the ‘docProps’ folder, a file ‘custom.xml’ contains a property with a ‘ContentTypeId’ name attribute with a unique ID.

When I used 7-zip to look inside the two Microsoft Publisher files, the synced Publisher file had a ‘MsoDataStore’ folder added in it, inside which contains 3 folders with gibberish names and 2 XML files inside each.  I found the same ContentTypeID code inside as the Word file and while it matched, it was different to that in files I compared with other users.

Even though OneDrive for Business modified these files, it left the ‘Date Modified’ attribute in every file unchanged, so to an unsuspecting user who just checks when the files were modified, they appear untouched.  For example, the Word file shows a modified time of ’16:14:14’ for both the original and synced file, even though the file sizes are clearly different.  The only files that remain untouched are those that were placed in the synced folder on the original computer, so even if a user checks the files they place in a synced folder, they would not know anything is being modified unless they physically took those files to another computer with the matching synced folder to compare them.

So what this means is that people who use OneDrive for Business or SharePoint need to be very careful with what they sync with it, especially those handling third party data due to confidentiality issues.  For example, if an employee needs to transfer confidential files that absolutely must not be touched between its laptop and PC and decides to do so through a synced folder in OneDrive for Business, those files will end up being inadvertently modified without the user’s knowledge.  This could have severe consequences if let’s say a file is used as evidence in a court case.  How do you prove that the company did not intentionally modify it?

Based on Myce testing, we found that the consumer version of OneDrive (formerly SkyDrive) does not appear to any modify files, whether synced with the desktop product or through the web interface.  We also tested BitTorrent Sync and found that it does not modify any files either, even when testing a 1GB folder with a wide range of file types.

Got any questions on cloud syncing, backup or storage?  Please discuss them in our File Sharing forum.

Show more