2016-09-20

Introduction

My personal mission is to teach the next generation of analytics workers how to achieve better data comprehension.

In this article, I discuss the topic of data comprehension in some detail, although my coverage of the topic is borderline philosophical. I try to delineate the various components of work that are needed to achieve improved data comprehension.

Preamble

I will offer the following advice on this article before you read it.

The thoughts I have created here took a long time for me to create. Sometimes the material might not be perfectly clear the first time you read it, or maybe even after the second time you read it. Some of it is still not clear to me even after attempting to revise it many times and using 3,000 words to explain myself! Depending upon your experience level, you might or might not appreciate the meaning and impact of what I am trying to say.

I’m not trying to insult anyone with this statement, but I have only arrived at these thoughts in the past 5 years and it is very hard for me to articulate the insights I have discovered related to this topic. Additionally, unless you have tried some of the things I have tried, you probably cannot fully appreciate some of the things I have learned. I now realize that I should not penalize someone for not knowing about things that I have tried, failed, and sometimes succeeded in completing. It is only a matter of time before these lessons become clear to people as they work through their careers in the field of data analytics.

I have spent the first 20 years of my career doing predictive analytics in scientific fields followed by 10 years of business analytics of various types. It was only after a quarter-century of serious number crunching did these concepts begin to materialize in my brain. This development was partially a function of me needing to gain the experiences needed to form the insights. The insights were also only possible after some new generation software tools were developed. I hope that I can explain my insights and beliefs in this article so that I can help others see what I now understand and enjoy.



My friend Melinda gave this to me the other day as a gift. I really liked it because it reminds me of why I love to write about data analytics and why I love doing my job.

Background

Writing an article like this doesn’t just happen over night. It takes a few decades of diligent and dedicated work experience and continuous learning to formulate the concepts included within. It also helps if your experience spans across multiple types of quantitative disciplines.

I learned the fundamentals of data and how to work with data a long time ago as a student and then as a computational scientist. Working with data is now as natural as taking a breath of air. Eventually, if you work enough years and are exposed to a wide variety of problems, data simply becomes a part of your vocabulary and working with it becomes automatic.

I now believe we are moving into a time where data has now become a part of our “language of life.” To more completely understand what I mean by that statement, you need to wait until I finish another article that I have been writing for the past couple of months. The title of that piece will be a shortened version of this quote:



The article I have been writing for a couple of months on this topic originally popped into my head, completely written. Completing it has been a challenge however, but I hope it will be worth the effort when it is done.

Since I believe that data is now an emerging form of language, we need to teach that language to our quantitative workers. If they can’t speak the language of data, they cannot achieve sufficient data comprehension. This matters a lot because we need people that can quickly achieve data comprehension on a consistent basis, and I want these workers to do it in a much shorter time than it took me to accomplish.

To summarize, when you learn to speak “the language of life” though data, you no longer spend much time thinking about the characteristics of the data you are using. When you look at a data set for the first time, you automatically know what to do with it. You don’t have to think about the data types and what can be done with them, for example. Working with data becomes as natural as speaking to your friend or sending a text message. Your brain simply does what is necessary to the data to move and transform it to achieve the desired result(s).

Achieving Data Comprehension in Modern Analytics Projects

Data comprehension is the end-game for any quantitative study. Comprehension is what happens after the data preparation is complete, the data has been visualized, and you understand how the final data set answers the questions and problems you were originally working on solving. The achievement of data comprehension requires that you possess various skills that are used to achieve your objectives.

In the old-days, many quantitative studies were done with one primary data file that contained all the information necessary to solve the problem. As time passed and data proliferation began occurring, many studies required the use of multiple data files. Data files also grew in size, complexity, and exploded into a wide array of formats that were designed to accomplish different goals and objectives.

For these reasons, it became necessary for practitioners of modern-day analytics to be able to rapidly connect, comprehend, and conjoin data from different sources. Since different data sources are now being used means that practitioners now have got to be nimble, they must be data agnostic, and they must have tools that allow them to reach out and grab whatever data is needed to answer the questions that have been posed to them.

Sometimes I long for the days when I was able to read and write pure binary files. Working with pure binary files was maximally efficient, fast, and was not subject to proprietary restrictions. I wonder why companies had to make everything so much more complex by designing their own systems for storing the data, hiding the instructions for how to do so, and then making us use their tools just to read our own data!  I could continue this line of reasoning into the realm of self-descriptive file formats, but I’ll save that for another day.

Since many jobs now require the use of multiple data sources, the complexity of the job is inherently higher than it has been in the past. Since the data sources generally did not originate from data models that were harmoniously developed, there are typically a lot of things that have to be done to the data sources to get them to communicate properly. This is necessary to be able to uncover the system behaviors and quantitative results that are hidden within the combined data files.

Therefore, a lot of data work is required in the initial stages of many advanced analytics projects. Many people generally refer to this type of work as ETL  (Extract, Transform, and Load) operations. There are many people who have jobs as ETL developers and these people spend a lot of time moving data in-between computer systems and/or files. A lot of these activities also include making modifications to the data as it is moved.

On the contrary, I look at this segment of work as being much more than just moving data. Taken by itself, ETL implies data is being moved in and out of various locations, maybe with some minor changes along the way. However, when I complete this segment of work, what actually is being done is much more elaborate, fun and challenging than simply moving data and making minor changes.

I see this work as much more complex and as such, requires a diverse skill set to properly complete. This skill set is what is needed by young workers in analytics so that they can achieve data comprehension. The methods I have discovered and use on a daily basis for doing this type of work is what I want to teach to young workers.

The 80/20 Rule of Analytics

I have heard many times about the 80/20 rule in analytics. This rule suggests that about 80% of the work on a project happens in the beginning, during the data preparation phase. After the data preparation/building phase is complete, the remaining 20% of the time is used for data visualization and analysis.

Generally, I agree with this and have experienced it myself many times. In fact, I have never once had a job in which the data was perfectly presented to me at the beginning of a project. Again, I repeat, it has never happened once.

Therefore, if 80% of the time needed to do a modern-day analytics job is related to data preparation, then I believe it is necessary to teach our workers how to prepare the data! By properly preparing the data, we can achieve rapid and insightful data comprehension.

Components of Data Comprehension

I’m going to discuss these two parts to achieving data comprehension. First, I discuss the final 20 % of the work that includes data visualization. After that, I will cover the first 80% of the job, because this is where the most work is needed to train our youngest workers.

The Final 20% of the Work

Let’s start by discussing the ending 20% of the work, even though it seems backwards to do so. Even though this component represents 20% of the time, this is where the complete picture of data comprehension occurs. Its value cannot be understated and it principally occurs by visualizing the data.

It is imperative that the data visualization software used is capable of producing the types of charts and figures needed to achieve data comprehension. It doesn’t help anyone if the output produced by the software is not aligned with the data being visualized. The software must be easy to configure, easy to learn and use, and be capable of rapidly analyzing/visualizing the final data set created for your project.

Data visualization combines artistic and scientific skills to elucidate important aspects of the data. People who can effectively visualize data typically have spent a lot of time practicing their craft and have a passion for turning numbers into pictures. For people to become skilled in this endeavor, they must practice a lot with the software that they want to use to draw the pictures. You will know that you have practiced a lot when people begin accusing you of simply “playing with pictures” while you are doing your job.

In simple cases, data visualization by itself can lead to sufficient data comprehension to solve the problem at hand. In more complex cases, the data preparation steps have to be properly executed to create the final data set that can lead to data comprehension.

For the most advanced problems, data science techniques may have to be used either during the data preparation phase or as an iterative component in the data visualization part of the job. Sometimes an iterative approach is used to cycle between performing model simulations and visualizing the results. Being able to understand the algorithms and quantitative approaches used to manipulate and transform the data are also special skills to possess. An even more difficult skill to master is being able to communicate these topics, including the mathematical details, to non-quantitative people!

When you achieve great data comprehension, the figures/charts/dashboards that you create will be able to tell the story of the data set you interrogated. You should refine the output you create to succinctly explain the most important findings you have uncovered. I do believe that keeping things simple is a guiding principle that will help you deliver content that leads to better data comprehension for you and your audience.

The Initial 80% of the Work

In the types of projects I’m thinking and writing about in this article, multiple data sources are brought together to gain data comprehension. For many of these cases, knowledge of data models is not usually that important because I receive data sets that are already constructed. In most cases, I do not have to know about the decisions that were made to originally create the data sets.

I do need to know some information, however, such as whether or not the data has been pre-aggregated, or whether it is a row-based, full level of detail type of data set.

If you are lucky, you will receive a good data dictionary along with the data. Understanding the definitions of each variable is a very important part of the process when you are striving to achieve data comprehension. What amazes me, however, is how many times the database developers can deliver a data model but not be able to deliver the simplest data dictionaries! This is like handing a kid an erector set and asking them to build you a skyscraper, without defining a skyscraper or giving them any instructions on how to assemble the parts.

Data blending and transformation is where the real magic happens. This is where the various pieces of data, in the form of distinct data sets, come together to form the final data set that will allow you to achieve data comprehension.

The bulk of this work involves connecting the various pieces of data together. Another part of the work involves installing logic, business rules, and creating new variables that are needed to achieve comprehension. You might also have to operate on the data to perform conversions, aggregations, or other non-quantitative transformations, such as data reshaping operations. You might have to introduce a series of files along the way that contain information needed to achieve full comprehension. All of this work is what makes complete data comprehension possible.

What is really going on during this stage is a combination of ETL, computer programming, application of logic and rules, testing, debugging, and visualizing the road you are taking to go from point A to point Z. The tools you choose to do this can vary widely and some of them make it much easier than others to accomplish these tasks.

The process used in complex situations is to solve a series of challenges and to string these successes back to back to back. The incremental advancement of a project happens in starts and stops. If you try to achieve the Full Monty in the beginning, you probably will likely not be successful. You will crash and burn. You have to learn to be patient when you are trying to achieve data comprehension in complex situations.

When you think about it, everything we do involves planning and a transformation of energy. We wake up, take a shower, get dressed, drive to work, have lunch, work some more, drive home, etc. When we take a trip, we plan, pack the car, buy gas, do the driving, get to the destination, and do what we are going to do. Successfully working with data to achieve data comprehension requires this same approach.

We are conditioned to be thinking ahead, to visualize what we are going to be doing that day, and then we execute the plan. If we didn’t do this, we would be walking around randomly, not knowing what we want to accomplish.If I didn’t do this type of planning, I wouldn’t be writing this article right now as I roll down Interstate 75 in a double-decker Megabus, traveling from Knoxville, TN to Atlanta, GA.



This picture is very deep on many levels. Just as I was flying down a superhighway in a bus while taking this picture, life is flying by me at warp speed. In this picture, the trees are a blur and I’m illuminated, but in real life, time is a blur and my thoughts are illuminated in my head. I feel like I need every moment I have to teach others to do the things I can now do so easily. That is my mission.

Achieving Improved Data Comprehension

To achieve improved data comprehension, we need to plan for, obtain, and then transform information (i.e., data). This work requires us to transform energy and information, and these transformations are needed in modern-day analytics because of the proliferation of data that we are now trying to work with. To uncover and comprehend the important insights and stories that lie dormant and hidden in the data, we must transform the information from its original state to the final state that gives us the insights we seek.

When we strive to achieve data comprehension, what we are really trying to do is solve problems. The problems are created by people that realize we are not operating optimally. Companies hire us to work for them so that we can help them achieve better performance. People who have the courage to admit that we are not operating optimally, are generally viewed as the thought leaders.

Think of Bill Gates, Steve Jobs, Elon Musk, and other technical innovators that realized we needed better ways of doing things. These people disrupted the status quo, to build better things to allow us to work more easily, with more capabilities, and with better comprehension. I have been very lucky to have lived through these transformational times, experiencing the huge changes that these innovators have delivered to us.

We have now moved though space and time to a point where we are able to quantify our degree of non-optimal performance. We are now empowered to quantitatively improve our performance by adjusting the ways that we conduct our work and the tools that we use to do the work needed to achieve better data comprehension.

In other words, we are now pretty well equipped to achieve data comprehension. I don’t think I could have made and believed this statement 10 years ago. The reasons for this are twofold: the software wasn’t available and/or mature enough and my skills and experience were insufficient (even after 20 years of working with data). I now believe I a fully engaged in the battle of achieving better data comprehension and I have the tools and armaments necessary for me to be successful in this endeavor.

Upcoming in Part 2

In my next article, I am going to explain how I regularly achieve improved data comprehension. I am going to talk specifically about the tools I use and how and why I choose to use them. If you have gotten this far in this article, thanks for sticking with me, and I promise to deliver some things you can use in your working life.

Thanks for reading.

Show more