Planet.python.org

Continuum Analytics News: Taking the Wheel: How Open Source is Driving Data Science

2016-05-27

Company Blog

Posted Friday, May 27, 2016

Travis Oliphant

Chief Executive Officer & Co-Founder

The world is a big, exciting place—and thanks to cutting-edge technology, we now have amazing ways to explore its many facets. Today, self-driving cars, bullet trains and even private rocket ships allow humans to travel anywhere faster, more safely and more efficiently than ever before.

But technology's impact on our exploratory abilities isn't just limited to transportation: it's also revolutionizing how we navigate the Data Science landscape. More companies are moving toward Open Data Science and the open source technology that underlies it. As a result, we now have an amazing new fleet of vehicles for our data-related excursions.

We're no longer constrained to the single railroad track or state highway of a proprietary analytics product. We can use hundreds of freely available open source libraries for any need: web scraping, ingesting and cleaning data, visualization, predictive analytics, report generation, online integration and more. With these tools, any corner of the Data Science map—astrophysics, financial services, public policy, you name it—can be reached nimbly and efficiently.

But even in this climate of innovation, nobody can afford to completely abandon previous solutions and traditional approaches still remain viable. Fortunately, graceful interoperability is one of the hallmarks of Open Data Science. In appropriate scenarios, it accommodates the blending of legacy code or proprietary products with open source solutions. After all, sometimes taking the train is necessary and even preferable.

Regardless of which technology teams use, the open nature of Open Data Science allows you to travel across the data terrain in a way that is transparent and accessible for all participants.

Data Science in Overdrive

Let's take a look at six specific ways Open Data Science is propelling analytics for small and large teams.

1. Community. Open Data Science prioritizes inclusivity; community involvement is a big reason that open source software has boomed in recent years. Communities can test out new software faster and more thoroughly than any one vendor, accelerating innovation and remediation of any bugs.

Today, the open source software repository, GitHub, is home to more than 5 million open source projects and thousands of distinct communities. One such community is conda-forge, a community of developers that build infrastructure and packages for the conda package manager, a general, cross-platform and cross-language package manager with a large and growing number of data science packages available. Considering that Python is the most popular language in computer science classrooms at U.S. universities, open source communities will only continue to grow.

2. Innovation. The Open Data Science movement recognizes that no one software vendor has all the answers. Instead, it embraces the large—and growing—community of bright minds that are constantly working to build new solutions to age-old challenges.

Because of its adherence to free or low-cost technologies, non-restrictive licensing and shareable code, Open Data Science offers developers unparalleled flexibility to experiment and create innovative software.

One example of the innovation that is possible with Open Data Science is taxcalc, an Open Source Policy Modeling Center project publically available via TaxBrain. Using open source software, the project brought developers from around the globe together to create a new kind of tax policy analysis. This software has the computational power to process the equivalent of more than 120 million tax returns, yet is easy-to-use and accessible to private citizens, policy professionals and journalists alike.

3. Inclusiveness. The Open Data Science movement unites dozens of different technologies and languages under a single umbrella. Data science is a team sport and the Open Data Science movement recognizes that complex projects require a multitude of tools and approaches.

This is why Open Data Science brings together leading open source data science tools under a single roof. It welcomes languages ranging from Python and R to FORTRAN and it provides a common base for data scientists, business analysts and domain experts like economists or biologists.

What's more, it can integrate legacy code or enterprise projects with newly developed code, allowing teams to take the most expedient path to solve their challenges. For example, with the conda package management system, developers can create conda packages from legacy code, allowing integration into a custom analytics platform with newer open source code. In fact, libraries like SciPy already leverage highly optimized legacy FORTRAN code.

4. Visualizations. Visualization has come a long way in the last decade, but many visualization technologies have been focused on reporting and static dashboards. Open Data Science; however, has unveiled intelligent web apps that offer rich, browser-based interactive visualizations, such as those produced with Bokeh. Visualizations empower data scientists and business executives to explore their data, revealing subtle nuances and hidden patterns.

One visualization solution, Anaconda's Datashader library, is a Big Data visualizer that plays to the strengths of the human visual system. The Datashader library—alongside the Bokeh visualization library—offers a clever solution to the problem of plotting an enormous number of points in a relatively limited number of pixels.

Another choice for data scientists is the D3 Javascript library, which exploded the number of visual tools for data. With wrappers for Python and other languages, D3 has prompted a real renaissance in data visualization.

5. Deep Learning. One of the hottest branches of data science is deep learning, a sub-segment of machine learning based on algorithms that work to model data abstractions using a multitude of processing layers. Open source technology, such as that embraced by Open Data Science, is critical to its expansion and improvement.

Some of the new entrants to the field—all of which are now open source—are Google's TensorFlow project, the Caffe deep learning framework, Microsoft's Computational Network Toolkit (CNTK), Amazon's Deep Scalable Sparse Tensor Network Engine (DSSTNE), Facebook's Torch framework and Nervana's Neon. These products enter a field with many participants like Theano whose Lasagne extension allows easy construction of deep learning models and Berkley's Caffe, which is an open deep learning framework.

These are only some of the most interesting frameworks. There are many others, which is a testament to the lively and burgeoning Open Data Science community and its commitment to innovation and idea sharing allowing for even more future innovation.

6. Interoperability. Traditional, proprietary data science tools typically integrate well only with their own suite. They’re either closed to outside tools or provide inferior, slow methods of integration. Open Data Science, by contrast, rejects these restrictions, instead allowing diverse tools to cooperate and interact in every more closely connected ways.

For example, Anaconda, includes open source distributions of the Python and R languages, which interoperate very well together enabling data scientists to use the technologies that make sense for them. For example, a business analyst might start with Excel, then work with predictive models in R and later fire up Tableau for data visualizations. Interoperable tools speed analysis, eliminate the need for switching between multiple toolsets and improve collaboration.

It's clear that open source tools will lead the charge towards innovation in Data Science and many of the top technology companies are moving in this direction. IBM, Microsoft, Google, Facebook, Amazon and others are all joining the Open Data Science revolution, making their technology available with APIs and open source code. This benefits technology companies and individual developers, as it empowers a motivated user base to improve code, create new software and use existing technologies in new contexts.

That's the power of open source software and inclusive Open Data Science platforms like Anaconda. Thankfully, today's user-friendly languages—like Python—make joining this new future easier than ever.

If you're considering open source for your next data project, now’s the time to grab the wheel. Join the Open Data Science movement and shift your analyses into overdrive.