2015-03-27

In the third part of this series of posts on (Why SQL Part 3 - Simple optimization makes life simpler) of this series of blog posts I explained that by letting the database perform the optimisations at source, where the data is located, it is possible to share tuning improvements across a wide spectrum of front-end systems (BI reports, dashboards, applications etc). The ability of the database to “own” the optimization and processing characteristics of how a SQL statement is executed is an important element in the success of SQL as a data analysis language. In the whitepaper that forms the basis for this series of blog posts this section is quite brief. This was done for reasons that will become obvious as you work through this blog post

Now that we have looked at the powerful framework that supports SQL and the simplified optimization layer we can move on to look at how SQL has evolved over time to support ever more sophisticated features and functions and compare this with some of the newer open source frameworks.

Big Data Analysis and Open Source

All languages evolve over time. The open source community has a long history of making new functions and optimizations available on a regular basis; however, each function is created and managed in isolation and there are no guarantees that the code will work in every environment. It is not uncommon for developers to spend a considerable amount of time debugging open source functions to make them work with their specific data set. Many optimizations are not transparent but require complete code rewrites. For example on big data platforms, Spark is the new framework for running jobs on Hadoop. Spark is designed to run a wider range of workloads and does so much faster compared with MapReduce but it uses a different language. While MapReduce is written in Java, Spark uses Scala. In fact the whole Spark ecosystem uses different tools and utilities compared with MapReduce. There is an interesting article on the ADTMag blog that was published a few days ago (March 20). It’s called “Hadoop and Spark: Friends or Foes?"

Spark is optimised for multi-pass workflows that are based around data mining and machine learning workloads as well as interactive, real-time style data analysis. This is a small subset of the typical big data workloads and it is likely that most pure open-source projects will need to implement a variety of approaches to cover the full range of business requirements. This means using MapReduce, Spark, Hive, Spark SQL, Mahout, MLib GraphX, Giraph and many other languages and tools to deliver a complete information management architecture. The key complication with this approach is the enforced migration from one language/approach to the next “hot thing” as projects lose and gain attention within the open source community.

In the case of MapReduce and Spark, there is no automatic conversion process from MapReduce code to Spark code so developers are forced to rewrite all their application code to take advantage of the improved performance offered by Spark. This creates all sorts of risks within a project and builds frustration within the business because while developers are rewriting existing code they are not working on new features. It is obvious that continuous evolution is clearly part of the open source community-led projects but I would propose that what is missing is “intelligent” continuous evolution. In today’s fast changing business environment there is simply not the time, resources or budget to keep rewriting applications to take advantage of new optimisations, features and functions.

If we examine the analytical capabilities that most business users need to analyze their data sets its clear that a rich set of features of required. In the past I think it has been acceptable to use separate analytical engines to run certain types of analysis. Most big data projects have a tendency to treat different types of analysis as separate workloads. There is no joined-up analysis. Therefore, it is not uncommon to see a number of different data engines, languages and tools incorporated within a project to support the diverse analytical needs of the business users. This works really until someone asks a question that requires cherry-picking analytical functions and features from across a range of engines and languages to solve a specific business problem.

Supporting Analytical Mashups

A good example of this is the analysis of spatial data sets. Today, more and more businesses are building and developing data sets that contain geospatial elements and just focusing on spatially aware analytical functions limits the value that can be extracted from these data sets. It is clear that both applications and business users are becoming more sophisticated. They both need the ability to run analytical mash-ups where different types of analysis are merged together to paint a very rich picture. A good example of this was shown at OpenWorld in 2013 where we demonstrated a fraud application that used both pattern matching and spatial analytics to flag, in real-time, potentially fraudulent transactions.



The blog post discussing this demo is here. So what do you do if you need a single framework that is regularly updated with new features and functions and a framework that allows you to build analytical mash-ups using a broad palette of functions? This might come as a surprise but that framework already exists and you are probably already using it without realising quite how far and wide its analytical capabilities stretch….Yup, it is good old SQL.

SQL - intelligent evolution

As an analytical language SQL has stayed relevant over the decades because, even though its core is grounded in universal data processing techniques, it can be extended with new processing techniques and new calculations. Oracle has a long history of embedding sophisticated SQL-based analytics within the Oracle Database. One of the most important aspects of SQL is that its feature set continues to expand while maintaining backward compatibility.

Over almost 40 years Oracle has invested heavily in optimizing the execution of SQL code within its database engine. The evolution of features such as parallel execution, Oracle Optimizer and the smart storage services offered by Exadata are automatically inherited and consumed by existing SQL statements precisely because developers do not have to specify how and where a query is executed. SQL code that was written in 1985 will still run today, even in Oracle Database 12c.



An important aspect of the on-going success of SQL as a language is the speed of innovation. As the analysis of big data has evolved, the way data is being analysed has also evolved. Below is a historical database-timeline showing how the SQL language has evolved through each release of the Oracle Database:



As you can see from this timeline, our window functions, which are now a key analytical feature in the analysis of big data, were first introduced in Oracle 8i and many developers use them to manage complex big data requirements. Oracle 10g introduced the SQL Model clause, which provides a spreadsheet-like what-if modeling framework aimed at business users allowing them to interact with both rows and columns within their data set to create new records within a given set. Oracle Database 12c includes a number of new big data related analytical functions which share common structure and syntax with existing SQL functions. The above image only shows the key functions that I am responsible for as a product manager and is obviously not a complete list of all the in-database analytical features in the Oracle Database.

SQL - Rich set of in-database analytics

Oracle’s strategy for delivering analytics offers you both depth and breadth of analysis. There is a wide a range of analytical features to support these data types that allow users to explore and layer their analysis. Given that we all use the Oracle Database every day as part of our jobs, it is incredibly easy to forget just how much analytical power has been added to the Oracle Database over time. Just incase you are wondering I have put together a simple diagram below that provides an overview of the rich analytical ecosystem inside the Oracle Database:

Let’s quickly look at a few of the key analytical capabilities that are now part of Oracle’s in-database analytics:

SQL data mining with Oracle Advanced Analytics

Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Oracle Advanced Analytics (OAA) provides powerful data mining functionality as native SQL functions within the Oracle Database. You can use industry standard SQL APIs to build next-generation applications that deliver real-time results and predictions. Oracle’s approach to in-database analytics means the data mining workflow runs inside the Oracle Database and the results also remain in the Oracle Database.

This means that costly and time-consuming data movement is eliminated, information latency is minimized and security is maintained. Additionally, Oracle data mining models can be included in SQL queries and embedded in business applications to offer improved business intelligence and deeper insight. You can use OAA to analyze customer demographics and buying patterns, evaluate the success of a marketing campaigns, use predictive analytics to enrich customer profiles, find anomalies within a data set by searching for outliers and analyze warranty claims for fraudulent activity.

Analyzing text with Oracle Text Mining

Oracle’s data mining features support the analysis of unstructured data. It has been estimated that as much as 85% of enterprise data falls into this category. Extracting meaningful information from this unstructured textual data can be critical to the success of many big data and data warehouse projects.

Textual data must undergo a transformation process before it can be mined. Once the data has been properly transformed using the built-in data mining and related transformations, it can be used for building, testing, or scoring data mining models. Oracle’s text mining capabilities rely heavily on core in-database features that are part of OAA and this is another great example of the power of joined-up analytics where analytical features are layered one-on-top-of-the-other.

Text mining is increasingly being used within big data projects to help extract meaning from social media streams, perform e-discovery on emails and text documents and analyze insurance claims forms to search for possible patterns of fraudulent activities.

Geospatial analytics with Oracle Spatial

Location intelligence is now a must-have feature in big data and data warehouse projects. The spatial features in the Oracle Database include high performance, advanced analysis for the complete range of geospatial requirements. The geospatial features support complex geographic information systems (GIS) applications and location aware business intelligence applications. Spatial analytics can help you gain deeper insight into regional differences within their sales teams, evaluate sales catchment areas looking for overlaps, identify possible cannibalisation across areas, monitor the impact of new outlets and evaluate catchment areas for new stores.

Association analytics with Oracle Graph

With the growth in of big data and the focus on unstructured data sources I expect that you are looking to integrate semantic models and analysis into your business intelligence platforms? Semantic analysis has many use cases including governance-compliance-risk management and reporting, threat analysis, social network analysis, and recommendations.

Oracle provides native support for managing and processing semantic data based on open standards using XML, RDF, OWL, SPARQL and SQL. The integration of this storage model inside the database provides semantic developers with a scalable & secure platform to support a broad spectrum of semantic driven applications that can integrate both semantic and relational data sets to create a broader and more intuitive analytical experience for all your business users.

Typical use cases for semantic analysis include the following: threat analysis, asset tracking, bio-pathway analysis, interrogation of patient records, bio-surveillance, fraud and compliance management, social network and activity analysis and content management within games.

Multi-dimensional analytics

Multidimensional models organize data sets into easily understood business concepts such as dimensions and measures. A multidimensional view of a data lets you slice, dice and pivot data that is subject to complex hierarchical relationships, complex aggregation rules and/or sophisticated calculations. Oracle OLAP is a world-class multidimensional analytic engine embedded within the Oracle Database. It can provide centralized management of data and multidimensional business rules in a secure, scalable and enterprise-ready platform. It supports a rich set of sophisticated calculations and aggregation rules which are encapsulated within the definition of the cube. This means that you can issue simple SQL commands to access sophisticated calculations such as time-series calculations, financial models, forecasts, allocations, and regressions.

Multi-dimensional cubes can even help you model organisation changes within territories or across product ranges, model new products based on established products, forecast using multiple scenarios, budgeting and planning based on complex allocation rules, provide, create a unified definition of calculated measures (KPIs and key such as ratios, comparisons, shares, and trends) and deploy aggregation rules that are too sophisticated for normal SQL-based operations.

Unstructured analytics

XML is an extremely popular way to persist and exchange business-critical information. Many industry segments have developed XML-based standards for representing information. These standards typically utilize XML Schema, a W3C standard for defining the structure and content of an XML File. XML-based standards can be found in healthcare, financial services, manufacturing, publishing, law enforcement and the public sector.

Oracle XML DB is a high-performance, native XML storage and retrieval technology that is delivered as a part of all versions of Oracle Database. Oracle XML DB allows an organization to manage its XML content using the same trusted technology that it uses to manage its relational data. Being inside the Oracle Database allows XML DB to leverage other core database features such as Oracle Text to provide flexible and rich search capabilities. Oracle XML DB provides tight integration between SQL and xml data sets which allows developers and business users run SQL queries across both relational and XML data to deliver deeper insight.

If that is not sufficient then we also allow you to build your own algorithms that can be incorporated alongside the existing in-database functions.

Building your own analytical logic

An analytical gatewayTable functions allows SQL developers to extend the analytical capabilities of the Oracle Database by writing custom functions that return a result set which can be treated just like a relational table. This means a table function can be referenced just like a database table by using it in the FROM clause of a query or as a column name in the SELECT list of a query. Alternatively the rows can be assigned to a PL/SQL collection variable to provide added flexibility when building certain types of analytical workflows, e.g. extract-load-transform workflows.

This approach helps to greatly simplify the application code layer because application developers can refer to a simple table, which encapsulates all the required analytical processing. This means that the developer does not need to know which features and functions have been implemented and how they have been implemented.For example, embedding the new SQL pattern-matching feature within a table function offers the following benefits to developers:

Use of pipelining to stream results in real-time

Removes the need to create intermediate staging tables between processes

Improves query response time within workflows by returning rows as they are produced

Reduces the memory overhead for workflows because object caches do not need to materialize entire collections

Parallel execution to improve performance

Delivers multi-threaded, concurrent execution

Support for a range of application programming languages: C++/C and Java

Streaming data using Table Functions

Table functions can return data using a pipeline, i.e. data is returned to the calling process while the table function is still processing. The returned data behaves like a stream that can be fetched on demand, which makes it possible to use a table function like a virtual table. The need to stream results is usually associated with operational applications where a response needs to be fed into the workflow of GUI in real-time. Pipelining enables a table function to return rows faster and can reduce the memory required within the cache, making it ideal for supporting real-time requirements.

SQL - Continuous evolution…

A great example of the speed at which SQL can evolve to support new analytical requirements is the growing need for approximate answers. What many organizations have now realized is that in some cases the level of precision they need from their analytics can be reduced – i.e. good enough is in fact the perfect answer. For example by asking the question 'how many distinct visitors came to our website last month?' it may not be necessary to have a precise answer. An approximate answer that is for example within 1% of the actual value would be sufficient. Oracle has already added this feature to the latest release of Database 12c by implementing HyperLogLog algorithms in SQL for ‘approximate count distinct’ operations (APPROX_COUNT_DISTINCT).

As part of its on-going investment Oracle has released in-memory processing to further improve the performance of SQL queries. The management and lifecycle of the in-memory data structures and the use of different types of compression depending on the type of operations (OLTP-centric vs. Analytics-centric) are transparent to the queries. As with Exadata, all existing queries can automatically inherit the benefits of this new feature.

Let’s return to our code from 1985. Hopefully, we can all agree that this SELECT statement will run, even in Database 12c. There is of course one, major difference – it will run faster, with no changes to the code to use different features or by forcing the optimizer to choose a different plan through the use of hints. Same code, just much faster. This links back to the previous blog post on simplified transparent optimizations.

The release of Oracle’s Exadata platform delivered huge performance gains even when moving existing SQL code and running it unchanged. The storage level optimizations such as storage indexes, hybrid columnar compression and smart scans, which were incorporated into the Exadata Storage Server, are completely transparent to the SQL code. This has allowed every application, every SQL statement to run unchanged on this platform and reap the benefits of running on an engineered and optimised platform. Below is a comparison of the transparent in-database features that have been incorporated into the Oracle Database since 1985 allowing code from 1985 to run unchanged but significantly faster on today’s engineered systems. SQL allows developers to incrementally adopt new capabilities as they emerge without having to throw away existing code.

Summary

Apologies for the length of this post and congratulations if you made it to the summary. As I stated earlier, in the whitepaper I decided to keep the section on the evolution of SQL relatively short and assume that everyone knows about all the in-database analytical features that are included with Database 12c. As part of this post I decided to outline all the key analytical features supported by Oracle Database just to provide you with some insight into the sorts of analytical mash-ups that are possible.Most importantly, what is possible is continually expanding as we add more analytical features to the database. SQL will continue to evolve but it will do so in an intelligent way.

In the next post I am going to explore the importance of the ANSI standard. It is because of this adherence to this ANSI standard that SQL has emerged as the most attractive language for big data for data discovery and analysis.

Show more