Blogs.oracle.com

Oracle R Advanced Analytics for Hadoop on the Fast Lane: Spark-based Logistic Regression and MLP Neural Networks

2015-08-07

This is the first in a series of blogs that is going to explore the capabilities of the newly released Oracle R Advanced Analytics for Hadoop 2.5.0, part of Oracle Big Data Connectors, which includes two new algorithm implementations that can take advantage of an Apache Spark cluster for a significant performance gains on Model Build and Scoring time. These algorithms are a redesigned version of the Multi-Layer Perceptron Neural Networks (orch.neural) and a brand new implementation of a Logistic Regression model (orch.glm2).

Through large scale benchmarks we are going to see the improvements in performance that the new custom algorithms bring to enterprise Data Science when running on top of a Hadoop Cluster with an available Apache Spark infrastructure.

In this first part, we are going to compare only model build performance and feasibility of the new algorithms against the same algorithms running on Map-Reduce, and we are not going to be concerned with model quality or precision. Model scoring, quality and precision are going to be part of a future Blog.

The Documentation on the new Components can be found on the product itself (with help and sample code), and also on the ORAAH 2.5.0 Documentation Tab on OTN.

Hardware and Software used for testing

As a test Bed, we are using an Oracle Big Data Appliance X3-2 cluster with 6-nodes. Each node consists of two Intel® Xeon® 6-core X5675 (3.07 GHz), for a total of 12 cores (24 threads), and 96 GB of RAM per node is available.

The BDA nodes run Oracle Enterprise Linux release 6.5, Cloudera Hadoop Distribution 5.3.0 (that includes Apache Spark release 1.2.0).

Each node also is running Oracle R Distribution release 3.1.1 and Oracle R Advanced Analytics for Hadoop release 2.5.0.

Dataset used for Testing

For the test we are going to use a classic Dataset that consists of Arrival and Departure information of all major Airports in the USA. The data is available online in different formats. The most used one contains 123 million records and has been used for many benchmarks, originally cited by the American Statistical Association for their Data Expo 2009. We have augmented the data available in that file by downloading additional months of data from the official Bureau of Transportation Statistics website. Our starting point is going to be this new dataset that contains 159 million records and has information up to September 2014.

For smaller tests, we created a simple subset of this dataset of 1, 10 and 100 million records. We also created a 1 billion-record dataset by appending the 159 million-record data over and over until we reached 1 billion records.

Connecting to a Spark Cluster

In release 2.5.0 we are introducing a new set of R commands that will allow the Data Scientist to request the creation of a Spark Context, in either YARN or Standalone modes.

For this release, the Spark Context is exclusively used for accelerating the creation of Levels and Factor variables, the Model Matrix, the final solution to the Logistic and Neural Networks models themselves, and Scoring (in the case of GLM).

The new commands are highlighted below:
spark.connect(master, name = NULL, memory = NULL, dfs.namenode = NULL)

spark.connect() requires loading the ORCH library first to read the configuration of the Hadoop Cluster.

The “master” variable can be specified as either “yarn-client” to use YARN for Resource Allocation, or the direct reference to a Spark Master service and port, in which case it will use Spark in Standalone Mode.
The “name” variable is optional, and it helps centralized logging of the Session on the Spark Master. By default, the Application name showing on the Spark Master is “ORCH”.
The “memory” field indicates the amount of memory per Spark Worker to dedicate to this Spark Context.

Finally, the dfs.namenode points to the Apache HDFS Namenode Server, in order to exchange information with HDFS.

In summary, to establish a Spark connection, one could do:

> spark.connect("yarn-client", memory="2g", dfs.namenode=”my.namenode.server.com")

Conversely, to disconnect the Session after the work is done, you can use spark.disconnect() without options.

The command spark.connected() checks the status of the current Session and contains the information of the connection to the Server. It is automatically called by the new algorithms to check for a valid connection.

ORAAH 2.5.0 introduces support for loading data to Spark cache from an HDFS file via the function hdfs.toRDD(). ORAAH dfs.id objects were also extended to support both data residing in HDFS and in Spark memory, and allow the user to cache the HDFS data to an RDD object for use with the new algorithms.

For all the examples used in this Blog, we used the following command in the R Session:

Oracle Distribution of R version 3.1.1 (--) -- "Sock it to Me"

Copyright (C) The R Foundation for Statistical Computing

Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.

You are using Oracle's distribution of R. Please contact

Oracle Support for any problems you encounter with this

distribution.

[Workspace loaded from ~/.RData]

> library(ORCH)

Loading required package: OREbase

Attaching package: ‘OREbase’

The following objects are masked from ‘package:base’:

cbind, data.frame, eval, interaction, order, paste, pmax, pmin, rbind, table

Loading required package: OREstats

Loading required package: MASS

Loading required package: ORCHcore

Loading required package: rJava

Oracle R Connector for Hadoop 2.5.0 (rev. 307)

Info: using native C base64 encoding implementation

Info: Hadoop distribution is Cloudera's CDH v5.3.0

Info: using auto-detected ORCH HAL v4.2

Info: HDFS workdir is set to "/user/oracle"

Warning: mapReduce checks are skipped due to "ORCH_MAPRED_CHECK"=FALSE

Warning: HDFS checks are skipped due to "ORCH_HDFS_CHECK"=FALSE

Info: Hadoop 2.5.0-cdh5.3.0 is up

Info: Sqoop 1.4.5-cdh5.3.0 is up

Info: OLH 3.3.0 is up

Info: loaded ORCH core Java library "orch-core-2.5.0-mr2.jar"

Loading required package: ORCHstats

>

> # Spark Context Creation

> spark.connect(master="spark://my.spark.server:7077", memory="24G",dfs.namenode="my.dfs.namenode.server")

>

In this case, we are requesting the usage of 24 Gb of RAM per node in Standalone Mode. Since our BDA has 6 nodes, the total RAM assigned to our Spark Context is 144 GB, which can be verified in the Spark Master screen shown below.

GLM – Logistic Regression

In this release, because of a totally overhauled computation engine, we created a new function called orch.glm2() that is going to execute exclusively the Logistic Regression model using Apache Spark as platform. The input data expected by the algorithm is an ORAAH dfs.id object, which means an HDFS CSV dataset, a HIVE Table that was made compatible by using the hdfs.fromHive() command, or HDFS CSV dataset that has been cached into Apache Spark as an RDD object using the command hdfs.toRDD().

A simple example of the new algorithm running on the ONTIME dataset with 1 billion records is shown below. The objective of the Test Model is the prediction of Cancelled Flights. The new model requires the indication of a Factor variable as an F() in the formula, and the default (and only family available in this release) is the binomial().

The R code and the output below assumes that the connection to the Spark Cluster is already done.

> # Attaches the HDFS file for use within R

> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")

> # Checks the size of the Dataset

> hdfs.dim(ont1bi)

[1] 1000000000         30

> # Testing the GLM Logistic Regression Model on Spark
> # Formula definition: Cancelled flights (0 or 1) based on other attributes

> form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) +
+   F(DAYOFMONTH) + F(DAYOFWEEK)

> # ORAAH GLM2 Computation from HDFS data (computing factor levels on its own)

> system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi))
ORCH GLM: processed 6 factor variables, 25.806 sec
ORCH GLM: created model matrix, 100128 partitions, 32.871 sec
ORCH GLM: iter 1, deviance   1.38433414089348300E+09, elapsed time 9.582 sec
ORCH GLM: iter 2, deviance   3.39315388583931150E+08, elapsed time 9.213 sec
ORCH GLM: iter 3, deviance   2.06855738812683250E+08, elapsed time 9.218 sec
ORCH GLM: iter 4, deviance   1.75868100359263200E+08, elapsed time 9.104 sec
ORCH GLM: iter 5, deviance   1.70023181759611580E+08, elapsed time 9.132 sec
ORCH GLM: iter 6, deviance   1.69476890425481350E+08, elapsed time 9.124 sec
ORCH GLM: iter 7, deviance   1.69467586045954760E+08, elapsed time 9.077 sec
ORCH GLM: iter 8, deviance   1.69467574351380850E+08, elapsed time 9.164 sec
user system elapsed
84.107   5.606 143.591

> # Shows the general features of the GLM Model
> summary(m_spark_glm)
Length Class Mode
coefficients   846    -none- numeric
deviance         1    -none- numeric
solutionStatus   1    -none- character
nIterations      1    -none- numeric
formula          1    -none- character
factorLevels     6    -none- list

A sample benchmark against the same models running on Map-Reduce are illustrated below. The Map-Reduce models used the call orch.glm(formula, dfs.id, family=(binomial()), and used as.factor() in the formula.

We can see that the Spark-based GLM2 is capable of a large performance advantage over the model executing in Map-Reduce.

Later in this Blog we are going to see the performance of the Spark-based GLM Logistic Regression on 1 billion records.

Linear Model with Neural Networks

For the MLP Neural Networks model, the same algorithm was adapted to execute using the Spark Caching. The exact same code and function call will recognize if there is a connection to a Spark Context, and if so, will execute the computations using it.

In this case, the code for both the Map-Reduce and the Spark-based executions is exactly the same, with the exception of the spark.connect() call that is required for the Spark-based version to kick in.

The objective of the Test Model is the prediction of Arrival Delays of Flights in minutes, so the model class is a Regression Model. The R code used to run the benchmarks and the output is below, and it assumes that the connection to the Spark Cluster is already done.

> # Attaches the HDFS file for use within R
> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")

> # Checks the size of the Dataset
> hdfs.dim(ont1bi)
[1] 1000000000         30

> # Testing Neural Model on Spark
> # Formula definition: Arrival Delay based on other attributes

> form_oraah_neu <- ARRDELAY ~ DISTANCE + ORIGIN + DEST + as.factor(MONTH) +
+   as.factor(YEAR) + as.factor(DAYOFMONTH) + as.factor(DAYOFWEEK)

> # Compute Factor Levels from HDFS data
> system.time(xlev <- orch.getXlevels(form_oraah_neu, dfs.dat = ont1bi))
user system elapsed
17.717   1.348 50.495

> # Compute and Cache the Model Matrix from HDFS data, passing factor levels
> system.time(Mod_Mat <- orch.prepare.model.matrix(form_oraah_neu, dfs.dat = ont1bi,xlev=xlev))
user system elapsed
17.933   1.524 95.624

> # Compute Neural Model from RDD cached Model Matrix
> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat, xlev=xlev, trace=T))
Unconstrained Nonlinear Optimization
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
Iter           Objective Value   Grad Norm        Step   Evals
1   5.08900838381104858E+11   2.988E+12   4.186E-16       2
2   5.08899723803646790E+11   2.987E+12   1.000E-04       3
3   5.08788839748061768E+11   2.958E+12   1.000E-02       3
4   5.07751213455999573E+11   2.662E+12   1.000E-01       4
5   5.05395855303159180E+11   1.820E+12   3.162E-01       1
6   5.03327619811536194E+11   2.517E+09   1.000E+00       1
7   5.03327608118144775E+11   2.517E+09   1.000E+00       6
8   4.98952182330299011E+11   1.270E+12   1.000E+00       1
9   4.95737805642779968E+11   1.504E+12   1.000E+00       1
10   4.93293224063758362E+11   8.360E+11   1.000E+00       1
11   4.92873433106044373E+11   1.989E+11   1.000E+00       1
12   4.92843500119498352E+11   9.659E+09   1.000E+00       1
13   4.92843044802041565E+11   6.888E+08   1.000E+00       1
Solution status             Optimal (objMinProgress)
Number of L-BFGS iterations 13
Number of objective evals   27
Objective value             4.92843e+11
Gradient norm               6.88777e+08
user system elapsed
43.635   4.186 63.319

> # Checks the general information of the Neural Network Model
> mod_neu
Number of input units      845
Number of output units     1
Number of hidden layers    0
Objective value            4.928430E+11
Solution status            Optimal (objMinProgress)
Output layer               number of neurons 1, activation 'linear'
Optimization solver        L-BFGS
Scale Hessian inverse      1
Number of L-BFGS updates   20
> mod_neu$nObjEvaluations
[1] 27
> mod_neu$nWeights
[1] 846

A sample benchmark against the same models running on Map-Reduce are illustrated below. The Map-Reduce models used the exact same orch.neural() calls as the Spark-based ones, with only the Spark connection as a difference.

We can clearly see that the larger the dataset, the larger the difference in speeds of the Spark-based computation compared to the Map-Reduce ones, reducing the times from many hours to a few minutes.

This new performance makes possible to run much larger problems and test several models on 1 billion records, something that took half a day just to run one model.

Logistic and Deep Neural Networks with 1 billion records

To prove that it is now feasible not only to run Logistic and Linear Model Neural Networks on large scale datasets, but also complex Multi-Layer Neural Network Models, we decided to test the same 1 billion record dataset against several different architectures.

These tests were done to check for performance and feasibility of these types of models, and not for comparison of precision or quality, which will be part of a different Blog.

The default activation function for all Multi-Layer Neural Network models was used, which is the bipolar sigmoid function, and also the default output activation layer was also user, which is the linear function.

As a reminder, the number of weights we need to compute for a Neural Networks is as follows:

The generic formulation for the number of weights to be computed is then:

Total Number of weights = SUM of all Layers from First Hidden to the Output of [(Number of inputs into each Layer + 1) * Number of Neurons)]

In the simple example, we had [(3 inputs + 1 bias) * 2 neurons] + [(2 neurons + 1 bias) * 1 output ] = 8 + 3 = 11 weights

In our tests for the Simple Neural Network model (Linear Model), using the same formula, we can see that we were computing 846 weights, because it is using 845 inputs plus the Bias.

Thus, to calculate the number of weights necessary for the Deep Multi-layer Neural Networks that we are about to Test below, we have the following:

MLP 3 Layers (50,25,10) => [(845+1)*50]+[(50+1)*25]+[(25+1)*10]+[(10+1)*1] = 43,846 weights

MLP 4 Layers (100,50,25,10) => [(845+1)*100]+[(100+1)*50]+[(50+1)*25]+[(25+1)*10]+[(10+1)*1] = 91,196 weights

MLP 5 Layers (200,100,50,25,10) => [(845+1)*200]+[(200+1)*100]+[(100+1)*50]+[(50+1)*25]+ [(25+1)*10]+[(10+1)*1] = 195,896 weights

The times required to compute the GLM Logistic Regression Model that predicts the Flight Cancellations on 1 billion records is included just as an illustration point of the performance of the new Spark-based algorithms.

The Neural Network Models are all predicting Arrival Delay of Flights, so they are either Linear Models (the first one, with no Hidden Layers) or Non-linear Models using the bipolar sigmoid activation function (the Multi-Layer ones).

This demonstrates that the capability of building Very Complex and Deep Networks is available with ORAAH, and it makes possible to build networks with hundreds of thousands or millions of weights for more complex problems.

Not only that, but a Logistic Model can be computed on 1 billion records in less than 2 and a half minutes, and a Linear Neural Model in almost 3 minutes.

The R Output Listing of the Logistic Regression computation and of the MLP Neural Networks are below.

> # Spark Context Creation
> spark.connect(master="spark://my.spark.server:7077", memory="24G",dfs.namenode="my.dfs.namenode.server")

> # Attaches the HDFS file for use with ORAAH
> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi")

> # Checks the size of the Dataset
> hdfs.dim(ont1bi)
[1] 1000000000         30

GLM - Logistic Regression

> # Testing GLM Logistic Regression on Spark
> # Formula definition: Cancellation of Flights in relation to other attributes
> form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) +
+   F(DAYOFMONTH) + F(DAYOFWEEK)

> # ORAAH GLM2 Computation from RDD cached data (computing factor levels on its own)
> system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi))
ORCH GLM: processed 6 factor variables, 25.806 sec
ORCH GLM: created model matrix, 100128 partitions, 32.871 sec
ORCH GLM: iter 1, deviance   1.38433414089348300E+09, elapsed time 9.582 sec
ORCH GLM: iter 2, deviance   3.39315388583931150E+08, elapsed time 9.213 sec
ORCH GLM: iter 3, deviance   2.06855738812683250E+08, elapsed time 9.218 sec
ORCH GLM: iter 4, deviance   1.75868100359263200E+08, elapsed time 9.104 sec
ORCH GLM: iter 5, deviance   1.70023181759611580E+08, elapsed time 9.132 sec
ORCH GLM: iter 6, deviance   1.69476890425481350E+08, elapsed time 9.124 sec
ORCH GLM: iter 7, deviance   1.69467586045954760E+08, elapsed time 9.077 sec
ORCH GLM: iter 8, deviance   1.69467574351380850E+08, elapsed time 9.164 sec
user system elapsed
84.107   5.606 143.591

> # Checks the general information of the GLM Model
> summary(m_spark_glm)
Length Class Mode
coefficients   846    -none- numeric
deviance         1    -none- numeric
solutionStatus   1    -none- character
nIterations      1    -none- numeric
formula          1    -none- character
factorLevels     6    -none- list

Neural Networks - Initial Steps

For the Neural Models, we have to add the times for computing the Factor Levels plus the time for creating the Model Matrix to the Total elapsed time of the Model computation itself.

> # Testing Neural Model on Spark
> # Formula definition
> form_oraah_neu <- ARRDELAY ~ DISTANCE + ORIGIN + DEST + as.factor(MONTH) +
+   as.factor(YEAR) + as.factor(DAYOFMONTH) + as.factor(DAYOFWEEK)
>
> # Compute Factor Levels from HDFS data
> system.time(xlev <- orch.getXlevels(form_oraah_neu, dfs.dat = ont1bi))
user system elapsed
12.598   1.431 48.765

>
> # Compute and Cache the Model Matrix from cached RDD data
> system.time(Mod_Mat <- orch.prepare.model.matrix(form_oraah_neu, dfs.dat = ont1bi,xlev=xlev))
user system elapsed
9.032   0.960 92.953
<

Neural Networks Model with 3 Layers of Neurons

> # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
> # Three Layers, with 50, 25 and 10 neurons respectively.

> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
+                                    xlev=xlev, hiddenSizes=c(50,25,10),trace=T))
Unconstrained Nonlinear Optimization
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
Iter           Objective Value   Grad Norm        Step   Evals
0   5.12100202340115967E+11   5.816E+09   1.000E+00       1
1   4.94849165811250305E+11   2.730E+08   1.719E-10       1
2   4.94849149028958862E+11   2.729E+08   1.000E-04       3
3   4.94848409777413513E+11   2.702E+08   1.000E-02       3
4   4.94841423640935242E+11   2.437E+08   1.000E-01       4
5   4.94825372589270386E+11   1.677E+08   3.162E-01       1
6   4.94810879175052673E+11   1.538E+07   1.000E+00       1
7   4.94810854064597107E+11   1.431E+07   1.000E+00       1
Solution status             Optimal (objMinProgress)
Number of L-BFGS iterations 7
Number of objective evals   15
Objective value             4.94811e+11
Gradient norm               1.43127e+07
user   system elapsed
91.024    8.476 1975.947

>
> # Checks the general information of the Neural Network Model
> mod_neu
Number of input units      845
Number of output units     1
Number of hidden layers    3
Objective value            4.948109E+11
Solution status            Optimal (objMinProgress)
Hidden layer [1]           number of neurons 50, activation 'bSigmoid'
Hidden layer [2]           number of neurons 25, activation 'bSigmoid'
Hidden layer [3]           number of neurons 10, activation 'bSigmoid'
Output layer               number of neurons 1, activation 'linear'
Optimization solver        L-BFGS
Scale Hessian inverse      1
Number of L-BFGS updates   20
> mod_neu$nObjEvaluations
[1] 15
> mod_neu$nWeights
[1] 43846
>

Neural Networks Model with 4 Layers of Neurons

> # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
> # Four Layers, with 100, 50, 25 and 10 neurons respectively.

> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
+                                    xlev=xlev, hiddenSizes=c(100,50,25,10),trace=T))
Unconstrained Nonlinear Optimization
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
Iter           Objective Value   Grad Norm        Step   Evals
0   5.15274440087001343E+11   7.092E+10   1.000E+00       1
1   5.10168177067538818E+11   2.939E+10   1.410E-11       1
2   5.10086354184862549E+11   5.467E+09   1.000E-02       2
3   5.10063808510261475E+11   5.463E+09   1.000E-01       4
4   5.07663007172408386E+11   5.014E+09   3.162E-01       1
5   4.97115989230861267E+11   2.124E+09   1.000E+00       1
6   4.94859162124700928E+11   3.085E+08   1.000E+00       1
7   4.94810727630636963E+11   2.117E+07   1.000E+00       1
8   4.94810490064279846E+11   7.036E+06   1.000E+00       1
Solution status             Optimal (objMinProgress)
Number of L-BFGS iterations 8
Number of objective evals   13
Objective value             4.9481e+11
Gradient norm               7.0363e+06
user   system elapsed
166.169   19.697 6467.703
>
> # Checks the general information of the Neural Network Model
> mod_neu
Number of input units      845
Number of output units     1
Number of hidden layers    4
Objective value            4.948105E+11
Solution status            Optimal (objMinProgress)
Hidden layer [1]           number of neurons 100, activation 'bSigmoid'
Hidden layer [2]           number of neurons 50, activation 'bSigmoid'
Hidden layer [3]           number of neurons 25, activation 'bSigmoid'
Hidden layer [4]           number of neurons 10, activation 'bSigmoid'
Output layer               number of neurons 1, activation 'linear'
Optimization solver        L-BFGS
Scale Hessian inverse      1
Number of L-BFGS updates   20
> mod_neu$nObjEvaluations
[1] 13
> mod_neu$nWeights
[1] 91196

Neural Networks Model with 5 Layers of Neurons

> # Compute DEEP Neural Model from RDD cached Model Matrix (passing xlevels)
> # Five Layers, with 200, 100, 50, 25 and 10 neurons respectively.

> system.time(mod_neu <- orch.neural(formula=form_oraah_neu, dfs.dat=Mod_Mat,
+                                    xlev=xlev, hiddenSizes=c(200,100,50,25,10),trace=T))

Unconstrained Nonlinear Optimization
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno)
Iter           Objective Value   Grad Norm        Step   Evals
0   5.14697806831633850E+11   6.238E+09   1.000E+00       1
......
......

6   4.94837221890043518E+11   2.293E+08   1.000E+00 1
7   4.94810299190365112E+11   9.268E+06   1.000E+00       1
8   4.94810277908935242E+11   8.855E+06   1.000E+00       1
Solution status             Optimal (objMinProgress)
Number of L-BFGS iterations 8
Number of objective evals   16
Objective value             4.9481e+11
Gradient norm               8.85457e+06
user    system   elapsed
498.002    90.940 30473.421
>
> # Checks the general information of the Neural Network Model
> mod_neu
Number of input units      845
Number of output units     1
Number of hidden layers    5
Objective value            4.948103E+11
Solution status            Optimal (objMinProgress)
Hidden layer [1]           number of neurons 200, activation 'bSigmoid'
Hidden layer [2]           number of neurons 100, activation 'bSigmoid'
Hidden layer [3]           number of neurons 50, activation 'bSigmoid'
Hidden layer [4]           number of neurons 25, activation 'bSigmoid'
Hidden layer [5]           number of neurons 10, activation 'bSigmoid'
Output layer               number of neurons 1, activation 'linear'
Optimization solver        L-BFGS
Scale Hessian inverse      1
Number of L-BFGS updates   20
> mod_neu$nObjEvaluations
[1] 16
> mod_neu$nWeights
[1] 195896

Best Practices on logging level for using Apache Spark with ORAAH

It is important to note that Apache Spark’s log is by default verbose, so it might be useful after a few tests with different settings to turn down the level of logging, something a System Administrator typically will do by editing the file $SPARK_HOME/etc/log4j.properties (see Best Practices below).

By default, that file is going to look something like this:

# cat $SPARK_HOME/etc/log4j.properties # Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=INFO
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

A typical full log will provide the below information, but also might provide too much logging when running the Models themselves, so it will be more useful for the first tests and diagnostics.

> # Creates the Spark Context. Because the Memory setting is not specified,
> # the defaults of 1 GB of RAM per Spark Worker is used
> spark.connect("yarn-client", dfs.namenode="my.hdfs.namenode")
15/02/18 13:05:44 WARN SparkConf:
SPARK_JAVA_OPTS was detected (set to '-Djava.library.path=/usr/lib64/R/lib'). This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with conf/spark-defaults.conf to set defaults for an application
- ./spark-submit with --driver-java-options to set -X options for a driver
- spark.executor.extraJavaOptions to set -X options for executors
-
SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
15/02/18 13:05:44 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '- Djava.library.path=/usr/lib64/R/lib' as a work-around.
15/02/18 13:05:44 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '- Djava.library.path=/usr/lib64/R/lib' as a work-around
15/02/18 13:05:44 INFO SecurityManager: Changing view acls to: oracle
15/02/18 13:05:44 INFO SecurityManager: Changing modify acls to: oracle
15/02/18 13:05:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(oracle); users with modify permissions: Set(oracle)
15/02/18 13:05:44 INFO Slf4jLogger: Slf4jLogger started
15/02/18 13:05:44 INFO Remoting: Starting remoting
15/02/18 13:05:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@my.spark.master:35936]
15/02/18 13:05:45 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@my.spark.master:35936]
15/02/18 13:05:46 INFO SparkContext: Added JAR /u01/app/oracle/product/12.2.0/dbhome_1/R/library/ORCHcore/java/orch-core-2.4.1-mr2.jar at http://1.1.1.1:11491/jars/orch-core-2.4.1-mr2.jar with timestamp 1424264746100
15/02/18 13:05:46 INFO SparkContext: Added JAR /u01/app/oracle/product/12.2.0/dbhome_1/R/library/ORCHcore/java/orch-bdanalytics-core-2.4.1- mr2.jar at http://1.1.1.1:11491/jars/orch-bdanalytics-core-2.4.1-mr2.jar with timestamp 1424264746101
15/02/18 13:05:46 INFO RMProxy: Connecting to ResourceManager at my.hdfs.namenode /10.153.107.85:8032
Utils: Successfully started service 'sparkDriver' on port 35936. SparkEnv: Registering MapOutputTracker
SparkEnv: Registering BlockManagerMaster
DiskBlockManager: Created local directory at /tmp/spark-local-
MemoryStore: MemoryStore started with capacity 265.1 MB
HttpFileServer: HTTP File server directory is /tmp/spark-7c65075f-850c-
HttpServer: Starting HTTP Server
Utils: Successfully started service 'HTTP file server' on port 11491. Utils: Successfully started service 'SparkUI' on port 4040.
SparkUI: Started SparkUI at http://my.hdfs.namenode:4040
15/02/18 13:05:46 INFO Client: Requesting a new application from cluster with 1 NodeManagers
15/02/18 13:05:46 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/02/18 13:05:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/18 13:05:46 INFO Client: Uploading resource file:/opt/cloudera/parcels/CDH-5.3.1- 1.cdh5.3.1.p0.5/lib/spark/lib/spark-assembly.jar -> hdfs://my.hdfs.namenode:8020/user/oracle/.sparkStaging/application_1423701785613_0009/spark- assembly.jar
15/02/18 13:05:47 INFO Client: Setting up the launch environment for our AM container
15/02/18 13:05:47 INFO SecurityManager: Changing view acls to: oracle
15/02/18 13:05:47 INFO SecurityManager: Changing modify acls to: oracle
15/02/18 13:05:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(oracle); users with modify permissions: Set(oracle) 15/02/18 13:05:47 INFO Client: Submitting application 9 to ResourceManager
15/02/18 13:05:47 INFO YarnClientImpl: Submitted application application_1423701785613_0009 15/02/18 13:05:48 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)
13:05:48 INFO Client:
client token: N/A
diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.oracle
start time: 1424264747559 final status: UNDEFINED tracking URL: http:// my.hdfs.namenode
13:05:46 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB
13:05:46 INFO Client: Setting up container launch context for our AM
13:05:46 INFO Client: Preparing resources for our AM container
my.hdfs.namenode:8088/proxy/application_1423701785613_0009/ user: oracle
15/02/18 13:05:49 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)
15/02/18 13:05:50 INFO Client: Application report for application_1423701785613_0009 (state: ACCEPTED)

Please note that all those warnings are expected, and may vary depending on the release of Spark used.

With the Console option in the log4j.properties settings are lowered from INFO to WARN, the request for a Spark Context would return the following:

# cat $SPARK_HOME/etc/log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

Now the R Log is going to show only a few details about the Spark Connection.

> # Creates the Spark Context. Because the Memory setting is not specified,
> # the defaults of 1 GB of RAM per Spark Worker is used
> spark.connect(master="yarn-client", dfs.namenode="my.hdfs.server")
15/04/09 23:32:11 WARN SparkConf:
SPARK_JAVA_OPTS was detected (set to '-Djava.library.path=/usr/lib64/R/lib').
This is deprecated in Spark 1.0+.

Please instead use:
- ./spark-submit with conf/spark-defaults.conf to set defaults for an application
- ./spark-submit with --driver-java-options to set -X options for a driver
- spark.executor.extraJavaOptions to set -X options for executors
- SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)

15/04/09 23:32:11 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '-Djava.library.path=/usr/lib64/R/lib' as a work-around.
15/04/09 23:32:11 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '-Djava.library.path=/usr/lib64/R/lib' as a work-around.
15/04/09 23:32:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Finally, with the Console logging option in the log4j.properties file set to ERROR instead of INFO or WARN, the request for a Spark Context would return nothing in case of success:

# cat $SPARK_HOME/etc/log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

This time there is no message returned back to the R Session, because we requested it to only return feedback in case of an error:

> # Creates the Spark Context. Because the Memory setting is not specified,
> # the defaults of 1 GB of RAM per Spark Worker is used
> spark.connect(master="yarn-client", dfs.namenode="my.hdfs.server")
>

In summary, it is practical to start any Project with the full logging, but it would be a good idea to bring the level of logging down to WARN or ERROR after the environment has been tested to be working OK and the settings are stable.