Geekswithblogs.net

PMML – Predictive Model Markup Language

2014-10-08

Originally posted on: http://geekswithblogs.net/JoshReuben/archive/2014/10/09/pmml--predictive-model-markup-language.aspx

PMML
Overview

An XML standard managed by
the Data Mining Group (www.dmg.org
) whose members include IBM, Microsoft, Oracle, SAS, SPSS,, NCR,
SAP, KXEN, Magnify, MINEit, & StatSoft

Predictive Model Markup
Language (PMML) is an XML mark up language to describe statistical
and data mining models.

PMML describes the inputs to
data mining models, the transformations used prior to prepare data
for data mining, and the parameters which define the models
themselves.

It is the most widely
deployed data mining standard.

PMML is complementary to
many other data mining standards. It's XML interchange formats is
supported by several other standards, such as XML for Analysis.

provides a way for
applications to define statistical and data mining models and to
share models between PMML compliant applications.

PMML provides applications a
vendor-independent method of defining models so that proprietary
issues and incompatibilities are no longer a barrier to the
exchange of models between applications.

It allows users to develop
models within one vendor's application, and use other vendors'
applications to visualize, analyze, evaluate or otherwise use the
models ?
the exchange of models between compliant applications is now
straightforward.

One or more mining models
can be contained in a PMML XML document.

The PMML statistics subset
provides a basic framework for representing univariate statistics,
such as mean, min, max, counts, standard deviation & frequency

PMML is a standard for XML
documents which express trained instances of analytic models.

PMML supports the following
Model classes:

Association Rules

Decision Trees

Center-Based &
Distribution-Based Clustering

Regression

General Regression

Neural Networks

Naive Bayes

Sequences

PMML
document structure:

<?xml
version="1.0"?>

<!DOCTYPE
PMML PUBLIC "PMML 2.0"

"http://www.dmg.org/v2-0/pmml_v2_0.dtd">

<PMML
version="2.0">

...

</PMML>

The root element of a PMML
document must have type PMML.

A PMML document can contain
zero or more models - The document can be used to carry the initial
metadata before an actual model is computed. A PMML document
containing no model is not meant to be useful for a PMML consumer.

The element <MiningBuildTask>
can contain any XML value describing the configuration of the
training run that produced the model instance - the natural
container for task specifications as defined by other mining
standards, e.g., in SQL or .NET.

The fields in the
<DataDictionary>
and in the <TransformationDictionary>
elements are identified by unique names - Other elements in the
models can refer to these fields by name so that Multiple models on
one PMML document can share the same fields defined in these
dictionary elements

Certain types of PMML models
such as neural networks or logistic regression can be used for
different purposes - some instances implement prediction of numeric
values, while others can be used for classification according to the
functionName
attribute which specifies the mining function.

A Model element has the
following attributes:

modelName
- identifies the model with a unique name in the context of the
PMML file.

functionName
and algorithmName
provide informational descriptions of the nature of the mining
model, e.g., whether it is intended to be used for clustering or
for classification.

Basic data types and
entities: NUMBER,
INT-NUMBER, REAL-NUMBER, PROB-NUMBER
(a real number between 0.0 & 0.1) & PERCENTAGE-NUMBER

The types <Array>
, <NUM-ARRAY>, <REAL-ARRAY> & <STRING-ARRAY>
are defined as container structure which implements arrays of
numbers and strings in a fairly compact way:

<Array
n="3" type="int">

1
22 3

</Array>

<Array
n="3" type="string">

ab
"a b" "with \"quotes\" "

</Array>

PMML
Header Information

Header:
The top level tag that marks the beginning of the header
information.

copyright:
This attribute contains the copyright information for this model.

description:
obvious.

Application:
This element describes the software application that generated the
model.

name:
The name of the application that generated the model.

version:
The version of the application that generated this model.

Annotation:
Document modification history is embedded here.

Timestamp:
This element allows a model creation timestamp

PMML
Data Dictionary

The data dictionary contains
definitions for fields as used in mining models. It specifies the
types and value ranges. These definitions are assumed to be
independent of specific data sets as used for training or scoring a
specific model.

A data dictionary can be
shared by multiple models, statistics and other information related
to the training set is stored within a model

The value numberOfFields
is the number of fields which are defined in the content of
<DataDictionary>,
this number can be added for consistency checks. The name of a data
field must be unique in the data dictionary. The displayName
is a string which may be used by applications to refer to that
field.

The fields are separated
into different types depending on which operations are defined on
the values; this is defined by the attribute optype.
Categorical fields have the operator "=", ordinal fields
have an additional "<", and continuous fields also
have arithmetic operators. Cyclic fields have a distance measure
which takes into account that the maximal value and minimal value
are close together.

The optional attribute
'taxonomy'
refers to a hierarchy of values and is only applicable to
categorical fields.

The content of a DataField
defines the set of values which are considered to be valid – the
mining model will categorize a value as valid, invalid or missing

If a categorical
or ordinal
field contains at least one Value element where the value of
property is 'valid' or unspecified, then the set of Value elements
completely defines the set of valid values. Otherwise any value is
valid by default.

The element Interval
defines a range of numeric values - The attributes leftMargin
and rightMargin
are optional but at least one value must be defined. If a margin is
missing, then +/- infinity is assumed.

PMML
Mining Schema

Each model contains one
mining schema which lists fields as used in that model - This is a
subset of the fields as defined in the data dictionary.

While the mining schema
contains information that is specific to a certain model, the data
dictionary contains data definitions which do not vary per model.

The main purpose of the
mining schema is to list the fields which a user has to provide in
order to apply the model.

The
usageType attribute
can have the following values:

active:
field used as input (independent field).

predicted:
field whose value is predicted by the model.

supplementary:
field holding additional descriptive information.

The
outliers attribute
can have the following values:

asIs:
field values treated at face value.

asMissingValues:
outlier values are treated as if they were missing.

asExtremeValues:
outlier values are changed to a specific high or low value defined
in MiningField.

name:
symbolic name of field, must refer to a field in the data
dictionary.

highValue
and lowValue:
for outliers

missingValueReplacement:
If this attribute is specified then a missing input value is
automatically replaced by the given value. That is, the model
itself works as if the given value was found in the original
input..

missingValueTreatment:
informational only.

PMML
Data flow

PMML defines a variety of
specific mining models such as for tree classification, neural
networks, regression, etc.

there are definitions which
are common to all models, in order to describe the input data
itself and generic transformations which can be applied to the
input data before the model itself is evaluated.

The <DataDictionary>
element describes
the data 'as is', that's the raw input data and refers to the
original data and defines how the mining model interprets the data,
e.g., as categorical, or numerical

The <MiningSchema>
element defines an interface to the user of PMML models, listing
all fields which are used as input to the computations in the
mining model. The MiningSchema also defines which values are
regarded as outliers, which weighting is applied to a field, e.g.,
for clustering. Input fields as specified in the MiningSchema refer
to fields in the data dictionary but not to derived fields because
a user of a model is not required to perform the normalizations.

Various transformations are
defined such as normalization of numbers to a range [0..1] or
discretization of continuous fields, which convert the original
values to internal values as they are required by the mining model
such as an input neuron of a network model. The mining model may
internally require further derived values that depend on the input
values defined in the transformations block The transformations
cover expressions that were generated by a mining technique - A
complete mining project usually needs many other preprocessing
steps which may have to be defined manually, and PMML does not
provide a complete language for this full preprocessing ?These
data preparation steps must be performed before feeding the values
into a PMML consumer.

If a PMML document contains
multiple models then sharing definitions of normalizations could
save space in the document. That's the same idea as for having a
common data dictionary. Note, the normalizations may still differ
between models, i.e., different models may refer to different sets
of derived fields.

A derived value, defined by
a normalization, can be input for another transformation. E.g. a
neural network model could have a linear normalization defined on a
log-transformed input field 'income'.

The specific definitions of
models such as tree classification or neural network may refer to
fields listed in the MiningSchema or to derived fields which can be
computed from the MiningSchema-fields (incl. transitive closure).

The statistics and the
specific model can refer to fields in the MiningSchema but also to
transformed fields. If there is a replacement value defined for
missing values, the statistics refer to the values before the
missing values are replaced.

The output of a model always
depends on the specific kind of model, and the final result, such
as a predicted class and a probability, are computed from the
output of the model.

If a neural network is used
for predicting numeric values then the output value of the network
usually needs to be denormalized into the original domain of
values, which can use the same kind of transformation types - The
PMML consumer system will automatically compute the inverse
mapping.

PMML
Transformation Dictionary & Derived Values

At various places the mining
models use simple functions in order to map user data to values
that are easier to use in the specific model – e.g. for neural
networks - internally work with numbers, usually in the range from
0 to 1. Numeric input data are mapped to the range [0..1], and
categorical fields are mapped to series of 0/1 indicators..

PMML defines 4 kinds of
simple data transformations:

Normalization:
map values to numbers, the input can be continuous or discrete.

Discretization:
map continuous values to discrete values.

Value mapping:
map discrete values to discrete values.

Aggregation:
summarize or collect groups of values, e.g. compute average.

The transformations in PMML
do not cover the full set of preprocessing functions which may be
needed to collect and prepare the data for mining, as there are too
many variations of preprocessing expressions - Instead, the PMML
transformations represent expressions that are created
automatically by a mining system

PMML
Conformance

PMML intends to enable
application portability, sharing, and reuse of analytic models
produced by a variety of tools.

Conformance must therefore
be specified from both producer and consumer perspectives.

Applications need ways to
specify what kinds of analytic models they can use, and modeling
tools need ways to specify what kinds of analytic models they
produce.

A PMML document is what gets
produced by a modeling tool to specify a trained analytic model and
is what an application uses to deploy that model.

Satisfying conformance rules
ensures a model definition document is syntactically correct ,
specification consistent and that such a model will be applied in
ways which are valid.

PMML
Regression

A RegressionModel
defines three types of regression models: linear, polynomial, and
logistic regression. The modelType
attribute indicates the type of regression used.

Linear and
stepwise-polynomial regression are designed for numeric dependent
variables having a continuous spectrum of values. These models
should contain exactly one regression table. The attributes
normalizationMethod
and targetCategory
are not used in that case.

Logistic regression is
designed for categorical dependent variables. These models should
contain exactly one regression table for each targetCategory.
The normalizationMethod
describes whether/how the prediction is converted into a
probability.

p is the predicted value and
is normally interpreted as the confidence or the probability of an
individual belonging to the category of interest, as defined by
targetCategory.
There can be multiple regression equations. A confidence value for
a category j can be computed by the softmax or simplemax functions

the <RegressionModel>
element is the root element of an XML regression model, and
contains the following attributes:

modelName:
This is a unique identifier specifying the name of the regression
model.

functionName:
Can be regression or classification.

algorithmName:
Can be any string describing the algorithm that was used while
creating the model.

modelType:
Specifies the type of a regression model. This information is used
to select the appropriate mathematical formulas during the scoring
phase. The supported regression algorithms are linearRegression,
polynomialRegression,
& logisticRegression.

targetFieldName:
The name of the target field (also called response variable).

The
<RegressionTable> element
represents a table that lists the values of all predictors or
independent variables. If the model is used to predict a numerical
field, then there is only one RegressionTable
and the attribute targetCategory
may be missing. If the model is used to predict a categorical
field, then there are two or more RegressionTables
and each one must have the attribute targetCategory
defined with a unique value.

The
<NumericPredictor> subelement
defines a numeric independent variable. The list of valid
attributes comprises the name of the variable, the exponent to be
used, and the coefficient by which the values of this variable must
be multiplied. If the independent variable contains missing values,
the mean attribute is used to replace the missing values with the
mean value.

The
<CategoricalPredictor> subelement
defines a categorical independent variable. The list of attributes
comprises the name of the variable, the value
attribute, and the coefficient by which the values of this variable
must be multiplied.

To do a regression analysis
with categorical values, some means must be applied to enable
calculations. If the specified value of an independent value
occurs, the term variable_name(value)
is replaced with 1. Thus the coefficient is multiplied by 1.

If the value does not occur,
the term variable_name(value)
is replaced with 0 so that the product coefficient
× variable_name(value) yields
0. Consequently, the product is ignored in the ongoing analysis. If
the input value is missing then variable_name(v) yields 0 for any
'v'.

E.g.
a linear regression analysis PMML model:

<RegressionModel

functionName="regression"

modelName="Sample
for linear regression"

modelType="linearRegression"

targetFieldName="number
of claims">

<RegressionTable
intercept="132.37">

<NumericPredictor
name="age"

exponent="1"
coefficient="7.1"/>

<NumericPredictor
name="salary"

exponent="1"
coefficient="0.01"/>

<CategoricalPredictor
name="car location"

value="carpark"
coefficient="41.1"/>

<CategoricalPredictor
name="car location"

value="street"
coefficient="325.03"/>

</RegressionTable>

</RegressionModel>

PMML
Neural Network Models for Backpropagation

PMML can model each neuron
to receive one or more input values, each coming via a network
connection, and sends only one output value. All incoming
connections for a certain neuron are contained in the corresponding
<Neuron
element>. Each connection Con
stores the ID of a node it comes from and the weight. A bias weight
coefficient may be stored as an attribute of <Neuron>
element.

All neurons in the network
are assumed to have the same (default) activation function,
although each individual neuron may have its own activation and
threshold that override the default.

NeuralInput defines
how input fields are normalized so that the values can be processed
in the neural network. For example, string values must be encoded
as numeric values.

NeuralOutput
defines how the output of the neural network must be interpreted.

NN-NEURON-ID is
a string which uniquely identifies a neuron within a model (not
within a document).

An input neuron represents
the normalized value for an input field using the normalization
elements <NormContinuous>
and <NormDiscrete>.
A numeric input field is usually mapped to a single input neuron
while a categorical input field is usually mapped to a set of input
neurons using some fan-out function.

Restrictions: A numeric
input field or a pair of categorical input field together with an
input value must not appear more than once in the input layer.

Neuron contains an
identifier which must be unique in all layers, its attribute
threshold has default value 0. If no activationFunction is given
then the default activationFunction
of the NeuralNetwork element applies.

The attribute 'bias'
implicitly defines a connection to a bias unit where the unit's
value is 1.0 and the weight is the value of 'bias'

Weighted connection between
neural net nodes are represented by Con
elements which
are always part of a Neuron and define the connections coming into
that parent element.

The neuron identified by
'from'
may be part of any layer.

NN-NEURON-IDs of all nodes
must be unique across the combined set of NeuralInput
and Neuron
nodes. The 'from'
attributes of connections
and NeuralOutputs
refer to these identifiers.

In parallel to input
neurons, there are output neurons which are connected to input
fields via some normalization.

While the activation of an
input neuron is defined by the value of the corresponding input
field, the activation of an output neuron is computed by the
activation function, and thus an output neuron is defined by a
'Neuron'.

In networks with supervised
learning the computed activation of the output neurons is compared
with the normalized values of the corresponding target fields

The difference between the
neuron's activation and the normalized target field determines the
prediction error.

For scoring the
normalization for the target field is used to denormalize the
predicted value in the output neuron. Therefore, each instance of
'Neuron'
which represent an output neuron, is additionally connected to a
normalized field. Note that the scoring procedure must apply the
inverse of the normalization in order to map the neuron activation
to a value in the original domain.

For neural value prediction
with back propagation, the output layer contains a single neuron,
this is denormalized giving the predicted value.

For neural classification
with backpropagation, the output layers contains one or more
neurons. The neuron with maximal activation determines the
predicted class label. If there is no unique neuron with maximal
activation then the predicted value is undefined.

backward connections from
level N to level M with M <= N or connections between
non-adjacent layers and variable values for activationFunction
per Neuron require extensions

e.g.

<?xml
version="1.0" ?>

<PMML
version="2.0">

<Header
copyright="DMG.org"/>

<DataDictionary
numberOfFields="5">

<DataField
name="gender" optype="categorical">

<Value
value=" female"/>

<Value
value=" male"/>

</DataField>

<DataField
name="no of claims" optype="categorical">

<Value
value=" 0"/>

<Value
value=" 1"/>

<Value
value=" 3"/>

<Value
value=" > 3"/>

<Value
value=" 2"/>

</DataField>

<DataField
name="domicile" optype="categorical">

<Value
value="suburban"/>

<Value
value=" urban"/>

<Value
value=" rural"/>

</DataField>

<DataField
name="age of car" optype="continuous"/>

<DataField
name="amount of claims" optype="continuous"/>

</DataDictionary>

<NeuralNetwork
modelName="Neural Insurance"

functionName="regression"

activationFunction="logistic">

<MiningSchema>

<MiningField
name="gender"/>

<MiningField
name="no of claims"/>

<MiningField
name="domicile"/>

<MiningField
name="age of car"/>

<MiningField
name="amount of claims" usageType="predicted"/>

</MiningSchema>

<NeuralInputs>

<NeuralInput
id="0">

<DerivedField>

<NormContinuous
field="age of car">

<LinearNorm
orig="0.01" norm="0"/>

<LinearNorm
orig="3.07897" norm="0.5"/>

<LinearNorm
orig="11.44" norm="1"/>

</NormContinuous>

</DerivedField>

</NeuralInput>

<NeuralInput
id="1">

<DerivedField>

<NormDiscrete
field="gender" value="male"/>

</DerivedField>

</NeuralInput>

<NeuralInput
id="2">

<DerivedField>

<NormDiscrete
field="no of claims" value="0"/>

</DerivedField>

</NeuralInput>

<NeuralInput
id="3">

<DerivedField>

<NormDiscrete
field="no of claims" value="1"/>

</DerivedField>

</NeuralInput>

<NeuralInput
id="4">

<DerivedField>

<NormDiscrete
field="no of claims" value="3"/>

</DerivedField>

</NeuralInput>

<NeuralInput
id="5">

<DerivedField>

<NormDiscrete
field="no of claims" value="3"/>

</DerivedField>

</NeuralInput>

<NeuralInput
id="6">

<DerivedField>

<NormDiscr