Wednesday, March 31, 2010

Here is a good example of how Machine Learning algorithms can be used in the Insurance Industry. This is a competition I entered with one of my colleagues: Chandra Lakkaraju.

TITLE: Loss Development Challenge
DATE: March 31, 2010

Introduction
The accident year, development year, number of claims and loss amounts describe processes that develop in time and are characterized by random fluctuations. The processes can be represented by functions that can be used to forecast future unknown development costs. There are multiple approaches that can be followed to model the triangles. In this paper, three machine learning approaches are applied to the data.
Problem Definition
The problem this paper investigates is the prediction of future loss-development costs given two independent triangles of data (Paid & Paid-Case). Specifically, predictions of what is known as the diagonal for the next valuation time period and the 50th period. The 50th period is known as the ultimate loss-development for a particular accident year. The only information provided is the year of the accident, number of claims, and the loss-development for a number of years in a time series format. The conjecture is that modeling can be used to predict future loss-development costs. Specifically, Linear Regression, Neural Networks and M5P models will be used to make predictions. The predicted results will be evaluated against unpublished values by a committee to determine how accurate the results predicted by the proposed models turned out to be. The criterion used by the committee to compare results from multiple competitors is to apply the least sum of squares and sum of absolute errors to come up with the best results.
Basic Assumptions
Benefit-Level changes and inflationary (CPI) effects are incorporated in the dataset provided, thus requiring no further adjustments to the costs disclosed.
Modeling and Data Transformation
The models are based on machine learning techniques to process link-ratio distribution obtained from the original loss-development costs to predict the missing link-ratio. The predicted link-ratios form the basis for computing the target loss amounts. The Paid and Paid-Case triangles are treated in similar fashion because the machine learning is based on under lying pattern submission.
Methodology
The loss development is decomposed into two primary effects of age and momentum. The age and momentum effects are computed by fitting a regression model across rows and columns respectively. The regression model is a simple average of three mathematical machine learning models, which are Linear Regression (a.k.a. LR), Neural Networks (a.k.a. NN), and Multilevel Trees (a.k.a. M5P). The link-ratios from the triangle are fed into these three separate models and missing link-ratios are computed and two-way simple average from the three models is derived as the LDF for the data point. The mean LDF is considered as the Tail Factor and is attached at 19th report to compute the ultimate.
Machine learning approaches
Machine learning models make it possible to use non-linear approximations of the parameters in the functions that relate accident years to future development losses. For the purpose of approximating the parameters in the function, the coefficients of the function decomposition are obtained from the input–output data pairs, some chosen model structure and systematic learning rules. Once trained, the machine learning model becomes a parametric description of the function. Learning a general principle from a set of specific training examples that were provided is achieved by trying out different model structures and the related parameters. For this paper, out of several viable possible methods considered, Artificial Neural Networks (a.k.a. ANN) and M5 model trees, specifically the M5P implementation, are applied to the data. The models produce fitness to the data statistics for comparison purposes.
Artificial neural network
An ANN is the most widely used Machine Learning model. ANN is a a broad term covering a large variety of network architectures, the most common of which is the multi-layer perceptron (MLP), the one selected for this paper. Such a network is trained by the so-called error-back-propagation method, which is a specialized version of the gradient-based optimization algorithm. In MLP, each target vector z is an unknown function f of the input vector x as in Equation 1.

z=f(x)
Equation 1 - MLP

The task of the network is to learn the function f. The network includes a set of parameters (weights vector), the values of which are varied to modify the generated function “f”, which is computed by the network to be as close as possible to “f”. The weight parameters are determined by training (calibrating) the ANN based on the training data set.

Modular approach and the M5 model trees
A complex modeling problem can be solved by dividing it into a number of simple tasks and combining the solutions of these tasks. The input space can be divided into a number of subspaces or regions for each of which a separate specialized model built. In machine learning such models are often called experts, or modules, and a combination of experts –– a committee machine classifies such machines into two major categories: (1) static (represented by ensemble averaging and boosting), where response of experts is combined by a mechanism that does not involve the input signal, e.g., using fixed weights; and (2) dynamic, where experts are combined using weighting schemes depending on the input vector.

The category of dynamic committee machines can be split further into two groups: (2a) statistically-driven approaches with ‘‘soft’’ splits of input space represented by mixtures of experts, and (2b) methods which do not combine the outputs of different experts or modules but explicitly use only one of them, the most appropriate one (a particular case when the weights of other experts are zero).

Contrary to the mixture models, methods of this group use the ‘‘hard’’ (i.e. yes–no) splits of input space into regions progressively narrowing the regions of the input space. Each individual expert is trained individually on subsets of instances contained in these regions, and finally the output of only one specialized expert is taken into consideration. The result is a hierarchy, a tree (often a binary one) with splitting rules in non-terminal nodes and the expert models in leaves. Such models can be called hierarchical (or tree-like) modular models (HiMM).
Models in HiMMs could be of any type, for example linear regression or ANNs.
For solving numerical prediction (regression) problem, there is a number of splitting methods that are based on the idea of a decision tree:
· If a leave is associated with an average output value of the instances sorted down to it (zero-order model), then the overall approach is called a regression tree and resulting in the numerical constants (zero-order models) in leaves.
· If it is desirable to have in leaves the regression functions of the input variables, then the two approaches are typically used: MARS (multiple adaptive regression splines) algorithm, and M5 model tree algorithm].

The M5 algorithm is rooted in the following principle: split the parameter space into areas (subspaces) and build in each of them a local specialized linear regression model. The splitting in MT follows the idea used in building a decision tree, but instead of the class labels it has linear regression functions at the leaves, which can predict continuous numeric attributes. Model trees generalize the concepts of regression. Regressions can be characterized as having constant values at their leaves. So, they are analogous to piecewise linear functions (and hence non-linear). The major advantage of model trees over regression trees is that model trees are much smaller than regression trees, the decision strength is clear, and the regression functions do not normally involve many variables.

Tree-based models are constructed by a divide-and-conquer method. The set T is either associated with a leaf, or some test is chosen that splits T into subsets corresponding to the test outcomes and the same process is applied recursively to the subsets. The splitting criterion for the M5 model tree algorithm is based on treating the standard deviation of the class values that reach a node as a measure of the error at that node, and calculating the expected reduction in this error as a result of testing each attribute at that node. The formula to compute the standard deviation reduction (SDR) is calculated as shown in “Equation 1”. In Equation 2, T is the set of examples that reach the node and T1, T2, … are the sets that result from splitting the node according to the chosen attribute (in case of multiple split). The splitting process will terminate if the output values of all the instances that reach the node vary only slightly or only a few instances remain.


SDR = sd(T) - ∑sd(Ti)xTi/T
Equation 2 - SDR


After examining all possible splits (that is, the attributes and the possible split values), M5 chooses the one that maximizes the expected error reduction. Splitting in M5 ceases when the class values of all the instances that reach a node vary just slightly, or only a few instances remain. The relentless division often produces over elaborate structures that must be pruned back, for instance by replacing a sub-tree with a leaf. In the final stage, a smoothing process is performed to compensate for the sharp discontinuities that will inevitably occur between adjacent linear models at the leaves of the pruned tree, particularly for some models constructed from a smaller number of training examples. In smoothing, the adjacent linear equations are updated in such a way that the predicted outputs for the neighboring input vectors corresponding to the different equations are becoming close in value.
Linear Regression Models
Linear regression models the relationship between "y" the dependent variable and one or more independent variables. The parameters are estimated from the data. The focus is on the conditional probability distribution of "y" given the independent variables. In this paper, the dependent variable is the expected claim amount and the purpose of the model is to make future claim predictions. The approach used for fitting the model is based on the least square approach. The best fit, when applying the least-squares method, minimizes the sum of squared residuals where a residual is calculated to be the difference between an observed value and the value provided by the linear regression model. Other fitting methods such as "loss function" or "ridge regression" could have been used. Thus, least-square and linear regression models are closely linked but they are not synonymous.


Findings

DATA: AgePaidLR
Scheme: LinearRegression
Instances: 188
Attributes: 4
· accYear
· devYear
· numClaims
· LinkRatio

Linear Regression Model:

LinkRatio =0.0166 * accYear +0 * numClaims - 31.8819


STATISTICS

· Correlation coefficient 0.6071
· Mean absolute error 0.0678
· Root mean squared error 0.1049
· Relative absolute error 86.6075 %
· Root relative squared error 79.4622 %
· Total Number of Instances 170

Scheme: MultilayerPerceptron Neural Network
Instances: 188
Attributes: 4
· accYear
· devYear
· numClaims
· LinkRatio
Linear Node 0
Inputs Weights
Threshold 1.6706022241965304
Node 1 1.1888781347915334
Node 2 -2.645600688006365

Sigmoid Node 1
Inputs Weights
Threshold -3.7217940290073224
Attrib accYear 1.6944608988382996
Attrib devYear 0.6773386587792524
Attrib numClaims 0.8555831948782261

Sigmoid Node 2
Inputs Weights
Threshold 13.465319633555458
Attrib accYear -13.905862469377947
Attrib devYear 0.4610582822165239
Attrib numClaims 0.41560553773801107

Class
Input
Node 0

STATISTICS

· Correlation coefficient 0.9545
· Mean absolute error 0.0203
· Root mean squared error 0.0446
· Relative absolute error 25.9256 %
· Root relative squared error 33.7803 %
· Total Number of Instances 170
· Ignored Class Unknown Instances 18

Scheme: M5P Tree
Instances: 188
Attributes: 4
· accYear
· devYear
· numClaims
· LinkRatio

M5 pruned model tree:
(using smoothed linear models)

accYear <= 2002.5 : accYear <= 1999.5 : LM1 (93/4.855%) accYear > 1999.5 : LM2 (33/8.658%)
accYear > 2002.5 :
accYear <= 2005.5 : LM3 (33/26.421%) accYear > 2005.5 : LM4 (11/110.124%)

LM num: 1
LinkRatio =
0.0021 * accYear
- 0 * numClaims
- 3.0716

LM num: 2
LinkRatio =
0.0064 * accYear
- 0 * numClaims
- 11.6721

LM num: 3
LinkRatio =
0.0729 * accYear
- 0 * numClaims
- 144.8916

LM num: 4
LinkRatio =
0.0669 * accYear
- 0 * numClaims
- 132.7683

Number of Rules : 4

STATISTICS

· Correlation coefficient 0.9399
· Mean absolute error 0.0211
· Root mean squared error 0.0501
· Relative absolute error 26.9433 %
· Root relative squared error 37.9528 %
· Total Number of Instances 170
· Ignored Class Unknown Instances 18


DATA: AgePaidCaseLR

Scheme: LinearRegression
Instances: 188
Attributes: 4
· accYear
· devYear
· numClaims
· LinkRatio



Linear Regression Model

LinkRatio = 0.0043 * accYear - 7.6494

STATISTICS

· Correlation coefficient 0.6044
· Mean absolute error 0.0186
· Root mean squared error 0.0273
· Relative absolute error 83.3631 %
· Root relative squared error 79.6664 %
· Total Number of Instances 170
· Ignored Class Unknown Instances 18

Scheme: MultilayerPerceptron Neural Network
Instances: 188
Attributes: 4
· accYear
· devYear
· numClaims
· LinkRatio

Linear Node 0
Inputs Weights
Threshold 0.41446534397100615
Node 1 -0.8086509022397707
Node 2 -1.2996723162806572

Sigmoid Node 1
Inputs Weights
Threshold -2.747034927505303
Attrib accYear -0.06784776781376296
Attrib devYear -0.6265880851849212
Attrib numClaims -1.39432161574802

Sigmoid Node 2
Inputs Weights
Threshold 6.305947676391236
Attrib accYear -7.490361498177734
Attrib devYear -0.13654758765016786
Attrib numClaims 0.5267805300501266
Class
Input
Node 0


STATISTICS

· Correlation coefficient 0.8096
· Mean absolute error 0.0138
· Root mean squared error 0.0215
· Relative absolute error 61.9053 %
· Root relative squared error 62.7793 %
· Total Number of Instances 170
· Ignored Class Unknown Instances 18

Scheme: M5P Tree
Instances: 188
Attributes: 4
· accYear
· devYear
· numClaims
· LinkRatio


M5 pruned model tree:
(using smoothed linear models)

accYear <= 2002.5 : LM1 (126/25.297%) accYear > 2002.5 : LM2 (44/108.504%)

LM num: 1
LinkRatio =
0.0012 * accYear
+ 0 * numClaims
- 1.3895

LM num: 2
LinkRatio = 0.0194 * accYear - 37.79

Number of Rules : 2

STATISTICS

· Correlation coefficient 0.8018
· Mean absolute error 0.012
· Root mean squared error 0.0207
· Relative absolute error 53.588 %
· Root relative squared error 60.308 %
· Total Number of Instances 170
· Ignored Class Unknown Instances 18


Conclusions

The missing loss amounts are computed using the age-to-age LDFs and CDFs. The predicted values of link ratios from the models are shown in Figure 1 and 2 but for easier viewing enclosed in the attached spreadsheet along with final results.

The resulting triangles can be made available upon request.

Thursday, March 25, 2010

Why a blog about optimizing business decisions

There is no doubt that data is being accumulated at an accelerated pace with no end in sight.
How else can the market be willing to pay a 35 P/E for shares in EMC (Corporation).
There are companies that have positioned themselves to exploit this market niche and some of them, for example the SAS Institute and IBM (SPSS and Cognos), have done very well judging by their market penetration; 33 and 14 percent respectively.
But is it possible that this segment is going to grow exponentially?
My answer is "yes". The reason is simple. The Internet has leveld the playing field and now any company any where can compete. This intense competition is bound to put more pressure on CEO's to produce better decisions with very little room for mistakes. Therefore, CEO's will be looking for decisions that are well grounded and validated by experts where some of those experts are no longer human. No, we are not talking about ET experts but "machine learning algorithms" that are capable of analyzing mountains of data to find patters and outliers not easily detectable with more conventional approaches such as those offered by statistics. These new breed of experts will require a new kind of expertise in organizations that is most likely not there today. This blog is designed for such purpose: Q&A to help optimize decisions.

The proposed approach for answering questions is to engage tools for this type of problems.
These new challenges often require the application of heuristic modeling techniques with its roots in Operations Research and Artificial Intelligence. The challenge lies not in finding the tools, nor the computational capacity to tackle such problems, but rather finding the right place to ask questions. I hope this blog is the place.