Mohegan SkunkWorks

Sun, 01 Nov 2009 10:30:37 EST

Toy Problem: Simple One Dimensional Least Squares Learner.

In chapter two of Hastie, Tibshirani and Friedman 's 'The Elements of Statistical Learning' the authors discuss the use of least- squares regression to construct a data classifier for linearly separable data.

A set of training data together with the least-squares method is used to construct a hyper-plane in the data space. The classification of a data point depends on what side of the hyper-plane you end up on.

The example in Hastie uses two data classes in a two dimensional parameter space. I didn't grok the the example immediately, and I thought it would be helpful to try to construct my own much simpler example by staying in one dimension and using a simple normal distribution. The rest of this post describes the details.

Data Classes and Least Squares

My data points are generated by two normal distributions. with means on the interval [0,1].

Class 1 is classified as -1 and class 2 as 1. In my example, class 1 will always have the smaller mean, and hence will lie to the left of class 2. As in Hastie linear regression. is used to find a linear classifier for a set of training data for these two classes.

The data space is one-dimensional and the classification boundary will be a point on the interval [0,1] somewhere between the means of the two data classes. For a one dimensional data space linear regression models the classification response function as a line: y = Mx + Q

The RSS is the sum of the squares of the difference between the actual and response predicted by the LSM equation :

RSS = SUM ( (y_obs - y_calc) ^2 ) 

LSM tries to find the slope M and intercept Q such that the responses generated by the least-squares model (LSM) have the smallest 'residual sum of squared errors' (RSS).

x represents the input and is going to be generated by the normal distributions associated with one of the two data classes. y represents the response and the only values are -1 or 1, depending on whether x was generated by the first or second class distribution respectively.

The observed classifications, y_obs, can have only one of two values : -1 or 1. The values generated by the LSM, y_calc, range from Q (for x = 0) to M + Q (for x = 1).

The LSM separates the space into two pieces, one for one class and one for the other. In this case the data space is the line piece [0,1], and the 'hyper-plane' separating that space is where the linear regression line crosses the data space, yielding a classification boundary of :

 S = -Q / M 

The LSM estimate of the classifier is then :

x < S -> class 1  
x > S -> class 2  

A simple R script

I've put together a quick R script implementing these ideas.

This [R script]( http://github.com/fons/blog-code/blob/master/1d-classification-toy/sl-1d-regression.R) has two main functions : train and sim. The script is loaded on the R command prompt like so:

source('sl-1d-regression.R') 

The train function is used to generate a graph showing two data classes as well as the line generated by the LSM separating both classes. The example discussed below was generated using train as follows :

> train(20, 0.35, 0.15, 0.65, 0.15)  
  coefficients :  -1.916889  3.973720  
  classifier :  0.4824  
  prob of misclassification of class 1 :  0.16  
  prob of misclassification of class 2 :  0.16  
 

The sim function returns a vector of monte-carlo simulations of the separation boundary. For example, here's 10 values of the boundary generated by sim :

> sim(10, 0.35, 0.15, 0.65, 0.15)  
  (Intercept) (Intercept) (Intercept) (Intercept) (Intercept) (Intercept)  
   0.4495022   0.5185030   0.4978533   0.5887321   0.4547277   0.5029925  
  (Intercept) (Intercept) (Intercept) (Intercept)  
   0.4792027   0.5665940   0.4952061   0.5337139 

The following plots the density function for 400 values of the classification boundary, for the same class parameters as the train example above.

plot(density(sim(400, 0.15, 0.35, 0.65, 0.15)), xlim=c(0,1), xlab="", ylab="")  

Obviously, since these are monte-carlo simulations, subsequent runs are going to be different as the random values generated by rnorm are going to be different each time the function is run.

Classification Results

This graph shows two classes and the resulting least-squares model. Both distributions have the same standard deviation of 0.15. One class marked the with red inverted triangles has a mean of 0.35. The other class is marked with the blue squares and has a mean of 0.65. The size of each training class is 20.

Each class shows up twice. Once as part of the x-axis which is the data space for this problem. The second time I show them as data points (x,y) used in the linear regression. Obviously the red triangles all have y = -1, and the blue squares all have y = 1.

As you can see, there is some overlap between the two data sets, because of the variance of the normal distribution.

The green line represents the best fit of the data points according to the least-squares model (LSM). The classification boundary is the point where the green line crosses the x-axis. As you can see that's somewhere around 0.5.

You would expect 0.5 to be a good estimate of the classification boundary S because the normal distribution is symmetric around the mean and both classes are equi-distant from 0.5.

The R-script generates the probability of misclassification defined as the probability of ending up on the wrong side of the estimated classification boundary :

 S_est = 0.5 * (mean_1 + mean_2)  
 
 P_mis_classified = Prob( x > S_est) for class 1  
 		      = Prob( x < S_est) for class 2  

This is obviously determined in large part by the variance (for given mean), and for this example the probability is around 16 % for both classes.

The sim can be used to generate a whole set of estimates of the classification boundary. This is in fact sampling the space of estimators. The mean of this sample should be very close to the actual value thanks to the central limit theorem. Here's a typical run

 >mean(sim(400, 0.35, 0.15, 0.65, 0.15))  
  [1] 0.4977804  
> mean(sim(400, 0.35, 0.15, 0.65, 0.15))  
  [1] 0.4990484  
> mean(sim(400, 0.35, 0.15, 0.65, 0.15))  
  [1] 0.5024554 

The R function mean is used to find the mean of the values of the vector of simulation results.

This graph shows a plot of the density of the sample distribution for the mean. The graph shows two distributions one for a standard deviations 0.15 in blue and 0.35 in red respectively.

Notice how wide the distribution for sd=0.35 is around 0.5. It turns out that about 33 % of your data will be misclassified because there will be significant overlap between the two classes.

This graph is generated using a standard deviation of 0.35 for both classes and with 50 data points in each class. As you can see there's significant overlap between the two classes. In fact, it would be hard to just visually pick out a good classification boundary.