2019-01-18

We are given some data points which we call our training set. Then we get a new data point for which we want to predict the corresponding value in a way that it follows the same pattern as in our training set. Thus our task is to find a model which can capture/learn these underlying patterns (aka relationships) in our training set and apply them to new points to make accurate predictions.

Some models types can capture/learn only a specific type of pattern (e.g. Linear Regression), others are more powerful and can learn any type of pattern (e.g. Neural Networks). Unfortunately the more powerful models will not only learn useful patterns but also the noise in our data. Usually it’s best to pick the simplest model type which is yet able to learn the most important patterns in your data because the simpler the model is the less chance there is that it learns noise.

Linear Regression is a simple model type which assumes that the relationship between and is linear. Because of the added bias term the relationship it is strictly speaking affine and not linear. But you can apply a neat trick to make it linear again.

**For scalar valued :**

**For vector valued :**

**Trick: affine => linear**

By adding a new entry to every data point we can absorb the bias term () and make the relationship between and linear.

Here are three examples of how the relationship between x and y might look like. The relationship can either be **exactly linear**, **roughly linear** or **not linear at all**.

We can fit a linear regression model in any case but the predictions won’t make much sense in the case when the data doesn’t follow a linear relationship.

- You define a function which tells you how good your current estimate of the best paramters is. This function is usually called
**error function**or**loss function**. - You minimize the error function by changing your paramter values . When people say they fit a model they refer to this step of minimizng an error function. For some model types and error functions there exist a closed form solution for which minimizes the error function for others you need to apply other (often iterative) minimization methods, e.g. for logistic regression.

The vectors and matrices have the following dimensions.

: Dx1

: Dx1

: 1x1

: Nx1

: NxD

In order to minizme the error function we take the gradient with respect to set it to zero and solve for .

$$

There exists no closed form solution for the least absolute error because of the absolute value in the equation. That’s why an iterative approach like Gradient Descent is required which won’t be introduced at this point.

For the first plot (above) the least square error was used to fit the model, for the plot below the least absolute error was used. The results are pretty much the same.

In case the relationship between and is not linear you have two options. You can either switch to a completely new type of model which is able to capture nonlinear relationships or you can transform your data into a new space where the relationship between your transformed data x and y is linear again. The function which transforms your data is generally called a **basis function**.

**Example 1:**

Let’s assume we have a quadratic relationship between and in the form of . If we transform our data to a new space (again just a 1D space) with the new x axis then the relationship between and will be linear again as you can see in the plot below.

**Example 2:**

Usually you don’t know to which space you should transform your data in order to make the raltionship linear again. To overcome this issue you can transform your data to a very high dimensional space and let your model decide which transformations to use.

For example assume the true relationship between and has the form of but you don’t know it in advance. Now you transform your data to a high dimensional space with the new axes ,

,

,

,

.

This works well in theory as your model can always decide not to use a transformation by setting the corresponding model parameter to .

**Example 3:**

Using this technique you can fit your model to data which only approximately follows a polynomial relationship like for example data which was generated in the follwing way: . If we use the same transformation technique up to an exponent of 5 we get the following model when we fit our model with a least square error.

You can see in the plot below that our model captures the data pretty well and it allows us to make good predictions.

**Example 4:**

So far it seems best to transform the data to an as high dimensional space as possible and let your model figure out which transformations to choose. But there is a drawback. Assume that your data was generated in the following way . As before let’s transform our data up to an exponent of 40 and see what happens.

In red you see the fitted model which uses the transformed high dimensional data and in green our desired model. This phenomen is called overfitting. Our model is to complex and uses the extra power (high dimensional features e.g. ) to follow not only the major underlying quadratic pattern but also the noise.

In conclusion, you can say that our model is **not** able to figure our itself which transformations to use. When we provide higher dimensional features to our model it will **not** ignore the ones which are not necessary to describe the major underlying pattern but rather use them to also capture noise in our data. We somehow need to find the sweet spot of complexity where our model is powerful enugh to capture the important patterns but not powerful enough to also capture the noise. There is one technique called **Regularization** which can help us with that.

Regularization describes methods which prevent our model from overfitting by reducing the complexity of your model. This is a quite a general definition but provides you with the right intuition for future more complex models where regularization can occure in many different ways and not just by tampering the error function which we will do in a second for linear regression.

$$

L2 regularization adds a new term to the error function which prevents the parameters w to become too large. It uses the squared euclidean distance/squared L2 norm to do so. Let’s again minimize the error function with respect to .

$$

(Xw-y)^T(Xw-y) + \lambda\lVert w\rVert_1$$

L1 regularization also adds a new term to the error function which pretty much does the same thing as the l2 regularization but uses the L1 norm (not squared) instead.

Let’s look at an example similiar to the previous one. Our data was generated in the following way: . I played around with the coefficients to make the plots look nice. Here are the results.

Firstly, you can see that a model (red line) with regularization becomes smoother compared to a model without regularization. Secondly, you can see that L1 regularization leads to an even “smoother” model than L2 regularization. Why is this so? The L2 norm doesn’t penalize all errors in the same way. For example when your model is far off you will get a huge error because of the quadratic term in the L2 norm. However when you are only slighly off then your error will be almost zero again because of the quadratic term. This is different in the L1 norm where you use the absolute value of the distance. That’s why L1 regularization can push parameters w down to zero whereas L2 regularization will make them small but not zero.

The following picture provides another intuitive way to grasp what regularization for linear regression does. Essentially regularization offers a way to shrink the complexity of our model without changing its structure.

Now we come to the fun part where we will explore the connection between a regularized non probabilistic model and a probabibilistic model with certain assumptions.

Our probabilistic model:

As the noise is the only probabilistic term it follows directly:

It’s a bit confusing at first to read those formulas. In our case we interpret not as a function of but as a function over and . This interpretation of the formula is generally called a likelihood function. The integral over and doesn’t sum up to 1. That’s why it is not a probability distribution. If we would assume and are given/fixed and we look at it as a function of then it is a proper probability distribution and the integral over does sum of to 1.

We are now interested in finding the unknown parameters and . In order to do so, we have three options.

We assume that our data is identical and independently distributed (aka i.i.d) meaning every data point was generated using the same distribution and all data points are independent of each other. With this assumption we can compute the likelihood function $$ p(y,x | w,\sigma^2) = \prod_{i=1}^{N} p(y_i,x_i | ,w,\sigma^2)w\sigma$$ which give us the most likely description for our observed data points. |

Our task:

You might think that we will run into issues when we try to optimize over two variables simultaneously. Generally that can be an issue but not in this case as doesn’t depend on .

We made use of a simple trick. We took the log of the whole expression knowing that it doesn’t change the position of the maximum. In order to find the maximum, we take the gradient with respect to , set it to zero and solve for .

You can see that is independent of and also that the result is the same as for the non probabilistic least square case with no regularization.

Let’s go on to derive the solution for . Same procedure, we take the gradient with respect to , set it to 0 and solve for .

This is a satisfying result because it is just finding the sample average of the squared deviations between what predicts and what the training data actually are. It feels exactly like what happens when you compute the maximum likelihood estimate of the variance of a univariate Gaussian distribution.

Let’s quickly recap. The solution for with the probabilistic approach (+ our assumptions) and taking the Maximum Likelihood Estimate leads exactly to the same solution as for the non probabilistic model. Which means that by using the least squared error function we implictely assume that any noise we have has a mean of zero and is symmetric around the mean. We can’t assume that the noise strictly follows a normal distribution as we would get the same result when the noise would be e.g. uniformly distributed with a mean of 0. But at least we got some better intuition about what’s going on behind the scenes. Let’s go a step further and see if we can also find a corresponding probabilisitc assumption for the regularization term.

NOT QUITE SURE YET

Bayes formula:

Translated to our model:

The evidence term doesn’t depend on or and thus doesn’t affect the maximum of the posterior distribution . So in order to find the maximum of the posterior distribution we only need the likelihood function and the prior distribution . We already know how to get the likelihood function from the previous step but how do we get the prior?

Eventually we need to make some hard assumptions about and . Let’s discuss our options. We could assume that both and are indepedent random variables with follow a certain distribution which we can freely choose. For example . Not much progress made, now we need to come up with good values for . We could again assume they follow a certain assumptions to take uncertainty into account but you see that we are trapped in a loop. At some point we have a make a hard assumption and set a parameter to a fixed value. It’s good practice too pick a reasonable fixed value when you can’t make good asusmptions about the distribution of a parameter anymore. In our case we will assume is not a random variable but a fixed scalar and is gaussian distributed with a variance of thus . Our prior simplifies to .

No we can ccalculate the maximum a poserior estimation for our only left random variable .

Now we set the gradient with respect to to zero and solve for .

Here we get another insight. If you go back to the solution for the optimal when we used a non probabilistic model with L2 regularization, you see that the result is exactly the same as for the probabilistic model if we assume . Apparently using a **gaussian prior** corresponds to an added L2 reguarization term to our error function. If you play around a bit with different prior distributions you can find out that a **laplacian prior** corresponds to L1 regularization.

When I first learnt about these correspondences I was pretty amazed!

So far we only did a so called **point estimates** for . This can work out nicely but we are losing information about the uncertainty of our estimate for w. If we want to keep the information about the uncertainty we need to compute or estimate the full posterior distribution. I won’t go into more detail here. We will discuss more about this in the next post “Probabilistic Inference - behind the scenes”.