# Prediction begins with guesswork

PressLast articleMachine learning is the process of applying mathematical methods to find laws in data. Since mathematics is the interpretation of the real world, let’s return to the real world and make some comparative imagination.

Imagine that there is a white board made of plastic foam in front of us. There are several blue tacks arranged on the white board. There seems to be some regularity in them. We try to find out the rules.

Pushpin on whiteboard（**data**）As shown in the figure above, do we have a method（**Mathematical algorithm**）To find the law（**Model interpretation**）And? Since we don’t know how to do it, let’s guess!

I picked up two sticks and compared them in front of the whiteboard, trying to express the law of the data with the sticks. I put it casually, as shown in the figure below:

They all seem to represent the law of blue pushpins to a certain extent, so the question is, which one of green (dotted line) and red (solid line) represents better?

# Loss function (cost function)

Good and bad are very subjective expressions, and subjective feelings are unreliable. We must find an objective way to measure them. We take it for granted that the representation with the smallest error is the best. Then, we introduce a method of quantization error – least square method.

**least square method**: the method of minimizing the sum of squares of errors is an error statistical method. Double multiplication means square.

$$ SE = \sum{(y_{pred} -y_{true})^2} $$

The explanation of the least square method is that we use`Predicted - actual`

Represents the error of a single point, and then`Sum of squares`

Add together to represent the overall error（*The advantage of square can deal with negative values, and it is not impossible to use the sum of absolute values.*）We use this final value to represent the loss (cost), and the function that can represent the loss (cost) is called the loss function (cost function).

As shown in the figure above, we can see that the distance from the blue dot to the solid line is the error we want to bring into the formula. Although they look similar, the calculated result is a solid red line（`y=3x+2`

）The loss is 27.03, while the green solid line（`y=4x+4`

）The loss is 29.54, obviously the red model is better than the green model.

So, is there a better model to represent the data than the red solid line? Is there a way to find it?

# gradient descent

Since we can use 3 and 4 as the coefficients of X, we can certainly try other numbers. We express this relationship with the following formula:

$$ y = wx + b $$

Where x and y are known, we constantly adjust them`w`

(**weight**）And`b`

(**deviation**）, and then bring in the loss function to find the minimum value, which is**gradient descent **。

We start at – 50 and end at – 50`w`

We refer to bias by random number`b`

Then, the loss function is brought in to calculate the error loss between our prediction and actual value, and the following curve is obtained:

**It should be noted that the image we draw is a curve drawn according to weight and loss. Our model represents a straight line.**

We can see that in the figure above, we can find the minimum value, about 5. Here is the position where we have the least loss. At this time, our model can best represent the law of data.

Gradient can**completely**As a derivative, the process of gradient descent is the process of our continuous derivation.

# Learning rate (step)

The process of constantly adjusting the weight and deviation to find the minimum value of the loss function is the process of using the gradient descent method to fit the data and find the best model. So now that we have a solution, should we consider how to improve efficiency and how to quickly find the lowest point?

Imagine that when you are lost in the thick fog on the mountain, all you can feel is the slope of the road under your feet. One strategy to get to the foot of the mountain quickly is to go downhill in the steepest direction. An important parameter in gradient descent is the gradient of each step**step**（**Learning rate**）, if the step size is too small, the algorithm needs a lot of iterations to converge. If the step size is too large, you may directly cross the valley, resulting in the divergence of the algorithm, and the value is getting larger and larger.

Set step size too small:

Setting step size is too large:

Set the appropriate step size:

The step size cannot be learned by the algorithm itself, and it must be specified by the outside world.

This kind of algorithm can’t be learned, and the parameters that need to be set manually are called**Super parameter**。

# linear regression

Finally we found it**linear**Model to explain the relationship between independent variable x and dependent variable y, which is**linear regression**。 The explanation of regression is that things always tend to develop towards some kind of “average”**trend**It is called regression, so regression is mostly used for prediction.

In the above figure, the red line is the best model we fit. On this model, we can find the predicted values of 2.2, 2.6 and 2.8, respectively corresponding to the three red points in the figure.

This is also the basic significance of linear regression.

# Code practice

Prepare data:

```
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
X = 2 * np.random.rand(10)
y = 4 + 3 * X + np.random.randn(10)
plt.plot(X, y, "bo")
plt.xlabel("$X$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
plt.show()
```

draw`y=3x+2`

and`y=4x+4`

Two straight lines:

```
plt.plot(X, y, "bo")
plt.plot(X, 3*X+2, "r-", lw="5", label = "y=3x+2")
plt.plot(X, 4*X+4, "g:", lw="5", label = "y=4x+4")
plt.xlabel("$X$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
plt.legend(loc="upper left")
plt.show()
```

Calculate the loss and compare`y=3x+2`

and`y=4x+4`

Two straight lines:

```
fig, ax_list = plt.subplots(nrows=1, ncols=2,figsize=(20,10))
ax_list[0].plot(X, y, "bo")
ax_list[0].plot(X, 3*X+2, "r-", lw="5", label = "y=3x+2")
loss = 0
for i in range(10):
ax_list[0].plot([X[i],X[i]], [y[i],3*X[i]+2], color='grey')
loss= loss + np.square(3*X[i]+2-y[i])
pass
ax_list[0].axis([0, 2, 0, 15])
ax_list[0].legend(loc="upper left")
ax_list[0].title.set_text('loss=%s'%loss)
ax_list[1].plot(X, y, "bo")
ax_list[1].plot(X, 4*X+4, "g:", lw="5", label = "y=4x+4")
loss = 0
for i in range(10):
ax_list[1].plot([X[i],X[i]], [y[i],4*X[i]+4], color='grey')
loss= loss + np.square(4*X[i]+4-y[i])
pass
ax_list[1].axis([0, 2, 0, 15])
ax_list[1].legend(loc="upper left")
ax_list[1].title.set_text('loss=%s'%loss)
fig.subplots_adjust(wspace=0.1,hspace=0.5)
fig.suptitle("Calculate loss",fontsize=16)
```

Train the model and predict:

```
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X.reshape(-1,1),y.reshape(-1,1))
X_test = [[2.2],[2.6],[2.8]]
y_test = lr.predict(X_test)
X_pred = 3 * np.random.rand(100, 1)
y_pred = lr.predict(X_pred)
plt.scatter(X,y, c='b', label='real')
plt.plot(X_test,y_test, 'r', label='predicted point' ,marker=".", ms=20)
plt.plot(X_pred,y_pred, 'r-', label='predicted')
plt.xlabel("$X$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 3, 0, 15])
plt.legend(loc="upper left")
loss = 0
for i in range(10):
loss = loss + np.square(y[i]-lr.predict([[X[i]]]))
plt.title("loss=%s"%loss)
plt.show()
```

# Other regression

How do we really understand regression? Through a large number of data statistics, individual small beans tend to produce larger offspring, while individual large beans tend to produce smaller offspring. The newly generated individuals have a trend towards the average value of beans, which is regression. The linear regression we talk about in this article is applied to**forecast**A technology of. At this time, regression is often opposite to classification.

Linear regression, logical regression, polynomial regression, stepwise regression, ridge regression, Lasso regression and elastic net regression are the most commonly used regression techniques. I’ll make a simple arrangement of these technologies first, so that we can sort out the context and explore them in depth when we actually need them.**Trying to exhaust knowledge will only drag yourself to fatigue**。

name | explain | formula |
---|---|---|

Linear regression | A method of modeling the relationship between independent variables and dependent variables by linear model | $$ y = wx+b $$ |

Logistic regression | Modeling specific categories for secondary classification | $$ y=\frac{1}{1+e^{-x}} $$ |

Polynomial regression | independent variable X and dependent variables The relationship between Y is modeled as a polynomial of degree n with respect to X | $$ y=\beta_ 0 + \beta_ 1x + \beta_ 2x^2 + … + \beta_ mx^m + \varepsilon $$ （c α Iyongji watermark) |

Stepwise regression | Multiple variables are introduced into the model one by one to find the variables that have a great impact on the model | |

Lasso regression | Sparse matrix, eliminating unimportant features, MSE + L1 norm | $$J (\ theta) = MSE (\ theta) + \ alpha \ sum \ mid \ theta \ mid $$, where, α The larger the model, the smaller the weight |

Ridge regression | Regularized linear regression, increase model freedom, prevent over fitting, MSE + L2 norm | $$J (\ theta) = MSE (\ theta) + \ alpha \ frac {1} {2} \ sum \ theta ^ 2 $$, where, α The larger the model, the smaller the weight |

Elastic net | Between ridge regression and lasso regression | $$J (\ theta) = MSE (\ theta) + \ gamma \ alpha \ sum \ mid \ theta \ mid + \ alpha \ frac{1 – \ gamma}{2} \ sum \ theta ^ 2 $$, where, γ Between 0 and 1, close to 0 is more inclined to ridge regression, close to 1 is more inclined to lasso regression |