python多项式回归_Python从头开始的多项式回归

python多项式回归

Polynomial regression in an improved version of linear regression. If you know linear regression, it will be simple for you. If not, I will explain the formulas here in this article. There are other advanced and more efficient machine learning algorithms are out there. But it is a good idea to learn linear based regression techniques. Because they are simple, fast, and works with very well known formulas. Though it may not work with a complex set of data.

线性回归的改进版本中的多项式回归。 如果您知道线性回归,那么对您来说很简单。 如果没有,我将在本文中解释这些公式。 还有其他先进且更有效的机器学习算法。 但是,学习基于线性的回归技术是一个好主意。 因为它们简单,快速并且可以使用众所周知的公式。 尽管它可能不适用于复杂的数据集。

多项式回归公式 (Polynomial Regression Formula)

Linear regression can perform well only if there is a linear correlation between the input variables and the output variable. As I mentioned before polynomial regression is built on linear regression. If you need a refresher on linear regression, here is the link to linear regression:

仅当输入变量和输出变量之间存在线性相关性时,线性回归才能很好地执行。 如前所述,多项式回归建立在线性回归的基础上。 如果您需要线性回归的基础知识,请访问以下线性回归链接:

Polynomial regression can find the relationship between input features and the output variable in a better way even if the relationship is not linear. It uses the same formula as the linear regression:

多项式回归可以更好地找到输入要素与输出变量之间的关系,即使该关系不是线性的。 它使用与线性回归相同的公式:

Y = BX + C

Y = BX + C

I am sure, we all learned this formula in school. For linear regression, we use symbols like this:

我敢肯定,我们都在学校学过这个公式。 对于线性回归,我们使用如下符号:

Here, we get X and Y from the dataset. X is the input feature and Y is the output variable. Theta values are initialized randomly.

在这里,我们从数据集中获得X和Y。 X是输入要素,Y是输出变量。 Theta值是随机初始化的。

For polynomial regression, the formula becomes like this:

对于多项式回归,公式如下所示:

We are adding more terms here. We are using the same input features and taking different exponentials to make more features. That way, our algorithm will be able to learn about the data better.

我们在这里添加更多术语。 我们使用相同的输入功能,并采用不同的指数以制作更多功能。 这样,我们的算法将能够更好地了解数据。

The powers do not have to be 2, 3, or 4. They could be 1/2, 1/3, or 1/4 as well. Then the formula will look like this:

幂不必为2、3或4。它们也可以为1 / 2、1 / 3或1/4。 然后,公式将如下所示:

成本函数和梯度下降 (Cost Function And Gradient Descent)

Cost function gives an idea of how far the predicted hypothesis is from the values. The formula is:

成本函数给出了预测假设与值之间的距离的概念。 公式为:

This equation may look complicated. It is doing a simple calculation. First, deducting the hypothesis from the original output variable. Taking a square to eliminate the negative values. Then dividing that value by 2 times the number of training examples.

这个方程可能看起来很复杂。 它正在做一个简单的计算。 首先,从原始输出变量中减去假设。 取平方消除负值。 然后将该值除以训练示例数的2倍。

What is gradient descent? It helps in fine-tuning our randomly initialized theta values. I am not going to the differential calculus here. If you take the partial differential of the cost function on each theta, we can derive these formulas:

什么是梯度下降? 它有助于微调我们随机初始化的theta值。 我不打算在这里进行微积分。 如果对每个θ取成本函数的偏微分,则可以得出以下公式:

Here, alpha is the learning rate. You choose the value of alpha.

在这里,alpha是学习率。 您选择alpha的值。

多项式回归的Python实现 (Python Implementation of Polynomial Regression)

Here is the step by step implementation of Polynomial regression.

这是多项式回归的逐步实现。

  1. We will use a simple dummy dataset for this example that gives the data of salaries for positions. Import the dataset:

    在此示例中,我们将使用一个简单的虚拟数据集,该数据集提供职位的薪水数据。 导入数据集:
import pandas as pd
import numpy as np
df = pd.read_csv('position_salaries.csv')
df.head()

2. Add the bias column for theta 0. This bias column will only contain 1. Because if you multiply 1 with a number it does not change.

2.添加theta 0的偏差列。该偏差列将仅包含1。因为如果将1乘以数字,它不会改变。

df = pd.concat([pd.Series(1, index=df.index, name='00'), df], axis=1)
df.head()

3. Delete the ‘Position’ column. Because the ‘Position’ column contains strings and algorithms do not understand strings. We have the ‘Level’ column to represent the positions.

3.删除“位置”列。 由于“位置”列包含字符串,并且算法无法理解字符串。 我们有“级别”列来代表职位。

df = df.drop(columns='Position')

4. Define our input variable X and the output variable y. In this example, ‘Level’ is the input feature and ‘Salary’ is the output variable. We want to predict the salary for levels.

4.定义我们的输入变量X和输出变量y。 在此示例中,“级别”是输入要素,而“薪水”是输出变量。 我们要预测各个级别的薪水。

y = df['Salary']
X = df.drop(columns = 'Salary')
X.head()

5. Take the exponentials of the ‘Level’ column to make ‘Level1’ and ‘Level2’ columns.

5.以“级别”列的指数表示“级别1”和“级别2”列。

X['Level1'] = X['Level']**2
X['Level2'] = X['Level']**3
X.head()

6. Now, normalize the data. Divide each column by the maximum value of that column. That way, we will get the values of each column ranging from 0 to 1. The algorithm should work even without normalization. But it helps to converge faster. Also, calculate the value of m which is the length of the dataset.

6.现在,标准化数据。 将每一列除以该列的最大值。 这样,我们将获得每列的值,范围从0到1。即使没有规范化,该算法也应该起作用。 但这有助于收敛更快。 同样,计算m的值,它是数据集的长度。

m = len(X)
X = X/X.max()

7. Define the hypothesis function. That will use the X and theta to predict the ‘y’.

7.定义假设函数。 这将使用X和theta来预测“ y”。

def hypothesis(X, theta):
y1 = theta*X
return np.sum(y1, axis=1)

8. Define the cost function, with our formula for cost-function above:

8.使用上面的成本函数公式定义成本函数:

def cost(X, y, theta):
y1 = hypothesis(X, theta)
return sum(np.sqrt((y1-y)**2))/(2*m)

9. Write the function for gradient descent. We will keep updating the theta values until we find our optimum cost. For each iteration, we will calculate the cost for future analysis.

9.编写梯度下降函数。 我们将不断更新theta值,直到找到最佳成本。 对于每次迭代,我们将计算成本以供将来分析。

def gradientDescent(X, y, theta, alpha, epoch):
J=[]
k=0
while k < epoch:
y1 = hypothesis(X, theta)
for c in range(0, len(X.columns)):
theta[c] = theta[c] - alpha*sum((y1-y)* X.iloc[:, c])/m
j = cost(X, y, theta)
J.append(j)
k += 1
return J, theta

10. All the functions are defined. Now, initialize the theta. I am initializing an array of zero. You can take any other random values. I am choosing alpha as 0.05 and I will iterate the theta values for 700 epochs.

10.定义了所有功能。 现在,初始化theta。 我正在初始化零数组。 您可以采用任何其他随机值。 我选择alpha为0.05,我将迭代700个纪元的theta值。

theta = np.array([0.0]*len(X.columns))
J, theta = gradientDescent(X, y, theta, 0.05, 700)

11. We got our final theta values and the cost in each iteration as well. Let’s find the salary prediction using our final theta.

11.我们还获得了最终的theta值以及每次迭代的成本。 让我们使用最终的theta查找薪水预测。

y_hat = hypothesis(X, theta)

12. Now plot the original salary and our predicted salary against the levels.

12.现在根据水平绘制原始薪水和我们的预测薪水。

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(x=X['Level'],y= y)
plt.scatter(x=X['Level'], y=y_hat)
plt.show()

Our prediction does not exactly follow the trend of salary but it is close. Linear regression can only return a straight line. But in polynomial regression, we can get a curved line like that. If the line would not be a nice curve, polynomial regression can learn some more complex trends as well.

我们的预测并不完全符合薪资趋势,但接近。 线性回归只能返回一条直线。 但是在多项式回归中,我们可以得到这样的曲线。 如果该线不是一条好曲线,则多项式回归也可以学习一些更复杂的趋势。

13. Let’s plot the cost we calculated in each epoch in our gradient descent function.

13.让我们绘制我们在梯度下降函数中每个时期计算的成本。

plt.figure()
plt.scatter(x=list(range(0, 700)), y=J)
plt.show()

The cost fell drastically in the beginning and then the fall was slow. In a good machine learning algorithm, cost should keep going down until the convergence. Please feel free to try it with a different number of epochs and different learning rates (alpha).

成本从一开始就急剧下降,然后下降缓慢。 在一个好的机器学习算法中,成本应该一直下降直到收敛。 请随意尝试不同的时期和不同的学习率(alpha)。

Here is the dataset: salary_data

这是数据集: salary_data

Follow this link for the full working code: Polynomial Regression

请点击以下链接获取完整的工作代码: 多项式回归

推荐阅读: (Recommended reading:)

翻译自: https://towardsdatascience.com/polynomial-regression-from-scratch-in-python-1f34a3a5f373

python多项式回归

你可能感兴趣的