基于 pandas 和 numpy 实现多元线性回归模型

简介

在学习多元线性回归模型的时候，我尝试了这样的一个程序。

将一个大约 2w 行的 csv 数据集读取进入程序，并且筛选有价值的数据，对其进行分析和处理，最后构建出一个多元线性回归的模型。

以下为我写的程序的说明部分（纯英文）：

                    '''
                    author: ky0ha
                    title: A classifier by multiple linear regression
                    argument: 
                        fpath: a string, is the position of data-file
                        fmodel: a string, is the model of data-file (default is 'csv')
                        isdropna: a bool, drop NAN of data? True is drop, False isn't drop (defalut is True)
                        droplist: a list, value of the list is characteristics of data want to del, all characteristics is string
                        smatrix: a matrix, shape is (n, 1), to change the value of string to be the number type, sample: 
                                [["characteristric1", {"string1": value1}], 
                                ["characteristric2", {"string2": value2}]] meanscharacteristric1's string1 change to value1
                        n: a int, is the length of data to be training
                        characteristic_list: a list, value in list is string, means the list of the characteristics
                        mark: a string, is the mark of data
                    return: a array, is the argument matrix, can actually use in estimate fuction
                    '''

大致说明了程序内各个参数的意义和最后的返回值的意义：

fpath 是一个字符串，存放的是数据文件的绝对路径
fmodel 是一个字符串，是读取的数据文件的后缀名，默认是 csv
isdropna 是一个布尔值，意思为是否丢掉数据文件中的空值，如果为是，则丢弃，如果是否，则不丢弃
smatrix 是一个矩阵，n行2列，目的是为了将数据表里面的字符串类型替换为数值类型，矩阵内容为 [["characteristric1", {"string1": value1}],...]，前者是要改变的内容的列索引名，后者是字典，字典的键是字符串，值是数值，效果是将 "characteristric1" 列的所有 "string1" 改变为 value1
n 是一个整数，是数据表用作训练集的数据长度
characteristic_list 是一个列表，存放的是所有的列索引名
mark 是一个字符串，是数据的一个标记
返回值是一个参数数组，即模型的参数表

对于库的引用部分：

                    import pandas as pd
                    import numpy as np
                    from math import sqrt
                    from matplotlib import pyplot as plt

数据清洗

通过 drop 方法去掉无关列。通过 reset_index(drop=True) 丢掉读取自带的列索引。然后通过 dropna 方法丢弃所有含有空值的行。

                    f.drop(["RowNumber","CustomerId",""RowNumber"Surname","HasCrCard"], axis=1, inplace=True)
                    if isdropna:
                        f.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
                    f = f.reset_index(drop=True)

通过对指定列使用 map(dict) 方法，依据字典的键值对将列内所有的字符型数据替换为数值数据。

                    dic = {"France": 1, "Spain": 2, "Germany": 3}
                    f["Geography" = f["Geography"].map(dic)
                    dicf = {"Female": 0, "Male": 1}
                    f["Gender"] = f["Gender"].map(dicf)

将数据提取并赋值在矩阵内

                    CL = []
                    for i in characteristic_list:
                        CL.append(np.array(list(f.loc[:,i]))[:n])

模型的构建过程

将提取的数据矩阵和一个新构建的全 1 向量拼接在一起，将训练集作为一个单独的向量，存放在变量 y 内。通过最小二乘法，依据 x 和 y 矩阵，计算出使损失函数最小的参数矩阵 theta_best。

                    def getTheta(X_b, y):
                        return np.linalg.inv.(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
                    
                    X_b = np.one((n,1))
                    y = np.array(list(f.loc[:,mark]))[:n]
                    for i in CL:
                        X_b = np.c_[X_b, i]
                    # 带入 x 矩阵求得参数矩阵 theta_best
                    theta_best = getTheta(X_b,y)

总结

从 0 开始一点一点的构建整个模型的计算体系，而且此计算方法可以套用在任何简单线性回归/多元线性回归内，主要采用的方法就是利用 pandas 进行数据的清洗和提取，然后利用 numpy 构建矩阵并进行计算，通过最小二乘法求得 theta_best 参数矩阵，最后利用参数矩阵进行模型的测试。

模型测试部分没有写入此文档内，因为有点多，而且写了一整天，有点累了。。。日后可能会慢慢补充，还请见谅！

基于 pandas 和 numpy 实现多元线性回归模型

简介

数据清洗

模型的构建过程

总结

如果你想下载这个项目的css，请点击这里：点击下载