李宏毅机器学习笔记-Week1

搭建简单的DNN来解决一个回归问题，感觉学到了很多。

LHY_ML_HW1_Regression

重新熟悉了一下Colab的使用方式，还是挺好用的。

Baseline

直接运行给出的Sample Code即可通过Baseline，不得不说这倒是给了同学们选择，想躺平的起码能学会最基本的操作，也能获得一些分数，但是如果想获得更多的分数，就得自己进行进一步优化咯。

Medium

Sample Code中也给出了提示，可以通过减少特征数目来增强模型的学习能力，这里选择前40列，第57列以及第75列，作为训练数据，这样便能获取更好的效果

Strong

上一步中我们特征削减这件事情做的有些过分，因此可以从增加特征数目入手，参考别人的做法，进行以下几个方面的改进：

特征选取

使用sklearn的f_regression方法来获取影响最大的特征，示例代码如下

import pandas as pd
import numpy as np

data = pd.read_csv('/kaggle/input/ml2021spring-hw1/covid.train.csv')
x = data[data.columns[1:94]] #这里是利用columns函数获取1:94列或94列的索引
y = data[data.columns[94]]

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

from sklearn import preprocessing
x = (x - x.min()) / (x.max() - x.min())

bestfeatures = SelectKBest(score_func=f_regression, k=5)
fit = bestfeatures.fit(x,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(x.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(15,'Score'))  #print 15 best features

下面附上这个类的官方文档。

class SelectKBest(_BaseFilter):
    """Select features according to the k highest scores.
    Read more in the :ref:`User Guide <univariate_feature_selection>`.
    Parameters
    ----------
    score_func : callable
        Function taking two arrays X and y, and returning a pair of arrays
        (scores, pvalues) or a single array with scores.
        Default is f_classif (see below "See also"). The default function only works with classification tasks.
    k : int or "all", optional, default=10
        Number of top features to select.
        The "all" option bypasses selection, for use in a parameter search.
    Attributes
    ----------
    scores_ : array-like, shape=(n_features,)
        Scores of features.
    pvalues_ : array-like, shape=(n_features,)
        p-values of feature scores, None if `score_func` returned only scores.
    Notes
    -----
    Ties between features with equal scores will be broken in an unspecified
    way.
    See also
    --------
    f_classif: ANOVA F-value between label/feature for classification tasks.
    mutual_info_classif: Mutual information for a discrete target.
    chi2: Chi-squared stats of non-negative features for classification tasks.
    f_regression: F-value between label/feature for regression tasks.
    mutual_info_regression: Mutual information for a continuous target.
    SelectPercentile: Select features based on percentile of the highest scores.
    SelectFpr: Select features based on a false positive rate test.
    SelectFdr: Select features based on an estimated false discovery rate.
    SelectFwe: Select features based on family-wise error rate.
    GenericUnivariateSelect: Univariate feature selector with configurable mode.
    """

网络结构改进

使用BN，在防止过拟合的同时，加速模型训练
使用Dropout，减小过拟合
更换激活函数为LeakyReLU
损失函数添加L2正则
更换优化器
防止过拟合的几种办法总结
L1正则
- 增加了参数矩阵的稀疏表达(参数矩阵中一部分参数为0),可以进行特征选择,通过保留重要的特征,舍弃不重要特征,达到防止过拟合效果
L2正则
- 衰减了参数的值，从而有效地完成过拟合
Batch Normalization
- 在训练中，BN的使用使得一个mini-batch中的所有样本都被关联在了一起，因此网络不会从某一个训练样本中生成确定的结果。就是一个batch数据中每张图片对应的输出都受到一个batch所有数据影响,这样相当于一个间接的数据增强,达到防止过拟合作用.
- 此外，BN还具有加速模型训练的速度，BN在训练时候,会把每一层的Feature值约束到均值为0,方差为1,这样每一层的数据分布都会一样,在二维等值线上的表现就是圆形,能加快梯度下降法的收敛速度,而且,数据被约束到均值为0 ,方差为1,相当于把数据从饱和区约束到了非饱和区,这样求得的梯度值会更大,加速收敛,也避免了梯度消失和梯度爆炸问题
Dropout
- 随机丢掉网络的一部分,相当于每次都有一个新的残缺网络,每个残缺网络都学到不同的局部特征,多个残缺网络就能充分学到数据的局部特征,这样,测试数据不论怎么变,只要有局部特征,网络就能起作用(我都认识你),这样比总要比仅在单个健全网络上进行特征学习，其泛化能力来得更加健壮
  激活函数总结

TODO

Sigmoid函数

tanh函数

ReLU函数

Leaky ReLU函数

ELU (Exponential Linear Units) 函数

MaxOut函数

如何选用

L1、L2详解

常见的最优化问题只是考虑了对于数据的拟合，却忽略了模型本身的复杂度，因此我们引入了正则项，用于描述模型本身的复杂度，最优化目标便成为损失与正则项的加和。

上图为西瓜书中的插图，用于解释$L1$、$L2$正则化各自的特点。

首先说明，$L1$正则化适用于使得参数稀疏化，而$L2$正则化适用于使得参数稠密的接近于0️⃣。

从公式角度分析的已经太多，这里从等值线角度进行分析。

上图中$w_1$以及$w_2$为模型的两个参数。图中存在三组等值线，位于同一条等值线的参数，具有同样的误差、$L2$、$L1$范数。而我们优化的目标在于找到范数与误差的加和最小的参数。

对于$L1$范数，其等值线为菱形，因此与误差等值线的交点更容易出现在坐标轴上，此时有参数的值为0，从而实现了参数的稀疏化。

对于$L2$范数，其等值线为圆形，与误差等值线的交点的位置无规律性，无法实现稀疏性；但由于交点在圆上，根据圆的优良性质，参数的值会比较接近，也就实现了参数在0️⃣附近稠密且平滑。

下面为Pytorch的实现。

1
2
3

for param in model.parameters():
            # regularization_loss += torch.sum(abs(param))
        regularization_loss += torch.sum(param ** 2)

优化器总结

https://zhuanlan.zhihu.com/p/58236906

反向传播算法

todo