title: MachineLearning

#+STARTUP: overview

Math for ML

Self Entropy

One thing happens, how much information can be persent. how meanful it is $I (p_{i}) = - lo g (p_{i})$

Information Entropy

$H (p) = E_{x ∽ p} [I (p_{i})] = - E_{x ∽ p} [lo g (p_{i})]$ $H (p) = - i = 1 \sum n p (x_{i}) lo g p (x_{i})$ $H (p) = - \int_{x} p (x) lo g p (x) d x$

cross Enteopy

$H (p, q) = E_{x ∽ p} [I (q_{i})] = - E_{x ∽ p} [lo g (q_{i})]$ $H (p, q) = - i = 1 \sum n p (x_{i}) lo g q (x_{i})$ $H (p, q) = - \int_{x} p (x) lo g q (x) d x$

Conditional enteopy

Under the condition than Y has been happened, estimate the Enteopy of X to be happened $H (X ∣ Y) = i = 1 \sum n p_{i} H (X ∣ Y = y_{i})$ $H (X ∣ Y) = H (X, Y) - H (X)$

Gini

吉尼系数 $G ini (D) = k = 1 \sum ∣ y ∣ k^{^{'}} \neq = k \sum p_{k} p_{k^{^{'}}} = 1 - k = 1 \sum ∣ y ∣ p_{k}^{2}$

KL

the difference between p and q $K L (p ∣∣ q) = E_{x ∽ p} [\frac{lo g p ( x )}{lo g q ( x )}] = i = 1 \sum n p (x_{i}) \frac{lo g p ( x )}{lo g q ( x )}$ $K L (p ∣∣ q) = H (p, q) - H (p)$

Matrix

Conjugate transpose Matrix

$A^{*} = (\overset{ˉ}{A})^{T} = \overset{ˉ}{A^{T}}$ 共轭转置矩阵, 先共轭再转置，还是先转置再共轭都可以。

Normal Matrix

$A^{*} A = A A^{*}$ 正定矩阵, 是转置和本身满足交换律 $A = U R D U^{- 1}$ 可酉变换

Unitary Matrix

$A^{*} A = A A^{*} = I$ 酉矩阵，是正定矩阵, 即转置和本身满足交换律，而且等于 I

orthogonal matrix

实数酉矩阵

$Q^{T} = Q^{- 1} \Leftrightarrow Q^{T} Q = Q Q^{T} = I$ 得该矩阵的转置矩阵为其逆矩阵

Hermitian matrix

$A = A^{H}$ , $a_{i, j} = \overset{a_{j, i}}{ˉ}$ 共轭对称埃尔米特矩阵，厄米特矩阵，厄米矩阵,所有元素对称出共轭

Eigenwertzerlegung eigende decompostion

对于矩阵求特征值特征向量，特征值分解，但是要求必须是方阵，如果不是，先要转换： A = a*a.T

$A = U B U^{T}$

import numpy as np
a = np.mat([[1,2,3,4],[1,1,1,1]])
A = a*a.T
B, U = np.linalg.eig(A)
print("eigenvalue of A : ")
print(B)
print("eigenvalue of a :(should be equal to the following) ")
print(np.sqrt(B))
print("eigenvactor : ")
print(U)

SVD Singular value decompostion

但是对于一般矩阵，不是方阵，可以奇异值分解： $a = U B V^{T}, a^{t} = V B U^{T}$ $A = a a^{T} = U B^{2} U^{T}$ $A^{^{'}} = a^{T} a = V B^{2} V^{T}$

import numpy as np
a = np.mat([[1,2,3,4],[1,1,1,1]])
U, B, Vt = np.linalg.svd(a)
print("left eigenvalue : ")
print(U)
print("eigenvactor of a : ")
print(B)
print("right eigenvalue : ")
print(Vt)

Bayes's Rule

if x and y are independent: $p (x, y) = p (y) p (x) = p (x) p (y) = p (y, x)$

条件概率：
条件概率 = 联合概率/边缘概率
先验概率和后验概率都是条件概率，但是条件已知是先验

$P (y ∣ x) = \frac{P ( x , y )}{P ( x )}$

全概率公式

$p (y) = i = 1 \sum n p (y, x_{i})$

贝叶斯公式

$P (A B) = P (B A)$ $P (y ∣ x) P (x) = P (x ∣ y) P (y)$ $P (y ∣ x) = \frac{P ( x ∣ y ) P ( y )}{P ( x )}$

贝叶斯公式 + 全概率公式 + 条件概率

\begin{eqnarray*}
P(A|B) &= \frac{P(B|A)P(A)}{P(B)} \\
       &= \frac{P(B|A)P(A)}{\sum_{i=1}^{n} p(B,A_{i})} \\
       &= \frac{P(B|A)P(A)}{\sum_{i=1}^{n} P(B|A_{i})P(A)} \\
\end{eqnarray*}

Regularization

L Regularization $x_{i}$ is the weight in network, it's a scalar

$∣∣ x ∣ ∣_{p} := (i = 1 \sum n ∣ x_{i} ∣^{p})^{\frac{1}{p}}$ $∣∣ x ∣ ∣_{1} := i = 1 \sum n ∣ x_{i} ∣$ $∣∣ x ∣ ∣_{2} := (i = 1 \sum n ∣ x_{i} ∣^{2})^{\frac{1}{2}}$

$L oss = \frac{1}{2} i = 1 \sum N (y_{i} - w^{t} ϕ (x_{i}))^{2} + \frac{λ}{2} j = 1 \sum M ∣ w_{j} ∣^{q}$

N example number M Eigenschaften Number, Diemension number L0 控制网络中的非零权重 L1 网络中的所有元素的绝对值之和,促使网络生成更多的稀疏矩阵 L2 网络中的所有元素平方和,促使网络生成小比重的权值 w w~1~, w~2~ q=1 l1 regularization q=2 l2 regularization λ learning rate $\sum_{i = 1}^{N} (y_{i} - w^{t} ϕ (x_{i}))^{2}$ 依据w的同心圆 $\sum_{j = 1}^{M} w_{j}^{q}$ q=1, 菱形， q=2, 圆形

Loss 要最小，part1 刚好和 part2 相接，l1会在坐标轴上，所以如果有较小分量，会被直接设为0

Error Bias-Variance-trade-off

Error = Bias + Variance + noise

Bias 偏差欠拟合发挥，观测等主观因素影响
Variance 方差过过拟合稳定性，模型的构建决定
noise 噪音统难度

Multi variable Gaussian distribution

seeing the link 知乎

$f_{X} (x_{1}, \dots, x_{k}) = \frac{exp ( - \frac{1}{2} ( x - μ ) ^{T} Σ ^{- 1} ( x - μ ) )}{( 2 π ) ^{k} ∣ Σ ∣}$


def gaussian(x,mean,cov):
    dim = np.shape(cov)[0] #维度
    #之所以加入单位矩阵是为了防止行列式为0的情况
    covdet = np.linalg.det(cov+np.eye(dim)*0.01) #协方差矩阵的行列式
    covinv = np.linalg.inv(cov+np.eye(dim)*0.01) #协方差矩阵的逆
    xdiff = x - mean
    #概率密度
    prob = 1.0/np.power(2*np.pi,1.0*dim/2)/np.sqrt(np.abs(covdet))*np.exp(-1.0/2*np.dot(np.dot(xdiff,covinv),xdiff))
    return prob

Mahalanobis distance

$Δ = (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))$ $Σ = \sum U Λ U^{T}$ $Σ^{- 1} = \sum U Λ^{- 1} U^{T}$

$Δ = - \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ) = - \frac{1}{2} (x - μ)^{T} U Λ^{- 1} U^{T} (x - μ)$ 马氏距离所使用的变换 : $Z = U^{T} (X - μ)$ ,

$D = Z Z^{T}$ 关于新的坐标，U 是变换的旋转， $Λ$ 是基底的延伸， $(x - μ)$ 是在其上的投影，此后，在新坐标上，即为多变量，标准，不相关高斯分布

K-fold Cross Validation

N total examples K number of sub-fold m number of each sub-fold big K small bias, with over fitting, big variance small K big bias, without fitting, low variance

confusion matrix

                 Actually Ture   Actually False

predict Positive TP FP predict Negative FN TN

Ture Positive Rate (recall)

namely, Ture Positive Rate, y axes of ROC

$S e n s i t i v i t y = \frac{TP}{TP + FN}$

False Postive Rate

namely, false Position, x axes of ROC

$1 - S e n s i t i v i t y = \frac{FP}{FP + TN}$

ROC Receivert Operator Characteristic

under the acceptable x (1 - Sensitivity) , we want the best y (Sensitivity). from side to side is all classifed to Positive to all classifed to negative

AUC Area under the Curve

je mehr Fachsgebiet, desto besser for the Method, we use this target to choice our Method

F1

harmonic mean of recall and precision $F_{1} = \frac{1}{2} \cdot (\frac{1}{P} + \frac{1}{R})$ $F_{β} = \frac{1}{β ^{2}} \cdot (\frac{β ^{2}}{P} + \frac{1}{R})$

关联性属性：高中低（3,2,1）非关联性属性：猪狗羊（(1,0,0), (0,1,0),(0,0,1)）

Macro Micro Average

This is for non-binary classification at frist calcalete for all feather the confusion matrix als binary classification

Macro Average

$TP R_{ma cro} = \frac{1}{∣ C ∣} c \in C \sum \frac{T P _{c}}{T P _{c} + F N _{c}}$

Micro Average

$TP R_{mi cro} = \frac{\sum _{c \in C} T P _{c}}{\sum _{c \in C} T P _{c} + \sum _{c \in C} F N _{c}}$

Jacobin matrix

for $Y_{m} = f (X_{n}), Y = (y_{1}, y_{2}, y_{3} .... y_{m}), X = (x_{1}, x_{2} .... x_{n})$ $d_{Y} = J d_{x}$ , $J = [\frac{\partial f}{\partial x _{1}} \dots \frac{\partial f}{\partial x _{n}}] = \frac{\partial f _{1}}{\partial x _{1}} ⋮ \frac{\partial f _{m}}{\partial x _{1}} \dots ⋱ \dots \frac{\partial f _{1}}{\partial x _{n}} ⋮ \frac{\partial f _{m}}{\partial x _{n}}$ 由球坐标系到直角坐标系的转化由 F: ℝ+ × [0, π] × [0, 2π) → ℝ3 函数给出，其分量为： $x y z = r sin θ cos φ; = r sin θ sin φ; = r cos θ .$ 此坐标变换的雅可比矩阵是 $J_{F} (r, θ, φ) = \frac{\partial x}{\partial r} \frac{\partial y}{\partial r} \frac{\partial z}{\partial r} \frac{\partial x}{\partial θ} \frac{\partial y}{\partial θ} \frac{\partial z}{\partial θ} \frac{\partial x}{\partial φ} \frac{\partial y}{\partial φ} \frac{\partial z}{\partial φ} = sin θ cos φ sin θ sin φ cos θ r cos θ cos φ r cos θ sin φ - r sin θ - r sin θ sin φ r sin θ cos φ 0 .$ 其雅可比行列式为 r2 sin θ，由于 dV = dx dy dz，如果做变数变换的话其体积元(Volume element)，dV，会变成：dV = r2 sin θ dr dθ dφ。

预剪枝和后剪枝

在利用训练集的最大信息增益确定划分属性后，用验证集来检验划分，如果验证集的信息熵增加，（泛化结果不好)否定此次划分，设为叶节点

后剪枝是在这个树完成后，用验证集去检验每一个内节点，从下到上，如果去掉该划分有更小的信息熵，则废除该划分。

属性有连续值和缺失值

连续值离散化：排列该属性的所有取值n个，在n-1个区间中去中间值为离散值，遍历所有离散值，找到最大信息增益的离散值，作二分。

缺失值，取出该属性的非缺失子集，再配以相应的比率计算信息增益，处理和以前一样。如果选出的划分属性有缺失值，则给划分不作用到缺失样本，复制到每个划分子集

activation function

activation

输出为实数空间或某个区间，连续变化。直接有输出值和真实值比较

Sigmoid

$S i g m o i d (x) = \frac{1}{1 + e ^{- x}}$

导数： $σ^{'} (x) = σ (x) (1 - σ (x))$

ReLU

$R e l u (x) = {x 0 x >= 0 x < 0$

$R e l u (x) = ma x (0, x)$

LeakyReLU

$L e ak y R e LU (x) = {x p x x >= 0 x < 0, 0 < p << 1$

Tanh

$t anh (x) = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}}$ $t anh (x) = 2. s i g m o i d (2 x) - 1$ 导数： $tanh^{'} (x) = 1 - tanh^{2} (x)$

MSE

$L = MSE (y, o) = \frac{1}{d _{o u t}} i = 1 \sum d_{o u t} (y_{i} - o^{i})^{2}$ 导数： $\frac{\partial L}{\partial o _{i}} = (o_{i} - y_{i})$

inf entropy

$H (p) = - i \sum P (i) lo g_{2}^{P_{i}}$

softmax

$p_{z_{i}} = \frac{e ^{z_{i}}}{\sum _{j} e ^{z_{j}}}$ 所有种类的概率之和为1 导数: $\frac{\partial p _{z_{i}}}{\partial z _{j}} = {p_{i} (1 - p_{j}) - p_{i} p_{j} i f i = j i f \neq = j$

cross entropy

在计算交叉熵时，一般是和 softmax 函数一起使用的 $H (p ∣∣ q) = - i \sum p (i) lo g_{2}^{q_{i}}$ $H (p ∣∣ q) = H (p) + D_{K L} (p ∣∣ q)$

for One-hot coding $H (p ∣∣ q) = D_{K L} (p ∣∣ q) = i \sum y_{i} l o g (\frac{y _{j}}{o _{j}}) = 1 \cdot lo g \frac{1}{o _{i}} + j! = i \sum 0 \cdot lo g \frac{0}{o _{j}} = - lo g o_{i}$ $o_{i}$ 为1 时，预测正确，交叉熵为0。导数： $L = - k \sum y_{k} lo g (p_{k})$ $\frac{\partial L}{\partial z _{i}} = - k \sum y_{k} \frac{\partial lo g ( p _{k} )}{\partial z _{i}} = - k \sum y_{k} \frac{\partial lo g ( p _{k} )}{\partial p _{k}} \cdot \frac{\partial p _{k}}{\partial z _{i}} = - k \sum y_{k} \frac{1}{\partial p _{k}} \cdot \frac{\partial p _{k}}{\partial z _{i}}$

用上面 softmax 的导数结果，分为k=i 和k!=i两种情况 $\frac{\partial L}{z _{i}} = p_{i} - y_{i}$

$C = - \frac{1}{n} x \sum [y \cdot ln (a) + (1 - y) \cdot ln (1 - a)]$ $\frac{\partial C}{\partial w} = X (a - y)$ $\frac{\partial C}{\partial b} = a - y$

This is why Cross Enterpy can train the loss very fast. if a is far away from y, the update will be its difference

example bagging

多次放回抽样，用不同抽样的数据集在多棵树上并行计算，

More Tree Bias remained Variance reduce to limit

所以刚开始选择偏差小，方差大的强模型

Boosting

固定数据集，在多个串行的模型上顺序计算，模型间强相关，防止过拟合，用弱相关模型

gradient decent

当数据点很多是，正则化方法计算量将非常大，此时较多使用梯度下降

sklearn API


import numpy as np
import random
from sklearn import linear_model
testsize = 5

x = np.array([a for a in range(100)])
onesx = np.ones(x.shape)
X = np.c_[x, 2*x, onesx]
y = np.array([a*5 + 20 + random.randint(0,3) for a in range(100)])
print("the X shape is {}, and y shape is {}".format(X.shape, y.shape))

# Sklearn API
reg = linear_model.LinearRegression()
model = reg.fit(X,y)
print("Sklearn: the weith is {}, and the intercept is {}".format(model.coef_[:-1] ,model.intercept_))
print("the predect of 3 ele is {}".format(model.predict(np.c_[np.arange(testsize), np.arange(testsize),np.ones(testsize)])))


# manual
def featureNormalize(X):
    (m,n) = X.shape
    X_norm = X
    mu = np.zeros(n);
    sigma = np.zeros(n);
    for i in range(n):
        mu[i] = np.mean(X[:,i])
        sigma[i] = np.std(X[:,i])
        X_norm[:,i] = (X_norm[:,i]-mu[i])/sigma[i]
    return X_norm
def computeCost(X, y, theta):
    return np.sum((np.dot(X,theta) -y)**2)/(2*len(y));

def gradientDescent(X, y, theta, alpha, num_iters):
    m = len(y)
    J_history = np.zeros(num_iters);
    theta_len = len(theta);
    for num_iter in range(num_iters):
        theta = theta - (alpha/m)*np.dot(X.T,(np.dot(X,theta).reshape(-1)-y))
        J_history[num_iter] = computeCost(X, y, theta)
    return theta, J_history

alpha = 0.0001
num_iters = 400000
theta = np.zeros(2+1)
theta, J_history = gradientDescent(X, y, theta, alpha, num_iters)
print("Greadient decent: the weight is {}, and the intercept is {}".format(theta[:-1],theta[-1]))
print("the predect of 3 ele is {}".format(np.dot(np.c_[np.arange(testsize), np.arange(testsize),np.ones(testsize)], theta)))

Ordinary Least Squares(OLS)

正规化方程

正则化方程的推导，用高斯分布的多变量分布的Maxisum likelihood,能一起求得对weight和bias值 :

但是要添加一列1到 train 和 test，至于在前面还是后面有点怪异。

目前认为，在后面的话，多变量和参数可以按需求访问

Loss function: $J = \frac{1}{2 m} \sum (h_{θ} (x_{i}) - y_{i})^{2}$ $σ = \frac{1}{2 m} (Xθ - y)^{T} (Xθ - y)$ 对 $θ$ 求导，并令其为0， $θ = (X^{T} X)^{- 1} X^{T} y$ 但是要求 $X^{T} X$ 必须可逆。

正则化正规化方程

$w = (Φ^{T} Φ + λ I)^{- 1} Φ^{T} y$

import numpy as np
import random
from sklearn import linear_model
testsize = 5

x = np.array([a for a in range(100)])
onesx = np.ones(x.shape)
X = np.c_[x,onesx]
y = np.array([a*5 + 20 + random.randint(0,3) for a in range(100)])
print("the X shape is {}, and y shape is {}".format(X.shape, y.shape))

# weight = np.dot(np.dot(np.linalg.pinv(np.dot(X.T, X)), X.T), y)
weight = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

print("OLS : the weight is{}, and the bais is {} ".format(weight[:-1], weight[-1]))
print("the predect of 5 ele is {}".format(np.dot(np.c_[np.arange(testsize),np.ones(testsize)], weight)))

也可以是对变量 with multi variables

import numpy as np
import random
from sklearn import linear_model
testsize = 5

x = np.array([a for a in range(100)])
onesx = np.ones(x.shape)
X = np.c_[onesx, x, 2*x]
y = np.array([a*5 + 20 + random.randint(0,3) for a in range(100)])
print("the X shape is {}, and y shape is {}".format(X.shape, y.shape))

# ordinary  least squares (正规化方法）
weight = np.dot(np.dot(np.linalg.pinv(np.dot(X.T, X)), X.T), y)
print("OLS : the weight is{}, and the bais is {} ".format(weight[:-1], weight[-1]))
print("the predect of 5 ele is {}".format(np.dot(np.c_[np.arange(testsize), np.arange(testsize),np.ones(testsize)], weight)))

Lasso regression

we reform the Loss function from OLS $L = \frac{1}{2 m} \sum (h_{θ} (x_{i}) - y_{i})^{2}$ as $\frac{1}{2 N} i = 1 \sum N (H_{θ} (x^{i}) - y^{i})^{2}$ and add the regularity term (Manhattan norm) of Lasso regression $L_{1} = \frac{λ}{2} ∣ θ ∣$ put all together, for Lasso regession: $L = \frac{1}{2 m} \sum (h_{θ} (x_{i}) - y_{i})^{2} + \frac{λ}{2} ∣ θ ∣$

the minimun of $L$ should be a the interaction of the first term, which is the solution of OLS and the second term, which is the Lasso regularity term.

This reduce many feather coefficient to be 0,

Ridge regression

we reform the Loss function from OLS $L = \frac{1}{2 m} \sum (h_{θ} (x_{i}) - y_{i})^{2}$ as $\frac{1}{2 N} i = 1 \sum N (H_{θ} (x^{i}) - y^{i})^{2}$ and add the regularity term of (Euclidean norm) Lasso regression $L_{1} = \frac{λ}{2} ∣∣ θ ∣ ∣^{2}$ put all together, for Lasso regession: $L = \frac{1}{2 m} \sum (h_{θ} (x_{i}) - y_{i})^{2} + \frac{λ}{2} ∣∣ θ ∣ ∣^{2}$

the minimun of $L$ should be a the interaction of the first term, which is the solution of OLS and the second term, which is the Ridge regularity term.

This reduce many feather coefficient to be as small as possible

Elastic Net regression

combinate Lasso regession and Ridge regression

$L = \frac{1}{2 m} \sum (h_{θ} (x_{i}) - y_{i})^{2} + p \cdot \frac{λ}{2} ∣ θ ∣ + \frac{1 - p}{2} \cdot \frac{λ}{2} ∣∣ θ ∣ ∣^{2}$

Decision List

(f1,v1),(f2,v2)....(fr,vr) fi is a term in CNF, vi belongs {0,1}, and the last term fr is always true. and each term can be viewed as if else extended. if fi is matched, so vi is its value.

for 0<k<n, k-CNF and k-DNF are proper

linear regression

linear Discriminate Analysis

Fisher's linear discriminant

输入为j=0，1类样本，每类分别 $N_{j}$ 个样本 $μ_{j} = \frac{1}{N _{j}} \sum x$ $x \in N_{j}$ $Σ_{j} = \sum (x - μ_{j}) (x - μ_{j})^{T}$ , $x \in N_{j}$

$a r g ma x (J) = \frac{ω ^{T} ( μ _{0} - μ _{1} ) ( μ _{0} - μ _{1} ) ^{T} ω}{ω ^{T} ( Σ _{0} + Σ _{1} ) ω} = \frac{ω ^{T} S _{b} ω}{ω ^{T} S _{w} ω}$

Fisher's linear discriminant with Kernel method

$J (w) = \frac{( m _{2} - m _{1} ) ^{2}}{s _{1}^{2} + s _{2}^{2}} = \frac{w ^{T} ( m _{2} - m _{1} ) ^{T} ( m _{2} - m _{1} ) w}{w ^{T} ( s _{1}^{2} + s _{2}^{2} ) w}$

$w = k = 1 \sum L α_{k} ϕ (x_{k})$

$m = \frac{1}{L _{i}} n = 1 \sum L i ϕ (x_{n}^{i})$

$w^{T} m_{i} = α^{T} M_{i}$

$M_{i} = \frac{1}{L _{i}} k = 1 \sum L n = 1 \sum L_{i} k (x_{k}, x_{n}^{i})$

Numerator: $w^{t} S_{B} w = α^{T} M α$ Denominator: $w^{T} S_{w} w = α^{T} N α$ $N = i = 1, 2 \sum K_{i} (I - 1/ L) K_{i}^{T}$ $(K_{i})_{n, m} = k (x_{n}, x_{m}^{i})$

Probabilistic Generative Model

用贝叶斯定理求出每个可能的概率，再取最大的值

one two class case

$P (C_{1} ∣ x) = \frac{P ( C _{1} ∣ x ) P ( C _{1} )}{P ( C _{1} ∣ x ) P ( C _{1} ) + P ( C _{2} ∣ x ) P ( C _{2} )} = \frac{1}{1 + e x p ( l o g \frac{P ( C _{1} ∣ x ) P ( C _{1} )}{P ( C _{2} ∣ x ) P ( C _{2} )} )}$ 即可以 Logistic sigmoid 函数求解

multi class case

$P (C_{k} ∣ x) = \frac{P ( x ∣ C _{k} ) P ( C _{k} )}{\sum _{j} P ( x ∣ C _{j} ) P ( C _{j} )}$

即可以用 Softmax funtion 来求解

Probabilistic Discriminant Model

Better predictive performance if assumptions about class-conditional distributions not correct.

和 generative model 一样求解，同样也有二分和多分类，但是该类问题设为 logical regression, See logical regression

Principe Component Analysis

PCA Algorithms

将原来的数据坐标进行线性组合，组成新的坐标基底，让数据在新基底上投影最小化，以去除，压缩该些维度

将数据中心化
求出数据在所有特性的协方差矩阵
如果矩阵是方阵，则可以直接特征值分解
如果矩阵不是方阵，则先乘以转置，再特征值分解，注意此时求得特征值要开方
如果不是方阵，也可以直接奇异值分解
取出前面的需要的维度，多余的被压缩了

Probabilistic generative model for PCA

State the probabilistic generative model underlying Probabilistic PCA with a K-dimensional latent space and observations $x \in R^{D}$ . Define all three random variables and their distribution.

Hidden Variable z in K-dimension from probabilistic generative PCA: we can transfer z into standard gaussian distribution, $p (z) = N (0, I), z \in R^{K}, z \sim N (0, I)$

observation variable x in D-dimension giving z: $p (x ∣ z) = N (W z + u, σ^{2} I), x \in R^{D}$ $x = W z + u + ϵ, ϵ \sim N (0, σ^{2} I)$

So, $p (x) = \int p (x ∣ z) p (z) d z$ $E (x) = E (Z + u + ϵ) = u$ $C o v e [x] = E [(W z + u + ϵ) (W z + u + ϵ)^{T}] = E (W^{T} W) + E (ϵ ϵ^{T}) = W W^{T} + σ^{2} I$

$x \sim N (u, C o v [x])$

K-Nearest Neighbor [[classification]{.smallcaps}]{.tag tag-name="classification"}

This is based on the idea that instances of the same class are close to each othera

Algorithms

selecting the k nearest neighbor from the new instance, and leabel it as the majority target from k nearest neighbor instance

Decision tree [[classification]{.smallcaps}]{.tag tag-name="classification"}

to find the "most informative feature"

Algorithms

在训练集内以最大信息增益来确定划分属性，在各个子区内再重复剩下的属性信息熵增益 = Entropy - conditional Entropy $G ain (D, a) = H (D) - v = 1 \sum V \frac{∣ D ^{v} ∣}{∣ D ∣} H (D^{v})$

$G ain (D, a) = H (D) - H (D ∣ a)$

For all remained feathers, get the biggest Gain(D, a) for one feather, and using this feather as Criteria for the classification. over this again and again

Random Forest [[classification]{.smallcaps}]{.tag tag-name="classification"}

combining multiple decision trees into a single classifier. Random Forest = Decision Tree + Bagging + random Eigenschaften

More deeper Bias reduce More Eigenschaft Bias reduce

algorithms on decision tree

1, examples randomization for training 2, features randomization for training

Naivi Bayes's [[classification]{.smallcaps}]{.tag tag-name="classification"}

$P (y ∣ x_{1}, x_{2},, x_{i}) = \frac{P ( x _{1} , x _{2} ,, x _{i} ∣ y ) P ( y )}{P ( x _{1} , x _{2} ,, x _{i} )}$

$P (y ∣ x_{1}, x_{2},, x_{i}) \propto P (x_{1}, x_{2},, x_{i} ∣ y) P (y) = k = 1 \prod i P (x_{k} ∣ y) P (y)$

$P (\overset{y}{ˉ} ∣ x_{1}, x_{2},, x_{i}) \propto P (x_{1}, x_{2},, x_{i} ∣ \overset{y}{ˉ}) P (\overset{y}{ˉ}) = k = 1 \prod i P (x_{k} ∣ \overset{y}{ˉ}) P (\overset{y}{ˉ})$

For $\prod_{k = i}^{i} P (x_{k} ∣ y)$ can be calcaleted with Multinomial Navie Bayes and Graussian Navie Bayes, later one is better for continuons examples

假设各个属性完全独立的条件下

要计算在某些条件下某个事件出现的概率（分数）等于
在每个条件下该事件发生的条件概率的连乘再乘以该事件发生的总概率
再计算在同样的条件下该事件不出现的概率，再归一化

最后谁大选谁（注意样本不足引起的某属性的条件为零）

logistic regression [[classification]{.smallcaps}]{.tag tag-name="classification"}

Odd

$O dd (x) = \frac{p _{x}}{1 - p _{x}}$ , which is the probability of x happen to it not happen

fit logit with linear regression: $l o g i t (P) = W^{T} X$

$O dd (x) = \frac{P ( X = x )}{1 - P ( X = x )} = e^{W^{T} x}$

Odd ratio

Odd ratio of $X_{i}$ : $O dd R a t i o (X_{i}) = e^{W_{i}}$

if Odd ratio eqaul 2, means when feather x increase by one, Odd will increase 2.

if Odd ratio bigger than one, Odd increase if x increase if Odd ratio smaller than one, Odd decrease if x increase

algorithms

Logistic regression tries to estimate the logarithm of odds that an instance belongs to a class
i.e., that is nothing else but the logarithm of the odds that the instance is of that class.

$L o g i t (P (X = x)) = ln (\frac{P ( X = x )}{P ( X ! = x )}) = ln (\frac{P ( X = x )}{1 - P ( X = x )})$

$P (X = x) = \frac{1}{1 + e ^{- W^{T} X}}$ This is also the same form of sigmoid function 用 logical sigmoid function 来作二分类判断，检验概率是否过半

Support Vector Machine [[classification]{.smallcaps}]{.tag tag-name="classification"}

without soft margin

对于点的划分，由decision theory: $w u + c \geq 0$ 距此线一个单位对点标注 $w x_{+} + b \geq 1$ then y = $w x_{-} + b \leq 1$ then y = -1 So, $y (w x + b) - 1 \geq 0$ 最大化标+点和标-点的距离： $D_{ma x} = (x_{+} - x_{1}) \frac{w}{∣∣ w ∣∣} = \frac{2}{∣∣ w ∣∣}$ 等价于最小化 $\frac{1}{2} ∣∣ w ∣ ∣^{2}$ , 再加上约束条件 $L = \frac{1}{2} ∣∣ w ∣ ∣^{2} - \sum α_{i} [y_{i} (w x + b) - 1]$ 设L对w和b的偏导为0， $w = \sum α_{i} x_{i} y_{i}$ , $\sum α_{i} y_{i} = 0$ . 再代回L， $L = \sum α_{i} - \frac{1}{2} \sum\sum α_{i} α_{j} y_{i} y_{j} (x_{i} x_{j})$

with soft margin

对于不能绝对线性分割的，可以允许某些点进入空白分割区域（从-1到1的区域）

slack variable $ϵ$ $t_{n} y (x_{n}) \geq 1 - ϵ_{n}$ $ϵ > 0$ Controls trade-off between slack and margin C $C = \infty$ , if misclassified $C \sum ϵ_{n}$

this L satisfied the KKT condition, and can be solved.

good classified a = 0 ϵ = 0 C = 0 on the margin a < C ϵ = 0
violate the margin a = C ϵ > 0
misclassified ϵ > 1 $C = \infty$

kernel function

high dimension separable

Neural network [[classification]{.smallcaps}]{.tag tag-name="classification"}

Backpropagation

感知机

对x的向后更正， $x^{^{'}} = x - η \cdot \frac{d y}{d x}$ . 对于感知机的传递功能， $y = w^{T} x + b$ . 由于感知机没有激活函数，所以直接对 $L = \frac{1}{n} i = 1 \sum n (w \cdot x^{i} + b - y^{i})^{2}$ . $\frac{\partial L}{\partial w} = \frac{2}{n} i = 1 \sum n (w x^{i} + b - y^{i}) x^{i}$ $\frac{\partial L}{\partial b} = \frac{2}{n} i = 1 \sum n (w x^{i} + b - y^{i})$

多层神经网络

而对于多层神经网络， $z = w^{T} x + b$ , $\frac{\partial z}{\partial w} = x$ , $\frac{\partial z}{\partial b} = 1$ . 每层之间具有激活函数, $σ (z) = \frac{1}{1 + e ^{- z}}$ , $\frac{\partial σ}{\partial z} = z (1 - z)$ . 损失函数, $L = \frac{1}{2} (σ - y^{i})^{2}$ , $\frac{\partial L}{\partial σ} = (σ - y^{i})$

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial σ} \cdot \frac{\partial σ}{\partial z} \cdot \frac{\partial z}{\partial w} = (σ - y) z (1 - z) \cdot x$

$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial σ} \cdot \frac{\partial σ}{\partial z} \cdot \frac{\partial z}{\partial b} = (σ - y) z (1 - z)$

如果对于多层神经网络，则需要逐层计算，其中 $\frac{\partial L}{\partial w}$ 中的w就是相应层的权重，由最后的 L逐步回推到w。

K-Means [[Cluster]{.smallcaps}]{.tag tag-name="Cluster"}

K-means is an example for centroid-based clustering. We can determine the cluster of any instance $x \in F$ as $c (x) = a r g mi n_{i = 1, .. k} d (x, C_{i})$ .

algothism

输入样本集 D: <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,,,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>
聚类数 k,
最大迭代数 N,
期望输出: <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0715em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0715em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,,,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em;"><span style="top:-2.55em;margin-left:-0.0715em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03148em;">k</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

随机初始化k个聚类中心，并作不同类别的标记
for i= 1,2,..N:
    随机初始化所有C个中心
    计算每个点到每个中心的距离（arithmetic mean），并被最小距离的聚类中心标记，以此划分所有X
    对于所有相同标记的聚类X更新中心，再重复上一步骤，直到没有变化为止,或者达到迭代次数限制

import random
import numpy as np
import matplotlib.pyplot as plt

b = []
for i in range(100):
    a = np.array(list([(20,50),(30,10),(60,30)]))
    for j in range(a.shape[0]):
        for k in range(a.shape[1]):
            a[j][k] += random.randint(0,30)
            b.append(a[j])

b = np.array(b)
plt.plot(b[:,0], b[:,1], 'ro')
plt.title("toy data")
plt.show()


# sklearn API
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=3, random_state=9).fit_predict(b)
plt.scatter(b[:, 0], b[:, 1], c=y_pred)
plt.title("toy data with sklearn API")
plt.show()

# manual
def findClosestCentroids(X, centroids):
    distance = np.zeros((len(X),len(centroids)))
    for i in range(len(X)):
        for j in range(len(centroids)):
            distance[i,j] = np.linalg.norm(X[i,:]-centroids[j,:])
    return np.argmin(distance,axis=1)

def computeCentroids(X, idx, K):
    centroids = np.zeros((K,X.shape[1]))
    for i in range(K):
        centroids[i,:] = np.mean(X[idx == i],axis = 0)
    return centroids


def runkMeans(X,K,max_iters):
    indexs = np.random.choice(np.array(range(len(X))), K,replace=False)
    centroids = X[indexs]
    for max_iter in range(max_iters):
        idx = findClosestCentroids(X, centroids)
        centroids = computeCentroids(X, idx, K)
        colors = ['','','']
        for i in range(K):
            plt.scatter(X[idx==i, 0], X[idx==i, 1])
        plt.scatter(centroids[:, 0], centroids[:, 1], c='r')
        plt.title("toy data with manual {} time".format(max_iter))
        plt.show()
K = 3
max_iters = 3
runkMeans(b,K,max_iters)

select the best k

based on domain knowlegde:

cluster are not internally similar, increase k similar objects are in different clusters, decrease k

Visualizations

this give us a intuitive aspects

Within-Sum-of-Squares

minimizing the intra-cluster variance $W SS = i = 1 \sum k x \in X_{i} \sum d (x, C_{i})^{2}$

Problems

1, k -Means is sensitive to the initial clusters.

2, An unsuitable value of k may lead to bad results.

3, All features must have a similar range of values,

4, Because the cluster assignment is based on the distance, clusters tend to be round.

EM algorithms for Gaussian Mixture model [[Cluster]{.smallcaps}]{.tag tag-name="Cluster"}

This concept is called distribution-based clustering. Each instance is assigned to the most likely cluster with $c (x) = ma x_{i = 1, 2... k} P (C_{i} = x)$ and the initialization means random mean values and random covariance matrices at first.

k selecting

Bayesian Information Criterion for clusters K $B I C = lo g (∣ X ∣) \cdot k^{^{'}} - 2 \cdot lo g (\hat{L} (C_{1},,, C_{k}; X))$ for $k^{^{'}} = k \cdot (d + \frac{d ( d + 1 )}{2})$

Algorithm

$x_{1}$ , $x_{2}$ .... $x_{n}$ $\in N_{1... k}$ for $N_{i} (μ_{i}, σ_{i}^{2})$

E step: compute responsibilites $γ_{nk}$ given current $π_{k}, μ_{k}, Σ_{k}$ $γ_{nk} = \frac{π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )}{\sum _{k = 1}^{K} π _{k} N ( x _{n} ∣ μ _{k} , Σ _{k} )}$

$γ_{nk}$ descripte the probabilistic of example n belongs to k distribution.

M step: update $π_{k}, μ_{k}, Σ_{k}$ given $γ_{nk}$ . according to the derivative of $log p(x|π, μ, Σ) = ∑^N^~~n=1~log ∑^K^~~k=1~~ π N(x~k~~|μ~k~, Σ~k~)$with respect to the $π_{k}, μ_{k}, Σ_{k}$ ,