title: MachineLearning

#+STARTUP: overview

Math for ML

Self Entropy

One thing happens, how much information can be persent. how meanful it is

Information Entropy

cross Enteopy

Conditional enteopy

Under the condition than Y has been happened, estimate the Enteopy of X to be happened

Gini

吉尼系数

KL

the difference between p and q

Matrix

Conjugate transpose Matrix

共轭转置矩阵, 先共轭再转置,还是先转置再共轭都可以。

Normal Matrix

正定矩阵, 是转置和本身满足交换律 可酉变换

Unitary Matrix

酉矩阵,是正定矩阵, 即转置和本身满足交换律,而且等于 I

orthogonal matrix

实数酉矩阵

得该矩阵的转置矩阵为其逆矩阵

Hermitian matrix

, 共轭对称 埃尔米特矩阵,厄米特矩阵,厄米矩阵,所有元素对称出共轭

Eigenwertzerlegung eigende decompostion

对于矩阵求特征值特征向量,特征值分解,但是要求必须是方阵,如果不是,先 要转换: A = a*a.T

import numpy as np
a = np.mat([[1,2,3,4],[1,1,1,1]])
A = a*a.T
B, U = np.linalg.eig(A)
print("eigenvalue of A : ")
print(B)
print("eigenvalue of a :(should be equal to the following) ")
print(np.sqrt(B))
print("eigenvactor : ")
print(U)

SVD Singular value decompostion

但是对于一般矩阵,不是方阵,可以奇异值分解:

import numpy as np
a = np.mat([[1,2,3,4],[1,1,1,1]])
U, B, Vt = np.linalg.svd(a)
print("left eigenvalue : ")
print(U)
print("eigenvactor of a : ")
print(B)
print("right eigenvalue : ")
print(Vt)

Bayes's Rule

if x and y are independent:

条件概率:
条件概率 = 联合概率/边缘概率
先验概率和后验概率都是条件概率,但是条件已知是先验

全概率公式

贝叶斯公式

贝叶斯公式 + 全概率公式 + 条件概率
\begin{eqnarray*}
P(A|B) &= \frac{P(B|A)P(A)}{P(B)} \\
       &= \frac{P(B|A)P(A)}{\sum_{i=1}^{n} p(B,A_{i})} \\
       &= \frac{P(B|A)P(A)}{\sum_{i=1}^{n} P(B|A_{i})P(A)} \\
\end{eqnarray*}

Kovarianz Matrix

for i = {1...n}, is a random variable, which belongs to Gaussian distribution

set

co-variance matrix

\begin{equation}
\Sigma =
  \left(
  \begin{array}{c}
          x_{1}-\bar{x}_{1} \\
          x_{2}-\bar{x}_{2} \\
          x_{3}-\bar{x}_{3} \\
          ..                \\
          x_{n}-\bar{x}_{n}
 \end{array}
 \right)
  \left(
  \begin{array}{ccccc}
          x_{1}-\bar{x}_{1} &
          x_{2}-\bar{x}_{2} &
          x_{3}-\bar{x}_{3} &
          ..                &
          x_{n}-\bar{x}_{n}
  \end{array}
  \right)
\end{equation}

对角线上是对应元素的方差,其他是相对于两个元素的协方差

Regularization

L Regularization is the weight in network, it's a scalar


N example number M Eigenschaften Number, Diemension number L0 控制网络中的非零权重 L1 网络中的所有元素的绝对值之和,促使网络生成更多的稀疏矩阵 L2 网络中的所有元素平方和,促使网络生成小比重的权值 w w~1~, w~2~ q=1 l1 regularization q=2 l2 regularization λ learning rate 依据w的同心圆 q=1, 菱形, q=2, 圆形


Loss 要最小,part1 刚好和 part2 相接,l1会在坐标轴上,所以如果有较小分 量,会被直接设为0

Error Bias-Variance-trade-off

Error = Bias + Variance + noise

Bias 偏差 欠拟合 发挥,观测等主观因素影响
Variance 方差 过过拟合 稳定性,模型的构建决定
noise 噪音 统难度


Multi variable Gaussian distribution

seeing the link 知乎


def gaussian(x,mean,cov):
    dim = np.shape(cov)[0] #维度
    #之所以加入单位矩阵是为了防止行列式为0的情况
    covdet = np.linalg.det(cov+np.eye(dim)*0.01) #协方差矩阵的行列式
    covinv = np.linalg.inv(cov+np.eye(dim)*0.01) #协方差矩阵的逆
    xdiff = x - mean
    #概率密度
    prob = 1.0/np.power(2*np.pi,1.0*dim/2)/np.sqrt(np.abs(covdet))*np.exp(-1.0/2*np.dot(np.dot(xdiff,covinv),xdiff))
    return prob

Mahalanobis distance

马氏距离所使用的变换 : ,

关于新的坐标,U 是变换的旋转, 是基底的延伸, 是在其 上的投影,此后,在新坐标上,即为多变量,标准,不相关高斯分布

K-fold Cross Validation


N total examples K number of sub-fold m number of each sub-fold big K small bias, with over fitting, big variance small K big bias, without fitting, low variance


confusion matrix


                 Actually Ture   Actually False

predict Positive TP FP predict Negative FN TN


Ture Positive Rate (recall)

namely, Ture Positive Rate, y axes of ROC

False Postive Rate

namely, false Position, x axes of ROC

ROC Receivert Operator Characteristic

under the acceptable x (1 - Sensitivity) , we want the best y (Sensitivity). from side to side is all classifed to Positive to all classifed to negative

AUC Area under the Curve

je mehr Fachsgebiet, desto besser for the Method, we use this target to choice our Method

F1

harmonic mean of recall and precision

关联性属性 : 高中低 (3,2,1) 非关联性属性: 猪狗羊 ((1,0,0), (0,1,0),(0,0,1))

Macro Micro Average

This is for non-binary classification at frist calcalete for all feather the confusion matrix als binary classification

Macro Average

Micro Average

Jacobin matrix

for , 由球坐标系到直角坐标系的转化由 F: ℝ+ × [0, π] × [0, 2π) → ℝ3 函数给出, 其分量为: 此坐标变换的雅可比矩阵是 其雅可比行列式为 r2 sin θ,由于 dV = dx dy dz,如果做变数变换的话其体 积元(Volume element),dV,会变成:dV = r2 sin θ dr dθ dφ。

预剪枝和后剪枝

在利用训练集的最大信息增益确定划分属性后,用验证集来检验划分,如果验证 集的信息熵增加,(泛化结果不好)否定此次划分,设为叶节点

后剪枝是在这个树完成后,用验证集去检验每一个内节点,从下到上,如果去掉 该划分有更小的信息熵,则废除该划分。

属性有连续值和缺失值

连续值离散化:排列该属性的所有取值n个,在n-1个区间中去中间值为离散值, 遍历所有离散值,找到最大信息增益的离散值,作二分。

缺失值,取出该属性的非缺失子集,再配以相应的比率计算信息增益,处理和以 前一样。如果选出的划分属性有缺失值,则给划分不作用到缺失样本,复制到每 个划分子集

activation function

activation

输出为实数空间或某个区间, 连续变化。直接有输出值和真实值比较

Sigmoid

导数:

ReLU

LeakyReLU

Tanh

导数:

MSE

导数 :

inf entropy

softmax

所有种类的概率之和为1 导数:

cross entropy

在计算交叉熵时, 一般是和 softmax 函数一起使用的

for One-hot coding 为1 时,预测正确,交叉熵为0。 导数:

用上面 softmax 的导数结果,分为k=i 和k!=i两种情况

This is why Cross Enterpy can train the loss very fast. if a is far away from y, the update will be its difference

example bagging

多次放回抽样,用不同抽样的数据集在多棵树上并行计算,


More Tree Bias remained Variance reduce to limit


所以刚开始选择偏差小,方差大的强模型

Boosting

固定数据集,在多个串行的模型上顺序计算,模型间强相关,防止过拟合,用弱 相关模型

gradient decent

当数据点很多是,正则化方法计算量将非常大,此时较多使用梯度下降

sklearn API

import numpy as np
import random
from sklearn import linear_model
testsize = 5

x = np.array([a for a in range(100)])
onesx = np.ones(x.shape)
X = np.c_[x, 2*x, onesx]
y = np.array([a*5 + 20 + random.randint(0,3) for a in range(100)])
print("the X shape is {}, and y shape is {}".format(X.shape, y.shape))

# Sklearn API
reg = linear_model.LinearRegression()
model = reg.fit(X,y)
print("Sklearn: the weith is {}, and the intercept is {}".format(model.coef_[:-1] ,model.intercept_))
print("the predect of 3 ele is {}".format(model.predict(np.c_[np.arange(testsize), np.arange(testsize),np.ones(testsize)])))


# manual
def featureNormalize(X):
    (m,n) = X.shape
    X_norm = X
    mu = np.zeros(n);
    sigma = np.zeros(n);
    for i in range(n):
        mu[i] = np.mean(X[:,i])
        sigma[i] = np.std(X[:,i])
        X_norm[:,i] = (X_norm[:,i]-mu[i])/sigma[i]
    return X_norm
def computeCost(X, y, theta):
    return np.sum((np.dot(X,theta) -y)**2)/(2*len(y));

def gradientDescent(X, y, theta, alpha, num_iters):
    m = len(y)
    J_history = np.zeros(num_iters);
    theta_len = len(theta);
    for num_iter in range(num_iters):
        theta = theta - (alpha/m)*np.dot(X.T,(np.dot(X,theta).reshape(-1)-y))
        J_history[num_iter] = computeCost(X, y, theta)
    return theta, J_history

alpha = 0.0001
num_iters = 400000
theta = np.zeros(2+1)
theta, J_history = gradientDescent(X, y, theta, alpha, num_iters)
print("Greadient decent: the weight is {}, and the intercept is {}".format(theta[:-1],theta[-1]))
print("the predect of 3 ele is {}".format(np.dot(np.c_[np.arange(testsize), np.arange(testsize),np.ones(testsize)], theta)))

Ordinary Least Squares(OLS)

正规化方程

正则化方程的推导,用高斯分布的多变量分布的Maxisum likelihood,能一起求得对weight和bias值 :

但是要添加一列1到 train 和 test,至于在前面还是后面有点怪异。

目前认为,在后面的话,多变量和参数可以按需求访问

Loss function: 求导,并令其为0, 但是要求 必须可逆。

正则化正规化方程

import numpy as np
import random
from sklearn import linear_model
testsize = 5

x = np.array([a for a in range(100)])
onesx = np.ones(x.shape)
X = np.c_[x,onesx]
y = np.array([a*5 + 20 + random.randint(0,3) for a in range(100)])
print("the X shape is {}, and y shape is {}".format(X.shape, y.shape))

# weight = np.dot(np.dot(np.linalg.pinv(np.dot(X.T, X)), X.T), y)
weight = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

print("OLS : the weight is{}, and the bais is {} ".format(weight[:-1], weight[-1]))
print("the predect of 5 ele is {}".format(np.dot(np.c_[np.arange(testsize),np.ones(testsize)], weight)))


也可以是对变量 with multi variables
import numpy as np
import random
from sklearn import linear_model
testsize = 5

x = np.array([a for a in range(100)])
onesx = np.ones(x.shape)
X = np.c_[onesx, x, 2*x]
y = np.array([a*5 + 20 + random.randint(0,3) for a in range(100)])
print("the X shape is {}, and y shape is {}".format(X.shape, y.shape))

# ordinary  least squares (正规化方法)
weight = np.dot(np.dot(np.linalg.pinv(np.dot(X.T, X)), X.T), y)
print("OLS : the weight is{}, and the bais is {} ".format(weight[:-1], weight[-1]))
print("the predect of 5 ele is {}".format(np.dot(np.c_[np.arange(testsize), np.arange(testsize),np.ones(testsize)], weight)))


Lasso regression

we reform the Loss function from OLS as and add the regularity term (Manhattan norm) of Lasso regression put all together, for Lasso regession:

the minimun of should be a the interaction of the first term, which is the solution of OLS and the second term, which is the Lasso regularity term.

This reduce many feather coefficient to be 0,

Ridge regression

we reform the Loss function from OLS as and add the regularity term of (Euclidean norm) Lasso regression put all together, for Lasso regession:

the minimun of should be a the interaction of the first term, which is the solution of OLS and the second term, which is the Ridge regularity term.

This reduce many feather coefficient to be as small as possible

Elastic Net regression

combinate Lasso regession and Ridge regression

Decision List

(f1,v1),(f2,v2)....(fr,vr) fi is a term in CNF, vi belongs {0,1}, and the last term fr is always true. and each term can be viewed as if else extended. if fi is matched, so vi is its value.

for 0<k<n, k-CNF and k-DNF are proper

linear regression

linear Discriminate Analysis

Fisher's linear discriminant

输入为j=0,1类样本,每类分别 个样本 ,

Fisher's linear discriminant with Kernel method

Numerator: Denominator:

Probabilistic Generative Model

用贝叶斯定理求出每个可能的概率,再取最大的值

one two class case

即可以 Logistic sigmoid 函数求解

multi class case

即可以用 Softmax funtion 来求解

Probabilistic Discriminant Model

Better predictive performance if assumptions about class-conditional distributions not correct.

和 generative model 一样求解,同样也有二分和多分类,但是该类问题设为 logical regression, See logical regression

Principe Component Analysis

PCA Algorithms

将原来的数据坐标进行线性组合,组成新的坐标基底,让数据在新基底 上投影最小化,以去除,压缩该些维度

  1. 将数据中心化
  2. 求出数据在所有特性的协方差矩阵
  3. 如果矩阵是方阵,则可以直接特征值分解
  4. 如果矩阵不是方阵,则先乘以转置,再特征值分解,注意此时求得特征值要开方
  5. 如果不是方阵,也可以直接奇异值分解
  6. 取出前面的需要的维度,多余的被压缩了

Probabilistic generative model for PCA

State the probabilistic generative model underlying Probabilistic PCA with a K-dimensional latent space and observations . Define all three random variables and their distribution.

Hidden Variable z in K-dimension from probabilistic generative PCA: we can transfer z into standard gaussian distribution,

observation variable x in D-dimension giving z:

So,

K-Nearest Neighbor [[classification]{.smallcaps}]{.tag tag-name="classification"}

This is based on the idea that instances of the same class are close to each othera

Algorithms

selecting the k nearest neighbor from the new instance, and leabel it as the majority target from k nearest neighbor instance

Decision tree [[classification]{.smallcaps}]{.tag tag-name="classification"}

to find the "most informative feature"

Algorithms

在训练集内以最大信息增益来确定划分属性,在各个子区内再重复剩下的属性 信息熵增益 = Entropy - conditional Entropy

For all remained feathers, get the biggest Gain(D, a) for one feather, and using this feather as Criteria for the classification. over this again and again

Random Forest [[classification]{.smallcaps}]{.tag tag-name="classification"}

combining multiple decision trees into a single classifier. Random Forest = Decision Tree + Bagging + random Eigenschaften


More deeper Bias reduce More Eigenschaft Bias reduce


algorithms on decision tree

1, examples randomization for training 2, features randomization for training

Naivi Bayes's [[classification]{.smallcaps}]{.tag tag-name="classification"}

For can be calcaleted with Multinomial Navie Bayes and Graussian Navie Bayes, later one is better for continuons examples

假设各个属性完全独立的条件下

要计算在某些条件下某个事件出现的概率(分数)等于
在每个条件下该事件发生的条件概率的连乘再乘以该事件发生的总概率
再计算在同样的条件下该事件不出现的概率,再归一化

最后谁大选谁(注意样本不足引起的某属性的条件为零)

logistic regression [[classification]{.smallcaps}]{.tag tag-name="classification"}

Odd

, which is the probability of x happen to it not happen

fit logit with linear regression:

Odd ratio

Odd ratio of :

if Odd ratio eqaul 2, means when feather x increase by one, Odd will increase 2.

if Odd ratio bigger than one, Odd increase if x increase if Odd ratio smaller than one, Odd decrease if x increase

algorithms

Logistic regression tries to estimate the logarithm of odds that an instance belongs to a class
i.e., that is nothing else but the logarithm of the odds that the instance is of that class.

This is also the same form of sigmoid function 用 logical sigmoid function 来作二分类判断,检验概率是否过半

Support Vector Machine [[classification]{.smallcaps}]{.tag tag-name="classification"}

without soft margin

对于点的划分,由decision theory: 距此线一个单位对点标注 then y = then y = -1 So, 最大化标+点和标-点的距离: 等价于最小化, 再加上约束条件 设L对w和b的偏导为0,,. 再代回L,

with soft margin

对于不能绝对线性分割的,可以允许某些点进入空白分割区域(从-1到1的区域)


slack variable Controls trade-off between slack and margin C , if misclassified


this L satisfied the KKT condition, and can be solved.


good classified a = 0 ϵ = 0 C = 0 on the margin a < C ϵ = 0
violate the margin a = C ϵ > 0
misclassified ϵ > 1


kernel function

high dimension separable

Neural network [[classification]{.smallcaps}]{.tag tag-name="classification"}

Backpropagation

感知机

对x的向后更正,. 对于感知机的传递功能,. 由于感知机没有激活函数,所以直接对.

多层神经网络

而对于多层神经网络,, , . 每层之间具有激活函数, , . 损失函数, ,

如果对于多层神经网络,则需要逐层计算,其中 中的w就是相应层的权重,由最后的 L逐步回推到w。

K-Means [[Cluster]{.smallcaps}]{.tag tag-name="Cluster"}

K-means is an example for centroid-based clustering. We can determine the cluster of any instance as .

algothism

输入样本集 D: <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,,,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.1514em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight">m</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>
聚类数 k,
最大迭代数 N,
期望输出: <span class="katex"><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8778em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0715em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0715em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mpunct">,,,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em;"><span style="top:-2.55em;margin-left:-0.0715em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.03148em;">k</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>

随机初始化k个聚类中心,并作不同类别的标记
for i= 1,2,..N:
    随机初始化所有C个中心
    计算每个点到每个中心的距离(arithmetic mean),并被最小距离的聚类中心标记,以此划分所有X
    对于所有相同标记的聚类X更新中心,再重复上一步骤,直到没有变化为止,或者达到迭代次数限制
import random
import numpy as np
import matplotlib.pyplot as plt

b = []
for i in range(100):
    a = np.array(list([(20,50),(30,10),(60,30)]))
    for j in range(a.shape[0]):
        for k in range(a.shape[1]):
            a[j][k] += random.randint(0,30)
            b.append(a[j])

b = np.array(b)
plt.plot(b[:,0], b[:,1], 'ro')
plt.title("toy data")
plt.show()


# sklearn API
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=3, random_state=9).fit_predict(b)
plt.scatter(b[:, 0], b[:, 1], c=y_pred)
plt.title("toy data with sklearn API")
plt.show()

# manual
def findClosestCentroids(X, centroids):
    distance = np.zeros((len(X),len(centroids)))
    for i in range(len(X)):
        for j in range(len(centroids)):
            distance[i,j] = np.linalg.norm(X[i,:]-centroids[j,:])
    return np.argmin(distance,axis=1)

def computeCentroids(X, idx, K):
    centroids = np.zeros((K,X.shape[1]))
    for i in range(K):
        centroids[i,:] = np.mean(X[idx == i],axis = 0)
    return centroids


def runkMeans(X,K,max_iters):
    indexs = np.random.choice(np.array(range(len(X))), K,replace=False)
    centroids = X[indexs]
    for max_iter in range(max_iters):
        idx = findClosestCentroids(X, centroids)
        centroids = computeCentroids(X, idx, K)
        colors = ['','','']
        for i in range(K):
            plt.scatter(X[idx==i, 0], X[idx==i, 1])
        plt.scatter(centroids[:, 0], centroids[:, 1], c='r')
        plt.title("toy data with manual {} time".format(max_iter))
        plt.show()
K = 3
max_iters = 3
runkMeans(b,K,max_iters)

select the best k

based on domain knowlegde:

cluster are not internally similar, increase k similar objects are in different clusters, decrease k

Visualizations

this give us a intuitive aspects

Within-Sum-of-Squares

minimizing the intra-cluster variance

Problems

1, k -Means is sensitive to the initial clusters.

2, An unsuitable value of k may lead to bad results.

3, All features must have a similar range of values,

4, Because the cluster assignment is based on the distance, clusters tend to be round.

EM algorithms for Gaussian Mixture model [[Cluster]{.smallcaps}]{.tag tag-name="Cluster"}

This concept is called distribution-based clustering. Each instance is assigned to the most likely cluster with and the initialization means random mean values and random covariance matrices at first.

k selecting

Bayesian Information Criterion for clusters K for

Algorithm

, .... for

E step: compute responsibilites given current

descripte the probabilistic of example n belongs to k distribution.

M step: update given . according to the derivative of $log p(x|π, μ, Σ) = ∑^N^n=1~log ∑^K^k=1 π N(x~k|μ~k~, Σ~k~)$with respect to the ,

cluster number:

cluster means:

cluster covariances:

cluster priors:

problem

1, This is also sensitive to the initination.

2, An unsuitable value of k may lead to bad results.

3, clusters tend to be round or ellipse, but still not suit for half mood

DBSCAN [[Cluster]{.smallcaps}]{.tag tag-name="Cluster"}

This is density-based clustering

concepts minPts

neighbors(x) = {x' X: d(x, x') <= $ϵ$}

neighbors(x) is dense if |neighbors(x)| >= minPts

core(x) = {x X: |neighbors(x)| >= minPts} called core points each dense neighbors exist at last on core point

Algorithm

1, randomly select one core point as the first cluster 2, growing the first cluster with neighbors, 3, growing again with core points in neighbors again and again 4, select the other core point as other cluster repeat

select and minpts

with largest curvature ,sharpest change in lines for difference and minPts

problems

1, difficult selection of and minPts 2, different density matters 3, scale sensitive

Single Linkage Clustering [[Cluster]{.smallcaps}]{.tag tag-name="Cluster"}

hierarchical clustering

Algorithm

1, every instance as a cluster 2, growing the distance to merge cluster to 1 cluster

problems

large storage of scalability no noise leads to small cluster scale sensitive