title: DeepLearning
#+STARTUP: overview
Batch Size Problem
Batch Gradient descent
 all data as input once
 loss graph is quite smooth
 need great calculation power
Stochastic gradient descent
 input each time only one data sample
 cost will fluctuate over the training
 fast update the weights
minibatch gradient descent
 only a subset of all data points in each
big batch size
Advantages: the descension of weights has a more accuary direction, less oscillation. Disadvantages: memery explotion, and fall into locally minimum
small batch size
Advantages: more times weights update, more chance to overcome locally minimum, Disadvantages: difficult to convergence
Learning rate
too small converge slowly or find local minimum too large oscillation or divergence
Second order optimizer
SGD with momentum
learning rate decay
Pytorch
automatic differentiation
Regularization
Multilayer perceptrons
Perceptron
Universal Approximation Theorem Perceptrons solves linearly separable problem One hidden layer is enough to approximate any continuons funtion, to an arbitrary degree of accuracy
Perceptron Learning Algorithm(PLA): $ω←0$ Iterate over training examples until convergence $y^ _{i}←ω_{T}x_{i}$ $e←y_{i}−y^ _{i}$ $ω←ω+e⋅x_{i}$
Gradient Descent
Advance gradient descent
Gradientbased optimization does neither always find the absolute minimum, nor does it find the optimal direction
Secondorder optimizers
Theorical besser performance, but generally too expensive and not work with minibatches
SGD with momentum
$w←w−v$ $v←βv+η∇_{w}L$ $β∈[0,1]$
too small get stuck in local minimum or saddle points too large overshoot optimum or spiral around it
AdaGrad
adaptive gradient algorithm(DL~03MLP~.pdf/12 )
RMSProp
root mean square propagation(DL~03MLP~.pdf/13 )
Adam
Regularization of MLP
Dropout
Early stop
Active funtion
Sigmoids vanish for large postive and negative inputs Rectified Linear Unit(ReLu) LeakyReLU Exponential linear unit Absolute value activation
Babysitting training neural network
 check initial loss make sense
 overfit training model to 100% accuracy of small samples, such as
few minibatches
 adjust the initialization and learning rate
 find the learning rate, which reduce loss significantly
 rough train few epochs with learning rate nearby from previous step
 use the best options from previous step, training longer
 watch out the loss curves
 if learning rate decay needed
 accuracy of train vs validation, overfit, underfit, go back to step 5
 Early strop Idea: Stop training when generalization error increases
Hyperparameter search
 grid search and random search
 Multipy learning rate by N where you increase the batchsize by factor of N
Data Augmentation
 data artifacts
 flips
 crops and scales
 randomize color
 rotate
 advance data augmentation
 Mixup: take linear combination of input and target of two training samples
 CutMit: mix patches of the two input, Target is linear combination with weight according to patch ratio
CNN
Modern CNN architectures tend to use strided convolutions instead of max pooling.
Output size
valid: $M=⌊sN−k ⌋+1$ padding: $M=⌊sN−k+2p ⌋+1$ N: input size M: output size p: padding k: kernel size s: stride size
Receptive field
$RF=1+l=1∑L (k_{l}−1)∗s$ $RF_{i}=(RF_{i+1}−1)∗s+k$
From where I want to calcalete to the input layer. and set the current calcalete layer's RF as 1
Zero Padding
valid without padding shape reduce k1 same with padding shape stay the same
pooling
max pooling only chose the maximum one of block average pooling use the average of block
Feather abstact
The layer can extract image features, and finally determine the convolution kernel parameters through backpropagation to obtain the final features
Normalization
For Vanishing/exploding gradients: each example in layer all data are normalized $μ=N1 i=1∑N x_{i,j}$
$σ_{j}=N1 i=1∑n (x_{i,j}−μ_{j})_{2}$
$x^_{i,j}=σ_{j}+ϵ x_{i,j}−μ_{j} $
$y_{i,j}=γ_{j}x^_{i,j}+β_{j}$
 Batch Normalization norm each channel
 Layer Normalization norm each sample
 Instance Normalization norm each sample and each channel
 Group Normalization norm multi channel and sample
regular convolution
Depthwise separable convolution
 Depthwise Convolution: channelwise
 Pointwise Convolution: 1*1 convolution
learning rate decay
dacay schudle
Linear Warmup
small learing rate increase very fast, and decay slowly can deal with bad initialization
Modul ensembling
 training N model, and take the averate
 take N snapshots of training
ResNet
Vanishing gradient and exploding gradient
GAN (Generative Adersarial models)
Implicit density
step 1 Generater stay, Discriminater update
 randomly initialization of G and D
 inputs (Distribution: $z_{1},z_{2},z_{3},z_{4}...:Z$) from known distribution to G get rough outputs (Distribution: $z_{1},z_{2},z_{3},z_{4}...:Z_{_{′}}$)
 rough outputs and real image (Examples: $x_{1},x_{2},x_{3},x_{4}.....:X$) feed to D
 training D to classify them with mark, and update D Max $V=m1 ∑_{i=1}logD(X)$ to 1 Min $V=m1 ∑_{i=1}logD(G(Z))$ to 0, so $max_{d}[E_{x∽data}gD_{d}(x)+E_{z∽p(z)}g(1−D_{d}(G_{g}(z)))]$
step 2 Discriminater stay, Generater update

fix D, feed new inputs from known distribution to G

get rough outputs again, and pass them to D, and evaluated with mark

training G, for getting better mark
max $V=m1 ∑_{i=1}logD(G(Z))=m1 ∑_{i=1}logD(Z_{′})$ to 1, so $min_{g}[E_{z∽p(z)}g(1−D_{d}(G_{g}(z)))]$
Just like training normal neural network with minimum cross enteopy
summary
$min_{g}max_{d}[E_{x∽data}gD_{d}(x)+E_{z∽p(z)}g(1−D_{d}(G_{g}(z)))]$
Autoencoder
input data $→$ Feathers $→$ predicted labels predicted labels + Classifier $→$ loss function
VAE (Variational Autoencoders)
optimizes variational lower bound on likelihood Approximate density search the latent implimentation, reduce the dimensionality to capture meaningful factors in data x: examples z: latent parameters $p_{θ}(x)=∫p_{θ}(z)p_{θ}(x∣z)dz$, simple gaussian prior, encoder neural network $gp(x)=gp(z∣x)p(x∣z)p(z) $ $gp(x)=E_{x∽p(x∣z)}logp(z∣x)−KL[q(z∣x)∣∣p(z)]+KL[q(z∣x)∣∣p(z∣x)]$
 decoder network
 KL term between gaussian encoder and z latent. make approximate posterior distribution close to prior.
 mostly similarity between q and p, KL[q(zx)p(zx)], bigger than 0, so maxmized data likelihood $gp(x)$ can only have a lower bound value
Intractability of p(x), because We want $p(z∣x)$, but it's too difficult, so we use $q(z∣x)$ as approximation:
$KL[q(z∣x)∣∣p(z∣x)]$ $=∫q(z∣x)⋅gp(z∣x)q(z∣x) dz$ $=∫q(z∣x)⋅gp(x∣z)p(z)q(z∣x)p(x) dz$ $=∫q(z∣x)⋅gq(z∣x)dz+∫q(z∣x)⋅gp(x)dz−∫q(z∣x)⋅gp(x∣z)dz−∫q(z∣x)⋅gp(x)dz$ $=logp(x)+KL[q(z∣x)∣∣p(z)]−E_{x∽p(x∣z)}logp(z∣x)$
We randomly example the z from the normal Gaussian for VAE
PixelCNN (Autoregressive models)
Explicit density, optimizes exact likelihood exact likelihood(chain rule), and training slowly maxmize likelihood $p(x)=i=1∏n p(x_{i}∣x_{1},x_{2},x_{3},...x_{i−1})$ Mask converlutions: kernel filter pixel in the future are set to be 0
Diffusion Model
image to noise : forward process noise to image : backward process
RNN
this is many to one, $h(t)=tanh(A∗[h(t−1),x(t)]_{T})$ A[shape(h), shap(h)+shap(x)] is shared by all step there is no big difference for prediction with only h(t) or conta(h(1), h(t))
LSTM
many gate, output elementweise product Stacked, Bedirection

Forget Gate: $f_{t}=σ(W_{f}⋅[h_{t−1}x_{t} ])$

Input Gate: $i_{t}=σ(W_{i}⋅[h_{t−1}x_{t} ])$

New Value: $n_{t}=tanh(W_{n}⋅[h_{t−1}x_{t} ])$

Output Gate: $o_{t}=σ(W_{o}⋅[h_{t−1}x_{t} ])$

input of C : $C_{t}=f_{t}⊗C_{t−1}+i_{t}⊗n_{t}$

output of C : $h_{t}=o_{t}⊗tanh(C_{t})$
Semisupervised learning
train model jointly on labeled and unlabled data $L=L_{S}+μ(t)L_{μ}$ supervised loss, time dependent weight*unsupervised loss($L_{μ}$)
consistentcy loss SSL
consider the consistency loss on all examples between Student model and Teacher model.
 training the student model with labeled data as usual,
 difference augmented view methodes(scala, rotate...) applying on each unlabled data.
 passing augmented views(x', x'') of the same data(x) to student and teacher model
 minimizing the consistency loss of both output $L_{μ}=∣∣f(x_{′})−g(x_{′′})∣∣_{2}$.
 updating weight of teacher model, $Θ_{t}=αΘ_{t−1}+(1−α)Θ_{t}$
Pseudolabel based SSL
 training Teacher model with labeled data as usual
 using well trianed teacher model to predict unlabled data
 taking over the confident prediciton(threshold) lable as new labeled data
 training student model with original and new label data
 passing the student model weights to teacher model, and predict all data again
Weaklysupervised learing
use simple and cheaper labels for training
 Classification: hashtags
 Object detection: images tags
 Semantic Segmentation: scribble annotations
Selfsupervised learing
pretraining unsupervised model with large unlabled data, then fineturn it with small label dataset
Colorization
Reference Frame $f_{i}$, color $c_{i}$, Target Frame $f_{j}$, predect color $y_{j}$. $A_{ij}=∑_{k}exp(f_{k}f_{j})exp(f_{i}f_{j}) ,y_{j}=i∑ A_{ij}c_{i}$
Context Prediciton
picture divied into patches, predict the relative position of patches.
 Gap between batches, jitter location of patches
 Chromatic abberation, predict the absolute position of patches
Contrastive Learning
Contrastive Predictive Coding (CPC)
Idea: Learn to predict future embeddings linearly. $Z_{t+k}=W_{k}C_{t}$ Loss: mean squared error not helpful, because encoding = 0 will give perfect loss, positive example are close, and negative example are distant
SimCLR
Maxmize agreement between representations of two views, good contrastive learning need many negative examples.
 MoCo: $θ_{k}<−mθ_{k}+(1−m)θ_{q}$, decouples batch size of large number of negative exsamples, more complex model
 BYOL: no need for negative examples
Crossmodel contrastive learning(CLIP)
Semantic segmentation
methods
 in CNN model, replace the last fully connected layer with 1x1 converlutions layer
 at last upsampling to original size
 ouput original weight * original height * class number with onehot coding.
 loss funtion, cross entry of pixelwise:
$N1 ∑_{ij}∑_{k}gp_{ij}t_{ij}$,
 imbalanced background not work good for target prediciton, using balanced loss function
 weight factor r inverse to class frequency
 dice loss
 Focal loss
 upsampling: nearest neighbor interpolation, transposed convolutions
 upsampling combining with the corresponding pooling
FCN
tranposed convolution can cause artifacts, can avoid by using fixed upsampling(nearest neighbor)
Unet
 Contraction: extract semantic information
 Expansion: produce detail segmentation
 Skip connection: copy highresolution information into decoder
Deep Lab
combine feathers representations at multiple scale atrous converlution: dilate filter by implicit zeros in between kenerl elements
Object Detection
Predict the Bounding box and predict the class
two stage mothode
Faster RCNN:
 Loss = $∑$ classification loss + $∑$ region proposal loss
 RoI pooling: all object map to Rol convolutional features(C*H*W) for region proposal
single stage mothode
change the mask size, and predect all at once
Instance Segmentation
segment individual object from image Instance segmentation is invariant under relabeling
 Proposalbased instance segmentation, perform object detection at first, the predict each mask instance with bounding box L = classification loss + region proposal loss(bounding box) + mask loss
 proposalfree instance segmentation, predict intermediate
representations,
 foreground prediciton
 boundary prediciton
image to image
 colorization
 super resolution
 Denoising