title: DeepLearning

Batch Size Problem

Batch Gradient descent

  • all data as input once
  • loss graph is quite smooth
  • need great calculation power

Stochastic gradient descent

  • input each time only one data sample
  • cost will fluctuate over the training
  • fast update the weights

mini-batch gradient descent

  • only a subset of all data points in each

big batch size

Advantages: the descension of weights has a more accuary direction, less oscillation. Disadvantages: memery explotion, and fall into locally minimum

small batch size

Advantages: more times weights update, more chance to overcome locally minimum, Disadvantages: difficult to convergence

Learning rate

too small converge slowly or find local minimum too large oscillation or divergence

Second order optimizer

SGD with momentum

learning rate decay


automatic differentiation


Multi-layer perceptrons


Universal Approximation Theorem Perceptrons solves linearly separable problem One hidden layer is enough to approximate any continuons funtion, to an arbitrary degree of accuracy

Perceptron Learning Algorithm(PLA): Iterate over training examples until convergence

Gradient Descent

Advance gradient descent

Gradient-based optimization does neither always find the absolute minimum, nor does it find the optimal direction

Second-order optimizers

Theorical besser performance, but generally too expensive and not work with minibatches

SGD with momentum

too small get stuck in local minimum or saddle points too large overshoot optimum or spiral around it


adaptive gradient algorithm(DL~03MLP~.pdf/12 )


root mean square propagation(DL~03MLP~.pdf/13 )


Regularization of MLP


Early stop

Active funtion

Sigmoids vanish for large postive and negative inputs Rectified Linear Unit(ReLu) LeakyReLU Exponential linear unit Absolute value activation

Babysitting training neural network

  • check initial loss make sense
  • overfit training model to 100% accuracy of small samples, such as few minibatches
    • adjust the initialization and learning rate
  • find the learning rate, which reduce loss significantly
  • rough train few epochs with learning rate nearby from previous step
  • use the best options from previous step, training longer
  • watch out the loss curves
    • if learning rate decay needed
    • accuracy of train vs validation, overfit, underfit, go back to step 5
  • Early strop Idea: Stop training when generalization error increases

Hyperparameter search

  • grid search and random search
  • Multipy learning rate by N where you increase the batchsize by factor of N

Data Augmentation

  • data artifacts
    • flips
    • crops and scales
    • randomize color
    • rotate
  • advance data augmentation
    • Mixup: take linear combination of input and target of two training samples
    • CutMit: mix patches of the two input, Target is linear combination with weight according to patch ratio


Modern CNN architectures tend to use strided convolutions instead of max pooling.

Output size

valid: padding: N: input size M: output size p: padding k: kernel size s: stride size

Receptive field

From where I want to calcalete to the input layer. and set the current calcalete layer's RF as 1

Zero Padding

valid without padding shape reduce k-1 same with padding shape stay the same


max pooling only chose the maximum one of block average pooling use the average of block

Feather abstact

The layer can extract image features, and finally determine the convolution kernel parameters through backpropagation to obtain the final features


For Vanishing/exploding gradients: each example in layer all data are normalized

  • Batch Normalization norm each channel
  • Layer Normalization norm each sample
  • Instance Normalization norm each sample and each channel
  • Group Normalization norm multi channel and sample

regular convolution

Depthwise separable convolution

  • Depthwise Convolution: channel-wise
  • Pointwise Convolution: 1*1 convolution

learning rate decay

dacay schudle

Linear Warmup

small learing rate increase very fast, and decay slowly can deal with bad initialization

Modul ensembling

  • training N model, and take the averate
  • take N snapshots of training


Vanishing gradient and exploding gradient

GAN (Generative Adersarial models)

Implicit density

step 1 Generater stay, Discriminater update

  • randomly initialization of G and D
  • inputs (Distribution: ) from known distribution to G get rough outputs (Distribution: )
  • rough outputs and real image (Examples: ) feed to D
  • training D to classify them with mark, and update D Max to 1 Min to 0, so

step 2 Discriminater stay, Generater update

  • fix D, feed new inputs from known distribution to G

  • get rough outputs again, and pass them to D, and evaluated with mark

  • training G, for getting better mark

    max to 1, so

    Just like training normal neural network with minimum cross enteopy



input data Feathers predicted labels predicted labels + Classifier loss function

VAE (Variational Autoencoders)

optimizes variational lower bound on likelihood Approximate density search the latent implimentation, reduce the dimensionality to capture meaningful factors in data x: examples z: latent parameters , simple gaussian prior, encoder neural network

  • decoder network
  • KL term between gaussian encoder and z latent. make approximate posterior distribution close to prior.
  • mostly similarity between q and p, KL[q(z|x)||p(z|x)], bigger than 0, so maxmized data likelihood can only have a lower bound value

Intractability of p(x), because We want , but it's too difficult, so we use as approximation:

We randomly example the z from the normal Gaussian for VAE

PixelCNN (Autoregressive models)

Explicit density, optimizes exact likelihood exact likelihood(chain rule), and training slowly maxmize likelihood Mask converlutions: kernel filter pixel in the future are set to be 0

Diffusion Model

image to noise : forward process noise to image : backward process


this is many to one, A[shape(h), shap(h)+shap(x)] is shared by all step there is no big difference for prediction with only h(t) or conta(h(1), h(t))


many gate, output elementweise product Stacked, Bedirection

  • Forget Gate:

  • Input Gate:

  • New Value:

  • Output Gate:

  • input of C :

  • output of C :

Semi-supervised learning

train model jointly on labeled and unlabled data supervised loss, time dependent weight*unsupervised loss()

consistentcy loss SSL

consider the consistency loss on all examples between Student model and Teacher model.

  1. training the student model with labeled data as usual,
  2. difference augmented view methodes(scala, rotate...) applying on each unlabled data.
  3. passing augmented views(x', x'') of the same data(x) to student and teacher model
  4. minimizing the consistency loss of both output .
  5. updating weight of teacher model,

Pseudo-label based SSL

  1. training Teacher model with labeled data as usual
  2. using well trianed teacher model to predict unlabled data
  3. taking over the confident prediciton(threshold) lable as new labeled data
  4. training student model with original and new label data
  5. passing the student model weights to teacher model, and predict all data again

Weakly-supervised learing

use simple and cheaper labels for training

  • Classification: hashtags
  • Object detection: images tags
  • Semantic Segmentation: scribble annotations

Self-supervised learing

pre-training unsupervised model with large unlabled data, then fineturn it with small label dataset


Reference Frame , color , Target Frame , predect color .

Context Prediciton

picture divied into patches, predict the relative position of patches.

  • Gap between batches, jitter location of patches
  • Chromatic abberation, predict the absolute position of patches

Contrastive Learning

Contrastive Predictive Coding (CPC)

Idea: Learn to predict future embeddings linearly. Loss: mean squared error not helpful, because encoding = 0 will give perfect loss, positive example are close, and negative example are distant


Maxmize agreement between representations of two views, good contrastive learning need many negative examples.

  • MoCo: , decouples batch size of large number of negative exsamples, more complex model
  • BYOL: no need for negative examples

Cross-model contrastive learning(CLIP)

Semantic segmentation


  • in CNN model, replace the last fully connected layer with 1x1 converlutions layer
  • at last upsampling to original size
  • ouput original weight * original height * class number with one-hot coding.
  • loss funtion, cross entry of pixel-wise: ,
    • imbalanced background not work good for target prediciton, using balanced loss function
    • weight factor r inverse to class frequency
    • dice loss
    • Focal loss
  • upsampling: nearest neighbor interpolation, transposed convolutions
  • upsampling combining with the corresponding pooling


tranposed convolution can cause artifacts, can avoid by using fixed upsampling(nearest neighbor)


  • Contraction: extract semantic information
  • Expansion: produce detail segmentation
  • Skip connection: copy high-resolution information into decoder

Deep Lab

combine feathers representations at multiple scale atrous converlution: dilate filter by implicit zeros in between kenerl elements

Object Detection

Predict the Bounding box and predict the class

two stage mothode

Faster R-CNN:

  • Loss = classification loss + region proposal loss
  • RoI pooling: all object map to Rol convolutional features(C*H*W) for region proposal

single stage mothode

change the mask size, and predect all at once

Instance Segmentation

segment individual object from image Instance segmentation is invariant under relabeling

  • Proposal-based instance segmentation, perform object detection at first, the predict each mask instance with bounding box L = classification loss + region proposal loss(bounding box) + mask loss
  • proposal-free instance segmentation, predict intermediate representations,
    • foreground prediciton
    • boundary prediciton

image to image

  • colorization
  • super resolution
  • Denoising