Learning
Linear Classification
Logistic Regression
Useful for classification problems.
Cross-Entropy Loss
With regularization:
How to deal with multiclass problems?
Softmax Regression
Normalizes multiple outputs in a probability vector.
Cross-Entropy Loss
This loss is convex. But there are many solutions that result in same outputs, so the regularizaton is indispensible to prevent divergence.
Support Vector Machine (SVM)
Soft-SVM (Hinge Loss)
Define Hinge Loss
For the linear hypothesis:
Theorem: Soft-SVM is equivalent to a Regularized Rise Minimization:
这意味着SVM的“最大化”边界距离项本质上是一个正则化项。
Kernel Soft-SVM
Basis function $\Phi(x)$ can often replaced by kernal function $k(x_1, x_2)$.
Polynomial Kernel: efficient computation: $O(d)$
Construct new kernel function from exist kernel functions:
For any function $g: \mathcal X \rightarrow \R$
Apply Representer theorem:
- $\alpha_j$ is the weight of each reference point $\color{red}{x_j}$ to the prediction of $\color{red}{x_i}$.
- lt is actually a Primal Form with kernel functions.
Decision Tree
Criterion:
- More balance
- More pure
Misclassification error (not used very frequently):
Use Entropy to measure purity:
Gini Index:
ID3 Algorithm
Multiplayer Perceptrons (MLP)
MLP for XOR
Activation
Loss Functions
Entropy
Relative-entropy
Cross-entropy
Relationship:
Softmax in the output layer
Cross-entropy loss:
Cost function:
Gradient-Based Training
Forward Propagation: to compute activations & objective $J(\theta)$
Backward Propagation: Update paramters in all layers
Learning Rate decay
Exponential decay strategy:
1/t decay strategy:
Weight Decay
L2 regularization:
L1:
一般不调。
Weight Initialization
Xavier initialization
(Linear activations)
避免梯度爆炸或者消失;
He initialization
(ReLU activations)
因为 ReLU 删除了一半的信息。
Convolutional Neural Network (CNN)
Convoluion Kernel
Stride
Padding
Pooling
Batch Normalization
在 N 张图像的对应通道做归一化。
可以增强训练的稳定性,使得学习率可以设大一点而仍然保证收敛。
- 数据集成
- 参数集成
- 模型集成
ResNet
最后一层 Global Average Pooling:7*7*2048 -> 1* 1 * 2048
Recurrent Neural Network (RNN)
Idea for Sequence Modeling
Local Dependency
Parameter Sharing
RNN
Go deeper
Standard Architectures
- RNNs can represent unbounded temporal dependencies
- RNNs encode histories of words into a fixed size hidden vector
- Parameter size does not grow with the length of dependencies
- RNNs are hard to learn long range dependencies present in data
LSTM
Multihead, shared bottom.
Gradient flow highway: remember history very well.
NIPS 2015 Highway Network.
Training Strategies
Shift in Training & Inference
Use Scheduled Sampling to solve this
Problem: Gradient Explosion during continuously multiplication.
Solution: Gradient Clipping
Variational Dropout
Layer Normalization
BN: Easy to compare between channels
LN: Easy to compare between samples
在图像任务上,我们一般认为 channel 之间的地位应该是相同的,因此常常采用 BN。
Transformer
use attention to replace state space.
Attention
Multi-Head Attention
Sparse?
$W^o$ to maintain shape and jointly attend to information from different representation subspaces.
FFN
Position-wise FFN (Similar to multi convolution kernels in CNN, shared parameters in every word.)
Positional Encoding
Reasoning
Reasoning (Probabilistic) = Modeling + Inference
Modeling:
- Bayesian Networks
- Markov random fields
Inference:
- Elimination methods (变量消除法)
- Latent variable models (因变量模型)
- Variational methods (变分方法)
- Sampling methods (采样方法) - 难学!
Bayesian Network
Variable Elimination
用于计算概率的边缘分布
一般而言是 NP-hard 问题。
对于 Markov chain,复杂度为 $O(nk^2)$;对于一般的图,$O(k^{n-1})$;如果确定每个节点的父节点数不超过 m,则复杂度为 $O(nk^{m-1})$
Message Passing
Reuse the computation from $P(Y|E=e)$ when calcuating another probability $P(Y_1|E_1=e_1)$
“$\propto$” 意味着只需要知道概率的相对值就够了,因为可以通过归一化算出最终的概率值。
MAP 需要求概率分布的最大值。
sum 与 max 同为聚合操作,因此同样满足分配律,只需要对应替换就可以得到第二种 Message Passing:
Bayes Approach
MLE method
如果概率模型的参数知道,称为概率;不知道,称为统计推断。
估计高斯分布的参数:
方差是有偏估计,所以一般× $1/(n-1)$。
Bayes Decision Rule
对于回归问题,可以采用高斯噪声假设:
这样就得到了最小二乘估计。
MLE 是先验概率相等的 MAP。
放在机器学习中,MAP 可以定义为:模型 = 数据 + 先验。
先验信息在机器学习中体现为正则化:
2 范数正则化就是在认为模型参数服从高斯分布的先验假设情况下,利用 MAP 准则来估计参数。
这也就是为什么正则化倾向于避免过拟合:高斯分布先验希望模型参数足够简单。
Bayesian Model Averaging
意义:模型集成。
Discriminative Models
上面的理论足够解释判别式模型的原理了。
Generative Models
Naive Bayes Classifier
model $Y$ as a bernoulli distribution with parameters $p(y=1)$ and $p(y=-1)$
conditional independence: each dimension is independent given label y
Laplacian smoothing for 0 samples:
For dataset with all continuous features: descretize it, or use another model based on a different assumption.
Guassian Discriminant Analysis
是一个生成模型!虽然它被用来分类,但是它的建模设计上采用的是生成式。
For dataset with all continuous features:
Using parametrice distribution to represent $P(X=x|Y=y)$
A common assumption in classification:
- We always assume that data points in a class is a cluster.
Still model $p(Y=y)$ as Bernoulli distribution.
Note the shared parameters $\Sigma$ for positive and nagative classes.
Use MLE to find the best solution:
Then:
Discriminative vs. Generative
Mixture Models and EM
Gaussian Mixture Model
A Generative model. More assumption than Logistic Regression.
Sample dataset from GMM:
Compute log-likelihood:
Intracable! Use EM method to estimate parameters.
$z$ is latent variable.
Expectation Maximization
Learing Problem:
find MLE
Inference Promblem:
Given $x$, find conditional variable of $z$:
EM method is for both problems!
it is hard to maximize the marginal likelihood directly:
but the complete data log-likelihood is easy typically:
if we had a distribution $q(z)$ for z:
We have Evidence Lower Bound (ELBO):
Now we optimize the ELBO iteratively:
The math background for ELBO:
We get back an equality for the marginal likelihood:
Evidence = ELBO + KL-Divergence
In E-step, if we want to maximize the ELBO without changing $\theta$, we have to let KL be zero. Thus $q^*(z)=p(z|x,\theta)$
For M-step, we find the $\theta$ to maximize the ELBO.
In MAP case:
For GMM, E-step:
M-step:
Recommended Initialize:
Variational Methods:
注意:E 步计算的是隐变量的后验(如果能计算出来),因为它是使得似然函数及ELBO最大的 $q(z)$。算不出来就用变分方法近似。
$q(z)$ 既不是先验分布,也不是后验分布,它只是我们对隐变量分布的一种估计。
Probabilistic Topic Models
Dirichlet-Multinomial Model
Beta Distribution:
Dirichlet Multinomial Model: Multi-dimensional version of Beta Distribution
Conjugate prior:
Admixture:
Latent Dirichlet Allocation (LDA):
Probabilistic Graphical Models:
Maximum Likelihood Estimation
To learn the param $\alpha, \eta$, use EM method:
In E-step, calculate
However, the denominator is intractable:
This problem is for general Bayesian models. We can use Variational Methods or Markov Chain Monte Carlo to solve it.
Variational Methods
Use Mean field assumption in LDA:
Variational Autoencoders (VAE)
Reparameterization Trick: