Policy Graident
带权重的梯度下降方法
A2C
Model-Based RL
Model-Based RL
If we know the dynamics of some process:
Objective in a Stochastic World
The dynamics are stochastic
The expectation under these actions in such a stochastic world.
It is suboptimal.
Open Loop Planning
Guess and check (Random Shooting)
- Pick action sequences uniformly in the action space
- Calculate the Result
Better: Cross Entropy Method
Example: MCTS
Trajectory Optimization with Derivatives
LQR? Linear Quadratic Regulator
The Q and R are symmetric and ponlositive definite. If not positive, you may optimize the result into negative inf.
Value Iteration
Model-Free RL
If the model is not known?
modelbased RL:
- base policy to collect dataset
- learning dynamics model from dataset
- plan through dynamics model and give actions
- Execute the actions and add the result into data set
Model predictive control (MPC)
- base policy to collect dataset
- learning dynamics model from dataset
- plan through dynamics model and give actions
- only execute the first planned action
- append the $(s, a, s^\prime)$ to dataset
Model-based RL with a policy
Why Model based RL with a learned model?
- Data-efficiency
- Dont need to act in real world
- Multi-task with a model
- reuse the world model
But they are unstable and worse asymptotic performance.
- If the model biased toword the positive side
- the actions overfit to the learned model
- if the trajectory is really long
- Accumulated errors
To resolve 1: use uncertainty
Optimize towards expectation of rewards rather than rewards
Two types of uncertainty
- Aleatoric or statistical: The reality itself has uncertainty (e.g. Dice)
- Epistemic or model uncertainty: You are uncertain about the true function
If use output entropy, it can’t tell apart the type of uncertainty. We need to measure the epistemic uncertainty.
How to measure?
We use the collected data to train the model, maximize $\log(D|\theta)$ by changing $\theta$
Can we instead to measure $\log(\theta|D)$ – the model uncertainty!
but it is rather intractable.
Model Ensemble!
Training multiple models, see if they agree with each other. We have to make the models different(variant).
The randomness and SGD is enough to make the models different.
- Every time drag one model and give actions
- calculate the reward
- add the data into dataset and update policy
To resolve 2 (long rollouts can be error-prone), we can always use short rollouts.
combine the real and model data to improve policy
Example: DYNA-style MBRL
Also can try Baysian Neural Networks.
Value-Equivalent Model
You dont have to stimulate the world, just simplify the value fuction ensuring to keep the value is approximately the same with the real one.
Use mean square error.
Model-Base RL with images
Imitation Learning
Accumulate Error and Covariate Shiift
DAgger:
- Train a policy from human data $D$
- Run the policy to get dataset $D_\pi$
- Ask human to label $D_\pi$ with actions $a_t$
- Aggregate: $D \larr D \cup D_\pi$
Techniques: Dataset Resampling / Reweighting
Techniques: Pre-Trained Models to extract representations
MSE gives the mean value, while the cross-entropy gives the probability. If a task has a probability with 50% left, 50% right, the MSE will give an answer “go forward”.
Leverage Demonstrations in DRL
DQfD: Deep Q-Learning from Demonstrations