Introduction
Probability Space
Probability space is a triple $(\Omega, \mathcal{F}, \mathbf{P})$, comprised of the following three
elements:
1 Sample space $\Omega$: the set of all possible outcomes of an experiment
2 $\sigma$-algebra (or $\sigma$-field) $\mathcal F$: a collection of subsets of $\Omega$
3 Probability measure $\mathbf P$: a function that assigns a nonnegative
probability to every set in the $\sigma$-algebra $\mathcal F$
Sample space
Mutually exclusive: no identical element.
Collectively exhaustive: all results should be included.
$\sigma$-algegra
not unique
3 requirements:
Borel field
used to measure intervals
when $\Omega$ is continuous($\R$ for example), Borel field is useful.
“minimum” $\sigma$-algebra means deleting any element in the $\mathcal B (\mathbf R)$ will miss the requirements.
Uncountable
decimal numbers between 0 and 1 are uncountable.
Probability measures
Nonnegativity $P(A)\ge0, \forall A \in \mathcal{ F}$
Normalization $P(\empty)=0, P(\Omega)=1$
Countable additivity $A_1, A_2, … \text { is disjoint in }\mathcal F, P(A_1\cup A_2\cup …)=P(A_1)+P(A_2)+…$
- They are the axioms of probability.
- Probability is a mapping from $\sigma$-algebra to a real number betwenn 0 and 1, which intuitively specifies the “likelihood” of any event.
- There exist non-measurable sets, on which we cannot define a probability measure.
Discrete models
Continuous Models
Probability = Area
Some properties of Probability measure
Conditional Probability
- If $P(B)=0$, $P(A|B)$ is undefined.
- For a fixed event $B$, $P(A|B)$ can be verified as a legitimate probability measure on the new universe. $P(A, B)\ge 0$, $P(\Omega|B)=1$, $P(A_1\cup A_2\cup…|B)=P(A_1|B)+P(A_2|B)+…$
- $P(A|B)=\frac{\text{ \# of elements of }A\cap B}{\text{total \# of elements of }B}$
Total probability theorem
Let $A_1, …, A_n$ be disjoint events that form a partition of the sample space and assume that $P(A_i)>0$ for all $i$. Then for any event B, we have
Remark
- The definition of partition is that $\cup_{i=1}^n A_i = \Omega, A_i\cap A_j = \emptyset, \forall i\ne j$
- The probability of B is a weighted average of its conditional probability under each scenario
- Each scenario is weighted according to its prior probability
- Useful when $P(B|A_i)$ is known or easy to derive
Inference and Bayes’ rule
Let $A_1, …, A_2$ be disjoint events that from a partition of the sample space and assume that $P(A_i) \gt 0$ for all $i$. Then for any event $B$ such that $P(B)\gt 0$, we have
Remarks
- Relates conditional probabilities of the form $P(A_i|B)$ with conditional probabilities of the form $P(B|A_i)$
- often used in inference: effect $B$ $\lrarr$ cause $A_i$
The meaning of $P(A_i|B)$ in the view of Bayes: the belief of $A_i$ is revised if we observed effect $B$. If the cause and the effect are closely binded($P(B|A_i) > P(B|A_i^c)$), then the belief $A_i$ is enhanced by the observation of effect $B$($P(A_i|B) > P(A)$). This can be derived from the Bayes’ rule through simple calculation. If $P(A_i|B)=P(A_i)$, then $B$ provides no information on $A_i$.
Independence
Independence of two disjoint events
Events A and B are called independent if
Remarks
- Occurrence of B provides no information about A’s occurrence
- Equivalence due to $P(A\cap B) = P(B)\cdot P(A|B)$
- Symmetric with respect to $A$ and $B$.
- applies even if $P(B) = 0$
- implies $P(B|A) = P(B)$ and $P(A|B^c) = P(A)$
- Does not imply that A and B are disjoint, indeed opposite!
- Two disjoint events are never independent!($P(A\cap B) = 0$, but $P(A)\cdot P(B)\ne 0$)
Conditional independence
Definition
Event $A_1, A_2, …, A_n$ are called independent if:
for any distinct indices $i, j, \dots q$ chosen from ${1, \dots n}$.
Pairwise is independence does not imply independence.
Discrete Random Variables
Random Variable is neither random, nor variable.
Definition
We care about the probability that $X \le x$ instead $X = x$ in the consideration of generality.
Random variables
Given a probability space $(\Omega, F, P)$, a random variable is a function $X: \Omega \rightarrow \R$ with the probability that ${\omega \in \Omega: X(\omega) \le x} \in \mathcal F$ for each $x\in \R$. Such a function $X$ is said to be $\mathcal F$-measurable.
Probability Mass Function(PMF)
Bonulli PMF:
Binomial PMF: $p_X(k)=\binom{n}{k}p^k(1-p)^{n-k}$
Geometric PMF: $p_X(k)=(1-p)^{k-1}p$
Poisson PMF: $p_X(k)=e^{-\lambda}\frac{\lambda^k}{k!}$. Note: $\sum_{k=0}^\infty e^{-\lambda}\frac{\lambda^k}{k!}=e^{-\lambda}e^\lambda=1$
If $y=g(x)$, $p_Y(y)=\sum_{\lbrace x|g(x)=y \rbrace} p(x)$.
Expectation and Variance
Expectation
Note: we assume that the sum converges.
Properties:
Variance
Standard deviation: $\sigma_X=\sqrt{\text{var}(X)}$
Properties:
Bernoulli RV
Discrete Uniform RV
Poisson RV
Conditional
Total expectation theorem
$A_1, \dots, A_n$ is a partition of sample space
We derive the expectation and variance use the theories above.
Geometric PMF example
However, the Geometric has a memoryless property.
Thus,
Multiple discrete random variables
Joint PMFs
Marginal PMF
Conditional PMF
Funcitons of multiple RVs
Expectations
linearity
Let’s calculate the Mean of Binominal RV.
Independence
Independence
if $X$ and $Y$ are independent:
Conditional independence
Continuous Random Variables
Probability Density Function
- $f_X(x)\ge 0\text{ for all }x$
- $\int_{-\infty}^\infty f_X(x)\mathrm dx = 1$
- If $\delta$ is very small, then $P([x, x+\delta])\approx f_X(x) \cdot \delta$
- For any subset $B$ of the real line, $P(X\in B) = \int_B f_X(x)\mathrm d x$.
Expectation
Assuming that the integration is well-defined. The Cauchy distribution ($\frac{1}{1+x^2}$)doesn’t have expectation since $\frac{x}{1+x^2}$ is not absolutely integrably.
Variance
Uniform RV
Properties:
Common Example for PDF
Exponential Random Variable
Cumulative Distribution Functions
Properties
Examples for CDF
Geometric CDF
Exponential CDF
Exponential Distribution is Memoriless, like Geometric:
The relationship:
Normal Random Variables
Gaussian is good, since adding two Gaussian functions resulting in a new Gaussian functions. And with a huge mount of samples, the distribution is close to Gaussian(Central limit theorem).
The Standard Normal Random Variable
Normal(Gaussian)
The CDF of Normal Random Variable $\Phi(y)$ can not be derived directly, we can use the standard normal table to get the value.
Multiple Continuous Random Variables
Joint PDFs
The two continuous RVs X and Y, with the same experiment, are jointly continuous if they can be described by a joint PDF $f_{X, Y}$, where $f_{X, Y}$ is a nonnegative function that satisfies
for every subset B of the two-dimensional plane. In particular, when B is the form $B = {(x, y)|a\le x \le b, c\le y \le d}$, we have
Normalization
Interpretation(Small rectangle)
Marginal PDF
Joint CDF
If X and Y are two RVs asscociated with the same experiment, then the joint CDF of X and Y is the function
Conversely
Expectations
If g is linear, of the form of $g(x, y) = ax + by + c$, then
X and Y are called independent if
Conditional and Independence
Conditional PDFs
Let X and Y be continuous RVs with joint PDF $f_{X, Y}$. For any $f_Y(y) \gt 0$, the conditional PDF of X given Y = y is defined by
Discrete case:
By analogy, for fixed y would like:
But {Y = y} is a zero-probability event.
Let $B = {y\le Y \le y + \epsilon}$, for small $\epsilon > 0$. Then
Limiting case when $\epsilon \rightarrow 0$, to define conditional PDF where the denominator is a zero-probability event.
Conditional Expectation
The conditional expectation of X given that A has happened is defined by
For a function g, we have
Total expectation theorem
Le $A_1, A_2, \dots A_n$ be disjoint events that form a partition of the sample space $\Omega$. And $P(A_i)\gt 0$ for all $i$. Then
Conditional Expectation
The conditional expectation of X given that $Y = y$ has happened is defined by
For a function g, we have
Total expectation theorem
Independence
Two continuous RVs $X$ and $Y$ are independent if and only if
Independence is the same as the condition
If $X$ and $Y$ are independent, then
The continuous Bayes’s rule
Based on the normalization property $\int_{-\infty}^\infty f_{X|Y}(x|y)\mathrm dx = 1$,
Derived distributions and Entropy
Derived Distribution
If we want to calculate the expectation $E[g(X)]$, there’s no need to calculate the PDF $f_X$ of $X$.
But sometimes we want the PDF $f_Y$ of $Y = g(X)$, where $Y$ is a new RV.
Principal Method
Two-step procedure for the calculation of the PDF of a function $Y=g(X)$ of a continuous RV $X$
- Calcualte the CDF $F_Y$ of $Y$: $F_Y(y) = P(Y \le y)$
- Differentiate $F_Y$ to obtain the PDF $f_Y$ of $Y$: $f_Y(y) = \frac{\mathrm d F_Y}{\mathrm d y}(y)$
The PDF of $Y=aX + b$
Suppose $a>0$ and $b$ are constants.
If $X$ is Normal, then $Y = aX + b$ is also Normal.
Suppose X is normal with mean $\mu$ and variance $\sigma^2$. Then
The PDF of a strictly monotonic function
Suppose $g$ is a strictly monotonic function and that for some function $h$ and all $x$ in the range of $X$ we have
Assume that $h$ is differentiable.
Then the PDF of $Y = g(X)$ is given by
Entropy
Defintion
Discrete case
Let $X$ be a discrete RV defined on probability space $(\Omega, \mathcal F, P)$. The entropy of $X$ is defined by
Continuous case
Let $X$ be a continuous RV defined on probability space $(\Omega, \mathcal F, P)$. The differential entropy of $X$ is defined by
Remarks
- a special expectation of a random variable
- a measure of uncertainty in a random experiment
- the larger the entropy, the more uncertain the experiment
- For a deterministic event, the entropy is zero
- The base of logarithm can be different. Changing the base od the logarithm is equivalent to multiplying the entropy by a constant.
- With base 2, we say that the entropy is in units of bits
- With base e, we say that the entropy is in units of nats
- The basis of information theory
Maximum entropy distributions
• Maximum entropy distributions
− Distributions with maximum entropy under some constraints
− Gaussian, exponential, and uniform distributions are all maximum entropy distributions under certain conditions
• Why studying maximum entropy distributions?
− The most random distribution, reflecting the maximum uncertainty about the quantity of interest
Definition
Discrete Case
X can be a finite number of values $x_1, x_2, \dots, x_n$, satisfying $p_X(x_k) = p_k.$
We have the following optimization problem:
Solution
Applying the Lagrange multiplier method, we have
Note that the above is true for all $k$. So we have
Continuous Case 1
$X \in [-\infty, \infty]$.
Constrain on mean and variance,
we have the following optimization problem:
In detail,
Solving the above problem, we have Gaussian distribution with mean $\mu$ and variance $\sigma^2$.
Solution
For all measurable functions $g$, we have
Therefore,
$G(t)$ reaches its maximum at $t = 0$.
Then apply the Lagrange multiplier method, we have
Get the derivative of $\overline{G}(t)$ with regard to $t$, and let the derivative equal to zero.
Continuous Case 2
$X \in [0, \infty)$.
Constrain on mean only, we have the following optimization problem:
In detail,
Solving the above problem, we have exponential distribution with parameter $\lambda$.
Continuous Case 3
$X \in [a, b]$.
No constrain, we have the unconstrained optimization problem:
In detail,
Solving the above problem, we have uniform distribution within $[a, b]$.
Convolution, covariance, correlation, and conditional expectation
Convolution
Discrete case
PMF $p_W$ is the convolution of PMFs $p_X$ and $p_Y$.
The distribution of $X+Y$
Mechanics:
- Put the PMF’s on top of each other
- Flip the PMF of $Y$
- Shift the flipped PMF by $w$ (to the right if $w>0$)
- Cross-multiply and add
Continuous Case
Sum of 2 independent normal RVs
which is constant on the ellipse(circle if $\sigma_x = \sigma_y$).
$W$ is Normal.
Mean = 0, Variance = $\sigma_x^2 + \sigma_y^2$
Same argument for nonzero mean case.
The difference of two independent RVs
$X$ and $Y$ are independent exponential RVs with parameter $\lambda$.
Fix some $z\ge 0$ and note that $f_Y(x-z)$ is non zero when $x\ge z$.
The answer for the case $z\le 0$
The first quality holds by symmetry.
Covariance and Correlation
Definition
The covariance of two RVs $X$ and $Y$, denoted by $\text{cov}(X, Y)$, is defined by
or,
$X$ and $Y$ are uncorrelated if $\text{cov}(X, Y) = 0$.
Zero mean case $\text{cov}(X, Y) = E[XY]$
Properties
Variance of the sum of RVs
In particular,
Correlation coefficient
The correlation coefficient $\rho(X, Y)$ of two RVs $X$ and $Y$ that have nonzero variance is defined as
- $-1 \le \rho \le 1$
- $|\rho| = 1 \Leftrightarrow (X-E[X]) = c(Y-E[Y])$
- Independent $\Rightarrow \rho = 0(\text{converse is not true})$
Conditional expected value
Conditional expectation
Definition
$E[X|Y=y]$ is a function of $y$.
Law of iterated expectations
Conditional expectation as an estimator
Denote the conditional expectation
as an estimator of $X$ given $Y$, and the estimation error
is a RV.
Properties of the estimator:
Unbiased
For any possible $Y=y$:
By the law of iterated expectations
Uncorrelated
Since $X = \hat{X} + \tilde{X}$, the variance of X can be decomposed as
Conditional variance
here comes the law of total variance:
The total variability is avarage variability within sections + variability between sections.
Law of iterated expectations
Conditional variance
Law of total variance
Transforms and sum of a random number of random variables
The transform associated with a RV $X$ is a function $M_X(s)$ of a scalar parameter $s$, defined by
Remarks
- a function of $s$, rather than a number
- not necessarily defined for all (complex) s
- always well defined for $\Re(s)=0$
- compared with Laplace transform
Properties
Sanity Checks
Linear operation
Expected Values
since
Example
$X$ is a Poisson RV with parameter $\lambda$
$X$ is an exponential RV with parameter $\lambda$
$Y$ is a standard normal RV,
Consider $X = \sigma Y + \mu$
Inversion of transforms
Inversion Property
The transform $M_X(s)$ associated with a RV $X$ uniquely determines the CDF of $X$, assuming that $M_X(s)$ is finite for all $s$ in some interval $[-a, a]$, where $a$ is a positive number.
Example:
The probability $P(X = k)$ is found by reading the coefficient of the term $e^{ks}$:
Transform of Mixture of Distributions
Let $X_1,\dotsb, X_n$ be continuous RVs with PDFs $f_{X_1}, \dotsb, f_{X_n}$.
The value $y$ of RV $Y$ is generated as follows: an index $i$ is chosen with a corresponding probability $p_i$, and $y$ is taken to be equal to the value $X_i$. Then,
Sum of independend RVs
Let $X$ and $Y$ be independent RVs, and let $Z = X + Y$. The transform associated with $Z$ is
Since
Generalization:
A collection of independent RVs: $X_1, \dotsb, X_n$, $Z = X_1 + \dotsb + X_n$ ,
Example
Let $X_1, \dotsb, X_n$ be independent Bernoulli RVs with a common parameter $p$:
$Z = X_1 + \dotsb + X_n$ is binomial with parameters n and p:
Let $X$ and $Y$ be independent Poisson RVs with means $\lambda$ and $\mu$, and let $Z = X + Y$. Then $Z$ is still Poisson with mean $\lambda + \mu$.
Let $X$ and $Y$ be independent Gaussian RVs with means $\mu_x$ and $\mu_y$, and variances $\sigma_x^2, \sigma_y^2$. And let $Z = X + Y$. Then $Z$ is still Gaussian with mean $\mu_x + \mu_y$ and variance $\sigma_x^2 + \sigma_y^2$
Consider
where $N$ is a RV that takes integer values, and $X_1, \dotsb, X_N$ are identically distributed RVs.
Assume that $N, X_1, \dotsb$ are independent.
For the variance,
So,
For transform,
Summary on Properties
Consider the sum
where $N$ is a RV that takes integer values, and $X_1, X_2, \dotsb$ are identically distributed RVs. Assume that $N$, $X_1, X_2, \dotsb$ are independent.
Example
Assume that $N$ and $X_i$ are both geometrically distributed with parameters $p$ and $q$ respectively. All of these RVs are independent. $Y = X_1 + \dotsb + X_N$
$Y$ is also geometrically distributed, with parameter $pq$.
Weak law of large numbers
Markov inequality
If a RV $X$ can only take nonnegative values, then
Intuition: If a nonnegative RV has a small mean, then the probability that it takes a large value must be small。
Fix a positive number $a$,
Chebyshev’s Inequality
If $X$ is a RV with mean $\mu$ and variance $\sigma^2$, then
Intuition: If a RV has small variance, then the probability that it takes a value far from its mean is also small.
The upperbounds of $\sigma^2$:
Chernoff inequality
If a RV $X$ has MGF $M_X(s)$, then
or, for $s \ge 0$
for $s \lt 0$
proof: for $s \ge 0$
Weak law of large numbers
Let $X_1, X_2, \dots$ be independent identically distributed (i.i.d.) RVs with finite mean $\mu$ and variance $\sigma^2$
Applying the Chebyshev inequality and we get:
For large $n$, the bulk of the distribution of $M_n$ is concentrated near $\mu$
Theorem
Let $X_1, X_2, \dots$ be independent identically distributed (i.i.d.) RVs with finite mean $\mu$ and variance $\sigma^2$. For every $\epsilon \gt 0$, we have
$M_n$ converges in probability to $\mu$.
Convergence “in Probability”
Theorem: Convergence in Probability
Let $\lbrace Y_n\rbrace$(or $Y_1, Y_2, \dots$) be a sequence of RVs(not necessarily independent), and let $a$ be a real number. We say that the sequence $Y_n$ converges to $a$ in probability, if for every $\epsilon \gt 0$, we have
(almost all) of the PMF/PDF of $Y_n$, eventually gets concentrated (arbitrarily) close to $a$.
Many types of convergence
Deterministic limits: $\lim_{n\rightarrow \infty} a_n = a$
Convergence in probability: $X_n\stackrel P{\rightarrow} X$
(WLLN)
Convergence in Distribution: $X_n \stackrel{D}{\rightarrow} X$
For all points of $x$ at which the function $F_X(x) = P(X\le x)$is continuous.
(CLT)
Convergence with probability $1$(almost surely): $X_n \stackrel{\text{a.s.}}{\rightarrow} X$
or
Lemma:
i.o. stand for infinitely often
(SLLN)
Convergence in Mean/in Norm: $X_n \stackrel{r}{\rightarrow}X$
if $E[X_n^r] \lt \infty$ for all $n$ and
Relations:
The converse assertions fail in general!
The relation between “almost surely” and “in r-th mean” is complicated. There exist sequences which converge almost surely but
not in mean, and which converge in mean but not almost surely!
Central Limit Theorem
Theorem
Let $X_1, X_2, \dots$ be i.i.d. RVs with mean $\mu$ and variance $\sigma^2$. Let
Then
CDF of $Z_n$ converges to normal CDF(converge in distribution)
Normal Approximation Based on the Central Limit Theorem
Let $S_n = X_1 + \dotsb + X_n$, where $X_i$ are $\text{i.i.d.}$ RVs with mean $\mu$ and variance $\sigma^2$. If $n$ is large, the probability $P(S_n ≤ c)$ can be approximated by
treating $S_n$ as if it were normal, according to the following procedure.
- Calculate the mean $n\mu$ and the variance $n\sigma^2$ of $S_n$
- Calculate the normalinzd value $z = (c - n\mu)/(\sigma\sqrt{n})$
- Use the approxmation
where $\Phi(z)$ is available from the standard normal CDF.
Proof
Suppose that $X_1, X_2, \dots$ has mean zero.
Suppose that the transform $M_X(s)$ has a second order Taylor series expansion around $s=0$,
where $a = M_X(0) = 1, b = M_X’(0) = E[X] = 0, c = \frac{1}{2}M_X’’(0) = \frac{\sigma^2}{2}$
Then
As $n\rightarrow \infty$,
Approxmation on binomial:
(De Moivre-Laplace Approxmation to the Binomial)
The Strong Law of Large Numbers
Theorem
Let $X_1, X_2, \dots$ be i.i.d. RVs with mean $\mu$.
Borel-Cantelli lemma & Bernoulli Process
Limit of set sequence
If upper limit equals to lower limit, the limit of set sequence exists.
Upper limit can also be denoted as
Borel-Cantelli Lemma
Let $\lbrace A_n, n = 1, 2, \dotsb\rbrace$ be a sequence of events, then
Let $\lbrace A_n, n = 1, 2, \dotsb\rbrace$ be a sequence of independent events, then
Stochastic process
A stochastic process is a mathematical model of a probabilistic experiment that evolves in time and generates a sequence of
numerical values.
- Bernoulli process(memoryless, discrete time)
- Poisson process(memoryless, continuous time)
The Bernoulli Process
is a sequence of independent Bernoulli trials, each with probability of success $p$.
Independence property: For any given time $n$, the sequence of $X_{n + 1}, X_{n + 2}, \dots$ is also a Bernoulli process, and is independent from $X_1, \dots, X_n$
Memoryless property: Let $n$ be a given time and let $\overline T$ be the time of the first success after
time $n$. Then $\overline T − n$ has a geometric distribution with parameter $p$,
and is independent of the RVs $X_1, \dots , X_n$.
Interarrival times
Denote the $k$th success as $Y_k$, the $k$th interarrival time as $T_k$.
represents the number of trials following the $(k - 1)$th success until the next success.
Note that
Alternative description of the Bernoulli process:
- Start with a sequence of independent geometric RVs $T_1, T_2, \dots$ with common parameter p, and let these stand for the interarrival times.
- Record a success at times $T_1$, $T_1 + T_2$, etc.
Splitting of a Bernoulli process
Whenever there is an arrival, we choose to either keep it (with probability $q$), or to discard it (with probability $1 − q$).
Both the process of arrivals that are kept and the process of discarded arrivals are Bernoulli processes, with success probability $pq$ and $p(1 − q)$, respectively, at each time.
Merging of a Bernoulli process
In a reverse situation, we start with two independent Bernoulli processes (with parameters $p$ and $q$ respectively). An arrival is
recorded in the merged process if and only if there is an arrival in at least one of the two original processes.
The merged process is Bernoulli, with success probability $p+q−pq$ at each time step.
The Poisson Process
Definition
An arrival process is called a Poisson process with rate $λ$ if it has the following properties:
Time homogenity
Independence
Numbers of arrivals in disjoint time intervals are independent.
Small interval probabilities
Bernoulli/Poisson Relation
In a short time interval $\delta$
For binomial PMF $p_S(k;n,p)$,
PMF of Number of Arrivals $N$
Time $T$ of the first arrival
Memoryless property The time to next arrival is independent of the past.
Interarrival times
We also denote the time of the kth success as $Y_k$, and denote the
kth interarrival time as $T_k$. That is,
Note that