Presented by Dr. Tommi Jaakkola, MIT, Nov 2 2021
Notes by Emily Liu
Bayesian networks use directed acyclic graphs to describe how the probability distribution factors into smaller components. A directed edge from node
Some graph networks will have boxes subscripted with a constant (let's say
Bayesian networks can be applied to describe many problems. A few are listed below.
Consider a
To simplify, we consider
In our matrix, our known values form our prediction dataset
The distribution
In this formulation, you would only need to train
Sometimes, clusters remain the same across tasks, but the proportion of examples in each cluster changes. We need to model the variability of mixing proportions across the tasks, $$ P(x) = \sum_{z=1}^k \pi_k N(x; \mu_k, \Sigma_k) $$
The probability distribution can be constructed as follows:
For
-
$\pi_t \sim$ Dir($\alpha_1, ..., \alpha_k$ ) - For
$i = 1...N$ :-
$z_{it} \sim$ Cat($\pi_{t1}...\pi_{tk}$ ) $x_{it} \sim N(\mu_{z_{it}}, \Sigma_{z_{it}})$
-
Where Dir is the Dirichlet distribution.
Similar analysis can be applied to topic models where you draw from a categorical distribution instead of a normal distribution. (In this case,
For
-
$\pi_t \sim$ Dir($\alpha_1, ..., \alpha_k$ ) - For
$i = 1...N$ :-
$z_{i} \sim$ Cat($\theta_1...\theta_k$ ) -
$w_{i} \sim$ Cat(${\beta_{w | z_i}}_{w \in W}$ )
-
And the probability of all the words in a single doc being $$ P(w1...w_N) = \int P(\theta | \alpha) \prod_{i=1}^N\left(\sum_{z_i=1}^k \theta_z \beta_{w_i | z_i}\right)d \theta $$
Consider in the context of LDA topic identification a single document
In order to use EM to estimate
We do this via expected lower bound optimization (ELBO)
This is not tractable.
- Bayesian networks describe complex probability distributions
- Plates in Bayesian networks describe repeated iid sampling from the same probability distribution
- Bayesian networks can be used to model matrix completion by treating the features of the columns and rows as distributions. However, this is still a difficult task.
- Bayesian networks can be used to model multi-task clustering and topic models in a similar fashion.
- The variational lower bound is maximized via ELBO through the EM algorithm; however to make this feasible we need to approximate
$Q$ via mean field approximation.