PGM 读书笔记节选(十二)

作为 inference 部分的小结,我们这里对 machine learning 里面常见的三个 model 的 inference 问题进行整理,当然很幸运的是他们都存在 tractable 的算法是的我们避免了烦人的 approximate inference。

HMM

常意所说的 HMM 是对离散状态、离散观测情形下的一种 generative model,它包括

  • 状态的先验分布 \pi_k(在下面的推导中我们可以将其藏在转移概率中)
  • 转移状态 \tau (k, k'),这是对 k' 的分布
  • 发射概率 e (k, v),这是对 v 的分布

这个模型的潜台词是

  • Markovian property:\Pr (s_t \mid s_1, \ldots, s_{t - 1}) = \Pr (s_t \mid s_{t - 1})
  • time-invariance:\Pr (s_t = k' \mid s_{t - 1} = k)

因此联合分布的概率为

\displaystyle \Pr (o_{1, \ldots, t}, s_{0, \ldots, t}) = \Pr(s_0)\prod_{i = 1}^t \Pr (s_i \mid s_{i - 1}) \Pr (o_i \mid s_i),

其中 \Pr(s_0) = 1 故可省略。下面我们分别讨论这上面的 message passing、belief update 和一些常见的 inference 问题。

message passing 需要建立一个 cluster graph,当然实际也是一个 clique tree,这个图上的顶点包括 C_i,这是将 o_is_{i-1}s_i 绑在一起,i = 1, \ldots, t;则每个对应的 \psi_i (s_i, s_{i-1}, o_i)= \Pr (s_i \mid s_{i - 1}) \Pr (o_i \mid s_i), i = 1, \ldots, t。于是可以计算前向的消息,

\displaystyle\delta_{1\to 2} (s_1) = \sum_{o_1} \psi_1 (s_1, s_0, o_1), \quad \delta_{i \to i + 1} (s_i) = \sum_{o_i} \sum_{s_{i-1}} \delta_{i - 1\to i} (s_{i - 1}) \psi_i (s_i, s_{i - 1}, o_i)

其中 i = 2, \ldots, t - 1,后向消息为

\displaystyle\delta_{t\to t-1} (s_{t - 1}) = \sum_{s_t} \sum_{o_t} \phi_t (s_t, s_{t - 1}, o_t) , \quad \delta_{i\to i-1} (s_{i - 1}) = \sum_{s_i} \sum_{o_i} \delta_{i+1 \to i} (s_i) \psi_i (s_i, s_{i - 1}, o_i)

其中 i = t - 1, \ldots, 1。如果仔细分析一下这些消息,我们就会发现,前向消息其实是边际分布

\displaystyle \delta_{1\to 2} (s_1) = \sum_{o_1} \psi_1 (s_1, s_0, o_1) = \sum_{o_1} \Pr (s_1\mid s_0) \Pr (o_1 \mid s_1) = \Pr (s_1) = \pi_{s_1}

我们可以继续代入后面的消息里面,

\displaystyle\delta_{i\to i + 1} (s_0) = \sum_{s_i} \sum_{o_i} \delta_{i - 1 \to i} (s_i) \psi_i (s_i, s_{i - 1}, o_i)= \sum_{o_{i - 1}} \sum_{o_i} \Pr (s_{i - 1}) \Pr (s_i\mid s_{i - 1}) \Pr (o_i \mid s_i) = \Pr(s_i)

如果观测是给定的,即 o_{0, \ldots, i} 已知,这获得的将是 \Pr (s_i, o_{0, \ldots, i})。对后向消息而言,

\displaystyle\delta_{t\to t-1} (s_{t - 1}) = \sum_{s_t} \sum_{o_t} \phi_t (s_t, s_{t - 1}, o_t) = \sum_{s_t} \sum_{o_t} \Pr (s_t \mid s_{t - 1}) \Pr (o_t \mid s_t) \equiv 1

代入后面的消息有

\displaystyle\delta_{i\to i-1} (s_{i - 1}) = \sum_{s_i} \sum_{o_i} \delta_{i+1 \to i} (s_i) \psi_i (s_i, s_{i - 1}, o_i) = \sum_{s_i} \delta_{i+1 \to i} \Pr (s_i \mid s_{i - 1}) \equiv 1

都是常数。如果 o_{i, \ldots, t} 是已知的,这将获得 \Pr (s_i, o_{i, \ldots, t})

对于 MAP 类型的 query,我们需要使用 max-product 算法,此时的前向消息为(e_k = \max_v e(k, v)

\displaystyle \delta_{1\to 2} (s_1) = \max_{o_1} \psi_1 (s_1, s_0, o_1) = \pi_{s_1} e_{s_1}

\displaystyle \delta_{i \to i + 1} (s_i) = \max_{o_i} \max_{s_{i-1}} \delta_{i - 1\to i} (s_{i - 1}) \psi_i (s_i, s_{i - 1}, o_i) = e_{s_i} \max_k \delta_{i-1\to i} (k) \tau(k, s_i)

后向消息为

\displaystyle \delta_{t\to t-1} (s_{t - 1}) = \max_{s_t} \max_{o_t} \phi_t (s_t, s_{t - 1}, o_t) = \max_k \tau(s_{t - 1}, k) e_k

\displaystyle \delta_{i\to i-1} (s_{i - 1}) = \max_{s_i} \max_{o_i} \delta_{i+1 \to i} (s_i) \psi_i (s_i, s_{i - 1}, o_i) = \max_k \delta_{i+1 \to i} (k) \tau (s_{i - 1}, k) e_k

对 belief update 来说,belief 是 o_i, s_i, s_{i - 1} 上的边际分布

\displaystyle \beta_1 = \psi_1 \delta_{2\to 1} = \Pr (s_0)\Pr (s_1 \mid s_0) \Pr (o_1\mid s_1), \quad \beta_i = \psi_i\delta_{i-1\to i} \delta_{i+1\to i} = \psi_i \Pr (s_{i-1}) = \Pr (s_{i-1}, s_i, o_i), \quad \beta_t = \psi_t \delta_{t-1\to t} = \psi_t \Pr (s_{t-1}) = \Pr (s_{t - 1}, s_t, o_t)

而对应的 belief update 为

\displaystyle \mu_{i, i+1} (s_i) = \delta_{i\to i + 1} (s_i) \delta_{i + 1\to i} (s_i) = \Pr (s_i)

类似可以导出 MAP 类型下的形式。这样,对于 filtering 来说 \Pr (s_i \mid o_{0, \ldots, i}) \propto \Pr (s_i, o_{0, \ldots, i}) 可以将前向消息归一化,而 prediction 使用的概率

\displaystyle \Pr (s_{i + 1} \mid o_{0, \ldots, i}) = \sum_{s_i} \Pr (s_{i + 1}, s_i \mid o_{0, \ldots, i}) = \sum_{s_i} \Pr (s_{i + 1} \mid s_i) \Pr (s_i \mid o_{0, \ldots, i}) = \sum_k \tau (k, s_{i + 1}) \Pr (s_i \mid o_{0, \ldots, i})

是归一化后的值。smoothing 需要求 \arg\max \Pr (s_{0, \ldots, t} \mid o_{0, \ldots, t}),本质上就是 \arg\max \Pr (s_{0, \ldots, t}, o_{0, \ldots, t}),这直接使用 MAP 类型两种 message 就能给出两种算法。

LDS

LDS 和 HMM 具有类似的图结构,但是对应的状态和观测均为连续分布,因而常使用 Gaussian 建模。

\displaystyle\Pr (o_{1, \ldots, t}, s_{0, \ldots, t})=\Pr (s_0) \prod_{i = 1}^{t - 1} \Pr (s_i \mid s_{i - 1}) \Pr(o_i \mid s_i)

其中,

\displaystyle\Pr(s_0)\sim N(\cdot \mid\mu_0, \Gamma_0), \quad \Pr(s_i \mid s_{i - 1}) \sim N (\cdot \mid A s_{i - 1}, \Gamma), \quad \Pr (o_i \mid s_i) \sim N (\cdot \mid C s_i, \Sigma)

另一种描述这种关系的形式是使用 additive noise,

\displaystyle s_0 = \mu_0 + \epsilon_0, \epsilon_0 \sim N (\cdot \mid 0, \Gamma_0), \quad s_i = A s_{i - 1} + \epsilon_i, \epsilon_i \sim N (\cdot \mid 0, \Gamma), \quad o_i = C s_i + \xi_i, \xi_i \sim N (\cdot \mid 0, \Sigma).

使用的 clique tree 与前面一致,前向消息为

\displaystyle \delta_{1\to 2} (s_1) = \iint N (s_0 \mid \mu_0, \Gamma_0) N (s_1 \mid As_0, \Gamma) \Pr (o_1 \mid C s_1, \Sigma) \, \mathrm{d} s_0 \mathrm{d} o_0 = N(s_1 \mid A \mu_0, A\Gamma_0A^\top + \Gamma) = \Pr (s_1)

\displaystyle \delta_{i \to i + 1} (s_i) = \iint \delta_{i-1\to i} (s_{i - 1}) N (s_i \mid As_{i - 1}, \Gamma) \Pr (o_i \mid C s_i, \Sigma) \, \mathrm{d} s_{i - 1} \mathrm{d} o_i = N(s_i \mid A \mu_{i - 1}, A\Gamma_{i - 1}A^\top + \Gamma) = \Pr (s_i),

其中 \mathbb{E} s_{i - 1} = \mu_i and \text{var}\, s_{i - 1} = \Gamma_{i - 1},后向消息也均为 1。对 MAP 类型的 query,前向消息为

\displaystyle \delta_{1\to 2} (s_1) = \max_{s_0} N (s_0 \mid \mu_0, \Gamma_0) N (s_1 \mid A s_0, \Gamma) N (o_1 \mid Cs_1, \Sigma)

关于 s_0 的优化问题是

\displaystyle \min_{s_0} s_0^\top (\Gamma_0^{-1} + A\Gamma^{-1} A^\top)^{-1} s_0 - 2 (\mu_0^\top \Gamma_0^{-1} + s_1^\top \Gamma^{-1} A)s_0

其解为

\displaystyle s_0^\star = (\Gamma_0^{-1} + A\Gamma^{-1} A^\top)^{-1} (\Gamma_0^{-1} \mu_0 + A^\top \Gamma^{-1} s_1)

这是 s_1 的线性函数,因此大致的求解过程是,从 s_{i - 1} 的二次方程中解出 s_{i - 1} 得到一个使用 s_i 的线性函数表示的关系,代入后得到 s_i 的消息,这仍然是一个二次函数,向后代入即可。最后获得的 s_t 的方程解出 s_t 后进行回代就解出了其他的隐变量。

beliefs 为

\displaystyle \beta_1 (s_0, s_1, o_1) = \psi_1 \delta_{2 \to 1} = N (s_0\mid, \mu_0, \Gamma_0) N (s_1 \mid A s_0, \Gamma) N (o_1 \mid C s_1, \Sigma)

\displaystyle\beta_i(s_{i-1}, s_i, o_i) = \psi_i\delta_{i-1\to i}\delta_{i+1\to i}

类似有对应 belief。

对 filtering 问题,给定 o_1, \ldots, o_i 后计算 \Pr (s_i \mid o_{1, \ldots, i}) 可使用前向消息,

\displaystyle \Pr (s_1 \mid o_1) \propto \int \Pr (s_0) \Pr (s_1\mid s_0) \Pr(o_1 \mid s_1) \, \mathrm{d} s_0 = N (s_1 \mid A\mu_0, A\Gamma_0 A^\top + \Gamma) N (o_1 \mid Cs_1, \Sigma) \propto N (s_1 \mid \xi_1, \Lambda_1)

其中,

\displaystyle \Lambda_1 = (\Gamma_1^{-1} + C^\top \Sigma^{-1} C)^{-1}, \quad \xi_1= \Lambda_1 (\Gamma_1^{-1} \mu_1 + C^\top \Sigma^{-1} o_1)

\displaystyle \Pr (s_i \mid o_{1,\ldots, i}) \propto\int \Pr (s_{i - 1} \mid o_{1, \ldots, i - 1}) \Pr (s_i\mid s_{i - 1} \Pr(o_i \mid s_i)\,\mathrm{d} s_{i - 1} = N (s_i \mid A\xi_{i - 1}, A\Lambda_{i - 1} A^\top + \Gamma) N (o_i \mid Cs_i, \Sigma \propto N (s_i \mid \xi_i, \Lambda_i),

其中

\displaystyle \Lambda_i = (\Lambda_{i - 1}^{-1} + C^\top \Sigma^{-1} C)^{-1}, \quad \xi_i = \Lambda_i (\Lambda_{i - 1}^{-1} \xi_{i - 1} + C^\top \Sigma^{-1} o_i).

\Lambda_0 = \Gamma_1\xi_0 = \mu_1 则以上计算可用统一的形式表述。

对 prediction 问题,给定 o_1, \ldots, o_i\Pr (s_{i + 1} \mid o_{1, \ldots, i}) 可使用 filtering 的结果计算

\displaystyle \Pr (s_{i + 1} \mid o_{1, \ldots, t}) = \int \Pr (s_{i + 1} \mid s_i) \Pr (s_i \mid o_{1, \ldots, i})\, \mathrm{d} s_i = \int N (s_{i + 1} \mid A s_i, \Gamma) N (s_i \mid \xi_i, \Lambda_i) \,\mathrm{d} s_i = N (s_{i + 1} \mid A \xi_i, A\Lambda_iA^\top + \Gamma).

MEMM

我们直接对 \Pr (s_i \mid o_i) 使用 ME 建模,但是为了引入上下文关系,我们可以将这个 ME 弄成多个 \Pr_{s_{i - 1}} (s_i \mid o_i),这也就是说前面一个状态决定了后面使用的 ME 的参数。这样似然函数为

\displaystyle \Pr(s_{0, \ldots, t} \mid o_{0, \ldots, t}) = \prod_{i = 1}^t \Pr_{s_{i - 1}} (s_i\mid o_i).

这里的假定有,

  • Markovian 性:\Pr (s_i \mid s_{0, \ldots, i - 1}, o_{0, \ldots, i - 1}) = \Pr (s_i \mid s_{i - 1}, o_i)
  • ME 假定:\Pr (s_i \mid s_{i - 1}, o_i) = \Pr_{s_{i - 1}} (s_i \mid o_i) = p_{s_{i - 1}} (s_i \mid o_i)

我们使用与 HMM 一致的 cluster graph,前向消息为

\displaystyle \delta_{1\to 2} (s_1) = p_{s_0} (s_1 \mid o_1), \quad \delta_{i\to i+1} (s_i) = \sum_{s_{i-1}} \delta_{i-1 \to i} (s_{i - 1}) \psi_i = \sum_{s_{i - 1}} \delta_{i-1 \to i} (s_{i - 1}) p_{s_{i-1}} (s_i \mid o_i).

后向消息为

\displaystyle \delta_{t\to t-1} (s_{t - 1}) = \sum_{s_t} p_{s_{t - 1}} (s_t \mid o_t), \quad \delta_{i\to i-1} (s_{i - 1}) = \sum_{s_i} \delta_{i + 1\to i} (s_i) p_{s_{i - 1}} (s_i \mid o_i).

max-product message passing 仅仅需要将求和换成 max。belief propagation 中 belief 为

\displaystyle \beta_1 (s_0, s_1, o_1) = \psi_1 \delta_{2 \to 1} = \Pr (s_0, s_1 \mid o_{1, \ldots, t}), \quad \beta_i (s_{i - 1}, s_i, o_i) = \psi_i \delta_{i + 1 \to i} \delta_{i - 1\to i} = \Pr (s_{i - 1}, s_i \mid o_{1, \ldots, t}), \quad \beta_t (s_{t - 1}, s_t, o_t) = \psi_t \delta_{t - 1\to t} = \Pr (s_{t - 1}, s_t \mid o_{1, \ldots, t})

且 belief update 为

\displaystyle\mu_{i, i + 1} = \delta_{i\to i+ 1} (s_i) \delta_{i+1 \to i} (s_i) =\Pr (s_i \mid o_{1, \ldots, t})

其 filtering、prediction 和 smoothing 算法与 HMM 完全一样。

CRF

其假设为

  • Markovian 性,与 MEMM 类似;
  • invariant factor:对每个 transition,我们引入一个 log-linear 表示,\phi_i (s_{i - 1}, s_i, o_i) = \exp\left( \sum_j f_j(s_{i}, o_i, s_{i - 1})\right),其中 f_j 是所谓的 feature;

类似前面可以定义消息、belief 等。如果需要计算 log-likelihood,我们需要求 partition function 的函数值,这需要使用前向消息

\displaystyle Z (o_{1, \ldots, t}) = \sum_{o_{1, \ldots, t}} \prod_{i = 1}^t \exp\left( \sum_j \lambda_j f_j (s_i, o_i, s_{i - 1})\right),

就能避免指数求和项,而计算梯度的时候,

\displaystyle \frac{\partial \mathcal{L}}{\partial \lambda_j} = \sum_\text{over samples} \sum_{i = 1}^t f_j (s_i, o_i, s_{i - 1}) - \langle f_j (s_i, o_i, s_{i - 1})\rangle,

其中后者需要 \Pr (s_i, s_{i - 1} \mid o_{1, \ldots, t}),这正是 belief。

——————
And Sarah saw the son of Hagar the Egyptian, which she had born to Abraham, mocking.

Advertisements
PGM 读书笔记节选(十二)

发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s