Source-LDA on Earth

17 Jun 2021 Note

为了周一的分享来写一写Source-LDA的笔记。

困惑度用于衡量语言模型时,可能和人类直觉判断不同。

Perplexity is not strongly correlated to human judgment

[Chang09] have shown that, surprisingly, predictive likelihood (or equivalently, perplexity) and human judgment are often not correlated, and even sometimes slightly anti-correlated.

They ran a large scale experiment on the Amazon Mechanical Turk platform. For each topic, they took the top five words (ordered by frequency p(w k)=ϕkw) of that topic and added a random sixth word. Then, they presented these lists of six words to participants asking them to identify the intruder word.

If every participant could identify the intruder, then we could conclude that the topic is good at describing an idea. If on the other hand, many people identified one of the topic top five word as the intruder, it means that they could not see the logic in the association of words, and we can conclude the topic was not good enough.

It’s important to understand what this experiment is proving. The result proves that, given a topic, the five words that have the largest frequency p(w k)=ϕkw withing their topic are usually not good at describing one coherent idea; at least not good enough to be able to recognize an intruder.

翻译一下论文[1]先!

Abstract

Abstract—Topic modeling has increasingly attracted interests from researchers. Common methods of topic modeling usually produce a collection of unlabeled topics where each topic is depicted by a distribution of words. Associating semantic meaning with these word distributions is not always straightforward. Traditionally, this task is left to human interpretation. Manually labeling the topics is unfortunately not always easy, as topics generated by unsupervised learning methods do not necessarily align well with our prior knowledge in the subject domains.

主题建模越来越吸引研究者的兴趣了。主题建模的通常做法是,生成未标记主题的集合,其中每个主题用单词的分布来描述。联系语义于这些单词分布不是总是非常直观的。传统说来,这项工作就是留给人去翻译的。不幸的是手动标记标题也不总是容易的,因为无监督学习方法不一定会和我们在学科领域的先验知识保持一致。

Currently, two approaches to solve this issue exist. The first is a post-processing procedure that assigns each topic with a label from the prior knowledge base that is semantically closest to the word distribution of the topic. The second is a supervised topic modeling approach that restricts the topics to a predefined set whose word distributions are provided beforehand.

Neither approach is ideal, as the former may produce labels that do not accurately describe the word distributions, and the latter lacks the ability to detect unknown topics that are crucial to enrich our knowledge base.

目前存在有两种解决此问题的方法,第一种是分配每个话题给一个从先验知识中得知的语义上最接近该话题的单词分布的标签的后处理过程。

第二个是将主题限制在分布已经实现提供好的预定义单词集合中的有监督主题建模方法。

两者方法都不是理想的,因为前者可能会产生不能准确描述词分布的标签,后者缺乏检测未知主题的能力,而这对丰富我们的知识库至关重要。

Our goal in this paper is to introduce a semisupervised Latent Dirichlet allocation (LDA) model, Source-LDA, which incorporates prior knowledge to guide the topic modeling process to improve both the quality of the resulting topics and of the topic labeling. We accomplish this by integrating existing labeled knowledge sources representing known potential topics into a probabilistic topic model. These knowledge sources are translated into a distribution and used to set the hyperparameters of the Dirichlet generated distribution over words. This approach ensures that the topic inference process is consistent with existing knowledge, and simultaneously, allows for discovery of new topics. The results show improved topic generation and increased accuracy in topic labeling when compared to those obtained using various labeling approaches based off LDA.

我们在本文中的目标是介绍一个半监督的隐含狄利克雷分布(LDA)模型,Source-LDA,它结合了先验知识来指导主题建模过程以提高生成主题和标记主题的质量。我们通过将【表示已知潜在主题的】【现有的】【有标签知识源】 集成到概率主题模型中来实现这一点。这些知识源被转换为分布并用于设置 Dirichlet 生成词分布的超参数。这种方法确保主题推理过程与现有的知识一致,同时允许发现新的主题。结果显示了改进的主题生成(topic generation)话题标注中(topic labeling)增加的准确度,与使用基于 LDA 的各种标记方法获得的效果相比。

I. INTRODUCTION

Existing topic modeling is often based off Latent Dirichlet allocation (LDA)[1] and involves analyzing a given corpus to produce a distribution over words for each latent topic and a distribution over latent topics for each document. The distributions representing topics are often useful and generally representative of a linguistic topic. Unfortunately, assigning labels to these topics is often left to manual interpretation.

现有的主题建模通常基于隐含狄利克雷分布 (LDA) 并涉及分析给定的语料库为每个潜在主题生成单词分布,为每个文档生成潜在主题分布。这个表示主题的分布通常很有用,而且通常代表一个语言上的主题。不幸的是,分配这些主题的标签通常要人为解释。

Identifying topic labels is useful in summarizing a set of words with a single label. For example, words such as pencil, laptop, ruler, eraser, and book can be mapped to the label “School Supplies.” Adding descriptive semantics to each topic can help people, especially those without domain knowledge, to understand topics obtained by topic modeling.

识别主题标签有助于总结一组带有单个标签的单词。例如,像铅笔、笔记本电脑、尺子、橡皮擦和书本这样的词,可以映射到标签“学校用品。” 为每个主题添加描述性语义可以帮助人们,尤其是那些没有领域知识的人,理解通过主题建模获得的主题。

A motivating application of accurate topic labeling is to develop summarization systems for primary care physicians, who are faced with the challenges of being inundated with too much data for a patient and too little time to comprehend it all [2]. The labels can be used to more appropriately and quickly give an overview, or a summary, of patient’s medical history, leading to better outcomes for the patient. This added information can bring significant value to the field of clinical informatics which already utilizes topic modeling without labeling [3]–[5].

准确标记主题的一个激励人的应用是为初级保健医生开发总结系统,他们面临被淹没在患者的数据太多,理解他们的时间太少的挑战中。标签可用于更恰当地和快速给出概述或患者的病史总结,从而为患者带来更好的结果。这增加的信息可以为已经在利用无标签主题建模的临床信息学领域带来重大价值。

Existing approaches in labeling topics usually do their fitting of labels to topics after completion of the unsupervised topic modeling process. A topic produced by this approach may not always match well with any semantic concepts and would therefore be difficult to categorize with a single label. These problems are best illustrated via a simple case study.

标记主题的现有方法通常会在完成无监督主题建模过程后做标签到主题的拟合。这种方法产生的主题可能不会总是与语义概念很好地匹配,并且因会此很难用单一标签进行分类。这些问题最好通过一个简单的案例研究来说明。

1) Case Study: Suppose a corpus of a news source that consists of two articles is given by documents d1 and d2 each with three words:

1) 案例研究:假设一个新闻源的语料包含两篇文章,由文档 d1 和 d2 给出,每篇含3个单词。

d1 - pencil, pencil, umpire

d2 - ruler, ruler, baseball

LDA (with the traditionally used collapsed Gibbs sampler, standard hyperparameters and the number of topics (K) set as two) would output different results for different runs due to the inherent stochastic nature. It is very possible to obtain the following result of topic assignments:

可能得到这样的主题分配:

d1 - pencil^1, pencil^1, umpire^2

d2 - ruler^2, ruler^2, baseball^1

But these assignments to topics differs from the ideal solution that involves knowing the context of the topics in which these words come from. If the topic modeling was to incorporate prior knowledge about the topics “School Supplies” and “Baseball”, then a topic modeling process will more likely generate the ideal topic assignments of:

但这些主题分配和已知这些词来源的主题上下文的理想方案有差异。如果主题建模包含有关“学校用品”和“网球”的先验知识,那么一个主题建模过程更应该生成这样的理想主题分配:

d1 - pencil^2, pencil^2, umpire^1

d2 - ruler^2, ruler^2, baseball^1

and assign a label of “School Supplies” to topic 1 and “Baseball” to topic 2. Furthermore it is advantageous to incorporate this prior knowledge during the topic modeling process. Consider the following table displaying four different mapping techniques of the first result using the Wikipedia articles of “School Supplies” and “Baseball” as the prior knowledge:

并且分配一个“学校用品”的标签给主题1,一个“网球”给主题2。此外,在主题建模过程中包含这个先验知识是有利的。 考虑下面表格显示的四种不同的映射技术,使用维基百科的“学校用品”和“网球”文章作为先验知识得到的第一结果:

Technique Topic 1 Topic 2
JS Divergence Baseball Baseball
TF-IDF/CS (same) (same)
Counting Baseball Baseball
PMI (same) (same)

Applying this labeling post topic modeling can lead to problems dealing with the topic themselves. This is not so much a problem of the mapping techniques but of the topics used as input. By separating the topics during inference this problem of combining different semantic topics can be avoided.

应用此标记文章主题的建模方法可能会导致处理主题本身时出现问题。这不是一个映射技术的问题,是主题作为输入的问题。通过在推理期间分离主题,可以避免组合不同语义主题的问题。

To overcome this problem, one may take a supervised approach that incorporates such prior knowledge into the topic modeling process to improve the quality of topic assignments and more effectively label topics. However, existing supervised approaches [6]–[8] are either too lenient or too strict. For example, in the Concept-topic model (CTM) [6], a multinomial distribution is placed over known concepts with associated word sets. This pioneering approach does integrate prior knowledge, but does not take into account word distributions. For example if a document is generated about the topic “School Supplies” it is much more probable to see the word “pencil” than the word “compass” even though both words may be associated with the topic “School Supplies”. This technique also requires some supervision which requires manually inputting preexisting concepts and their bags of words.

为了克服这个问题,可以采取一个有监督的将此类先验知识纳入主题建模过程的方法来提高主题分配的质量并更有效地标记主题。然而,现有的监督方法要么太宽松,要么太严格。例如,在概念主题模型 (CTM)中,多项式分布被放置在有关联单词集合的已知概念上。这种开创性的方法确实整合了先前的知识,但不考虑词分布。例如,如果一篇文章是以主题“学校用品”生成的,比起“指南针”这个词,更可能看到“铅笔”这个词。即使这两个词都可能是——与“学校用品”主题相关。这种技术也需要一些监督,需要手动输入预先存在的概念及其词袋。

Another approach given by Hansen et al. as explicit Dirichlet allocation [7] incorporates a preexisting distribution based off Wikipedia but does not allow for variance from the Wikipedia distribution. This approach fulfills the goal of incorporating prior knowledge with their distributions but requires the topic in the generated corpus to strictly follow the Wikipedia word distributions.

Hansen 等人给出的另一种方法 - 这种方法实现了目标将先验知识与其分布相结合,但要求生成的语料库中的主题严格遵循维基百科词分布。

To address these limitations, we propose the Source-LDA model which is a balance between these two approaches. The goal is to allow for simultaneous discovery of both known and unknown topics. Given a collection of known topics and their word distributions, Source-LDA is able to identify the subset of these topics that appear in a given corpus. It allows some variance in word distributions to the extent that it optimizes the topic modeling. A summary of the contributions of this work are:

为了解决这些限制,我们提出了 Source-LDA 模型是这两种方法之间的平衡。目标是允许同时发现已知和未知的主题。给定一组已知主题及其词分布,Source-LDA 能够识别出现在给定语料库中的这些主题的子集。它允许一些词分布的差异,在它可优化主题建模的范围内。对本文贡献的总结工作是:

1) We propose a novel technique to topic modeling in a semi-supervised fashion that takes into account preexisting topic distributions.

2) We show how to find the appropriate topics in a corpus given an input set that contains a subset of the topics used to generate a corpus.

3) We explain how to make use of prior knowledge sources. In particular, we show how to use Wikipedia articles to form word distributions.

4) We introduce an approach that allows for variance from an input topic to the latent topic discovered during the topic modeling process.

1) 我们提出了一种以半监督方式进行主题建模的新技术,该技术考虑了预先存在的主题分布。

2) 我们展示了如何在 给定包含用于生成语料库的主题子集 的输入集中的语料库中 找到合适的主题。

3) 我们解释了如何利用先验知识来源。 特别是,我们展示了如何使用维基百科文章来形成单词分布。

4) 我们引入了一种允许再主题建模过程中 从输入主题到到潜在主题之间差异被发现的方法。

The rest of this paper is organized as follows: In Section 2, we give a brief introduction to the LDA algorithm and the Dirichlet distribution. A more detailed description of the Source-LDA algorithm is presented in Section 3. In Section 4, the algorithm is used and evaluated under various metrics. Related literature is highlighted in Section 5. Section 6 gives the conclusions of this paper.

For reproducible research, we make all of our code available online.

https://github.com/ucla-scai/Source-LDA

II. PRELIMINARIES

A. Dirichlet Distribution

The Dirichlet distribution is a distribution over probability mass functions with a specific number of atoms and is commonly used in Bayesian models. A property of the Dirichlet that is often used in inference of Bayesian models is conjugacy to the multinomial distribution. This allows for the posterior of a random variable with a multinomial likelihood and a Dirichlet prior to also be a Dirichlet distribution.

The parameters are given as a vector denoted by α. The probability density function for a given probability mass function (PMF) θ and parameter vector α of length J is defined as:

【To be edited】 f(θ, α) = Γ(PJ i αi) QJ i Γ(αi) Y J i θ αi−1 i

A sample from the Dirichlet distribution produces a PMF that is parameterized by α. The choice of a particular set of α values influences the outcome of the generated PMF. If all α values are the same (symmetric parameter), as α approaches 0, the probability will be concentrated on a smaller set of atoms. As α approaches infinity, the PMF will become the uniform distribution. If all αi are natural numbers then each individual αi can be thought of as the “virtual” count for the ith value [9].

B. Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is the basis for many existing probabilistic topic models, and the framework for the approach presented by this paper. Since we enhance the LDA model in our proposed approach it is worth giving a brief overview of the algorithm and model of LDA. LDA is a hierarchical Bayes model which utilizes Dirichlet priors to estimate the intractable latent variables of the model. At a high level, LDA is based on a generative model in which each word of an input document from a corpus is chosen by first selecting a topic that corresponds to that word and then selecting the word from a topic-to-word distribution. Each topic-to-word distribution and word-to-topic distribution is drawn from its respective Dirichlet distribution. The formal definition of the generative algorithm over a corpus is:

  1. For each of the K topics φk:
  2. Choose φk ∼ Dir(β)
  3. For each of the D documents d:
  4. Choose Nd ∼ Poisson(ξ)
  5. Choose θd ∼ Dir(α)
  6. For each of the Nd words wn,d:
  7. Choose zn,d ∼ Multinomial(θ)
  8. Choose wn,d ∼ Multinomial(φzn,d )

From the generative algorithm the resultant Bayes model is shown by Figure 1(a). Bayes’ law is used to infer the latent θ distribution, φ distribution, and z P(θ, φ, z|w, α, β) = p(θ, φ, z, w|α, β) p(w|α, β) Unfortunately the exact computation of this equation is intractable. Hence, it must be approximated with techniques such as expectation-maximization [1], Gibbs sampling or collapsed Gibbs sampling [10].

III. PROPOSED APPROACH

Source-LDA is an extension of the LDA generative model. In Source-LDA, after a known set of topics are determined, an initial word-to-topic distribution is generated from corresponding Wikipedia articles. The desiderata is to enhance existing LDA topic modeling by integrating prior knowledge into the topic modeling process. The relevant terms and concepts used in the following discussion are defined below.

Source-LDA 是 LDA 生成模型的扩展。在 Source-LDA 中,在确定了一组已知的主题集合后,一个初始的 单词-主题 分布从对应维基百科文章中生成。需要的是通过将先验知识整合到主题建模的过程中来提高现有的 LDA 主题模型。使用的相关术语和概念在下面的讨论中定义如下。

Definition 1 (Knowledge source): A knowledge source is a collection of documents that are focused on describing a set of concepts. For example the knowledge source used in our experiments are Wikipedia articles that describe the categories we select from the Reuters dataset.

Definition 2 (Source Distribution): The source distribution is a discrete probability distribution over the words of a document describing a topic. The probability mass function is given by

F2

where W is the set of all words in the document, G = |W|, and nwi is the number of times word wi appears in the document.

Definition 3 (Source Hyperparameters): For a given document in a knowledge source the knowledge source hyperparameters are defined by the vector (X1, X2, . . . , XV ) where Xi = nwi + and  is a very small positive number that allows for non-zero probability draws from the Dirichlet distribution. V is the size of the vocabulary of the corpus for which we are topic modeling, and nwi is the number of times the word wi from the corpus vocabulary appears in the knowledge source document.

We detail three approaches to capture the intent of SourceLDA. The first approach is a simple enhancement to the LDA model that allows for the influencing of topic distributions, but suffers from needing more user intervention. The second approach allows for the mixing of unknown topics, and the third approach combines the previous two approaches. It moves toward a complete solution to topic modeling based off prior knowledge sources.

C. Source-LDA

By using the counts as hyperparameters, the resultant φ distribution will take on the shape of the word distribution derived from the knowledge source. However, this might be at odds with the aim of enhancing existing topic modeling. With the goal to influence the φ distribution, it is entirely plausible to have divergence between the two distributions. In other words, φ may not need to strictly follow the corresponding knowledge source distribution.

通过使用计数作为超参数,得到的 φ 分布将有来源于知识源的单词分布的形状。 然而,这可能于以增强现有主题建模的目标不一致。 为了影响 φ 分布,两个分布之间完全有可能存在差异。 换句话说,φ 可能不需要严格遵循相应的知识源的分布。

1) Variance from the source distribution: To allow for this relaxation, another parameter λ is introduced into the model which is used to allow for a higher deviance from the source distribution. To obtain this variance each source hyperparameter will be raised to a power of λ. Thus as λ approaches 0 each hyperparameter will approach 1 and the subsequent Dirichlet draw will allow all discrete distributions with equal probability. As λ approaches 1 the Dirichlet draw will be tightly conformed to the source distribution.

1) 与源分布的方差:为了允许这种松弛,将另一个参数 λ 引入模型中,用于允许与源分布的更高偏差。 为了获得这个方差,每个源超参数将被提升到 λ 的幂。 因此,当 λ 接近 0 时,每个超参数将接近 1,随后的狄利克雷绘制将允许所有离散分布具有相等的概率。 当 λ 接近 1 时,狄利克雷图将与源分布紧密一致。

The addition of λ changes the existing generative model only slightly and allows for a variance for each individual δi, which frees us from an overly restrictive binding to the associated knowledge source distribution. The λ parameter acts as a measure of how much divergence is allowed for a given modeled topic from the knowledge source distribution. Figure 3 shows how the JS Divergence changes with changes to the λ parameter.

Figure 3

添加 λ 只会略微改变现有的生成模型,并允许每个个体 δi 存在差异,这使我们摆脱了对相关知识源分布的过度限制。 λ 参数用作衡量给定建模主题与知识源分布的差异程度的度量。 图 3 显示了 JS Divergence 如何随着 λ 参数的变化而变化。

With the introduction of λ as an input parameter, the new topic model has the advantage of allowing variance and also leaves the collapsed Gibbs sampling equation unchanged. However this also requires a uniform variance from the knowledge base distribution for all latent topics. This can be a problem if the corpus was generated with some topics influenced strongly while others less so. To solve this we can introduce λ as a hidden parameter of the model.

通过引入 λ 作为输入参数,新的主题模型具有允许方差的优点,并且还保持折叠的 Gibbs 采样方程不变。 然而,这也需要所有潜在主题的知识库分布的统一方差。 如果生成的语料库中某些主题受到强烈影响而其他主题影响较小,则这可能是一个问题。 为了解决这个问题,我们可以引入 λ 作为模型的隐藏参数。

2) Approximating λ: In the ideal situation λ will be as close to 1 for most knowledge based latent topics, with the flexibility to deviate as required by the data. For this we assume a Gaussian prior over λ with mean set to µ. The variance then becomes a modeled parameter that conceptually can be thought of as how much variance from the knowledge source distribution we wish to allow in our topic model. In assuming a Gaussian prior for λ, we must integrate λ out of the collapsed Gibbs sampling equations (only the probability of wi under topic j is shown, the probability of topic j in document d is unchanged and omitted).

2) 近似 λ:在理想情况下,对于大多数基于知识的潜在主题,λ 将接近 1,并且可以根据数据的要求灵活地偏离。 为此,我们假设 λ 上的高斯先验,均值设置为 µ。 然后方差成为一个建模参数,从概念上可以将其视为我们希望在我们的主题模型中允许的知识源分布的多少方差。 在假设 λ 为高斯先验时,我们必须将 λ 从折叠的 Gibbs 采样方程中积分出来(仅显示主题 j 下 wi 的概率,文档 d 中主题 j 的概率不变并省略)。

Formula 4

Unfortunately closed form expressions for these integrals are hard to obtain and so they must be approximated numerically during sampling.

不幸的是,这些积分的闭式表达式很难获得,因此必须在采样期间进行数值近似。

Another problem arises in that the change of λ is not in par with the change of the Gaussian distribution, as can be seen in Figure 3. To make the changes of λ more in line with that expected from the Gaussian PDF, we must map each individual λ value in the range 0 to 1 with a value which produces a change in the JS divergence in a linear fashion. We approximate a function, g(x) with a linear derivative, shown in Figure 4. The approach taken to approximate g(x) is by linear interpolation of an aggregated large number of samples for each point taken in the range 0 to 1. Our collapsed Gibbs sampling equations then becomes:

另一个问题是 λ 的变化与高斯分布的变化不一致,如图 3 所示。 为了使 λ 的变化更符合高斯 PDF 的预期,我们必须映射每个 在 0 到 1 范围内的单个 λ 值,其值以线性方式产生 JS 散度的变化。 我们使用线性导数逼近函数 g(x),如图 4 所示。 逼近 g(x) 的方法是对 0 到 1 范围内的每个点的聚合大量样本进行线性插值。 我们折叠的 Gibbs 采样方程变为:

Formula 5

3) Superset Topic Reduction: A third problem involves knowing the right mixture of known topics and unknown topics. It is also entirely possible that many known topics may not be used by the generative model. Our desire to leave the model as unsupervised as possible calls for input that is a superset of the actual generative topic selection in order to avoid manual topic selection. In the case of modeling only a specific number of topics over the corpus, the problem then becomes how to choose which knowledge source latent topics to allow in the model vs. how many unlabeled topics to allow.

3)超集主题减少:第三个问题涉及知道已知主题和未知主题的正确混合。 生成模型可能不使用许多已知主题也是完全有可能的。 我们希望让模型尽可能不受监督,这要求输入是实际生成主题选择的超集,以避免手动选择主题。 在仅对语料库中特定数量的主题进行建模的情况下,问题就变成了如何选择模型中允许的知识源潜在主题与允许的未标记主题的数量。

The goal then is to allow for a superset of knowledge source topics as input and then during the inference to select the best subset of these with a mixture of unknown topics where the total number of unlabeled topics is given as input K. The approach given is to use a mixture of K unlabeled topics alongside the labeled knowledge source topics. The total number of topics then becomes T. During the inference we eliminate topics which are not assigned to any documents. At the end of the sampling phase we then can use a clustering algorithm (such as k-means, JS divergence) to further reduce the modeled topics and give a total of K topics. As described more in the experimental section, with the goal of capturing topics that were frequently occurring in the corpus, topics not appearing in a frequent enough of documents were eliminated.

目标是允许知识源主题的超集作为输入,然后在推理过程中选择这些最佳子集与未知主题的混合,其中未标记主题的总数作为输入 K。给出的方法是 将 K 个未标记的主题与标记的知识源主题混合使用。 然后主题的总数变为 T。在推理过程中,我们消除了未分配给任何文档的主题。 在采样阶段结束时,我们可以使用聚类算法(例如 k-means、JS divergence)进一步减少建模主题并给出总共 K 个主题。 如实验部分所述,为了捕获语料库中频繁出现的主题,消除了没有出现在足够频繁的文档中的主题。

5) Input determination: Determining the necessary arameters and inputs into LDA is an established research area [21], but since the proposed model introduces additional input requirements a brief overview will be given about how to best set the parameters and determine the knowledge source.

a) Parameter selection: To determine the appropriate parameters, techniques utilizing log likelihood have previously been established [10]. Since these approaches generally require held out data and are a function of the φ, θ, and α variables the introduction of λ and σ will not differentiate from their original equations. For example the perplexity calculations used for Source-LDA are based off of importance sampling [22], or latent variable estimation via Gibbs sampling [23]. Importance sampling is only a function of φ given by Equation 4, and estimation via Gibbs sampling can made using Equation 4 and by the following equation (z˜, w˜, and n˜ represent the corresponding variables in the test document set):

a) 参数选择:为了确定合适的参数,之前已经建立了利用对数似然的技术[10]。 由于这些方法通常需要保留数据并且是 φ、θ 和 α 变量的函数,因此引入 λ 和 σ 不会与它们的原始方程区分开来。 例如,用于 Source-LDA 的困惑度计算基于重要性采样 [22],或通过 Gibbs 采样 [23] 的潜在变量估计。 重要性抽样只是方程 4 给出的 φ 的函数,可以使用方程 4 和以下方程(z~、w~和 n~ 表示测试文档集中的相应变量)通过 Gibbs 采样进行估计:

b) Knowledge source selection: Source-LDA is designed to be used only with a corpus which has a known super set of topics which comprise a large portion of the tokens. An example of such a case is that of a corpus consisting of clinical patient notes. Since there are extensive knowledge sources comprising essentially all medical topics, Source-LDA can be useful in discovering and labeling these existing topics. In cases where it is not so easy to collect a superset of topics traditional approaches may be more useful.

知识源选择:Source-LDA 设计为仅用于具有已知超主题集的语料库,其中包含大部分标记。 这种情况的一个例子是由临床患者笔记组成的语料库。 由于有广泛的知识来源,基本上包括所有医学主题,Source-LDA 可用于发现和标记这些现有主题。 在收集主题超集不是那么容易的情况下,传统方法可能更有用。

LDA & pLSA

词袋模型 Bag Of Words

LDA 是一个词袋模型。词袋模型,指仅考虑一个词汇出现与否,而不考虑其出现的顺序。与词袋模型相对的一个模型是n-gram,n-gram考虑了词汇出现的先后顺序。

We generalize PLSA by changing the fixed dd to a Dirichlet prior.

The generative process for each word w_jw j ​ (from a vocab of size VV) in document d_id i ​ is as follow:

PLSA模型

Unigram Model模型中,没有考虑主题词这个概念。我们人写文章时,写的文章都是关于某一个主题的,当然,也有很少一部分词汇会涉及到其他主题。所以,PLSA认为生成一篇文档的生成过程如下:

  1. 现有两种类型的骰子,一种是doc-topic骰子,每个doc-topic骰子有K个面,每个面一个topic的编号;一种是topic-word骰子,每个topic-word骰子有V个面,每个面对应一个词;

  2. 现有K个topic-word骰子,每个骰子有一个编号,编号从1到K;

  3. 生成每篇文档之前,先为这篇文章制造一个特定的doc-topic骰子,重复如下过程生成文档中的词:

3.1 投掷这个doc-topic骰子,得到一个topic编号z;

3.2 选择K个topic-word骰子中编号为z的那个,投掷这个骰子,得到一个词;

PLSA 和 LDA 的区别 首先,我们来看看PLSA和LDA生成文档的方式。在PLSA中,生成文档的方式如下:

  1. 按照概率[公式]选择一篇文档[公式]
  2. 根据选择的文档[公式],从从主题分布中按照概率[公式]选择一个隐含的主题类别[公式]
  3. 根据选择的主题[公式], 从词分布中按照概率[公式]选择一个词[公式] LDA 中,生成文档的过程如下:

  4. 按照先验概率[公式]选择一篇文档[公式]
  5. 从Dirichlet分布[公式]中取样生成文档[公式]的主题分布[公式],主题分布[公式]由超参数为[公式]的Dirichlet分布生成
  6. 从主题的多项式分布[公式]中取样生成文档[公式]第 j 个词的主题[公式]
  7. 从Dirichlet分布[公式]中取样生成主题[公式]对应的词语分布[公式],词语分布[公式]由参数为[公式]的Dirichlet分布生成
  8. 从词语的多项式分布[公式]中采样最终生成词语[公式] 可以看出,LDA 在 PLSA 的基础上,为主题分布和词分布分别加了两个 Dirichlet 先验。

LDA Training 根据上一小节中的公式,我们的目标有两个:

  1. 估计模型中的参数[公式] 和 [公式] ;
  2. 对于新来的一篇文档,我们能够计算这篇文档的 topic 分布[公式]。 训练的过程:

  3. 对语料库中的每篇文档中的每个词汇[公式],随机的赋予一个topic编号z
  4. 重新扫描语料库,对每个词[公式],使用Gibbs Sampling公式对其采样,求出它的topic,在语料中更新
  5. 重复步骤2,直到Gibbs Sampling收敛
  6. 统计语料库的topic-word共现频率矩阵,该矩阵就是LDA的模型; 根据这个topic-word频率矩阵,我们可以计算每一个p(word|topic)概率,从而算出模型参数 [公式] , 这就是那 K 个 topic-word 骰子。而语料库中的文档对应的骰子参数 [公式] 在以上训练过程中也是可以计算出来的,只要在 Gibbs Sampling 收敛之后,统计每篇文档中的 topic 的频率分布,我们就可以计算每一个 p(topic|doc) 概率,于是就可以计算出每一个 [公式] 。由于参数 [公式] 是和训练语料中的每篇文档相关的,对于我们理解新的文档并无用处,所以工程上最终存储 LDA 模型时候一般没有必要保留。通常,在 LDA 模型训练的过程中,我们是取 Gibbs Sampling 收敛之后的 n 个迭代的结果进行平均来做参数估计,这样模型质量更高。

3.3.6 LDA Inference 有了 LDA 的模型,对于新来的文档 doc, 我们只要认为 Gibbs Sampling 公式中的 [公式] 部分是稳定不变的,是由训练语料得到的模型提供的,所以采样过程中我们只要估计该文档的 topic 分布 [公式] 就好了. 具体算法如下:

  1. 对当前文档中的每个单词[公式], 随机初始化一个topic编号z;
  2. 使用Gibbs Sampling公式,对每个词[公式], 重新采样其topic;
  3. 重复以上过程,知道Gibbs Sampling收敛;
  4. 统计文档中的topic分布,该分布就是[公式]

4 Tips 懂 LDA 的面试官通常会询问求职者,LDA 中主题数目如何确定?

在 LDA 中,主题的数目没有一个固定的最优解。模型训练时,需要事先设置主题数,训练人员需要根据训练出来的结果,手动调参,有优化主题数目,进而优化文本分类结果。

Reference - This post

[1] Wood, Justin, et al. “Source-LDA: Enhancing probabilistic topic models using prior knowledge sources.” 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 2017.

[2] LDA主题模型简介 https://www.jianshu.com/p/24b1bca1629f

[3] 一文详解LDA主题模型 https://zhuanlan.zhihu.com/p/31470216

Comments