Reuters Newswire Analysis

25 Jun 2021 Note

This is a experiment of reproducing the evaluation result in the paper of Source-LDA.

From Paper

To show the type of topics discovered from Source-LDA we run the model on an existing dataset. This collection contains documents from the Reuters newswire from 1987. The dataset contains 21,578 articles, among a large set of categories. One important feature of the dataset are a set of given categories that we can use for our topic labeling. These include broad categories such as shipping, interest rates, and trade, as well as more refined categories such as rubber, zinc, and coffee. Our choice to apply our topic labeling method to this dataset is due to the fact that the Reuters dataset is widely used for information retrieval and text categorization applications. Due to its widespread use, it can considerably aid us in comparing our results to other studies. Additionally, because it contains distinct categories that we can use as our known set of topics, we can easily demonstrate the viability of our model.

为了显示从 Source-LDA 发现的主题类型,我们在现有数据集上运行模型。 该集合包含来自 1987 年路透社新闻专线的文档。该数据集包含 21,578 篇文章,拥有大量类别。 此数据集的一个重要特征是一组可用于我们的主题标签的给定类别。 其中包括航运、利率和贸易和更精细的类别,如橡胶、锌和咖啡等。 我们选择将我们的主题标记方法应用于这个数据集是因为路透社数据集被广泛用于信息检索和文本分类应用。 由于其广泛使用,它可以极大地帮助我们将我们的结果与其他研究进行比较。 此外,因为它包含我们可以用作已知主题集的不同类别,所以我们可以轻松证明我们模型的可行性。

1) Experimental Setup: Source-LDA, LDA, and CTM were run against the Reuters-21578 newswire collection. Since EDA does not discover new topics, nor does it update the word distributions of the input topics, we do not include EDA in this experiment. From the original 21,578 document corpus we select a subset of 2,000 documents. The Source-LDA and CTM supplementary distributions were generated by first obtaining a list of topics from the Reuters-21578 dataset.

Source-LDA、LDA 和 CTM 是在 Reuters-21578 数据集运行的。由于 EDA 不会发现新主题,也不会更新输入主题的词分布,因此我们在本实验中不包括 EDA。从原始的 21,578 个文档语料库中,我们选择了 2,000 个文档的子集。 Source-LDA 和 CTM 补充分布是通过首先从 Reuters-21578 数据集中获取主题列表而生成的。

Next, for each topic, the corresponding Wikipedia article was crawled and the words in the topic were counted, forming their respective distributions. Querying Wikipedia resulted in 80 distinct topics as our superset for the knowledge source. Out of the 80 crawled available topics, only 49 topics appear in the 2,000 document corpus. This represents the ideal conditions in which Source-LDA is to be applied; that of a corpus which a significant portion of tokens are generated from a subset of a larger and relatively easy to obtain topic set.

接下来,对于每个主题,爬取对应的维基百科文章并统计该主题中的单词,形成它们各自的分布。查询维基百科产生了 80 个不同的主题作为我们知识源的超集。在抓取的 80 个可用主题中,只有 49 个主题出现在 2,000 个文档语料库中。这代表了应用 Source-LDA 的理想条件;语料库的大部分 token 是从较大且相对容易获得的主题集的子集生成的。

For all models, a symmetric Dirichlet parameter of 50/T (where T is the number of topics) and 200/V (where V is the size of the vocabulary) was used for α and β respectively. For Source-LDA, µ and σ were determined by experimentally finding a local minimum value of perplexity which resulted from the parameter values of 0.7 for µ and 0.3 for σ. The bag of words used in the CTM were taken from the top 10,000 words by frequency for each topic. The models showed good convergence after 1,000 iterations. After sampling was complete for LDA, the resulting topic-to-word distribution was mapped using an information retrieval (IR) approach. The IR approach was to use cosine similarity of documents mapped to term frequency-inverse document frequency (TF-IDF) vectors with TF-IDF weighted query vectors formed from the top 10 words per topic.

对于所有模型,α 和 β 分别使用 50/T(其中 T 是主题数)和 200/V(其中 V 是词汇表的大小)的对称狄利克雷参数。对于 Source-LDA,μ 和 σ 是通过实验找到困惑度的局部最小值来确定的,该值由 μ 的参数值为 0.7,σ 的参数值为 0.3。 CTM 中使用的词袋是从每个主题的频率前 10,000 个词中提取的。模型在 1,000 次迭代后表现出良好的收敛性。 LDA 采样完成后,使用信息检索 (IR) 方法映射生成的主题到单词的分布。 IR 方法是使用映射到词频-逆文档频率 (TF-IDF) 向量的文档的余弦相似度,以及从每个主题的前 10 个单词形成的 TF-IDF 加权查询向量。

Comments