1 无监督 (unsupervised) 方法
1.1 统计方法(TF, TF-IDF, YAKE)
1.2 图方法 (TextRank, SingleRank, TopicRank, PositionRank)
In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf.
Variations of the tf–idf weighting scheme were often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query.
One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.
N-gram is a contiguous sequence of ‘N’ items like words or characters from text or speech. 基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作,形成了长度是N的字节片段序列。
Generates n-grams by creating tuples of consecutive words:
def generate_ngrams(text, n):
tokens = text.split()
ngrams = [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]
return ngrams
利用马尔科夫链的假设,即当前这个词仅仅跟前面几个有限的词相关,因此也就不必追溯到最开始的那个词,这样便可以大幅缩减上述算式的长度。
N-grams in NLP are used for: NLP 中的 N-gram 用于:
Capturing Context and Semantics: N-grams help us understand how words work together in a sentence. By analyzing small word combinations they provide insight into the meaning and flow of language making text interpretation more accurate.
Improving Language Models: In tools like translation systems or voice assistants N-grams help create smarter models that can better guess what comes next in a sentence, leading to more natural and accurate responses.
Enhancing Text Prediction: They are widely used in predictive typing. By analyzing the words you’ve already typed they help suggest what you’re likely to type next making writing faster and more intuitive.
Information Retrieval: When searching for information they helps to find and rank documents by recognizing important word patterns. This makes search engines more effective at delivering relevant results.
[2] N-gram in NLP - https://www.geeksforgeeks.org/nlp/n-gram-in-nlp/ https://zhuanlan.zhihu.com/p/32829048