Существует много библиотек NLP, например Natural Language Toolkit (NLTK), TextBlob, CoreNLP, Gensim, и spaCy. Также есть множество способов обобщения текстов — покажу самый простой, в три предложения без NLP.
Понадобятся библиотеки для предварительной обработки и сортировки данных.
import re
import heapq
Будем обобщать текстовый блок, который содержит 600 слов:
“ Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore’s law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques[4][5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,[6] parsing,[7][8] and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that was used in statistical machine translation (SMT).”
Загрузим текст в строку:
Теперь разделим текст по следующему шаблону:
sentences = re.split(r’ *[\.\?!][\’”\)\]]* *’, text)
Для предварительной обработки преобразуем все буквы в строчные и разобьем текст на слова (word_tokenize
clean_text = text.lower()
word_tokenize = clean_text.split()
Исключим стоп-слова языка. Стоп-слова нужного языка можно получить с сайта Countwordsfree: https://countwordsfree.com/stopwords
stop_words = [“i”, “me”, “my”, “myself”, “we”, “our”, “ours”, “ourselves”, “you”, “your”, “yours”, “yourself”, “yourselves”, “he”, “him”, “his”, “himself”, “she”, “her”, “hers”, “herself”, “it”, “its”, “itself”, “they”, “them”, “their”, “theirs”, “themselves”, “what”, “which”, “who”, “whom”, “this”, “that”, “these”, “those”, “am”, “is”, “are”, “was”, “were”, “be”, “been”, “being”, “have”, “has”, “had”, “having”, “do”, “does”, “did”, “doing”, “a”, “an”, “the”, “and”, “but”, “if”, “or”, “because”, “as”, “until”, “while”, “of”, “at”, “by”, “for”, “with”, “about”, “against”, “between”, “into”, “through”, “during”, “before”, “after”, “above”, “below”, “to”, “from”, “up”, “down”, “in”, “out”, “on”, “off”, “over”, “under”, “again”, “further”, “then”, “once”, “here”, “there”, “when”, “where”, “why”, “how”, “all”, “any”, “both”, “each”, “few”, “more”, “most”, “other”, “some”, “such”, “no”, “nor”, “not”, “only”, “own”, “same”, “so”, “than”, “too”, “very”, “s”, “t”, “can”, “will”, “just”, “don”, “should”, “now”]
Теперь, когда мы поместили все стоп-слова английского в список (можно загрузить любой язык), промаркируем все слова в словаре и определим для них значения:
word2count = {}
for word in word_tokenize:
if word not in stop_words:
if word not in word2count.keys():
word2count[word] = 1
word2count[word] += 1
Используя значение из word2count
, будем оценивать предложения:
sent2score = {}
for sentence in sentences:
for word in sentence.split():
if word in word2count.keys():
if len(sentence.split(' ')) < 28 and len(sentence.split(' ')) > 9:
if sentence not in sent2score.keys():
sent2score[sentence] = word2count[word]
sent2score[sentence] += word2count[word]
Для краткого содержания мы задали максимальное число слов в исходном предложении 27 и минимальное — 10.
Затем создадим взвешенную гистограмму:
for key in word2count.keys():
word2count[key] = word2count[key] / max(word2count.values())
Теперь надо просто отсортировать лучшие 3 предложения и посмотреть результаты:
best_three_sentences = heapq.nlargest(3, sent2score, key=sent2score.get)print(*best_three_sentences)
“Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing The cache language models upon which many speech recognition systems now rely are examples of such statistical models [3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules.”
Текст из 600 слов сжался до 65.
Весь код
text = "Text you want to summarize"
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
clean_text = text.lower()
word_tokenize = clean_text.split()
#стоп-слова английского
stop_words = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
word2count = {}
for word in word_tokenize:
if word not in stop_words:
if word not in word2count.keys():
word2count[word] = 1
word2count[word] += 1
sent2score = {}
for sentence in sentences:
for word in sentence.split():
if word in word2count.keys():
if len(sentence.split(' ')) < 28 and len(sentence.split(' ')) > 9:
if sentence not in sent2score.keys():
sent2score[sentence] = word2count[word]
sent2score[sentence] += word2count[word]
# взвешенная гистограмма
for key in word2count.keys():
word2count[key] = word2count[key] / max(word2count.values())
best_three_sentences = heapq.nlargest(3, sent2score, key=sent2score.get)
Методов составления краткого содержания текста очень много. Этот — самый простой.
Спасибо за чтение!
Перевод статьи Sajid Hasan Sifat: Create Text Summary Using Python Without NLP Libraries