I'm trying to create a program using sumy, an automatic summarization library in python.
I want to specify the maximum number of characters so that the summary sentence is less than that number of characters, but I don't know how to do it. I would like to ask for your help.
The program isThis siteI referred to (or almost the same)Corresponding source code
Supplement (FW/tool version, etc.)
from janome.analyzer import Analyzer from janome.charfilter import UnicodeNormalizeCharFilter, RegexReplaceCharFilter from janome.tokenizer import Tokenizer as JanomeTokenizer # sumy Tokenizer and name from janome.tokenfilter import POSKeepFilter, ExtractAttributeFilter from janome.analyzer import Analyzer from janome.charfilter import * from janome.tokenfilter import * import re from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lex_rank import LexRankSummarizer def fn_start_document_summarize (): file = (r "file path") with open (file, encoding ='utf-8') as f: contents = f.readlines () # Read text in file contents =''. join (contents) #joins join elements text = re.findall ("[^.] +.?", Contents.replace ('\ n','')) Store #contents in an array by separating the elements after replace processing from characters other than "." To "." # re.findall (filter to specify, target to check) # (text = divide into elements for each sentence) print (text) tokenizer = JanomeTokenizer ('japanese') #Language selection for word division and part of speech assignment char_filters = [UnicodeNormalizeCharFilter (), RegexReplaceCharFilter (r'[(\) "",.]','')] token_filters = [POSKeepFilter (['noun','adjective','adverb','verb']), ExtractAttributeFilter ('base_form')] #ExtractAttributeFilter = Extract the basic form of the part of speech of POSKeepFilter # Morphological analysis (divided into words) analyzer = Analyzer ( char_filters = char_filters, tokenizer = tokenizer, token_filters = token_filters ) corpus = ['''. join (analyzer.analyze (sentence)) + u'. 'for sentence in text] Execute on the right for each element of the #text array ⇒ Concatenate words extracted in the basic form with spaces print ("corpus =", corpus) print ("corpus_len =", len (corpus)) #Document summary processing execution # from sumy.parsers.plaintext import PlaintextParser # from sumy.nlp.tokenizers import Tokenizer # from sumy.summarizers.lex_rank import LexRankSummarizer parser = PlaintextParser.from_string (''. join (corpus), Tokenizer ('japanese')) #PlaintextParser = Read document from character string/file # Divide the string concatenated in the basic form into words again Extract about 30% of the original document with # LexRank summarizer = LexRankSummarizer () summarizer.stop_words = ['''] #It is said that the important point of the document is 20% to 30%, so set sentences_count referring to it. summary = summarizer (document = parser.document, sentences_count = int (int (len (corpus)/10 * 3))) #sentences_count = 30% setting print ("contents_len =", len (contents)) print (int (len (corpus)/10 * 3)) print (u'Document summary result') for sentence in summary: print (text [corpus.index (sentence.__str__ ())]) #text [corpus array (specified number of sentences. Return as a character string)] if __name__ =='__main__': Execute #def main and main () at the same time fn_start_document_summarize ()
I am using pycharm.
Answer # 1
Read the sumy source in the GitHub repository.
You are using it in the code you provided (made by someone else)
LexRankSummarizerHowever, other summerizers do not have a function to "specify the maximum number of characters".
sentences_countJust specify the number of lines (number of sentences) of the summary sentence in.
- i want to set the maximum value of the slider in python to the number entered in the text box
- python - how to count the number of characters in a character string including a regular expression
- get the number of characters from the python text box
- i want to get the maximum even number from the python list
- python - when you want to judge by the number of characters from the back with a regular expression
- a large number of errors when installing the python library
- python - does the number of arguments match?
- python - when i try to erase non-ascii characters in the text with resub (r "[^ x00-x7f]", r "", text), the
- python - how to accept number input even if keypressevent is defined for qlineedit in pyside
- display the number of characters a included in the entered character string
- python - valueerror: field'id' expected a number but got'suzukitadashi'
- where is the minor version number of python that starts by default specified?
- python - i want to convert the characters obtained from the txt file to numbers
- python - how to get the number of searches for a specific word within the period
- python - i want to fix garbled characters when using a function
- python - about garbled characters in webbroeser
- python - how to store the characters in the list in the dictionary
- python 3x - supports garbled characters for web scraping
- number of python combinations