Home>

I'm trying to create a program using sumy, an automatic summarization library in python.
I want to specify the maximum number of characters so that the summary sentence is less than that number of characters, but I don't know how to do it. I would like to ask for your help.

The program isThis siteI referred to (or almost the same)

Corresponding source code
from janome.analyzer import Analyzer
from janome.charfilter import UnicodeNormalizeCharFilter, RegexReplaceCharFilter
from janome.tokenizer import Tokenizer as JanomeTokenizer # sumy Tokenizer and name
from janome.tokenfilter import POSKeepFilter, ExtractAttributeFilter
from janome.analyzer import Analyzer
from janome.charfilter import *
from janome.tokenfilter import *
import re
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
def fn_start_document_summarize ():
    file = (r "file path")
    with open (file, encoding ='utf-8') as f:
        contents = f.readlines () # Read text in file
    contents =''. join (contents) #joins join elements
    text = re.findall ("[^.] +.?", Contents.replace ('\ n',''))
    Store #contents in an array by separating the elements after replace processing from characters other than "." To "."
    # re.findall (filter to specify, target to check)
    # (text = divide into elements for each sentence)
    print (text)
    tokenizer = JanomeTokenizer ('japanese') #Language selection for word division and part of speech assignment
    char_filters = [UnicodeNormalizeCharFilter (), RegexReplaceCharFilter (r'[(\) "",.]','')]
    token_filters = [POSKeepFilter (['noun','adjective','adverb','verb']), ExtractAttributeFilter ('base_form')]
    #ExtractAttributeFilter = Extract the basic form of the part of speech of POSKeepFilter
    # Morphological analysis (divided into words)
    analyzer = Analyzer (
        char_filters = char_filters,
        tokenizer = tokenizer,
        token_filters = token_filters
    )
    corpus = ['''. join (analyzer.analyze (sentence)) + u'. 'for sentence in text]
    Execute on the right for each element of the #text array ⇒ Concatenate words extracted in the basic form with spaces
    print ("corpus =", corpus)
    print ("corpus_len =", len (corpus))
    #Document summary processing execution
    # from sumy.parsers.plaintext import PlaintextParser
    # from sumy.nlp.tokenizers import Tokenizer
    # from sumy.summarizers.lex_rank import LexRankSummarizer
    parser = PlaintextParser.from_string (''. join (corpus), Tokenizer ('japanese'))
    #PlaintextParser = Read document from character string/file
    # Divide the string concatenated in the basic form into words again
    Extract about 30% of the original document with # LexRank
    summarizer = LexRankSummarizer ()
    summarizer.stop_words = [''']
    #It is said that the important point of the document is 20% to 30%, so set sentences_count referring to it.
    summary = summarizer (document = parser.document, sentences_count = int (int (len (corpus)/10 * 3)))
    #sentences_count = 30% setting
    print ("contents_len =", len (contents))
    print (int (len (corpus)/10 * 3))
    print (u'Document summary result')
    for sentence in summary:
        print (text [corpus.index (sentence.__str__ ())])
        #text [corpus array (specified number of sentences. Return as a character string)]

if __name__ =='__main__':
    Execute #def main and main () at the same time
    fn_start_document_summarize ()
Supplement (FW/tool version, etc.)

I am using pycharm.
Python 3.8.5
OS: windows10

  • Answer # 1

    Read the sumy source in the GitHub repository.

    You are using it in the code you provided (made by someone else)LexRankSummarizerHowever, other summerizers do not have a function to "specify the maximum number of characters".sentences_countJust specify the number of lines (number of sentences) of the summary sentence in.

Trends