Home>

Currently, we extract only the text part from the article data in multiple JSON files, and check how many articles all the words included in the article appear in.

articles = []
    wordset = set () # occurrence word set
    for line in fileinput.input (JSON_FILE):
        wordcounter = Counter ()
        json_obj = json.loads (line)
        json_obj1 = json_obj ['text']
        for s in json_obj1:
            tokens = t.tokenize (s)
            base_forms = [tk.base_form for tk in tokens]
            wordset.update (base_forms)
            wordcounter.update (base_forms)
            unique_arr = np.unique (base_forms)
            l.append (unique_arr)

This is part of the program. In the first for sentence, one article is read from JSON_FILE, and only TEXT in it is taken out.

The word for each article is decomposed with the following for sentence. Here, words of all articles are assigned to wordset, and words of one article are assigned to list l.

wordset = Counter (wordset)
for W in l:
    wordset.update (W)
wordset_sorted = sorted (wordset.items (), key = lambda x: x [1])
print (wordset)

Finally, wordset and l are compared and the number of occurrences is examined.

But the above program

for s in json_obj1:


In the part of, the data assigned to s was not the data for each article, but was separated by ','. A part of the output result is shown below.

Part of json_obj1 printed ↓

['Koei Tecmo Games has released the latest information on the tactical action "Musou OROCHI3" for PlayStation 4/Nintendo Switch, scheduled to be released on September 27th (Steam version is October 16th) did. ',' This time, the character introduction video of the new character "Perseus" has been released. CV is served by Satoshi Shimono. In addition, it has become clear that "Seki Gin" is the target of "Deification" in which the key character of the story changes to a special figure. In addition, two kinds of key items "Godware" that gained the power of God have been released. ',' Olympus hero. A half-god born between the god Zeus and the person Danae. He killed Medusa and became revered as a hero. Worried about Zeus' intervention in the human world, he stood up to prevent confusion in the world. ',' Eight people who are the key to the story will become "the deified" that changes to the figure that possesses the power of God. This time, the deification of "Sekigin Kaoru (CV: Saori Mikami)" has been decided. ',' It is very well-developed and shows talent for the martial arts learned with the intention of self-defense. It is a tremendous monster, but the person is not aware. ',' A winged magic shoe possessed by the guardian deity Hermes. Those who wear it gain the speed to overtake the wind. ',' One of the mysterious treasures created by the hermit. It has the power to shake the earth and blow away the mountains. ',' An additional schedule for the ongoing store experience meeting has been decided. "Musou OROCHI3" special can badge "will be randomly distributed to those who tried. The number is limited and it will end as soon as it is gone. ',' [In-store experience meeting additional schedule] ',' A collaboration with a restaurant "KOEI TECMO CAFE&DINING" in Ikebukuro, Tokyo will be held in early September. You can enjoy in-store decorations and limited menus after "Musou OROCHI3". Detailed period and reservation method will be released later. ',' © Koei Tecmo Games All rights reserved. ']

If i assign to s using for statement in this state

[In-store experience meeting additional schedule]
A collaboration with the restaurant "KOEI TECMO CAFE&DINING" in Ikebukuro, Tokyo will be held in early September. You can enjoy in-store decorations and limited menus after "Musou OROCHI3". Detailed period and reservation method will be released later.© Koei Tecmo Games All rights reserved.


It will be separated by ',' like. In the output result ofjson_obj1, it is divided into parts separated by ',' and assigned to s.

In this state, it is not possible to check the words that appear in each article.(Currently, words are examined in single sentence units separated by ',' instead of in single article units)How should I change it?

I tried to delete ',' using the strip function, but when substituting for s, it was broken down by one character.

Additional

Problem of s: The result of output of json_obj1 above is read with ',' as a delimiter. I want you to read the article as an article. Taking the above image as an example,

KOEI TECMO GAMES has released the latest information on the "Musou OROCHI3" tactical action for PlayStation 4/Nintendo Switch, which is scheduled to be released on September 27th (October 16th for the Steam version). This time, the character introduction video of the new character "Perseus" was released. CV is served by Satoshi Shimono. In addition, it has become clear that "Seki Gin" is the target of "Deification" in which the key character of the story changes to a special figure. In addition, two kinds of key items "Godware" that gained the power of God have been released. Olympos hero. A half-god born between the god Zeus and the person Danae. He killed Medusa and became revered as a hero. Worried about Zeus' intervention in the human world, he stood up to prevent confusion in the world. Eight people who are the key to the story will undergo a "deification" that transforms into a figure with the power of God. This time, the deification of "Sekigin Kaoru (CV: Saori Mikami)" has been decided. He is very skillful and demonstrates his talent for martial arts he learned with the intention of self-defense. It is a tremendous monster, but the person is not aware. "The magical shoe with wings possessed by the guardian god Hermes. Those who wear it gain the speed to overtake the wind. One of the mysterious treasures created by the hermit. It has the power to shake the earth and blow the mountains." An additional schedule for the store experience meeting has been decided. "Musou OROCHI3" special can badge "will be randomly distributed to those who tried. The number is limited and it will end as soon as it is gone. [In-store experience meeting additional schedule] Collaboration with a restaurant "KOEI TECMO CAFE&DINING" in Ikebukuro, Tokyo will be held in early September. You can enjoy in-store decorations and limited menus after "Musou OROCHI3". Detailed period and reservation method will be released later. © KOEI TECMO GAMES All rights reserved.

Aim to display like this.

This is the output result of the current program.

Since the number of articles is 710, and there are many articles that exceed that number, it is assumed that this is the result, and if you examine it, the character string entered in s as the first question Because it is not a single article unit, I came up with the idea that the above numbers might be displayed.

The above is a supplement. Also, if you have any questions about answering, please contact us.

  • Answer # 1

    The list is separated by commas.
    Does that mean you want to combine elements?
    If you want to make a list of strings into a single string, you can join them.

    >>>json_obj1 = ['Hello', 'World', '!']
    >>>'' .join (json_obj1)
    'HelloWorld!'
    >>>'' .join (json_obj1)
    'Hello World!'

    If you want a one-element list, use square brackets. .

    >>>['' .join (json_obj1)]
    ['Hello World!']

    Reassign the combined result to a variable.

    json_obj1 = ['' .join (json_obj1)]

    In that case, the immediately following for statement always loops only once, so you can stop the list and eliminate the loop.

    json_obj1 = '' .join (json_obj ['text'])
    tokens = t.tokenize (json_obj1)
    base_forms = [tk.base_form for tk in tokens]
    wordset.update (base_forms)
    wordcounter.update (base_forms)
    unique_arr = np.unique (base_forms)
    l.append (unique_arr)

Related articles