Currently, we extract only the text part from the article data in multiple JSON files, and check how many articles all the words included in the article appear in.
articles = []
wordset = set () # occurrence word set
for line in fileinput.input (JSON_FILE):
wordcounter = Counter ()
json_obj = json.loads (line)
json_obj1 = json_obj ['text']
for s in json_obj1:
tokens = t.tokenize (s)
base_forms = [tk.base_form for tk in tokens]
wordset.update (base_forms)
wordcounter.update (base_forms)
unique_arr = np.unique (base_forms)
l.append (unique_arr)
This is part of the program. In the first for sentence, one article is read from JSON_FILE, and only TEXT in it is taken out.
The word for each article is decomposed with the following for sentence. Here, words of all articles are assigned to wordset, and words of one article are assigned to list l.
wordset = Counter (wordset)
for W in l:
wordset.update (W)
wordset_sorted = sorted (wordset.items (), key = lambda x: x [1])
print (wordset)
Finally, wordset and l are compared and the number of occurrences is examined.
But the above program
for s in json_obj1:
In the part of, the data assigned to s was not the data for each article, but was separated by ','. A part of the output result is shown below.
Part of json_obj1 printed ↓
['Koei Tecmo Games has released the latest information on the tactical action "Musou OROCHI3" for PlayStation 4/Nintendo Switch, scheduled to be released on September 27th (Steam version is October 16th) did. ',' This time, the character introduction video of the new character "Perseus" has been released. CV is served by Satoshi Shimono. In addition, it has become clear that "Seki Gin" is the target of "Deification" in which the key character of the story changes to a special figure. In addition, two kinds of key items "Godware" that gained the power of God have been released. ',' Olympus hero. A half-god born between the god Zeus and the person Danae. He killed Medusa and became revered as a hero. Worried about Zeus' intervention in the human world, he stood up to prevent confusion in the world. ',' Eight people who are the key to the story will become "the deified" that changes to the figure that possesses the power of God. This time, the deification of "Sekigin Kaoru (CV: Saori Mikami)" has been decided. ',' It is very well-developed and shows talent for the martial arts learned with the intention of self-defense. It is a tremendous monster, but the person is not aware. ',' A winged magic shoe possessed by the guardian deity Hermes. Those who wear it gain the speed to overtake the wind. ',' One of the mysterious treasures created by the hermit. It has the power to shake the earth and blow away the mountains. ',' An additional schedule for the ongoing store experience meeting has been decided. "Musou OROCHI3" special can badge "will be randomly distributed to those who tried. The number is limited and it will end as soon as it is gone. ',' [In-store experience meeting additional schedule] ',' A collaboration with a restaurant "KOEI TECMO CAFE&DINING" in Ikebukuro, Tokyo will be held in early September. You can enjoy in-store decorations and limited menus after "Musou OROCHI3". Detailed period and reservation method will be released later. ',' © Koei Tecmo Games All rights reserved. ']
If i assign to s using for statement in this state
[In-store experience meeting additional schedule]
A collaboration with the restaurant "KOEI TECMO CAFE&DINING" in Ikebukuro, Tokyo will be held in early September. You can enjoy in-store decorations and limited menus after "Musou OROCHI3". Detailed period and reservation method will be released later.© Koei Tecmo Games All rights reserved.
It will be separated by ',' like. In the output result ofjson_obj1, it is divided into parts separated by ',' and assigned to s.
In this state, it is not possible to check the words that appear in each article.(Currently, words are examined in single sentence units separated by ',' instead of in single article units)How should I change it?
I tried to delete ',' using the strip function, but when substituting for s, it was broken down by one character.
Additional
Problem of s: The result of output of json_obj1 above is read with ',' as a delimiter. I want you to read the article as an article. Taking the above image as an example,
KOEI TECMO GAMES has released the latest information on the "Musou OROCHI3" tactical action for PlayStation 4/Nintendo Switch, which is scheduled to be released on September 27th (October 16th for the Steam version). This time, the character introduction video of the new character "Perseus" was released. CV is served by Satoshi Shimono. In addition, it has become clear that "Seki Gin" is the target of "Deification" in which the key character of the story changes to a special figure. In addition, two kinds of key items "Godware" that gained the power of God have been released. Olympos hero. A half-god born between the god Zeus and the person Danae. He killed Medusa and became revered as a hero. Worried about Zeus' intervention in the human world, he stood up to prevent confusion in the world. Eight people who are the key to the story will undergo a "deification" that transforms into a figure with the power of God. This time, the deification of "Sekigin Kaoru (CV: Saori Mikami)" has been decided. He is very skillful and demonstrates his talent for martial arts he learned with the intention of self-defense. It is a tremendous monster, but the person is not aware. "The magical shoe with wings possessed by the guardian god Hermes. Those who wear it gain the speed to overtake the wind. One of the mysterious treasures created by the hermit. It has the power to shake the earth and blow the mountains." An additional schedule for the store experience meeting has been decided. "Musou OROCHI3" special can badge "will be randomly distributed to those who tried. The number is limited and it will end as soon as it is gone. [In-store experience meeting additional schedule] Collaboration with a restaurant "KOEI TECMO CAFE&DINING" in Ikebukuro, Tokyo will be held in early September. You can enjoy in-store decorations and limited menus after "Musou OROCHI3". Detailed period and reservation method will be released later. © KOEI TECMO GAMES All rights reserved.
Aim to display like this.
This is the output result of the current program.
Since the number of articles is 710, and there are many articles that exceed that number, it is assumed that this is the result, and if you examine it, the character string entered in s as the first question Because it is not a single article unit, I came up with the idea that the above numbers might be displayed.
The above is a supplement. Also, if you have any questions about answering, please contact us.
-
Answer # 1
Related articles
- i want to remove double quotes and commas when exporting with python
- python 3x - supports garbled characters for web scraping
- python - when i try to erase non-ascii characters in the text with resub (r "[^ x00-x7f]", r "", text), the
- python - yolo i want to detect only a specific class
- python - updating a specific column in each row does not work in the case of duplicate index in dataframe
- python 3x - how to extract row and column numbers with specific values in a dataframe table
- python - i want to convert the characters obtained from the txt file to numbers
- i want to aggregate a specific character string by day using the python groupby count function etc
- python - when you want to judge by the number of characters from the back with a regular expression
- python - how to get the number of searches for a specific word within the period
- python - sumy by specifying the maximum number of characters with sumy
- python - about garbled characters in webbroeser
- substitute the value of a specific line of csv read into python
- python - i want to separate by a specific word using the split function
- python - how to list only specific values in a dictionary
- i want to extract specific information from a python string
- linux - delete specific characters in shell
- i can't get a specific value from a dict type in python
- get the number of characters from the python text box
- python - i want to search for a specific value in a row specified by a two-dimensional array
The list is separated by commas.
Does that mean you want to combine elements?
If you want to make a list of strings into a single string, you can join them.
If you want a one-element list, use square brackets. .
Reassign the combined result to a variable.
In that case, the immediately following for statement always loops only once, so you can stop the list and eliminate the loop.