I am trying to extract specific parts of speech with mecab for studying machine learning.
I did not agree with the results of the default dictionary, and I introduced the NEologd dictionary.
Some words do not work properly and will cause an error in later processing.
Please give me advice to make it work correctly.
If the option "-Ochasen" is used, it works correctly and can be avoided.
I wrote that I wanted to deal with it as soon as possible if there was a problem.
An extra list follows the return value (see below)Applicable source code
import MeCab tagger = MeCab.Tagger ("-d .. \ dic \ ipadic-neologd") sentence = "most popular" tagger.parse ("") tagger.parse (sentence) #>>>Most popular noun, proper noun, general, *, *, *, 1st most popular, Ichibanninki, Ichibanninki, [: _: 3726 3689 7806] #>>>particle, linkage, *, *, *, *, no, no
-Works correctly with option -Ochasen.
・ Confirm about other cases
>>>Two types of nouns, proper nouns, general, *, *, *, two types, Nissui, Nissui, [: _: 2635 2609 8281]
>>>Noun, proper noun, general, *, *, *, 1st, Ichivante, Ichivante, [: _: 1817 1799 8281]
>>>First Floor Noun, Proper Noun, Personal Name, Surname, *, *, First Floor, Ikkai, Ikkai
It seems to occur in some words that contain Chinese numerals.
Python 3.6.6 | Anaconda custom (64-bit)
The NEologd dictionary was introduced with reference to here .
Answer # 1
seed/neologd-quantity-infreq-dict-seed.20170224.csv.xzentry has such a featureregisteredIt's out, and it's normal when you see it from mecab. (Because the dictionary features are output "quoted"")
% xzgrep -n 2nd seed/neologd-quantity-infreq-dict-seed.20170224.csv.xz 199271: 2nd, 1288, 1288, 1234, Noun, Proper noun, General, *, *, *, 2nd, Nivante, Nivante, [: _: 1246 1234 8281]
You can rewrite NEologd's seed as you wish, but you can recreate the dictionary. think.
(By the way, NEologd was built with the
-install_infreq_quantity. Otherwise, I don't think this seed will be used. )
There are probably no pages (someone explained) that can be helpful.
mecab-ipadic-neologd -hfor help on install-mecab-ipadic-neologd (ie this Usage), I think you have to judge what you need.
What I did was confirming that it was not included by default, confirming that it was included with the
-aoption, and which file in"seed /"The procedure was to search and test what options were likely to be supported.
Another way is to control the output format of mecab.
The program can be fixed by rewriting dicrc or by specifying "What number of features should be output in this format" with command line options (Tagger initialization parameter in Python) You can also.
An example of
-Ochasenis listed at the bottom of the page, so I think it will be helpful.
- python natural language processing morphological element analysis please tell me what to do
- python - preprocessing of natural language processing
- python - [natural language processing] regarding back propagation of embedding layer when word2vec is implemented
- about func (c language, recursive processing)
- assembly language - about c language assembly processing on x64 cpu
- php - about processing when google recaptcha v2 authentication is successful
- about c language variables
- c: i have doubts about loop processing
- opencv - about the processing result when the size is set to 0 in the bilateral filter cv2bilateralfilter
- java - about expansion and contraction processing of processing
- about conditional statements in c language
- about python if statement processing
- python - about black and white processing of images
- c language programming about power calculation
- about c++ language errors
- dart - about out-of-context processing in state_notifier
- assembly language - about assembly issues
- c# - question about c language pointer
- about passing by reference in c language
- python 3x - typeerror: 'method' object is not subscriptable
- python - you may need to restart the kernel to use updated packages error
- xcode - pod install [!] no `podfile 'found in the project directory
- vuejs - [vuetify] unable to locate target [data-app] i want to unit test to avoid warning
- android studio - emulator: dsound: could not initialize about the error message directsoundcapture
- android studio - unresolved reference comes out in kotlin
- mysql startup failed [error] innodb: the innodb_system data file 'ibdata1' must be writable
- django - oserror: [winerror 123] the file name, directory name, or volume label syntax is incorrect : '<frozen importlib_boot
- python - importerror: cannot import name md5 error cannot be resolved