Home>

I am trying to extract specific parts of speech with mecab for studying machine learning.
I did not agree with the results of the default dictionary, and I introduced the NEologd dictionary.
Some words do not work properly and will cause an error in later processing.
Please give me advice to make it work correctly.

If the option "-Ochasen" is used, it works correctly and can be avoided.
I wrote that I wanted to deal with it as soon as possible if there was a problem.

Error message

An extra list follows the return value (see below)

Applicable source code
import MeCab
tagger = MeCab.Tagger ("-d .. \ dic \ ipadic-neologd")
sentence = "most popular"
tagger.parse ("")
tagger.parse (sentence)
#>>>Most popular noun, proper noun, general, *, *, *, 1st most popular, Ichibanninki, Ichibanninki, [: _: 3726 3689 7806]
#>>>particle, linkage, *, *, *, *, no, no
Tried

-Works correctly with option -Ochasen.

・ Confirm about other cases
>>>Two types of nouns, proper nouns, general, *, *, *, two types, Nissui, Nissui, [: _: 2635 2609 8281]
>>>Noun, proper noun, general, *, *, *, 1st, Ichivante, Ichivante, [: _: 1817 1799 8281]
>>>First Floor Noun, Proper Noun, Personal Name, Surname, *, *, First Floor, Ikkai, Ikkai
It seems to occur in some words that contain Chinese numerals.

Supplemental information (FW/tool version etc.)

Windows10 64bit
Python 3.6.6 | Anaconda custom (64-bit)

The NEologd dictionary was introduced with reference to here .

  • Answer # 1

    Neologd'sseed/neologd-quantity-infreq-dict-seed.20170224.csv.xzentry has such a featureregisteredIt's out, and it's normal when you see it from mecab. (Because the dictionary features are output "quoted"")

    % xzgrep -n 2nd seed/neologd-quantity-infreq-dict-seed.20170224.csv.xz
    199271: 2nd, 1288, 1288, 1234, Noun, Proper noun, General, *, *, *, 2nd, Nivante, Nivante, [: _: 1246 1234 8281]

    You can rewrite NEologd's seed as you wish, but you can recreate the dictionary. think.

    (By the way, NEologd was built with the-aoption or-install_infreq_quantity. Otherwise, I don't think this seed will be used. )


    Appendix

    There are probably no pages (someone explained) that can be helpful.
    Seemecab-ipadic-neologd -hfor help on install-mecab-ipadic-neologd (ie this Usage), I think you have to judge what you need.

    What I did was confirming that it was not included by default, confirming that it was included with the-aoption, and which file in"seed /"The procedure was to search and test what options were likely to be supported.


    Additional
    Another way is to control the output format of mecab.
    http://taku910.github.io/mecab/format.html

    The program can be fixed by rewriting dicrc or by specifying "What number of features should be output in this format" with command line options (Tagger initialization parameter in Python) You can also.
    An example of-Ochasenis listed at the bottom of the page, so I think it will be helpful.