Home>
Question points

When thinking about sentence classification problems in natural language processing, I think that training data and test data are separated, but in many cases they are not fixed length.
If it is done without dividing, it will be insanely easy, but since leakage will surely occur, I think that it is necessary to separate the data at least before creating the dictionary.
How do you divide variable length input data?

For fixed length, sklearn's train_test_split is immediate, so
I think it's easy to come up with a way to convert it to a list and add some characters to the longest data to make it a fixed length.
Please let me know if this is an efficient way.

Specifically,

text = [
['aaa' 'bbb' 'cccc'],
['xx'],
['xxxx', 'bb', 'eeee', 'bbbb'],
]
For data like

, I would like to ask you how to write a process withlabels.

I'm soft, but if you have any questions, I'll answer any questions.

  • Answer # 1

    The length of the document doesn't matter.

    It depends on what method you use. One-Hot expression, Bag of Words, n-gram, topic model, distributed expression, etc.

    One method is to convert the whole with unsupervised feature extraction method, and then divide it with train_test_split etc. It is usually enough.

    Since complex feature extraction is performed to some extent, if you hate that there is a possibility that leakage may occur slightly at that stage, you can divide it into learning data and test data at the stage where nothing is processed, and only the learning data You can also use it to build models.

  • Answer # 2

    I haven't tried it yet, can't I split it even if it's not a fixed length?

    If that's not enough, should we prepare an index array for the number of samples and divide it?

  • Answer # 3

    Depending on what you want to do, why not run train_test_split after converting all data to One-Hot representation? If this is the case, it will be processed after being converted to fixed-length data, so I think it is in line with your wishes