When thinking about sentence classification problems in natural language processing, I think that training data and test data are separated, but in many cases they are not fixed length.
If it is done without dividing, it will be insanely easy, but since leakage will surely occur, I think that it is necessary to separate the data at least before creating the dictionary.
How do you divide variable length input data?
For fixed length, sklearn's train_test_split is immediate, so
I think it's easy to come up with a way to convert it to a list and add some characters to the longest data to make it a fixed length.
Please let me know if this is an efficient way.
For data like
text = [ ['aaa' 'bbb' 'cccc'], ['xx'], ['xxxx', 'bb', 'eeee', 'bbbb'], ]
, I would like to ask you how to write a process withlabels.
I'm soft, but if you have any questions, I'll answer any questions.
Answer # 1
The length of the document doesn't matter.
It depends on what method you use. One-Hot expression, Bag of Words, n-gram, topic model, distributed expression, etc.
One method is to convert the whole with unsupervised feature extraction method, and then divide it with train_test_split etc. It is usually enough.
Since complex feature extraction is performed to some extent, if you hate that there is a possibility that leakage may occur slightly at that stage, you can divide it into learning data and test data at the stage where nothing is processed, and only the learning data You can also use it to build models.
Answer # 2
I haven't tried it yet, can't I split it even if it's not a fixed length?
If that's not enough, should we prepare an index array for the number of samples and divide it?
Answer # 3
Depending on what you want to do, why not run train_test_split after converting all data to One-Hot representation? If this is the case, it will be processed after being converted to fixed-length data, so I think it is in line with your wishes
- python - about dp matching
- about python errors
- about compiling python py files
- python - about pandas (reading csv)
- about python tkinter file selection
- [python] about fonts and frames of ttklabelframe
- about the output result of iloc method of python
- python - about sorting characters when manipulating strings using the set function
- about updating mac python
- python - about the configuration of "adminsiteurls" in django source
- python - about "itertoolscombinations"
- [python] about aoj-114 electro-fly
- about string search for python dictionaries
- python - about import error of flask
- about python regular expressions
- about python animation axis setting
- about python numpy calculation
- python - about getting all dataframes
- python - about sytemctl using error snap in slurm
- python - about instantiation of exceptions
- php - coincheck api authentication doesn't work
- php - i would like to introduce the coincheck api so that i can make payments with bitcoin on my ec site
- [php] i want to get account information using coincheck api
- python - you may need to restart the kernel to use updated packages error
- python 3x - typeerror: 'method' object is not subscriptable
- xcode - pod install [!] no `podfile 'found in the project directory
- vuejs - [vuetify] unable to locate target [data-app] i want to unit test to avoid warning
- android studio - unresolved reference comes out in kotlin
- android studio - emulator: dsound: could not initialize about the error message directsoundcapture