Home>

In the data frame,
There are three columns like "Fukuoka | Fukuoka"
Once separated by .str.split ('|'),
Extract only the first row,
By joining back to the original data frame
Trying to pre-process.

The second column and the original data frame
When trying to join
The data has become so large that it is impossible.

The original data frame is

RangeIndex: 8557 entries, 0 to 8556
Data columns (total 57 columns):
A 8557 non-null object
B 8557 non-null object
C 1241 non-null float64
D 1241 non-null object
E 1241 non-null object
F 1241 non-null object
G 1093 non-null object
H 1225 non-null object
I 4362 non-null object
J 4362 non-null float64
K 4362 non-null float64

It feels like

.

Applicable source code
import pandas as pd
import numpy as np
import datetime as dt
import codecs
with codecs.open ("G: \ attribute_November.csv",
                 "r", "Shift-JIS", "ignore") as file:
    df = pd.read_table (file, delimiter = ",", dtype = {'user name': object})
df.head ()
df.dropna (subset = ['Hotel name'], inplace = True)
date_birth = df ['Birthdate']. str.split ('|', expand = True)
a = date_birth [[0]]
a.rename (columns = {0: 'date of birth'}, inplace = True)
#merge will increase the number of data
name = df ['Hotel name']. str.split ('|', expand = True)
b = name [[0]]
b.rename (columns = {0: 'Hotel name'}, inplace = True)
'' '
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1241 entries, 3 to 8555
Data columns (total 1 columns):
Hotel name 1241 non-null object
dtypes: object (1)
memory usage: 14.5+ KB
'' '
live = df ['residence']. str.split ('|', expand = True)
c = live [[0]]
c.rename (columns = {0: 'residence'}, inplace = True)
data = pd.merge (df, a, on = 'date of birth', how = 'left')
data.info ()
'' '
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1950 entries, 0 to 1949
Data columns (total 17 columns):
A 1950 non-null object
B 1950 non-null object
C 1950 non-null float64
Hotel name 1950 non-null object
D 1950 non-null object
E 1950 non-null object
Residence 1522 non-null object
Birth date 1694 non-null object
F 581 non-null object
G 581 non-null float64
H 581 non-null float64
I 1950 non-null object
J 581 non-null object
K 431 non-null object
L 411 non-null object
M 581 non-null object
N 581 non-null object
#no problem
'' '
#The number of data increases several times
VV = pd.merge (data, b, on = 'hotel name', how = 'left')
VV.info ()
'' '
<class 'pandas.core.frame.DataFrame'>
Int64Index: 642509 entries, 0 to 642508
Data columns (total 17 columns):
A 642509 non-null object
B 642509 non-null object
C 642509 non-null float64
Hotel name 642509 non-null object
D 642509 non-null object
E 642509 non-null object
Residence 574292 non-null object
Birth date 640989 non-null object
F 200390 non-null object
G 200390 non-null float64
H 200390 non-null float64
I 642509 non-null object
J 200390 non-null object
K 146738 non-null object
L 139382 non-null object
M 200390 non-null object
N 200390 non-null object
dtypes: float64 (3), object (14)
memory usage: 53.9+ MB
Somehow the data has grown
'' '

I tried pd.merge_asof and how = 'inner' etc.
It did not lead to a solution. . .

  • Answer # 1

    If the key value is duplicated, the number of lines increases.

    For example
    a = [Fukuoka, Fukuoka]andb = [Fukuoka, Fukuoka]as a key,
    a [0] * b [0], a [0] * b [1], a [1] * b [0], a [1] * b [1][Fukuoka, Fukuoka, Fukuoka, Fukuoka]is returned.

    I think you can substitute what you want to do, but what about?

    df ['Hotel name'] = df ['Hotel name']. str.split ('|'). str.get (1)