Home>

I'm thinking of making a script that reads a csv file created based on Excel and converts it to a datetime type using pd.to_datetime of pandas. However, due to the change in the saving method of Excel, the time notation is fluctuating as shown in files 1 and 2 below.

At this time, I want to divide the to_datetime script by if minutes according to each time format. What should I use to determine the current format?

I thought about judging by len (df ['month and day'] [0]), but I thought that an error would occur when another similar format came in, so a judgment that can identify the current formatting method (In that sense, could you please tell me if there is a way to convert it without separating it with an if statement?).


File 1

time Hour and minute Value
April 1st 8:40:00 1
April 1st 8:40:10 3

File 2

time Hour and minute Value
2020/10/1 8:55 1
2020/10/1 8:56 2

* There is no description of seconds, but there is no problem considering it as 00 seconds.


#Process to read file 1
df = pd.read_csv ('file1.csv')
if ...: # I want to make an accurate judgment here (I want to prevent misclassification even if other formats come in)
    df.index = pd.to_datetime (df ['month and day'] +'/' + df ['hour and minute'], format ='% Y /% m /% d /% H:% M:% S')
elif ...:
    df.index = pd.to_datetime (...)
  • Answer # 1

    As below,to_datetimeWith the option oferrors ='coerce'By inserting, the na value will be set if a format error occurs. After that, repeat the process of trying the following format only for na with fillna. Even if the format is different for each line of the data frame or there are 3 or more formats, it can be handled by applying this.

    import pandas as pd
    import io
    file1 ='''
    Month, day, hour, minute Value
    April 1st 8:40:00 1
    April 1st 8:40:10 3
    '''
    file2 ='''
    Month, day, hour, minute Value
    2020/10/1 8:55 1
    2020/10/1 8:56 2
    '''
    df = pd.read_csv (io.StringIO (file1.replace ('',',')))
    #df = pd.read_csv (io.StringIO (file2.replace ('',',')))
    year = 2020
    df ['date and time'] = pd.to_datetime (
        df ['Monday'] +''+ df ['hour and minute'], errors ='coerce')
    df ['date and time'] = df ['date and time'] .fillna (pd.to_datetime (pd.to_datetime)
        f'{year} year' + df ['month and day'] +''+ df ['hour and minute'],
        format ='% Y year% m month% d day% H:% M:% S', errors ='coerce'))
    print (df)
    #Month, day, hour, minute Value Date and time
    # 0 April 1st 8:40:00 1 2020-04-01 08:40:00
    # 1 April 1st 8:40:10 3 2020-04-01 08:40:10