Home>
I want to achieve

Correct the logically incorrect value * 1 (timestamp type) at once and input it to the DB.

Current status

I have 500,000 rows and 4 columns of data stored in a csv file, and I want to put them into DB (mysql).
An error is output because some timestamp type data has a logically incorrect value * 1.
(By the way, the timestamp type data is the primary key, so duplication is NG.)

error contents
ERROR: date/time field value out of range: "1990-12-00 84: 12: 20.4 + 0900"
HINT: Perhaps you need a different "datestyle" setting.

After checking, it seems that the above error will occur if the datetime type value is logically incorrect * 1 in mysql10 or higher.

Therefore, I would like to perform data cleansing * 2.
I couldn't think of a method, so I asked this question.
If i have any questions or missing information, please ask. Thank you.

* 1
For example, [2000-00-00T28: 47: 20.4 + 900], the time exceeds 24 hours, or the value of the month and day is 0.
* 2
For the time being, it would be nice if it could be input to the DB, so please understand that data cleansing here is "replace the logically incorrect time with a random value".

For reference, I will describe what I tried below.

What I tried Bulk replace with logical values ​​using python's pandas module
  1. Store csv data in datafram using python's pandas module.
  2. Search timestamp type data with regular expression and replace all at once using replace function.
    【result】
    Although it was possible to replace it, since it is replaced with the same value, duplication occurs in the timestamp type data, and an error occurs in the DB.
Use python's pandas module and assign random and logical values
  1. Store csv data in dataframe (= df1) using python's pandas module.
  2. Search the timestamp type data with a regular expression and use the replace function to replace it with a character string such as "temp_time" at once.
  3. Then use the pandas contains function to extract the rows containing "temp_time".
  4. Create a dataframe (= df_tmp) with the same number of rows as the number of rows containing the extracted "temp_time", and assign random timestam type data to them.
  5. Perform "df1 = df_tmp" and substitute.
    【result】
    When executing python, no error is output, but the value of df_tmp is not assigned, and the timestamp type data of df1 is still temp_time.

【result】
I was able to replace it, but the timestamp type data was duplicated and an error occurred in the DB.

  • Answer # 1

    Please understand that data cleansing here is "replacement of logically incorrect time with a random value".

    If you want to replace invalid dates with random, non-duplicate dates, you can do the following:

    As pd.to_datetime (df [date column name], errors = "coerce"), replace the date of the invalid row with NaT (Not a Time) once.

    Create an appropriate value to fill in with pd.date_range (). (Starting from a value that may not be in the data such as 1970)

    Fill in the created date where it is NaT

    import pandas as pd
    df = pd.DataFrame (
        {
            "Date": [
                "1990-12-01 84: 12: 20.4 + 0900",
                "1990-12-02 12: 12: 20.4 + 0900",
                "1990-12-03 13: 12: 20.4 + 0900",
                "1991-12-01 84: 12: 20.4 + 0900",
                "2001-12-01 84: 12: 20.4 + 0900",
            ],,
            "Value": [1, 2, 3, 4, 5],
        }
    )
    # Extract invalid rows.
    is_invalid = pd.to_datetime (df ["Date"], errors = "coerce"). isna ()
    # Create an appropriate date that does not overlap for invalid lines.
    fill_values ​​= pd.date_range (start = "1970-01-01", periods = is_invalid.sum (), freq = "H")
    # Assign that value to an invalid line.
    df.loc [is_invalid, "Date"] = fill_values.to_list ()
    print (df)
    assert (not df ["Date"] .duplicated (). any ()) # Check for duplicates
                           Date Value
    0 1970-01-01 00:00:00 1
    1 1990-12-02 12: 12: 20.4 + 0900 2
    2 1990-12-03 13: 12: 20.4 + 0900 3
    3 1970-01-01 01:00:00 4
    4 1970-01-01 02:00:00 5