Home>

I want to eliminate duplication of character strings up to commas in text files.
In the text file handled here, one line (**, ********There is a string like), and there is always a comma. The conditions are as follows.

  1. Regardless of the presence or absence of duplication, the character string before the comma shall not be deleted or extracted when exporting to another file.
  2. If there are duplicates, the entire character string will not be deleted or extracted when writing to another file.

I was able to complete the code below that does not extract duplicates of the entire string, but I could not create the code with the above conditions. I would appreciate it if anyone could answer.

lines_seen = set ()
outfile = open ("******. Txt", "w")
for line in open ("*****. Txt", "r"):
    if line not in lines_seen:
        outfile.write (line)
        lines_seen.add (line)
outfile.close ()
  • Answer # 1

    It is assumed that the encoding of the file to be read below is UTF-8.

    lines_seen = set ()
    outfile = open ("out.txt", "w")
    for line in open ("in.txt", "r", encoding = "utf-8"):
        try: try:
            # "There is always a comma in one line" (In this wording, there is not always one comma, so cut out only the beginning)
            key, data = line.split (',', 1)
        except ValueError:
            print ("--- read a line without a comma. Skip and continue processing. ---")
            continue
        if key not in lines_seen:
            outfile.write (data)
            lines_seen.add (key)
    outfile.close ()
    print ("Processing completed.")