Home>

I want to delete each column in the csv file with a Linux command.
I'm thinking about finding duplicate words in each line loop usingcut -d,etc., with commas as delimiters and deleting them.
Specifically, it looks like the following.

kingyo, panda, pig, pig
neko, inu, sakana, penguin
sea, see, sea, mountain
taro, taro, taro1, taro2
kanji, hiragana, katakana, eigo, kanji
kingyo, panda, pig
neko, inu, sakana, penguin
sea, see, mountain
taro, taro1, taro2
kanji, hiragana, katakana, eigo


I want to convert like

Are there any solutions?

[Afterwards]

Thank you all for decompressing.

Additionally, it is necessary to delete the numbers attached to the end of each word, so I useddatamash transposeas shown below to execute one line at a time.
The original file is specified in $1, and the destination file name is specified in $2.

while read row;do
    echo $row |
    sed -e 's /,/\ t/g' | #, =>tab
    datamash transpose | # transpose
    # Remove 1 or 2 digits at the end of line
    sed -e "s/[0-9] * $//" |
    sed -e "s/[0-9] [0-9] * $//" |
    sort -u | # Remove duplicate lines
    tr '\ n' ',' | # Return a line with a newline as a comma
    sed "s/^, // g" >>$2 # Remove comma at beginning of line
done<$1

I would be grateful if you could give me your opinion if there is something wrong with this method.

P.S. This data is processed bycutafter the fifth column of the original data. After working, it is necessary to merge with the original file, but when you merge with thepastecommand, several rows and columns are stored in one cell at several places. So, instead of cutting out the original data, I want to use thesedcommand etc. "only for items in the ~ th column" while keeping the original data. Is there any good way to do this?

  • Answer # 1

    take88's answer and idea are the same, but

    If you don't specify my working hash as my, the problem will occur when the same word reappears on another line.

    Use

    -F, a option to write a simple one-liner.

    So

    perl -F, -anle 'my% x;print join (",", grep {! $x {$_} ++} @F)' file.csv

    Additional questions

    I think it's too wasteful to replace a matrix when you just want to delete the number of each term.
    If you use sed, you can uses/[0-9] *, /,/g;s/[0-9] * $//;.

    A script that supports additional items. As expected, oneliner has become difficult, so let's make it a script file.

    $cat coluniq.pl
    while (<>) {
      chomp;
      my% d;
      my @ F = split (/, /, $_);
      my ($from, $to) = (4, $# F);
      foreach my $x (@F [$from .. $to]) {
        $x = ~ s/[0-9] + $//;
        $d {$x} = 0;
      }
      print join (",", @F [0 .. ($from-1)], keys% d);
    }
    $cat in.csv
    1,1,1,1, kingyo, panda, pig, pig
    1,1,1,1, neko, inu, sakana, penguin
    1,1,1,1, sea, see, sea, mountain
    1,1,1,1, taro, taro, taro1, taro2
    $perl ~/work/coluniq.pl in.csv
    1,1,1,1, panda, kingyo, pig
    1,1,1,1, inu, penguin, sakana, neko
    1,1,1,1, see, mountain, sea

  • Answer # 2

    I tried with Perl's one-liner.

    $cat file.csv
    kingyo, panda, pig, pig
    neko, inu, sakana, penguin
    sea, see, sea, mountain
    taro, taro, taro1, taro2
    kanji, hiragana, katakana, eigo, kanji
    $perl -nle 'print join ",", grep {! $buf {$_} ++} split ",", $_;' file.csv
    kingyo, panda, pig
    neko, inu, sakana, penguin
    sea, see, mountain
    taro, taro1, taro2
    kanji, hiragana, katakana, eigo

  • Answer # 3

    When trying to write withawk, it seemed a bit longer, so stop it and usesed.

    cat<


    Since there may be more than two of the same word, loop withtuntil the same word is gone.

  • Answer # 4

    I used perl to change the taste a little and tried using only regular expressions.

    $perl -ple 's/(,?) ([^,] +) (? {$`! ~ $2? $1. $2:" "})/$^ R/g 'file.csv

  • Answer # 5

    while read row;do
      echo "Executing commands in $count th row ..."
      echo "..."
      echo "..."
      count = $((count + 1))
        echo $row |
        cut -d, -f 5- | # after column 5
        nkf -X --overwrite | # Change half-width kana to full-width kana
        sed -e 's /,/\ t/g' | #, =>tab
        datamash transpose | # transpose
        sort -u | # Remove duplicate lines
        tr '\ n' ',' | # Return a line with a newline as a comma
        sed 's/^ "// g' | # Delete quotation at the beginning of the line
        sed "s/^, // g" >>$2 # Remove comma at beginning of line
      echo >>$2
    done<$1