如何將列中的欄位分隔（到它們自己的命名列中），然後從每個條目中刪除重複的單詞

July 8, 2021

我有一個包含 9 個變數的 tsv 文件，如下所示：
> seqnames  start   endwidth    strand  metadata    X.10logMacsq    annotation  distanceToTSS
元數據列包含我想要對其進行一些分析的資訊，但我首先需要拆分條目並將它們放入它們自己的列中（帶有標題）。元數據如下所示（第一行）：
ID=SRX067411;Name=H3K27ac%20(@%20HMEC);Title=GSM733660:%20Bernstein%20HMEC%20H3K27ac;Cell%20group=Breast;&lt;br&gt;source_name=HMEC;biomaterial_provider=Lonza;lab=Broad;lab%20description=Bernstein%20-%20Broad%20Institute;datatype=ChipSeq;datatype%20description=Chromatin%20IP%20Sequencing;cell%20organism=human;cell%20description=mammary%20epithelial%20cells;cell%20karyotype=normal;cell%20lineage=ectoderm;cell%20sex=U;antibody%20antibodydescription=rabbit%20polyclonal.%20Antibody%20Target:%20H3K27ac; 
此列（每行）總共有 27 個條目（此處未顯示全部），但我想我會先將它們全部寫入它們自己的列，然後刪除我不需要的那些。一旦他們有一個描述性的列標題，那麼我也可以刪除他們的名字（例如：ID=SRX 將只是 SRX 等等）
樣本文件輸入（第一行）
seqnames    start   end width   strand  metadata    X.10logMacsq    annotation  geneChr geneStart   geneEnd geneLength  geneStrand  geneId  distanceToTSS
chr2    1711333 1711568 236 *   ID=SRX067411;Name=H3K27ac%20(@%20HMEC);Title=GSM733660:%20Bernstein%20HMEC%20H3K27ac;Cell%20group=Breast;&lt;br&gt;source_name=HMEC;biomaterial_provider=Lonza;lab=Broad;lab%20description=Bernstein%20-%20Broad%20Institute;datatype=ChipSeq;datatype%20description=Chromatin%20IP%20Sequencing; 447 Intron (uc002qxa.3/7837, intron 1 of 22)    1   1635659 1748291 112633  2   7837    36723
任何人都可以幫我解決這個問題或提供一些建議嗎？我對 Bash 還很陌生，對命令還不太熟悉。
到目前為止，我剛剛設法通過以下方式清理了文件：
cut --complement -f 9-14 hisHMECanno.tsv | sed 's/%20/ /g' &gt; hisHMECannoFilt.tsv
（原始文件有一些我剛剛刪除的不必要的列）
然後我一直在嘗試使用 awk 將條目分成製表符分隔的列，但無濟於事。

以下 perl 腳本使用Text::CSV模組來讀取 TSV 文件並輸出格式正確的 TSV 數據。

如果需要，它會自動引用欄位，並使用Text::CSV’sundef_str設置將未定義的元數據欄位輸出為帶引號的空字元串""（帶有註釋掉的範例，說明如何將它們列印為N/A或--代替）。

最多只有這 3 行中的一個應該被取消註釋，其他的應該被刪除或註釋掉。如果您只是希望這些欄位為空，請刪除/註釋掉所有這三行。

我建議在那些未定義的欄位中添加一些內容，因為這樣可以更輕鬆地使用其他工具對該腳本的輸出進行後處理，這些工具可能會將兩個或多個選項卡（即一個空欄位）視為一個選項卡（例如兩者awk和perl預設情況下會這樣做，除非您通過將欄位分隔符顯式設置為單個選項卡而不是預設的“任意數量的空格”來告訴他們不要這樣做）。

Text::CSV為 debian 和相關發行版打包為libtext-csv-perl（純 perl 版本）和libtext-csv-xs-perl（更快編譯的 C 模組）。使用apt install libtext-csv-perl. 其他發行版可能也將其打包。否則，使用cpan.

#!/usr/bin/perl

use strict;
use Text::CSV qw(csv);

my $csv=Text::CSV-&gt;new({sep_char =&gt; "\t", quote_space =&gt; 0});

# optional: define how to print undefined fields
#$csv-&gt;undef_str ('--');
#$csv-&gt;undef_str ('N/A');
$csv-&gt;undef_str ('""');

# get header line, split into an arrayref called $cols
my $cols = $csv-&gt;getline(*ARGV);

# get first data row, extract headers & data from metadata field
my $row = $csv-&gt;getline(*ARGV);

# The following line assumes that the metadata in the FIRST data row
# contains ALL of the metadata fields in the exact order you want them
# included in the output.
#
my $md_headers = extract_metadata_headers($$row[4]);
#
# If this is not the case, then delete the extract_metadata_headers
# subroutine and define the metadata fields manually with something
# like:
#
#my $md_headers = [
#  'ID', 'Name', 'Title', 'Cell group', 'source_name',
#  'biomaterial_provider', 'lab', 'lab description', 'datatype',
#  'datatype description', 'cell organism', 'cell description',
#  'cell karyotype', 'cell lineage', 'cell sex',
#  'antibody antibodydescription'
#];
# This defines both the extra metadata headers **and** the order
# that they will be included in each output row.

# extract the data from the metadata field
my $md_data = extract_metadata($$row[4]);

# replace the metadata header in $cols aref with the md headers
splice @$cols,4,1,@$md_headers;

# replace the metadata field in $row aref with the md fields
splice @$row,4,1,@$md_data;

# print the updated header line and the first row of data
$csv-&gt;say(*STDOUT,$cols);
$csv-&gt;say(*STDOUT,$row);

# main loop: extract and print the rest of the data
while (my $row = $csv-&gt;getline(*ARGV)) {
 my $md_data = extract_metadata($$row[4]);
 splice @$row,4,1,@$md_data;

 $csv-&gt;say(*STDOUT,$row);
}

###
### subroutines
###

sub extract_metadata_headers {
 my $md = clean_metadata(shift);
 my @metadata = split /;/, $md;
 my @headers=();

 foreach (@metadata) {
   next if m/^\s*$/; # skip empty metadata
   my ($key,$val) = split /=/;
   push @headers, $key;
 };

 return \@headers;
};

sub extract_metadata {
 my $md = clean_metadata(shift);
 my @metadata = split /;/, $md;
 my %data=();

 foreach (@metadata) {
   next if m/^\s*$/; # skip empty metadata
   my ($key,$val) = split /=/;
   $data{$key} = $val;
 };

 return [@data{@$md_headers}];
};

sub clean_metadata {
   my $md = shift;
   $md =~ s/%(\d\d)/chr hex $1/eg; # decode %-encoded spaces etc.
   $md =~ s/&lt;[^&gt;]*&gt;//g;            # remove HTML crap like &lt;br&gt;
   return $md;
};

將其另存為，例如，process-tsv.pl使其可執行，chmod +x process-tsv.pl並在執行時為其提供文件名參數。例如

$ ./process-tsv.pl filename.tsv

它將產生這樣的輸出到標準輸出：

$ ./process-tsv.pl input.tsv
seqnames        start   endwidth        strand  ID      Name    Title   Cell group      source_name     biomaterial_provider    lab     lab description datatype        datatype description    cell organism   cell description      cell karyotype   cell lineage    cell sex        antibody antibodydescription    X.10logMacsq    annotation      distanceToTSS
seq1    1       10      X       SRX067411       H3K27ac (@ HMEC)        GSM733660: Bernstein HMEC H3K27ac       Breast  HMEC    Lonza   Broad   Bernstein - Broad Institute     ChipSeq Chromatin IP Sequencing human   mammary epithelial cells       normal  ectoderm        U       rabbit polyclonal. Antibody Target: H3K27ac     x10     annot   dist
seq2    2       20      Y       SRX067411       H3K27ac (@ HMEC)        GSM733660: Bernstein HMEC H3K27ac       ""      ""      Lonza   Broad   Bernstein - Broad Institute     ChipSeq Chromatin IP Sequencing human   mammary epithelial cells       normal  ectoderm        U       ""      Y10     annot2  dist2

當然，您可以將輸出重定向到 shell 中的文件：

./process-tsv.pl input.tsv &gt; output.tsv

引用自：https://unix.stackexchange.com/questions/657235

如何將列中的欄位分隔（到它們自己的命名列中），然後從每個條目中刪除重複的單詞

相關問答

awk/sed 將集群文件拆分為多個文件

如何用另一列中的字元串替換另一列指示的位置的一列中的字元

根據字元串 z 用字元串 y 替換字元串 x

使用同一行中的部分字元串替換字元串

循環通過 awk 輸出

如何轉置多個txt文件？