如何將列中的欄位分隔(到它們自己的命名列中),然後從每個條目中刪除重複的單詞
我有一個包含 9 個變數的 tsv 文件,如下所示:
> seqnames start endwidth strand metadata X.10logMacsq annotation distanceToTSS
元數據列包含我想要對其進行一些分析的資訊,但我首先需要拆分條目並將它們放入它們自己的列中(帶有標題)。元數據如下所示(第一行):
ID=SRX067411;Name=H3K27ac%20(@%20HMEC);Title=GSM733660:%20Bernstein%20HMEC%20H3K27ac;Cell%20group=Breast;<br>source_name=HMEC;biomaterial_provider=Lonza;lab=Broad;lab%20description=Bernstein%20-%20Broad%20Institute;datatype=ChipSeq;datatype%20description=Chromatin%20IP%20Sequencing;cell%20organism=human;cell%20description=mammary%20epithelial%20cells;cell%20karyotype=normal;cell%20lineage=ectoderm;cell%20sex=U;antibody%20antibodydescription=rabbit%20polyclonal.%20Antibody%20Target:%20H3K27ac;
此列(每行)總共有 27 個條目(此處未顯示全部),但我想我會先將它們全部寫入它們自己的列,然後刪除我不需要的那些。一旦他們有一個描述性的列標題,那麼我也可以刪除他們的名字(例如:ID=SRX 將只是 SRX 等等)
樣本文件輸入(第一行)
seqnames start end width strand metadata X.10logMacsq annotation geneChr geneStart geneEnd geneLength geneStrand geneId distanceToTSS chr2 1711333 1711568 236 * ID=SRX067411;Name=H3K27ac%20(@%20HMEC);Title=GSM733660:%20Bernstein%20HMEC%20H3K27ac;Cell%20group=Breast;<br>source_name=HMEC;biomaterial_provider=Lonza;lab=Broad;lab%20description=Bernstein%20-%20Broad%20Institute;datatype=ChipSeq;datatype%20description=Chromatin%20IP%20Sequencing; 447 Intron (uc002qxa.3/7837, intron 1 of 22) 1 1635659 1748291 112633 2 7837 36723
任何人都可以幫我解決這個問題或提供一些建議嗎?我對 Bash 還很陌生,對命令還不太熟悉。
到目前為止,我剛剛設法通過以下方式清理了文件:
cut --complement -f 9-14 hisHMECanno.tsv | sed 's/%20/ /g' > hisHMECannoFilt.tsv
(原始文件有一些我剛剛刪除的不必要的列)
然後我一直在嘗試使用 awk 將條目分成製表符分隔的列,但無濟於事。
以下 perl 腳本使用Text::CSV模組來讀取 TSV 文件並輸出格式正確的 TSV 數據。
如果需要,它會自動引用欄位,並使用
Text::CSV
’sundef_str
設置將未定義的元數據欄位輸出為帶引號的空字元串""
(帶有註釋掉的範例,說明如何將它們列印為N/A
或--
代替)。最多只有這 3 行中的一個應該被取消註釋,其他的應該被刪除或註釋掉。如果您只是希望這些欄位為空,請刪除/註釋掉所有這三行。
我建議在那些未定義的欄位中添加一些內容,因為這樣可以更輕鬆地使用其他工具對該腳本的輸出進行後處理,這些工具可能會將兩個或多個選項卡(即一個空欄位)視為一個選項卡(例如兩者
awk
和perl
預設情況下會這樣做,除非您通過將欄位分隔符顯式設置為單個選項卡而不是預設的“任意數量的空格”來告訴他們不要這樣做)。
Text::CSV
為 debian 和相關發行版打包為libtext-csv-perl
(純 perl 版本)和libtext-csv-xs-perl
(更快編譯的 C 模組)。使用apt install libtext-csv-perl
. 其他發行版可能也將其打包。否則,使用cpan
.#!/usr/bin/perl use strict; use Text::CSV qw(csv); my $csv=Text::CSV->new({sep_char => "\t", quote_space => 0}); # optional: define how to print undefined fields #$csv->undef_str ('--'); #$csv->undef_str ('N/A'); $csv->undef_str ('""'); # get header line, split into an arrayref called $cols my $cols = $csv->getline(*ARGV); # get first data row, extract headers & data from metadata field my $row = $csv->getline(*ARGV); # The following line assumes that the metadata in the FIRST data row # contains ALL of the metadata fields in the exact order you want them # included in the output. # my $md_headers = extract_metadata_headers($$row[4]); # # If this is not the case, then delete the extract_metadata_headers # subroutine and define the metadata fields manually with something # like: # #my $md_headers = [ # 'ID', 'Name', 'Title', 'Cell group', 'source_name', # 'biomaterial_provider', 'lab', 'lab description', 'datatype', # 'datatype description', 'cell organism', 'cell description', # 'cell karyotype', 'cell lineage', 'cell sex', # 'antibody antibodydescription' #]; # This defines both the extra metadata headers **and** the order # that they will be included in each output row. # extract the data from the metadata field my $md_data = extract_metadata($$row[4]); # replace the metadata header in $cols aref with the md headers splice @$cols,4,1,@$md_headers; # replace the metadata field in $row aref with the md fields splice @$row,4,1,@$md_data; # print the updated header line and the first row of data $csv->say(*STDOUT,$cols); $csv->say(*STDOUT,$row); # main loop: extract and print the rest of the data while (my $row = $csv->getline(*ARGV)) { my $md_data = extract_metadata($$row[4]); splice @$row,4,1,@$md_data; $csv->say(*STDOUT,$row); } ### ### subroutines ### sub extract_metadata_headers { my $md = clean_metadata(shift); my @metadata = split /;/, $md; my @headers=(); foreach (@metadata) { next if m/^\s*$/; # skip empty metadata my ($key,$val) = split /=/; push @headers, $key; }; return \@headers; }; sub extract_metadata { my $md = clean_metadata(shift); my @metadata = split /;/, $md; my %data=(); foreach (@metadata) { next if m/^\s*$/; # skip empty metadata my ($key,$val) = split /=/; $data{$key} = $val; }; return [@data{@$md_headers}]; }; sub clean_metadata { my $md = shift; $md =~ s/%(\d\d)/chr hex $1/eg; # decode %-encoded spaces etc. $md =~ s/<[^>]*>//g; # remove HTML crap like <br> return $md; };
將其另存為,例如,
process-tsv.pl
使其可執行,chmod +x process-tsv.pl
並在執行時為其提供文件名參數。例如$ ./process-tsv.pl filename.tsv
它將產生這樣的輸出到標準輸出:
$ ./process-tsv.pl input.tsv seqnames start endwidth strand ID Name Title Cell group source_name biomaterial_provider lab lab description datatype datatype description cell organism cell description cell karyotype cell lineage cell sex antibody antibodydescription X.10logMacsq annotation distanceToTSS seq1 1 10 X SRX067411 H3K27ac (@ HMEC) GSM733660: Bernstein HMEC H3K27ac Breast HMEC Lonza Broad Bernstein - Broad Institute ChipSeq Chromatin IP Sequencing human mammary epithelial cells normal ectoderm U rabbit polyclonal. Antibody Target: H3K27ac x10 annot dist seq2 2 20 Y SRX067411 H3K27ac (@ HMEC) GSM733660: Bernstein HMEC H3K27ac "" "" Lonza Broad Bernstein - Broad Institute ChipSeq Chromatin IP Sequencing human mammary epithelial cells normal ectoderm U "" Y10 annot2 dist2
當然,您可以將輸出重定向到 shell 中的文件:
./process-tsv.pl input.tsv > output.tsv