Perl
如何使用 perl 刪除 gff3 文件中的重疊區域?
我正在嘗試從 9 列 gff3 文件中刪除重疊區域。
**Input file:** scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007 scaffold591 Source transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007 scaffold591 Source transcription_end_site 3322458 3322458 . - . Parent=g24007.t1 scaffold591 Source gene 3322500 3346055 0.41 - . ID=g24007 scaffold591 Source transcript 3322500 3346055 0.41 - . ID=g24007.t1;Parent=g24007 scaffold591 Source transcription_end_site 3322500 3322500 . - . Parent=g24007.t1 scaffold591 Source gene 3377307 3513095 0.46 + . ID=g24008 scaffold591 Source transcript 3377307 3513095 0.41 + . ID=g24008.t1;Parent=g24008 scaffold591 Source transcription_end_site 3377307 3377307 . + . Parent=g24008.t1
在這裡,我試圖只比較具有相同鏈的“基因”的行,即“-”或“+”(第 7 列)。
例如第 1 行和第 4 行。
scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007 scaffold591 Source gene 3322500 3346055 0.41 - . ID=g24007
它們是來自相同支架和相同“-”鏈(第 7 列)的“基因”。第 4 行座標(第 4 列和第 5 列)位於第 1 行座標的範圍內。在這種情況下,我的程式碼應該刪除重疊的第 4 行並保留具有更大範圍的第 1 行。
**My expected output:** scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007 scaffold591 Source transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007 scaffold591 Source transcription_end_site 3322458 3322458 . - . Parent=g24007.t1 scaffold591 Source gene 3377307 3513095 0.46 + . ID=g24008 scaffold591 Source transcript 3377307 3513095 0.41 + . ID=g24007.t1;Parent=g24008 scaffold591 Source transcription_end_site 3377307 3377307 . + . Parent=g24008.t1
我的程式碼兩次列印 row1 及其以下行
**My code:** #!/usr/bin/perl use warnings; use strict; open (IN, "<scaffold_sample.txt"); #open (OUT, ">output.txt"); my $previous_seqid = ""; my $previous_strand; my $previous_start; my $previous_end; my @gff; my @tmp; my @tmp2; my @transcripts; while (<IN>) { chomp; @gff = split ("\t",$_); if ($gff[2] eq "gene") { #print "yes"."\n"; if($gff[0] eq $previous_seqid && $gff[6] eq $previous_strand) { if($gff[3] < $previous_end && $gff[4] < $previous_end) { @tmp2 = @tmp; $previous_seqid = $tmp2[0]; $previous_strand = $tmp2[6]; $previous_start = $tmp2[3]; $previous_end = $tmp2[4]; } else { @gff=@tmp; print join "\t",@gff; print "\n"; $previous_seqid = $gff[0]; $previous_strand = $gff[6]; $previous_start = $gff[3]; $previous_end = $gff[4]; } } else { @tmp = @gff; $previous_seqid = $tmp[0]; $previous_strand = $tmp[6]; $previous_start = $tmp[3]; $previous_end = $tmp[4]; } print join "\t",@tmp2; print "\n"; } else { print join "\t",@gff; print "\n"; } } close (IN);
@Jesvin 試試這個。希望它會有所幫助。
#!/usr/local/perl use strict; use warnings; use Data::Dumper; open( IN, "<gene_sorted.txt" ); open(OUT, ">genes_out.txt"); my $prev_scaffold = ""; my $prev_strand = ""; my @data; while (<IN>) { $_ =~ s/\n|\r//g; my @tmp = split( /\t/, $_ ); if ( $prev_scaffold . "#" . $prev_strand ne $tmp[0] . "#" . $tmp[6] ) { $prev_scaffold = $tmp[0]; $prev_strand = $tmp[6]; check( \@data ); undef @data; push( @data, $_ ); } else { push( @data, $_ ); } } close IN; check( \@data ); sub check { my $prev_start = 0; my $prev_end = 0; my @array = @{ $_[0] }; foreach (@array) { #print $_."\n"; ( my $scaffold, my $source, my $annotation, my $start, my $end, my $score, my $strand, my $column, my $id ) = split( /\t/, $_ ); if ( $start > $prev_start && $end > $prev_end ) { $prev_start = $start; $prev_end = $end; #print $start. " > " . $end . "\n"; print OUT $_."\n"; } } }