Perl

如何使用 perl 刪除 gff3 文件中的重疊區域?

  • November 7, 2017

我正在嘗試從 9 列 gff3 文件中刪除重疊區域。

**Input file:**
scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007
scaffold591 Source transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007
scaffold591 Source transcription_end_site 3322458 3322458 . - . Parent=g24007.t1
scaffold591 Source gene 3322500 3346055 0.41 - . ID=g24007
scaffold591 Source transcript 3322500 3346055 0.41 - . ID=g24007.t1;Parent=g24007
scaffold591 Source transcription_end_site 3322500 3322500 . - . Parent=g24007.t1
scaffold591 Source gene 3377307 3513095 0.46 + . ID=g24008
scaffold591 Source transcript 3377307 3513095 0.41 + . ID=g24008.t1;Parent=g24008
scaffold591 Source transcription_end_site 3377307 3377307 . + . Parent=g24008.t1

在這裡,我試圖只比較具有相同鏈的“基因”的行,即“-”或“+”(第 7 列)。

例如第 1 行和第 4 行。

scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007
scaffold591 Source gene 3322500 3346055 0.41 - . ID=g24007

它們是來自相同支架和相同“-”鏈(第 7 列)的“基因”。第 4 行座標(第 4 列和第 5 列)位於第 1 行座標的範圍內。在這種情況下,我的程式碼應該刪除重疊的第 4 行並保留具有更大範圍的第 1 行。

**My expected output:**
scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007
scaffold591 Source transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007
scaffold591 Source transcription_end_site 3322458 3322458 . - . Parent=g24007.t1
scaffold591 Source gene 3377307 3513095 0.46 + . ID=g24008
scaffold591 Source transcript 3377307 3513095 0.41 + . ID=g24007.t1;Parent=g24008
scaffold591 Source transcription_end_site 3377307 3377307 . + . Parent=g24008.t1

我的程式碼兩次列印 row1 及其以下行

**My code:**
#!/usr/bin/perl
use warnings;
use strict;

open (IN, "<scaffold_sample.txt");
#open (OUT, ">output.txt");
my $previous_seqid = "";
my $previous_strand;
my $previous_start;
my $previous_end;
my @gff;
my @tmp;
my @tmp2;
my @transcripts;
while (<IN>)
{ 
   chomp;
   @gff = split ("\t",$_);

  if ($gff[2] eq "gene")
   {
       #print "yes"."\n";
       if($gff[0] eq $previous_seqid && $gff[6] eq $previous_strand)
       {
           if($gff[3] < $previous_end && $gff[4] < $previous_end)
           {
               @tmp2 = @tmp;
               $previous_seqid = $tmp2[0];
               $previous_strand = $tmp2[6];
               $previous_start = $tmp2[3];
               $previous_end = $tmp2[4];
           }
           else
           {
               @gff=@tmp;
               print join "\t",@gff;
               print "\n";
               $previous_seqid = $gff[0];
               $previous_strand = $gff[6];
               $previous_start = $gff[3];
               $previous_end = $gff[4];

           }
       }
       else
       {
           @tmp = @gff;
           $previous_seqid = $tmp[0];
           $previous_strand = $tmp[6];
           $previous_start = $tmp[3];
           $previous_end = $tmp[4];
       }
       print join "\t",@tmp2;
       print "\n"; 
   }
   else
   {
       print join "\t",@gff;
       print "\n";
   }

}
close (IN);

@Jesvin 試試這個。希望它會有所幫助。

#!/usr/local/perl

use strict;
use warnings;
use Data::Dumper;

open( IN, "<gene_sorted.txt" );
open(OUT, ">genes_out.txt");
my $prev_scaffold = "";
my $prev_strand   = "";
my @data;

while (<IN>) {
$_ =~ s/\n|\r//g;

my @tmp = split( /\t/, $_ );
if ( $prev_scaffold . "#" . $prev_strand ne $tmp[0] . "#" . $tmp[6] ) {
$prev_scaffold = $tmp[0];
$prev_strand   = $tmp[6];
check( \@data );
undef @data;
push( @data, $_ );
}
else 
{
push( @data, $_ );
}

}
close IN;

check( \@data );

sub check {
my $prev_start = 0;
my $prev_end   = 0;
my @array      = @{ $_[0] };

foreach (@array) {

#print $_."\n";
(
my $scaffold,
my $source,
my $annotation,
my $start,
my $end,
my $score,
my $strand,
my $column,
my $id
) = split( /\t/, $_ );


if ( $start > $prev_start && $end > $prev_end ) {
$prev_start = $start;
$prev_end   = $end;
#print $start. " > " . $end . "\n";
print OUT $_."\n";
}   
} 
}

引用自:https://unix.stackexchange.com/questions/403075