Perl

如何使用來自另一個文件的 ID 從 gff3 文件中提取數據?

  • November 5, 2017

我有一個包含多個 ID 的文件

File 1:
g24007
g51692

和一個gff3文件如下

File2:
   # start gene g24007
   scaffold591 method  gene    3322458 3376057 0.41    -   .   ID=g24007
   scaffold591 method  transcript  3322458 3376057 0.41    -   .   ID=g24007.t1;Parent=g24007
   scaffold591 method  transcription_end_site  3322458 3322458 .   -   .   Parent=g24007.t1
   scaffold591 method  CDS 3323084 3323326 1   -   0   ID=g24007.t1.cds;Parent=g24007.t1
   # coding sequence = [atggaaaaagctaaagatggcgaagagagcccaagtgaggcatctcctccagcccaggtggggcttgaaaatatccctg
   # cgacggtgtctggggaggagggccagctgctgtatcacgaggagactatcgatcttggtggagacgagtttgggtctgaagagaatgaggaaccctca
   --
   # end gene g24007
   # start gene g20000
   scaffold591 method  gene    3322458 3376057 0.41    -   .   ID=g20000
   scaffold591 method  transcript  3322458 3376057 0.41    -   .   ID=g20000.t1;Parent=g20000
   ffold591    method  intron  3356166 3369049 1   -   .   Parent=g20000.t1
   scaffold591 method  CDS 3323084 3323326 1   -   0   ID=g20000.t1.cds;Parent=g20000.t1
   # coding sequence = [atggaaaaagctaaagatggcgaagagagcccaagtgaggcatctcctccagcccaggtggggcttgaaaatatccctg
   --
   # end gene g20000

在這裡,我試圖從 file1 映射 Id,並從 file2 中提取相應的數據,即位於“起始基因”和“結束基因”之間,同時我想從我想要的輸出中排除“編碼序列”。

Expected output:
# start gene g24007
scaffold591 method  gene    3322458 3376057 0.41    -   .   ID=g24007
scaffold591 method  transcript  3322458 3376057 0.41    -   .   ID=g24007.t1;Parent=g24007
scaffold591 method  transcription_end_site  3322458 3322458 .   -   .   Parent=g24007.t1
scaffold591 method  CDS 3323084 3323326 1   -   0   ID=g24007.t1.cds;Parent=g24007.t1
# end gene g24007

我一直在嘗試使用 perl。

My code:
use strict;
use warnings;
use Data::Dumper;

my $file1 = 'IDs.txt';
open FILE1, "<", $file1 or die $!;
my $file2 = 'gff3.txt';

open FILE2, "<", $file2 or die $!;
my %id;
my @array;

while(<FILE1>)
{
   $id{$_} = 1;
}
#print Dumper \%id;

my $gene_id = 0;
while (<FILE2>)
{
   if($_ !~ /^#/)
   {
       @array = split(/\t/,$_);
       $array[8] =~ s/ID=//g;
       if($id{$_})
       {
           print $_, @array;
       }
   }
}
close FILE1;
close FILE2;

@Hari 沒有查看您的預期輸出,我嘗試使用標準 gff3 文件。但是,我的腳本不會列印“#start gene”和“#end gene”行。希望這對你有幫助

Code:
#!/usr/local/perl

use strict;
use warnings;

my $file1 = $ARGV[0];
my $file2 = $ARGV[1];
my $output_file = $ARGV[2];

my %id;
my $ctr = 0;
open(IN, $file1);
while(<IN>)
{

   $_ =~ s/\n|\r//g;
   $ctr++;
   $id{$_} = $ctr;
}
close IN;

open(IN, $file2);
open(OUT, ">".$output_file);
while(<IN>)
{
   $_ =~ s/\n|\r//g;

   if($_ !~ /^#/)
   {
       my @tmp = split(/\t/, $_);

       if($tmp[8] =~ /ID=g(\d+)/)
       {
           my $gene_id = "g".$1;
           if(exists $id{$gene_id})
           {
               print OUT $_."\n";
           }
       }
       elsif($tmp[8] =~ /Parent=g(\d+)\.t(\d+)/)
       {
           my $gene_id = "g".$1;
           if(exists $id{$gene_id})
           {
               print OUT $_."\n";
           }
       }
   }
}
close IN;
close OUT;

引用自:https://unix.stackexchange.com/questions/402620