Perl
如何使用來自另一個文件的 ID 從 gff3 文件中提取數據?
我有一個包含多個 ID 的文件
File 1: g24007 g51692
和一個gff3文件如下
File2: # start gene g24007 scaffold591 method gene 3322458 3376057 0.41 - . ID=g24007 scaffold591 method transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007 scaffold591 method transcription_end_site 3322458 3322458 . - . Parent=g24007.t1 scaffold591 method CDS 3323084 3323326 1 - 0 ID=g24007.t1.cds;Parent=g24007.t1 # coding sequence = [atggaaaaagctaaagatggcgaagagagcccaagtgaggcatctcctccagcccaggtggggcttgaaaatatccctg # cgacggtgtctggggaggagggccagctgctgtatcacgaggagactatcgatcttggtggagacgagtttgggtctgaagagaatgaggaaccctca -- # end gene g24007 # start gene g20000 scaffold591 method gene 3322458 3376057 0.41 - . ID=g20000 scaffold591 method transcript 3322458 3376057 0.41 - . ID=g20000.t1;Parent=g20000 ffold591 method intron 3356166 3369049 1 - . Parent=g20000.t1 scaffold591 method CDS 3323084 3323326 1 - 0 ID=g20000.t1.cds;Parent=g20000.t1 # coding sequence = [atggaaaaagctaaagatggcgaagagagcccaagtgaggcatctcctccagcccaggtggggcttgaaaatatccctg -- # end gene g20000
在這裡,我試圖從 file1 映射 Id,並從 file2 中提取相應的數據,即位於“起始基因”和“結束基因”之間,同時我想從我想要的輸出中排除“編碼序列”。
Expected output: # start gene g24007 scaffold591 method gene 3322458 3376057 0.41 - . ID=g24007 scaffold591 method transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007 scaffold591 method transcription_end_site 3322458 3322458 . - . Parent=g24007.t1 scaffold591 method CDS 3323084 3323326 1 - 0 ID=g24007.t1.cds;Parent=g24007.t1 # end gene g24007
我一直在嘗試使用 perl。
My code: use strict; use warnings; use Data::Dumper; my $file1 = 'IDs.txt'; open FILE1, "<", $file1 or die $!; my $file2 = 'gff3.txt'; open FILE2, "<", $file2 or die $!; my %id; my @array; while(<FILE1>) { $id{$_} = 1; } #print Dumper \%id; my $gene_id = 0; while (<FILE2>) { if($_ !~ /^#/) { @array = split(/\t/,$_); $array[8] =~ s/ID=//g; if($id{$_}) { print $_, @array; } } } close FILE1; close FILE2;
@Hari 沒有查看您的預期輸出,我嘗試使用標準 gff3 文件。但是,我的腳本不會列印“#start gene”和“#end gene”行。希望這對你有幫助
Code: #!/usr/local/perl use strict; use warnings; my $file1 = $ARGV[0]; my $file2 = $ARGV[1]; my $output_file = $ARGV[2]; my %id; my $ctr = 0; open(IN, $file1); while(<IN>) { $_ =~ s/\n|\r//g; $ctr++; $id{$_} = $ctr; } close IN; open(IN, $file2); open(OUT, ">".$output_file); while(<IN>) { $_ =~ s/\n|\r//g; if($_ !~ /^#/) { my @tmp = split(/\t/, $_); if($tmp[8] =~ /ID=g(\d+)/) { my $gene_id = "g".$1; if(exists $id{$gene_id}) { print OUT $_."\n"; } } elsif($tmp[8] =~ /Parent=g(\d+)\.t(\d+)/) { my $gene_id = "g".$1; if(exists $id{$gene_id}) { print OUT $_."\n"; } } } } close IN; close OUT;