Shell-Script

計算輸入文件中出現的模式以匹配大文件

  • November 8, 2017

我有一個文本文件中的大學列表,在一個單獨的文件中我有一個附屬出版物的列表。我想寫一個腳本,檢查一個出版物重複了多少次,併計算大學合作的次數。我的數據如下;“p1”是論文的標題,“Affiliation”是發表論文的院校

例子:-

數據

UID,從屬關係

p1    "ADPRI, S"
p1    "ADPRI, S"
p2    "ADPRI, S"
p2    "AAC&S, H"
p3    "AAC&S, H"
p3    "HU, USA" 
p3    "Penn, USA"
p4    "AAC&S, H"  
p5    "AAC&S, H"  
p6    "AAC&S, H"  
p7    "AAC&S, H"  
p8    "AU, A"  
p9    "AECI, A"  
p10   "AECI, A" 
p10   "AECI, A" 

在上述數據中,論文“p2”連結到“ADPRI, S”和“AAC&S, H”。

同樣,“p3”與大學“AAC&S, H”、“HU, USA”、“Penn, USA”相關聯。

所以我的腳本應該給出一個文件,其中給出了兩所大學之間的合作次數。對於上述數據,它將是

期望的輸出:

College_A       College_B       Collaborated
 ADPRI, S       AAC&S, H            2
 HU, USA        Penn, USA           1
 ....
 ....
so on for all the colleges,

我在“第 2 列”上使用了 sort 和 uniq 命令來獲取大學的數量,這是 797 所大學的列表,我的數據庫中有超過 20000 篇大學發表的論文。我的數據也有很多空格和特殊字元。

PS:- 數據是製表符分隔的,我在 CSV 中也有相同的數據。

gawk解決方案。

用法: ./program.awk input.txt

此外,您可以執行以下操作:./program.awk input.txt | column -t -s $'\t'為了漂亮的顯示,如果對齊失去。

#!/usr/bin/awk -f

function pub_to_aff() {
   for(i in pub_arr) {
       for(j in pub_arr) {
           if(i != j)
               aff_arr[i][j]++;    
       }   
   }   
   delete pub_arr;
}

BEGIN {
   OFS = "\t";
   FS = "\t";
}

$1 != prev_uid {
   prev_uid = $1; 
   pub_to_aff();
}
{
   pub_arr[$2] = 1;
}

END {
   pub_to_aff();
   print "College_A", "College_B", "Collaborated";

   for(i in aff_arr) {
       for(j in aff_arr[i]) {
           print i, j, aff_arr[i][j];          
       }   
   }   
}

輸入- 添加了兩行用於展示 -p3p4.

p1  "ADPRI, S"
p1  "ADPRI, S"
p2  "ADPRI, S"
p2  "AAC&S, H"
p3  "AAC&S, H"
p3  "ADPRI, S"
p3  "HU, USA"
p3  "Penn, USA"
p4  "AAC&S, H"
p4  "ADPRI, S"
p5  "AAC&S, H"
p6  "AAC&S, H"
p7  "AAC&S, H"
p8  "AU, A"
p9  "AECI, A"
p10 "AECI, A"
p10 "AECI, A"

輸出

College_A   College_B   Collaborated
"AAC&S, H"  "HU, USA"   1
"AAC&S, H"  "Penn, USA" 1
"AAC&S, H"  "ADPRI, S"  3
"HU, USA"   "AAC&S, H"  1
"HU, USA"   "Penn, USA" 1
"HU, USA"   "ADPRI, S"  1
"Penn, USA" "AAC&S, H"  1
"Penn, USA" "HU, USA"   1
"Penn, USA" "ADPRI, S"  1
"ADPRI, S"  "AAC&S, H"  3
"ADPRI, S"  "HU, USA"   1
"ADPRI, S"  "Penn, USA" 1

編輯 - 真實數據測試。

輸入- 我只留下了您的 sample.txt 內容的一部分,並更改了幾行以展示腳本工作。請注意,如果輸入文件不包含合作大學,則腳本將僅輸出一行 - 標題。

WOS:000355337800046 "ACHARYA NARENDRA DEV COLL, NEW DELHI"
WOS:000355337800046 "ACHARYA NARENDRA DEV COLL, NEW DELHI"
WOS:000355337800046 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"
WOS:000328700900001 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"
WOS:000338233800012 "ADAMAS INST TECHNOL, KOLKATA"
WOS:000338233800012 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"
WOS:000349637600009 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"
WOS:000314892400031 "ADITYA INST TECHNOL & MANAGEMENT, TEKKALI"

使用的命令: ./program.awk sample.txt | column -t -s $'\t'

輸出

College_A                                            College_B                                            Collaborated
"ADAMAS INST TECHNOL, KOLKATA"                       "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"  1
"ACHARYA NARENDRA DEV COLL, NEW DELHI"               "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"             1
"ACHARYA PRAFULLA CHANDRA COLL. KOLKATA"             "ACHARYA NARENDRA DEV COLL, NEW DELHI"               1
"ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI"  "ADAMAS INST TECHNOL, KOLKATA"                       1

使用 Perl:

#!/usr/bin/env perl

use strict;
use warnings;

use List::MoreUtils qw(uniq);
use Set::Intersection;

my ( %papers, @colleges );

while (<>) {          
   chomp; 
   my ( $paper, $college ) = m/(\S+)\t"(.+)"/g;

   # normalize college names
   $college =~ s/\s\+/ /go;
   $college =~ s/^\s\+//go;
   $college =~ s/\s\+$//go;

   $papers{$college} //= [];
   push @{ $papers{$college} }, $paper;
}

@colleges = sort keys %papers;
for my $college (@colleges) {
   $papers{$college} = [ uniq sort @{ $papers{$college} } ];
}

print qq(College_A\tCollege_B\tCollaborated\n);
for ( my $i = 0 ; $i < @colleges - 1 ; $i++ ) {
   for ( my $j = $i + 1 ; $j < @colleges ; $j++ ) {
       my $collaborations = scalar get_intersection(
           { -preordered => 1 },
           $papers{ $colleges[$i] },
           $papers{ $colleges[$j] }
       );  
       print $colleges[$i], "\t", $colleges[$j], "\t", $collaborations, "\n"
         if ($collaborations);
   }
}

使用 Python:

#!/usr/bin/env python

from __future__ import print_function

import re
import sys
from collections import defaultdict

papers = defaultdict(lambda: set())
for line in sys.stdin:
   paper, college = line.split("\t")
   college = re.sub(r'^"|"$', '', college)
   college = re.sub(r'\s+', ' ', college)
   college = re.sub(r'^\s+|\s+$', '', college)
   papers[college].add(paper)

colleges = sorted(papers.keys())

print("College_A\tCollege_B\tCollaborated")
for i in range(len(colleges) - 1):
   for j in range(i + 1, len(colleges)):
       collaborations = len(papers[colleges[i]].intersection(papers[colleges[j]]))
       if collaborations:
           print("%s\t%s\t%d" % (colleges[i], colleges[j], collaborations))

引用自:https://unix.stackexchange.com/questions/402097