Shell-Script
計算輸入文件中出現的模式以匹配大文件
我有一個文本文件中的大學列表,在一個單獨的文件中我有一個附屬出版物的列表。我想寫一個腳本,檢查一個出版物重複了多少次,併計算大學合作的次數。我的數據如下;“p1”是論文的標題,“Affiliation”是發表論文的院校
例子:-
數據
UID,從屬關係
p1 "ADPRI, S" p1 "ADPRI, S" p2 "ADPRI, S" p2 "AAC&S, H" p3 "AAC&S, H" p3 "HU, USA" p3 "Penn, USA" p4 "AAC&S, H" p5 "AAC&S, H" p6 "AAC&S, H" p7 "AAC&S, H" p8 "AU, A" p9 "AECI, A" p10 "AECI, A" p10 "AECI, A"
在上述數據中,論文“p2”連結到“ADPRI, S”和“AAC&S, H”。
同樣,“p3”與大學“AAC&S, H”、“HU, USA”、“Penn, USA”相關聯。
所以我的腳本應該給出一個文件,其中給出了兩所大學之間的合作次數。對於上述數據,它將是
期望的輸出:
College_A College_B Collaborated ADPRI, S AAC&S, H 2 HU, USA Penn, USA 1 .... .... so on for all the colleges,
我在“第 2 列”上使用了 sort 和 uniq 命令來獲取大學的數量,這是 797 所大學的列表,我的數據庫中有超過 20000 篇大學發表的論文。我的數據也有很多空格和特殊字元。
PS:- 數據是製表符分隔的,我在 CSV 中也有相同的數據。
gawk
解決方案。用法:
./program.awk input.txt
此外,您可以執行以下操作:
./program.awk input.txt | column -t -s $'\t'
為了漂亮的顯示,如果對齊失去。#!/usr/bin/awk -f function pub_to_aff() { for(i in pub_arr) { for(j in pub_arr) { if(i != j) aff_arr[i][j]++; } } delete pub_arr; } BEGIN { OFS = "\t"; FS = "\t"; } $1 != prev_uid { prev_uid = $1; pub_to_aff(); } { pub_arr[$2] = 1; } END { pub_to_aff(); print "College_A", "College_B", "Collaborated"; for(i in aff_arr) { for(j in aff_arr[i]) { print i, j, aff_arr[i][j]; } } }
輸入- 添加了兩行用於展示 -
p3
和p4
.p1 "ADPRI, S" p1 "ADPRI, S" p2 "ADPRI, S" p2 "AAC&S, H" p3 "AAC&S, H" p3 "ADPRI, S" p3 "HU, USA" p3 "Penn, USA" p4 "AAC&S, H" p4 "ADPRI, S" p5 "AAC&S, H" p6 "AAC&S, H" p7 "AAC&S, H" p8 "AU, A" p9 "AECI, A" p10 "AECI, A" p10 "AECI, A"
輸出
College_A College_B Collaborated "AAC&S, H" "HU, USA" 1 "AAC&S, H" "Penn, USA" 1 "AAC&S, H" "ADPRI, S" 3 "HU, USA" "AAC&S, H" 1 "HU, USA" "Penn, USA" 1 "HU, USA" "ADPRI, S" 1 "Penn, USA" "AAC&S, H" 1 "Penn, USA" "HU, USA" 1 "Penn, USA" "ADPRI, S" 1 "ADPRI, S" "AAC&S, H" 3 "ADPRI, S" "HU, USA" 1 "ADPRI, S" "Penn, USA" 1
編輯 - 真實數據測試。
輸入- 我只留下了您的 sample.txt 內容的一部分,並更改了幾行以展示腳本工作。請注意,如果輸入文件不包含合作大學,則腳本將僅輸出一行 - 標題。
WOS:000355337800046 "ACHARYA NARENDRA DEV COLL, NEW DELHI" WOS:000355337800046 "ACHARYA NARENDRA DEV COLL, NEW DELHI" WOS:000355337800046 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA" WOS:000328700900001 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA" WOS:000338233800012 "ADAMAS INST TECHNOL, KOLKATA" WOS:000338233800012 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI" WOS:000349637600009 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI" WOS:000314892400031 "ADITYA INST TECHNOL & MANAGEMENT, TEKKALI"
使用的命令:
./program.awk sample.txt | column -t -s $'\t'
輸出
College_A College_B Collaborated "ADAMAS INST TECHNOL, KOLKATA" "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI" 1 "ACHARYA NARENDRA DEV COLL, NEW DELHI" "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA" 1 "ACHARYA PRAFULLA CHANDRA COLL. KOLKATA" "ACHARYA NARENDRA DEV COLL, NEW DELHI" 1 "ADARSH MAHAVIDYALAYA DHAMANGAON RAILWAY, AMRAVATI" "ADAMAS INST TECHNOL, KOLKATA" 1
使用 Perl:
#!/usr/bin/env perl use strict; use warnings; use List::MoreUtils qw(uniq); use Set::Intersection; my ( %papers, @colleges ); while (<>) { chomp; my ( $paper, $college ) = m/(\S+)\t"(.+)"/g; # normalize college names $college =~ s/\s\+/ /go; $college =~ s/^\s\+//go; $college =~ s/\s\+$//go; $papers{$college} //= []; push @{ $papers{$college} }, $paper; } @colleges = sort keys %papers; for my $college (@colleges) { $papers{$college} = [ uniq sort @{ $papers{$college} } ]; } print qq(College_A\tCollege_B\tCollaborated\n); for ( my $i = 0 ; $i < @colleges - 1 ; $i++ ) { for ( my $j = $i + 1 ; $j < @colleges ; $j++ ) { my $collaborations = scalar get_intersection( { -preordered => 1 }, $papers{ $colleges[$i] }, $papers{ $colleges[$j] } ); print $colleges[$i], "\t", $colleges[$j], "\t", $collaborations, "\n" if ($collaborations); } }
使用 Python:
#!/usr/bin/env python from __future__ import print_function import re import sys from collections import defaultdict papers = defaultdict(lambda: set()) for line in sys.stdin: paper, college = line.split("\t") college = re.sub(r'^"|"$', '', college) college = re.sub(r'\s+', ' ', college) college = re.sub(r'^\s+|\s+$', '', college) papers[college].add(paper) colleges = sorted(papers.keys()) print("College_A\tCollege_B\tCollaborated") for i in range(len(colleges) - 1): for j in range(i + 1, len(colleges)): collaborations = len(papers[colleges[i]].intersection(papers[colleges[j]])) if collaborations: print("%s\t%s\t%d" % (colleges[i], colleges[j], collaborations))