如何從大數據文件中查找索引值的值？

January 2, 2014

我有一個數據文件（範例中為 data_array.txt）和索引文件 index.txt，我想在其中從索引文件中具有相同 ID 的數據文件中提取數據並儲存到一個新文件中，輸出.txt。我還想將 NA 放在 Out.txt 中，用於那些在數據文件中沒有值的 ID。我知道如何為一欄做到這一點。但我的數據有 1000 多列（從 1 到 1344）。我希望你幫我寫一個可以更快完成的腳本。我的數據文件、索引 ID 和建議輸出如下。
data_array.txt
Id  1   2   3   .   .   1344
1   10  20  30  .   .   -1
2   20  30  40  .   .   -2
3   30  40  50  .   .   -3
4   40  50  60  .   .   -4
6   60  60  70  .   .   -5
8   80  70  80  .   .   -6
10  100 80  90  .   .   -7
索引.txt
Id
1
2
8
9
10
所需的輸出是
輸出.txt
Id  1   2   3   .   .   1344
1   10  20  30  .   .   -1
2   20  30  40  .   .   -2
8   80  70  80  .   .   -6
9   NA  NA  NA          NA
10  100 80  90  .   .   -7

這是我想出的一個小 awk 腳本，它應該搜尋與您的索引匹配的行。只需將它放在一個文件中（例如lookup.awk）並按如下方式執行：

查找.awk

BEGIN {
       # read lookup variables from the commandline and put them in an array
       split(indexes, index_array, " ");
}

NR=1 {
       # set the number of columns to the amount that's on the first line (only used for NA printing)
       nr_of_fields = NF-1;
}

# For every line in your data file do the following
{
       # check if the first field matches a value in the index array
       for (var in index_array) {
               if ($1 == index_array[var]) {
                       # when a match is found print the line and remove the value from the index array
                       print $0;
                       delete index_array[var];
                       next;
               }
       }
}

END {
       # after all matching lines are found, print "NA" lines for the indexes that are still in the array
       for (var in index_array) {
               printf index_array[var];
               for (i=1; i&lt;nr_of_fields; i++) {
                       printf "  NA";
               }
               printf "\n";
       }
}

然後你可以像這樣執行它：

$ awk -f ./lookup.awk -v indexes="1 2 3 4 5 6 7 8 9 10" data.txt | sort -n
1   10  20  30  .   .   -1
2   20  30  40  .   .   -2
3   30  40  50  .   .   -3
4   40  50  60  .   .   -4
5  NA  NA  NA  NA  NA
6   60  60  70  .   .   -5
7  NA  NA  NA  NA  NA
8   80  70  80  .   .   -6
9  NA  NA  NA  NA  NA
10  100 80  90  .   .   -7

請注意，這個 awk 腳本不會以某種順序輸出值作為您的索引（這將需要一些額外的邏輯）。

引用自：https://unix.stackexchange.com/questions/107537

如何從大數據文件中查找索引值的值？

相關問答

在 Linux 上將 CSV 轉換為 XLS 文件

在管道分隔文件中重新格式化時間戳

過濾文件，在包含特定字元串的行中乘一個值，返回所有行

如何按列刪除文件內容的重複？

將“du”中的所有數字相加

awk + 計算文件中的字元串