Awk

AWK:在字典中的源術語之後隨機選擇行插入目標術語

  • February 1, 2022

注意:我已經在AWK 中問過一個類似的問題:Quick way to insert target words after an source term,我是 AWK 的初學者。

這個問題考慮在隨機選擇的行中在源詞之後插入多個目標詞。

有了這個 AWK 程式碼片段

awk '(NR==FNR){a[$1];next}
   FNR in a { gsub(/\<source term\>/,"& target term") }
    1
   ' <(shuf -n 5 -i 1-$(wc -l < file)) file

我想target term在.source term``file

例如:我有一個雙語詞典dict,其中包含左側的源術語和右側的目標術語,例如

apple     : Apfel
banana    : Banane
raspberry : Himbeere

我的file由以下幾行組成:

I love the Raspberry Pi.
The monkey loves eating a banana.
Who wants an apple pi?
Apple pen... pineapple pen... pen-pineapple-apple-pen!
The banana is tasty and healthy.
An apple a day keeps the doctor away.
Which fruit is tastes better: raspberry or strawberry?

假設第一個單詞apple隨機選擇第 1、3、5、4、7 行。帶有單詞 apple 的輸出將如下所示:

I love the Raspberry Pi.
The monkey loves eating a banana.
Who wants an apple Apfel pi?
Apple Apfel pen... pineapple pen... pen-pineapple-apple-pen!
The banana is tasty and healthy.
An apple a day keeps the doctor away.
Which fruit is tastes better: raspberry or strawberry?

然後是另外 5 條隨機線;3、3、5、6、7;對於單詞banana將被選中:

I love the Raspberry Pi .
The monkey loves eating a banana .
Who wants an apple Apfel pi ?
Apple Apfel pen... pineapple pen... pen-pineapple-apple-pen!
The banana Banane is tasty and healthy .
An apple a day keeps the doctor away .
Which fruit is tastes better: raspberry or strawberry?

dict在匹配最後一個條目之前,所有其他條目也是如此。

我想選擇 5 條隨機線。如果這些行有一個完整的源術語,比如我apple只想匹配整個單詞(諸如“菠蘿”之類的術語將被忽略)。如果一行包含兩次源術語,例如,那麼我也想在它之後插入目標術語。匹配應該不區分大小寫,所以我也可以匹配源術語,比如and 。Apfel``apple``apple``apple``Apple

我的問題:我怎樣才能重寫上面的程式碼片段,這樣我就可以使用字典dict,它選擇隨機file並在源術語後面插入目標術語?

以下是如何使用 awk 從輸入文件中隨機選擇 5 個行號(第一次使用 wc 來計算行號):

$ awk -v numLines="$(wc -l < file)" 'BEGIN{srand(); for (i=1; i<=5; i++) print int(1+rand()*numLines)}'
7
2
88
13
18

現在您所要做的就是接受我之前的答案,並且對於ARGIND==1塊中讀取的每個“舊”字元串生成 5 個行號,如上所示,填充一個數組,將生成的行號映射到與每個行號關聯的舊字元串,並在讀取最終輸入文件時檢查目前行號是否在數組中,如果是,則循環遍歷儲存在數組中該行號的“舊”,按照gsub()我之前的回答執行。

將 GNU awk 用於ARGINDIGNORECASE、 字邊界、數組數組和 的\s簡寫[[:space:]]

$ cat tst.sh
#!/usr/bin/env bash

awk -v numLines=$(wc -l < file) '
   BEGIN {
       FS = "\\s*:\\s*"
       IGNORECASE = 1
       srand()
   }
   ARGIND == 1 {
       old = "\\<" $1 "\\>"
       new = "& " $2
       for (i=1; i<=5; i++) {
           lineNr = int(1+rand()*numLines)
           map[lineNr][old] = new
       }
       next
   }
   FNR in map {
       for ( old in map[FNR] ) {
           new = map[FNR][old]
           gsub(old,new)
       }
   }
   { print }
' dict file
$ ./tst.sh
I love the Raspberry Pi.
The monkey loves eating a banana Banane.
Who wants an apple Apfel pi?
Apple Apfel pen... pineapple pen... pen-pineapple-apple Apfel-pen!
The banana Banane is tasty and healthy.
An apple a day keeps the doctor away.
Which fruit is tastes better: raspberry Himbeere or strawberry?

引用自:https://unix.stackexchange.com/questions/688689