Awk

如何解析在某個字元串之後的數字大於門檻值的行?

  • June 23, 2020

我有一個看起來像這樣的文件(list_20.txt):

[{"d_prime":"0.475425","variation1":"rs909776","r2":"0.057940","variation2":"rs16991816","population_name":"1000GENOMES:phase_3:KHV"}]
[{"r2":"0.057940","variation1":"rs909776","d_prime":"0.475425","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs16991819"}]
[{"variation1":"rs909776","r2":"0.078476","d_prime":"0.546491","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269","r2":"0.073418","variation1":"rs6130034","d_prime":"0.528588"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs1201686","r2":"0.060239","variation1":"rs3746539","d_prime":"0.271891"}]
[{"variation2":"rs1201686","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.280262","r2":"0.058212","variation1":"rs2144011"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs10485662","r2":"0.058826","variation1":"rs844808","d_prime":"0.423639"}]
[{"variation2":"rs6065565","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.638509","r2":"0.110749","variation1":"rs6139746"}]
[{"r2":"0.110749","variation1":"rs6139746","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072936"}]
[{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6065562","variation1":"rs6139746","r2":"0.091021","d_prime":"0.606214"}]
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]
...

我想只提取在“r2”之後具有值的行:”大於 0.7 且小於或等於 1

在這個例子中,預期的輸出就是這一行:

[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]

我試過這個:

awk '$NF >= 0.8 && $NF <1 {print $0}' list_20.txt  > 20.out

但我有一個空文件。此命令也不是特定於感興趣的字元串:“r2”:“

由於這看起來像 JSON,讓我們使用命令行 JSON 解析器:

$ jq '.[] | select((.r2|tonumber) > 0.7 and (.r2|tonumber) <= 1)' file
{
 "variation1": "rs6139746",
 "r2": "0.910749",
 "d_prime": "0.638509",
 "population_name": "1000GENOMES:phase_3:KHV",
 "variation2": "rs6072937"
}

我們必須將r2鍵的值從字元串轉換為適當的數字tonumber,但除此之外,它是一個簡單的過濾器select()

我們可以稍微縮短它,或者至少避免將每個數字轉換兩次

jq '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1)' file

您是否希望獲得與輸入格式相同的結果,請使用

$ jq -c '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1) | [.]' file
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]

也就是說,使用 請求“緊湊輸出”,並為通過過濾器-c提取的每個結果創建一個數組。select()``[.]

使用 awk:

awk 'match($0, /"r2":"[^"]+"/) {
 t = substr($0, RSTART+6, RLENGTH-7)
 f = 0.7<t+0 && t+0<=1
 if ( f ) print 
}' list_20.txt 

您也可以在 perl 中執行此操作:

perl -lne '
 print if /"r2":"(.*?)"/ and 0.7<$1 && $1<=1;
' list_20.txt

我們正在尋找引號中的字元串 r2 及其後面的內容。然後應用範圍檢查的條件,如果在範圍內找到該行,則列印該行。

引用自:https://unix.stackexchange.com/questions/594483