Awk
如何解析在某個字元串之後的數字大於門檻值的行?
我有一個看起來像這樣的文件(list_20.txt):
[{"d_prime":"0.475425","variation1":"rs909776","r2":"0.057940","variation2":"rs16991816","population_name":"1000GENOMES:phase_3:KHV"}] [{"r2":"0.057940","variation1":"rs909776","d_prime":"0.475425","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs16991819"}] [{"variation1":"rs909776","r2":"0.078476","d_prime":"0.546491","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269"}] [{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs8114269","r2":"0.073418","variation1":"rs6130034","d_prime":"0.528588"}] [{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs1201686","r2":"0.060239","variation1":"rs3746539","d_prime":"0.271891"}] [{"variation2":"rs1201686","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.280262","r2":"0.058212","variation1":"rs2144011"}] [{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs10485662","r2":"0.058826","variation1":"rs844808","d_prime":"0.423639"}] [{"variation2":"rs6065565","population_name":"1000GENOMES:phase_3:KHV","d_prime":"0.638509","r2":"0.110749","variation1":"rs6139746"}] [{"r2":"0.110749","variation1":"rs6139746","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072936"}] [{"population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6065562","variation1":"rs6139746","r2":"0.091021","d_prime":"0.606214"}] [{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}] ...
我想只提取在“r2”之後具有值的行:”大於 0.7 且小於或等於 1
在這個例子中,預期的輸出就是這一行:
[{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]
我試過這個:
awk '$NF >= 0.8 && $NF <1 {print $0}' list_20.txt > 20.out
但我有一個空文件。此命令也不是特定於感興趣的字元串:“r2”:“
由於這看起來像 JSON,讓我們使用命令行 JSON 解析器:
$ jq '.[] | select((.r2|tonumber) > 0.7 and (.r2|tonumber) <= 1)' file { "variation1": "rs6139746", "r2": "0.910749", "d_prime": "0.638509", "population_name": "1000GENOMES:phase_3:KHV", "variation2": "rs6072937" }
我們必須將
r2
鍵的值從字元串轉換為適當的數字tonumber
,但除此之外,它是一個簡單的過濾器select()
。我們可以稍微縮短它,或者至少避免將每個數字轉換兩次,
jq '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1)' file
您是否希望獲得與輸入格式相同的結果,請使用
$ jq -c '.[] | (.r2|tonumber) as $r2 | select($r2 > 0.7 and $r2 <= 1) | [.]' file [{"variation1":"rs6139746","r2":"0.910749","d_prime":"0.638509","population_name":"1000GENOMES:phase_3:KHV","variation2":"rs6072937"}]
也就是說,使用 請求“緊湊輸出”,並為通過過濾器
-c
提取的每個結果創建一個數組。select()``[.]
使用 awk:
awk 'match($0, /"r2":"[^"]+"/) { t = substr($0, RSTART+6, RLENGTH-7) f = 0.7<t+0 && t+0<=1 if ( f ) print }' list_20.txt
您也可以在 perl 中執行此操作:
perl -lne ' print if /"r2":"(.*?)"/ and 0.7<$1 && $1<=1; ' list_20.txt
我們正在尋找引號中的字元串 r2 及其後面的內容。然後應用範圍檢查的條件,如果在範圍內找到該行,則列印該行。