Text-Processing

使用 awk 生成一系列新文件的 2 個文件之間的算術運算

  • May 15, 2021

我有一個製表符分隔的模型輸入文件,我想改變這種格式的集成分析

cat input.txt
#############################################   
###  Parameter file for the program ### 
#############################################   
### GENERAL PARAMETERS
4   /* nbout # Number of outputs */ 
46  /* numesp # Number of species */
0.05    /* p # light incidence param (diff through turbid medium) */
0.1357158   0.2446549   0.3535940   0.4992873   0.6449806   0.6957850   0.7465893   0.8130218   0.8794543   0.9397271   1.0000000   0.9397271   0.8794543   0.8078294   0.7362045   0.6899817   0.6437589   0.5989616   0.5541642   0.4617186   0.3692730   0.3633708   0.3574686   0.2426215   /* normalized daily light course (from 7am to 7pm, with a half-hour time-step */
1   /* vox_la_max. The max voxel leaf area. */
0   /* l_growth_scheme. 0 = top down; 1 = random; 2 = homogeneous; 3 = bottom up */
0.1 /* knockout_max. Parameter controlling the extent to which lianas can knock out trees */
0.05    /* shed_prob. With this probability, the liana is completely shed from the voxel. */

### Species description                                 
****    Nmass   LMA wsg dmax    hmax    ah  tmax    seedmass    Fregdistgr  Pmass   g1  s_liana
Alvaradoa_amorphoides   0.0214  74.775  0.584   0.5 24.44   0.892   1   0.0078  40  0.00145 3.77    0
Annona_reticulata   0.0350  74.529  0.503   0.5 24.44   0.892   1   0.2392  40  0.00142 3.77    0
Brosimum_alicastrum 0.0201  104.281 0.760   0.5 17.31   0.117   1   1.2486  40  0.00097 3.77    0

### Climate (input environment)
25.47447    26.02723    26.87827    27.58436    26.95839    25.63987    25.61669    25.26543    24.99990    24.10808    24.71997    24.67287    /*Temperature in degree C*/

我有另一個選項卡分隔的乘數文件,從格式如下的分佈中選擇

cat multipliers.txt
2   3   4
3   2   2
4   3   3

我正在嘗試將 3 個特定輸入欄位乘以乘數,以生成一系列與乘數相等的新輸入文件(在本例中為 3),同時保持輸入文件的其餘部分不變。在這種情況下,我想將第一個文件分別乘以vox_la_maxknockout_maxshed_prob2、3 和 4,第二個文件乘以 3、2 和 2,第三個文件乘以 4、3 和 3。我會生成 3 個新文件,例如這樣

cat input1.txt
#############################################   
###  Parameter file for the program ### 
#############################################   
### GENERAL PARAMETERS
4   /* nbout # Number of outputs */ 
46  /* numesp # Number of species */
0.05    /* p # light incidence param (diff through turbid medium) */
0.1357158   0.2446549   0.3535940   0.4992873   0.6449806   0.6957850   0.7465893   0.8130218   0.8794543   0.9397271   1.0000000   0.9397271   0.8794543   0.8078294   0.7362045   0.6899817   0.6437589   0.5989616   0.5541642   0.4617186   0.3692730   0.3633708   0.3574686   0.2426215   /* normalized daily light course (from 7am to 7pm, with a half-hour time-step */
2   /* vox_la_max. The max voxel leaf area. */
0   /* l_growth_scheme. 0 = top down; 1 = random; 2 = homogeneous; 3 = bottom up */
0.3 /* knockout_max. Parameter controlling the extent to which lianas can knock out trees */
0.2 /* shed_prob. With this probability, the liana is completely shed from the voxel. */

### Species description                                 
****    Nmass   LMA wsg dmax    hmax    ah  tmax    seedmass    Fregdistgr  Pmass   g1  s_liana
Alvaradoa_amorphoides   0.0214  74.775  0.584   0.5 24.44   0.892   1   0.0078  40  0.00145 3.77    0
Annona_reticulata   0.0350  74.529  0.503   0.5 24.44   0.892   1   0.2392  40  0.00142 3.77    0
Brosimum_alicastrum 0.0201  104.281 0.760   0.5 17.31   0.117   1   1.2486  40  0.00097 3.77    0

### Climate (input environment)
25.47447    26.02723    26.87827    27.58436    26.95839    25.63987    25.61669    25.26543    24.99990    24.10808    24.71997    24.67287    /*Temperature in degree C*/
cat input2.txt
#############################################   
###  Parameter file for the program ### 
#############################################   
### GENERAL PARAMETERS
4   /* nbout # Number of outputs */ 
46  /* numesp # Number of species */
0.05    /* p # light incidence param (diff through turbid medium) */
0.1357158   0.2446549   0.3535940   0.4992873   0.6449806   0.6957850   0.7465893   0.8130218   0.8794543   0.9397271   1.0000000   0.9397271   0.8794543   0.8078294   0.7362045   0.6899817   0.6437589   0.5989616   0.5541642   0.4617186   0.3692730   0.3633708   0.3574686   0.2426215   /* normalized daily light course (from 7am to 7pm, with a half-hour time-step */
3   /* vox_la_max. The max voxel leaf area. */
0   /* l_growth_scheme. 0 = top down; 1 = random; 2 = homogeneous; 3 = bottom up */
0.2 /* knockout_max. Parameter controlling the extent to which lianas can knock out trees */
0.1 /* shed_prob. With this probability, the liana is completely shed from the voxel. */

### Species description                                 
****    Nmass   LMA wsg dmax    hmax    ah  tmax    seedmass    Fregdistgr  Pmass   g1  s_liana
Alvaradoa_amorphoides   0.0214  74.775  0.584   0.5 24.44   0.892   1   0.0078  40  0.00145 3.77    0
Annona_reticulata   0.0350  74.529  0.503   0.5 24.44   0.892   1   0.2392  40  0.00142 3.77    0
Brosimum_alicastrum 0.0201  104.281 0.760   0.5 17.31   0.117   1   1.2486  40  0.00097 3.77    0

### Climate (input environment)
25.47447    26.02723    26.87827    27.58436    26.95839    25.63987    25.61669    25.26543    24.99990    24.10808    24.71997    24.67287    /*Temperature in degree C*/
cat input3.txt
#############################################   
###  Parameter file for the program ### 
#############################################   
### GENERAL PARAMETERS
4   /* nbout # Number of outputs */ 
46  /* numesp # Number of species */
0.05    /* p # light incidence param (diff through turbid medium) */
0.1357158   0.2446549   0.3535940   0.4992873   0.6449806   0.6957850   0.7465893   0.8130218   0.8794543   0.9397271   1.0000000   0.9397271   0.8794543   0.8078294   0.7362045   0.6899817   0.6437589   0.5989616   0.5541642   0.4617186   0.3692730   0.3633708   0.3574686   0.2426215   /* normalized daily light course (from 7am to 7pm, with a half-hour time-step */
4   /* vox_la_max. The max voxel leaf area. */
0   /* l_growth_scheme. 0 = top down; 1 = random; 2 = homogeneous; 3 = bottom up */
0.3 /* knockout_max. Parameter controlling the extent to which lianas can knock out trees */
0.15    /* shed_prob. With this probability, the liana is completely shed from the voxel. */

### Species description                                 
****    Nmass   LMA wsg dmax    hmax    ah  tmax    seedmass    Fregdistgr  Pmass   g1  s_liana
Alvaradoa_amorphoides   0.0214  74.775  0.584   0.5 24.44   0.892   1   0.0078  40  0.00145 3.77    0
Annona_reticulata   0.0350  74.529  0.503   0.5 24.44   0.892   1   0.2392  40  0.00142 3.77    0
Brosimum_alicastrum 0.0201  104.281 0.760   0.5 17.31   0.117   1   1.2486  40  0.00097 3.77    0

### Climate (input environment)
25.47447    26.02723    26.87827    27.58436    26.95839    25.63987    25.61669    25.26543    24.99990    24.10808    24.71997    24.67287    /*Temperature in degree C*/

我認為我應該使用 awk,但到目前為止,我只能成功地一次使用一列乘法器文件改變一個參數,並且我需要能夠同時改變這 3 個參數。我可以設置什麼樣的腳本來生成這些輸出?

**TL; DR:**為您的範例硬編碼的緊湊awk腳本

NR != FNR {
   out = "out" FNR ".txt"
   printf "" > out
   for (l=m=1; l <= nl; l++)
       printf tmpl[l] ORS, l in vals ? $(m++)*vals[l] : 0 >> out
   close(out)
   next
}

{
   gsub(/%/, "%%")
# here is the regex that selects the fields by their name
   if ($3 ~ /^(vox_la_max|knockout_max|shed_prob)[^[:alnum:]_]*$/) {
       vals[NR] = $1
       sub(/^[0-9]+(\.[0-9]+)?/, OFMT)
   }
   tmpl[NR] = $0; nl++
}

將其用作:

LC_NUMERIC=C awk -f script input.txt multipliers.txt

它生成名為outX.txt.

LC_NUMERIC=C如果您的語言環境將使用逗號而不是點作為浮點值的小數分隔符,則需要該位。

為簡單起見,我做了一些看起來合理的假設:

  • 想要的輸入欄位始終是單獨的值,相鄰的註釋將欄位名稱指示為一個單詞,必須用空格(至少一個空格)與/*
  • 沒有同名的欄位
  • 浮點值僅用數字和(可能)一個點表示,即沒有指數或其他科學表示

與上面相同的腳本,但冗長、描述和擴展以允許:

  • 按行號任意指定所需欄位
  • 通過屬於每個欄位的輸入行上的註釋所引用的所需欄位的名稱任意指定
  • 輸出文件自動以輸入文件名命名,輸入文件名可能有一個副檔名(例如 .txt),並且其指示的路徑(如果有)不能有點;換句話說,最好從包含輸入文件的目錄中執行腳本
# some preparations
BEGIN {
   # output files named as the input file name
   split(ARGV[1], f, ".")
   outpfx = f[1]
   # remember wanted fields specified on command line as comma-separated line numbers
   if (nums) {
       # split variable "nums" on comma into helper array "r"
       n = split(nums, r, ",")
       # loop over helper array to build final array, thus indexed by wanted line numbers
       while (n) rows[r[n--]]
   }
}

# here we operate on multipliers file
NR != FNR {
   # output file name for this set of multipliers
   out = outpfx FNR ".txt"
   # create/overwrite this output file
   printf "" > out
   # loop over template lines scanned from input file
   for (linenum = multnum = 1; linenum <= numlines; linenum++)
       # use the template line as printf format string to consume values to be multiplied (if any)
       printf tmpl[linenum] ORS, linenum in wanted_values ? $(multnum++)*wanted_values[linenum] : 0 >> out
   close(out)
   next
}

# here we scan the input file to build a template for printf
{
   # escape existing % chars as we are going to leverage printfs own format string which is %-based
   gsub(/%/, "%%")
   # on specified line numbers or named fields:
   if (NR in rows || names && match($3, "^("names")[^[:alnum:]_]*$")) {
       # remember this value
       wanted_values[NR] = $1
       # replace the original value with the printfs conversion specification for floating-point values
       # it will be used by printf later on while processing the multipliers file
       sub(/^[0-9]+(\.[0-9]+)?/, OFMT)
   }
   # remember this whole line as a template
   tmpl[NR] = $0; numlines++
}

像這樣使用它:

# specify fields by their line numbers, each separated by a comma
LC_NUMERIC=C awk -f script -v nums=36,38,39 input.txt multipliers.txt
# or specify fields by their names, each separated by the | character (NOTE it's a regexp)
LC_NUMERIC=C awk -f script -v names='vox_la_max|knockout_max|shed_prob' input.txt multipliers.txt
# or also use both ways of specifying fields
LC_NUMERIC=C awk -f script -v nums=15,112,234,71,5 -v names='vox_la_max|numesp' input.txt multipliers.txt

如果您指定的欄位多於乘數,則超出的欄位將變為0(乘以 0)。

如果您指定的欄位少於乘數,則會簡單地忽略超出的乘數。

在任何情況下,這些欄位總是按照它們出現的行號的順序消耗乘數,即輸入文件中遇到的第一個欄位會消耗第一個乘數,無論您如何指定該欄位。

引用自:https://unix.stackexchange.com/questions/640543