Awk

轉義損壞的 CSV 文件中的嵌套雙引號

  • September 11, 2020

我有一個帶有許多嵌套雙引號的大型損壞的“CSV”文件。例如:

123,"I wonder how to escape "these" quotes with backslashes.",123,456
456,"I wonder how to escape "these" quotes with backslashes.",456,789

知道如何解決這個問題嗎?

更新了一個真實的例子:

n9sih438,4994fa72322,PMC,Rapid Identification of Malaria Vaccine Candidates Based on alpha-Helical Coiled Coil Protein Motif,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"To identify malaria antigens for vaccine development, we selected alpha-helical coiled coil domains of proteins predicted to be present in the parasite erythrocytic stage. The corresponding synthetic peptides are expected to mimic structurally "native" epitopes. Indeed the 95 chemically synthesized peptides were all specifically recognized by human immune sera, though at various prevalence. Peptide specific antibodies were obtained both by affinity-purification from malaria immune sera and by immunization of mice. These antibodies did not show significant cross reactions, i.e., they were specific for the original peptide, reacted with native parasite proteins in infected erythrocytes and several were active in inhibiting in vitro parasite growth. Circular dichroism studies indicated that the selected peptides assumed partial or high alpha-helical content. Thus, we demonstrate that the bioinformatics/chemical synthesis approach described here can lead to the rapid identification of molecules which target biologically active antibodies, thus identifying suitable vaccine candidates. This strategy can be, in principle, extended to vaccine discovery in a wide range of other pathogens.",2007-07-25

嵌套的雙引號可以出現在“標題”欄位(第 4 個欄位)和“摘要”欄位(第 9 個欄位)中。

我創建了一個範例輸入文件,每行是 10 個欄位,其中欄位 4 和 9 可能被引用:

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

然後編寫這個腳本(使用 GNU awk 作為第三個參數 to match())來辨識每個輸入行的類型,然後相應地修改引用的欄位:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
   # The 4th and 9th fields may or may not be quoted so we are looking
   # for one of these patterns of fields:
   #
   #    1,2,3,4,5,6,7,8,9,10           - type A
   #    1,2,3,"4",5,6,7,8,9,10         - type B
   #    1,2,3,4,5,6,7,8,"9",10         - type C
   #    1,2,3,"4",5,6,7,8,"9",10       - type D
   #
   # If we can determine which type of record we have then we can
   # identify the fields.

   delete f
   if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
       type = "A"
       split(a[0],f)
   }
   else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
       type = "B"
       split(a[1],f)
       f[4] = a[3]
       split(a[4],tmp)
       for (i in tmp) {
           f[4+i] = tmp[i]
       }
   }
   else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
       type = "C"
       split(a[1],f)
       f[9] = a[3]
       f[10] = a[4]
   }
   else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
       type = "D"
       split(a[1],f)
       f[4] = a[3]
       split(a[4],tmp)
       for (i in tmp) {
           f[4+i] = tmp[i]
       }
       f[9] = a[6]
       f[10] = a[7]
   }
   else {
       type = "Unknown"
       split($0,f)
       printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
   }

   # Uncomment the following lines to see what the above is doing:
   #print ORS "################" ORS "Type " type ":\t" $0
   #for (i=1; i in f; i++) {
       #print i, "<" f[i] ">"
   #}

   gsub(/^"|"$/,"",f[4])
   gsub(/"/,"\"\"",f[4])
   f[4] = "\"" f[4] "\""

   gsub(/^"|"$/,"",f[9])
   gsub(/"/,"\"\"",f[9])
   f[9] = "\"" f[9] "\""

   $0 = ""
   for (i in f) {
       $i = f[i]
   }
   print
}

.

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

輸出總是引用輸入中可能引用的 2 個欄位 - 如果您不喜歡它,這是一個簡單的調整作為練習。我還使用了更傳統的方式來“轉義”CSV 中的雙引號,即雙引號。如果您更\"喜歡"". 有關在 CSV 和 CSV“標準”上使用 awk 的更多資訊,請參閱https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk 。

引用自:https://unix.stackexchange.com/questions/608991