Text-Processing

轉換書目參考以與 Latex 一起使用

  • February 13, 2020

我收到了一份長字文件,我必須將其移植到 Latex 中。在文件中,所有引文都以帶有作者和年份的經典形式出現。就像是

Lorem ipsum dolor (Sit, 1998) amet, consectetur adipiscing (Slit 2000, Sed and So 2002, Eiusmod et al. 1976).
Tempor incididunt ut labore et dolore magna aliqua (Ut et al. 1312)

此引用需要獲得正確的關鍵引用,因為它出現在圍兜引用列表中。換句話說,文本應該翻譯成

Lorem ipsum dolor \cite{sit1998} amet, consectetur adipiscing \cite{slit2000,sed2002,eiusmod1976}.
Tempor incididunt ut labore et dolore magna aliqua \cite{ut1312}

這意味著:

  • 提取由括號中的名稱和年份組成的所有字元串
  • 去掉那一串空格、第二個名字(名字後面的所有內容)和大寫字母
  • 使用生成的字元串形成新的 \cite{string}

我知道這可能是一項相當複雜的任務。我想知道也許有人為此特定任務編寫了腳本。或者,也歡迎任何部分建議。我目前在 MacOS 中工作。

以下awk程序應該可以工作。它( ... )在每一行中查找元素並檢查它們是否符合“author(s), year”或“author(s)1 year1, author(s)2 year2, …”模式。如果是這樣,它會創建一個引用命令並替換該( ... )組;否則它會按原樣離開組。

#!/usr/bin/awk -f


# This small function creates an 'authorYYYY'-style string from
# separate author and year fields. We split the "author" field
# additionally at each space in order to strip leading/trailing
# whitespace and further authors.
function contract(author, year)
{
   split(author,auth_fields," ");
   auth=tolower(auth_fields[1]);
   return sprintf("%s%4d",auth,year);
}



# This function checks if two strings correspond to "author name(s)" and
# "year", respectively.
function check_entry(string1, string2)
{
   if (string1 ~ /^ *([[:alpha:].-]+ *)+$/ && string2 ~ /^ *[[:digit:]]{4} *$/) return 1;
   return 0;
}




# This function creates a 'citation' command from a raw element. If the
# raw element does not conform to the reference syntax of 'author, year' or
# 'author1 year1,author2 year2, ...', we should leave it "as is", and return
# a "0" as indicator.
function create_cite(raw_elem)
{
   cite_argument=""

   # Split at ','. The single elements are either name(list) and year,
   # or space-separated name(list)-year statements.
   n_fields=split(raw_elem,sgl_elem,",");

   if (n_fields == 2 && check_entry(sgl_elem[1],sgl_elem[2]))
   {
       cite_argument=contract(sgl_elem[1],sgl_elem[2]);
   }
   else
   {
       for (k=1; k<=n_fields; k++)
       {
           n_subfield=split(sgl_elem[k],subfield," ");

           if (check_entry(subfield[1],subfield[n_subfield]))
           {
               new_elem=contract(subfield[1],subfield[n_subfield]);
               if (cite_argument == "")
               {
                   cite_argument=new_elem;
               }
               else
               {
                   cite_argument=sprintf("%s,%s",cite_argument,new_elem);
               }
           }
           else
           {
               return 0;
           }
       }
   }


   cite=sprintf("\\{%s}",cite_argument);
   return cite;
}




# Actual program
# For each line, create a "working copy" so we can replace '(...)' pairs
# already processed with different text (here: 'X ... Y'); otherwise 'sub'
# would always stumble across the same opening parentheses.
# For each '( ... )' found, check if it fits the pattern. If so, we replace
# it with a 'cite' command; otherwise we leave it as it is.

{
   working_copy=$0;

   # Allow for unmatched ')' at the beginning of the line:
   # if a ')' was found before the first '(', mark is as processed
   i=index(working_copy,"(");
   j=index(working_copy,")");
   if (i>0 && j>0 && j<i) sub(/\)/,"Y",working_copy);

   while (i=index(working_copy,"("))
   {
       sub(/\(/,"X",working_copy); # mark this '(' as "already processed

       j=index(working_copy,")");
       if (!j)
       {
           continue;
       }
       sub(/\)/,"Y",working_copy); # mark this ')', too


       elem=substr(working_copy,i+1,j-i-1);

       replacement=create_cite(elem);
       if (replacement != "0")
       {
           elem="\\(" elem "\\)"
           sub(elem,replacement);
       }

   }
   print $0;
}

呼叫程序

~$ awk -f transform_citation.awk input.tex

請注意,程序期望輸入是“合理的”格式正確的,即一行上的所有括號都應該成對匹配(儘管允許在行首有一個右括號,不匹配的左括號將被忽略)。

另請注意,上面的某些語法需要 GNU awk。要移植到其他實現,請替換

if (string1 ~ /^ *([[:alpha:].-]+ *)+$/ && string2 ~ /^ *[[:digit:]]{4} *$/) return 1;

if (string1 ~ /^ *([a-zA-Z.-]+ *)+$/ && string2 ~ /^ *[0123456789][0123456789][0123456789][0123456789] *$/) return 1;

並確保您已將排序規則語言環境設置為C.

引用自:https://unix.stackexchange.com/questions/565754