對列表進行數字排序
我有一個具有以下結構的文本列表(每個條目上的所有行都以製表符空格開頭,這些行之間沒有空行,並且條目之間有一個空行):
292G.- La Ilíada (tomo I) ; Collection one (volume 3) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf - I have to download more ancient greek texts. - Another note line. 293G.- El Ingenioso Hidalgo "Don Quijote" De La Mancha ; Collection one (volume 1) ; Miguel de Cervantes ; http://www.daemcopiapo.cl/Biblioteca/Archivos/7_6253.pdf - Masterpiece. 294G.- Crimen y castigo ; Collection one (volume 4) ; Fiódor Dostoyevski ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Fedor%20Dostoiewski/Crimen%20y%20castigo.pdf - Russian masterpiece. 295G.- La isla del tesoro ; Collection one (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf - I read this one as a kid.
從位置 292G 開始,繼續收集超過 100 卷的 Collection one。我希望這 100 卷按卷號排序(可以在第二個欄位中找到)。預期的輸出是:
292G.- El Ingenioso Hidalgo "Don Quijote" De La Mancha ; Collection one (volume 1) ; Miguel de Cervantes ; http://www.daemcopiapo.cl/Biblioteca/Archivos/7_6253.pdf - Masterpiece. 293G.- La isla del tesoro ; Collection one (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf - I read this one as a kid. 294G.- La Ilíada (tomo I) ; Collection one (volume 3) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf - I have to download more ancient greek texts. - Another note line. 295G.- Crimen y castigo ; Collection one (volume 4) ; Fiódor Dostoyevski ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Fedor%20Dostoiewski/Crimen%20y%20castigo.pdf - Russian masterpiece.
請注意,標題可以包含字元和字元串,例如
"
,(
,)
,但不能包含;
(它們僅用作分隔符)。我想sort
這裡有答案,但這超出了我的菜鳥技能。
這(使用 GNU awk 將第三個參數用於
match()
、gensub()
、sorted_in
和FPAT
)只會對您想要的部分進行排序(即序列號為“292”或更大的集合“one”),可以處理包含任何字元或字元串的標題包括;
,(
,)
或(volume <N>)
, 並將在未排序的周圍部分中的原始位置輸出已排序的部分:$ cat tst.awk BEGIN { RS = "" ORS = "\n\n" FPAT = "[^;]*(\"[^\"]*\")*[^;]*" tgtColl = "one" begSeqNr = 292 maxSeqs = 100 } match($2,/Collection (.*) \(volume ([0-9]+))/,a) { coll = a[1] volNr = a[2] seqNr = $1+0 } (coll == tgtColl) && (seqNr >= begSeqNr) && (++seqCnt <= maxSeqs) { vols[volNr] = $0 next } { prtVols() print } END { prtVols() } function prtVols( volNr, seqNr, vol) { PROCINFO["sorted_in"] = "@ind_num_asc" seqNr = begSeqNr for (volNr in vols) { vol = vols[volNr] sub(/[0-9]+/,seqNr++,vol) print vol } delete vols }
例如,假設這個輸入是從問題中的晴天案例修改的,以添加幾個有用的測試案例:
$ cat file 100G.- some earlier collection ; Collection zero (volume 1) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf - TEST earlier collection ID 200G.- right collection, too early sequence number; Collection one (volume 6) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf - TEST earlier sequence number 292G.- La Ilíada ; Collection one (volume 3) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf - I have to download more ancient greek texts. - Another note line. 293G.- El Quijote ; Collection one (volume 1) ; Miguel de Cervantes ; http://www.daemcopiapo.cl/Biblioteca/Archivos/7_6253.pdf - Masterpiece. 294G.- Crimen y castigo ; Collection one (volume 4) ; Fiódor Dostoyevski ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Fedor%20Dostoiewski/Crimen%20y%20castigo.pdf - Russian masterpiece. 295G.- "Kill Bill; Bury Him (volume 2)" ; Collection one (volume 5) ; Tarantino ; https://www.biblioteca.org.ar/libros/130864.pdf - TEST quoted title with sparator chars and target string 296G.- La isla del tesoro ; Collection one (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf - I read this one as a kid. 300G.- some later collection ; Collection twenty-three (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf - TEST later collecion ID
它將輸出:
$ awk -f tst.awk file 100G.- some earlier collection ; Collection zero (volume 1) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf - TEST earlier collection ID 200G.- right collection, too early sequence number; Collection one (volume 6) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf - TEST earlier sequence number 292G.- El Quijote ; Collection one (volume 1) ; Miguel de Cervantes ; http://www.daemcopiapo.cl/Biblioteca/Archivos/7_6253.pdf - Masterpiece. 293G.- La isla del tesoro ; Collection one (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf - I read this one as a kid. 294G.- La Ilíada ; Collection one (volume 3) ; Homer ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Homero/Iliada.pdf - I have to download more ancient greek texts. - Another note line. 295G.- Crimen y castigo ; Collection one (volume 4) ; Fiódor Dostoyevski ; http://www.ataun.eus/BIBLIOTECAGRATUITA/Cl%C3%A1sicos%20en%20Espa%C3%B1ol/Fedor%20Dostoiewski/Crimen%20y%20castigo.pdf - Russian masterpiece. 296G.- "Kill Bill; Bury Him (volume 2)" ; Collection one (volume 5) ; Tarantino ; https://www.biblioteca.org.ar/libros/130864.pdf - TEST quoted title with sparator chars and target string 300G.- some later collection ; Collection twenty-three (volume 2) ; Robert Louis Stevenson ; https://www.biblioteca.org.ar/libros/130864.pdf - TEST later collecion ID
由於它是欄位分隔符,
;
因此標題中出現的任何內容都必須在雙引號內,無論是單獨引用Kill Bill";" Bury Him
還是作為上面範例中整個引用標題的一部分,標題中的其他字元或字元串都不需要任何特殊處理。如果你真的想要所有的集合
one
,而不僅僅是從一個序列號開始,反之亦然,這是一個非常微不足道的調整,很明顯只是不測試一個或另一個,同樣,如果你希望所有集合從給定開始排序,begSeqNr
而不是其中只有 100 個,然後不包含 的文本seqCnt
,如果您不想列印周圍的集合/序列,那麼只需擺脫獨立
您沒有指定明確的語言要求,所以這裡是 python 3.8 中的一個骯髒的解決方案。我相信其他人可以想出更好的方法,但這應該足夠了。
該程式碼假定文本位於目前目錄中名為 list.txt 的文件中,並將創建一個名為 new-list.txt 的新文件
它也不處理“-La isla del tesoro”中的缺失空間
import re booklist = [] bookcount = 0 entry = '' line_numbers = [] # Find and return the volume number for a book def get_volnum(book): volstring = '' volstring = re.search('\\(volume (\d+)\\)', book) volnum = volstring.group(1) return volnum # Read file and put in doc variable doc = open('list.txt', 'r').readlines() # Group each book in a single string and append in a booklist for line in doc: # if line begins with three decimals followed by 'G.', put line in a new entry. if re.match("(\d\d\d)G.*", line): #read the line number and append to a list line_numbers.append(line.split('G.')[0]) # Add previous entry to booklist (without the three decimals and G.) if bookcount > 0: booklist.append(entry.split('G.')[1]) entry = line bookcount +=1 # If line begins with a '- ', concatenate the line into the current entry. if line.startswith('- '): entry += line #Append last line booklist.append(entry.split('G.')[1]) # Make a list (booktable) that contains [volnum, book] booktable = [] [booktable.append([get_volnum(book), book]) for book in booklist] # Sort that list by volnum (index 0 of each list item of booktable) booktable.sort(key=lambda x: int(x[0])) line_numbers.sort() # Write result to file f = open("new-list.txt", "w") for b in booktable: f.write(line_numbers.pop(0) + 'G.' + b[1]) f.write('\n') f.close()