Filenames
如何通過具有多個級別的文件名/路徑的子字元串對文件名列表(txt 文件)進行排序。特殊挑戰:兩種類型的文件名約定
我想對以下文件名/路徑列表進行排序。
L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE L1_Data/level1/191027/LC08_L1TP_191027_20201221_20210310_01_T1 DONE L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE
每行包含一個文件名(包括路徑)及其工作狀態(QUEUED/DONE)。每個文件名都包含衛星圖像數據的資訊,如衛星類型、記錄日期、足蹟等。
現在,我想根據以下優先級重新排序列表:
- 工作狀態 –>首先排隊。這一步對我來說不是問題,但後續步驟的解決方案包括它們的組合(您將在下一張圖片後找到對我的問題的更詳細描述):
- 衛星類型 (S2A=Sentinel A; S2B=Sentinel B; LC08=Landsat 8; LE07=Landsat 7) –> S2A/B開頭(無論是 A 還是 B),然後是 LC08,然後是 LE07。換句話說:我想區分 Sentinel 2、Landsat 8 和 Landsat 7,但不區分Sentinel 2A 和 Sentinel 2B。
- 記錄日期,升序
- 足跡,上升
下圖顯示了相應子字元串的位置,然後是我的問題的描述。
除了對排序命令只有非常基本的了解之外,我的特殊問題是:
- a) 正確定址子字元串,在
- b) 兩種不同的文件名類型(/conventions),
- c) 下劃線不能用作分隔符,因為在 Sentinel 文件名中有五個下劃線,在 Landsat 中有六個下劃線,除此之外,兩者的子字元串序列不同。
- d)遺憾的是, LE07之前的LC08之前的S2A/B順序不符合字母表,並且
- e) 將S2A和S2B衛星作為一個單元來處理。這當然可以通過僅定址S2來解決,但是,由於僅由兩個字元組成,因此存在與整個文件名字元串的其他部分混淆的一定風險(實際上該列表要長得多,並且會不時更新,因此可能在其他行或以後的行中包含 ‘false’ S2 )。
最後,重新排序的列表應如下所示:
L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE
有人可以幫我嗎?
使用
awk
,sort
和cut
:awk -F'[/ ]' -v OFS='\t' ' { status=$NF # this is the last field split($(NF-1), parts, "_") # split filename into array `parts` if (parts[1]=="S2A" || parts[1]=="S2B") type=1 else if (parts[1]=="LC08"){ type=2 } else if (parts[1]=="LE07"){ type=3 } else { print "error, got unknown type " parts[1]; exit 1 } date=(type==1 ? substr(parts[3], 1, 8) : parts[4]) footprint=(type==1 ? parts[6] : parts[3]) print status, type, date, footprint, $0 } ' file | sort -k1,1r -k2,2n -k3,3 -k4,4 | cut -f5-
這個想法是從每條記錄中提取工作狀態、衛星類型、記錄日期和足跡並將它們保存在四個變數中,類型被替換為數字以定義自定義順序。
然後列印這四個以製表符分隔並以原始記錄為後綴的變數,根據需要對輸出進行排序,然後使用 . 刪除前四個欄位
cut
。輸出:
L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE L1_Data/level1/191027/LC08_L1TP_191027_20201221_20210310_01_T1 DONE L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE
問題是排序欄位不在該行的同一列中。
我在這裡使用 perl 以獲得最大的靈活性:這是“custom_sort.pl”
#! perl while (<>) { # capture the fields of an "L" satellite if (/.*\/(L...)_.*?_(\d+)_(\d+)\S+\s+(.*)/) { push @data, [$_, $4, $1, $3, $2] } # capture the fields of an "S" satellite elsif (/.*\/(S..)_.*?_(\d{8}).*?_.*?_.*?_(.*?)_\S+\s+(.*)/) { push @data, [$_, $4, $1, $2, $3] } } sub mysort { -($a->[1] cmp $b->[1]) # work status, descending || cmp_satellite($a->[2], $b->[2]) # satellite || $a->[3] <=> $b->[3] # record date || $a->[4] cmp $b->[4] # footprint } sub cmp_satellite { my ($a, $b) = @_; return -1 if $a =~ /^S/; return +1 if $b =~ /^S/; $a cmp $b } print $_->[0] for sort mysort @data
執行它
perl custom_sort.pl file