Filenames

如何通過具有多個級別的文件名/路徑的子字元串對文件名列表(txt 文件)進行排序。特殊挑戰:兩種類型的文件名約定

  • July 7, 2021

我想對以下文件名/路徑列表進行排序。

L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE
L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE
L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE
L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE
L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED
L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE
L1_Data/level1/191027/LC08_L1TP_191027_20201221_20210310_01_T1 DONE
L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE
L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED
L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE
L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE
L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED
L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE

每行包含一個文件名(包括路徑)及其工作狀態(QUEUED/DONE)。每個文件名都包含衛星圖像數據的資訊,如衛星類型、記錄日期、足蹟等。

現在,我想根據以下優先級重新排序列表:

  1. 工作狀態 –>首先排隊。這一步對我來說不是問題,但後續步驟的解決方案包括它們的組合(您將在下一張圖片後找到對我的問題的更詳細描述):
  2. 衛星類型 (S2A=Sentinel A; S2B=Sentinel B; LC08=Landsat 8; LE07=Landsat 7) –> S2A/B開頭(無論是 A 還是 B),然後是 LC08,然後是 LE07。換句話說:我想區分 Sentinel 2、Landsat 8 和 Landsat 7,但不區分Sentinel 2A 和 Sentinel 2B。
  3. 記錄日期,升序
  4. 足跡,上升

下圖顯示了相應子字元串的位置,然後是我的問題的描述。

在此處輸入圖像描述

除了對排序命令只有非常基本的了解之外,我的特殊問題是:

  • a) 正確定址子字元串,在
  • b) 兩種不同的文件名類型(/conventions),
  • c) 下劃線不能用作分隔符,因為在 Sentinel 文件名中有五個下劃線,在 Landsat 中有六個下劃線,除此之外,兩者的子字元串序列不同。
  • d)遺憾的是, LE07之前的LC08之前的S2A/B順序不符合字母表,並且
  • e) 將S2AS2B衛星作為一個單元來處理。這當然可以通過僅定址S2來解決,但是,由於僅由兩個字元組成,因此存在與整個文件名字元串的其他部分混淆的一定風險(實際上該列表要長得多,並且會不時更新,因此可能在其他行或以後的行中包含 ‘false’ S2 )。

最後,重新排序的列表應如下所示:

L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED
L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED
L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED
L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE
L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE
L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE
L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE
L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE
L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE
L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE

有人可以幫我嗎?

使用awk,sortcut:

awk -F'[/ ]' -v OFS='\t' '
{
 status=$NF # this is the last field

 split($(NF-1), parts, "_") # split filename into array `parts`

 if (parts[1]=="S2A" || parts[1]=="S2B") type=1
 else if (parts[1]=="LC08"){ type=2 }
 else if (parts[1]=="LE07"){ type=3 }
 else { print "error, got unknown type " parts[1]; exit 1 }

 date=(type==1 ? substr(parts[3], 1, 8) : parts[4])
 footprint=(type==1 ? parts[6] : parts[3])
 
 print status, type, date, footprint, $0
}
' file | sort -k1,1r -k2,2n -k3,3 -k4,4 | cut -f5-

這個想法是從每條記錄中提取工作狀態、衛星類型、記錄日期和足跡並將它們保存在四個變數中,類型被替換為數字以定義自定義順序。

然後列印這四個以製表符分隔並以原始記錄為後綴的變數,根據需要對輸出進行排序,然後使用 . 刪除前四個欄位cut

輸出:

L1_Data/level1/T32TQR/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQR_20200101T110654.SAFE QUEUED
L1_Data/level1/T32TQR/S2B_MSIL1C_20200104T101319_N0208_R022_T32TQR_20200104T121239.SAFE QUEUED
L1_Data/level1/192027/LC08_L1TP_192027_20201212_20210313_01_T1 QUEUED
L1_Data/level1/T32TQS/S2B_MSIL1C_20200101T100319_N0208_R122_T32TQS_20200101T110654.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUL_20200101T110654.SAFE DONE
L1_Data/level1/T33TUM/S2B_MSIL1C_20200101T100319_N0208_R122_T33TUM_20200101T110654.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200102T102421_N0208_R065_T32TQS_20200102T105534.SAFE DONE
L1_Data/level1/T33TUL/S2B_MSIL1C_20200104T101319_N0208_R022_T33TUL_20200104T121239.SAFE DONE
L1_Data/level1/T32TQS/S2A_MSIL1C_20200106T100401_N0208_R122_T32TQS_20200106T103423.SAFE DONE
L1_Data/level1/192027/LC08_L1TP_192027_20201126_20210316_01_T1 DONE
L1_Data/level1/192028/LC08_L1TP_192028_20201126_20210316_01_T1 DONE
L1_Data/level1/192029/LC08_L1TP_192029_20201126_20210316_01_T1 DONE
L1_Data/level1/191027/LC08_L1TP_191027_20201221_20210310_01_T1 DONE
L1_Data/level1/191027/LE07_L1TP_191027_20201127_20201223_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201127_20201223_01_T1 DONE
L1_Data/level1/191028/LE07_L1TP_191028_20201213_20210108_01_T1 DONE
L1_Data/level1/191029/LE07_L1TP_191029_20201213_20210108_01_T1 DONE

問題是排序欄位不在該行的同一列中。

我在這裡使用 perl 以獲得最大的靈活性:這是“custom_sort.pl”

#! perl

while (<>) {
   # capture the fields of an "L" satellite
   if (/.*\/(L...)_.*?_(\d+)_(\d+)\S+\s+(.*)/) {
       push @data, [$_, $4, $1, $3, $2]
   }
   # capture the fields of an "S" satellite
   elsif (/.*\/(S..)_.*?_(\d{8}).*?_.*?_.*?_(.*?)_\S+\s+(.*)/) {
       push @data, [$_, $4, $1, $2, $3]
   }
}

sub mysort {
   -($a->[1] cmp $b->[1])              # work status, descending
   || cmp_satellite($a->[2], $b->[2])  # satellite
   || $a->[3] <=> $b->[3]              # record date
   || $a->[4] cmp $b->[4]              # footprint
}
sub cmp_satellite {
   my ($a, $b) = @_;
   return -1 if $a =~ /^S/;
   return +1 if $b =~ /^S/;
   $a cmp $b
}

print $_->[0] for sort mysort @data

執行它

perl custom_sort.pl file

引用自:https://unix.stackexchange.com/questions/657339