Shell-Script
如何從使用 shuf 隨機提取 200 個字元的腳本中進行採樣而不進行替換?
我有這個腳本可以從一組中提取 200 個隨機字元:
#!/usr/bin/bash n=$(stat -c "%s" newfile.txt) r=$(shuf -i1-"$((n-200+1))" -n1) < newfile.txt tail -c+"$r" | head -c200
for N in {1..10000}; do bash extraction_200pb.sh; done > output.txt
我知道
shuf
它非常強大,但我想包含一個無需替換的採樣。這意味著每抽取 200 個字元在採樣時只有一次被選中的機會。輸出應如下所示:
>1 GAACTCTACCAAAAGGTATGTTGCTTTCACAAAAAGCTGCATTCGATCATGTGTATAATCTAGCAAAACTAGTAGGAGGAGCAAAATACCCCGAAATTGTTGCTGCTCAGGCAATGCACGAATCAAACTACCTAGATCCTAGG ACTAATAGTGTTTATAATGCCACAAATAGAACTAATGCTTTCGGTCAAACTGGTGAC >2 GCCTACCGCATAAAACAGCATCACCGCCACGGCTTCAGGGTATTCTCCAATGGCAAAGGCTCCCATGGTCGCGATGGACATTAAGAGAAATTCAGTAAAGAAATCTCCATTTAGAATACTTTTGAATCCTTCTTTTATCACCG GAAAACCAACTGGGAGATAGGCCACAATGTACCAACCTACTCGCACCCAATCTGTAA >3 GCACGTGTCACCGTCAGCATCGCGGCAGCGGAACGGGTCACCCGGATTGCTGTCGGGACCATCGTTTACGCCGTCATTGTCGTTATCGGGATCGCCCGGATTACAAATGCCGTCGCCATCGACGTCGTTACCGTCGTTCGCGG CATCGGGGAAGCCGGCACCGGCGGCACAGTCATCGCAACCGTCGCCATCGGCATCGA >4 GCGTTCGAAGCAATTGCACGAGACCCAAACAACGAATTGCTGGTTGTTGAACTGGAAAACTCTCTAGTCGGAATGCTTCAAATTACTTATATTCCCTACCTGACACATATTGGCAGTTGGCGTTGTCTTATAGAAGGTGTTCG AATCCATAGTGACTATCGTGGACGAGGTTTTGGTGAGCAAATGTTCGCACATGCGAT >5 GTTTAAGACTAACAGCAATCTGTAAGGACATAGGTGCTGGAGTTGAGGTTAGTCTGGAAGATATGATCTGGGCAGAGAAATTGTCCAAAGCAAACACCGCAGCAAGAGGTATGCTAAACACAGCAAGAAGAATAAGTAATGAT CCTACTGATTCTTTTCTGAATGAGTTGAATATAGGAGACCCCGACTCAACTCATCAT
輸入文件是一個 ~8G 文件,如下所示:
CCAAGATCGCTGGTTGGCGAATCAATTTCATAAACGCCTACGCTTTCAAGGAACGTGTTAAGAATGTTCT GGCCGAGTTCCTTATGAGACGTTTCGCGTCCCTTAAATCGAATAACGACACGAACCTTGTCGCCGTCATT AAGAAAACCCTTTGCCTTCTTGGCCTTAATCTGAATATCACGGGTGTCCGTTACAGGTCGCAACTGGATT TCCTTGACTTCAGAAACAGACTTACGTGAATTCTTCTTGATTTCTTTCTGACGCTTTTCATTTTCATACT GGAACTTGCCGTAATCAATGATCTTACAAACAGGAATATCACCCTTATCAGAGATCAATACCAAATCAAG TTCGGCATCAAAAGCGCGATCAAGTGCGTCTTCAATGTCGAGGACCGTTGTTTCTTCACCGTCAACCAAA CGAATTGTGGAGGACTTGATGTCGTCTCGGGTACTAATTTTATTCACGTATATGTTACTCCTTATGTTGT
任何幫助,將不勝感激。提前致謝。
這是 awk 中解決方案的實現。數據是 8GB 的偽隨機十六進制數字(實際上是大約 12 個手冊頁的十六進制轉換,重複 3300 次)。它大約有 1100 萬行,平均每行 725 個字節。
這是一個定時執行。
Paul--) ls -l tdHuge.txt -rw-r--r-- 1 paul paul 8006529300 Dec 24 22:38 tdHuge.txt Paul--) ./rndSelect inFile ./tdHuge.txt; Size 8006529300; Count 10000; Lth 200; maxIter 50; Db 1; Iteration 1 needs 10000 Iteration 2 needs 2712 Overlap 9561: 7663038508 to 7663038657 Iteration 3 needs 728 Iteration 4 needs 195 Iteration 5 needs 50 Iteration 6 needs 11 Iteration 7 needs 2 Required 7 iterations Reporting 10000 samples real 2m3.326s user 0m3.496s sys 0m10.340s Paul--) wc Result.txt 20000 20000 2068894 Result.txt Paul--) head -n 8 Result.txt | cut -c 1-40 >1 5706C69636174656420696E666F726D6174696F6 >2 20737472696E672028696E207768696368206361 >3 20646F6573206E6F742067657420612068617264 >4 647320616E642073746F7265732E204966207468 Paul--) tail -n 8 Result.txt | cut -c 1-40 >9997 6F7374207369676E69666963616E7420646F7562 >9998 7472696E676F702D73747261746567793D616C67 >9999 865726520736F6D652066696C6573206D7573742 >10000 5726E65642E205768656E20746865202D66206F7 Paul--)
它需要迭代,因為它會對文件進行隨機探測。如果一個探針與相鄰的探針或換行符重疊,則將其丟棄並製作一小批新探針。平均線路長度為 725 行,樣本要求為 200,幾乎 30% 的探頭太接近行尾而無法接受。我們不知道真實數據的平均行長——更長的行會提高成功率。
我們也不知道文件中是否仍然存在標題行(如 2020 年 12 月 4 日的先前相關問題中所述)。但是如果每個標題行都小於 200 的樣本長度,則標題行將被丟棄(最好的偶然性)。
程式碼主要是 GNU/awk(最小的 bash)並且有一些註釋。有很多殘留調試可以通過在選項中設置 Db=0 來隱藏。
#! /bin/bash #.. Select random non-overlapping character groups from a file. export LC_ALL="C" #.. These are the optional values that will need to be edited. #.. Command options could be used to set these from scripts arguments. inFile="./tdHuge.txt" outFile="./Result.txt" Count=10000 #.. Number of samples. Lth=200 #.. Length of each sample. maxIter=50 #.. Prevents excessive attempts. Size="$( stat -c "%s" "${inFile}" )" Seed="$( date '+%N' )" Db=1 #.. Extracts random non-overlapping character groups from a file. Selector () { local Awk=' #.. Seed the random number generation, and show the args being used. BEGIN { NIL = ""; NL = "\n"; SQ = "\047"; srand (Seed % PROCINFO["pid"]); if (Db) printf ("inFile %s; Size %d; Count %d; Lth %d; maxIter %s; Db %s;\n", inFile, Size, Count, Lth, maxIter, Db); fmtCmd = "dd bs=%d count=1 if=%s iflag=skip_bytes skip=%d status=none"; } #.. Constructs an array of random file offsets, replacing overlaps. #.. Existing offsets are indexed from 1 to Count, deleting overlaps. #.. Additions are indexed from -1 down to -N to avoid clashes. function Offsets (N, Local, Iter, nSeek, Seek, Text, j) { while (N > 0 && Iter < maxIter) { ++Iter; if (Db) printf ("Iteration %3d needs %6d\n", Iter, N); for (j = 1; j <= N; ++j) { Seek[-j] = int ((Size - Lth) * rand()); Text[Seek[-j]] = getSample( Seek[-j], Lth); if (Db7) printf ("Added %10d: \n", Seek[-j], Text[Seek[-j]]); } #.. Reindex in numeric order for overlap checking. nSeek = asort (Seek); if (Db7) for (j in Seek) printf ("%6d: %10d\n", j, Seek[j]); #.. Discard offsets that overlap the next selection. N = 0; for (j = 1; j < nSeek; ++j) { if (Seek[j] + Lth > Seek[j+1]) { if (Db) printf ("Overlap %6d: %10d to %10d\n", j, Seek[j], Seek[j+1]); ++N; delete Text[Seek[j]]; delete Seek[j]; } else if (length (Text[Seek[j]]) < Lth) { if (Db7) printf ("Short %6d: %10d\n", j, Seek[j]); ++N; delete Text[Seek[j]]; delete Seek[j]; } } } if (Iter >= maxIter) { printf ("Failed with overlaps after %d iterations\n", Iter); } else { printf ("Required %d iterations\n", Iter); Samples( nSeek, Seek, Text); } } #.. Returns n bytes from the input file from position p. function getSample (p, n, Local, cmd, tx) { cmd = sprintf (fmtCmd, n, SQ inFile SQ, p); if (Db7) printf ("cmd :%s:\n", cmd); cmd | getline tx; close (cmd); return (tx); } #.. Send samples to the output file. function Samples (nSeek, Seek, Text, Local, j) { printf ("Reporting %d samples\n", nSeek); for (j = 1; j <= nSeek; ++j) { printf (">%d\n%s\n", j, Text[Seek[j]]) > outFile; } close (outFile); } END { Offsets( Count); } ' echo | awk -v Size="${Size}" -v inFile="${inFile}" \ -v outFile="${outFile}" -v Count="${Count}" -v Lth="${Lth}" \ -v maxIter="${maxIter}" \ -v Db="${Db}" -v Seed="${Seed}" -f <( printf '%s' "${Awk}" ) } #.. Test. time Selector