Shell-Script

如何從使用 shuf 隨機提取 200 個字元的腳本中進行採樣而不進行替換?

  • December 25, 2020

我有這個腳本可以從一組中提取 200 個隨機字元:

#!/usr/bin/bash
n=$(stat -c "%s" newfile.txt)
r=$(shuf -i1-"$((n-200+1))" -n1)
< newfile.txt tail -c+"$r" | head -c200
for N in {1..10000}; do bash extraction_200pb.sh; done > output.txt 

我知道shuf它非常強大,但我想包含一個無需替換的採樣。這意味著每抽取 200 個字元在採樣時只有一次被選中的機會。

輸出應如下所示:

>1     
GAACTCTACCAAAAGGTATGTTGCTTTCACAAAAAGCTGCATTCGATCATGTGTATAATCTAGCAAAACTAGTAGGAGGAGCAAAATACCCCGAAATTGTTGCTGCTCAGGCAATGCACGAATCAAACTACCTAGATCCTAGG
ACTAATAGTGTTTATAATGCCACAAATAGAACTAATGCTTTCGGTCAAACTGGTGAC
>2     
GCCTACCGCATAAAACAGCATCACCGCCACGGCTTCAGGGTATTCTCCAATGGCAAAGGCTCCCATGGTCGCGATGGACATTAAGAGAAATTCAGTAAAGAAATCTCCATTTAGAATACTTTTGAATCCTTCTTTTATCACCG
GAAAACCAACTGGGAGATAGGCCACAATGTACCAACCTACTCGCACCCAATCTGTAA
>3     
GCACGTGTCACCGTCAGCATCGCGGCAGCGGAACGGGTCACCCGGATTGCTGTCGGGACCATCGTTTACGCCGTCATTGTCGTTATCGGGATCGCCCGGATTACAAATGCCGTCGCCATCGACGTCGTTACCGTCGTTCGCGG
CATCGGGGAAGCCGGCACCGGCGGCACAGTCATCGCAACCGTCGCCATCGGCATCGA
>4     
GCGTTCGAAGCAATTGCACGAGACCCAAACAACGAATTGCTGGTTGTTGAACTGGAAAACTCTCTAGTCGGAATGCTTCAAATTACTTATATTCCCTACCTGACACATATTGGCAGTTGGCGTTGTCTTATAGAAGGTGTTCG
AATCCATAGTGACTATCGTGGACGAGGTTTTGGTGAGCAAATGTTCGCACATGCGAT
>5     
GTTTAAGACTAACAGCAATCTGTAAGGACATAGGTGCTGGAGTTGAGGTTAGTCTGGAAGATATGATCTGGGCAGAGAAATTGTCCAAAGCAAACACCGCAGCAAGAGGTATGCTAAACACAGCAAGAAGAATAAGTAATGAT
CCTACTGATTCTTTTCTGAATGAGTTGAATATAGGAGACCCCGACTCAACTCATCAT

輸入文件是一個 ~8G 文件,如下所示:

CCAAGATCGCTGGTTGGCGAATCAATTTCATAAACGCCTACGCTTTCAAGGAACGTGTTAAGAATGTTCT
GGCCGAGTTCCTTATGAGACGTTTCGCGTCCCTTAAATCGAATAACGACACGAACCTTGTCGCCGTCATT
AAGAAAACCCTTTGCCTTCTTGGCCTTAATCTGAATATCACGGGTGTCCGTTACAGGTCGCAACTGGATT
TCCTTGACTTCAGAAACAGACTTACGTGAATTCTTCTTGATTTCTTTCTGACGCTTTTCATTTTCATACT
GGAACTTGCCGTAATCAATGATCTTACAAACAGGAATATCACCCTTATCAGAGATCAATACCAAATCAAG
TTCGGCATCAAAAGCGCGATCAAGTGCGTCTTCAATGTCGAGGACCGTTGTTTCTTCACCGTCAACCAAA
CGAATTGTGGAGGACTTGATGTCGTCTCGGGTACTAATTTTATTCACGTATATGTTACTCCTTATGTTGT

任何幫助,將不勝感激。提前致謝。

這是 awk 中解決方案的實現。數據是 8GB 的​​偽隨機十六進制數字(實際上是大約 12 個手冊頁的十六進制轉換,重複 3300 次)。它大約有 1100 萬行,平均每行 725 個字節。

這是一個定時執行。

Paul--) ls -l tdHuge.txt
-rw-r--r-- 1 paul paul 8006529300 Dec 24 22:38 tdHuge.txt
Paul--) ./rndSelect
inFile ./tdHuge.txt; Size 8006529300; Count 10000; Lth 200; maxIter 50; Db 1;
Iteration   1 needs  10000
Iteration   2 needs   2712
Overlap   9561: 7663038508 to 7663038657
Iteration   3 needs    728
Iteration   4 needs    195
Iteration   5 needs     50
Iteration   6 needs     11
Iteration   7 needs      2
Required 7 iterations
Reporting 10000 samples

real    2m3.326s
user    0m3.496s
sys 0m10.340s
Paul--) wc Result.txt
 20000   20000 2068894 Result.txt
Paul--) head -n 8 Result.txt | cut -c 1-40
>1
5706C69636174656420696E666F726D6174696F6
>2
20737472696E672028696E207768696368206361
>3
20646F6573206E6F742067657420612068617264
>4
647320616E642073746F7265732E204966207468
Paul--) tail -n 8 Result.txt | cut -c 1-40
>9997
6F7374207369676E69666963616E7420646F7562
>9998
7472696E676F702D73747261746567793D616C67
>9999
865726520736F6D652066696C6573206D7573742
>10000
5726E65642E205768656E20746865202D66206F7
Paul--) 

它需要迭代,因為它會對文件進行隨機探測。如果一個探針與相鄰的探針或換行符重疊,則將其丟棄並製作一小批新探針。平均線路長度為 725 行,樣本要求為 200,幾乎 30% 的探頭太接近行尾而無法接受。我們不知道真實數據的平均行長——更長的行會提高成功率。

我們也不知道文件中是否仍然存在標題行(如 2020 年 12 月 4 日的先前相關問題中所述)。但是如果每個標題行都小於 200 的樣本長度,則標題行將被丟棄(最好的偶然性)。

程式碼主要是 GNU/awk(最小的 bash)並且有一些註釋。有很多殘留調試可以通過在選項中設置 Db=0 來隱藏。

#! /bin/bash

#.. Select random non-overlapping character groups from a file.

export LC_ALL="C"

#.. These are the optional values that will need to be edited.
#.. Command options could be used to set these from scripts arguments.

inFile="./tdHuge.txt"
outFile="./Result.txt"
Count=10000     #.. Number of samples.
Lth=200         #.. Length of each sample.
maxIter=50      #.. Prevents excessive attempts.

Size="$( stat -c "%s" "${inFile}" )"
Seed="$( date '+%N' )"
Db=1

#.. Extracts random non-overlapping character groups from a file.

Selector () {

   local Awk='
#.. Seed the random number generation, and show the args being used.
BEGIN {
   NIL = ""; NL = "\n"; SQ = "\047";
   srand (Seed % PROCINFO["pid"]);
   if (Db) printf ("inFile %s; Size %d; Count %d; Lth %d; maxIter %s; Db %s;\n",
       inFile, Size, Count, Lth, maxIter, Db);
   fmtCmd = "dd bs=%d count=1 if=%s iflag=skip_bytes skip=%d status=none";
}
#.. Constructs an array of random file offsets, replacing overlaps.
#.. Existing offsets are indexed from 1 to Count, deleting overlaps.
#.. Additions are indexed from -1 down to -N to avoid clashes.

function Offsets (N, Local, Iter, nSeek, Seek, Text, j) {

   while (N > 0 && Iter < maxIter) {
       ++Iter;
       if (Db) printf ("Iteration %3d needs %6d\n", Iter, N);

       for (j = 1; j <= N; ++j) {
           Seek[-j] = int ((Size - Lth) * rand());
           Text[Seek[-j]] = getSample( Seek[-j], Lth);
           if (Db7) printf ("Added %10d: \n", Seek[-j], Text[Seek[-j]]);
       }
       #.. Reindex in numeric order for overlap checking.
       nSeek = asort (Seek);
       if (Db7) for (j in Seek) printf ("%6d: %10d\n", j, Seek[j]);

       #.. Discard offsets that overlap the next selection.
       N = 0; for (j = 1; j < nSeek; ++j) {
           if (Seek[j] + Lth > Seek[j+1]) {
               if (Db) printf ("Overlap %6d: %10d to %10d\n",
                   j, Seek[j], Seek[j+1]);
               ++N; delete Text[Seek[j]]; delete Seek[j];
           } else if (length (Text[Seek[j]]) < Lth) {
               if (Db7) printf ("Short   %6d: %10d\n",
                   j, Seek[j]);
               ++N; delete Text[Seek[j]]; delete Seek[j];
           }
       }
   }
   if (Iter >= maxIter) {
       printf ("Failed with overlaps after %d iterations\n", Iter);
   } else {
       printf ("Required %d iterations\n", Iter);
       Samples( nSeek, Seek, Text);
   }
}
#.. Returns n bytes from the input file from position p.
function getSample (p, n, Local, cmd, tx) {

   cmd = sprintf (fmtCmd, n, SQ inFile SQ, p);
   if (Db7) printf ("cmd :%s:\n", cmd);
   cmd | getline tx; close (cmd);
   return (tx);
}
#.. Send samples to the output file.
function Samples (nSeek, Seek, Text, Local, j) {

   printf ("Reporting %d samples\n", nSeek);
   for (j = 1; j <= nSeek; ++j) {
       printf (">%d\n%s\n", j, Text[Seek[j]]) > outFile;
   }
   close (outFile);
}
END { Offsets( Count); }
'
   echo | awk -v Size="${Size}" -v inFile="${inFile}" \
       -v outFile="${outFile}" -v Count="${Count}" -v Lth="${Lth}" \
       -v maxIter="${maxIter}" \
       -v Db="${Db}" -v Seed="${Seed}" -f <( printf '%s' "${Awk}" )
}

#.. Test.

   time Selector

引用自:https://unix.stackexchange.com/questions/625284