Text-Formatting

如何在 linux 中重新格式化 Kegg 映射器輸出?

  • October 23, 2018

我需要重新格式化kegg reconstruct pathway輸出,我在file1中有這樣的東西:

00550 Peptidoglycan biosynthesis (2)

K01000

K02563

00511 Other glycan degradation (8) K01190   K01191

K01192

K01201

K01227

K12309

我在file2中需要一些類似的東西:

00550 Peptidoglycan biosynthesis (2)   K01000   K02563
00511 Other glycan degradation (6)   K01190   K01191   K01192   K01201   K01227   K12309

我如何在 linux 或 python 中重新格式化它?

謝謝

這會讓你走多遠:

awk '
!NF             {next                                                   # don"t process empty lines
               }
/^[0-9]+ /      {sub (/\([0-9]*\)/, "(" CNT ")", PRT)                   # for the "glycan" lines (leading numerical)
                                                                       # correct the count in parentheses
                if (PRT) print PRT                                     # print the PRT buffer (NOT first line when empty)
                PRT = ""                                               # empty it after print
                CNT = gsub (/K[0-9]*/, "&") - 1                        # get this line"s "K..." count, corr.for later incr.
               }
               {PRT = sprintf ("%s%s%s", PRT, PRT?" ":"", $0)          # append this line to buffer
                CNT++                                                  # increment "K..." count
               }
END             {sub (/\([0-9]*\)/, "(" CNT ")", PRT)                   # see above
                print PRT
               }
' file
00550 Peptidoglycan biosynthesis (2) K01000 K02563
00511 Other glycan degradation (6) K01190   K01191 K01192 K01201 K01227 K12309

引用自:https://unix.stackexchange.com/questions/477268