Python

python:多列pandas數據文件

  • December 1, 2020

我正在編寫一個循環 N .SDF 填充的 Python 腳本,使用 glob 創建它們的列表,為每個文件執行一些計算,然後以 pandas 數據文件格式儲存這些資訊。假設我計算每個文件的 4 個不同屬性,對於 1000 個填充,預期輸出應以 5 列和 1000 行的數據文件格式匯總。以下是程式碼範例:

 # make a list of all .sdf filles present in data folder:
dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]

# create empty data file with 5 columns:
# name of the file,  value of variable p, value of ac, value of don, value of wt
df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"])

# for each sdf file get its name and calculate 4 different properties: p, ac, don, wt
for sdf in dirlist:
       sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
       # set a name of the file
       key = f'{sdf_name}'
       mol = open(sdf,'rb')
       # --- do some specific calculations --
       p = MolLogP(mol) # coeff conc-perm
       ac = CalcNumLipinskiHBA(mol)#
       don = CalcNumLipinskiHBD(mol)
       wt = MolWt(mol)
       # add one line to DF in the following order : ["key", "p", "ac", "don", "wt"]
       df[key] = [p, ac, don, wt]

問題出在腳本的最後一行,需要在一行中匯總所有計算並將其與處理後的文件一起附加到 DF 中。最終,對於 1000 個已處理的 SDF 填充,我的 DF 應該包含 5 列和 1000 行。

# make a list of all .sdf filles present in data folder:
dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]

# create empty data file with 5 columns:
# name of the file,  value of variable p, value of ac, value of don, value of wt

# for each sdf file get its name and calculate 4 different properties: p, ac, don, wt

holder = []
for sdf in dirlist:
       sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
       # set a name of the file
       key = f'{sdf_name}'
       mol = open(sdf,'rb')
       # --- do some specific calculations --
       p = MolLogP(mol) # coeff conc-perm
       ac = CalcNumLipinskiHBA(mol)#
       don = CalcNumLipinskiHBD(mol)
       wt = MolWt(mol)
       # add one line to DF in the following order : ["key", "p", "ac", "don", "wt"]
       output_list = pd.Series([key, p, ac, don, wt])
       holder.append(output_list)

df = pd.concat(holder, axis = 1)
df.rename(columns={0:"key", 1:"p", 2:"ac", 3:"don", 4:"wt"], inplace = True)
print(df)

引用自:https://unix.stackexchange.com/questions/622341