查詢 csv 文件，如 sql

June 24, 2020

這顯然是一個流行的面試問題：
有 2 個包含恐龍數據的 CSV 文件。我們需要查詢它們以返回滿足特定條件的恐龍。
有 2 種選擇 - 僅使用 Unix 命令行工具 ( cut/ paste/ sed/ awk)，或使用 Python 等腳本語言，但不使用其他模組，如q,等。fsql``csvkit
文件 1.csv：
NAME,LEG_LENGTH,DIET
Hadrosaurus,1.2,herbivore
Struthiomimus,0.92,omnivore
Velociraptor,1.0,carnivore
Triceratops,0.87,herbivore
Euoplocephalus,1.6,herbivore
Stegosaurus,1.40,herbivore
Tyrannosaurus Rex,2.5,carnivore
文件2.csv
NAME,STRIDE_LENGTH,STANCE
Euoplocephalus,1.87,quadrupedal
Stegosaurus,1.90,quadrupedal
Tyrannosaurus Rex,5.76,bipedal
Hadrosaurus,1.4,bipedal
Deinonychus,1.21,bipedal
Struthiomimus,1.34,bipedal
Velociraptor,2.72,bipedal
使用論壇：
speed = ((STRIDE_LENGTH / LEG_LENGTH) - 1) * SQRT(LEG_LENGTH * g)
在哪裡
g = 9.8 m/s^2
編寫一個程序來讀取 csv 文件，並僅列印雙足恐龍的名稱，按速度從快到慢排序。
在 SQL 中，這很簡單：
select f2.name from
file1 f1 join file2 f2 on f1.name = f2.name
where f1.stance = 'bipedal'
order by (f2.stride_length/f1.leg_length - 1)*pow(f1.leg_length*9.8,0.5) desc
這如何在 Bash 或 Python 中完成？

你提到這是一個面試問題。如果我在一次採訪中被問到這個問題，我會問關於限制的問題，例如，為什麼我們有這些限制，什麼是允許的，什麼是不允許的，原因是什麼。對於每個問題，我都會嘗試與我們在業務環境中存在限制的原因建立聯繫，以真正了解這裡發生了什麼。
另外，我會問動物速度公式的起源，但那隻是因為我的物理科學背景比我的生命科學背景強，我對此感到好奇。
作為一名面試官，我真的很想听到有用於 CSV 解析的標準工具。我特別希望聽到使用腳本或命令行實用程序從頭開始解析/修改比使用標準工具（如pandas和csv.
Stack Exchange 不適合這種類型的迭代問答，所以我將使用 Python 發布一個答案，我只會在真正了解業務問題後在面試時提供。
# Assume it's OK to import sqrt, otherwise the spirit of the problem isn't understood.
from math import sqrt

# Read data into dictionary.
dino_dict = dict()
for filename in ['file1.csv','file2.csv']:
   with open(filename) as f:
       # Read the first line as the CSV headers/labels.
       labels = f.readline().strip().split(',')

       # Read the data lines.
       for line in f.readlines():
           values = line.strip().split(',')
       
           # For each line insert the data in the dict.
           for label, value in zip(labels, values):
               if label == "NAME":
                   dino_name = value
                   if dino_name not in dino_dict:
                       dino_dict[dino_name] = dict() # New dino.
               else:
                   dino_dict[dino_name][label] = value # New attribute.

# Calculate speed and insert into dictionary.
for dino_stats in dino_dict.values():
   try:
       stride_length = float(dino_stats['STRIDE_LENGTH'])
       leg_length = float(dino_stats['LEG_LENGTH'])
   except KeyError:
       continue
   
   dino_stats["SPEED"] = ((stride_length / leg_length) - 1) * sqrt(leg_length * 9.8)
   
# Make a list of dinos with their speeds.
bipedal_dinos_with_speed = list()
for dino_name, dino_stats in dino_dict.items():
   if dino_stats.get('STANCE') == 'bipedal':
       if 'SPEED' in dino_stats:
           bipedal_dinos_with_speed.append((dino_name, dino_stats['SPEED']))

# Sort the list by speed and print the dino names.
[dino_name for dino_name, _ in sorted(bipedal_dinos_with_speed, key=lambda x: x[1], reverse=True)]
$$ ‘Tyrannosaurus Rex’, ‘Velociraptor’, ‘Struthiomimus’, ‘Hadrosaurus’ $$

已經創建了一些工具來服務於這個目的。這是範例：

$ csvq 'select * from cities'
+------------+-------------+----------+
|    name    |  population |  country |
+------------+-------------+----------+
| warsaw     |  1700000    |  poland  |
| ciechanowo |  46000      |  poland  |
| berlin     |  3500000    |  germany |
+------------+-------------+----------+

$ csvq 'insert into cities values("dallas", 1, "america")'
1 record inserted on "C:\\cities.csv".
Commit: file "C:\\cities.csv" is updated.

https://github.com/mithrandie/csvq

引用自：https://unix.stackexchange.com/questions/593939

查詢 csv 文件，如 sql

相關問答

將 mm 轉換為 hh:mm

從 BASH 中的 CSV 文件中讀取空字元串

awk：計算CSV給定列中元素的出現次數

使用 awk 從 CSV 文件中僅選擇一個欄位小於門檻值的行

awk 按星期幾返回百分比

如果欄位 1 匹配並且欄位 3 中的日期/時間距離第一個欄位 1 匹配的時間少於 5 分鐘，則過濾要刪除的 CSV 文件