Python Pandas處理CSV檔案的常用技巧分享

2022-06-08 18:00:49

讀取Pandas檔案

df = pd.read_csv(file_path, encoding='GB2312')
print(df.info())

注意：Pandas的讀取格式預設是UTF-8，在中文CSV中會報錯：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 2: invalid continuation byte

修改編碼為 GB2312 ，即可，或者忽略encode跳脫錯誤，如下：

df = pd.read_csv(file_path, encoding='GB2312')
df = pd.read_csv(file_path, encoding='unicode_escape')

df.info()顯示df的基本資訊，例如：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840 entries, 0 to 3839
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 實驗時間批次 3840 non-null object
1 物鏡倍數 3840 non-null object
2 板子編號 3840 non-null object
3 板子編號及物鏡倍數 3840 non-null object
4 圖名稱 3840 non-null object
5 細胞型別 3840 non-null object
6 板子孔位置 3840 non-null object
7 孔拍攝位置 3840 non-null int64
8 細胞培養基 3840 non-null object
9 細胞培養時間（小時） 3840 non-null int64
10 擾動類別 3840 non-null object
11 擾動處理時間（小時） 3840 non-null int64
12 擾動處理濃度（ug/ml） 3840 non-null float64
13 標註啟用(1/0) 3840 non-null int64
14 unique 3840 non-null object
15 tvt 3840 non-null int64
dtypes: float64(1), int64(5), object(10)
memory usage: 480.1+ KB

統計列值出現的次數

df[列名].value_counts()，如df["擾動類別"].value_counts()：

df["擾動類別"].value_counts()

輸出：

coated OKT3 720
OKT3 720
coated OKT3+anti-CD28 576
DMSO 336
anti-CD28 288
PBS 288
Nivo 288
Pemb 288
empty 192
coated OKT3 + anti-CD28 144
Name: 擾動類別, dtype: int64

直接繪製value_counts()的柱形圖，參考Pandas - Chart Visualization：

import matplotlib.pyplot as plt
%matplotlib inline

plt.close("all")
plt.figure(figsize=(20, 8))
df["擾動類別"].value_counts().plot(kind="bar")
# plt.xticks(rotation='vertical', fontsize=10)
plt.show()

柱形圖：

篩選特定列值

df.loc[篩選條件]，篩選特定列值之後，重新賦值，只處理篩選值，也可以寫入csv檔案。

df_plate1 = df.loc[df["板子編號"] == "plate1"]
df_plate1.info()
# df.loc[df["板子編號"] == "plate1"].to_csv("batch3_IOStrain_klasses_utf8_plate1.csv")  # 儲存CSV檔案

注意：篩選的內外兩個df需要相同，否則報錯

pandas loc IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

輸出，資料量由3840下降為1280。

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1280 entries, 0 to 1279
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 實驗時間批次 1280 non-null object
1 物鏡倍數 1280 non-null object
2 板子編號 1280 non-null object
3 板子編號及物鏡倍數 1280 non-null object
4 圖名稱 1280 non-null object
5 細胞型別 1280 non-null object
6 板子孔位置 1280 non-null object
7 孔拍攝位置 1280 non-null int64
8 細胞培養基 1280 non-null object
9 細胞培養時間（小時） 1280 non-null int64
10 擾動類別 1280 non-null object
11 擾動處理時間（小時） 1280 non-null int64
12 擾動處理濃度（ug/ml） 1280 non-null float64
13 標註啟用(1/0) 1280 non-null int64
14 unique 1280 non-null object
15 tvt 1280 non-null int64
dtypes: float64(1), int64(5), object(10)
memory usage: 170.0+ KB

遍歷資料行

for idx, row in df_plate1_lb0.iterrows():，通過row[“列名”]，輸出具體的值，如下：

for idx, row in df_plate1_lb0.iterrows():
    img_name = row["圖名稱"]
    img_ch_format = img_format.format(img_name, "{}")
    for i in range(1, 7):
        img_path = os.path.join(plate1_img_folder, img_ch_format.format(i))
        img = cv2.imread(img_path)
        print('[Info] img shape: {}'.format(img.shape))
    break

輸出：

[Info] img shape: (1080, 1080, 3)
[Info] img shape: (1080, 1080, 3)
[Info] img shape: (1080, 1080, 3)
[Info] img shape: (1080, 1080, 3)
[Info] img shape: (1080, 1080, 3)
[Info] img shape: (1080, 1080, 3)

繪製直方圖(柱狀圖)

統計去除背景顏色的灰度圖字典

# 去除背景顏色
pix_bkg = np.argmax(np.bincount(img_gray.ravel()))
img_gray = np.where(img_gray <= pix_bkg + 2, 0, img_gray)
img_gray = img_gray.astype(np.uint8)

# 生成數值陣列
hist = cv2.calcHist([img_gray], [0], None, [256], [0, 256]) 
hist = hist.ravel()

# 數值字典
hist_dict = collections.defaultdict(int)
for i, v in enumerate(hist):
    hist_dict[i] += int(v)

# 去除背景顏色，已經都統計到0，所以0值非常大，刪除0值，觀察分佈
hist_dict[0] = 0

繪製柱狀圖：

plt.subplots：設定多個子圖，figsize背景尺寸，facecolor背景顏色
ax.set_title：設定標題
ax.bar：x軸的值，y軸的值
ax.set_xticks：x軸的顯示間隔
plt.savefig：儲存影象
plt.show：展示

fig, ax = plt.subplots(1, 1, figsize=(10, 8), facecolor='white')
ax.set_title('channel {}'.format(ci))
n_bins = 100
ax.bar(range(n_bins+1), [hist_dict.get(xtick, 0) for xtick in range(n_bins+1)])
ax.set_xticks(range(0, n_bins, 5))

plt.savefig(res_path)
plt.show()

效果：

到此這篇關於Python Pandas處理CSV檔案的常用技巧分享的文章就介紹到這了,更多相關Pandas處理CSV檔案內容請搜尋it145.com以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援it145.com！