Python 実践 データ加工/可視化 100本ノック に挑戦 ノック49

ノック49:通った人を可視化しよう

通った人のカウントを行います。

from glob import glob
import pandas as pd
files = glob('data/person_count_1sec/out_0001/*.csv')
files.sort()

data=[]
for iii in files:
    tmp = pd.read_csv(iii,parse_dates=[2])
    data.append(tmp)

data = pd.concat(data,ignore_index=True)

#ノック43
data['receive_date']= data['receive_time'].dt.date

#ノック44
data['dayofweek'] = data['receive_time'].dt.dayofweek
data['day_name']  = data['receive_time'].dt.day_name()


#ノック45
import datetime as dtt
data_extract = data.loc[(data['receive_time']>=dtt.datetime(2021,1,20))&
                        (data['receive_time']<dtt.datetime(2021,1,23))].copy()

#ノック46
data_extract['receive_time_sec'] = data_extract['receive_time'].dt.round('S')
# print(data_extract.head())

# print(len(data_extract))
# print(len(data_extract['receive_time_sec'].unique()))

dupli_data = data_extract[data_extract['receive_time_sec'].duplicated(keep=False)]
# print(dupli_data)

data_extract['receive_time_sec'] = data_extract['receive_time'].dt.floor('S')
# print(len(data_extract))
# print(len(data_extract['receive_time_sec'].unique()))

data_extract = data_extract.drop_duplicates(subset=['receive_time_sec'])
#print(len(data_extract))

#ノック47
#print(pd.date_range('2021-01-15','2021-01-16',freq='S'))

min_receive = data_extract['receive_time_sec'].min()
max_receive = data_extract['receive_time_sec'].max()

date1 = pd.date_range(min_receive,max_receive,freq='S')

base_data = pd.DataFrame({'receive_time_sec':date1})
data_base_extract = pd.merge(base_data,data_extract,on='receive_time_sec',how='left')

#ノック48
data_base_extract.sort_values('receive_time_sec',inplace=True)
data_base_extract = data_base_extract.fillna(method='ffill')

#ノック49
data_analytics = data_base_extract[['receive_time_sec','in1','out1']].copy()
print(data_analytics.head())

 

実行結果

     receive_time_sec      in1     out1
0 2021-01-20 00:00:40  12109.0  11302.0
1 2021-01-20 00:00:41  12109.0  11302.0
2 2021-01-20 00:00:42  12109.0  11302.0
3 2021-01-20 00:00:43  12109.0  11302.0
4 2021-01-20 00:00:44  12109.0  11302.0

 

次に1秒前のデータを作成します。shiftを使用することで1つずれたデータが作成されます。

 
data_before_1sec = data_analytics.shift(1)
print(data_before_1sec.head())

 

実行結果

     receive_time_sec      in1     out1
0                 NaT      NaN      NaN
1 2021-01-20 00:00:40  12109.0  11302.0
2 2021-01-20 00:00:41  12109.0  11302.0
3 2021-01-20 00:00:42  12109.0  11302.0
4 2021-01-20 00:00:43  12109.0  11302.0

データがindexの1から始まっています。

 

作成したdata_analyticsとdeata_before_1secをconcatで結合します。

結合した部分にはカラム名がないので、receive_time_b1sc、in1_b1sec、out1_b1secを追加します。


data_before_1sec.columns=['receive_time_sec_b1sec','in1_b1sec','out1_b1sec']
data_analytics = pd.concat([data_analytics,data_before_1sec],axis=1)
print(data_analytics.head())

 

 

実行結果

   receive_time_sec      in1     out1 receive_time_sec_b1sec  in1_b1sec  out1_b1sec
0 2021-01-20 00:00:40  12109.0  11302.0                    NaT        NaN         NaN
1 2021-01-20 00:00:41  12109.0  11302.0    2021-01-20 00:00:40    12109.0     11302.0
2 2021-01-20 00:00:42  12109.0  11302.0    2021-01-20 00:00:41    12109.0     11302.0
3 2021-01-20 00:00:43  12109.0  11302.0    2021-01-20 00:00:42    12109.0     11302.0
4 2021-01-20 00:00:44  12109.0  11302.0    2021-01-20 00:00:43    12109.0     11302.0

 

何人入ってきて、何人出ていったのかを計測します。

data_analytics['in1_calc']=data_analytics['in1']-data_analytics['in1_b1sec']
data_analytics['out1_calc']=data_analytics['out1']-data_analytics['out1_b1sec']
print(data_analytics.head())

 

実行結果

     receive_time_sec      in1     out1 receive_time_sec_b1sec  in1_b1sec  out1_b1sec  in1_calc  out1_calc
0 2021-01-20 00:00:40  12109.0  11302.0                    NaT        NaN         NaN       NaN        NaN
1 2021-01-20 00:00:41  12109.0  11302.0    2021-01-20 00:00:40    12109.0     11302.0       0.0        0.0
2 2021-01-20 00:00:42  12109.0  11302.0    2021-01-20 00:00:41    12109.0     11302.0       0.0        0.0
3 2021-01-20 00:00:43  12109.0  11302.0    2021-01-20 00:00:42    12109.0     11302.0       0.0        0.0
4 2021-01-20 00:00:44  12109.0  11302.0    2021-01-20 00:00:43    12109.0     11302.0       0.0        0.0

 

秒単位だと変化が見られないので、各行を時(hour)単位で表示し、時単位でグラフ化します。

data_analytics['date_hour'] = data_analytics['receive_time_sec'].dt.strftime('%Y%m%d%H')

print(data_analytics.head())

 

strftimeで文字列の変換しています。

 

実行結果

     receive_time_sec      in1     out1 receive_time_sec_b1sec  in1_b1sec  out1_b1sec  in1_calc  out1_calc   date_hour
0 2021-01-20 00:00:40  12109.0  11302.0                    NaT        NaN         NaN       NaN        NaN  2021012000
1 2021-01-20 00:00:41  12109.0  11302.0    2021-01-20 00:00:40    12109.0     11302.0       0.0        0.0  2021012000
2 2021-01-20 00:00:42  12109.0  11302.0    2021-01-20 00:00:41    12109.0     11302.0       0.0        0.0  2021012000
3 2021-01-20 00:00:43  12109.0  11302.0    2021-01-20 00:00:42    12109.0     11302.0       0.0        0.0  2021012000
4 2021-01-20 00:00:44  12109.0  11302.0    2021-01-20 00:00:43    12109.0     11302.0       0.0        0.0  2021012000

 

折れ線グラフを作成します。

 
viz_data = data_analytics[['date_hour','in1_calc','out1_calc']].groupby('date_hour',as_index=False).sum()
viz_data = pd.melt(viz_data,id_vars='date_hour',value_vars=['in1_calc','out1_calc'])
 
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,10))
plt.xticks(rotation=90)
sns.lineplot(x=viz_data['date_hour'],y=viz_data['value'],hue=viz_data['variable'])
plt.show()

 

 

実行結果

人数カウントの可視化

 

 

 

 

 

 

 

 

 

 

 

 

 

 

/* -----codeの行番号----- */