如何提高数据处理速度

来源：1-2 课程导学

yscyber

2020-05-11

老师，您好，现在有下面两个结构的数据集

数据集

数据集2

根据这边的任务需要（需要过滤出那种1个小时内播放次数异常的用户听歌记录），现在要聚合成下面这样的格式：

user_id，song_id，artist_id，gmt_create，date，play_count

我现在的代码如下：

# 参数分别就是通过 Pandas 的 read_csv 读取两个数据集得到的 DataFrame
def fun_user_action_data_set_aggregation_one(df_user_action, df_song):
    # 最终返回结果：columns=[0-'user_id', 1-'song_id', 2-'artist_id', 3-'unix_timestamp', 4-'date', 5-'play_count']
    temp1 = df_user_action.loc[df_user_action['action_type'] == 1]
    group_result = temp1.groupby(by=['user_id', 'song_id', 'gmt_create'], as_index=False).size().to_dict()
    list_result = []
    for key in group_result.keys():
        if int(group_result[key]) <= 30:
            artist_id = df_song.loc[df_song['song_id'] == str(key[1])]['artist_id'].head(1).tolist()[0]
            elem = {'user_id': str(key[0]), 'song_id': str(key[1]), 'artist_id': artist_id, 'unix_timestamp': int(key[2]), 'date': unix_time_to_normal_time_one(int(key[2])), 'play_count': int(group_result[key])}
            list_result.append(elem)
    df_result = pd.DataFrame(list_result)
    df_result.to_csv('..\\middle\\p1_2_one.csv', index=False, header=False, encoding='UTF-8')