第一次作业在5月13日截止,已经得知作业内容并且上次周天直播课上老师也自己亲手从头到尾做了一遍。作业主要是想要我们了解Python的程序如何加速,具体是用Cython来写程序,根据课上演示,仅仅是指明运算量的数据类型,速度竟会提高超过一万倍,惊了!(其实看完老师的直播做作业,自己怎么做都感觉是在抄答案。。。)

1.数据及文件介绍

1.1数据介绍

数据大小是按照作业要求来做的,xy都是10万行,y只取值0和1,x是按照课上老师的写法,只取0-9十位整数。

1.2文件介绍

运行之前一共三个文件:setup.py,target_encoding.pyx,main.py。都写完程序后需要在终端输入python setup.py install构建环境(当然应该得有C++,我最开始就没下载Visual Studio。。。),再python main.py,就运行了。

其中setup.py是完全根据课件hello里面代码就行了;然后main.py大体上是按照老师课上的结果来修改的,即定义一个main函数、计算并输出各个方法所用时间。最后target_encoding.pyx是重点:

  1. 函数target_mean_v1是老师课上给的基于pandas中groupby来计算,是最耗时的;
  2. 函数target_mean_v2是我自己构建了一个用来记录每类x对应的y各自有多少个的(10,2)的表,行表示x的不同类,列表示y的不同类,值是个数;由于代码逻辑复杂度较target_mean_v1少了很多,所以速度也快很多,试了几遍大体在几十倍左右;
  3. 函数target_mean_v3的逻辑和target_mean_v2完全相同,区别在于用到了Cython的语法,把xy都转化为C中的数组;做的事就是这么个事,但是编译的时候总会出各种问题,当然是因为我不够熟悉导致,比如说
    1. 修改代码需要重新在终端输入python setup.py install构建环境,才行;
    2. Cython对数组的数据结构要求很严格,左右分别是short型和float型就会报错,要改一致,并且numpy中和C中是有对应关系的;(np.int16对应short)
    3. ……

      2.结果展示

      image.png
      target_mean_v1就不用说了,时间很长,远远超过1分钟;target_mean_v2比target_mean_v1快了几十倍;target_mean_v3比target_mean_v2又快了几十倍。target_mean_v2和target_mean_v3都符合作业要求:运行时间在1分钟内。

      附录:代码

      setup.py

      ```python from distutils.core import setup, Extension from Cython.Build import cythonize import numpy

compile_flags = [‘-std=c++11’, ‘-fopenmp’] linker_flags = [‘-fopenmp’]

module = Extension(‘target_encoding’, [‘target_encoding.pyx’], language=’c++’, include_dirs=[numpy.get_include()], # This helps to create numpy extra_compile_args=compile_flags, extra_link_args=linker_flags)

setup( name=’target_encoding’, ext_modules=cythonize(module), gdb_debug=True # This is extremely dangerous; Set it to False in production. )

  1. <a name="vplFC"></a>
  2. ### target_encoding.pyx
  3. ```python
  4. # distutils: language=c++
  5. import numpy as np
  6. cimport numpy as np
  7. def say_hello_to(name):
  8. print("Hello %s!" % name)
  9. cpdef convert_demo(matrix):
  10. cdef np.ndarray[double, ndim=2, mode='fortran'] arg = np.asfortranarray(matrix, dtype=np.float64)
  11. return arg
  12. def target_mean_v1(data, y_name, x_name):
  13. result = np.zeros(data.shape[0])
  14. for i in range(data.shape[0]):
  15. groupby_result = data[data.index != i].groupby([x_name], as_index=False).agg(['mean', 'count'])
  16. result[i] = groupby_result.loc[groupby_result.index == data.loc[i, x_name], (y_name, 'mean')]
  17. return result
  18. def target_mean_v2(data, y_name, x_name):
  19. tmp = np.zeros([10, 2])
  20. for i in range(data.shape[0]):
  21. if tmp[data.iloc[i, 1], data.iloc[i, 0]]:
  22. tmp[data.iloc[i, 1], data.iloc[i, 0]] += 1
  23. else:
  24. tmp[data.iloc[i, 1], data.iloc[i, 0]] = 1
  25. result = []
  26. for i in range(data.shape[0]):
  27. result.append(
  28. (tmp[data.iloc[i, 1], data.iloc[i, 0]] - 1) / (tmp[data.iloc[i, 1]].sum() - 1)
  29. )
  30. return result
  31. cpdef target_mean_v3(data, y_name, x_name):
  32. # cdef np.ndarray[double, ndim=2, mode='c'] data2 = np.asfortranarray(data, dtype=np.float64)
  33. # tmp = np.zeros([10, 2])
  34. # for i in range(len(data2)):
  35. # if tmp[data2[i][1], data2[i][0]]:
  36. # tmp[data2[i][1], data2[i][0]] += 1
  37. # else:
  38. # tmp[data2[i][1], data2[i][0]] = 1
  39. # result = []
  40. # for i in range(len(data2)):
  41. # result.append(
  42. # (tmp[data2[i][1], data2[i][0]] - 1) / (tmp[data2[i][1]].sum() - 1)
  43. # )
  44. # return result
  45. cdef int row = np.asfortranarray(data.shape[0], dtype=np.float64)
  46. cdef np.ndarray[short] x = np.asfortranarray(data[x_name], dtype=np.int16)
  47. cdef np.ndarray[short] y = np.asfortranarray(data.iloc[:, 0], dtype=np.int16)
  48. cdef np.ndarray[short, ndim=2, mode='fortran'] tmp = np.asfortranarray(np.zeros([10, 2]), dtype=np.int16)
  49. cdef np.ndarray[double] result = np.asfortranarray(np.zeros(data.shape[0]), dtype=np.float64)
  50. for i in range(row):
  51. if tmp[x[i]][y[i]]:
  52. tmp[x[i]][y[i]] += 1
  53. else:
  54. tmp[x[i]][y[i]] = 1
  55. for i in range(row):
  56. result[i] = (tmp[x[i]][y[i]] - 1) / (tmp[x[i]][y[i]] + tmp[x[i]][1 - y[i]])
  57. return result

main.py

  1. import target_encoding
  2. import numpy as np
  3. import pandas as pd
  4. import time
  5. def main():
  6. n = 100000
  7. x = np.random.randint(0, high=10, size=[n, 1])
  8. y = np.random.randint(0, high=2, size=[n, 1])
  9. data = pd.DataFrame(np.concatenate((y, x), axis=1), columns=['y', 'x'])
  10. start1 = time.time()
  11. result1 = target_encoding.target_mean_v1(data, 'y', 'x')
  12. end1 = time.time()
  13. start2 = time.time()
  14. result2 = target_encoding.target_mean_v2(data, 'y', 'x')
  15. end2 = time.time()
  16. start3 = time.time()
  17. result3 = target_encoding.target_mean_v3(data, 'y', 'x')
  18. end3 = time.time()
  19. print('target_mean_v1耗时秒数:', end1 - start1)
  20. print('target_mean_v2耗时秒数:', end2 - start2)
  21. print('target_mean_v3耗时秒数:', end3 - start3)
  22. if __name__ == '__main__':
  23. main()