3.2 OpenCL平台模型

OpenCL平台需要包含一个主处理器和一个或多个OpenCL设备。平台模型定义了host和device的角色,并且为device提供了一种抽象的硬件模型。一个device可以被划分成一个或多个计算单元,这些计算单元在之后能被分成一个或多个“处理单元”(processing elements)。具体的关系可见图3.1。

OpenCL平台模型 - 图1

图3.1 OpenCL平台具有多个计算设备。每个计算设备都具有一个或多个计算单元。一个计算单元又由一个或多个处理元素(PEs)构成。系统中可以同时具有多个平台。例如,在一个系统中可以既有AMD的平台和Intel的平台。

平台模型是应用开发的重点,其保证了OpenCL代码的可移植性(在具有OpenCL能力的系统间)。即使只在一个系统中,这个系统也可以具有多个不同的OpenCL平台,这些平台可以被不同的应用所使用。平台模型的API允许一个OpenCL应用能够适应和选择对应的平台和计算设备,从而在相应平台和设备上运行应用。

应用可以使用OpenCL运行时API,选择对应提供商提供的对应平台。不过,平台上能指定和互动的设备,也只限于供应商提供的那些设备。例如,如果选择了A公司的平台,那么就不能使用B公司的GPU。不过,平台硬件并不需要由供应商独家提供。例如,AMD和Intel的实现可以使用其他公司的x86 CPU作为设备。

编程者写编写OpenCL C代码时,设备架构会被抽象成平台模型。供应商只需要将抽象的架构映射到对应的物理硬件上即可。平台模型定义了具有一组计算单元的设备,且每个计算单元的功能都是独立的。计算单元也可以划分成更多个处理单元。图3.1展示的就是这样的一种层级模型。举个例子,AMD Radeon R9 290X图形卡(device)包含44个向量处理器(计算单元)。每个计算单元都由4个16通道SIMD引擎,一共就有64个SIMD通道(处理单元)。Radeon R9 290X上每个SIMD通道都能处理一个标量指令。运行GPU设备能同时执行44x16x4=2816条指令。

3.2.1 平台和设备

clGetPlatformIDs()这个API就是查找制定系统上的可用OpenCL平台的集合。在具体的OpenCL程序中,这个API一般会调用两次,用来查询和获取到对应的平台信息。第一次调用这个API需要传入num_platforms作为数量参数,传入NULL作为平台参数。这样就能获取在该系统上有多少个平台可供使用。编程者可以开辟对应大小的空间(指针命名为platforms),来存放对应的平台对象(类型为 cl_platform_id)。第二次调用该API是,就可将platforms传入来获取对应数量的平台对象。平台查找完成后,使用clGetPlatformInfo()API可以查询对应供应商所提供的平台,然后决定使用哪个平台进行运行OpenCL程序。clGetPlatformIDs()这个API需要在其他API之前调用,3.6节中可以从矢量相加的源码中进一步了解。

  1. cl_int
  2. clGetPlatformIDs(
  3. cl_uint num_entries,
  4. cl_platform_id *platforms,
  5. cl_uint *num_platforms)

当平台确定好之后,下一步就是要查询平台上可用的设备了。clGetDeviceIDs()API就是用来做这件事的,并且在使用流程上和clGetPlatformIDs()很类似。clGetDeviceIDs()多了平台对象和设备类型作为入参,不过也需要简单的三步就能创建device:第一,查询设备的数量;第二,分配对应数量的空间来存放设备对象;第三,选择期望使用的设备(确定设备对象)。device_type参数可以将设备限定为GPU(CL_DEVICE_TYPE_GPU),限定为CPU(CL_DEVICE_TYPE_CPU),或所有设备(CL_DEVICE_TYPE_ALL),当然还有其他选项。这些参数都必须传递给clGetDeviceIDs()。相较于平台的查询API,clGetDeviceInfo()API可用来查询每个设备的名称、类型和供应商。

  1. cl_int
  2. clGetDeviceIDs(
  3. cl_platform_id platform,
  4. cl_device_type device_type,
  5. cl_uint num_entries,
  6. cl_device_id *devices,
  7. cl_uint *num_devices)

AMD的并行加速处理软件开发工具(APP SDK)中有一个clinfo的程序,其使用clGetPlatformInfo()clGetDeviceInfo()来获取对应系统中的平台和设备信息。硬件信息,比如内存总量和总线带宽也是可以用该程序获取。在了解其他OpenCL特性之前,我们先休息一下,了解一下clinfo的输入,如图3.2。

译者机器的clinfo显示,译者和原书使用的AMD APP SDK版本不大一样。从观察上来看,原书应该隐藏了一些硬件显示。

  1. Number of platforms: 3
  2. Platform Profile: FULL_PROFILE
  3. Platform Version: OpenCL 1.2 CUDA 8.0.0
  4. Platform Name: NVIDIA CUDA
  5. Platform Vendor: NVIDIA Corporation
  6. Platform Extensions:
  7. cl_khr_global_int32_base_atomics
  8. cl_khr_global_int32_extended_atomics
  9. cl_khr_local_int32_base_atomics
  10. cl_khr_local_int32_extended_atomics
  11. cl_khr_fp64
  12. cl_khr_byte_addressable_store
  13. cl_khr_icd cl_khr_gl_sharing
  14. cl_nv_compiler_options
  15. cl_nv_device_attribute_query
  16. cl_nv_pragma_unroll
  17. cl_nv_d3d10_sharing
  18. cl_khr_d3d10_sharing
  19. cl_nv_d3d11_sharing
  20. cl_nv_copy_opts
  21. Platform Profile: FULL_PROFILE
  22. Platform Version: OpenCL 1.2
  23. Platform Name: Intel(R) OpenCL
  24. Platform Vendor: Intel(R) Corporation
  25. Platform Extensions:
  26. cl_intel_dx9_media_sharing
  27. cl_khr_3d_image_writes
  28. cl_khr_byte_addressable_store
  29. cl_khr_d3d11_sharing
  30. cl_khr_depth_images
  31. cl_khr_dx9_media_sharing
  32. cl_khr_gl_sharing
  33. cl_khr_global_int32_base_atomics
  34. cl_khr_global_int32_extended_atomics
  35. cl_khr_icd cl_khr_local_int32_base_atomics
  36. cl_khr_local_int32_extended_atomics
  37. cl_khr_spir
  38. Platform Profile: FULL_PROFILE
  39. Platform Version: OpenCL 2.0 AMD-APP (1800.8)
  40. Platform Name: AMD Accelerated Parallel Processing
  41. Platform Vendor: Advanced Micro Devices, Inc.
  42. Platform Extensions:
  43. cl_khr_icd
  44. cl_khr_d3d10_sharing
  45. cl_khr_d3d11_sharing
  46. cl_khr_dx9_media_sharing
  47. cl_amd_event_callback
  48. cl_amd_offline_devices
  49. Platform Name: NVIDIA CUDA
  50. Number of devices: 1
  51. Device Type: CL_DEVICE_TYPE_GPU
  52. Vendor ID: 10deh
  53. Max compute units: 4
  54. Max work items dimensions: 3
  55. Max work items[0]: 1024
  56. Max work items[1]: 1024
  57. Max work items[2]: 64
  58. Max work group size: 1024
  59. Preferred vector width char: 1
  60. Preferred vector width short: 1
  61. Preferred vector width int: 1
  62. Preferred vector width long: 1
  63. Preferred vector width float: 1
  64. Preferred vector width double: 1
  65. Native vector width char: 1
  66. Native vector width short: 1
  67. Native vector width int: 1
  68. Native vector width long: 1
  69. Native vector width float: 1
  70. Native vector width double: 1
  71. Max clock frequency: 862Mhz
  72. Address bits: 64
  73. Max memory allocation: 536870912
  74. Image support: Yes
  75. Max number of images read arguments: 256
  76. Max number of images write arguments: 16
  77. Max image 2D width: 16384
  78. Max image 2D height: 16384
  79. Max image 3D width: 4096
  80. Max image 3D height: 4096
  81. Max image 3D depth: 4096
  82. Max samplers within kernel: 32
  83. Max size of kernel argument: 4352
  84. Alignment (bits) of base address: 4096
  85. Minimum alignment (bytes) for any datatype: 128
  86. Single precision floating point capability
  87. Denorms: Yes
  88. Quiet NaNs: Yes
  89. Round to nearest even: Yes
  90. Round to zero: Yes
  91. Round to +ve and infinity: Yes
  92. IEEE754-2008 fused multiply-add: Yes
  93. Cache type: Read/Write
  94. Cache line size: 128
  95. Cache size: 65536
  96. Global memory size: 2147483648
  97. Constant buffer size: 65536
  98. Max number of constant args: 9
  99. Local memory type: Scratchpad
  100. Local memory size: 49152
  101. Kernel Preferred work group size multiple: 32
  102. Error correction support: 0
  103. Unified memory for Host and Device: 0
  104. Profiling timer resolution: 1000
  105. Device endianess: Little
  106. Available: Yes
  107. Compiler available: Yes
  108. Execution capabilities:
  109. Execute OpenCL kernels: Yes
  110. Execute native function: No
  111. Queue on Host properties:
  112. Out-of-Order: Yes
  113. Profiling : Yes
  114. Platform ID: 000002D3A374DC10
  115. Name: GeForce GTX 765M
  116. Vendor: NVIDIA Corporation
  117. Device OpenCL C version: OpenCL C 1.2
  118. Driver version: 375.95
  119. Profile: FULL_PROFILE
  120. Version: OpenCL 1.2 CUDA
  121. Extensions:
  122. cl_khr_global_int32_base_atomics
  123. cl_khr_global_int32_extended_atomics
  124. cl_khr_local_int32_base_atomics
  125. cl_khr_local_int32_extended_atomics
  126. cl_khr_fp64
  127. cl_khr_byte_addressable_store
  128. cl_khr_icd
  129. cl_khr_gl_sharing
  130. cl_nv_compiler_options
  131. cl_nv_device_attribute_query
  132. cl_nv_pragma_unroll
  133. cl_nv_d3d10_sharing
  134. cl_khr_d3d10_sharing
  135. cl_nv_d3d11_sharing
  136. cl_nv_copy_opts
  137. Platform Name: Intel(R) OpenCL
  138. Number of devices: 2
  139. Device Type: CL_DEVICE_TYPE_GPU
  140. Vendor ID: 8086h
  141. Max compute units: 20
  142. Max work items dimensions: 3
  143. Max work items[0]: 512
  144. Max work items[1]: 512
  145. Max work items[2]: 512
  146. Max work group size: 512
  147. Preferred vector width char: 1
  148. Preferred vector width short: 1
  149. Preferred vector width int: 1
  150. Preferred vector width long: 1
  151. Preferred vector width float: 1
  152. Preferred vector width double: 0
  153. Native vector width char: 1
  154. Native vector width short: 1
  155. Native vector width int: 1
  156. Native vector width long: 1
  157. Native vector width float: 1
  158. Native vector width double: 0
  159. Max clock frequency: 1150Mhz
  160. Address bits: 64
  161. Max memory allocation: 427189862
  162. Image support: Yes
  163. Max number of images read arguments: 128
  164. Max number of images write arguments: 128
  165. Max image 2D width: 16384
  166. Max image 2D height: 16384
  167. Max image 3D width: 2048
  168. Max image 3D height: 2048
  169. Max image 3D depth: 2048
  170. Max samplers within kernel: 16
  171. Max size of kernel argument: 1024
  172. Alignment (bits) of base address: 1024
  173. Minimum alignment (bytes) for any datatype: 128
  174. Single precision floating point capability
  175. Denorms: No
  176. Quiet NaNs: Yes
  177. Round to nearest even: Yes
  178. Round to zero: Yes
  179. Round to +ve and infinity: Yes
  180. IEEE754-2008 fused multiply-add: No
  181. Cache type: Read/Write
  182. Cache line size: 64
  183. Cache size: 262144
  184. Global memory size: 1708759450
  185. Constant buffer size: 65536
  186. Max number of constant args: 8
  187. Local memory type: Scratchpad
  188. Local memory size: 65536
  189. Kernel Preferred work group size multiple: 32
  190. Error correction support: 0
  191. Unified memory for Host and Device: 1
  192. Profiling timer resolution: 80
  193. Device endianess: Little
  194. Available: Yes
  195. Compiler available: Yes
  196. Execution capabilities:
  197. Execute OpenCL kernels: Yes
  198. Execute native function: No
  199. Queue on Host properties:
  200. Out-of-Order: No
  201. Profiling : Yes
  202. Platform ID: 000002D3A374C760
  203. Name: Intel(R) HD Graphics 4600
  204. Vendor: Intel(R) Corporation
  205. Device OpenCL C version: OpenCL C 1.2
  206. Driver version: 20.19.15.4531
  207. Profile: FULL_PROFILE
  208. Version: OpenCL 1.2
  209. Extensions:
  210. cl_intel_accelerator
  211. cl_intel_advanced_motion_estimation
  212. cl_intel_ctz
  213. cl_intel_d3d11_nv12_media_sharing
  214. cl_intel_dx9_media_sharing
  215. cl_intel_motion_estimation
  216. cl_intel_simultaneous_sharing
  217. cl_intel_subgroups
  218. cl_khr_3d_image_writes
  219. cl_khr_byte_addressable_store
  220. cl_khr_d3d10_sharing
  221. cl_khr_d3d11_sharing
  222. cl_khr_depth_images
  223. cl_khr_dx9_media_sharing
  224. cl_khr_gl_depth_images
  225. cl_khr_gl_event
  226. cl_khr_gl_msaa_sharing
  227. cl_khr_global_int32_base_atomics
  228. cl_khr_global_int32_extended_atomics
  229. cl_khr_gl_sharing
  230. cl_khr_icd
  231. cl_khr_image2d_from_buffer
  232. cl_khr_local_int32_base_atomics
  233. cl_khr_local_int32_extended_atomics
  234. cl_khr_spir
  235. Device Type: CL_DEVICE_TYPE_CPU
  236. Vendor ID: 8086h
  237. Max compute units: 8
  238. Max work items dimensions: 3
  239. Max work items[0]: 8192
  240. Max work items[1]: 8192
  241. Max work items[2]: 8192
  242. Max work group size: 8192
  243. Preferred vector width char: 1
  244. Preferred vector width short: 1
  245. Preferred vector width int: 1
  246. Preferred vector width long: 1
  247. Preferred vector width float: 1
  248. Preferred vector width double: 1
  249. Native vector width char: 32
  250. Native vector width short: 16
  251. Native vector width int: 8
  252. Native vector width long: 4
  253. Native vector width float: 8
  254. Native vector width double: 4
  255. Max clock frequency: 2400Mhz
  256. Address bits: 64
  257. Max memory allocation: 2126515200
  258. Image support: Yes
  259. Max number of images read arguments: 480
  260. Max number of images write arguments: 480
  261. Max image 2D width: 16384
  262. Max image 2D height: 16384
  263. Max image 3D width: 2048
  264. Max image 3D height: 2048
  265. Max image 3D depth: 2048
  266. Max samplers within kernel: 480
  267. Max size of kernel argument: 3840
  268. Alignment (bits) of base address: 1024
  269. Minimum alignment (bytes) for any datatype: 128
  270. Single precision floating point capability
  271. Denorms: Yes
  272. Quiet NaNs: Yes
  273. Round to nearest even: Yes
  274. Round to zero: No
  275. Round to +ve and infinity: No
  276. IEEE754-2008 fused multiply-add: No
  277. Cache type: Read/Write
  278. Cache line size: 64
  279. Cache size: 262144
  280. Global memory size: 8506060800
  281. Constant buffer size: 131072
  282. Max number of constant args: 480
  283. Local memory type: Global
  284. Local memory size: 32768
  285. Kernel Preferred work group size multiple: 128
  286. Error correction support: 0
  287. Unified memory for Host and Device: 1
  288. Profiling timer resolution: 427
  289. Device endianess: Little
  290. Available: Yes
  291. Compiler available: Yes
  292. Execution capabilities:
  293. Execute OpenCL kernels: Yes
  294. Execute native function: Yes
  295. Queue on Host properties:
  296. Out-of-Order: Yes
  297. Profiling : Yes
  298. Platform ID: 000002D3A374C760
  299. Name: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  300. Vendor: Intel(R) Corporation
  301. Device OpenCL C version: OpenCL C 1.2
  302. Driver version: 5.2.0.10094
  303. Profile: FULL_PROFILE
  304. Version: OpenCL 1.2 (Build 10094)
  305. Extensions:
  306. cl_khr_icd
  307. cl_khr_global_int32_base_atomics
  308. cl_khr_global_int32_extended_atomics
  309. cl_khr_local_int32_base_atomics
  310. cl_khr_local_int32_extended_atomics
  311. cl_khr_byte_addressable_store
  312. cl_khr_depth_images
  313. cl_khr_3d_image_writes
  314. cl_intel_exec_by_local_thread
  315. cl_khr_spir
  316. cl_khr_dx9_media_sharing
  317. cl_intel_dx9_media_sharing
  318. cl_khr_d3d11_sharing
  319. cl_khr_gl_sharing
  320. cl_khr_fp64
  321. Platform Name: AMD Accelerated Parallel Processing
  322. Number of devices: 1
  323. Device Type: CL_DEVICE_TYPE_CPU
  324. Vendor ID: 1002h
  325. Board name:
  326. Max compute units: 8
  327. Max work items dimensions: 3
  328. Max work items[0]: 1024
  329. Max work items[1]: 1024
  330. Max work items[2]: 1024
  331. Max work group size: 1024
  332. Preferred vector width char: 16
  333. Preferred vector width short: 8
  334. Preferred vector width int: 4
  335. Preferred vector width long: 2
  336. Preferred vector width float: 8
  337. Preferred vector width double: 4
  338. Native vector width char: 16
  339. Native vector width short: 8
  340. Native vector width int: 4
  341. Native vector width long: 2
  342. Native vector width float: 8
  343. Native vector width double: 4
  344. Max clock frequency: 2394Mhz
  345. Address bits: 64
  346. Max memory allocation: 2147483648
  347. Image support: Yes
  348. Max number of images read arguments: 128
  349. Max number of images write arguments: 64
  350. Max image 2D width: 8192
  351. Max image 2D height: 8192
  352. Max image 3D width: 2048
  353. Max image 3D height: 2048
  354. Max image 3D depth: 2048
  355. Max samplers within kernel: 16
  356. Max size of kernel argument: 4096
  357. Alignment (bits) of base address: 1024
  358. Minimum alignment (bytes) for any datatype: 128
  359. Single precision floating point capability
  360. Denorms: Yes
  361. Quiet NaNs: Yes
  362. Round to nearest even: Yes
  363. Round to zero: Yes
  364. Round to +ve and infinity: Yes
  365. IEEE754-2008 fused multiply-add: Yes
  366. Cache type: Read/Write
  367. Cache line size: 64
  368. Cache size: 32768
  369. Global memory size: 8506060800
  370. Constant buffer size: 65536
  371. Max number of constant args: 8
  372. Local memory type: Global
  373. Local memory size: 32768
  374. Max pipe arguments: 16
  375. Max pipe active reservations: 16
  376. Max pipe packet size: 2147483648
  377. Max global variable size: 1879048192
  378. Max global variable preferred total size: 1879048192
  379. Max read/write image args: 64
  380. Max on device events: 0
  381. Queue on device max size: 0
  382. Max on device queues: 0
  383. Queue on device preferred size: 0
  384. SVM capabilities:
  385. Coarse grain buffer: No
  386. Fine grain buffer: No
  387. Fine grain system: No
  388. Atomics: No
  389. Preferred platform atomic alignment: 0
  390. Preferred global atomic alignment: 0
  391. Preferred local atomic alignment: 0
  392. Kernel Preferred work group size multiple: 1
  393. Error correction support: 0
  394. Unified memory for Host and Device: 1
  395. Profiling timer resolution: 427
  396. Device endianess: Little
  397. Available: Yes
  398. Compiler available: Yes
  399. Execution capabilities:
  400. Execute OpenCL kernels: Yes
  401. Execute native function: Yes
  402. Queue on Host properties:
  403. Out-of-Order: No
  404. Profiling : Yes
  405. Queue on Device properties:
  406. Out-of-Order: No
  407. Profiling : No
  408. Platform ID: 00007FFB80F36D30
  409. Name: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  410. Vendor: GenuineIntel
  411. Device OpenCL C version: OpenCL C 1.2
  412. Driver version: 1800.8 (sse2,avx)
  413. Profile: FULL_PROFILE
  414. Version: OpenCL 1.2 AMD-APP (1800.8)
  415. Extensions:
  416. cl_khr_fp64
  417. cl_amd_fp64
  418. cl_khr_global_int32_base_atomics
  419. cl_khr_global_int32_extended_atomics
  420. cl_khr_local_int32_base_atomics
  421. cl_khr_local_int32_extended_atomics
  422. cl_khr_int64_base_atomics
  423. cl_khr_int64_extended_atomics
  424. cl_khr_3d_image_writes
  425. cl_khr_byte_addressable_store
  426. cl_khr_gl_sharing
  427. cl_ext_device_fission
  428. cl_amd_device_attribute_query
  429. cl_amd_vec3
  430. cl_amd_printf
  431. cl_amd_media_ops
  432. cl_amd_media_ops2
  433. cl_amd_popcnt
  434. cl_khr_d3d10_sharing
  435. cl_khr_spir
  436. cl_khr_gl_event

原书clinfo信息

  1. Number of platforms: 1
  2. Platform Profile: FULL_PROFILE
  3. Platform Version: OpenCL 2.0 AMD-APP (1642.5)
  4. Platform Name: AMD Accelerated Parallel Processing
  5. Platform Vendor: Advanced Micro Devices, Inc.
  6. Platform Extensions:
  7. cl_khr_icd
  8. cl_khr_d3d10_sharing
  9. cl_khr_icd
  10. cl_amd_event_callback
  11. cl_amd_offline_devices
  12. Platform Name: AMD Accelerated Parallel Processing
  13. Number of devices: 2
  14. Vendor ID: 1002h
  15. Device Type: CL_DEVICE_TYPE_GPU
  16. Board name: AMD Radeon R9 200 Series
  17. Device Topology: PCI[B#1, D#0, F#0]
  18. Max compute units: 40
  19. Max work group size: 256
  20. Native vector width int: 1
  21. Max clock frequency: 1000Mhz
  22. Max memory allocation: 2505572352
  23. Image support: Yes
  24. Max image 3D width: 2048
  25. Cache line size: 64
  26. Global memory size: 3901751296
  27. Platform ID: 0x7f54fb22cfd0
  28. Name: Hawaii
  29. Vendor: Advanced Micro Devices, Inc.
  30. Device OpenCL C version: OpenCL C 2.0
  31. Driver version: 1642.5(VM)
  32. Profile: FULL_PROFILE
  33. Version: OpenCL 2.0 AMD-APP (1642.5)
  34. Extensions:
  35. cl_khr_fp64_cl_amd_fp64
  36. cl_khr_global_int32_base_atomics
  37. cl_khr_global_int32_extended_atomics
  38. cl_khr_local_int32_base_atomics
  39. Device Type: CL_DEVICE_TYPE_CPU
  40. Vendor ID: 1002h
  41. Board name:
  42. Max compute units: 8
  43. Max work items dimensions: 3
  44. Max work items[0]: 1024
  45. Max work items[1]: 1024
  46. Name: AMD FX(tm)-8120 Eight-Core Processor
  47. Vendor: AuthenticAMD
  48. Device OpenCL C version: OpenCL C 1.2
  49. Driver version: 1642.5(sse2, avx, fma4)
  50. Profile: FULL_PROFILE
  51. Version: OpenCL 1.2 (Build 10094)

图3.2 通过clinfo程序输出一些OpenCL平台和设备信息。我们能看到AMD平台上有两个设备(一个CPU和一个GPU)。这些信息都能通过平台API查询到。