TimeLoop原理和使用

JuneWen

2023-03-16

毕业设计

DNN加速器, 模拟器

作用：映射神经网络，找到最优的数据重用工作流。

原理

如何使用

00-model-conv1d-1level

通过文件timeloop-model.accelergy.log可知，timeloop生成映射后会自动调用accelergy，将输入的architecture通过catci组件评估。

输出的读写次数不一样，因为第一次做MAC运算不用读只需写回结果？？？。

01-model-conv1d-2level

定义了两种映射的方式，output stationary和weight stationary。

按照step4改成 3 1/1 1920之后，ws会报错：

改成（2）3 64/1 30 后比os的（1）1 1920/3 1 对比如下，能耗变大了。原因是，在比较小的3*16的计算中，所有的计算数和计算结果都能存储在buffer中，而计算量变大后（1）仍然可以在内层buffer中完成全部计算，由于是OS，每次mac只有一个数需要重新取，而（2）不是完全的WS，没算完30个output，都会再取一遍weight，从而多了63*3=189次的从mainMemory到buffer的weight数据读取.。。。

02-model-conv1d+output channels-2level

讨论了分块问题（tail）

buffer层的区别，主要在Input多读了一次进来

mainmemory的区别，也主要在input多读了一遍

03-model-conv1d+oc-3level

讨论了bypass的问题，感觉这里bypass的作用其实是定义连接的方式，选择性地绕过某一个存储层，这里可以理解为RF直接和MM相联。

RegisterFile层的bypass：output数据从Main Memory到Register File虽然经过了ClobalBuffer，但没有存储在其中，因而相比没有bypass，节省了读取GlobalBuffer的能量，但网络传输的能量未发生变化。

GLB层的bypass：Weight和Input读的次数显著提升，但省去了output的读取。

04-model-conv1d+oc+ic-3levelspatial

问题：[C, K, R, P]

Timeloop自动检测并利用多播机会。

KP: Output 分块, Input数据被自动的多播（从GLB到RF）

CP：Input 分块, Output数据被自动地多播（从GLB到RF），而Output数据存回时（从RF到GLB）可以先进行压缩再存回，与多播相反。观察输出文件，实际效果更差，因为分块了之后Output数据还要写回，而Input数据不用，所以这个策略能效较差。

05-mapper-conv1d+oc-3level

2^(2*3) = 64 映射空间的大小？？？

mapper字段变成了下图所示，还有多了一个mapspace_constraints字段。

06-mapper-convlayer-eyeriss

mapper:
  optimization-metrics: [ delay, energy ]
  live-status: False
  num-threads: 8
  timeout: 15000
  victory-condition: 500
  algorithm: random-pruned
  max-permutations-per-if-visit: 16

硬件架构和数据流的限制：

architecture_constraints:
  targets:
  # certain buffer only stores certain datatypes
  - target: psum_spad
    type: bypass
    bypass: [Inputs, Weights]
    keep: [Outputs]
  - target: weights_spad
    type: bypass
    bypass: [Inputs, Outputs]
    keep: [Weights]
  - target: ifmap_spad
    type: bypass
    bypass: [Weights, Outputs]
    keep: [Inputs]
  - target: DummyBuffer
    type: bypass
    bypass: [Inputs, Outputs, Weights]
  - target: shared_glb
    type: bypass
    bypass: [Weights]
    keep: [Inputs, Outputs]
  - target: DummyBuffer
    type: spatial
    split: 4
    permutation: NPQR SCM
    factors: N=1 P=1 Q=1 R=1 S=0
  # only allow fanout of M, Q out from glb
  - target: shared_glb
    type: spatial
    split: 7
    permutation: NCPRSQM
    factors: N=1 C=1 P=1 R=1 S=1
  # one ofmap position but of different output channels
  - target: psum_spad
    type: temporal
    permutation: NCPQRS M
    factors: N=1 C=1 R=1 S=1 P=1 Q=1
  # row stationary -> 1 row at a time
  - target: weights_spad
    type: temporal
    permutation: NMPQS CR
    factors: N=1 M=1 P=1 Q=1 S=1 R=0
  - target: ifmap_spad
    type: temporal
    permutation: NMCPQRS
    factors: N=1 M=1 C=1 P=1 Q=1 R=1 S=1
  # enforce the hardware limit of the bypassing everything
  - target: DummyBuffer
    type: temporal
    factors: N=1 M=1 C=1 P=1 Q=1 R=1 S=1

下面的约束不是硬件架构和数据流的限制，而是帮助限制搜索空间以加快搜索速度.

mapspace_constraints:
  targets:
    # intuitive optimization to reduce map space size
    # the factors of these are 1 anyways, so the order does not really matter
    - target: DummyBuffer
      type: temporal
      permutation: NMCPQRS
    # intuitive optimization for row stationary
    # -> process a row/col of the output before going to the next one
    - target: shared_glb
      type: temporal
      permutation: QRSC PNM
      factors: Q=1 R=1 S=1 P=0
    # intuitive optimization to reduce map space size
    - target: DRAM
      type: temporal
      permutation: RSP CMNQ
      factors: R=1 S=1 P=1

未理解的概念

timeloop-model.stats.txt

Temporal reductions
per-instance
per-cluster
Fanout
Multicast factor
Average number of hops : 0.50
Spatial Reduction Energy