TimeLoop原理和使用

作用:映射神经网络,找到最优的数据重用工作流。

原理

image-20230316120500047

如何使用

00-model-conv1d-1level

image-20230215163226554

通过文件timeloop-model.accelergy.log可知,timeloop生成映射后会自动调用accelergy,将输入的architecture通过catci组件评估。

image-20230215162913799

输出的读写次数不一样,因为第一次做MAC运算不用读只需写回结果???

image-20230215161827393

01-model-conv1d-2level

image-20230215164012799

定义了两种映射的方式,output stationaryweight stationary

image-20230215163758778

按照step4改成 3 1/1 1920之后,ws会报错:

image-20230215170843695

改成 (2)3 64/1 30 后比os的 (1)1 1920/3 1 对比如下,能耗变大了。原因是,在比较小的3*16的计算中,所有的计算数和计算结果都能存储在buffer中,而计算量变大后(1)仍然可以在内层buffer中完成全部计算,由于是OS,每次mac只有一个数需要重新取,而(2)不是完全的WS,没算完30个output,都会再取一遍weight,从而多了63*3=189次的从mainMemory到buffer的weight数据读取.。。。

image-20230215171153214

02-model-conv1d+output channels-2level

讨论了分块问题(tail)image-20230215175930634

buffer层的区别,主要在Input多读了一次进来

image-20230225144923476

mainmemory的区别,也主要在input多读了一遍

image-20230225145210890

03-model-conv1d+oc-3level

image-20230215181203973

讨论了bypass的问题,感觉这里bypass的作用其实是定义连接的方式,选择性地绕过某一个存储层,这里可以理解为RF直接和MM相联

image-20230225174724380

RegisterFile层的bypass:output数据从Main Memory到Register File虽然经过了ClobalBuffer,但没有存储在其中,因而相比没有bypass,节省了读取GlobalBuffer的能量,但网络传输的能量未发生变化。

GLB层的bypass:Weight和Input读的次数显著提升,但省去了output的读取。

image-20230225174059773

04-model-conv1d+oc+ic-3levelspatial

image-20230215181715931

问题:[C, K, R, P]

Timeloop自动检测并利用多播机会。

KP: Output 分块, Input数据被自动的多播(从GLB到RF)

CP:Input 分块, Output数据被自动地多播(从GLB到RF),而Output数据存回时(从RF到GLB)可以先进行压缩再存回,与多播相反。观察输出文件,实际效果更差,因为分块了之后Output数据还要写回,而Input数据不用,所以这个策略能效较差。

05-mapper-conv1d+oc-3level

2^(2*3) = 64 映射空间的大小???

mapper字段变成了下图所示,还有多了一个mapspace_constraints字段。

image-20230225220028110

06-mapper-convlayer-eyeriss

1
2
3
4
5
6
7
8
mapper:
optimization-metrics: [ delay, energy ]
live-status: False
num-threads: 8
timeout: 15000
victory-condition: 500
algorithm: random-pruned
max-permutations-per-if-visit: 16

硬件架构和数据流的限制:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
architecture_constraints:
targets:
# certain buffer only stores certain datatypes
- target: psum_spad
type: bypass
bypass: [Inputs, Weights]
keep: [Outputs]
- target: weights_spad
type: bypass
bypass: [Inputs, Outputs]
keep: [Weights]
- target: ifmap_spad
type: bypass
bypass: [Weights, Outputs]
keep: [Inputs]
- target: DummyBuffer
type: bypass
bypass: [Inputs, Outputs, Weights]
- target: shared_glb
type: bypass
bypass: [Weights]
keep: [Inputs, Outputs]
- target: DummyBuffer
type: spatial
split: 4
permutation: NPQR SCM
factors: N=1 P=1 Q=1 R=1 S=0
# only allow fanout of M, Q out from glb
- target: shared_glb
type: spatial
split: 7
permutation: NCPRSQM
factors: N=1 C=1 P=1 R=1 S=1
# one ofmap position but of different output channels
- target: psum_spad
type: temporal
permutation: NCPQRS M
factors: N=1 C=1 R=1 S=1 P=1 Q=1
# row stationary -> 1 row at a time
- target: weights_spad
type: temporal
permutation: NMPQS CR
factors: N=1 M=1 P=1 Q=1 S=1 R=0
- target: ifmap_spad
type: temporal
permutation: NMCPQRS
factors: N=1 M=1 C=1 P=1 Q=1 R=1 S=1
# enforce the hardware limit of the bypassing everything
- target: DummyBuffer
type: temporal
factors: N=1 M=1 C=1 P=1 Q=1 R=1 S=1

下面的约束不是硬件架构和数据流的限制,而是帮助限制搜索空间以加快搜索速度.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
mapspace_constraints:
targets:
# intuitive optimization to reduce map space size
# the factors of these are 1 anyways, so the order does not really matter
- target: DummyBuffer
type: temporal
permutation: NMCPQRS
# intuitive optimization for row stationary
# -> process a row/col of the output before going to the next one
- target: shared_glb
type: temporal
permutation: QRSC PNM
factors: Q=1 R=1 S=1 P=0
# intuitive optimization to reduce map space size
- target: DRAM
type: temporal
permutation: RSP CMNQ
factors: R=1 S=1 P=1

未理解的概念

timeloop-model.stats.txt

1
2
3
4
5
6
7
Temporal reductions
per-instance
per-cluster
Fanout
Multicast factor
Average number of hops : 0.50
Spatial Reduction Energy