accelergy原理和代码

JuneWen

2023-03-14

毕业设计

DNN加速器, 模拟器

能量估算公式：

\[Power = \alpha C V_{DD}^2 f\]

将造成每动作能量值差异的主要原因分为四类:

(1)动作特性

(2)数据特性

(3)时钟门控

(4)特定设计的优化

底层能量主要通过两个插件进行分析：CACTI和Aladdin。

CACTI

专门针对SRAM/DRAM/Chache的能量估算。

Cacti是一种用于估计高速缓存和内存的访问时间、能耗和面积的工具。Cacti基于一个参数化的模型，可以根据不同的设计选择和技术节点来计算缓存或内存的性能和功耗。Cacti使用了一种分层结构来描述缓存或内存的组织，包括阵列、子阵列、行解码器、列解码器、位线驱动器、感应放大器等。Cacti还考虑了电路级别的因素，如线路电容、电阻、晶体管尺寸等。

Cacti能量评估器的原理是通过建立一个能量模型来计算每个组件在不同操作模式下（读取、写入或空闲）消耗的能量。能量模型包括静态功耗和动态功耗两部分。静态功耗主要由晶体管漏电流引起，与操作模式无关；动态功耗主要由开关活动引起，与操作模式有关。Cacti使用了一些简化的假设来估计开关活动率和电荷/放电电容值。最后，Cacti将每个组件的能量相加得到整个缓存或内存的总能量。

1
2
3

# 调用CACTI获取估算的面积和能量（SRAM和Cache）
exec_list = [cacti_exec_path, '-infile', cfg_file_name]
subprocess.call(exec_list, stdout=temp_output)

DRAM

def DRAM_estimate_energy(self, interface):
        action_name = interface['action_name']
        width = interface['attributes']['width']
        energy = 0
        if 'read' in action_name or 'write' in action_name:
            tech = interface['attributes']['type']
            # Public data
            if tech == 'LPDDR4':
                energy = 8 * width
            # Malladi et al., ISCA'12
            elif tech == 'LPDDR':
                energy = 40 * width
            elif tech == 'DDR3':
                energy = 70 * width
            # Chatterjee et al., MICRO'17
            elif tech == 'GDDR5':
                energy = 14 * width
            elif tech == 'HBM2':
                energy = 3.9 * width
            else:
                energy = 0
        return energy
    
 def DRAM_estimate_area(self, interface):
        # DRAM area is zero 不占用片上面积
        return 0

SRAM

# Cacti only estimates energy for SRAM size larger than 64B (512b)

# Cacti only supports technology that is between 22nm to 180 nm

if address_delta == 0 and data_delta == 0:
            interpreted_entry_key = ('idle', tech_node, size_in_bytes, wordsize_in_bytes, n_rw_ports, desired_n_banks)
            energy = self.records[interpreted_entry_key]
        else:
            # rough estimate: address decoding takes 30%, memory_cell_access_energy takes 70%
            idle_energy = self.records[('idle', tech_node, size_in_bytes, wordsize_in_bytes,n_rw_ports, desired_n_banks)]
            address_decoding_energy = (self.records[desired_entry_key] - idle_energy) * 0.3 * address_delta/desired_n_banks
            memory_cell_access_energy = (self.records[desired_entry_key] - idle_energy) * 0.7 * data_delta
            energy = address_decoding_energy + memory_cell_access_energy + idle_energy
    return energy  # output energy is pJ

Cache底层就是SRAM，两者CACTI输入参数对比如下：

Aladdin

datapath（包括循环迭代并行性、流水线、数组分区和时钟频率，aladdin）+ memory（缓存层次结构，DRAMSim2）

aladdin是一种用于估计加速器的性能、能耗和面积的工具。aladdin基于一个高层次的模型，可以根据不同的算法和架构来计算加速器的性能和功耗。aladdin使用了一种基于动态追踪（dynamic trace）的方法来描述加速器的行为，包括内存访问、计算操作、控制流等。aladdin还考虑了电路级别的因素，如时钟频率、电压、晶体管尺寸等。

aladdin能量评估器的原理是通过建立一个能量模型来计算每个组件在不同操作模式下（读取、写入或空闲）消耗的能量。能量模型包括静态功耗和动态功耗两部分。静态功耗主要由晶体管漏电流引起，与操作模式无关；动态功耗主要由开关活动引起，与操作模式有关。aladdin使用了一些简化的假设来估计开关活动率和电荷/放电电容值。最后，aladdin将每个组件的能量相加得到整个加速器的总能量。

内部数据是40nm和45nm的器件的表格，通过缩放得到不同位宽/深度的能耗和面积。

def oneD_quadratic_interpolation(desired_x, known):
    """
    utility function that performs 1D linear interpolation with a known energy value
    :param desired_x: integer value of the desired attribute/argument
    :param known: list of dictionary [{x: <value>, y: <energy>}]

    :return energy value with desired attribute/argument

    """
    # assume E = ax^2 + c where x is a hardware attribute
    ordered_list = []
    if known[1]['x'] < known[0]['x']:
        ordered_list.append(known[1])
        ordered_list.append(known[0])
    else:
        ordered_list = known

    slope = (known[1]['y'] - known[0]['y']) / (known[1]['x']**2 - known[0]['x']**2)
    desired_energy = slope * (desired_x**2 - ordered_list[0]['x']**2) + ordered_list[0]['y']
    return desired_energy

def intmultiplier_estimate_energy(self, interface):
    this_dir, this_filename = os.path.split(__file__)
    nbit = interface['attributes']['datawidth']
    action_name = interface['action_name']
    if action_name == 'mult_gated':
        interface['action_name'] = 'idle'  # reflect gated multiplier energy
    csv_nbit = 32
    csv_file_path = os.path.join(this_dir, 'data/multiplier.csv')
    energy = AladdinTable.query_csv_using_latency(interface, csv_file_path)

    if not nbit == csv_nbit:
        energy = oneD_quadratic_interpolation(nbit, [{'x': 0, 'y': 0}, {'x': csv_nbit, 'y': energy}])
    if action_name == 'mult_reused':
        energy = 0.85 * energy
    return energy

环境配置

1、安装anaconda，python>=3.8

2、在accelergy文件夹下执行pip install .，自动地安装

3、发现accelergy被自动安装到了\home\june\.lacal\bin中，因此我们在\home\june\.bashrc文件中添加一句

1	export PATH="$PATH:/home/june/.local/bin"

然后再source .bashrc激活配置

4、然后acecerlgy就成了可执行的命令。

1 2	cd examples/hierarchy/input accelergy -o ../output/ .yaml components/.yaml -v 1

accelergy

使用 Timeloop 映射探索，Timeloop 使用 DNN 加速器的关键架构和实现属性的简洁统一表示来描述广泛的硬件架构空间。在精确能量估算器的帮助下，Timeloop 通过映射器为任何给定工作负载生成准确的处理速度和能效特征，该映射器找到在指定架构上安排操作和暂存数据的最佳方式。
使用Accelergy能量估算，Accelergy作为能量估算器，提供灵活的能量估算，以促进Timeloop的能量表征。Accelergy允许任意加速器架构设计规范，这些设计由用户定义的特定于设计的高级复合组件和用户定义的低级基元组件组成，这些组件可以通过第三方能量估算插件进行表征，以反映设计的技术相关特征。

estimater

estimation_plug_ins/accelergy-aladdin-plug-in/aladdin.estimator.yaml

estimation_plug_ins/accelergy-cacti-plug-in/cactimator.yaml

estimation_plug_ins/accelergy-table-based-plug-ins/table.estimator.yaml

architecture - eyeriss - v0.2

1 smartbuffer_SRAM:
eyeriss_like.weights_glb
eyeriss_like.shared_glb

2 XY_NoC:
eyeriss_like.weights_NoC
eyeriss_like.ifmap_NoC
eyeriss_like.psum_write_NoC
eyeriss_like.psum_read_NoC

3 smartbuffer_RF:
eyeriss_like.PE[0..167].weights_spad
eyeriss_like.PE[0..167].ifmap_spad
eyeriss_like.PE[0..167].psum_spad

4 intmac:
eyeriss_like.PE[0..167].mac

ation没出现PE[139], PE[125], PE[111], PE[97],PE[83],PE[69], PE[55], PE[41], PE[27], PE[13]，也就是12*14的最后一列。

version: 0.3
  subtree:
    - name: eyeriss_like
      attributes:
        technology: 40nm
      local:
        - name: weights_glb
          class: smartbuffer_SRAM
          attributes:
            memory_width: 64
            memory_depth: 1024
            n_banks: 2
        - name: shared_glb
          class: smartbuffer_SRAM
          attributes:
            memory_width: 64
            n_banks: 25
            bank_depth: 512
            memory_depth: bank_depth * n_banks
            n_buffets: 2
            update_fifo_depth: 2 
        - name: ifmap_NoC
          class: XY_NoC
          attributes:
            datawidth: 16 # 输入的图像是16位低精度
            col_id_width: 5 
        - name: weights_NoC
          class: XY_NoC
          attributes:
            datawidth: 64 # 权重是64位高精度
        - name: psum_write_NoC
          class: XY_NoC
          attributes:
            datawidth: 64 # 累加和是64位高精度
        - name: psum_read_NoC
          class: XY_NoC
          attributes:
            datawidth: 64
            Y_X_wire_avg_length: 4mm
      subtree:
      - name: PE[0..167]
        attributes:
          memory_width: 16
        local:
          - name: ifmap_spad #存放输入
            class: smartbuffer_RF
            attributes:
              memory_depth: 12 # ???
              buffet_manager_depth: 0 
          - name: weights_spad #存放权重
            class: smartbuffer_SRAM #SRAM???
            attributes:
              memory_depth: 224 # ??? 14*16
              buffet_manager_depth: 0 
          - name: psum_spad #存放和
            class: smartbuffer_RF
            attributes:
              memory_depth: 24 # ???
              buffet_manager_depth: 24 # 感觉就是scoreboard_depth
              update_fifo_depth: 2 
          - name: mac
            class: intmac
            attributes:
              datawidth: 16