トップ 差分 一覧 Farm ソース 検索 ヘルプ PDF RSS ログイン

Diary/2019-4-17

ASPLOS五日目

本会議三日目.

Machine Learning I

PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference
  • memristive crossbar
    • 2-6 bits per cell vs 1-bit or CMOS(SRAM) = 6x
    • cell area is 4F^2 vs 120F^2 for CMS(SRAM) = 30x
    • Analog MVM 1.34pJ/op
  • ref. RENO DAC15, PRIME ISCA16
  • Domain-specific ISA
    • large register address space to support memoristive crossbar
    • vector width keeps instruction memory low in spatial architecture
  • Hybrid core
    • hybrid memrisitive and CMOS
  • compiler optimization
    • graph partitioning
    • MVM instruction consume high latency
  • inference energy: skylake, Pascalと比べて削減.
  • PUMA compiler https://github.com/illinois-impact/puma-compiler
  • PUMA simulator https://github.com/Aayush-Ankit/dpe_emulate

FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture
  • isues
    • ReRAMなシステムではDACとADCがでかい.(Logical ViewだとReRAMでかいけど)
    • communication bound
    • reliability
    • flexibity
      • ReRAM-based VMM(fast), Digital-based others(relatively slow)
  • refs. bridge tha gap between neural netwoks and ..., ASPLOS'18
  • FPSA; ReRAM-based processing element
    • reduce digital circuit, spiking schema
    • fully parallel
  • routing = iland-style, like FPGA (ref. mrFPGA)
  • system stack: neural synthesizer -> spatial-to-temporal mapper -> place & route

Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks
  • MAC演算ユニットにスパースな演算データを無駄を省いて供給したい
  • → オンデマンドであいてる演算器にデータをつっこめるようにする
  • 演算器へのデータパスにMUXをいれてデータ供給を制御している,のかな.
  • うまくつくれれば便利そう

Machine Learning II


TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators
  • scaling NN perf.
    • use more PEs & more on-chip buffers
      • monolithic engine <- low resource utilization, long array busses, far from SRAM
      • -> tiled architecture - mostly local data transfers, easy to scale up/down
      • - <- dataflow scheduling ?
  • inter-layer parallel
    • buffer sharing dataflow - タイルでデータを共有 → 最初に分割して配って,あとで交換する
  • inter-layer pipeline
    • pipeline multiple layers, pros: save DRAM B/W, cons: utilize resources less efficiently(long delay, large SRAM)
    • -> fine-grained data forwarding
      • forward each subset of data to the next layer as soon as ready
      • require matched access patterns between adjacent layers
      • データフローツールでパイプラインスケジューリングする

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization
  • スパース行列をデンスな行列に変換する話
  • zero weights in systolic arrays are wasteful
    • -> column combining. 9タイルを3タイルに.
      • 保存された重さとの積の方だけ選択して計算する.
  • ref. Full-stack Optimization for Accelerating CNNs with FPGA Validation, ICS 2019 ???

Split-CNN: Splitting Window-based Operations in Convolutional Neural Networks for Memory System Optimization
  • DL faces a memory problem, HBM meomry is expensive
    • Accelerator(eg. GPU): 16GB/32GB/..., Host: 512GB/1TB/...
  • opportunities enabled by NV-LINK
    • ref. vDNN(Rhu, MICRO 49)
  • memory profile of training DNN
  • Split-CNN
    • accuracy drops slowly as we splite deeper and into more patches
    • batch毎にsplitの感じをかえる
  • HMMS is a static memory planner that detemines the timing of memory allocation, deallocation, prefetching and offloading

学習時のメモリボトルネックを解決するために,データを分割するSplit-CNNと,メモリ管理/プリフェッチの管理システムHMMSを提案.IBM Power System S822LCで評価.

Storage

LightStore: Software-defined Network-attached Key-value Drives

組み込みクラスのプロセッサと数TBのNAND FLASHを使ったNW接続なKVS LightStore を提案.FTLはHW上に実装.Xeonサーバ上のRocksDBと比べて,Random Setの速度はXenサーバを凌駕,ノード数に対してスケール,省電力.

  • one ssd per network port, KV interface,
  • optimization
    • system optimization
      • mmemcopy, thread
  • LSM-tree spec. opt
    • decoupled keys from KV paris, bloom filter
  • FTL in HW

SOML Read: Rethinking the read operation granularity of 3D NAND SSDs

3D NANDで密度あがったので同じ容量のSSDはチップ数減って,チップ間並列性がへって読み出しが遅くなった.なので,Partial-page読み出しを1つのread命令にパックできるようにSWとHWを工夫した,と.

  • fewer number of NAND chips -> lower multi-chip parallelism
  • ← sigle-operation-multiple-location
    • Partial-page readを1 READ命令にまぜる

FlatFlash: Exploiting the Byte-Accessibility of SSDs within A Unified Memory-Storage Hierarchy

SSD(PCIe接続なフラッシュストレージ)にDRAMと同じようにバイトアクセスできるようにするために
SSD->DRAMへのpromotionメカニズムを実装した,と.

  • FlatFlash, byte addressable interface
    • avoid paging
    • reduce i/o traffic
    • reduces dram latency
  • dram in ssd + pcie mmio + opencapi
  • ref. FlashMap, ISCA'15 - unifying the memory and storage <- FlatFlashは1.6倍速い.
  • DRAM への promoteがおそい -> background実行したい -> consistency問題

Quantum Computing

A Case for Variability-Aware Policies for NISQ-Era Quantum Computers
  • ref. qubitのswapを最適化する問題
  • not all qubis are created equal
    • exploit variation in error rates to improve reliability
      • assign more operations on reliable qubits/link
      • <- SWAPカウントじゃなくて

Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices
  • qubit connection limitation
  • mapping with SWAP
    • heuristic - Zulehner et al., DATE'18, Siraichi et al., CGO'18
  • reduce search complexity
    • swap-based search
      • Prev.: mapping-based search, high complexity - O(exp(N))
      • Proposed: search a SWAP sequence - only consider high-priority qubits - O(N^2.5)
    • reverse traversal for init. mapping
      • Prev.: random initial mapping
      • Proposed: Inspired by the reversibility
    • control the parallelism

Noise-Adaptive Compiler Mappings for Noisy Intermediate-Scale Quantum Computers
  • Q algorithmと実機にはギャップがある
  • NISQ Resource constraints
    • Low qubits: 5-72
    • high gate error rates: 1-10%
    • Qubts hold state for 100us
  • cur.
    • compile onece per input: more optimization opportunities
    • reduce program execution time to avoid decoherence
    • communication/SWAP optimization
    • Used in IBM, Rigetti, Google compilers
    • -> NISQ system have ~10x spatial and temporal noise variation!
  • proposed: noise-adaptive compilation
  • noise variation impacts successes rate
  • #1: choose a good initial mapping
  • #2: coherene-aware sheduling
    • influences mapping: choose qubits with good coherence time
  • #3: reduce SWAPs, use low-error rate routes
  • -> implement as a constrained optimization
  • Scaffold Program -> LLVM IR ScaffCC -> Optization using z3 SMT Solver* -> OpenQASM
  • *にノイズデータいれる

https://github.com/prakashmurali/TriQ

Optimized Compilation of Aggregated Instructions for Realistic Quantum Computers

ロジカルな量子操作と物理的な操作の乖離が大きい.効率的な物理制御をするために1-, 2-qubit操作じゃなくて,最大10qubitsまで同時に操作するようなユニットにまとめるよ.という話なのかな?

  • layered approach to quantum compilation
  • GRAPE - GRadient Ascent Pulse Engineering
  • how to maximally utilize optimal control? - physical gate decomposition, phisical gate optimization