Shaizeen Aga - AMD Research

Academic Research Projects

Smart Memory Defenses for Memory Bus Side Channel

Problem context: Ensuring privacy of code and data while executing software on a computer physically owned and maintained by an untrusted party is an open challenge.
Problem statement: Low-overhead hardware design which provides defenses against memory bus side-channel remains elusive. Current solutions employ Oblivious RAM (ORAM) to ensure address confidentiality and incur severe overheads (memory bandwidth~100X, performance~10X). Addressing memory bus side-channel completely also requires addressing data integrity, freshness and timing channel which only further add to these overheads.
My solution: I implemented a low-overhead hardware design, InvisiMem, in which we employ 3D stacked memory with a logic layer wherein we implement cryptographic primitives to provide efficient defenses against memory bus side-channel. Under InvisiMem, the secure processor can send encrypted addresses over the untrusted memory bus without having to rely on ORAM. Additional measures are required for address confidentiality which we identify and implement efficiently. We also propose an efficient freshness solution (without Merkle trees as in prior works) by establishing secure communication channel between processor and memory and using process isolation checks. Finally, we also employ constant heart-beat packets for mitigating memory bus timing channel.
Result impact: InvisiMem incurs about 14.21% performance and no memory bandwidth overhead as compared to an order of magnitude performance and two orders of magnitude memory bandwidth overhead that prior works incur. These ultra-low overheads will I believe pave the way for production systems to solve memory bus side channel efficiently.

In-place Computation in Caches

Problem context: We continue to generate nearly 2.5 quintillion bytes per day and as such, the amounts of data that needs analyzing will only increase with time.
Problem statement: Conventional architectures are not equipped to tackle this data deluge. They expend disproportionately large fraction of time and energy in moving data over cache hierarchy, and in instruction processing, as compared to the actual computation. Further, applications which tackle massive data tend to have high degree of data parallelism which the narrow vector units in conventional processors fail to exploit.
My solution: In this work, I proposed the Compute Cache architecture which transforms caches into active compute units capable of performing in-place computation. This transformation unlocks massive data-parallel compute capabilities (~100X wrt SIMD processor), as a cache is comprised of many smaller sub-arrays each of which can compute in parallel. This also reduces data movement energy over the cache hierarchy as we can perform computation in cache without moving it towards the processor. Realizing Compute Caches brings to forth several challenges like efficient data placement, orchestration of concurrent computation in caches, ensuring soft error reliability and more which I address efficiently.
Result impact: My study indicates that Compute Cache enabled operations can deliver significant throughput (54X) and total energy savings (14X). For a suite of data-centric applications, Compute Caches deliver performance improvement of 1.9X and energy savings of 2.4X while being limited by Amdahl’s law. Future studies to include a richer set of operations that can be performed in-place in cache will, I believe, help accelerate larger fractions of applications and close the gap between potential and realized improvements.
This work was awarded best demo at Center for Future Architectures Research (CFAR) Annual Workshop 2016 which showcased nearly 50 projects in computer architecture related topics from several leading institutions. Also, this work won the 1st place at University of Michigan CSE Graduate Students Honors Competition ; a yearly competition which recognizes research of broad interest and exceptional quality.

Exposing Optimistic Concurrency to Cilk Scheduler

This project was part of my Internship with High Performance Computing group at Pacific Northwest National Labs, Richland, WA. My mentor here was Sriram Krishnamoorthy.
Problem context: Recursive parallel programming models such as Cilk strive to make it easy for programmers to express parallel programs by enabling a simple divide-and-conquer programming model.
Problem statement: Recursive work partitioning can impose additional constraints on concurrency than is implied by the true dependencies in a program thus affecting performance.
My solution: I address Cilk's inefficiency using a speculation-based approach to alleviate the concurrency constraints imposed by such recursive parallel programs. We augment the Cilk runtime infrastructure to support speculative execution and implement a predictor to accurately learn and identify increasing degrees of speculation opportunities to relax extraneous concurrency constraints.
Result impact: I demonstrated that speculative relaxation of concurrency constraints can deliver considerable performance gains (1.6X on 30 cores) over baseline Cilk without compromising the easy of programming that Cilk affords.

zFENCE: Data-less Coherence for Efficient Fences

Problem context: With ubiquity of multi-core systems, a crucial debate is picking the memory model for such systems. Memory model decides the assumptions a programmer can make and hence the ease of programmability.
Problem statement: Sequential consistency (SC) is arguably the most intuitive memory model but manufacturers instead chose to support relaxed models which allow more performance optimizations. Supporting these relaxed models necessitates a fence instruction which is used to implement memory accesses to synchronization variables. The cost of these fence instructions remains prohibitively expensive. Efficient fences will not only help improve the performance of today’s concurrent algorithms, but could also pave the way for the adoption of SC model.
My solution: In this work, I observed that a significant fraction of fence overhead is caused by stores that are waiting for data from memory. To exploit this observation, I proposed the zFence architecture for efficiently implementing a fence by introducing the capability to grant coherence permission for a store much earlier than servicing its data from memory. Using this efficient fence instruction we can enable a low-overhead SC design.
Result impact: Using zFence, I demonstrated that sequential consistent multi-core system is possible for a mere overhead of 2.93%.

Stronger Memory Models for GPUs

Problem statement: Picking an intuitive memory model is an important question for GPUs for they are now widely used to write general purpose parallel programs.
Problem statement: Inspite of its simplicity, manufacturers do not implement Sequential consistency (SC) and instead chose to support relaxed models which allow more performance optimizations.
Solution: In this work, I collaborated with fellow graduate student Abhayendra Singh to investigate memory model implications for GPU’s wherein we propose a GPU-specific non-speculative SC design that takes advantage of high spatial locality and temporally private data in GPU applications to bring down SC overhead.
Result impact: Our GPU-specific SC design shows that SC can also be enabled for GPUs for a minimal overhead.

Other Projects

ReactiveSC: Speculatively Relaxing Memory Consistency Model Contraints using Dynamic Classification of Cache Blocks

This project was part of my Parallel Computer Architecture (EECS 570) course project at University of Michigan, Ann Arbor. Using dynamic classification of cache blocks, we relaxed memory consistency model constraints to improve performance of Sequentially Consistent hardware.
This project earned Top Grade in Winter 2012 class of EECS 570!
Project Report
Poster

Eka: P6 Microarchitecture based Core

This project was part of my Computer Architecture (EECS 470) course project at University of Michigan, Ann Arbor.
I implemented the memory interface of the core (load queue, store queue, post retirement store buffer) and host of other components like Reorder Buffer, Instruction Buffer. I also designed and implemented an Adaptive Instruction Prefetcher which gained us significant performance benefits.
This processor design earnedcTop Grade in Fall 2011 class of EECS 470!
Project Report

Undergraduate Research Project at NVIDIA Graphics Pvt. Ltd

I worked here on NVIDIA’s parallel computing platform CUDA and ported a True motion estimation algorithm on the CUDA platform. The challenges involved here were: understanding true motion estimation and CUDA architecture and doing a literature survey to pick an algorithm which could be efficiently ported onto the CUDA platform.

Conference Papers

SeqPoint: Identifying Representative Iterations of Sequence-based Neural Networks

Suchita Pati, Shaizeen Aga, Matt Sinclair and Nuwan Jayasena.
To appear in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2020. [link]

Co-ML: A Case for Collaborative ML Acceleration using Near-data Processing

Shaizeen Aga, Nuwan Jayasena and Mike Ignatowski.
In 5th International Symposium on Memory Systems (MEMSYS), October 2019. [link]

InvisiPage: Oblivious Demand Paging for Enclaves

Shaizeen Aga and Satish Narayanasamy.
In 46th International Symposium on Computer Architecture (ISCA), June 2019. [link]

MOCA: Memory Object Classification and Allocation in Heterogeneous Memory Systems

Aditya Narayan, Tiansheng Zhang, Shaizeen Aga, Satish Narayanasamy and Ayse Kivilcim Coskun.
In 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2018. [link]

InvisiMem: Smart Memory Defenses for Memory Bus Side Channel

Shaizeen Aga and Satish Narayanasamy.
In the 44th International Symposium on Computer Architecture (ISCA), June 2017. [link]

Compute Caches

Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das.
In 23rd IEEE Symposium on High Performance Computer Architecture (HPCA), February 2017. [link]

Efficiently Enforcing Strong Memory Ordering in GPUs

Abhayendra Singh, Shaizeen Aga, Satish Narayanasamy.
In the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2015. [link]

CilkSpec: optimistic concurrency for Cilk

Shaizeen Aga, Sriram Krishnamoorthy, Satish Narayanasamy.
In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Austin, TX, November 2015. [link]

zFence: Data-less Coherence for Efficient Fences

Shaizeen Aga, Abhayendra Singh, Satish Narayanasamy.
In 29th International Conference on Supercomputing (ICS), June 2015 [link]

Patents

Ordering constraint management within coherent memory systems

Shaizeen Aga, Abhayendra Singh, Satish Narayanasamy.
US Patent 9367461, 2014 [link]

Method for exploiting parallelism in task-based systems using an iteration space splitter

Behnam Robatmili, Shaizeen Aga, Dario Suarez Gracia, Arun Raman, Aravind Natarajan, Gheorghe Calin Cascaval, Pablo Montesinos Ortego, Han Zhao.
US Patent 9501328, 2016 [link]

Trusted computing system with enhanced memory

Satish Narayanasamy, Shaizeen Aga.
US Patent Pending [link]

¡Hola! I'm Shaizeen Aga.