I am a Computer Architect passionate about the future of "intelligent" computing hardware. Current manifestation (althought not exactly "intelligent" like the brain) of such processors are Machine/Deep Learning Accelerators. However, I believe brain-inspired neuromorphic processors (early manifestation of silicon neocortex) are right around the corner. I received my PhD from ECE at Carnegie Mellon University advised by Prof. John Paul Shen (NCAL). As part of my research, I also worked closely with Prof. Jim Smith, Emeritus at UW-Madison. My main area of research focus is Neuromorphic Computer Architecture, wherein I am exploring new brain-inspired paradigms of computing. Currently, I am building on Prof. Smith's work on Temporal Neural Networks and Space-Time Algebra to design microarchitecture for implementing energy-efficient sensory processing units using standard CMOS technology. Besides, I also enjoy working on projects broadly related to Computer Architecture as well as Machine Learning.
During my PhD, I have had close collaboration with MediaTek where I completed a 2-year long continuous internship. At MediaTek, I have been fortunate to work on bleeding-edge Deep Learning Accelerator architecture, compiler stack, and performance modeling. At CMU, I have helped co-create and proliferate two graduate courses: Modern Computer Architecture and Design (18-740) and Neuromorphic Computer Architecture and Processor Design (18-743). I have 5+ years of teaching experience over 11 semesters across the two courses (Head TA for 10 semesters), delivering lectures, developing lab assignments, leading multiple teams of TAs and mentoring 35+ teams of graduate students on research projects.
I speak four languages fluently (English, Malayalam, Hindi, Marathi), have finished a basic course in Sanskrit (long time ago) and German, and am a beginner in Spanish and French. In my free time, I like to go on short as well as long drives around my home state of California. I also enjoy playing chess, badminton and cricket. I used to actively collect coins in high school - currently possess a foreign currency collection of coins (72 countries) and notes (15 countries).
Feel free to connect with me on LinkedIn.
Key Courses: Foundations of Computer Systems (18-600), Machine Learning (10-601), Deep Learning (11-785), Hardware Architectures for Machine Learning (18-663), Neural Computation (15-686), Systems and Toolchains for AI engineers (18-813), Modern Computer Architecture and Design (18-740 | TA), Neuromorphic Computer Architecture (18-743 | TA)
Key Courses: Microprocessors, Advanced Computer Architecture, VLSI Design, Physics of Transistors, Operating Systems, Computer/Network Security, Statistics, Calculus, Linear Algebra, Quantum Physics
Key Courses: Computer Vision and Image Processing, Integrated Analog Design, Embedded Hardware System Design, Fuzzy/Neural Systems
Working on next-gen architecture for ML accelerator chip targeting wearables.
Worked on architectural simulator and ISA finetuning for in-house AI accelerator within production mobile SoCs. Developed efficient microarchitecture designs for components within next-generation AI accelerator targeting future mobile SoCs. Implemented the designs in Verilog RTL, performed functional verification and further assisted with UVM verification.
Explored the usability and robustness of in-house AI software ecosystem, NeuroPilot, and contributed to its documentation. Further developed AI applications using NeuroPilot for edge inferencing on Dimensity SoC.
Modeled a multi-layered IC stack in Ansys HFSS and determined the IC stack layers responsible for significant EM signal leakage. Further simulated a theoretical EM Side Channel Analysis using MATLAB and successfully extracted the correct key byte using correlation analysis.
Part of the team building a Smart Home Solar Power System with wireless load control and data monitoring. Created a Wireless Central hub and 6 Mini hubs using PIC MCUs, and RF/Wi-Fi Modules (an IoT system). Managed transmission of control signals from server to appliances and power dissipation data back to server.
Collaborated with three other graduate students to develop a CNN-based solution for Facial Emotion Recognition (FER) problem, with the goal of efficient edge inferencing. The idea was to take a small baseline CNN and inject it with an appropriate attention mechanism to focus on relevant facial features. The proposed solution achieved 83% and 63.5% accuracies on CK+ and FER2013 (among top 10 in ICML 2013 FER Challenge) datasets respectively, while being 8x faster and 3x more power-efficient compared to state-of-the-art VGG-19, on Snapdragon 855 mobile platform.
View ProjectThe goal was to propose a differentiable Neural Architecture Search (NAS) approach inspired from FBNets, to generate effective neural network (NN) architectures that are heavily optimized for a given target device. The key idea was to extend the loss function to include an energy constraint along with the typical loss function and a latency constraint as used by FBNets. After extensive experimentation and loss function tuning via PyTorch, the new loss function was successfully able to generate NN architectures that were optimal in terms of accuracy, latency and energy consumption, for Raspberry Pi (used as an example target device). The trained child architectures were able to provide upto 2.5x speedup and 3.8x reduction in energy with tolerable accuracy loss (4-5%). This work (with equal contribution from all three authors) is available for perusal on arXiv.
View ProjectThe goal was to design a hybrid microarchitecture with similar energy efficiency as inorder (InO) processors while providing close to out-of-order (OoO) performance. The proposed architecture used only InO structures without any expensive dynamic scheduling hardware. It consisted of a free-flow front-end consisting of functional units, and in-order queues at the back-end for exposing Memory Level Parallelism. It also implements certain dynamic optimizations at the renaming stage of the pipeline, namely, Move Elimination, Memory Bypassing and Constant Folding. These three optimizations collapse the corresponding instruction dependencies, reducing the total cycle count for execution. This hybrid architecture on simulation in Snipersim was able to display very high energy-efficiency (150% improvement over OoO and 80% over InO), while maintaining performance decently close to the OoO processor (lags by only 3.5%).
View PublicationImplemented an AES decryption engine on Xilinx Zynq-7000 FPGA using VHDL, which takes an encrypted image and displays the decrypted one on an LCD monitor. It consisted of two custom coprocessors, one for executing the AES decryption algorithm and another for implementing a TFT controller peripheral to display graphics on the LCD, with AXI-Stream (AXIS) communication between Processing System (PS) and Programmable Logic (PL). The AES decryption algorithm was also executed in C to validate the algorithm implementation and successfully displayed the decrypted image on the monitor.
Developed hardware description model for a processor consisting of a free-flowing in-order front-end coupled with a shrunken OoO backend, using Verilog for a simplified 16-bit RISC instruction set with 15 instructions. Tomasulo dynamic scheduling along with register renaming and a hybrid Reservation Station structure/Re-Order Buffer were implemented. Performed cycle-level simulations in Altera Quartus Prime to validate the model.
As part of the Computer Systems course, I developed a dynamic memory allocator for C programs and optimized it for space utilization, improving its throughput by almost 1.5x. I also designed an interactive command-line interpreter using appropriate signal handlers for running user programs, and a multi-cache simulator in C with MSI cache coherence protocol.
Designed non-pipelined as well as pipelined versions of a multi-cycle 16-bit RISC processor consisting of 15 instructions in Verilog HDL and simulated it in Modelsim-Altera Simulator. The designs were demonstrated successfully by implementing on a DE0-Nano Development Board. As part of another project, I also designed the data path and controller of a CISC microcode based processor consisting of 19 instructions.
The goal was to design a suspended beam MEMS metal switch with better sensitivity than conventional sensors. The proposed structure basically consisted of a metal beam with an insulator and air gap beneath it. When voltage across the beam and the ground electrode is varied, the beam deflects and ultimately collapses after pull-in voltage. Its sensitivity was optimised using appropriate dimensions and materials through extensive C-V experimentations using MEMS+ software, to achieve a subthreshold swing below 60 mV/decade.
Devised a system using Pt-51 (8052 architecture) Board, a GSM Modem (SIM800C) and an IR LED to control the temperature of an AC using GSM-based text messages. AT commands were sent via UART interface to initialize the GSM Module as well as to retrieve temperature information from the text message received by it. This temperature information was transmitted to the AC via the IR LED with the help of NEC Protocol. Decoding of AT commands and NEC encoding of temperature information were performed by the 8052 microcontroller.
Designed a prototype of Local Positioning System (LPS) using three ultrasound transmitters, a receiver, two Pt-51 Boards and two Xbee Modules, which could display the position of the receiver on an LCD. Used the Xbee Modules to enable synchronized sequential sending of Ultrasonic pulses from the transmitters. Implemented Trilateration Method at the receiver side to calculate the receiver’s position.
As a Head TA since the course's inception, I helped develop course material, delivered lectures and co-ordinated work among five TAs over four offerings. I was the primary developer of the hardware as well as software framework used in the class.
As a Head TA since the course's inception, I helped develop course material, delivered lectures and co-ordinated work among six TAs over three offerings. I served as the primary student liaison with Qualcomm and MediaTek, and helped establish industry collaboration for lab assignments exploring CPU, GPU and NPU cores inside Qualcomm/MediaTek's state-of-the-art mobile SoCs.
As an undergraduate TA, I mainly helped with grading and proctoring of quizzes and exams.