Technologyfeatured

[DGIST Series] How the Quest for AI Led to Next-Generation Memory & Computing Processors

By March 23, 2023 January 19th, 2024 No Comments

Neurometabolomics, brain engineering, microlasers, biophysics, and semiconductor convergence technology are some of the research areas of Daegu Gyeongbuk Institute of Science and Technology (DGIST) in South Korea. In particular, the institute has heavily invested in semiconductor R&D, with its campus home to a fabrication facility.

With such vast knowledge and experience of semiconductors, the DGIST professors are the ideal authors of our new seven-part series which covers subjects such as the basic modules of semiconductors and semiconductor application cases, as well as other advanced technologies such as memory and interface circuits.

In the first episode of the series, Professor Jong-Hyeok Yoon from the Department of Electrical Engineering and Computer Science will explain how computing processes have evolved to power current AI technologies. As AI requires an enormous amount of computation power along with high speed and efficiency, this article will reveal how development in this field started with CPUs and GPUs and has progressed to digital accelerators that have transformed the function of memory.

 

The Evolution of AI

AI has been in the spotlight recently following the release of the AI chatbot ChatGPT which has captured the public’s imagination due to its ability to answer sophisticated questions. However, 20 years ago, AI was generally only of practical use on large-scale servers to perform tasks such as online searches using natural language. In comparison, AI on edge devices1 such as personal computers was still at a very low level. People who remember the Office Assistant dog called Rocky that appeared in Microsoft Office programs more than two decades ago will probably agree. The dog was generally seen as an inconvenience to users as it was only able to provide information from its programmed index and would sometimes respond with irrelevant information. In addition, it was difficult to turn it off.

The progress made in AI technology over the years can be seen by simply opening smartphone camera apps. Nowadays, smartphones feature AI chips with an NPU2 —which are imitations of a neural network—so that related photos can be pulled up by merely entering a search term, while it is also possible to detect and edit objects in photos even without the use of a server.

So, why were we unable to develop such advanced AI technologies in the past? The 2016 Go match between legendary South Korean player Lee Sedol and Google Deepmind’s AI AlphaGo program which attracted over 200 million global viewers was clearly not the first time people thought about the application of AI. The fundamental principles of AI were actually proposed in the 1940s, while its practicality was demonstrated from the 1970s to the early 2000s by renowned computer scientist Geoffrey Hinton and his research group who developed the Restricted Boltzmann Machine (RBM)3 and the backpropagation4 algorithm. Although the theory of AI has been established for a long time, its application has only recently begun to develop due to the limitations in computational capabilities for implementing AI and in the hardware resources required to save data such as weights5 and results of neural networks.

1Edge device: Devices that are distinguished from existing smart devices in that their data processing takes place at the edge of the network rather than in the center of the network—or the cloud.
2Neural Processing Unit (NPU): A processor optimized to drive machine learning. Unlike GPUs that require the software to create and train artificial neural networks, an NPU is characterized by its implementation of artificial neural networks on a unit of a hardware chip.
3Restricted Boltzmann Machine (RBM): A generative, stochastic artificial neural network that can learn a probability distribution over its set of inputs.
4Backpropagation algorithm: A statistical technique used in machine learning for multilayer neural networks that calculates the error—the difference between the predicted value and the actual value—to accurately adjust the weight.
5 Weight: The amount of importance given to each individual value when calculating an average.

AI Hardware That Provides Optimal Digital Computation: CPUs, GPUs, and ASICs

Focusing on computational capabilities, neural networks used to implement AI require a large-scale matrix-vector multiplication (MVM)6 . Therefore, the purpose of AI hardware lies in how efficiently and quickly it can compute these large-scale MVMs. As a computer’s main brain, a central processing unit (CPU) can run a wide range of programs, but it is unable to support fast and effective MVM due to its limited parallel computation capability. To put it into perspective, if a CPU that is capable of processing 10 different tasks is required to perform 100 simple MVM computations, it would require the CPU to repeat its computations 10 times.

Given the increasing demand for parallel computing power, the graphics processing unit (GPU) was considered a turning point in AI development. GPUs have a large number of parallel computing7 units implemented to process graphic data required for multimedia tasks such as gaming and videos. Through general-purpose computing on GPU (GPGPU), in which the GPU performs computation for tasks typically handled by the CPU, it became possible to use AI in practice. However, despite the fact that a GPU can perform computations quickly as it can carry out immense parallel computations, it cannot operate efficiently. As previously mentioned, GPUs have many parallel computing units for graphical data processing, but their inefficiency is caused by their lack of dedicated MVM computation units (let’s remember that GPGPU stands for general-purpose computing on GPU). Moreover, as GPUs are intended for large-scale parallel processing operations, they consume a vast amount of power and therefore cannot support AI applications that require ultra-low power such as edge AI8 .

Thus, experts in academia and the industry have been involved in developing computing accelerators based on the ASIC9 design to provide fast and energy-efficient computation. The earliest ASIC chips for AI are primarily digital accelerators. It comprises many computing units specialized for large-scale MVM, and it also provides scalability depending on the AI network structure.

6Matrix-Vector Multiplication (MVM): An operation between a matrix and a vector that produces a new vector.
7Parallel computing: A type of computation in which vast quantities of information are processed simultaneously by breaking up the tasks and processing them all at once. It is the opposite of the serial processing method where a task needs to be completed before moving onto the next task.
8Edge AI: A method of AI computation that happens at the edge of a given network, usually on the device where the data is created instead of in a centralized cloud computing facility or offsite data center.
9Application Specific Integrated Circuit (ASIC): In contrast to a general integrated circuit, it is a non-memory semiconductor chip designed for a specific product.

So, can we say that digital accelerators are the ultimate solution to securing hardware resources for AI computation? Even if the efficiency and speed of computing units have increased with ASIC-based digital accelerators, we cannot tell if the efficiency of the entire system is directly improved. To calculate this, it is necessary to understand how much energy the entire operation takes.

In the Von Neumann architecture10, which is the most common computer structure, the computational devices read and process data from the memory and send it back to the memory. Thus, the efficiency of MVM operation in the Von Neumann architecture is affected by two factors: the energy required to deliver the inputs and weights to the computing unit, and the energy dissipated in multiplication. As inputs are directly fed to the computing unit, energy consumption is negligible. However, in the case of weights, it takes about 500 times more energy than the computational energy to transfer data from the external DRAM to the computing unit. During a DARPA workshop in 2020, Professor Philip Wong at Stanford University stated that energy consumption from memory limits the entire computational efficiency. Although there have been various efforts to increase this efficiency such as reducing the computational energy, most of the energy, in fact, was being used to read and write the weight in the memory. Consequently, to improve the computational efficiency of the entire system, the number of reading and writing operations should be reduced.

10Von Neumann architecture: A program-embedded computer structure typically featuring three levels consisting of the main memory unit, a central processing unit, and an input/output unit. Most computers today follow this basic structure, but its bottleneck limits the ability to design high-speed computers.

▲Figure 1. Energy consumption from computation and memory access (left) change from the Von Neumann architecture to the PIM computational structure (center and right)

 

IMC and NMC: Two Structures of PIM-based Accelerators for Computational Efficiency

The computing accelerator based on Processing In-Memory (PIM)11 was developed from observing these occurrences while breaking away from the Von Neumann architecture. As PIM is a memory with computing capability, the input is fed directly to the memory where the weight exists. This structure allows the memory to compute on its own and output the result value. PIM can be generally categorized into In-Memory Computing (IMC)12 and Near-Memory Computing (NMC)13 . The difference lies in whether one sees the PIM as a computation in a memory circuit or in a memory module. IMC modifies the memory cell itself to allow it to perform computations as an ASIC, while NMC refers to the integration of high-density memory such as HBM and ASIC specialized to compute MVM within a memory module—semiconductor substrates including memory chips. Note that in academia and the industry, the term “PIM” generally refers to IMC and NMC, respectively.

As the NMC structure still needs to read weights from the DRAM, one might assume that there would be a disadvantage in computational efficiency of the entire system. In the Von Neumann structure, the connection between the CPU and the memory is composed of multiple PCBs14 including the mainboard, memory module, and connectors. However, in contrast, NMC connects the memory with the computational ASIC within a single package through System-in-Package (SiP)15 or 3D IC technology, greatly reducing the energy and delay time caused from reading and writing memory. IMC goes a step further than NMC as it dramatically reduces the energy consumption and delay time by performing operations within the memory.

▲Figure 2. Examples of In-Memory Computing (left) and Near-Memory Computing (right)

 

11Processing In-Memory (PIM): A next-generation technology that provides a solution for data congestion issues for AI and big data by adding computational functions to semiconductor memory.
12In-Memory Computing (IMC): A technique of running computer calculations entirely in computer memory. It enables extremely fast processing that helps businesses increase performance and analyze massive volumes of data in real time at very high speeds.
13Near-Memory Computing (NMC): Near-memory incorporates memory and logic in an advanced IC package, while in-memory brings the processing tasks near or inside the memory.
14Printed Circuit Board (PCB): A semiconductor board that’s made up of electronic circuits and has components soldered on its surface. These boards are found in most electronic devices.
15System-in-Package (SiP): A type of package that combines multiple devices into a single package to implement a system.

So, why does NMC still exist if IMC seems to be more efficient? Several of the reasons include variability in computation size, computation and memory density, and bandwidth. NMC can easily compose a PIM structure by still utilizing semiconductor memories, a field in which South Korea excels, while adjacently placing and adding computational ASICs which can support various MVM sizes. IMC, on the other hand, requires modifications to existing memory circuits to facilitate computation, resulting in a trade-off in density for high computational efficiency. This also leads to loss of storage capacity for weights and decreased throughput.

The Importance of Memory Capacity and the Development of Accelerators

In addition to computational efficiency, memory capacity for weights is also one of the important performance metrics in PIM. ChatGPT, the AI chatbot developed by OpenAI which is based on the GPT-3.5 language model, is made up of more than 175 billion weights. As each weight uses 16-bit floating point (FP16), it requires about 350 GB of storage for the weights alone. Since it is not possible to compute with 350 GB of weights loaded at once, the NMC’s computational ASIC or IMC circuit must be able to utilize many weights to reduce the number of weight updates. This results in a higher percentage of total operations being computational and less energy being spent on data transfers. Taking this into consideration, an NMC-based PIM system utilizing a highly-integrated HBM seems to be the more viable approach.

Besides large-scale AI systems, how does this relate to edge AI? There are many applications for edge AI where all the weights can be put on a single chip. As edge AI is often battery-powered and requires ultra-low power operation, the energy consumption from the data transfer between memory and computing units is prohibitive. Therefore, it is necessary for the edge devices to implement edge AI with all weights preloaded on computationally energy-efficient circuits such as IMC. For this reason, in addition to the computational efficiency of IMC-based PIM systems, the amount of pre-loaded weight capacity plays an important role in advancing edge AI.

In line with the industry’s research and development of NMC-based PIM systems, academic circles strive for advanced AI. Research is being conducted on the design of volatile memory-based PIM accelerators with SRAM16 , eDRAM17 , and DRAM, as well as next-generation, non-volatile memory-based PIM accelerators with RRAM18 , PCRAM19 , and MRAM20 . Among volatile memories, SRAM has been actively researched due to the accessibility of the CMOS process. While there are methods such as current-based operations and resistance ratio that are being used, charge sharing and capacitive coupling methods that utilize the low process deviation of capacitors21 are the main research streams for SRAM-based PIM accelerators.

16Static Random-Access Memory (SRAM): A type of random-access memory (RAM) that retains data bits in its memory as long as power is being supplied. The term “static” differentiates SRAM from DRAM; SRAM will hold its data permanently while connected to a power source, while data in DRAM decays in seconds and thus must be periodically refreshed.
17Embedded DRAM (eDRAM): A dynamic random-access memory (DRAM) integrated on the same die or multi-chip module (MCM) of an ASIC or microprocessor.
18Resistive Random-Access Memory (RRAM): A type of non-volatile random-access memory (RAM) that works by changing the resistance across a dielectric solid-state material.
19Phase-Change RAM (PCRAM): A type of semiconductor memory which utilizes the phase changes of certain materials to store data. PCM possesses qualities of both flash memories and DRAMs. Like flash memories, PCM is non-volatile, meaning that it does not lose information even when the power is cut. Like DRAMs, PCM processes data quickly and is power efficient.
20Magnetoresistive Random-Access Memory (MRAM): A type of non-volatile semiconductor memory which utilizes magnetic reluctance to store data. Like flash memories, MRAM does not lose information even when the power is cut, and like DRAMs it processes data quickly and is power efficient.
21Capacitor: A device that stores data in a semiconductor memory. It can be thought of as a data storage room.

▲Figure 3. Types of operations in SRAM-based PIM accelerators

 

Non-volatile memories have a higher density and computational efficiency than volatile memories and they are more suitable for ultra-low power edge devices as non-volatile memories do not require a continuous power source to maintain weights. Among non-volatile memories, MRAM has a low on/off ratio—the ratio of resistance values when expressing 1 and 0—which is disadvantageous for multi-bit encoding. As a result, PIM accelerators utilizing RRAM or PCRAM which have high on/off ratios, have instead been widely studied. Non-volatile memories still require additional peripheral circuits such as write-verify due to their immaturity that includes the large variation in resistance values by device. For this reason, non-volatile PIM accelerators are still inferior to volatile memory-based PIM accelerators such as SRAM in terms of density per total area and computational efficiency. However, as there is a lot of potential for advancement in the technology of devices, many R&D projects focusing on this subject are underway in South Korea.

Making AI More Practical

In the past, AI development has been hampered by the gap existing between theorizing the concept and a lack of hardware to bring it to life. Over time, the development of digital accelerators such as CPUs, GPUs, and Google’s Tensor Processing Units (TPUs) has led to dramatic improvements in the amount of computation that is possible. Furthermore, following the arrival of PIM accelerators, memory has expanded its basic role of storing data to making computations, essentially playing the role of the computer’s brain. Despite these advancements, there is still a long way to go regarding research on PIM accelerators compared to markets of digital accelerators like TPUs and NPUs. There also needs to be better understanding of characteristics in the circuit that include computational resolution, storage capacity, latency, and power consumption. To overcome these challenges, researchers should keep on conducting studies on improving the performance of hardware that can better support AI algorithms, as well as optimizing algorithms of AI neural networks that can better suit the characteristics of PIM accelerators. This implies that the combination of circuits and algorithms will become an important pillar of future PIM accelerator research.

 

<Other articles from this series>

[DGIST Series] How Broadband Interface Circuits Are Evolving for Optimal Data Transfer

[DGIST Series] The Role of Semiconductor Technologies in Future Robotics

[DGIST Series] The Technologies Handling the Growing Data Demands in Healthcare

[DGIST Series] Silicon Photonics: Revolutionizing Data Transfers by Unleashing the Power of Light

[DGIST Series] AI-Powered Micro/Nanorobots to Revolutionize Medical Field

[DGIST Series] Sensor Interfaces and ADC Circuits: Bridging the Physical and Digital Worlds