The bottleneck of computing architecture
And breakthrough direction
With the sharp increase in the demand for AI computing, the existing computing architecture is facing challenges such as power consumption wall, performance wall, memory wall, Moore’s Law slowing down, and so on. It is urgent to innovate the computing architecture. The solution path is mainly reflected in two points: breaking the computing architecture and breaking the storage wall.
The innovation of computing architecture has always been the focus of debate. The emerging GPU, FPGA, ASIC, brain-like and even 3DSoC in applications are all trying to break the bottleneck of five hardware features, such as adaptability, performance, efficiency, programmability and scalability. No architecture can achieve the best in five features.
Regardless of which architecture is the best, which architecture is suitable for AI business scenarios, data types, and expenditure costs, it is a good architecture that can make medical AI solutions fast.
The increase in capital cost, time cost and complexity of computing architecture update has prompted academia and industry to turn to research on “how to break the storage wall”. There are many solutions, including:
1. High-bandwidth data communication
High-speed SerDes: point-to-point serial communication improves transmission speed;
Optical interconnection: no induction, no interference, high speed and high density between signals replace electrical interconnection;
2.5D/3D stacking technology: build blocks, increase the number of transistors per unit chip area without changing the existing product process, and stack more storage devices around the processor.
2. Data, calculation and access
Increase the number of cache levels: the processor and main memory are inserted into the cache. Relatively speaking, the larger the cache, the faster the speed, but the higher the cost.
High density on-chip memory: EDRAM dynamic random access memory, PCM phase change storage static and amorphous conversion.
3. Memory operation
Near data calculation: calculate and process the edge side closer to the data.
Integration of storage and computing: off-chip high-bandwidth memory HBM, high-bandwidth storage (3D-Xtacking, memory cells and peripheral circuits are independently processed on different wafers) and on-chip (algorithm embedding in memory particles).
Von Neumann architecture is the classic architecture of computers, and also the mainstream architecture of computers and processor chips at present. In von Neumann architecture, the computing/processing unit and memory are two completely separate units: the computing/processing unit reads data from the memory according to instructions, completes the computing/processing in the computing/processing unit, and returns to the memory.
The main improvement of in-memory operation is to embed the calculation into the memory, and the memory becomes a sharp tool for storage+calculation. It can complete the operation while storing/reading data, reducing the cost of data access in the calculation process. Convert the calculation into weighted sum calculation, store the weight in the memory unit, and make the memory unit have the calculation ability.
Another direction of AI operation
Low power consumption and continuous running IoT devices, such as smart homes, wearable devices, mobile terminals and perceptual computing, and low-power edge computing devices required by smart cities.
The current computer system adopts the von Neumann structure. When the CPU processes data from outside the DRAM chip, the frequently used data is stored in the cache (L1, L2 and L3), which not only has high speed and low power consumption, but also can achieve maximum performance. However, in applications that need to process a large amount of data, the vast majority of data is read from memory, because compared with the cache capacity, the data to be processed is much larger.
In this case, the bandwidth of the data channel between the CPU and memory becomes the bottleneck limiting the performance, and the transmission of data between the CPU and memory also consumes huge energy. In order to break through this bottleneck, the channel bandwidth between CPU and memory needs to be expanded. However, if the current number of CPU pins has reached the limit, further bandwidth improvement will face insurmountable technical difficulties. In modern computer architecture, data storage and data calculation are separated, and such “data wall” problem is inevitable. Let’s assume that the power consumption of the processor for multiplication operation is about 1 unit, and the energy required to obtain data from DRAM to the processor is 650 times of the actual calculation of data, that is to say, reducing data movement is a huge improvement in performance and power consumption.
Deep neural network (DNN) is a kind of machine learning (ML), in which the convolutional neural network (CNN) for computer vision (CV) and the recurrent neural network (RNN) for natural language processing (NLP) are well known, and the recently popular recommendation model (RM) and other new applications also tend to use DNN. For RNN, its main operation is matrix vector multiplication. Because of its low data reuse characteristics, the more memory accesses, the more data moves through the memory channel, and the more obvious the performance bottleneck.
So in order to improve this point, many people have proposed to use PIM technology to rebuild DRAM memory. PIM, as defined, its operations and calculations are performed in memory. That is to say, the expected effect of PIM is to minimize data movement and improve performance by performing operations in memory without moving data to the CPU. From the end of the 1990s to the beginning of the 21st century, the academic community actively studied this concept, but due to the technical difficulty of DRAM processing and logical computing, and the high cost of using DRAM processing to realize the CPU in memory, the competitiveness of PIM was greatly weakened, and it was not commercialized. But today’s demand for performance has put the commercialization of this concept on the agenda.
If we want to understand PIM, first of all, we need to know what AI is doing. The figure below shows us the fully connected (FC) layer in the neural network. The Y1 node of single output neuron is linked to X1, X2, X3 and X4 nodes. The weights on each node’s synapse are w11, w12, w13 and w14 respectively. AI needs to multiply each calculation node and weight, then sum them, and then apply an activation function, such as RELU, in order to process this fully connected layer. More complex is the case where there are several inputs (X1… Xn) and outputs (Y1… Yn), AI multiplies each cell by its corresponding output weight and then sums it separately, which is also the mathematical matrix multiplication and addition operation.
Similarly, in Figure 5, if we design all the circuits for these calculations into the storage unit, we do not need to carry and transmit data at all, just need to complete the calculations in the storage unit and inform the CPU of the results; This can not only significantly reduce power consumption, but also handle more complex operations as much as possible. At present, SK Hynix is vigorously developing PIM DRAM using this technology. For applications with memory bottlenecks such as RNN, if computing circuits are used in DRAM to execute applications, it is expected that the performance and power consumption will be significantly improved. In the future, the amount of data that the CPU needs to process is still increasing, and PIM is expected to become the most powerful solution to improve the performance of the computer.
Advantages and disadvantages of memory operation
(1) Off-chip storage (integrated storage and calculation based on digital chip and memory)
① High bandwidth memory HBM:
For GPU, 3D DRAM and GPU metal wire connection are used to improve the communication speed (900GB/S), but the power consumption is high and the cost is high.
For other chips, replacing HBM (3D DRAM) with SRAM reduces energy consumption and improves reading and writing speed, which is costly. In this case, a large number of SRAM can match a large number of MPU, CPU and other processors to improve the efficiency of operation.
② New storage widens memory:
Use a new memory layout to expand memory around the processor, such as magnetic storage (MRAM) to reduce costs, improve storage density, and do not lose data after power failure. The process only improves 3-4 layers of MASK, effectively improving the performance to about 10Tops/W (10 trillion operations per watt).
(2) On-chip storage
In-chip storage is to embed the algorithm weight MAC in the memory grain. The memory unit has the computing function, and the parallel computing ability is strong. In addition, the neural network has a high tolerance for the error of the computing accuracy (the number of memory bits can be adjusted according to the application). Therefore, even if the error is caused by the mixing of in-memory computing numbers and simulations, it is appropriate for the corresponding application performance and energy efficiency ratio, and it brings the wide combination of in-memory computing and artificial intelligence, especially deep learning.
① Phase change storage PCM
The phase change memory is usually used to store data by changing the heating time to promote the huge conductivity difference between the crystalline and amorphous chalcogenides. The phase change time is 100-1000ns, and the number of erasable times is 108. Now more and more new materials are emerging.
② Resistive memory/memristor RRAM/memristor
A memristor is a nonlinear resistor with memory function. Its resistance will change with the current flowing through it. After power failure, even if the current stops, the resistance value will still remain, and it will not return to the original state until the reverse current passes. Therefore, the resistance value can be changed by controlling the current change, and then the data storage function can be realized by defining the high resistance value as “1” and the low resistance value as “0”. It is usually used to build high-density non-volatile resistive memory (RRAM).
The memristor network, similar to the neural network of biological brain, can handle many tasks at the same time. Most importantly, it does not need to move data repeatedly. It can process a large number of signals in parallel, especially suitable for machine learning systems. The programming time is about 10-1000ns, and the number of programmable times is 106-1012.
③ Floating gate device
The floating gate device has mature technology, programming time of 10-1000ns, programmable times of 105 times, large storage array, high precision, high density, high efficiency and low cost, and is suitable for deep learning and artificial intelligence.
3. Chip optimization strategy
The terminal storage and computing integrated chip reasoning application requires lower cost and lower power consumption, and has low requirements for accuracy and versatility.
The training application of cloud storage and computing integrated chip needs the requirements of universality, speed and accuracy, so it is suitable for the embedded application of front-end under the condition of low accuracy of storage and computing integrated chip.
4. The challenge of integrated storage and computing chip
(1) The existing floating gate device storage is not suitable for calculation and needs to be optimized and improved.
(2) The development of new memory challenges floating gate devices, which may be more suitable for storage and computing integration.
(3) At present, the storage and calculation integration is at 8bit operation precision, and the operation precision needs to be improved under appropriate conditions, such as 10bit in NorFlash.
(4) The compatibility of the integrated storage and computing chip with the development environment, architecture and existing technology needs market and time.
(5) The combination of performance and scenario needs to be implemented.
5. The future of integration of deposit and settlement
(1) Low-precision but accurate multiplication and accumulation operations can improve the efficiency of the terminal and reduce the cost of the chip. At present, NorFlash can be used in the 40nm/55nm process, but NorFlash will limit the application to a certain extent, but it can be broken through by developing more optimized devices and processes in the future.
(2) The investment institutions of the integrated storage and computing chip include Softbank, Intel, Microsoft, Bosch, Amazon and even the United States government. China’s integrated storage and computing technology will receive the next round of investment, as well as the new memory technology of Tsinghua memristor.
(3) The first generation of integrated storage and computing chips are all aimed at voice, and will enter the security and market segments in the future, but.
(4) The enterprise model of integrated storage and computing should be divided into two modes: one is to sell IP, and the other is to make AI integrated storage and computing chips. The former will have a very hard time with pure IP. In the future, we should make chips! But all kinds of competition are not small.
(5) At present, the limit efficiency of the integration of storage and computing is>300Tops/W (8bit). Now the gap between the industry and the industry is large, 5-50Tops/W, and there is great room for improvement.
(6) Driven by Moore’s law, floating gate devices will develop towards higher technology, such as the transition from 40-14nm, and the performance will be greatly improved. The new memory will transition from 28-5nm process to improve process performance.
(7) Memory technology will improve the performance of storage and computing integration towards 2X or even 10X and structure optimization.
Compiled from: The prospect of Processing In Memory (PIM) in memory systems for AI applications —- EEtimes
About Huiwei Intelligence
Huiwei Intelligent Medical Technology Co., Ltd. was established in June 2019, specializing in the R&D, production and sales of intelligent medical products. Our core members are from the world’s top scientific research institutions and the world’s top 500 enterprises. Huiwei Intelligent, driven by its own core technology in the fields of “artificial intelligence” and “edge computing”, is committed to providing medical products and services of “high standard and good experience” to medical institutions around the world, and helping doctors improve their diagnosis and treatment level and efficiency to the greatest extent.
The bottleneck of computing architecture