Describe in words “a woman walking on the road with a red umbrella”, and the system presents a beautiful street photo;
There is a picture of an airplane taking off. I want to match it with a suitable sound. After uploading the picture, an audio of the engine whistling will be played;
By introducing the pattering rain, the dense rain scene of the old houses in the south of the Yangtze River is displayed in front of us
These are the multimodal AI applications that have been realized, which are across text, image and voice. In the primary application functions, they have shown more intelligent, more natural and more diverse charm than single mode. Their prospects have been widely concerned, but for a long time, the development speed of multimodal has not been fast.
Now, things are changing.
During the period of Huawei All-Connection 2021, the China Institute of Science and Technology Information, AITISA (New Generation AI Industry Technology Innovation Strategic Alliance) and Pengcheng Laboratory jointly released the “White Paper 2.0 on the Development of AI Computing Center – From AI Computing Center to AI Computing Network”, It is explicitly mentioned that the “big computing power+big data” enables the big model (the ability of multimodal diversification is generally better realized by the big model, or the multimodal form is represented by the big model). At the meeting, the Institute of Automation of the Chinese Academy of Sciences released the world’s first three mode large model Zidong. At the beginning, this undoubtedly led to a new landing stage of multimodal development.
The multimodal large model is promoting each other with the artificial intelligence computing network and becoming the best companion of each other.
Under multiple factors, multimodal large model has become the general trend
With the gradual deepening of AI’s technology and industrial development, the trend of multimodal large model is very clear, which is mainly reflected in three aspects:
First, it is the ability evolution requirements of AI itself.
In the field of single mode, such as the application of cross-language translation belonging to NLP, machines can be said to have already surpassed human beings and realized important technical and industrial values. If we want to go further, multi-mode will naturally become a new direction for AI technology and industry breakthrough. At the same time, the single mode itself also faces the bottleneck problem of “knowledge iceberg”, and further intellectualization also needs the support of a large model. For example, the understanding of “Lao Wang goes to the canteen” is difficult for AI to distinguish “eating the canteen” from “eating the canteen”, but a scene image or video can be easily explained and related.
Then, there is the requirement of “data” supply.
Data is the basis of AI development and the “food” of AI. In the global market, including the Chinese market, the emergence of the Internet has helped AI model training data become more and more large, and they have enabled AI to get rapid energy supplement.
However, at present, Internet audio and video data is growing at a high speed, accounting for more than 80%. A single data type, such as text, only accounts for a small proportion. This makes more abundant voice, image, video and other data not fully utilized and learned. The value of these data will be mined more deeply and widely in a multimodal way. In turn, a large number of data feeds with various attributes will also promote AI to get rid of single mode and move towards a multimodal model.
Finally, it is the reverse force of industrial demand.
With the gradual landing of AI, the industrial demand is also going deeper. More scene applications need to be supported by multimodal large models, such as cross-modal retrieval, intelligent question answering, literary and artistic creation, video dubbing, video summarization, etc.
It can be said that the more image, text and voice are integrated at the technical level, the more obvious the value of an application in the scene will be, and the more can AI scene applications really say goodbye to the often criticized “chicken ribs” feeling.
Trinity of computing power, framework and technology accumulation, and acceleration of landing of multimodal large model
There are three main reasons why such a multimodal model as Zidong and Taichu can be realized:
1. AI power network has become an important driving factor for multimodal and large models
An important feature of the multimodal large model is that the scale of the training parameters increases exponentially.
In the past, single mode and single type of data “feeding” helped AI model acquire knowledge and iterative ability. Relatively speaking, the model itself does not need too many parameters, just as primary school students continue to learn addition, subtraction, multiplication and division, as long as they understand basic mathematical rules.
When different modes are added, a general algorithm that can recognize images, text and voice needs not only to understand various data of single mode, but also to understand the extremely complex relationship between different data. The parameters of the model expand, which is just like the professional university science and engineering students need to integrate various disciplines to carry out complex calculations.
At this time, it is obvious that computing power becomes the most basic support. Only a very large scale of computing power can support the training of large models and make multimodal applications have better effects.
Therefore, on the basis of local AI computing centers providing powerful cluster computing power, the emergence of AI computing power network has further solved the problem of computing power demand for multimodal large models, and has become an important driving factor.
In fact, because the calculation of large models often has the problem of peaks and valleys (that is, the computational power is huge when calculating, while the computational power is idle when not calculating), and the artificial intelligence computing power network can sense, distribute and dispatch the artificial intelligence computing power nationwide, and dynamically allocate the computing power according to the situation of the computing power resources of each center and the demand of each region, the supply and demand relationship between the two sides is also very consistent in terms of “rhythm” in addition to “quantity”.
“Future technology” AI computing network comes out: the best “companion” of multimodality?
On the other hand, the technological development of multimodal large model and its application in industry will also promote the better development of the artificial intelligence computing network, which is itself the driving force of industrial clusters in various regions. “Make the best use of everything” and the continuous progress of technology can be seen. It can be seen that the two are mutually reinforcing.
2. Shengsi MindSpot features promote development acceleration
Because the model parameters are very large, it is not enough to support the computational power alone. The AI framework on which the multimodal large model development relies also needs the ability to carry and utilize the computational power and support the huge parameters. In this regard, some mainstream development frameworks at home and abroad in the past only support simple data parallelism, which cannot meet the needs of large models.
The multimodal large model Zidong, released on Huawei Full Connect 2021, was trained based on the Shengsi framework at the beginning. It is the industry’s first framework to support fully automatic parallel, and the world’s first Chinese pre-training large model Pengcheng. Pangu was created by him.
The main technical advantage of the combination of the Shengsi framework and the multimodal large model is that the model can be automatically divided into different devices during the training process, and the large cluster of computing devices can be efficiently used to complete parallel training, which is equivalent to establishing an effective central command system to distribute computing tasks in a simultaneous manner, and the largest training tasks can be accelerated in an orderly manner, rather than blocked.
The implementation process is realized through the unique ability of multi-dimensional automatic parallelism – through data parallelism, model parallelism, pipeline parallelism, heterogeneous parallelism, repeated computation, efficient memory reuse and topology-aware scheduling, to reduce the occupation of communication time and achieve the minimum overall iteration time, which is simply to make parallelism more scale and efficiency through a series of technological innovations, It is not necessary to complete parallel execution and development of large models semi-automatically or even manually like other AI frameworks.
In the latest version 1.5 update, the Shengsi framework also adds a variety of parallel tuning to support efficient training of hundreds of billions to trillions of parameter models in large clusters.
3. Existing experience basis of multimodal large model
There is no doubt that multimodal capability must be based on single-modal capability. The developer of Zidong Taichu, the Institute of Automation of the Chinese Academy of Sciences, is an important ecological partner of Shengteng AI. Prior to the release of Zidong Taichu, the Institute of Automation of the Chinese Academy of Sciences has developed industry-leading models in image, voice and text:
“Future technology” AI computing network comes out: the best “companion” of multimodality?
On this basis, the Institute of Automation of the Chinese Academy of Sciences and Ascension AI have also achieved the construction of some “preliminary preparation” capabilities, including the global leadership in the performance of image and text cross-modal understanding and generation, and video understanding and description, which have become the important support of Zidong Taichu:
“Future technology” AI computing network comes out: the best “companion” of multimodality?
Finally, we can see that Zidong Taichu, the world’s first big three mode model, emerged at the historic moment, making multimodality leap from the common two modes into the three mode era. It can not only achieve cross modal understanding (such as image recognition, voice recognition and other tasks), but also complete cross modal generation (such as generating images from text, text from images, image and video from voice and other tasks).
It seems that there is only a quantitative difference between the two modes and the three modes, but in terms of technology, the difficulty of its realization is similar to the leap from the two-dimensional world to the three-dimensional world, which requires a lot of technical accumulation and innovation. Once the three modes are realized, the interaction of AI will become more natural compared with the two modes, and it can be a step closer to strong AI.
The multimodal large model is accelerating the enabling industry. On the premise of open source and opening up, Zidong and Taichu, which are supported by Shengteng AI, are entering the application scenarios of intelligent driving, industrial quality inspection, film and television creation, intelligent medical care, etc. The cooperative customers include SAIC Group, Weiqiao Venture and other well-known enterprises. A picture of the multimodal large model enabling thousands of industries is unfolding.
It can be seen from the development of multimodal large models that in the future, with the development of basic software and hardware breakthrough projects such as artificial intelligence computing network and Shengsi framework, China’s AI will achieve a comprehensive leadership from basic technology to industrial application, and have a real competitive barrier by virtue of technology and mode innovation.