Contact us

Let's talk about collaboration
and technology

Technology for the fusion of vision and natural language (Vision and Language)

Social implementation of a technology to achieve high levels of understanding and judgment by linking image data with language-based knowledge

Technology that guides decision-making by integrating background knowledge not available through visual information alone

Konica Minolta has achieved AI processing based on background knowledge and expertise, in addition to conventional vision data, by integrating image recognition technology, which it has refined as one of its strengths, with natural language processing technology.
For example, image recognition technology recognizes the position of vehicles and persons based on data captured in an image. However, the degree of danger depends on the positional relationship between vehicles and persons. While humans can instantly understand the meaning of a situation in front of them based on past experience, computer vision can only recognize objective information (facts). The value provided by this technology is that integrates the prior knowledge that “if the distance between the vehicle and person is within 2 meters and the vehicle is moving, the degree of danger is very high,” thereby enabling the same judgment as humans.
Main tasks include Q&A about the content of videos (VQA: Visual Question Answering) and automatic caption generation from videos.
Existing text data, such as manuals, know-how tips, and dictionaries, can be used as the data source for natural language processing. Konica Minolta is focusing on development so that this technology can be implemented for safety assurance at manufacturing sites, etc., in which we have domain knowledge.

Technology Overview

When integrating image recognition and natural language processing technologies, it is necessary to properly handle image features and language features, which are on different dimensions. By weighting and tuning according to the use case, both features can be handled and analyzed in the same way. For the analysis process, an advanced neural network architecture is introduced to extract image features from large and diverse datasets, while simultaneously learning the relationship with language features. To advance the language features, a large-scale language model (LLM) that has learned the language structures and meanings from a vast amount of text data is used.
In terms of image recognition technology, Konica Minolta is working on many projects related to human behavior and has strengths in relevant domains. The company also has know-how regarding tuning process to efficiently introduce knowledge of target domains. If you are interested in using this technology, feel free to contact us.

Example of task: Q&A function to give answers about the content of images based on knowledge (VQA)

Answers are output in response to questions about images and their content. Answers can also be given to questions that require knowledge other than vision data contained in images.

Example of task: automatic caption generation for videos

Image description text is generated. Image captions are generated by combining object detection processing with a natural language processing model.

Category to which this technology applies
(click to see a list of technologies in that category)