Viola–Jones object detection framework

Viola–Jones object detection framework

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes. In short, it consists of a sequence of classifiers. Each classifier is a single perceptron with several binary masks (Haar features). To detect faces in an image, a sliding window is computed over the image. For each image, the classifiers are applied. If at any point, a classifier outputs "no face detected", then the window is considered to contain no face. Otherwise, if all classifiers output "face detected", then the window is considered to contain a face. The algorithm is efficient for its time, able to detect faces in 384 by 288 pixel images at 15 frames per second on a conventional 700 MHz Intel Pentium III. It is also robust, achieving high precision and recall. While it has lower accuracy than more modern methods such as convolutional neural network, its efficiency and compact size (only around 50k parameters, compared to millions of parameters for typical CNN like DeepFace) means it is still used in cases with limited computational power. For example, in the original paper, they reported that this face detector could run on the Compaq iPAQ at 2 fps (this device has a low power StrongARM without floating point hardware). == Problem description == Face detection is a binary classification problem combined with a localization problem: given a picture, decide whether it contains faces, and construct bounding boxes for the faces. To make the task more manageable, the Viola–Jones algorithm only detects full view (no occlusion), frontal (no head-turning), upright (no rotation), well-lit, full-sized (occupying most of the frame) faces in fixed-resolution images. The restrictions are not as severe as they appear, as one can normalize the picture to bring it closer to the requirements for Viola-Jones. any image can be scaled to a fixed resolution for a general picture with a face of unknown size and orientation, one can perform blob detection to discover potential faces, then scale and rotate them into the upright, full-sized position. the brightness of the image can be corrected by white balancing. the bounding boxes can be found by sliding a window across the entire picture, and marking down every window that contains a face. This would generally detect the same face multiple times, for which duplication removal methods, such as non-maximal suppression, can be used. The "frontal" requirement is non-negotiable, as there is no simple transformation on the image that can turn a face from a side view to a frontal view. However, one can train multiple Viola-Jones classifiers, one for each angle: one for frontal view, one for 3/4 view, one for profile view, a few more for the angles in-between them. Then one can at run time execute all these classifiers in parallel to detect faces at different view angles. The "full-view" requirement is also non-negotiable, and cannot be simply dealt with by training more Viola-Jones classifiers, since there are too many possible ways to occlude a face. == Components of the framework == A full presentation of the algorithm is in. Consider an image I ( x , y ) {\displaystyle I(x,y)} of fixed resolution ( M , N ) {\displaystyle (M,N)} . Our task is to make a binary decision: whether it is a photo of a standardized face (frontal, well-lit, etc) or not. Viola–Jones is essentially a boosted feature learning algorithm, trained by running a modified AdaBoost algorithm on Haar feature classifiers to find a sequence of classifiers f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} . Haar feature classifiers are crude, but allows very fast computation, and the modified AdaBoost constructs a strong classifier out of many weak ones. At run time, a given image I {\displaystyle I} is tested on f 1 ( I ) , f 2 ( I ) , . . . f k ( I ) {\displaystyle f_{1}(I),f_{2}(I),...f_{k}(I)} sequentially. If at any point, f i ( I ) = 0 {\displaystyle f_{i}(I)=0} , the algorithm immediately returns "no face detected". If all classifiers return 1, then the algorithm returns "face detected". For this reason, the Viola-Jones classifier is also called "Haar cascade classifier". === Haar feature classifiers === Consider a perceptron f w , b {\displaystyle f_{w,b}} defined by two variables w ( x , y ) , b {\displaystyle w(x,y),b} . It takes in an image I ( x , y ) {\displaystyle I(x,y)} of fixed resolution, and returns f w , b ( I ) = { 1 , if ∑ x , y w ( x , y ) I ( x , y ) + b > 0 0 , else {\displaystyle f_{w,b}(I)={\begin{cases}1,\quad {\text{if }}\sum _{x,y}w(x,y)I(x,y)+b>0\\0,\quad {\text{else}}\end{cases}}} A Haar feature classifier is a perceptron f w , b {\displaystyle f_{w,b}} with a very special kind of w {\displaystyle w} that makes it extremely cheap to calculate. Namely, if we write out the matrix w ( x , y ) {\displaystyle w(x,y)} , we find that it takes only three possible values { + 1 , − 1 , 0 } {\displaystyle \{+1,-1,0\}} , and if we color the matrix with white on + 1 {\displaystyle +1} , black on − 1 {\displaystyle -1} , and transparent on 0 {\displaystyle 0} , the matrix is in one of the 5 possible patterns shown on the right. Each pattern must also be symmetric to x-reflection and y-reflection (ignoring the color change), so for example, for the horizontal white-black feature, the two rectangles must be of the same width. For the vertical white-black-white feature, the white rectangles must be of the same height, but there is no restriction on the black rectangle's height. ==== Rationale for Haar features ==== The Haar features used in the Viola-Jones algorithm are a subset of the more general Haar basis functions, which have been used previously in the realm of image-based object detection. While crude compared to alternatives such as steerable filters, Haar features are sufficiently complex to match features of typical human faces. For example: The eye region is darker than the upper-cheeks. The nose bridge region is brighter than the eyes. Composition of properties forming matchable facial features: Location and size: eyes, mouth, bridge of nose Value: oriented gradients of pixel intensities Further, the design of Haar features allows for efficient computation of f w , b ( I ) {\displaystyle f_{w,b}(I)} using only constant number of additions and subtractions, regardless of the size of the rectangular features, using the summed-area table. === Learning and using a Viola–Jones classifier === Choose a resolution ( M , N ) {\displaystyle (M,N)} for the images to be classified. In the original paper, they recommended ( M , N ) = ( 24 , 24 ) {\displaystyle (M,N)=(24,24)} . ==== Learning ==== Collect a training set, with some containing faces, and others not containing faces. Perform a certain modified AdaBoost training on the set of all Haar feature classifiers of dimension ( M , N ) {\displaystyle (M,N)} , until a desired level of precision and recall is reached. The modified AdaBoost algorithm would output a sequence of Haar feature classifiers f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} . The details of the modified AdaBoost algorithm is detailed below. ==== Using ==== To use a Viola-Jones classifier with f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} on an image I {\displaystyle I} , compute f 1 ( I ) , f 2 ( I ) , . . . f k ( I ) {\displaystyle f_{1}(I),f_{2}(I),...f_{k}(I)} sequentially. If at any point, f i ( I ) = 0 {\displaystyle f_{i}(I)=0} , the algorithm immediately returns "no face detected". If all classifiers return 1, then the algorithm returns "face detected". === Learning algorithm === The speed with which features may be evaluated does not adequately compensate for their number, however. For example, in a standard 24x24 pixel sub-window, there are a total of M = 162336 possible features, and it would be prohibitively expensive to evaluate them all when testing an image. Thus, the object detection framework employs a variant of the learning algorithm AdaBoost to both select the best features and to train classifiers that use them. This algorithm constructs a "strong" classifier as a linear combination of weighted simple “weak” classifiers. h ( x ) = sgn ⁡ ( ∑ j = 1 M α j h j ( x ) ) {\displaystyle h(\mathbf {x} )=\operatorname {sgn} \left(\sum _{j=1}^{M}\alpha _{j}h_{j}(\mathbf {x} )\right)} Each weak classifier is a threshold function based on the feature f j {\displaystyle f_{j}} . h j ( x ) = { − s j if f j < θ j s j otherwise {\displaystyle h_{j}(\mathbf {x} )={\begin{cases}-s_{j}&{\text{if }}f_{j}<\theta _{j}\\s_{j}&{\text{otherwise}}\end{cases}}} The threshold value θ j {\displaystyle \theta _{j}} and the polarity s j ∈ ± 1 {\displaystyle s_{j}\in \pm 1} are determined in the training, as well as the coefficients α j {\displaystyle \alpha _{j}} . Here a simplified version of the lea

CMU Pronouncing Dictionary

The CMU Pronouncing Dictionary (also known as CMUdict) is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research. CMUdict provides a mapping orthographic/phonetic for English words in their North American pronunciations. It is commonly used to generate representations for speech recognition (ASR), e.g. the CMU Sphinx system, and speech synthesis (TTS), e.g. the Festival system. CMUdict can be used as a training corpus for building statistical grapheme-to-phoneme (g2p) models that will generate pronunciations for words not yet included in the dictionary. The most recent release is 0.7b; it contains over 134,000 entries. An interactive lookup version is available. == Database format == The database is distributed as a plain text file with one entry to a line in the format "WORD " with a two-space separator between the parts. If multiple pronunciations are available for a word, variants are identified using numbered versions (e.g. WORD(1)). The pronunciation is encoded using a modified form of the ARPABET system, with the addition of stress marks on vowels of levels 0, 1, and 2. A line-initial ;;; token indicates a comment. A derived format, directly suitable for speech recognition engines is also available as part of the distribution; this format collapses stress distinctions (typically not used in ASR). The following is a table of phonemes used by CMU Pronouncing Dictionary. == History == == Applications == The Unifon converter is based on the CMU Pronouncing Dictionary. The Natural Language Toolkit contains an interface to the CMU Pronouncing Dictionary. The Carnegie Mellon Logios tool incorporates the CMU Pronouncing Dictionary. PronunDict, a pronunciation dictionary of American English, uses the CMU Pronouncing Dictionary as its data source. Pronunciation is transcribed in IPA symbols. This dictionary also supports searching by pronunciation. Some singing voice synthesizer software like CeVIO Creative Studio and Synthesizer V uses modified version of CMU Pronouncing Dictionary for synthesizing English singing voices. Transcriber, a tool for the full text phonetic transcription, uses the CMU Pronouncing Dictionary 15.ai, a real-time text-to-speech tool using artificial intelligence, uses the CMU Pronouncing Dictionary

Intelligent database

Until the 1980s, databases were viewed as computer systems that stored record-oriented and business data such as manufacturing inventories, bank records, and sales transactions. A database system was not expected to merge numeric data with text, images, or multimedia information, nor was it expected to automatically notice patterns in the data it stored. In the late 1980s the concept of an intelligent database was put forward as a system that manages information (rather than data) in a way that appears natural to users and which goes beyond simple record keeping. The term was introduced in 1989 by the book Intelligent Databases by Kamran Parsaye, Mark Chignell, Setrag Khoshafian and Harry Wong. The concept postulated three levels of intelligence for such systems: high level tools, the user interface and the database engine. The high level tools manage data quality and automatically discover relevant patterns in the data with a process called data mining. This layer often relies on the use of artificial intelligence techniques. The user interface uses hypermedia in a form that uniformly manages text, images and numeric data. The intelligent database engine supports the other two layers, often merging relational database techniques with object orientation. In the twenty-first century, intelligent databases have now become widespread, e.g. hospital databases can now call up patient histories consisting of charts, text and x-ray images just with a few mouse clicks, and many corporate databases include decision support tools based on sales pattern analysis.

Document classification

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach. == "Content-based" versus "request-based" classification == Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document. Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230). Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as policy-based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach. == Classification versus indexing == Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986, 2004; Broughton, 2008; Riesthuis & Bliedung, 1991). Therefore, assigning a subject term to a document in an index is equivalent to assigning that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). == Automatic document classification (ADC) == Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification, where parts of the documents are labeled by the external mechanism. There are several software products under various license models available. === Techniques === Automatic document classification techniques include: Artificial neural network Concept Mining Decision trees such as ID3 or C4.5 Expectation maximization (EM) Instantaneously trained neural networks Latent semantic indexing Multiple-instance learning Naive Bayes classifier Natural language processing approaches Rough set-based classifier Soft set-based classifier Support vector machines (SVM) K-nearest neighbour algorithms tf–idf == Applications == Classification techniques have been applied to spam filtering, a process which tries to discern E-mail spam messages from legitimate emails email routing, sending an email sent to a general address to a specific address or mailbox depending on topic language identification, automatically determining the language of a text genre classification, automatically determining the genre of a text readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system sentiment analysis, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. health-related classification using social media in public health surveillance article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology

Sparkles emoji

The Sparkles emoji (U+2728 ✨ SPARKLES) is an emoji that has one large star surrounded by smaller stars. Originating from Japan to represent sparkles used in anime and manga, the sparkles are often used as emphasis in text by surrounding words or phrases with it. It is the third most-used emoji in the world on Twitter as of 2021. Since the early 2020s it has been used by major software companies to represent artificial intelligence, marketing the technology as "like magic". == Development == According to Emojipedia, the Sparkles emoji was first used by Japanese mobile operators SoftBank, Docomo and au in the late 1990s. The emoji was added to Unicode 6.0 in 2010 and Emoji 1.0 in 2015. On some platforms the Sparkles emoji has been multicoloured whilst on other platforms it has been one colour. Twitter and Microsoft's Sparkles have changed from being multicoloured to being a single colour. Samsung's version of the emoji previously had a night sky in the background. == Usage == === Interpersonal communication === The Sparkles emoji was originally meant to represent the usage of sparkles in Japanese anime and manga, where the sparkles are used to represent beauty, happiness or awe. The emoji has several meanings and depends upon context. Starting in the late 2010s, the emoji started being used to surround words or phrases to be used as emphasis, an example from the book Because Internet being "I would simply ✨pass away✨". It can also be used as sarcasm, irony or as a way to mock people. Without emoji this could be represented with tildes or asterisks, for example, "~tildes~" or "~asterisk plus tilde~" or "~~true sparkle exuberance~~". The sparkles emoji can be used to represent stars in text, be used to represent cleanliness or can be used to mean "orgasm" whilst sexting. In September 2021 the Sparkles emoji overtook the Pleading Face (🥺) emoji to become the third most-used emoji in the world according to Emojipedia, with approximately 1 per cent of all tweets containing the Sparkles emoji. === Artificial intelligence === In the early 2020s, the Sparkles emoji started being used as an icon to represent artificial intelligence (AI). Companies who use the emoji this way include Google, OpenAI, Samsung, Microsoft, Adobe, Spotify and Zoom. As of August 2024, seven of the top 10 software companies by market capitalisation use the Sparkles emojis with AI. OpenAI has different versions of the Sparkles for different versions of the models that ChatGPT uses. One explanation is that Sparkles is being used by these companies as a way to market AI as "magic". Marketing technology as "magic" has been used before AI, particularly by Apple. Another explanation given by designers and marketers choosing to use Sparkles to signify AI is simply that other platforms are doing it, making it familiar to users. Around 2024, some of these companies started removing two of the smaller stars from the emoji in their AI services and have kept the one large star, an example being Google's Gemini chatbot. In early 2024, the Nielsen Norman Group provided test subjects with the star in isolation and found that people did not associate the symbol with AI, but instead mostly with "optimisation" or "favourite or save an item".

Multimodal representation learning

Multimodal representation learning is a subfield of representation learning focused on integrating and interpreting information from different modalities, such as text, images, audio, or video, by projecting them into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points within that space, facilitating a unified understanding of diverse data types. By automatically learning meaningful features from each modality and capturing their inter-modal relationships, multimodal representation learning enables a unified representation that enhances performance in cross-media analysis tasks such as video classification, event detection, and sentiment analysis. It also supports cross-modal retrieval and translation, including image captioning, video description, and text-to-image synthesis. == Motivation == The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality. A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts. These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities. Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques. == Approaches and methods == === Canonical-correlation analysis based methods === Canonical-correlation analysis (CCA) was first introduced in 1936 by Harold Hotelling and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data matrices X ∈ R n × p {\displaystyle X\in \mathbb {R} ^{n\times p}} and Y ∈ R n × q {\displaystyle Y\in \mathbb {R} ^{n\times q}} representing different modalities, CCA finds projection vectors w x ∈ R p {\displaystyle w_{x}\in \mathbb {R} ^{p}} and w y ∈ R q {\displaystyle w_{y}\in \mathbb {R} ^{q}} that maximizes the correlation between the projected variables: ρ = max w x , w y w x ⊤ Σ x y w y w x ⊤ Σ x x w x w y ⊤ Σ y y w y {\displaystyle \rho =\max _{w_{x},w_{y}}{\frac {w_{x}^{\top }\Sigma _{xy}w_{y}}{{\sqrt {w_{x}^{\top }\Sigma _{xx}w_{x}}}{\sqrt {w_{y}^{\top }\Sigma _{yy}w_{y}}}}}} such that Σ x x {\displaystyle \Sigma _{xx}} and Σ y y {\displaystyle \Sigma _{yy}} are the within-modality covariance matrices, and Σ x y {\displaystyle \Sigma _{xy}} is the between-modality covariance matrix. However, standard CCA is limited by its linearity, which led to the development of nonlinear extensions, such as kernel CCA and deep CCA. ==== Kernel CCA ==== Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using kernel functions. Given kernel functions K x {\displaystyle K_{x}} and K y {\displaystyle K_{y}} with corresponding Gram matrices K x ∈ R n × n {\displaystyle K_{x}\in \mathbb {R} ^{n\times n}} and K y ∈ R n × n {\displaystyle K_{y}\in \mathbb {R} ^{n\times n}} , KCCA seeks coefficients α {\displaystyle \alpha } and β {\displaystyle \beta } that maximize: ρ = max α , β α ⊤ K x K y β α ⊤ K x 2 α β ⊤ K y 2 β {\displaystyle \rho =\max _{\alpha ,\beta }{\frac {\alpha ^{\top }K_{x}Ky\beta }{{\sqrt {\alpha ^{\top }K_{x}^{2}\alpha }}{\sqrt {\beta ^{\top }K_{y}^{2}\beta }}}}} To prevent overfitting, regularization terms are typically added, resulting in: ρ = max α , β α T K x K y β α T ( K x 2 + λ x K x ) α β T ( K y 2 + λ y K y ) β {\displaystyle \rho =\max _{\alpha ,\beta }{\frac {\alpha ^{T}K_{x}K_{y}\beta }{{\sqrt {\alpha ^{T}\left(K_{x}^{2}+\lambda _{x}K_{x}\right)\alpha }}{\sqrt {\;\beta ^{T}\left(K_{y}^{2}+\lambda _{y}K_{y}\right)\beta }}}}} where λ x {\displaystyle \lambda _{x}} and λ y {\displaystyle \lambda _{y}} are regularization parameters. KCCA has proven effective for tasks such as cross-modal retrieval and semantic analysis, though it faces computational challenges with large datasets due to its O ( n 2 ) {\displaystyle O(n^{2})} memory requirement for sorting kernel matrices. KCCA was proposed independently by several researchers. ==== Deep CCA ==== Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing the correlation between modalities. DCCA uses separate neural networks f x {\displaystyle f_{x}} and f y {\displaystyle f_{y}} for each modality to transform the original data before applying CCA: max W x , W y , θ x , θ y corr ⁡ ( f x ( X ; θ x ) , f y ( Y ; θ y ) ) {\displaystyle \max _{W_{x},W_{y},\theta _{x},\theta _{y}}\operatorname {corr} \left(f_{x}(X;\theta _{x}),f_{y}(Y;\theta _{y})\right)} where θ x {\displaystyle \theta _{x}} and θ y {\displaystyle \theta _{y}} represent the parameters of the neural networks, and W x {\displaystyle W_{x}} and W y {\displaystyle W_{y}} are the CCA projection matrices. The correlation objective is computed as: corr ⁡ ( H x , H y ) = tr ⁡ ( T − 1 / 2 H x T H y S − 1 / 2 ) {\displaystyle \operatorname {corr} (H_{x},H_{y})=\operatorname {tr} \left(T^{-1/2}H_{x}^{T}H_{y}S^{-1/2}\right)} where H x = f x ( X ) {\displaystyle H_{x}=f_{x}(X)} and H y = f y ( Y ) {\displaystyle H_{y}=f_{y}(Y)} are the network outputs, T = H x T H x + r x I {\displaystyle T=H_{x}^{T}H_{x}+r_{x}I} , S = H y T H y + r y I {\displaystyle S=H_{y}^{T}H_{y}+r_{y}I} and r x , r y {\displaystyle r_{x},r_{y}} are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization. === Graph-based methods === Graph-based approaches for multimodal representation learning leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data. One such method is cross-modal graph neural networks (CMGNNs) that extend traditional graph neural networks (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as nodes and their relationships as edges. Other graph-based methods include Probabilistic Graphical Models (PGMs) such as deep belief networks (DBN) and deep Boltzmann machines (DBM). These models can learn a joint representation across modalities, for instance, a multimodal DBN achieves this by adding a shared restricted Boltzmann Machine (RBM) hidden layer on top of modality-specific DBNs. Additionally, the structure of data in some domains like Human-Computer Interaction (HCI), such as the view hierarchy of app screens, can potentially be modeled using graph-like structures. The field of graph representation learning is also relevant, with ongoing progress in developing evaluation benchmarks. === Diffusion maps === Another set of methods relevant to multimodal representation learning are based on diffusion maps and their extensions to handle multiple modalities. ==== Multi-view diffusion maps ==== Multi-view diffusion maps address the challenge of achieving multi-view dimensionality reduction by effectively utilizing the availability of multiple views to extract a coherent low-dimensional representation of the data. The core idea is to exploit both the intrinsic relations within each view and the mutual relations between the different views, defining a cross-view model where a random walk process implicitly hops between objects in different views. A multi-view kernel matrix is constructed by combining these relations, defining a cross-view diffusion process and associ

Human-in-the-loop

Human-in-the-loop (HITL) is used in multiple contexts. It can be defined as a model requiring human interaction. HITL is associated with modeling and simulation (M&S) in the live, virtual, and constructive taxonomy. HITL, along with the related human-on-the-loop, are also used in relation to lethal autonomous weapons. Further, HITL is used in the context of machine learning.It is also used in conversational AI to manage complex interactions that require human empathy. == Machine learning == In machine learning, HITL is used in the sense of humans aiding the computer in making the correct decisions in building a model. HITL improves machine learning over random sampling by selecting the most critical data needed to refine the model. == Simulation == In simulation, HITL models may conform to human factors requirements as in the case of a mockup. In this type of simulation, a human is always part of the simulation and consequently influences the outcome in such a way that is difficult if not impossible to reproduce exactly. HITL also readily allows for the identification of problems and requirements that may not be easily identified by other means of simulation. HITL is often referred to as an interactive simulation, which is a special kind of physical simulation in which physical simulations include human operators, such as in a flight or a driving simulator. === Benefits === Human-in-the-loop allows the user to change the outcome of an event or process. The immersion effectively contributes to a positive transfer of acquired skills into the real world. This can be demonstrated by trainees utilizing flight simulators in preparation to become pilots. HITL also allows for the acquisition of knowledge regarding how a new process may affect a particular event. Utilizing HITL allows participants to interact with realistic models and attempt to perform as they would in an actual scenario. HITL simulations bring to the surface issues that would not otherwise be apparent until after a new process has been deployed. A real-world example of HITL simulation as an evaluation tool is its usage by the Federal Aviation Administration (FAA) to allow air traffic controllers to test new automation procedures by directing the activities of simulated air traffic while monitoring the effect of the newly implemented procedures. As with most processes, there is always the possibility of human error, which can only be reproduced using HITL simulation. Although much can be done to automate systems, humans typically still need to take the information provided by a system to determine the next course of action based on their judgment and experience. Intelligent systems can only go so far in certain circumstances to automate a process; only humans in the simulation can accurately judge the final design. Tabletop simulation may be useful in the very early stages of project development for the purpose of collecting data to set broad parameters, but the important decisions require human-in-the-loop simulation. HITL reflects scenarios where human input remains essential despite advances in automation. === Within the virtual simulation taxonomy === Virtual simulations inject HITL in a central role by exercising motor control skills (e.g. flying an airplane), decision making skills (e.g. committing fire control resources to action), or communication skills (e.g. as members of a C4I team). === Examples === Flight simulators Driving simulators Marine simulators Video games Supply chain management simulators Digital puppetry === Misconceptions === Although human-in-the-loop simulation can include a computer simulation in the form of a synthetic environment, computer simulation is not necessarily a form of human-in-the-loop simulation, and is often considered as human-out-of-the loop simulation. In this particular case, a computer model’s behavior is modified according to a set of initial parameters. The results of the model differ from the results stemming from a true human-in-the-loop simulation because the results can easily be replicated time and time again, by simply providing identical parameters. == Weapons == === Taxonomy === Three classifications of the degree of human control of autonomous weapon systems were laid out by Bonnie Docherty in a 2012 Human Rights Watch report. human-in-the-loop: a human must instigate the action of the weapon (in other words not fully autonomous) human-on-the-loop: a human may abort an action human-out-of-the-loop: no human action is involved === Positive human action === In discussions of autonomous weapons and nuclear command and control, the phrase positive human action has been used alongside "human-in-the-loop" to emphasize that a human operator must affirmatively authorize the use of force. Descriptions of the United States Navy's Aegis Combat System have used the phrase in characterizing a requirement for affirmative human action to initiate live firing. A survey of autonomous weapons systems described the Aegis "Auto SM" mode as one in which "the system fully develops the engagement process however engagement requires positive human action". The phrase entered United States federal law in the National Defense Authorization Act for Fiscal Year 2025, which stipulates that artificial intelligence systems not compromise "the principle of requiring positive human actions in execution of decisions by the President with respect to the employment of nuclear weapons".