Multimodal representation learning is a subfield of representation learning focused on integrating and interpreting information from different modalities, such as text, images, audio, or video, by projecting them into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points within that space, facilitating a unified understanding of diverse data types. By automatically learning meaningful features from each modality and capturing their inter-modal relationships, multimodal representation learning enables a unified representation that enhances performance in cross-media analysis tasks such as video classification, event detection, and sentiment analysis. It also supports cross-modal retrieval and translation, including image captioning, video description, and text-to-image synthesis. == Motivation == The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality. A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts. These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities. Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques. == Approaches and methods == === Canonical-correlation analysis based methods === Canonical-correlation analysis (CCA) was first introduced in 1936 by Harold Hotelling and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data matrices X ∈ R n × p {\displaystyle X\in \mathbb {R} ^{n\times p}} and Y ∈ R n × q {\displaystyle Y\in \mathbb {R} ^{n\times q}} representing different modalities, CCA finds projection vectors w x ∈ R p {\displaystyle w_{x}\in \mathbb {R} ^{p}} and w y ∈ R q {\displaystyle w_{y}\in \mathbb {R} ^{q}} that maximizes the correlation between the projected variables: ρ = max w x , w y w x ⊤ Σ x y w y w x ⊤ Σ x x w x w y ⊤ Σ y y w y {\displaystyle \rho =\max _{w_{x},w_{y}}{\frac {w_{x}^{\top }\Sigma _{xy}w_{y}}{{\sqrt {w_{x}^{\top }\Sigma _{xx}w_{x}}}{\sqrt {w_{y}^{\top }\Sigma _{yy}w_{y}}}}}} such that Σ x x {\displaystyle \Sigma _{xx}} and Σ y y {\displaystyle \Sigma _{yy}} are the within-modality covariance matrices, and Σ x y {\displaystyle \Sigma _{xy}} is the between-modality covariance matrix. However, standard CCA is limited by its linearity, which led to the development of nonlinear extensions, such as kernel CCA and deep CCA. ==== Kernel CCA ==== Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using kernel functions. Given kernel functions K x {\displaystyle K_{x}} and K y {\displaystyle K_{y}} with corresponding Gram matrices K x ∈ R n × n {\displaystyle K_{x}\in \mathbb {R} ^{n\times n}} and K y ∈ R n × n {\displaystyle K_{y}\in \mathbb {R} ^{n\times n}} , KCCA seeks coefficients α {\displaystyle \alpha } and β {\displaystyle \beta } that maximize: ρ = max α , β α ⊤ K x K y β α ⊤ K x 2 α β ⊤ K y 2 β {\displaystyle \rho =\max _{\alpha ,\beta }{\frac {\alpha ^{\top }K_{x}Ky\beta }{{\sqrt {\alpha ^{\top }K_{x}^{2}\alpha }}{\sqrt {\beta ^{\top }K_{y}^{2}\beta }}}}} To prevent overfitting, regularization terms are typically added, resulting in: ρ = max α , β α T K x K y β α T ( K x 2 + λ x K x ) α β T ( K y 2 + λ y K y ) β {\displaystyle \rho =\max _{\alpha ,\beta }{\frac {\alpha ^{T}K_{x}K_{y}\beta }{{\sqrt {\alpha ^{T}\left(K_{x}^{2}+\lambda _{x}K_{x}\right)\alpha }}{\sqrt {\;\beta ^{T}\left(K_{y}^{2}+\lambda _{y}K_{y}\right)\beta }}}}} where λ x {\displaystyle \lambda _{x}} and λ y {\displaystyle \lambda _{y}} are regularization parameters. KCCA has proven effective for tasks such as cross-modal retrieval and semantic analysis, though it faces computational challenges with large datasets due to its O ( n 2 ) {\displaystyle O(n^{2})} memory requirement for sorting kernel matrices. KCCA was proposed independently by several researchers. ==== Deep CCA ==== Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing the correlation between modalities. DCCA uses separate neural networks f x {\displaystyle f_{x}} and f y {\displaystyle f_{y}} for each modality to transform the original data before applying CCA: max W x , W y , θ x , θ y corr ( f x ( X ; θ x ) , f y ( Y ; θ y ) ) {\displaystyle \max _{W_{x},W_{y},\theta _{x},\theta _{y}}\operatorname {corr} \left(f_{x}(X;\theta _{x}),f_{y}(Y;\theta _{y})\right)} where θ x {\displaystyle \theta _{x}} and θ y {\displaystyle \theta _{y}} represent the parameters of the neural networks, and W x {\displaystyle W_{x}} and W y {\displaystyle W_{y}} are the CCA projection matrices. The correlation objective is computed as: corr ( H x , H y ) = tr ( T − 1 / 2 H x T H y S − 1 / 2 ) {\displaystyle \operatorname {corr} (H_{x},H_{y})=\operatorname {tr} \left(T^{-1/2}H_{x}^{T}H_{y}S^{-1/2}\right)} where H x = f x ( X ) {\displaystyle H_{x}=f_{x}(X)} and H y = f y ( Y ) {\displaystyle H_{y}=f_{y}(Y)} are the network outputs, T = H x T H x + r x I {\displaystyle T=H_{x}^{T}H_{x}+r_{x}I} , S = H y T H y + r y I {\displaystyle S=H_{y}^{T}H_{y}+r_{y}I} and r x , r y {\displaystyle r_{x},r_{y}} are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization. === Graph-based methods === Graph-based approaches for multimodal representation learning leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data. One such method is cross-modal graph neural networks (CMGNNs) that extend traditional graph neural networks (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as nodes and their relationships as edges. Other graph-based methods include Probabilistic Graphical Models (PGMs) such as deep belief networks (DBN) and deep Boltzmann machines (DBM). These models can learn a joint representation across modalities, for instance, a multimodal DBN achieves this by adding a shared restricted Boltzmann Machine (RBM) hidden layer on top of modality-specific DBNs. Additionally, the structure of data in some domains like Human-Computer Interaction (HCI), such as the view hierarchy of app screens, can potentially be modeled using graph-like structures. The field of graph representation learning is also relevant, with ongoing progress in developing evaluation benchmarks. === Diffusion maps === Another set of methods relevant to multimodal representation learning are based on diffusion maps and their extensions to handle multiple modalities. ==== Multi-view diffusion maps ==== Multi-view diffusion maps address the challenge of achieving multi-view dimensionality reduction by effectively utilizing the availability of multiple views to extract a coherent low-dimensional representation of the data. The core idea is to exploit both the intrinsic relations within each view and the mutual relations between the different views, defining a cross-view model where a random walk process implicitly hops between objects in different views. A multi-view kernel matrix is constructed by combining these relations, defining a cross-view diffusion process and associ
Software engineering professionalism
Software engineering professionalism is a movement to make software engineering a profession, with aspects such as degree and certification programs, professional associations, professional ethics, and government licensing. The field is a licensed discipline in Texas in the United States (Texas Board of Professional Engineers, since 2013), Engineers Australia(Course Accreditation since 2001, not Licensing), and many provinces in Canada. == History == In 1993 the IEEE and ACM began a joint effort called JCESEP, which evolved into SWECC in 1998 to explore making software engineering into a profession. The ACM pulled out of SWECC in May 1999, objecting to its support for the Texas professionalization efforts, of having state licenses for software engineers. ACM determined that the state of knowledge and practice in software engineering was too immature to warrant licensing, and that licensing would give false assurances of competence even if the body of knowledge were mature. The IEEE continued to support making software engineering a branch of traditional engineering. In Canada the Canadian Information Processing Society established the Information Systems Professional certification process. Also, by the late 1990s (1999 in British Columbia) the discipline of software engineering as a professional engineering discipline was officially created. This has caused some disputes between the provincial engineering associations and companies who call their developers software engineers, even though these developers have not been licensed by any engineering association. In 1999, the Panel of Software Engineering was formed as part of the settlement between Engineering Canada and the Memorial University of Newfoundland over the school's use of the term "software engineering" in the name of a computer science program. Concerns were raised over the inappropriate use of the name "software engineering" to describe non-engineering programs could lead to student and public confusion, and ultimately threaten public safety. The Panel issued recommendations to create a Software Engineering Accreditation Board, but the task force created to carry out the recommendations was unable to get the various stakeholders to agree to concrete proposals, resulting in separate accreditation boards. == Ethics == Software engineering ethics is a large field. In some ways it began as an unrealistic attempt to define bugs as unethical. More recently it has been defined as the application of both computer science and engineering philosophy, principles, and practices to the design and development of software systems. Due to this engineering focus and the increased use of software in mission critical and human critical systems, where failure can result in large losses of capital but more importantly lives such as the Therac-25 system, many ethical codes have been developed by a number of societies, associations and organizations. These entities, such as the ACM, IEEE, EGBC and Institute for Certification of Computing Professionals (ICCP) have formal codes of ethics. Adherence to the code of ethics is required as a condition of membership or certification. According to the ICCP, violation of the code can result in revocation of the certificate. Also, all engineering societies require conformance to their ethical codes; violation of the code results in the revocation of the license to practice engineering in the society's jurisdiction. These codes of ethics usually have much in common. They typically relate the need to act consistently with the client's interest, employer's interest, and most importantly the public's interest. They also outline the need to act with professionalism and to promote an ethical approach to the profession. A Software Engineering Code of Ethics has been approved by the ACM and the IEEE-CS as the standard for teaching and practicing software engineering. === Examples of codes of conduct === The following are examples of codes of conduct for Professional Engineers. These 2 have been chosen because both jurisdictions have a designation for Professional Software Engineers. Engineers and Geoscientists of British Columbia (EGBC): All members in the association's code of Ethics must ensure that the government, the public can rely on BC's professional engineers and Geoscientists to act at all times with fairness, courtesy and good faith to their employers, employee and customers, and to uphold the truth, honesty and trustworthiness, and to safe guard human life and the environment. This is just one of the many ways in which BC's Professional Engineers and Professional Geoscientists maintain their competitive edge in today's global marketplace. Association of Professional Engineers and Geoscientists of Alberta (APEGA): Different with British Columbia, the Alberta Government granted self governance to engineers, Geoscientists and geophysicists. All members in the APEGA have to accept legal and ethical responsibility for the work and to hold the interest of the public and society. The APEGA is a standards guideline of professional practice to uphold the protection of public interest for engineering, Geoscientists and geophysics in Alberta. === Opinions on ethics === Bill Joy argued that "better software" can only enable its privileged end users, make reality more power-pointy as opposed to more humane, and ultimately run away with itself so that "the future doesn't need us." He openly questioned the goals of software engineering in this respect, asking why it isn't trying to be more ethical rather than more efficient. In his book Code and Other Laws of Cyberspace, Lawrence Lessig argues that computer code can regulate conduct in much the same way as the legal code. Lessig and Joy urge people to think about the consequences of the software being developed, not only in a functional way, but also in how it affects the public and society as a whole. Overall, due to the youth of software engineering, many of the ethical codes and values have been borrowed from other fields, such as mechanical and civil engineering. However, there are many ethical questions that even these, much older, disciplines have not encountered. Questions about the ethical impact of internet applications, which have a global reach, have never been encountered until recently and other ethical questions are still to be encountered. This means the ethical codes for software engineering are a work in progress, that will change and update as more questions arise. == Independent licensing and certification exams == Since 2002, the IEEE Computer Society offered the Certified Software Development Professional (CSDP) certification exam (in 2015 this was replaced by several similar certifications). A group of experts from industry and academia developed the exam and maintained it. Donald Bagert, and at a later period Stephen Tockey headed the certification committee. Contents of the exam centered around the SWEBOK (Software Engineering Body of Knowledge) guide, with an additional emphasis on Professional Practices and Software Engineering Economics knowledge areas (KAs). The motivation was to produce a structure at an international level for software engineering's knowledge areas. == Criticism of licensing == Professional licensing has been criticized for many reasons. The field of software engineering is too immature Licensing would give false assurances of competence even if the body of knowledge were mature Software engineers would have to study years of calculus, physics, and chemistry to pass the exams, which is irrelevant to most software practitioners. Many (most?) computer science majors don't earn degrees in engineering schools, so they are probably unqualified to pass engineering exams. == Licensing by country == === United States === The Bureau of Labor Statistics (BLS) classifies computer software engineers as a subcategory of "computer specialists", along with occupations such as computer scientist, Programmer, Database administrator and Network administrator. The BLS classifies all other engineering disciplines, including computer hardware engineers, as engineers. Many states prohibit unlicensed persons from calling themselves an Engineer, or from indicating branches or specialties not covered licensing acts. In many states, the title Engineer is reserved for individuals with a Professional Engineering license indicating that they have shown minimum level of competency through accredited engineering education, qualified engineering experience, and engineering board's examinations. In April 2013 the National Council of Examiners for Engineering and Surveying (NCEES) began offering a Professional Engineer (PE) exam for Software Engineering. The exam was developed in association with the IEEE Computer Society. NCEES ended the exam in April 2019 due to lack of participation. The American National Society of Professional Engineers provides a model law and lobbies legislatures to adopt occ
Pooling layer
In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, thus reducing the amount of computation and memory required, which makes the model more robust to small variations in the input; and it increases the receptive field of neurons in later layers in the network. == Convolutional neural network pooling == Pooling is most commonly used in convolutional neural networks (CNN). Below is a description of pooling in 2-dimensional CNNs. The generalization to n-dimensions is immediate. As notation, we consider a tensor x ∈ R H × W × C {\displaystyle x\in \mathbb {R} ^{H\times W\times C}} , where H {\displaystyle H} is height, W {\displaystyle W} is width, and C {\displaystyle C} is the number of channels. A pooling layer outputs a tensor y ∈ R H ′ × W ′ × C ′ {\displaystyle y\in \mathbb {R} ^{H'\times W'\times C'}} . We define two variables f , s {\displaystyle f,s} called "filter size" (aka "kernel size") and "stride". Sometimes, it is necessary to use a different filter size and stride for horizontal and vertical directions. In such cases, we define 4 variables: f H , f W , s H , s W {\displaystyle f_{H},f_{W},s_{H},s_{W}} . The receptive field of an entry in the output tensor, y {\displaystyle y} , are all the entries in x {\displaystyle x} that can affect that entry. === Max pooling === Max Pooling (MaxPool) is commonly used in CNNs to reduce the spatial dimensions of feature maps. Define M a x P o o l ( x | f , s ) 0 , 0 , 0 = max ( x 0 : f − 1 , 0 : f − 1 , 0 ) {\displaystyle \mathrm {MaxPool} (x|f,s)_{0,0,0}=\max(x_{0:f-1,0:f-1,0})} where 0 : f − 1 {\displaystyle 0:f-1} means the range 0 , 1 , … , f − 1 {\displaystyle 0,1,\dots ,f-1} . Note that we need to avoid the off-by-one error. The next input is M a x P o o l ( x | f , s ) 1 , 0 , 0 = max ( x s : s + f − 1 , 0 : f − 1 , 0 ) {\displaystyle \mathrm {MaxPool} (x|f,s)_{1,0,0}=\max(x_{s:s+f-1,0:f-1,0})} and so on. The receptive field of y i , j , c {\displaystyle y_{i,j,c}} is x i s + f − 1 , j s + f − 1 , c {\displaystyle x_{is+f-1,js+f-1,c}} , so in general, M a x P o o l ( x | f , s ) i , j , c = m a x ( x i s : i s + f − 1 , j s : j s + f − 1 , c ) {\displaystyle \mathrm {MaxPool} (x|f,s)_{i,j,c}=\mathrm {max} (x_{is:is+f-1,js:js+f-1,c})} If the horizontal and vertical filter size and strides differ, then in general, M a x P o o l ( x | f , s ) i , j , c = m a x ( x i s H : i s H + f H − 1 , j s W : j s W + f W − 1 , c ) {\displaystyle \mathrm {MaxPool} (x|f,s)_{i,j,c}=\mathrm {max} (x_{is_{H}:is_{H}+f_{H}-1,js_{W}:js_{W}+f_{W}-1,c})} More succinctly, we can write y k = max ( { x k ′ | k ′ in the receptive field of k } ) {\displaystyle y_{k}=\max(\{x_{k'}|k'{\text{ in the receptive field of }}k\})} . If H {\displaystyle H} is not expressible as k s + f {\displaystyle ks+f} where k {\displaystyle k} is an integer, then for computing the entries of the output tensor on the boundaries, max pooling would attempt to take as inputs variables off the tensor. In this case, how those non-existent variables are handled depends on the padding conditions, illustrated on the right. Global Max Pooling (GMP) is a specific kind of max pooling where the output tensor has shape R C {\displaystyle \mathbb {R} ^{C}} and the receptive field of y c {\displaystyle y_{c}} is all of x 0 : H , 0 : W , c {\displaystyle x_{0:H,0:W,c}} . That is, it takes the maximum over each entire channel. It is often used just before the final fully connected layers in a CNN classification head. === Average pooling === Average pooling (AvgPool) is similarly defined A v g P o o l ( x | f , s ) i , j , c = a v e r a g e ( x i s : i s + f − 1 , j s : j s + f − 1 , c ) = 1 f 2 ∑ k ∈ i s : i s + f − 1 ∑ l ∈ j s : j s + f − 1 x k , l , c {\displaystyle \mathrm {AvgPool} (x|f,s)_{i,j,c}=\mathrm {average} (x_{is:is+f-1,js:js+f-1,c})={\frac {1}{f^{2}}}\sum _{k\in is:is+f-1}\sum _{l\in js:js+f-1}x_{k,l,c}} Global Average Pooling (GAP) is defined similarly to GMP. It was first proposed in Network-in-Network. Similarly to GMP, it is often used just before the final fully connected layers in a CNN classification head. === Interpolations === There are some interpolations of max pooling and average pooling. Mixed Pooling is a linear sum of max pooling and average pooling. That is, M i x e d P o o l ( x | f , s , w ) = w M a x P o o l ( x | f , s ) + ( 1 − w ) A v g P o o l ( x | f , s ) {\displaystyle \mathrm {MixedPool} (x|f,s,w)=w\mathrm {MaxPool} (x|f,s)+(1-w)\mathrm {AvgPool} (x|f,s)} where w ∈ [ 0 , 1 ] {\displaystyle w\in [0,1]} is either a hyperparameter, a learnable parameter, or randomly sampled anew every time. Lp Pooling is similar to average pooling, but uses Lp norm average instead of average: y k = ( 1 N ∑ k ′ in the receptive field of k | x k ′ | p ) 1 / p {\displaystyle y_{k}=\left({\frac {1}{N}}\sum _{k'{\text{ in the receptive field of }}k}|x_{k'}|^{p}\right)^{1/p}} where N {\displaystyle N} is the size of receptive field, and p ≥ 1 {\displaystyle p\geq 1} is a hyperparameter. If all activations are non-negative, then average pooling is the case of p = 1 {\displaystyle p=1} , and max pooling is the case of p → ∞ {\displaystyle p\to \infty } . Square-root pooling is the case of p = 2 {\displaystyle p=2} . Stochastic pooling samples a random activation x k ′ {\displaystyle x_{k'}} from the receptive field with probability x k ′ ∑ k ″ x k ″ {\displaystyle {\frac {x_{k'}}{\sum _{k''}x_{k''}}}} . It is the same as average pooling in expectation. Softmax pooling is like max pooling, but uses softmax, i.e. ∑ k ′ e β x k ′ x k ′ ∑ k ″ e β x k ″ {\displaystyle {\frac {\sum _{k'}e^{\beta x_{k'}}x_{k'}}{\sum _{k''}e^{\beta x_{k''}}}}} where β > 0 {\displaystyle \beta >0} . Average pooling is the case of β ↓ 0 {\displaystyle \beta \downarrow 0} , and max pooling is the case of β ↑ ∞ {\displaystyle \beta \uparrow \infty } Local Importance-based Pooling generalizes softmax pooling by ∑ k ′ e g ( x k ′ ) x k ′ ∑ k ″ e g ( x k ″ ) {\displaystyle {\frac {\sum _{k'}e^{g(x_{k'})}x_{k'}}{\sum _{k''}e^{g(x_{k''})}}}} where g {\displaystyle g} is a learnable function. === Other poolings === Spatial pyramidal pooling applies max pooling (or any other form of pooling) in a pyramid structure. That is, it applies global max pooling, then applies max pooling to the image divided into 4 equal parts, then 16, etc. The results are then concatenated. It is a hierarchical form of global pooling, and similar to global pooling, it is often used just before a classification head. Region of Interest Pooling (also known as RoI pooling) is a variant of max pooling used in R-CNNs for object detection. It is designed to take an arbitrarily-sized input matrix, and output a fixed-sized output matrix. Covariance pooling computes the covariance matrix of the vectors { x k , l , 0 : C − 1 } k ∈ i s : i s + f − 1 , l ∈ j s : j s + f − 1 {\displaystyle \{x_{k,l,0:C-1}\}_{k\in is:is+f-1,l\in js:js+f-1}} which is then flattened to a C 2 {\displaystyle C^{2}} -dimensional vector y i , j , 0 : C 2 − 1 {\displaystyle y_{i,j,0:C^{2}-1}} . Global covariance pooling is used similarly to global max pooling. As average pooling computes the average, which is a first-degree statistic, and covariance is a second-degree statistic, covariance pooling is also called "second-order pooling". It can be generalized to higher-order poolings. Blur Pooling means applying a blurring method before downsampling. For example, the Rect-2 blur pooling means taking an average pooling at f = 2 , s = 1 {\displaystyle f=2,s=1} , then taking every second pixel (identity with s = 2 {\displaystyle s=2} ). == Vision Transformer pooling == In Vision Transformers (ViT), there are the following common kinds of poolings. BERT-like pooling uses a dummy [CLS] token, "classification". For classification, the output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution, which is the network's prediction of class probability distribution. This is the one used by the original ViT and Masked Autoencoder. Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good. Multihead attention pooling (MAP) applies a multi headed attention block to pooling. Specifically, it takes as input a list of vectors x 1 , x 2 , … , x n {\displaystyle x_{1},x_{2},\dots ,x_{n}} , which might be thought of as the output vectors of a layer of a ViT. It then applies a feedforward layer F F N {\displaystyle \mathrm {FFN} } on each vector, resulting in a matrix V = [ F F N ( v 1 ) , … , F F N ( v n ) ] {\displaystyle V=[\mathrm {FFN} (v_{1}),\dots ,\mathrm {FFN} (v_{n})]} . This is then sent to a multi-headed attention, resulting in M u l t i h e a d e d A
PropBank
PropBank is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced by Martha Palmer et al., the term propbank is also coming to be used as a common noun referring to any corpus that has been annotated with propositions and their arguments. The PropBank project has played a role in research in natural language processing, and has been used in semantic role labelling. == Comparison == PropBank differs from FrameNet, the resource to which it is most frequently compared, in several ways. PropBank is a verb-oriented resource, while FrameNet is centered on the more abstract notion of frames, which generalizes descriptions across similar verbs (e.g. "describe" and "characterize") as well as nouns and other words (e.g. "description"). PropBank does not annotate events or states of affairs described using nouns. PropBank commits to annotating all verbs in a corpus, whereas the FrameNet project chooses sets of example sentences from a large corpus and only in a few cases has annotated longer continuous stretches of text. PropBank-style annotations often remain close to the syntactic level, while FrameNet-style annotations are sometimes more semantically motivated. From the start, PropBank was developed with the idea of serving as training data for machine learning-based semantic role labeling systems in mind. It requires that all arguments to a verb be syntactic constituents and different senses of a word are only distinguished if the differences bear on the arguments. Due to such differences, semantic role labeling with respect to PropBank is often a somewhat easier task than producing FrameNet-style annotations.
GPT-5
GPT-5 is a multimodal large language model developed by OpenAI and the fifth in its series of generative pre-trained transformer (GPT) foundation models. Preceded in the series by GPT-4, it was launched on August 7, 2025. It is publicly accessible to users of the chatbot products ChatGPT and Microsoft Copilot as well as to developers through the OpenAI API. == Background == On April 14, 2023, Sam Altman, the chief executive officer of OpenAI, spoke at an event at the Massachusetts Institute of Technology and said that the company was not training GPT-5 at that time. He stated that OpenAI was "prioritizing GPT-4 development" and that "we are not and won't for some time" release GPT-5. On July 18, OpenAI filed for a "GPT-5" trademark in the United States. On November 13, Altman confirmed to the Financial Times that the company was working to develop GPT-5. According to The Information, "[f]or much of the second half of 2024, OpenAI was developing a model known internally as Orion and intended to become GPT-5", "[b]ut the Orion effort failed to produce a better model, and the company instead released it as GPT-4.5 in February [2025]." By late July 2025, OpenAI was widely anticipated as planning to release GPT-5 in early August. On July 30, The Verge reported that "Microsoft is getting ready for GPT-5" as "sources familiar with Microsoft's AI plans" told an editor that the company was testing a new mode for its Copilot chatbot that would offer a model that "thinks deeply or quickly based on the task". On August 5, in the leadup to the release of GPT-5, OpenAI released GPT-OSS, a set of two open-weight models that have reasoning capabilities. GPT-5 was then unveiled during a livestream event on August 7. == Capabilities == At the time of its release, GPT-5 had state-of-the-art performance on benchmarks that test mathematics, programming, finance, and multimodal understanding. According to OpenAI, improvements over its predecessor models include faster response times, better coding and writing skills, more accurate answers to health questions, and lower levels of hallucination. Also, compared to previous models, GPT-5 aims to give safe, high-level responses to potentially harmful queries rather than outright declining them, an approach that OpenAI refers to as "safe completions", aiming to result "in GPT-5 being able to refuse more unsafe questions, while offering fewer rejections to users seeking harmless information." In addition, GPT-5 was trained to give more critical, "less effusively agreeable" answers compared to its predecessor models. Days before the launch of GPT-5, two early testers of the model stated that they were "impressed" by its ability to code and to solve mathematical and scientific problems. They suggested that the model shows great improvement from GPT-4, but not as large of a gain as from GPT-3 to GPT-4. A day prior to the release of GPT-5, during a press briefing, Sam Altman, the chief executive officer of OpenAI, called GPT-5 "a significant step along the path to AGI", referring to artificial general intelligence, the hypothetical level of intelligence that OpenAI defines as the ability to perform any economically valuable task that a human can. According to Altman, GPT-5 is "significantly better" than its predecessors, offering "PhD-level" abilities across a wide range of tasks. The exact energy consumption of GPT-5 use has not been disclosed by OpenAI. Researchers at the University of Rhode Island estimated that a medium-length response consumes slightly over 18 watt-hours, equivalent to using an incandescent bulb for 18 minutes. === Architecture === GPT-5 is a system that contains a fast, high-throughput model, a deeper reasoning model, and a real-time router that decides which model to use based on conversation type, complexity, tool needs, and explicit user intent. Altman had previously criticized the manual model picker for being overly complex, suggesting a need for unification. GPT-5 also includes agentic functionality through which it can set up its own desktop and can use its browser to search autonomously for sources that relate to its task. The GPT-5 system card defines two fast, high-throughput models – gpt-5-main and gpt-5-main-mini – and two thinking models – gpt-5-thinking and gpt-5-thinking-mini. In the OpenAI API, developers can access the thinking model, its mini version, and gpt-5-thinking-nano, an even smaller and faster nano version of the thinking model. The version of GPT-5 that is accessible via the API has adjustable reasoning effort (low, medium, high, or minimal) and verbosity (low, medium, or high). Additionally, ChatGPT provides access to gpt-5-thinking with a setting that makes use of parallel test-time compute, referred to as gpt-5-thinking-pro. == Limitations == === Safety === Neuraltrust, a security research company, claimed to have successfully compromised GPT-5 within its first day of testing the model. According to its report, it enabled GPT-5 to generate detailed instructions for manufacturing explosive devices. SPLX, another company, conducted similar tests and came to similar conclusions about GPT-5's security. Their assessments suggest that GPT-5 has significant security gaps, potentially rendering it as being unsafe for use in a corporate environment. == Training == According to AIMultiple, GPT-5 is natively multimodal, meaning that it was trained from scratch on multiple modalities (like text and images) at once without relying on already-trained language or vision models. Its training process involved three stages: unsupervised pretraining, supervised fine-tuning, and reinforcement learning from human feedback. Pretraining used a large-scale multilingual dataset of books, articles, web pages, academic papers, and licensed sources. GPT-5's visual and text capabilities were described as having been developed alongside each other throughout training, unlike with GPT-4. == Use == GPT-5 is used in ChatGPT. Although GPT-5 is free for all ChatGPT users, Plus users get higher use limits while Pro users get unlimited access to GPT-5 as well as limited access to GPT-5 Pro. Standard limits for lower-tier users on responses per hour still apply. Additionally, with the introduction of GPT-5, ChatGPT's "Advanced Voice Mode" was replaced by "ChatGPT Voice", which is supposed to enable more natural-sounding conversations. OpenAI stated that "Standard Voice Mode retires on September 9, 2025, unifying all users on ChatGPT Voice". On November 24, 2025, the feature of shopping research was added to ChatGPT, claimed to be a mini model post-trained on gpt-5-thinking-mini. GPT-5 is also available in Microsoft Copilot, and Microsoft stated that it will incorporate GPT-5 into a wide variety of its products. According to 9to5Mac, Apple Inc. is planning to integrate the model into the Apple Intelligence feature in its iOS 26, iPadOS 26, and macOS Tahoe operating systems. It is also accessible via the OpenAI API. A number of American companies were reported as having received access to GPT-5 ahead of its launch. OpenAI stated that the private health insurance company Oscar Health was checking applications from its policyholders with the model. In addition, Uber was using GPT-5 for its customer support system; GitLab, Windsurf, and Cursor were using the model for software development; and the Spanish bank BBVA was using it for financial analysis. Other companies that OpenAI listed as having used GPT-5 pre-release include Amgen, Lowe's, and Notion. == Reception == === Critical reviews === Grace Huckins in MIT Technology Review found that, "[w]hereas o1 was a major technological advancement, GPT-5 is, above all else, a refined product." In response to claims that Sam Altman, the chief executive officer of OpenAI, had made about the model, she stated that "GPT-5 will furnish a more pleasant and seamless user experience. That's not nothing, but it falls far short of the transformative AI future that Altman has spent much of the past year hyping." In response to Altman's claim that GPT-5 is "a significant step along the path" to artificial general intelligence, she noted: "[M]aybe he's right—but if so, it's a very small step." In The Information, Stephanie Palazzolo praised GPT-5's coding capabilities. According to Matteo Wong in The Atlantic, GPT-5 "is intuitive, fast, and efficient; adapts to human preferences and intentions; and is easy to personalize." He stated: "At this stage of the AI boom, when every major chatbot is legitimately helpful in numerous ways, benchmarks, science, and rigor feel almost insignificant. What matters is how the chatbot feels [...]". John Herrman from the New York magazine wrote: "Casual users who encounter GPT-5 through ChatGPT aren't likely to feel like they're using a completely different product [...] while people who use it for software development or in a corporate context are more likely to notice a major change." Mashable's Christian de Looper found that "GPT-5
Argumentation framework
In artificial intelligence and related fields, an argumentation framework is a way to deal with contentious information and draw conclusions from it using formalized arguments. In an abstract argumentation framework, entry-level information is a set of abstract arguments that, for instance, represent data or a proposition. Conflicts between arguments are represented by a binary relation on the set of arguments. In concrete terms, an argumentation framework is represented with a directed graph such that the nodes are the arguments, and the arrows represent the attack relation. There exist some extensions of the Dung's framework, like the logic-based argumentation frameworks or the value-based argumentation frameworks. == Abstract argumentation frameworks == === Formal framework === Abstract argumentation frameworks, also called argumentation frameworks à la Dung, are defined formally as a pair: A set of abstract elements called arguments, denoted A {\displaystyle A} A binary relation on A {\displaystyle A} , called attack relation, denoted R {\displaystyle R} For instance, the argumentation system S = ⟨ A , R ⟩ {\displaystyle S=\langle A,R\rangle } with A = { a , b , c , d } {\displaystyle A=\{a,b,c,d\}} and R = { ( a , b ) , ( b , c ) , ( d , c ) } {\displaystyle R=\{(a,b),(b,c),(d,c)\}} contains four arguments ( a , b , c {\displaystyle a,b,c} and d {\displaystyle d} ) and three attacks ( a {\displaystyle a} attacks b {\displaystyle b} , b {\displaystyle b} attacks c {\displaystyle c} and d {\displaystyle d} attacks c {\displaystyle c} ). Dung defines some notions : an argument a ∈ A {\displaystyle a\in A} is acceptable with respect to E ⊆ A {\displaystyle E\subseteq A} if and only if E {\displaystyle E} defends a {\displaystyle a} , that is ∀ b ∈ A {\displaystyle \forall b\in A} such that ( b , a ) ∈ R , ∃ c ∈ E {\displaystyle (b,a)\in R,\exists c\in E} such that ( c , b ) ∈ R {\displaystyle (c,b)\in R} , a set of arguments E {\displaystyle E} is conflict-free if there is no attack between its arguments, formally : ∀ a , b ∈ E , ( a , b ) ∉ R {\displaystyle \forall a,b\in E,(a,b)\not \in R} , a set of arguments E {\displaystyle E} is admissible if and only if it is conflict-free and all its arguments are acceptable with respect to E {\displaystyle E} . === Different semantics of acceptance === ==== Extensions ==== To decide if an argument can be accepted or not, or if several arguments can be accepted together, Dung defines several semantics of acceptance that allows, given an argumentation system, sets of arguments (called extensions) to be computed. For instance, given S = ⟨ A , R ⟩ {\displaystyle S=\langle A,R\rangle } , E {\displaystyle E} is a complete extension of S {\displaystyle S} only if it is an admissible set and every acceptable argument with respect to E {\displaystyle E} belongs to E {\displaystyle E} , E {\displaystyle E} is a preferred extension of S {\displaystyle S} only if it is a maximal element (with respect to the set-theoretical inclusion) among the admissible sets with respect to S {\displaystyle S} , E {\displaystyle E} is a stable extension of S {\displaystyle S} only if it is a conflict-free set that attacks every argument that does not belong in E {\displaystyle E} (formally, ∀ a ∈ A ∖ E , ∃ b ∈ E {\displaystyle \forall a\in A\backslash E,\exists b\in E} such that ( b , a ) ∈ R {\displaystyle (b,a)\in R} , E {\displaystyle E} is the (unique) grounded extension of S {\displaystyle S} only if it is the smallest element (with respect to set inclusion) among the complete extensions of S {\displaystyle S} . There exists some inclusions between the sets of extensions built with these semantics : Every stable extension is preferred, Every preferred extension is complete, The grounded extension is complete, If the system is well-founded (there exists no infinite sequence a 0 , a 1 , … , a n , … {\displaystyle a_{0},a_{1},\dots ,a_{n},\dots } such that ∀ i > 0 , ( a i + 1 , a i ) ∈ R {\displaystyle \forall i>0,(a_{i+1},a_{i})\in R} ), all these semantics coincide—only one extension is grounded, stable, preferred, and complete. Some other semantics have been defined. One introduce the notation E x t σ ( S ) {\displaystyle Ext_{\sigma }(S)} to note the set of σ {\displaystyle \sigma } -extensions of the system S {\displaystyle S} . In the case of the system S {\displaystyle S} in the figure above, E x t σ ( S ) = { { a , d } } {\displaystyle Ext_{\sigma }(S)=\{\{a,d\}\}} for every Dung's semantic—the system is well-founded. That explains why the semantics coincide, and the accepted arguments are: a {\displaystyle a} and d {\displaystyle d} . ==== Labellings ==== Labellings are a more expressive way than extensions to express the acceptance of the arguments. Concretely, a labelling is a mapping that associates every argument with a label in (the argument is accepted), out (the argument is rejected), or undec (the argument is undefined—not accepted or refused). One can also note a labelling as a set of pairs ( a r g u m e n t , l a b e l ) {\displaystyle ({\mathit {argument}},{\mathit {label}})} . Such a mapping does not make sense without additional constraint. The notion of reinstatement labelling guarantees the sense of the mapping. L {\displaystyle L} is a reinstatement labelling on the system S = ⟨ A , R ⟩ {\displaystyle S=\langle A,R\rangle } if and only if : ∀ a ∈ A , L ( a ) = i n {\displaystyle \forall a\in A,L(a)={\mathit {in}}} if and only if ∀ b ∈ A {\displaystyle \forall b\in A} such that ( b , a ) ∈ R , L ( b ) = o u t {\displaystyle (b,a)\in R,L(b)={\mathit {out}}} ∀ a ∈ A , L ( a ) = o u t {\displaystyle \forall a\in A,L(a)={\mathit {out}}} if and only if ∃ b ∈ A {\displaystyle \exists b\in A} such that ( b , a ) ∈ R {\displaystyle (b,a)\in R} and L ( b ) = i n {\displaystyle L(b)={\mathit {in}}} ∀ a ∈ A , L ( a ) = u n d e c {\displaystyle \forall a\in A,L(a)={\mathit {undec}}} if and only if L ( a ) ≠ i n {\displaystyle L(a)\neq {\mathit {in}}} and L ( a ) ≠ o u t {\displaystyle L(a)\neq {\mathit {out}}} One can convert every extension into a reinstatement labelling: the arguments of the extension are in, those attacked by an argument of the extension are out, and the others are undec. Conversely, one can build an extension from a reinstatement labelling just by keeping the arguments in. Indeed, Caminada proved that the reinstatement labellings and the complete extensions can be mapped in a bijective way. Moreover, the other Datung's semantics can be associated to some particular sets of reinstatement labellings. Reinstatement labellings distinguish arguments not accepted because they are attacked by accepted arguments from undefined arguments—that is, those that are not defended cannot defend themselves. An argument is undec if it is attacked by at least another undec. If it is attacked only by arguments out, it must be in, and if it is attacked some argument in, then it is out. The unique reinstatement labelling that corresponds to the system S {\displaystyle S} above is L = { ( a , i n ) , ( b , o u t ) , ( c , o u t ) , ( d , i n ) } {\displaystyle L=\{(a,{\mathit {in}}),(b,{\mathit {out}}),(c,{\mathit {out}}),(d,{\mathit {in}})\}} . === Inference from an argumentation system === In the general case when several extensions are computed for a given semantic σ {\displaystyle \sigma } , the agent that reasons from the system can use several mechanisms to infer information: Credulous inference: the agent accepts an argument if it belongs to at least one of the σ {\displaystyle \sigma } -extensions—in which case, the agent risks accepting some arguments that are not acceptable together ( a {\displaystyle a} attacks b {\displaystyle b} , and a {\displaystyle a} and b {\displaystyle b} each belongs to an extension) Skeptical inference: the agent accepts an argument only if it belongs to every σ {\displaystyle \sigma } -extension. In this case, the agent risks deducing too little information (if the intersection of the extensions is empty or has a very small cardinal). For these two methods to infer information, one can identify the set of accepted arguments, respectively C r σ ( S ) {\displaystyle Cr_{\sigma }(S)} the set of the arguments credulously accepted under the semantic σ {\displaystyle \sigma } , and S c σ ( S ) {\displaystyle Sc_{\sigma }(S)} the set of arguments accepted skeptically under the semantic σ {\displaystyle \sigma } (the σ {\displaystyle \sigma } can be missed if there is no possible ambiguity about the semantic). Of course, when there is only one extension (for instance, when the system is well-founded), this problem is very simple: the agent accepts arguments of the unique extension and rejects others. The same reasoning can be done with labellings that correspond to the chosen semantic : an argument can be accepted if it is in for each labelling and refused if it is out for each labelling, the others being in an undecided state (the status of the arguments can remind the
Integreat
Integreat (former project name: Refguide+) is an open source mobile app that provides local information and services tailored to refugees and migrants coming to Germany. The content is maintained by local organizations, such as local governments or integration officers, and made available in locally relevant languages. It was developed by Tür an Tür - Digitalfabrik gGmbH (formerly Tür an Tür - Digital Factory gGmbH) in Augsburg together with a team of researchers and students from the Technical University of Munich. == History == In 1997, the Augsburg association "Tür an Tür", which has been working for refugees since 1992, published the brochure "First Steps", which answers local everyday questions. Since addresses and contact persons change quickly, some information is already outdated after a few weeks. Students of business informatics at the Technical University of Munich therefore developed the app Integreat within eight months together with the association and the social department of the city of Augsburg. The app was then also used by other cities and districts within months. As of February 3, 2022, information is available at 72 locations, including Munich, Dortmund, Nuremberg and Augsburg. == Mode of action == Refugees need information on areas such as registration, contact persons, health care, education, family, work and everyday life. Integreat seeks to provide refugees with this information by allowing them to select their geographic location and receive locally relevant information. This information is available offline once the app is opened so it can be used without an internet connection. In addition, the content is translated into the native languages of refugees and migrants to facilitate access. The content is licensed with a CC BY 4.0 license to facilitate collaboration and translation between content creators and dissemination of the content. Integreat is now being used for a broader migrant audience and says it can also support professionals, volunteers, and counseling centers. == Comparable mobile apps == Other mobile apps that are likewise intended to provide initial orientation for refugees include the app Ankommen, a joint project of the Federal Office for Migration and Refugees, the Goethe-Institut, the Federal Employment Agency and the Bavarian Broadcasting Corporation, which is intended as a companion for the first few weeks in Germany, and the Welcome App, a company-sponsored non-profit initiative for information about Germany and asylum procedures with a regional focus, and a book by the Konrad Adenauer Foundation (KAS) and Verlag Herder with a corresponding app Deutschland - Erste Informationen für Flüchtlinge (Germany - First Information for Refugees) as a companion for Arabic-speaking refugees in Germany.