IJCAI Award for Research Excellence

IJCAI Award for Research Excellence

The IJCAI Award for Research Excellence is a biannual award before given at the IJCAI conference to researcher in artificial intelligence as a recognition of excellence of their career. Beginning in 2016, the conference is held annually and so is the award. == Laureates == The recipients of this award have been: John McCarthy (1985) Allen Newell (1989) Marvin Minsky (1991) Raymond Reiter (1993) Herbert A. Simon (1995) Aravind Joshi (1997) Judea Pearl (1999) Donald Michie (2001) Nils Nilsson (2003) Geoffrey E. Hinton (2005) Alan Bundy (2007) Victor R. Lesser (2009) Robert Kowalski (2011) Hector Levesque (2013) Barbara Grosz (2015) for her pioneering research in Natural Language Processing and in theories and applications of Multiagent Collaboration. Michael I. Jordan (2016) for his groundbreaking and impactful research in both the theory and application of statistical machine learning. Andrew Barto (2017) for his pioneering work in the theory of reinforcement learning. Jitendra Malik (2018) Yoav Shoham (2019) Eugene Freuder (2020) Richard S. Sutton (2021) Stuart J. Russell (2022) Sarit Kraus (2023) for her pioneering work of the study of interactions among self-interested agents, creating the field of automated negotiation, and developing methods for coalition formation and teamwork, both as formal models and real-world implementations. == Winners of also Turing Award == John McCarthy (1971) Allen Newell (1975) Marvin Minsky (1969) Herbert A. Simon (1975) Judea Pearl (2011) Geoffrey Hinton (2018) Andrew Barto (2024) Richard S. Sutton (2024)

Multimodal representation learning

Multimodal representation learning is a subfield of representation learning focused on integrating and interpreting information from different modalities, such as text, images, audio, or video, by projecting them into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points within that space, facilitating a unified understanding of diverse data types. By automatically learning meaningful features from each modality and capturing their inter-modal relationships, multimodal representation learning enables a unified representation that enhances performance in cross-media analysis tasks such as video classification, event detection, and sentiment analysis. It also supports cross-modal retrieval and translation, including image captioning, video description, and text-to-image synthesis. == Motivation == The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality. A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts. These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities. Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques. == Approaches and methods == === Canonical-correlation analysis based methods === Canonical-correlation analysis (CCA) was first introduced in 1936 by Harold Hotelling and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data matrices X ∈ R n × p {\displaystyle X\in \mathbb {R} ^{n\times p}} and Y ∈ R n × q {\displaystyle Y\in \mathbb {R} ^{n\times q}} representing different modalities, CCA finds projection vectors w x ∈ R p {\displaystyle w_{x}\in \mathbb {R} ^{p}} and w y ∈ R q {\displaystyle w_{y}\in \mathbb {R} ^{q}} that maximizes the correlation between the projected variables: ρ = max w x , w y w x ⊤ Σ x y w y w x ⊤ Σ x x w x w y ⊤ Σ y y w y {\displaystyle \rho =\max _{w_{x},w_{y}}{\frac {w_{x}^{\top }\Sigma _{xy}w_{y}}{{\sqrt {w_{x}^{\top }\Sigma _{xx}w_{x}}}{\sqrt {w_{y}^{\top }\Sigma _{yy}w_{y}}}}}} such that Σ x x {\displaystyle \Sigma _{xx}} and Σ y y {\displaystyle \Sigma _{yy}} are the within-modality covariance matrices, and Σ x y {\displaystyle \Sigma _{xy}} is the between-modality covariance matrix. However, standard CCA is limited by its linearity, which led to the development of nonlinear extensions, such as kernel CCA and deep CCA. ==== Kernel CCA ==== Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using kernel functions. Given kernel functions K x {\displaystyle K_{x}} and K y {\displaystyle K_{y}} with corresponding Gram matrices K x ∈ R n × n {\displaystyle K_{x}\in \mathbb {R} ^{n\times n}} and K y ∈ R n × n {\displaystyle K_{y}\in \mathbb {R} ^{n\times n}} , KCCA seeks coefficients α {\displaystyle \alpha } and β {\displaystyle \beta } that maximize: ρ = max α , β α ⊤ K x K y β α ⊤ K x 2 α β ⊤ K y 2 β {\displaystyle \rho =\max _{\alpha ,\beta }{\frac {\alpha ^{\top }K_{x}Ky\beta }{{\sqrt {\alpha ^{\top }K_{x}^{2}\alpha }}{\sqrt {\beta ^{\top }K_{y}^{2}\beta }}}}} To prevent overfitting, regularization terms are typically added, resulting in: ρ = max α , β α T K x K y β α T ( K x 2 + λ x K x ) α β T ( K y 2 + λ y K y ) β {\displaystyle \rho =\max _{\alpha ,\beta }{\frac {\alpha ^{T}K_{x}K_{y}\beta }{{\sqrt {\alpha ^{T}\left(K_{x}^{2}+\lambda _{x}K_{x}\right)\alpha }}{\sqrt {\;\beta ^{T}\left(K_{y}^{2}+\lambda _{y}K_{y}\right)\beta }}}}} where λ x {\displaystyle \lambda _{x}} and λ y {\displaystyle \lambda _{y}} are regularization parameters. KCCA has proven effective for tasks such as cross-modal retrieval and semantic analysis, though it faces computational challenges with large datasets due to its O ( n 2 ) {\displaystyle O(n^{2})} memory requirement for sorting kernel matrices. KCCA was proposed independently by several researchers. ==== Deep CCA ==== Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing the correlation between modalities. DCCA uses separate neural networks f x {\displaystyle f_{x}} and f y {\displaystyle f_{y}} for each modality to transform the original data before applying CCA: max W x , W y , θ x , θ y corr ⁡ ( f x ( X ; θ x ) , f y ( Y ; θ y ) ) {\displaystyle \max _{W_{x},W_{y},\theta _{x},\theta _{y}}\operatorname {corr} \left(f_{x}(X;\theta _{x}),f_{y}(Y;\theta _{y})\right)} where θ x {\displaystyle \theta _{x}} and θ y {\displaystyle \theta _{y}} represent the parameters of the neural networks, and W x {\displaystyle W_{x}} and W y {\displaystyle W_{y}} are the CCA projection matrices. The correlation objective is computed as: corr ⁡ ( H x , H y ) = tr ⁡ ( T − 1 / 2 H x T H y S − 1 / 2 ) {\displaystyle \operatorname {corr} (H_{x},H_{y})=\operatorname {tr} \left(T^{-1/2}H_{x}^{T}H_{y}S^{-1/2}\right)} where H x = f x ( X ) {\displaystyle H_{x}=f_{x}(X)} and H y = f y ( Y ) {\displaystyle H_{y}=f_{y}(Y)} are the network outputs, T = H x T H x + r x I {\displaystyle T=H_{x}^{T}H_{x}+r_{x}I} , S = H y T H y + r y I {\displaystyle S=H_{y}^{T}H_{y}+r_{y}I} and r x , r y {\displaystyle r_{x},r_{y}} are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization. === Graph-based methods === Graph-based approaches for multimodal representation learning leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data. One such method is cross-modal graph neural networks (CMGNNs) that extend traditional graph neural networks (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as nodes and their relationships as edges. Other graph-based methods include Probabilistic Graphical Models (PGMs) such as deep belief networks (DBN) and deep Boltzmann machines (DBM). These models can learn a joint representation across modalities, for instance, a multimodal DBN achieves this by adding a shared restricted Boltzmann Machine (RBM) hidden layer on top of modality-specific DBNs. Additionally, the structure of data in some domains like Human-Computer Interaction (HCI), such as the view hierarchy of app screens, can potentially be modeled using graph-like structures. The field of graph representation learning is also relevant, with ongoing progress in developing evaluation benchmarks. === Diffusion maps === Another set of methods relevant to multimodal representation learning are based on diffusion maps and their extensions to handle multiple modalities. ==== Multi-view diffusion maps ==== Multi-view diffusion maps address the challenge of achieving multi-view dimensionality reduction by effectively utilizing the availability of multiple views to extract a coherent low-dimensional representation of the data. The core idea is to exploit both the intrinsic relations within each view and the mutual relations between the different views, defining a cross-view model where a random walk process implicitly hops between objects in different views. A multi-view kernel matrix is constructed by combining these relations, defining a cross-view diffusion process and associ

Pedagogical agent

A pedagogical agent is a concept borrowed from computer science and artificial intelligence and applied to education, usually as part of an intelligent tutoring system (ITS). It is a simulated human-like interface between the learner and the content, in an educational environment. A pedagogical agent is designed to model the type of interactions between a student and another person. Mabanza and de Wet define it as "a character enacted by a computer that interacts with the user in a socially engaging manner". A pedagogical agent can be assigned different roles in the learning environment, such as tutor or co-learner, depending on the desired purpose of the agent. "A tutor agent plays the role of a teacher, while a co-learner agent plays the role of a learning companion". == History == The history of Pedagogical Agents is closely aligned with the history of computer animation. As computer animation progressed, it was adopted by educators to enhance computerized learning by including a lifelike interface between the program and the learner. The first versions of a pedagogical agent were more cartoon than person, like Microsoft's Clippy which helped users of Microsoft Office load and use the program's features in 1997. However, with developments in computer animation, pedagogical agents can now look lifelike. By 2006 there was a call to develop modular, reusable agents to decrease the time and expertise required to create a pedagogical agent. There was also a call in 2009 to enact agent standards. The standardization and re-usability of pedagogical agents is less of an issue since the decrease in cost and widespread availability of animation tools. Individualized pedagogical agents can be found across disciplines including medicine, math, law, language learning, automotive, and armed forces. They are used in applications directed to every age, from preschool to adult. == Learning theories related to pedagogical agent design == === Distributed cognition theory === Distributed cognition theory is the method in which cognition progresses in the context of collaboration with others. Pedagogical agents can be designed to assist the cognitive transfer to the learner, operating as artifacts or partners with collaborative role in learning. To support the performance of an action by the user, the pedagogical agent can act as a cognitive tool as long as the agent is equipped with the knowledge that the user lacks. The interactions between the user and the pedagogical agent can facilitate a social relationship. The pedagogical agent may fulfill the role of a working partner. === Socio-cultural learning theory === Socio-cultural learning theory is how the user develops when they are involved in learning activities in which there is interaction with other agents. A pedagogical agent can: intervene when the user requests, provide support for tasks that the user cannot address, and potentially extend the learners cognitive reach. Interaction with the pedagogical agent may elicit a variety of emotions from the learner. The learner may become excited, confused, frustrated, and/or discouraged. These emotions affect the learners' motivation. === Extraneous Cognitive Load === Extraneous cognitive load is the extra effort being exerted by an individual's working memory due to the way information is being presented. A pedagogical agent can increase the user's cognitive load by distracting them and becoming the focus of their attention, causing split attention between the instructional material and the agent. Agents can reduce the perceived cognitive load by providing narration and personalization that can also promote a user's interest and motivation. While research on the reduction of cognitive load from pedagogical agents is minimal, more studies have shown that agents do not increase it. == Effectiveness == It has been suggested by researchers that pedagogical agents may take on different roles in the learning environment. Examples of these roles are: supplanting, scaffolding, coaching, testing, or demonstrating or modelling a procedure. A pedagogical agent as a tutor has not been demonstrated to add any benefit to an educational strategy in equivalent lessons with and without a pedagogical agent. According to Richard Mayer, there is some support in research for pedagogical agent increasing learning, but only as a presenter of social cues. A co-learner pedagogical agent is believed to increase the student's self-efficacy. By pointing out important features of instructional content, a pedagogical agent can fulfill the signaling function, which research on multimedia learning has shown to enhance learning. Research has demonstrated that human-human interaction may not be completely replaced by pedagogical agents, but learners may prefer the agents to non-agent multimedia systems. This finding is supported by social agency theory. Much like the varying effectiveness of the pedagogical agent roles in the learning environment, agents that take into account the user's affect have had mixed results. Research has shown pedagogical agents that make use of the users’ affect have been found to increase user knowledge retention, motivation, and perceived self-efficacy. However, with such a broad range of modalities in affective expressions, it is often difficult to utilize them. Additionally, having agents detect a user's affective state with precision remains challenging, as displays of affect are different across individuals. == Design == === Attractiveness === The appearance of a pedagogical agent can be manipulated to meet the learning requirements. The attractiveness of a pedagogical agent can enhance student's learning when the users were the opposite gender of the pedagogical agent. Male students prefer a sexy appearance of a female pedagogical agents and dislike the sexy appearance of male agents. Female students were not attracted by the sexy appearance of either male or female pedagogical agents. === Affective Response === Pedagogical agents have reached a point where they can convey and elicit emotion, but also reason about and respond to it. These agents are often designed to elicit and respond to affective actions from users through various modalities such as speech, facial expressions, and body gestures. They respond to the affective state of the given user, and make use of these modalities using a wide array of sensors incorporated into the design of the agent. Specifically in education and training applications, pedagogical agents are often designed to increasingly recognize when users or learners exhibit frustration, boredom, confusion, and states of flow. The added recognition in these agents is a step toward making them more emotionally intelligent, comforting and motivating the users as they interact. === Digital Representation === The design of a pedagogical agent often begins with its digital representation, whether it will be 2D or 3D and static or animated. Several studies have developed pedagogical agents that were both static and animated, then evaluated the relative benefits. Similar to other design considerations, the improved learning from static or animated agents remains questionable. One study showed that the appearance of an agent portrayed using a static image can impact a user's recall, based on the visual appearance. Other research found results that suggest static agent images improve learning outcomes. However, several other studies found user's learned more when the pedagogical agent was animated rather than static. Recently a meta-analysis of such research found a negligible improvement in learning via pedagogical agents, suggesting more work needs to be done in the area to support any claims.

Labeled data

Labeled data is a group of samples that have been tagged with one or more labels. Labeling typically takes a set of unlabeled data and augments each piece of it with informative tags called judgments. For example, a data label might indicate whether a photo contains a horse or a cow, which words were uttered in an audio recording, what type of action is being performed in a video, what the topic of a news article is, what the overall sentiment of a tweet is, or whether a dot in an X-ray is a tumor. Labels can be obtained by having humans make judgments about a given piece of unlabeled data. Labeled data is significantly more expensive to obtain than the raw unlabeled data. The quality of labeled data directly influences the performance of supervised machine learning models in operation, as these models learn from the provided labels. == Crowdsourced labeled data == In 2006, Fei-Fei Li, the co-director of the Stanford Human-Centered AI Institute, initiated research to improve the artificial intelligence models and algorithms for image recognition by significantly enlarging the training data. The researchers downloaded millions of images from the World Wide Web and a team of undergraduates started to apply labels for objects to each image. In 2007, Li outsourced the data labeling work on Amazon Mechanical Turk, an online marketplace for digital piece work. The 3.2 million images that were labeled by more than 49,000 workers formed the basis for ImageNet, one of the largest hand-labeled database for outline of object recognition. == Automated data labelling == After obtaining a labeled dataset, machine learning models can be applied to the data so that new unlabeled data can be presented to the model and a likely label can be guessed or predicted for that piece of unlabeled data. == Challenges == === Data-driven bias === Algorithmic decision-making is subject to programmer-driven bias as well as data-driven bias. Training data that relies on bias labeled data will result in prejudices and omissions in a predictive model, despite the machine learning algorithm being legitimate. The labeled data used to train a specific machine learning algorithm needs to be a statistically representative sample to not bias the results. For example, in facial recognition systems underrepresented groups are subsequently often misclassified if the labeled data available to train has not been representative of the population,. In 2018, a study by Joy Buolamwini and Timnit Gebru demonstrated that two facial analysis datasets that have been used to train facial recognition algorithms, IJB-A and Adience, are composed of 79.6% and 86.2% lighter skinned humans respectively. === Human error and inconsistency === Human annotators are prone to errors and biases when labeling data. This can lead to inconsistent labels and affect the quality of the data set. The inconsistency can affect the machine learning model's ability to generalize well. === Domain expertise === Certain fields, such as legal document analysis or medical imaging, require annotators with specialized domain knowledge. Without the expertise, the annotations or labeled data may be inaccurate, negatively impacting the machine learning model's performance in a real-world scenario.

TensorFlow Hub

TensorFlow Hub (also styled TF Hub) is an open-source machine learning library and online repository that provides TensorFlow model components, called modules. It is maintained by Google as part of the TensorFlow ecosystem and allows developers to discover, publish, and reuse pretrained models for tasks such as computer vision, natural language processing, and transfer learning. == Overview == TensorFlow Hub provides a central platform where developers and researchers can access pre-trained models and integrate them directly into TensorFlow workflows. Each module encapsulates a computation graph and its trained weights, with standardized input and output signatures. Modules can be loaded using the hub.load() function or through Keras integration via hub.KerasLayer, enabling users to perform transfer learning or feature extraction. == History == TensorFlow Hub was announced by Google in March 2018, with the first public version released shortly after. Its introduction coincided with the growing adoption of transfer learning techniques and the need for standardized model packaging. Over time, the hub expanded to include models such as the BERT family, MobileNet, EfficientNet, and the Universal Sentence Encoder. In 2020, research on “Regret selection in TensorFlow Hub” explored the problem of identifying optimal models for downstream tasks given a large repository of alternatives. == Applications == TensorFlow Hub hosts a variety of models across machine learning domains: Natural language processing: BERT, ALBERT language model, and Universal Sentence Encoder. Computer vision: ResNet, Inception (deep learning), MobileNet, EfficientNet. Speech and audio: spectrogram feature extractors and automatic speech recognition models. Multilingual embeddings: cross-lingual and sentence-level representations for machine translation and semantic similarity. Modules are widely used in education, academic research, and industry for prototyping and production deployment.

Production (computer science)

In computer science, a production or production rule is a rewrite rule that replaces some symbols with other symbols. A finite set of productions P {\displaystyle P} is the main component in the specification of a formal grammar (specifically a generative grammar). In such grammars, a set of productions is a special case of relation on the set of strings V ∗ {\displaystyle V^{}} (where ∗ {\displaystyle {}^{}} is the Kleene star operator) over a finite set of symbols V {\displaystyle V} called a vocabulary that defines which non-empty strings can be substituted with others. The set of productions is thus a special kind subset P ⊂ V ∗ × V ∗ {\displaystyle P\subset V^{}\times V^{}} and productions are then written in the form u → v {\displaystyle u\to v} to mean that ( u , v ) ∈ P {\displaystyle (u,v)\in P} (not to be confused with → {\displaystyle \to } being used as function notation, since there may be multiple rules for the same u {\displaystyle u} ). Given two subsets A , B ⊂ V ∗ {\displaystyle A,B\subset V^{}} , productions can be restricted to satisfy P ⊂ A × B {\displaystyle P\subset A\times B} , in which case productions are said "to be of the form A → B {\displaystyle A\to B} . Different choices and constructions of A , B {\displaystyle A,B} lead to different types of grammars. In general, any production of the form u → ϵ , {\displaystyle u\to \epsilon ,} where ϵ {\displaystyle \epsilon } is the empty string (sometimes also denoted λ {\displaystyle \lambda } ), is called an erasing rule, while productions that would produce strings out of nowhere, namely of the form ϵ → v , {\displaystyle \epsilon \to v,} are never allowed. In order to allow the production rules to create meaningful sentences, the vocabulary is partitioned into (disjoint) sets Σ {\displaystyle \Sigma } and N {\displaystyle N} providing two different roles: Σ {\displaystyle \Sigma } denotes the terminal symbols known as an alphabet containing the symbols allowed in a sentence; N {\displaystyle N} denotes nonterminal symbols, containing a distinguished start symbol S ∈ N {\displaystyle S\in N} , that are needed together with the production rules to define how to build the sentences. In the most general case of an unrestricted grammar, a production u → v {\displaystyle u\to v} , is allowed to map arbitrary strings u {\displaystyle u} and v {\displaystyle v} in V {\displaystyle V} (terminals and nonterminals), as long as u {\displaystyle u} is not empty. So unrestricted grammars have productions of the form V ∗ ∖ { ϵ } → V ∗ {\displaystyle V^{}\setminus \{\epsilon \}\to V^{}} or if we want to disallow changing finished sentences V ∗ N V ∗ = ( V ∗ ∖ Σ ∗ ) → V ∗ {\displaystyle V^{}NV^{}=(V^{}\setminus \Sigma ^{})\to V^{}} , where V ∗ N V ∗ {\displaystyle V^{}NV^{}} indicates concatenation and forces a non-terminal symbol to always be present on the left-hand side of the productions, and ∖ {\displaystyle \setminus } denotes set minus or set difference. If we do not allow the start symbol to occur in v {\displaystyle v} (the word on the right side), we have to replace V ∗ {\displaystyle V^{}} with ( V ∖ { S } ) ∗ {\displaystyle (V\setminus \{S\})^{}} on the right-hand side. The other types of formal grammar in the Chomsky hierarchy impose additional restrictions on what constitutes a production. Notably in a context-free grammar, the left-hand side of a production must be a single nonterminal symbol. So productions are of the form: N → V ∗ {\displaystyle N\to V^{}} == Grammar generation == To generate a string in the language, one begins with a string consisting of only a single start symbol, and then successively applies the rules (any number of times, in any order) to rewrite this string. This stops when a string containing only terminals is obtained. The language consists of all the strings that can be generated in this manner. Any particular sequence of legal choices taken during this rewriting process yields one particular string in the language. If there are multiple different ways of generating this single string, then the grammar is said to be ambiguous. For example, assume the alphabet consists of a {\displaystyle a} and b {\displaystyle b} , with the start symbol S {\displaystyle S} , and we have the following rules: 1. S → a S b {\displaystyle S\rightarrow aSb} 2. S → b a {\displaystyle S\rightarrow ba} then we start with S {\displaystyle S} , and can choose a rule to apply to it. If we choose rule 1, we replace S {\displaystyle S} with a S b {\displaystyle aSb} and obtain the string a S b {\displaystyle aSb} . If we choose rule 1 again, we replace S {\displaystyle S} with a S b {\displaystyle aSb} and obtain the string a a S b b {\displaystyle aaSbb} . This process is repeated until we only have symbols from the alphabet (i.e., a {\displaystyle a} and b {\displaystyle b} ). If we now choose rule 2, we replace S {\displaystyle S} with b a {\displaystyle ba} and obtain the string a a b a b b {\displaystyle aababb} , and are done. We can write this series of choices more briefly, using symbols: S ⇒ a S b ⇒ a a S b b ⇒ a a b a b b {\displaystyle S\Rightarrow aSb\Rightarrow aaSbb\Rightarrow aababb} . The language of the grammar is the set of all the strings that can be generated using this process: { b a , a b a b , a a b a b b , a a a b a b b b , … } {\displaystyle \{ba,abab,aababb,aaababbb,\dotsc \}} .

Intelligent agent

In artificial intelligence, an intelligent agent is an entity that perceives its environment, takes actions autonomously to achieve goals, and may improve its performance through machine learning or by acquiring knowledge. AI textbooks define artificial intelligence as the "study and design of intelligent agents," emphasizing that goal-directed behavior is central to intelligence. A specialized subset of intelligent agents, agentic AI (also known as an AI agent or simply agent), expands this concept by proactively pursuing goals, making decisions, and taking actions over extended periods. Intelligent agents can range from simple to highly complex. A basic thermostat or control system is considered an intelligent agent, as is a human being, or any other system that meets the same criteria—such as a firm, a state, or a biome. Intelligent agents operate based on an objective function, which encapsulates their goals. They are designed to create and execute plans that maximize the expected value of this function upon completion. For example, a reinforcement learning agent has a reward function, which allows programmers to shape its desired behavior. Similarly, an evolutionary algorithm's behavior is guided by a fitness function. Intelligent agents in artificial intelligence are closely related to agents in economics, and versions of the intelligent agent paradigm are studied in cognitive science, ethics, and the philosophy of practical reason, as well as in many interdisciplinary socio-cognitive modeling and computer social simulations. Intelligent agents are often described schematically as abstract functional systems similar to computer programs . To distinguish theoretical models from real-world implementations, abstract descriptions of intelligent agents are called abstract intelligent agents. Intelligent agents are also closely related to software agents—autonomous computer programs that carry out tasks on behalf of users. They are also referred to using a term borrowed from economics: a "rational agent". == Intelligent agents as the foundation of AI == The concept of intelligent agents provides a foundational lens through which to define and understand artificial intelligence. For instance, the influential textbook Artificial Intelligence: A Modern Approach (Russell & Norvig) describes: Agent: Anything that perceives its environment (using sensors) and acts upon it (using actuators). E.g., a robot with cameras and wheels, or a software program that reads data and makes recommendations. Rational Agent: An agent that strives to achieve the best possible outcome based on its knowledge and past experiences. "Best" is defined by a performance measure – a way of evaluating how well the agent is doing. Artificial Intelligence (as a field): The study and creation of these rational agents. Other researchers and definitions build upon this foundation. Padgham & Winikoff emphasize that intelligent agents should react to changes in their environment in a timely way, proactively pursue goals, and be flexible and robust (able to handle unexpected situations). Some also suggest that ideal agents should be "rational" in the economic sense (making optimal choices) and capable of complex reasoning, like having beliefs, desires, and intentions (BDI model). Kaplan and Haenlein offer a similar definition, focusing on a system's ability to understand external data, learn from that data, and use what is learned to achieve goals through flexible adaptation. Defining AI in terms of intelligent agents offers several key advantages: Avoids Philosophical Debates: It sidesteps arguments about whether AI is "truly" intelligent or conscious, like those raised by the Turing test or Searle's Chinese Room. It focuses on behavior and goal achievement, not on replicating human thought. Objective Testing: It provides a clear, scientific way to evaluate AI systems. Researchers can compare different approaches by measuring how well they maximize a specific "goal function" (or objective function). This allows for direct comparison and combination of techniques. Interdisciplinary Communication: It creates a common language for AI researchers to collaborate with other fields like mathematical optimization and economics, which also use concepts like "goals" and "rational agents." == Objective function == An objective function (or goal function) specifies the goals of an intelligent agent. An agent is deemed more intelligent if it consistently selects actions that yield outcomes better aligned with its objective function. In effect, the objective function serves as a measure of success. The objective function may be: Simple: For example, in a game of Go, the objective function might assign a value of 1 for a win and 0 for a loss. Complex: It might require the agent to evaluate and learn from past actions, adapting its behavior based on patterns that have proven effective. The objective function encapsulates all of the goals the agent is designed to achieve. For rational agents, it also incorporates the trade-offs between potentially conflicting goals. For instance, a self-driving car's objective function might balance factors such as safety, speed, and passenger comfort. Different terms are used to describe this concept, depending on the context. These include: Utility function: Often used in economics and decision theory, representing the desirability of a state. Objective function: A general term used in optimization. Loss function: Typically used in machine learning, where the goal is to minimize the loss (error). Reward Function: Used in reinforcement learning. Fitness Function: Used in evolutionary systems. Goals, and therefore the objective function, can be: Explicitly defined: Programmed directly into the agent. Induced: Learned or evolved over time. In reinforcement learning, a "reward function" provides feedback, encouraging desired behaviors and discouraging undesirable ones. The agent learns to maximize its cumulative reward. In evolutionary systems, a "fitness function" determines which agents are more likely to reproduce. This is analogous to natural selection, where organisms evolve to maximize their chances of survival and reproduction. Some AI systems, such as nearest-neighbor, reason by analogy rather than being explicitly goal-driven. However, even these systems can have goals implicitly defined within their training data. Such systems can still be benchmarked by framing the non-goal system as one whose "goal" is to accomplish its narrow classification task. Systems not traditionally considered agents, like knowledge-representation systems, are sometimes included in the paradigm by framing them as agents with a goal of, for example, answering questions accurately. Here, the concept of an "action" is extended to encompass the "act" of providing an answer. As a further extension, mimicry-driven systems can be framed as agents optimizing a "goal function" based on how closely the agent mimics the desired behavior. In generative adversarial networks (GANs) of the 2010s, an "encoder"/"generator" component attempts to mimic and improvise human text composition. The generator tries to maximize a function representing how well it can fool an antagonistic "predictor"/"discriminator" component. While symbolic AI systems often use an explicit goal function, the paradigm also applies to neural networks and evolutionary computing. Reinforcement learning can generate intelligent agents that appear to act in ways intended to maximize a "reward function". Sometimes, instead of setting the reward function directly equal to the desired benchmark evaluation function, machine learning programmers use reward shaping to initially give the machine rewards for incremental progress. Yann LeCun stated in 2018, "Most of the learning algorithms that people have come up with essentially consist of minimizing some objective function." AlphaZero chess had a simple objective function: +1 point for each win, and -1 point for each loss. A self-driving car's objective function would be more complex. Evolutionary computing can evolve intelligent agents that appear to act in ways intended to maximize a "fitness function" influencing how many descendants each agent is allowed to leave. The mathematical formalism of AIXI was proposed as a maximally intelligent agent in this paradigm. However, AIXI is uncomputable. In the real world, an intelligent agent is constrained by finite time and hardware resources, and scientists compete to produce algorithms that achieve progressively higher scores on benchmark tests with existing hardware. == Agent function == An intelligent agent's behavior can be described mathematically by an agent function. This function determines what the agent does based on what it has seen. A percept refers to the agent's sensory inputs at a single point in time. For example, a self-driving car's percepts might include camera images, lidar data, GPS coordinates, and speed r