Owain Evans

Owain Evans

Owain Rhys Evans is a British artificial intelligence researcher who works on AI alignment and machine learning safety. He founded Truthful AI, a research group based in Berkeley, California, and is an affiliate of the Center for Human Compatible AI (CHAI) at the University of California, Berkeley. His research addresses AI truthfulness, emergent behaviors in large language models, and the alignment of AI systems with human values. == Education == Evans earned a Bachelor of Arts in philosophy and mathematics from Columbia University in 2008 and a PhD in philosophy from the Massachusetts Institute of Technology in 2015. His doctoral research focused on Bayesian computational models of human preferences and decision-making. == Career == After completing his doctorate, Evans held positions at the Future of Humanity Institute (FHI) at the University of Oxford, first as a postdoctoral research fellow and later as a research scientist. While at FHI, he co-authored a survey of machine learning researchers on timelines for human-level AI, published in the Journal of Artificial Intelligence Research. The survey was reported on by Newsweek, New Scientist, the BBC, and The Economist. He was also among the co-authors of a 2018 report on the potential for misuse of AI technologies, published by researchers at Oxford, Cambridge, and other institutions. Since 2022, Evans has been based in Berkeley, where he founded Truthful AI, a non-profit research group that studies AI truthfulness, deception, and emergent behaviors in large language models. == Research == Evans's early work examined challenges in inverse reinforcement learning when human behavior is irrational or biased, proposing methods for AI systems to infer preferences from imperfect human demonstrations. He co-developed TruthfulQA (2021), a benchmark that tests whether language models give truthful answers rather than repeating common misconceptions. Initial evaluations found that larger models were not more truthful, suggesting that scaling alone does not improve factual accuracy. The benchmark has since been used by AI developers to evaluate large language models. He also co-authored a paper proposing design and governance strategies for building AI systems that do not deceive or hallucinate. In 2023, Evans and collaborators described the "reversal curse", showing that language models trained on a fact in one direction (e.g. "A is B") often cannot answer the corresponding reverse query ("B is A"). His group also developed a benchmark for evaluating situational awareness in language models. In 2025, Evans and colleagues published a study in Nature on what they termed "emergent misalignment": fine-tuning a language model on a narrow task (writing insecure code) caused it to produce unrelated harmful outputs without explicit instruction to do so. Later that year, Evans and collaborators (including researchers at Anthropic) reported that hidden behavioral traits can transfer between language models through training data, even when those traits are not explicitly present in the data, a phenomenon they called "subliminal learning". == Public engagement == In November 2025, Evans delivered the Hinton Lectures, a keynote lecture series on AI safety co-founded by Geoffrey Hinton and the Global Risk Institute.

Meta-Labeling

Meta-labeling, also known as corrective AI, is a machine learning (ML) technique utilized in quantitative finance to enhance the performance of investment and trading strategies, developed in 2017 by Marcos López de Prado at Guggenheim Partners and Cornell University. The core idea is to separate the decision of trade direction (side) from the decision of trade sizing, addressing the inefficiencies of simultaneously learning both side and size predictions. The side decision involves forecasting market movements (long, short, neutral), while the size decision focuses on risk management and profitability. It serves as a secondary decision-making layer that evaluates the signals generated by a primary predictive model. By assessing the confidence and likely profitability of those signals, meta-labeling allows investors and algorithms to dynamically size positions and suppress false positives. == Motivation == Meta-labeling is designed to improve precision without sacrificing recall. As noted by López de Prado, attempting to model both the direction and the magnitude of a trade using a single algorithm can result in poor generalization. By separating these tasks, meta-labeling enables greater flexibility and robustness: Enhances control over capital allocation. Reduces overfitting by limiting model complexity. Allows the use of interpretability tools and tailored thresholds to manage risk. Enables dynamic trade suppression in unfavorable regimes. == Applications == Meta-labeling has been applied in a variety of financial ML contexts, including: Algorithmic trading: Filtering and sizing trades to reduce false positives. Portfolio optimization: Scaling exposure across multiple signals with differing confidence levels. Risk management: Dynamically disabling strategies in adverse market conditions. Model validation: Interpreting when and why a model may be underperforming due to regime shifts. == General architecture == Meta-labeling decouples two core components of systematic trading strategies: directional prediction and position sizing. The process involves training a primary model to generate trade signals (e.g., buy, sell, or hold) and then training a secondary model to determine whether each signal is likely to lead to a profitable trade. The second model outputs a probability that is interpreted as the confidence in the forecast, which can be used to adjust the position size or to filter out unreliable trades. Meta-labeling is typically implemented as a three-stage process: Primary model (M1): Predicts the direction or label of a financial outcome using features such as market prices, returns, or volatility indicators. A typical output is directional, e.g., Y ∈ {−1,0,1}, representing short, neutral, or long positions. Secondary model (M2): A binary classifier trained to predict whether the primary model's prediction will be profitable. The target variable is a binary meta-label F ∈ { 0 , 1 } {\displaystyle F\in \{0,1\}} . Inputs can include features used in the primary model, performance diagnostics, or market regime data. Position sizing algorithm (M3): Translates the output probability of the secondary model into a position size. Higher confidence scores result in larger allocations, while lower confidence leads to reduced or zero exposure. === Stage 1: Forecasting side === Primary model architecture Figure 1 Figure 1 presents the architecture of a primary model. It focuses on forecasting the side of the trade. Following the example, this model (M1) takes in input data – such as open-high-low-close data and determines the side of the position to take: a negative number is a short position, and positive number is a long position, the range is set between −1 and 1 (the closer it is to −1 or 1, the stronger the models conviction is). When training the model, the labels are −1 and 1, based on the direction of forward returns for some predefined investment horizon. The researcher may decide to apply a recall check (τ: "Tau") by setting a minimum threshold that the initial output needs to be to qualify of a short or long position (if the threshold is not met, no side forecast is predicted, leading to closing of any open positions), this leads to the primary model output which is one of three possible side forecasts: −1, 0, or 1. The primary model also generates evaluation data which can be used by the secondary model, to improve performance of size forecasts. Some examples of evaluation data include rolling accuracy, F1, recall, precision, and AUC scores. === Stage 2: Filtering out false positives === General meta-labeling architecture Figure 2 Next comes the phase of filtering out false positives, by applying a secondary machine learning model (M2), which is a binary classifier trained to determine if the trade will be profitable or not. The model takes as input four general groupings of data: General input data which is predictive of a false positive. For example the last 30 days rolling volatility of the underlying asset. Evaluation data. Market state and regime data, one may find that macro economic data or clustering the market into regimes may help as specific trading strategies are known to perform better in particular regimes. Example: momentum based strategies perform best in periods with low volatility and strong directional moves. Primary models initial input which is a value between −1 and 1. This highlights the strength of the primary models conviction. The output of the model is a value between −1 and 1 (if using a Tanh function) which will indicate the strength of the conviction that a short or long position is profitable, or it could simply be between 0 and 1 (using a sigmoid function) if one only wanted to know if it made money or not. This output allows filtering out trades that are likely to lead to losses. One could stop at this point or use the outputs of the secondary model as inputs to a position sizing algorithm (M3) which could further enhance strategy performance metrics by translating the output probability of the secondary model into a position size. Higher confidence scores result in larger allocations, while lower confidence leads to reduced or zero exposure. === Stage 3: Optimizing position sizes === ==== Position sizing methods (M3) ==== Various algorithms have been proposed for transforming predicted probabilities into trade sizes: All-or-nothing: Allocate 100% of capital if the probability exceeds a predefined threshold (e.g., 0.5); otherwise, do not trade. Model confidence: Use the probability score directly as the fraction of capital allocated. Linear scaling: Rescale the model's probabilities using min-max normalization based on the training data. Normal CDF (NCDF): Use a normal cumulative distribution function applied to a z-statistic derived from the predicted probability. Empirical CDF (ECDF): Rank probabilities based on their percentile in the training data to ensure relative allocation. Sigmoid Optimal Position Sizing (SOPS): Applies a smooth non-linear sigmoid transformation optimized to maximize risk-adjusted returns (Sharpe ratio). ==== Model calibration ==== Each machine learning algorithm used in meta-labeling tends to produce outputs with different characteristic distributions; for example, some are approximately normally distributed, whereas others exhibit a pronounced U-shape, concentrating probabilities near the extremes. Due to these varying distributions, simply summing the outputs of different models can inadvertently lead to uneven weighting of signals, biasing trade decisions. To address this, model calibration techniques are essential to adjust the predicted probabilities towards frequentist probabilities, ensuring that model outputs reflect true likelihoods more accurately. Two common calibration techniques are: Platt scaling (Sigmoid scaling): Suitable for correcting S-shaped calibration plots typically produced by models such as support vector machines (SVMs). Isotonic regression: Fits a non-decreasing step function to probabilities and is effective particularly with larger datasets, though it can sometimes lead to overfitting. Transforming predictions to frequentist probabilities is crucial as it provides probabilistic outputs that are directly interpretable as the actual likelihood of an event occurring. Such calibration significantly enhances the effectiveness of fixed position sizing methods, reducing maximum drawdowns and increasing risk-adjusted returns. However, calibration has less impact on position sizing methods that directly estimate parameters from the training data, such as ECDF and SOPS, suggesting that calibration is a critical step mainly for fixed methods that rely heavily on raw model outputs. =

Exploration–exploitation dilemma

The exploration–exploitation dilemma, also known as the explore–exploit tradeoff, is a fundamental concept in decision-making that arises in many domains. It is depicted as the balancing act between two opposing strategies. Exploitation involves choosing the best option based on current knowledge of the system (which may be incomplete or misleading), while exploration involves trying out new options that may lead to better outcomes in the future at the expense of an exploitation opportunity. Finding the optimal balance between these two strategies is a crucial challenge in many decision-making problems whose goal is to maximize long-term benefits. == Application in machine learning == In the context of machine learning, the exploration–exploitation tradeoff is fundamental in reinforcement learning (RL), a type of machine learning that involves training agents to make decisions based on feedback from the environment. Crucially, this feedback may be incomplete or delayed. The agent must decide whether to exploit the current best-known policy or explore new policies to improve its performance. === Multi-armed bandit methods === The multi-armed bandit (MAB) problem was a classic example of the tradeoff, and many methods were developed for it, such as epsilon-greedy, Thompson sampling, and the upper confidence bound (UCB). See the page on MAB for details. In more complex RL situations than the MAB problem, the agent can treat each choice as a MAB, where the payoff is the expected future reward. For example, if the agent performs an epsilon-greedy method, then the agent will often "pull the best lever" by picking the action that had the best predicted expected reward (exploit). However, it would pick a random action with probability epsilon (explore). Monte Carlo tree search, for example, uses a variant of the UCB method. === Exploration problems === There are some problems that make exploration difficult. Sparse reward. If rewards occur only once a long while, then the agent might not persist in exploring. Furthermore, if the space of actions is large, then the sparse reward would mean the agent would not be guided by the reward to find a good direction for deeper exploration. A standard example is Montezuma's Revenge. Deceptive reward. If some early actions give immediate small reward, but other actions give later large reward, then the agent might be lured away from exploring the other actions. Noisy TV problem. If certain observations are irreducibly noisy (such as a television showing random images), then the agent might be trapped exploring those observations (watching the television). === Exploration reward === This section based on. The exploration reward (also called exploration bonus) methods convert the exploration-exploitation dilemma into a balance of exploitations. That is, instead of trying to get the agent to balance exploration and exploitation, exploration is simply treated as another form of exploitation, and the agent simply attempts to maximize the sum of rewards from exploration and exploitation. The exploration reward can be treated as a form of intrinsic reward. We write these as r t i , r t e {\displaystyle r_{t}^{i},r_{t}^{e}} , meaning the intrinsic and extrinsic rewards at time step t {\displaystyle t} . However, exploration reward is different from exploitation in two regards: The reward of exploitation is not freely chosen, but given by the environment, but the reward of exploration may be picked freely. Indeed, there are many different ways to design r t i {\displaystyle r_{t}^{i}} described below. The reward of exploitation is usually stationary (i.e. the same action in the same state gives the same reward), but the reward of exploration is non-stationary (i.e. the same action in the same state should give less and less reward). Count-based exploration uses N n ( s ) {\displaystyle N_{n}(s)} , the number of visits to a state s {\displaystyle s} during the time-steps 1 : n {\displaystyle 1:n} , to calculate the exploration reward. This is only possible in small and discrete state space. Density-based exploration extends count-based exploration by using a density model ρ n ( s ) {\displaystyle \rho _{n}(s)} . The idea is that, if a state has been visited, then nearby states are also partly-visited. In maximum entropy exploration, the entropy of the agent's policy π {\displaystyle \pi } is included as a term in the intrinsic reward. That is, r t i = − ∑ a π ( a | s t ) ln ⁡ π ( a | s t ) + ⋯ {\displaystyle r_{t}^{i}=-\sum _{a}\pi (a|s_{t})\ln \pi (a|s_{t})+\cdots } . === Prediction-based === This section based on. The forward dynamics model is a function for predicting the next state based on the current state and the current action: f : ( s t , a t ) ↦ s t + 1 {\displaystyle f:(s_{t},a_{t})\mapsto s_{t+1}} . The forward dynamics model is trained as the agent plays. The model becomes better at predicting state transition for state-action pairs that had been done many times. A forward dynamics model can define an exploration reward by r t i = ‖ f ( s t , a t ) − s t + 1 ‖ 2 2 {\displaystyle r_{t}^{i}=\|f(s_{t},a_{t})-s_{t+1}\|_{2}^{2}} . That is, the reward is the squared-error of the prediction compared to reality. This rewards the agent to perform state-action pairs that had not been done many times. This is however susceptible to the noisy TV problem. Dynamics model can be run in latent space. That is, r t i = ‖ f ( s t , a t ) − ϕ ( s t + 1 ) ‖ 2 2 {\displaystyle r_{t}^{i}=\|f(s_{t},a_{t})-\phi (s_{t+1})\|_{2}^{2}} for some featurizer ϕ {\displaystyle \phi } . The featurizer can be the identity function (i.e. ϕ ( x ) = x {\displaystyle \phi (x)=x} ), randomly generated, the encoder-half of a variational autoencoder, etc. A good featurizer improves forward dynamics exploration. The Intrinsic Curiosity Module (ICM) method trains simultaneously a forward dynamics model and a featurizer. The featurizer is trained by an inverse dynamics model, which is a function for predicting the current action based on the features of the current and the next state: g : ( ϕ ( s t ) , ϕ ( s t + 1 ) ) ↦ a t {\displaystyle g:(\phi (s_{t}),\phi (s_{t+1}))\mapsto a_{t}} . By optimizing the inverse dynamics, both the inverse dynamics model and the featurizer are improved. Then, the improved featurizer improves the forward dynamics model, which improves the exploration of the agent. Random Network Distillation (RND) method attempts to solve this problem by teacher–student distillation. Instead of a forward dynamics model, it has two models f , f ′ {\displaystyle f,f'} . The f ′ {\displaystyle f'} teacher model is fixed, and the f {\displaystyle f} student model is trained to minimize ‖ f ( s ) − f ′ ( s ) ‖ 2 2 {\displaystyle \|f(s)-f'(s)\|_{2}^{2}} on states s {\displaystyle s} . As a state is visited more and more, the student network becomes better at predicting the teacher. Meanwhile, the prediction error is also an exploration reward for the agent, and so the agent learns to perform actions that result in higher prediction error. Thus, we have a student network attempting to minimize the prediction error, while the agent attempting to maximize it, resulting in exploration. The states are normalized by subtracting a running average and dividing a running variance, which is necessary since the teacher model is frozen. The rewards are normalized by dividing with a running variance. Exploration by disagreement trains an ensemble of forward dynamics models, each on a random subset of all ( s t , a t , s t + 1 ) {\displaystyle (s_{t},a_{t},s_{t+1})} tuples. The exploration reward is the variance of the models' predictions. === Noise === For neural network–based agents, the NoisyNet method changes some of its neural network modules by noisy versions. That is, some network parameters are random variables from a probability distribution. The parameters of the distribution are themselves learnable. For example, in a linear layer y = W x + b {\displaystyle y=Wx+b} , both W , b {\displaystyle W,b} are sampled from Gaussian distributions N ( μ W , Σ W ) , N ( μ b , Σ b ) {\displaystyle {\mathcal {N}}(\mu _{W},\Sigma _{W}),{\mathcal {N}}(\mu _{b},\Sigma _{b})} at every step, and the parameters μ W , Σ W , μ b , Σ b {\displaystyle \mu _{W},\Sigma _{W},\mu _{b},\Sigma _{b}} are learned via the reparameterization trick.

Halite AI Programming Competition

Halite is an open-source computer programming contest developed by the hedge fund/tech firm Two Sigma in partnership with a team at Cornell Tech. Programmers can see the game environment and learn everything they need to know about the game. Participants are asked to build bots in whichever language they choose to compete on a two-dimensional virtual battle field. == History == Benjamin Spector and Michael Truell created the first Halite competition in 2016, before partnering with Two Sigma later that year. === Halite I === Halite I asked participants to conquer territory on a grid. It launched in November 2016 and ended in February 2017. Halite I attracted about 1,500 players. === Halite II === Halite II was similar to Halite I, but with a space-war theme. It ran from October 2017 until January 2018. The second installment of the competition attracted about 6,000 individual players from more than 100 countries. Among the participants were professors, physicists and NASA engineers, as well as high school and university students. === Halite III === Halite III launched in mid-October 2018. It ran from October 2018 to January 2019, with an ocean themed playing field. Players were asked to collect and manage Halite, an energy resource. By the end of the competition, Halite III included more than 4000 players and 460 organizations. === Halite IV === Halite IV was hosted by Kaggle, and launched in mid-June 2020.

.ai

.ai is the Internet country code top-level domain (ccTLD) for Anguilla, a British Overseas Territory in the Caribbean. It is administered by the government of Anguilla. It is a popular domain hack with companies and projects related to the artificial intelligence industry (AI). Google's ad targeting treats .ai as a generic top-level domain (gTLD) because "users and website owners frequently see [the domain] as being more generic than country-targeted." In 2021, Google Search analyst Gary Illyes announced that ".ai" had been added to Google’s list of generic country-code top-level domains, meaning that Google would no longer infer Anguilla-specific targeting from the ccTLD. Identity Digital began managing the domain as of January 2025. == Second and third level registrations == Registrations within off.ai, com.ai, net.ai, and org.ai are available worldwide without restriction. From 15 September 2009, second level registrations within .ai are available to everyone worldwide. == Registration == The minimum registration term allowed for .ai domains is 2 through 10 years for registration and renewal, and a 2-year renewal for domain transfer. Identity Digital is the authority in charge of managing this extension. Registrations began on 16 February 1995. The limits on the number of characters used for the domain name are, at a minimum, from 1 to 3, depending on the registrar, and always at most 63 characters. The character set supported for .ai domain names includes A–Z, a–z, 0–9, and hyphen. As of November 2022, .ai domains cannot accommodate IDN characters. There are no requirements for registering a domain, including local and foreign residents. A .ai domain can be suspended or revoked, if the domain is involved in illegal activity such as violating trademarks or copyrights. Usage must not violate the laws of Anguilla. Anguilla uses the UDRP. Filing a UDRP challenge requires using one of the ICANN Approved Dispute Resolution Service Providers. If the domain is with an ICANN accredited registrar, they should work with the arbitrator. Usually this means either doing nothing or transferring a domain. .ai domains are transferable to any desired registrars as the registration of domain is done maintaining EPP. There used to be a whois.ai-based platform of expired domains in which those could be procured and auctioned every ten days through a standard online process. The last auctions of such kind closed there in December 2024; the platform had been scheduled for shutdown on 30 June 2025, but remained online in the months following that date. == Valuation == Domains cost depends on the registrar, with yearly fees ranging from US$140 (the base fee, as established by Anguilla) to $200. As of July 2025, the highest-valued .ai domain is an undisclosed one sold on 8 November 2023, on Escrow.com, for US$1,500,000—months after an initial $300,000 sale to the same buyer. Among the publicly disclosed ones, the most valued, fin.ai, was sold for $1,000,000 in March 2025. On 16 December 2017, the .ai registry started supporting the Extensible Provisioning Protocol (EPP) and migrated all of its domains onto an EPP system. Consequently, many registrars are allowed to sell .ai domains. Since that date, the .ai ccTLD has also been popular with artificial intelligence companies and organisations. Though such trends are primarily seen among new AI based companies or startups, many established AI and Tech companies preferred not to opt for .ai domains. For example, DeepMind has its domain retained at .com; Meta has redirected its facebook.ai domain to ai.meta.com. == Impact on Anguilla's economy == The registration fees earned from the .ai domains go to the treasury of the Government of Anguilla. As per a 2018 New York Times report, the total revenue generated out of selling .ai domains was $2.9 million. In 2023, Anguilla's government made about US$32 million from fees collected for registering .ai domains; that amounted to over 10% of gross domestic product for the territory. "In the years before the real breakthrough of AI, revenue from .ai domains made up less than 1% of our state income, by 2025 it will be around 47%," explained Jose Vanterpool, Minister of Infrastructure and Communications (MICUHITES), in an interview with BBC. The high 90% renewal rate of .ai domains and the 2025 renewal wave of domains registered in 2023 are driving another surge in state revenues, according to Domaintechnik.

Deep tomographic reconstruction

Deep Tomographic Reconstruction is a set of methods for using deep learning methods to perform tomographic reconstruction of medical and industrial images. It uses artificial intelligence and machine learning, especially deep artificial neural networks or deep learning, to overcome challenges such as measurement noise, data sparsity, image artifacts, and computational inefficiency. This approach has been applied across various imaging modalities, including CT, MRI, PET, SPECT, ultrasound, and optical imaging == Historical background == Traditional tomographic reconstruction relies on analytic methods such as filtered back-projection, or iterative methods which incrementally compute inverse transformations from measurement data (e.g., Radon or Fourier transform data). However, these approaches are not sufficient for certain imaging techniques such as low-dose CT and fast MRI, or scenarios involving metal artifacts and patient motion. == Use in imaging modalities == === Computed tomography (CT) === In CT, deep learning models can be particularly effective in reducing radiation exposure while maintaining image quality. Deep neural networks can also be able to reconstruct images of fair quality from sparsely sampled data without sacrificing diagnostic performance. Deep learning-based generative AI models can reduce CT metal artifacts. === Magnetic resonance imaging (MRI) === In magnetic resonance imaging (MRI), deep learning can lead to reduced MRI motion artifacts, and increased acquisition speed, referred to as fast MRI. Despite suffering from disadvantages such as lower signal-to-noise ratio (SNR), deep learning can enhance image quality in low field MRI, making these systems clinically viable. === Positron emission tomography (PET) and single-photon emission CT (SPECT) === For PET imaging, deep learning models can provide substantial improvements in low-dose imaging and motion artifact correction. Also, deep learning can help SPECT for generation of attenuation background. A notable technique for PET denoising involves integrating MR data through multimodal networks, which use anatomical information from MRI to enhance PET image quality. === Ultrasound imaging === Deep learning can enhance ultrasound imaging by reducing speckle noise and motion blur. For ultrasound beamforming, deep neural networks can allow superior image quality with limited data at high speed. === Optical imaging and microscopy === Diffuse optical tomography, optical coherence tomography and microscopy can be improved by deep neural networks beyond traditional methods. Furthermore, deep learning can also enhance Photoacoustic imaging (see Deep learning in photoacoustic imaging), addressing challenges like high noise, low contrast, and limited resolution. Deep learning has also been applied to label-free live-cell imaging, where convolutional neural networks predict fluorescence labels from transmitted light images, a technique known as in silico labeling. This method can enable high-throughput, non-invasive cell analysis and phenotyping without the need for traditional fluorescent dyes.

Kolmogorov–Arnold Networks

Kolmogorov–Arnold Networks (KANs) are a type of artificial neural network architecture inspired by the Kolmogorov–Arnold representation theorem, also known as the superposition theorem. Unlike traditional multilayer perceptrons (MLPs), which rely on fixed activation functions and linear weights, KANs replace each weight with a learnable univariate function, often represented using splines. == History == KANs (Kolmogorov–Arnold Networks) were proposed by Liu et al. (2024) as a generalization of the Kolmogorov–Arnold representation theorem (KART), aiming to outperform MLPs in small-scale AI and scientific tasks. Before KANs, numerous studies explored KART's connections to neural networks or used it as a basis for designing new network architectures. In the 1980s and 1990s, early research applied KART to neural network design. Kůrková et al. (1992), Hecht-Nielsen (1987), and Nees (1994) established theoretical foundations for multilayer networks based on KART. Igelnik et al. (2003) introduced the Kolmogorov Spline Network using cubic splines to model complex functions. Sprecher (1996, 1997) introduced numerical methods for building network layers, while Nakamura et al. (1993) created activation functions with guaranteed approximation accuracy. These works linked KART's theoretical potential with practical neural network implementation. KART has also been used in other computational and theoretical fields. Coppejans (2004) developed nonparametric regression estimators using B-splines, Bryant (2008) applied it to high-dimensional image tasks, Liu (2015) investigated theoretical applications in optimal transport and image encryption, and more recently, Polar and Poluektov (2021) used Urysohn operators for efficient KART construction, while Fakhoury et al. (2022) introduced ExSpliNet, integrating KART with probabilistic trees and multivariate B-splines for improved function approximation. == Architecture == KANs are based on the Kolmogorov–Arnold representation theorem, which was linked to the 13th Hilbert problem. Given x = ( x 1 , x 2 , … , x n ) {\displaystyle x=(x_{1},x_{2},\dots ,x_{n})} consisting of n variables, a multivariate continuous function f ( x ) {\displaystyle f(x)} can be represented as: f ( x ) = f ( x 1 , … , x n ) = ∑ q = 1 2 n + 1 Φ q ( ∑ p = 1 n φ q , p ( x p ) ) {\displaystyle f(x)=f(x_{1},\dots ,x_{n})=\sum _{q=1}^{2n+1}\Phi _{q}\left(\sum _{p=1}^{n}\varphi _{q,p}(x_{p})\right)} (1) This formulation contains two nested summations: an outer and an inner sum. The outer sum ∑ q = 1 2 n + 1 {\displaystyle \sum _{q=1}^{2n+1}} aggregates 2 n + 1 {\displaystyle 2n+1} terms, each involving a function Φ q : R → R {\displaystyle \Phi _{q}:\mathbb {R} \to \mathbb {R} } . The inner sum ∑ p = 1 n {\displaystyle \sum _{p=1}^{n}} computes n terms for each q, where each term φ q , p : [ 0 , 1 ] → R {\displaystyle \varphi _{q,p}:[0,1]\to \mathbb {R} } is a continuous function of the single variable x p {\displaystyle x_{p}} . The inner continuous functions φ q , p {\displaystyle \varphi _{q,p}} are universal, independent of f {\displaystyle f} , while the outer functions Φ q {\displaystyle \Phi _{q}} depend on the specific function f {\displaystyle f} being represented. The representation (1) holds for all multivariate functions f {\displaystyle f} as proved in . If f {\displaystyle f} is continuous, then the outer functions Φ q {\displaystyle \Phi _{q}} are continuous; if f {\displaystyle f} is discontinuous, then the corresponding Φ q {\displaystyle \Phi _{q}} are generally discontinuous, while the inner functions φ q , p {\displaystyle \varphi _{q,p}} remain the same universal functions. Liu et al. proposed the name KAN. A general KAN network consisting of L layers takes x to generate the output as: K A N ( x ) = ( Φ L − 1 ∘ Φ L − 2 ∘ ⋯ ∘ Φ 1 ∘ Φ 0 ) x {\displaystyle \mathrm {KAN} (x)=(\Phi ^{L-1}\circ \Phi ^{L-2}\circ \cdots \circ \Phi ^{1}\circ \Phi ^{0})x} (3) Here, Φ l {\displaystyle \Phi ^{l}} is the function matrix of the l-th KAN layer or a set of pre-activations. Let i denote the neuron of the l-th layer and j the neuron of the (l+1)-th layer. The activation function φ j , i l {\displaystyle \varphi _{j,i}^{l}} connects (l, i) to (l+1, j): φ j , i l , l = 0 , … , L − 1 , i = 1 , … , n l , j = 1 , … , n l + 1 {\displaystyle \varphi _{j,i}^{l},\quad l=0,\dots ,L-1,\;i=1,\dots ,n_{l},\;j=1,\dots ,n_{l+1}} (4) where nl is the number of nodes of the l-th layer. Thus, the function matrix Φ l {\displaystyle \Phi ^{l}} can be represented as an n l + 1 × n l {\displaystyle n_{l+1}\times n_{l}} matrix of activations: x l + 1 = ( φ 1 , 1 l ( ⋅ ) φ 1 , 2 l ( ⋅ ) ⋯ φ 1 , n l l ( ⋅ ) φ 2 , 1 l ( ⋅ ) φ 2 , 2 l ( ⋅ ) ⋯ φ 2 , n l l ( ⋅ ) ⋮ ⋮ ⋱ ⋮ φ n l + 1 , 1 l ( ⋅ ) φ n l + 1 , 2 l ( ⋅ ) ⋯ φ n l + 1 , n l l ( ⋅ ) ) x l {\displaystyle x^{l+1}={\begin{pmatrix}\varphi _{1,1}^{l}(\cdot )&\varphi _{1,2}^{l}(\cdot )&\cdots &\varphi _{1,n_{l}}^{l}(\cdot )\\\varphi _{2,1}^{l}(\cdot )&\varphi _{2,2}^{l}(\cdot )&\cdots &\varphi _{2,n_{l}}^{l}(\cdot )\\\vdots &\vdots &\ddots &\vdots \\\varphi _{n_{l+1},1}^{l}(\cdot )&\varphi _{n_{l+1},2}^{l}(\cdot )&\cdots &\varphi _{n_{l+1},n_{l}}^{l}(\cdot )\end{pmatrix}}x^{l}} == Implementations == To make the KAN layers optimizable, the inner function is formed by the combination of spline and basic functions as the formula: φ ( x ) = w b b ( x ) + w s spline ( x ) {\displaystyle \varphi (x)=w_{b}\,b(x)+w_{s}\,{\text{spline}}(x)} where b ( x ) {\displaystyle b(x)} is the basic function, usually defined as s i l u ( x ) = x / ( 1 + e x ) {\displaystyle silu(x)=x/(1+e^{x})} and w b {\displaystyle w_{b}} is the base weight matrix. Also, w s {\displaystyle w_{s}} is the spline weight matrix and spline ( x ) {\displaystyle {\text{spline}}(x)} is the spline function. The spline function can be a sum of B-splines. spline ( x ) = ∑ i c i B i ( x ) {\displaystyle {\text{spline}}(x)=\sum _{i}c_{i}B_{i}(x)} Many studies suggested to use other polynomial and curve functions instead of B-spline to create new KAN variants. == Functions used == The choice of functional basis strongly influences the performance of KANs. Common function families include: B-splines: Provide locality, smoothness, and interpretability; they are the most widely used in current implementations. RBFs (include Gaussian RBFs): Capture localized features in data and are effective in approximating functions with non-linear or clustered structures. Chebyshev polynomials: Offer efficient approximation with minimized error in the maximum norm, making them useful for stable function representation. Rational function: Useful for approximating functions with singularities or sharp variations, as they can model asymptotic behavior better than polynomials. Fourier series: Capture periodic patterns effectively and are particularly useful in domains such as physics-informed machine learning. Wavelet functions (DoG, Mexican hat, Morlet, and Shannon): Used for feature extraction as they can capture both high-frequency and low-frequency data components. Piecewise linear functions: Provide efficient approximation for multivariate functions in KANs. == Usage == In some modern neural architectures like convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Transformers, KANs are typically used as drop-in substitutes for MLP layers. Despite KANs' general-purpose design, researchers have created and used them for a number of tasks: Scientific machine learning (SciML): Function fitting, partial differential equations (PDEs) and physical/mathematical laws. Continual learning: KANs better preserve previously learned information during incremental updates, avoiding catastrophic forgetting due to the locality of spline adjustments. Graph neural networks: Extensions such as Kolmogorov–Arnold Graph Neural Networks (KA-GNNs) integrate KAN modules into message-passing architectures, showing improvements in molecular property prediction tasks. Sensor data processing: Kolmogorov–Arnold Networks (KANs) have recently been applied to sensor data processing due to their ability to model complex nonlinear relationships with relatively few parameters and improved interpretability compared to conventional multilayer perceptrons. Applications include industrial soft sensors, biomedical signal analysis, remote sensing, and environmental monitoring systems. == Drawbacks == KANs can be computationally intensive and require a large number of parameters due to their use of polynomial functions to capture data.