Kunihiko Fukushima

Kunihiko Fukushima

Kunihiko Fukushima (Japanese: 福島 邦彦, born 16 March 1936) is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a senior research scientist at the Fuzzy Logic Systems Institute in Fukuoka, Japan. == Notable scientific achievements == In 1980, Fukushima published the neocognitron, the original deep convolutional neural network (CNN) architecture. Fukushima proposed several supervised and unsupervised learning algorithms to train the parameters of a deep neocognitron such that it could learn internal representations of incoming data. Today, however, the CNN architecture is usually trained through backpropagation. This approach is now heavily used in computer vision. In 1969 Fukushima introduced the ReLU (Rectifier Linear Unit) activation function in the context of visual feature extraction in hierarchical neural networks, which he called "analog threshold element". (Though the ReLU was first used by Alston Householder in 1941 as a mathematical abstraction of biological neural networks.) As of 2017 it is the most popular activation function for deep neural networks. == Education and career == In 1958, Fukushima received his Bachelor of Engineering in electronics from Kyoto University. He became a senior research scientist at the NHK Science & Technology Research Laboratories. In 1989, he joined the faculty of Osaka University. In 1999, he joined the faculty of the University of Electro-Communications. In 2001, he joined the faculty of Tokyo University of Technology. From 2006 to 2010, he was a visiting professor at Kansai University. Fukushima acted as founding president of the Japanese Neural Network Society (JNNS). He also was a founding member on the board of governors of the International Neural Network Society (INNS), and president of the Asia-Pacific Neural Network Assembly (APNNA). He was one of the board of governors of the International Neural Network Society (INNS) in 1989-1990 and 1993-2005. == Awards == In 2020, Fukushima received the Bower Award and Prize for Achievement in Science. In 2022, Fukushima became a laureate of the Asian Scientist 100 by the Asian Scientist. He also received the IEICE Achievement Award and Excellent Paper Awards, the IEEE Neural Networks Pioneer Award, the APNNA Outstanding Achievement Award, the JNNS Excellent Paper Award and the INNS Helmholtz Award.

N-jet

An N-jet is the set of (partial) derivatives of a function f ( x ) {\displaystyle f(x)} up to order N. Specifically, in the area of computer vision, the N-jet is usually computed from a scale space representation L {\displaystyle L} of the input image f ( x , y ) {\displaystyle f(x,y)} , and the partial derivatives of L {\displaystyle L} are used as a basis for expressing various types of visual modules. For example, algorithms for tasks such as feature detection, feature classification, stereo matching, tracking and object recognition can be expressed in terms of N-jets computed at one or several scales in scale space.

Distribution learning theory

The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant. In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics. This article explains the basic definitions, tools and results in this framework from the theory of computation point of view. == Definitions == Let X {\displaystyle \textstyle X} be the support of the distributions of interest. As in the original work of Kearns et al. if X {\displaystyle \textstyle X} is finite it can be assumed without loss of generality that X = { 0 , 1 } n {\displaystyle \textstyle X=\{0,1\}^{n}} where n {\displaystyle \textstyle n} is the number of bits that have to be used in order to represent any y ∈ X {\displaystyle \textstyle y\in X} . We focus in probability distributions over X {\displaystyle \textstyle X} . There are two possible representations of a probability distribution D {\displaystyle \textstyle D} over X {\displaystyle \textstyle X} . probability distribution function (or evaluator) an evaluator E D {\displaystyle \textstyle E_{D}} for D {\displaystyle \textstyle D} takes as input any y ∈ X {\displaystyle \textstyle y\in X} and outputs a real number E D [ y ] {\displaystyle \textstyle E_{D}[y]} which denotes the probability that of y {\displaystyle \textstyle y} according to D {\displaystyle \textstyle D} , i.e. E D [ y ] = Pr [ Y = y ] {\displaystyle \textstyle E_{D}[y]=\Pr[Y=y]} if Y ∼ D {\displaystyle \textstyle Y\sim D} . generator a generator G D {\displaystyle \textstyle G_{D}} for D {\displaystyle \textstyle D} takes as input a string of truly random bits y {\displaystyle \textstyle y} and outputs G D [ y ] ∈ X {\displaystyle \textstyle G_{D}[y]\in X} according to the distribution D {\displaystyle \textstyle D} . Generator can be interpreted as a routine that simulates sampling from the distribution D {\displaystyle \textstyle D} given a sequence of fair coin tosses. A distribution D {\displaystyle \textstyle D} is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time. Let C X {\displaystyle \textstyle C_{X}} a class of distribution over X, that is C X {\displaystyle \textstyle C_{X}} is a set such that every D ∈ C X {\displaystyle \textstyle D\in C_{X}} is a probability distribution with support X {\displaystyle \textstyle X} . The C X {\displaystyle \textstyle C_{X}} can also be written as C {\displaystyle \textstyle C} for simplicity. In order to evaluate learnability, it is necessary to have a way to measure how well an approximated distribution D ′ {\displaystyle \textstyle D'} fits the sampled distribution D {\displaystyle \textstyle D} . There are several ways to measure the divergence between two distributions. Three common possibilities are Kullback–Leibler divergence Total variation distance of probability measures Kolmogorov distance Total variation and Kolmogorov distance are true metrics, while KL divergence is not (it lacks symmetry). These measures are ordered by convergence strength: closeness in KL divergence implies closeness in total variation (via Pinsker's inequality), which in turn implies closeness in Kolmogorov distance. Therefore, a learnability result proven under KL divergence automatically holds under the weaker measures, but not vice versa. Since certain measures may be more appropriate in specific applications, we will use d ( D , D ′ ) {\displaystyle \textstyle d(D,D')} to denote a selected divergence between the distribution D {\displaystyle \textstyle D} and the distribution D ′ {\displaystyle \textstyle D'} . The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle G E N ( D ) {\displaystyle \textstyle GEN(D)} that returns a sample from the distribution D {\displaystyle \textstyle D} . Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution D {\displaystyle \textstyle D} in class of distributions C {\displaystyle \textstyle C} . This quantity is called sample complexity of the learning algorithm. In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in. In this framework of statistical learning theory a training set S = { ( x 1 , y 1 ) , … , ( x n , y n ) } {\displaystyle \textstyle S=\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}} and the goal is to find a target function f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} that minimizes some loss function, e.g. the square loss function. More formally f = arg ⁡ min g ∫ V ( y , g ( x ) ) d ρ ( x , y ) {\displaystyle f=\arg \min _{g}\int V(y,g(x))d\rho (x,y)} , where V ( ⋅ , ⋅ ) {\displaystyle V(\cdot ,\cdot )} is the loss function, e.g. V ( y , z ) = ( y − z ) 2 {\displaystyle V(y,z)=(y-z)^{2}} and ρ ( x , y ) {\displaystyle \rho (x,y)} the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution ρ x ( y ) {\displaystyle \rho _{x}(y)} is known then the target function has the closed form f ( x ) = ∫ y y d ρ x ( y ) {\displaystyle f(x)=\int _{y}yd\rho _{x}(y)} . So the set S {\displaystyle S} is a set of samples from the probability distribution ρ ( x , y ) {\displaystyle \rho (x,y)} . Now the goal of distributional learning theory if to find ρ {\displaystyle \rho } given S {\displaystyle S} which can be used to find the target function f {\displaystyle f} . Definition of learnability A class of distributions C {\displaystyle \textstyle C} is called efficiently learnable if for every ϵ > 0 {\displaystyle \textstyle \epsilon >0} and 0 < δ ≤ 1 {\displaystyle \textstyle 0<\delta \leq 1} given access to G E N ( D ) {\displaystyle \textstyle GEN(D)} for an unknown distribution D ∈ C {\displaystyle \textstyle D\in C} , there exists a polynomial time algorithm A {\displaystyle \textstyle A} , called learning algorithm of C {\displaystyle \textstyle C} , that outputs a generator or an evaluator of a distribution D ′ {\displaystyle \textstyle D'} such that Pr [ d ( D , D ′ ) ≤ ϵ ] ≥ 1 − δ {\displaystyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta } If we know that D ′ ∈ C {\displaystyle \textstyle D'\in C} then A {\displaystyle \textstyle A} is called proper learning algorithm, otherwise is called improper learning algorithm. In some settings the class of distributions C {\displaystyle \textstyle C} is a class with well known distributions which can be described by a set of parameters. For instance C {\displaystyle \textstyle C} could be the class of all the Gaussian distributions N ( μ , σ 2 ) {\displaystyle \textstyle N(\mu ,\sigma ^{2})} . In this case the algorithm A {\displaystyle \textstyle A} should be able to estimate the parameters μ , σ {\displaystyle \textstyle \mu ,\sigma } . In this case A {\displaystyle \textstyle A} is called parameter learning algorithm. Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description. == First results == In their seminal work, Kearns et al. deal with the case where A {\displaystyle \textstyle A} is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution. O R {\displaystyle \textstyle OR} gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless # P ⊆ P / poly {\displaystyle \textstyle \#P\subseteq P/{\text{poly}}} . On the other hand, this class is efficiently learnable with generator. Parity gate distributions this class is efficiently learnable with both generator and evaluator. Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator. Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning fram

Diffusion model

In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of two major components: the forward diffusion process, and the reverse sampling process. The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data. A trained diffusion model can be sampled in many ways, with different efficiency and quality. There are various equivalent formalisms, including Markov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. They are typically trained using variational inference. The model responsible for denoising is typically called its "backbone". The backbone may be of any kind, but they are typically U-nets or transformers. As of 2024, diffusion models are mainly used for computer vision tasks, including image denoising, inpainting, super-resolution, image generation, and video generation. These typically involve training a neural network to sequentially denoise images blurred with Gaussian noise. The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image. Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation. Other than computer vision, diffusion models have also found applications in natural language processing such as text generation and summarization, sound generation, and reinforcement learning. == Denoising diffusion model == === Non-equilibrium thermodynamics === Diffusion models were introduced in 2015 as a method to train a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion. Consider, for example, how one might model the distribution of all naturally occurring photos. Each image is a point in the space of all images, and the distribution of naturally occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution. The equilibrium distribution is the Gaussian distribution N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} , with pdf ρ ( x ) ∝ e − 1 2 ‖ x ‖ 2 {\displaystyle \rho (x)\propto e^{-{\frac {1}{2}}\|x\|^{2}}} . This is just the Maxwell–Boltzmann distribution of particles in a potential well V ( x ) = 1 2 ‖ x ‖ 2 {\displaystyle V(x)={\frac {1}{2}}\|x\|^{2}} at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a Brownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution. === Denoising Diffusion Probabilistic Model (DDPM) === The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference. ==== Forward diffusion ==== To present the model, some notation is required. β 1 , . . . , β T ∈ ( 0 , 1 ) {\displaystyle \beta _{1},...,\beta _{T}\in (0,1)} are fixed constants. α t := 1 − β t {\displaystyle \alpha _{t}:=1-\beta _{t}} α ¯ t := α 1 ⋯ α t {\displaystyle {\bar {\alpha }}_{t}:=\alpha _{1}\cdots \alpha _{t}} σ t := 1 − α ¯ t {\displaystyle \sigma _{t}:={\sqrt {1-{\bar {\alpha }}_{t}}}} σ ~ t := σ t − 1 σ t β t {\displaystyle {\tilde {\sigma }}_{t}:={\frac {\sigma _{t-1}}{\sigma _{t}}}{\sqrt {\beta _{t}}}} μ ~ t ( x t , x 0 ) := α t ( 1 − α ¯ t − 1 ) x t + α ¯ t − 1 ( 1 − α t ) x 0 σ t 2 {\displaystyle {\tilde {\mu }}_{t}(x_{t},x_{0}):={\frac {{\sqrt {\alpha _{t}}}(1-{\bar {\alpha }}_{t-1})x_{t}+{\sqrt {{\bar {\alpha }}_{t-1}}}(1-\alpha _{t})x_{0}}{\sigma _{t}^{2}}}} N ( μ , Σ ) {\displaystyle {\mathcal {N}}(\mu ,\Sigma )} is the normal distribution with mean μ {\displaystyle \mu } and variance Σ {\displaystyle \Sigma } , and N ( x | μ , Σ ) {\displaystyle {\mathcal {N}}(x|\mu ,\Sigma )} is the probability density at x {\displaystyle x} . A vertical bar denotes conditioning. A forward diffusion process starts at some starting point x 0 ∼ q {\displaystyle x_{0}\sim q} , where q {\displaystyle q} is the probability distribution to be learned, then repeatedly adds noise to it by x t = 1 − β t x t − 1 + β t z t {\displaystyle x_{t}={\sqrt {1-\beta _{t}}}x_{t-1}+{\sqrt {\beta _{t}}}z_{t}} where z 1 , . . . , z T {\displaystyle z_{1},...,z_{T}} are IID (Independent and identically distributed random variables) samples from N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . The coefficients 1 − β t {\displaystyle {\sqrt {1-\beta _{t}}}} and β t {\displaystyle {\sqrt {\beta _{t}}}} ensure that Var ( X t ) = I {\displaystyle {\mbox{Var}}(X_{t})=I} assuming that Var ( X 0 ) = I {\displaystyle {\mbox{Var}}(X_{0})=I} . The values of β t {\displaystyle \beta _{t}} are chosen such that for any starting distribution of x 0 {\displaystyle x_{0}} , if it has finite second moment, then lim t → ∞ x t | x 0 {\displaystyle \lim _{t\to \infty }x_{t}|x_{0}} converges to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . The entire diffusion process then satisfies q ( x 0 : T ) = q ( x 0 ) q ( x 1 | x 0 ) ⋯ q ( x T | x T − 1 ) = q ( x 0 ) N ( x 1 | α 1 x 0 , β 1 I ) ⋯ N ( x T | α T x T − 1 , β T I ) {\displaystyle q(x_{0:T})=q(x_{0})q(x_{1}|x_{0})\cdots q(x_{T}|x_{T-1})=q(x_{0}){\mathcal {N}}(x_{1}|{\sqrt {\alpha _{1}}}x_{0},\beta _{1}I)\cdots {\mathcal {N}}(x_{T}|{\sqrt {\alpha _{T}}}x_{T-1},\beta _{T}I)} or ln ⁡ q ( x 0 : T ) = ln ⁡ q ( x 0 ) − ∑ t = 1 T 1 2 β t ‖ x t − 1 − β t x t − 1 ‖ 2 + C {\displaystyle \ln q(x_{0:T})=\ln q(x_{0})-\sum _{t=1}^{T}{\frac {1}{2\beta _{t}}}\|x_{t}-{\sqrt {1-\beta _{t}}}x_{t-1}\|^{2}+C} where C {\displaystyle C} is a normalization constant and often omitted. In particular, we note that x 1 : T | x 0 {\displaystyle x_{1:T}|x_{0}} is a Gaussian process, which affords us considerable freedom in reparameterization. For example, by standard manipulation with Gaussian process, x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} x t − 1 | x t , x 0 ∼ N ( μ ~ t ( x t , x 0 ) , σ ~ t 2 I ) {\displaystyle x_{t-1}|x_{t},x_{0}\sim {\mathcal {N}}({\tilde {\mu }}_{t}(x_{t},x_{0}),{\tilde {\sigma }}_{t}^{2}I)} In particular, notice that for large t {\displaystyle t} , the variable x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} converges to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} . That is, after a long enough diffusion process, we end up with some x T {\displaystyle x_{T}} that is very close to N ( 0 , I ) {\displaystyle {\mathcal {N}}(0,I)} , with all traces of the original x 0 ∼ q {\displaystyle x_{0}\sim q} gone. For example, since x t | x 0 ∼ N ( α ¯ t x 0 , σ t 2 I ) {\displaystyle x_{t}|x_{0}\sim N\left({\sqrt {{\bar {\alpha }}_{t}}}x_{0},\sigma _{t}^{2}I\right)} we can sample x t | x 0 {\displaystyle x_{t}|x_{0}} directly "in one step", instead of going through all the intermediate steps x 1 , x 2 , . . . , x t − 1 {\displaystyle x_{1},x_{2},...,x_{t-1}} . ==== Backward diffusion ==== The key idea of DDPM is to use a neural network parametrized by θ {\displaystyle \theta } . The network takes in two arguments x t , t {\displaystyle x_{t},t} , and outputs a vector μ θ ( x t , t ) {\displaystyle \mu _{\theta }(x_{t},t)} and a matrix Σ θ ( x t , t ) {\displaystyle \Sigma _{\theta }(x_{t},t)} , such that each step in the forward diffusion process can be approximately undone by x t − 1 ∼ N ( μ θ ( x t , t ) , Σ θ ( x t , t ) ) {\displaystyle x_{t-1}\sim {\mathcal {N}}(\mu _{\theta }(x_{t},t),\Sigma _{\theta }(x_{t},t))} . This then gives us a backward diffusion process p θ {\displaystyle p_{\theta }} defined by p θ ( x T ) = N ( x T | 0 , I ) {\displaystyle p_{\theta }(x

Homogeneity blockmodeling

In mathematics applied to analysis of social structures, homogeneity blockmodeling is an approach in blockmodeling, which is best suited for a preliminary or main approach to valued networks, when a prior knowledge about these networks is not available. This is because homogeneity blockmodeling emphasizes the similarity of link (tie) strengths within the blocks over the pattern of links. In this approach, tie (link) values (or statistical data computed on them) are assumed to be equal (homogenous) within blocks. This approach to the generalized blockmodeling of valued networks was first proposed by Aleš Žiberna in 2007 with the basic idea, "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Similar approach to the homogeneity blockmodeling, dealing with direct approach for structural equivalence, was previously suggested by Stephen P. Borgatti and Martin G. Everett (1992).

Imix video cube

The Imix (also known as ImMix) Video Cube is one of the first computer non-linear editing systems that was a full broadcast quality online video finishing machine. After its release in 1994, Imix released a more advanced version, the Imix Turbo Cube, which boasted 4 channels of real time layered visual effects. It was a hardware computer system controlled by an Apple Macintosh computer.

Modes of variation

In statistics, modes of variation are a continuously indexed set of vectors or functions that are centered at a mean and are used to depict the variation in a population or sample. Typically, variation patterns in the data can be decomposed in descending order of eigenvalues with the directions represented by the corresponding eigenvectors or eigenfunctions. Modes of variation provide a visualization of this decomposition and an efficient description of variation around the mean. Both in principal component analysis (PCA) and in functional principal component analysis (FPCA), modes of variation play an important role in visualizing and describing the variation in the data contributed by each eigencomponent. In real-world applications, the eigencomponents and associated modes of variation aid to interpret complex data, especially in exploratory data analysis (EDA). == Formulation == Modes of variation are a natural extension of PCA and FPCA. === Modes of variation in PCA === If a random vector X = ( X 1 , X 2 , ⋯ , X p ) T {\displaystyle \mathbf {X} =(X_{1},X_{2},\cdots ,X_{p})^{T}} has the mean vector μ p {\displaystyle {\boldsymbol {\mu }}_{p}} , and the covariance matrix Σ p × p {\displaystyle \mathbf {\Sigma } _{p\times p}} with eigenvalues λ 1 ≥ λ 2 ≥ ⋯ ≥ λ p ≥ 0 {\displaystyle \lambda _{1}\geq \lambda _{2}\geq \cdots \geq \lambda _{p}\geq 0} and corresponding orthonormal eigenvectors e 1 , e 2 , ⋯ , e p {\displaystyle \mathbf {e} _{1},\mathbf {e} _{2},\cdots ,\mathbf {e} _{p}} , by eigendecomposition of a real symmetric matrix, the covariance matrix Σ {\displaystyle \mathbf {\Sigma } } can be decomposed as Σ = Q Λ Q T , {\displaystyle \mathbf {\Sigma } =\mathbf {Q} \mathbf {\Lambda } \mathbf {Q} ^{T},} where Q {\displaystyle \mathbf {Q} } is an orthogonal matrix whose columns are the eigenvectors of Σ {\displaystyle \mathbf {\Sigma } } , and Λ {\displaystyle \mathbf {\Lambda } } is a diagonal matrix whose entries are the eigenvalues of Σ {\displaystyle \mathbf {\Sigma } } . By the Karhunen–Loève expansion for random vectors, one can express the centered random vector in the eigenbasis X − μ = ∑ k = 1 p ξ k e k , {\displaystyle \mathbf {X} -{\boldsymbol {\mu }}=\sum _{k=1}^{p}\xi _{k}\mathbf {e} _{k},} where ξ k = e k T ( X − μ ) {\displaystyle \xi _{k}=\mathbf {e} _{k}^{T}(\mathbf {X} -{\boldsymbol {\mu }})} is the principal component associated with the k {\displaystyle k} -th eigenvector e k {\displaystyle \mathbf {e} _{k}} , with the properties E ⁡ ( ξ k ) = 0 , Var ⁡ ( ξ k ) = λ k , {\displaystyle \operatorname {E} (\xi _{k})=0,\operatorname {Var} (\xi _{k})=\lambda _{k},} and E ⁡ ( ξ k ξ l ) = 0 for l ≠ k . {\displaystyle \operatorname {E} (\xi _{k}\xi _{l})=0\ {\text{for}}\ l\neq k.} Then the k {\displaystyle k} -th mode of variation of X {\displaystyle \mathbf {X} } is the set of vectors, indexed by α {\displaystyle \alpha } , m k , α = μ ± α λ k e k , α ∈ [ − A , A ] , {\displaystyle \mathbf {m} _{k,\alpha }={\boldsymbol {\mu }}\pm \alpha {\sqrt {\lambda _{k}}}\mathbf {e} _{k},\alpha \in [-A,A],} where A {\displaystyle A} is typically selected as 2 or 3 {\displaystyle 2\ {\text{or}}\ 3} . === Modes of variation in FPCA === For a square-integrable random function X ( t ) , t ∈ T ⊂ R p {\displaystyle X(t),t\in {\mathcal {T}}\subset R^{p}} , where typically p = 1 {\displaystyle p=1} and T {\displaystyle {\mathcal {T}}} is an interval, denote the mean function by μ ( t ) = E ⁡ ( X ( t ) ) {\displaystyle \mu (t)=\operatorname {E} (X(t))} , and the covariance function by G ( s , t ) = Cov ⁡ ( X ( s ) , X ( t ) ) = ∑ k = 1 ∞ λ k φ k ( s ) φ k ( t ) , {\displaystyle G(s,t)=\operatorname {Cov} (X(s),X(t))=\sum _{k=1}^{\infty }\lambda _{k}\varphi _{k}(s)\varphi _{k}(t),} where λ 1 ≥ λ 2 ≥ ⋯ ≥ 0 {\displaystyle \lambda _{1}\geq \lambda _{2}\geq \cdots \geq 0} are the eigenvalues and { φ 1 , φ 2 , ⋯ } {\displaystyle \{\varphi _{1},\varphi _{2},\cdots \}} are the orthonormal eigenfunctions of the linear Hilbert–Schmidt operator G : L 2 ( T ) → L 2 ( T ) , G ( f ) = ∫ T G ( s , t ) f ( s ) d s . {\displaystyle G:L^{2}({\mathcal {T}})\rightarrow L^{2}({\mathcal {T}}),\,G(f)=\int _{\mathcal {T}}G(s,t)f(s)ds.} By the Karhunen–Loève theorem, one can express the centered function in the eigenbasis, X ( t ) − μ ( t ) = ∑ k = 1 ∞ ξ k φ k ( t ) , {\displaystyle X(t)-\mu (t)=\sum _{k=1}^{\infty }\xi _{k}\varphi _{k}(t),} where ξ k = ∫ T ( X ( t ) − μ ( t ) ) φ k ( t ) d t {\displaystyle \xi _{k}=\int _{\mathcal {T}}(X(t)-\mu (t))\varphi _{k}(t)dt} is the k {\displaystyle k} -th principal component with the properties E ⁡ ( ξ k ) = 0 , Var ⁡ ( ξ k ) = λ k , {\displaystyle \operatorname {E} (\xi _{k})=0,\operatorname {Var} (\xi _{k})=\lambda _{k},} and E ⁡ ( ξ k ξ l ) = 0 for l ≠ k . {\displaystyle \operatorname {E} (\xi _{k}\xi _{l})=0{\text{ for }}l\neq k.} Then the k {\displaystyle k} -th mode of variation of X ( t ) {\displaystyle X(t)} is the set of functions, indexed by α {\displaystyle \alpha } , m k , α ( t ) = μ ( t ) ± α λ k φ k ( t ) , t ∈ T , α ∈ [ − A , A ] {\displaystyle m_{k,\alpha }(t)=\mu (t)\pm \alpha {\sqrt {\lambda _{k}}}\varphi _{k}(t),\ t\in {\mathcal {T}},\ \alpha \in [-A,A]} that are viewed simultaneously over the range of α {\displaystyle \alpha } , usually for A = 2 or 3 {\displaystyle A=2\ {\text{or}}\ 3} . == Estimation == The formulation above is derived from properties of the population. Estimation is needed in real-world applications. The key idea is to estimate mean and covariance. === Modes of variation in PCA === Suppose the data x 1 , x 2 , ⋯ , x n {\displaystyle \mathbf {x} _{1},\mathbf {x} _{2},\cdots ,\mathbf {x} _{n}} represent n {\displaystyle n} independent drawings from some p {\displaystyle p} -dimensional population X {\displaystyle \mathbf {X} } with mean vector μ {\displaystyle {\boldsymbol {\mu }}} and covariance matrix Σ {\displaystyle \mathbf {\Sigma } } . These data yield the sample mean vector x ¯ {\displaystyle {\overline {\mathbf {x} }}} , and the sample covariance matrix S {\displaystyle \mathbf {S} } with eigenvalue-eigenvector pairs ( λ ^ 1 , e ^ 1 ) , ( λ ^ 2 , e ^ 2 ) , ⋯ , ( λ ^ p , e ^ p ) {\displaystyle ({\hat {\lambda }}_{1},{\hat {\mathbf {e} }}_{1}),({\hat {\lambda }}_{2},{\hat {\mathbf {e} }}_{2}),\cdots ,({\hat {\lambda }}_{p},{\hat {\mathbf {e} }}_{p})} . Then the k {\displaystyle k} -th mode of variation of X {\displaystyle \mathbf {X} } can be estimated by m ^ k , α = x ¯ ± α λ ^ k e ^ k , α ∈ [ − A , A ] . {\displaystyle {\hat {\mathbf {m} }}_{k,\alpha }={\overline {\mathbf {x} }}\pm \alpha {\sqrt {{\hat {\lambda }}_{k}}}{\hat {\mathbf {e} }}_{k},\alpha \in [-A,A].} === Modes of variation in FPCA === Consider n {\displaystyle n} realizations X 1 ( t ) , X 2 ( t ) , ⋯ , X n ( t ) {\displaystyle X_{1}(t),X_{2}(t),\cdots ,X_{n}(t)} of a square-integrable random function X ( t ) , t ∈ T {\displaystyle X(t),t\in {\mathcal {T}}} with the mean function μ ( t ) = E ⁡ ( X ( t ) ) {\displaystyle \mu (t)=\operatorname {E} (X(t))} and the covariance function G ( s , t ) = Cov ⁡ ( X ( s ) , X ( t ) ) {\displaystyle G(s,t)=\operatorname {Cov} (X(s),X(t))} . Functional principal component analysis provides methods for the estimation of μ ( t ) {\displaystyle \mu (t)} and G ( s , t ) {\displaystyle G(s,t)} in detail, often involving point wise estimate and interpolation. Substituting estimates for the unknown quantities, the k {\displaystyle k} -th mode of variation of X ( t ) {\displaystyle X(t)} can be estimated by m ^ k , α ( t ) = μ ^ ( t ) ± α λ ^ k φ ^ k ( t ) , t ∈ T , α ∈ [ − A , A ] . {\displaystyle {\hat {m}}_{k,\alpha }(t)={\hat {\mu }}(t)\pm \alpha {\sqrt {{\hat {\lambda }}_{k}}}{\hat {\varphi }}_{k}(t),t\in {\mathcal {T}},\alpha \in [-A,A].} == Applications == Modes of variation are useful to visualize and describe the variation patterns in the data sorted by the eigenvalues. In real-world applications, modes of variation associated with eigencomponents allow to interpret complex data, such as the evolution of function traits and other infinite-dimensional data. To illustrate how modes of variation work in practice, two examples are shown in the graphs to the right, which display the first two modes of variation. The solid curve represents the sample mean function. The dashed, dot-dashed, and dotted curves correspond to modes of variation with α = ± 1 , ± 2 , {\displaystyle \alpha =\pm 1,\pm 2,} and ± 3 {\displaystyle \pm 3} , respectively. The first graph displays the first two modes of variation of female mortality data from 41 countries in 2003. The object of interest is log hazard function between ages 0 and 100 years. The first mode of variation suggests that the variation of female mortality is smaller for ages around 0 or 100, and larger for ages around 25. An appropriate and intuitive interpretation is that mortality around 25 is driven by accidental death, while around 0 or 100, mortality is related to congenital disease or natural death. Compared to female mortality