AI Face Live

AI Face Live — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • WaveNet

    WaveNet

    WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music. == History == Generating speech from text is an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft's Cortana, Amazon Alexa and the Google Assistant. Most such systems use a variation of a technique that involves concatenated sound fragments together to form recognisable sounds and words. The most common of these is called concatenative TTS. It consists of large library of speech fragments, recorded from a single speaker that are then concatenated to produce complete words and sounds. The result sounds unnatural, with an odd cadence and tone. The reliance on a recorded library also makes it difficult to modify or change the voice. Another technique, known as parametric TTS, uses mathematical models to recreate sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the speech is typically created using a voice synthesiser known as a vocoder. This can also result in unnatural sounding audio. == Design and ongoing research == === Background === WaveNet is a type of feedforward neural network known as a deep convolutional neural network (CNN). In WaveNet, the CNN takes a raw signal as an input and synthesises an output one sample at a time. It does so by sampling from a softmax (i.e. categorical) distribution of a signal value that is encoded using μ-law companding transformation and quantized to 256 possible values. === Initial concept and results === According to the original September 2016 DeepMind research paper WaveNet: A Generative Model for Raw Audio, the network was fed real waveforms of speech in English and Mandarin. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms at 16,000 samples per second. These waveforms include realistic breaths and lip smacks – but do not conform to any language. WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained with German, it produces German speech. The capability also means that if the WaveNet is fed other inputs – such as music – its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce waveforms that sound like classical music. === Content (voice) swapping === According to the June 2018 paper Disentangled Sequential Autoencoder, DeepMind has successfully used WaveNet for audio and voice "content swapping": the network can swap the voice on an audio recording for another, pre-existing voice while maintaining the text and other features from the original recording. "We also experiment on audio sequence data. Our disentangled representation allows us to convert speaker identities into each other while conditioning on the content of the speech." (p. 5) "For audio, this allows us to convert a male speaker into a female speaker and vice versa [...]." (p. 1) According to the paper, a two-digit minimum amount of hours (c. 50 hours) of pre-existing speech recordings of both source and target voice are required to be fed into WaveNet for the program to learn their individual features before it is able to perform the conversion from one voice to another at a satisfying quality. The authors stress that "[a]n advantage of the model is that it separates dynamical from static features [...]." (p. 8), i. e. WaveNet is capable of distinguishing between the spoken text and modes of delivery (modulation, speed, pitch, mood, etc.) to maintain during the conversion from one voice to another on the one hand, and the basic features of both source and target voices that it is required to swap on the other. The January 2019 follow-up paper Unsupervised speech representation learning using WaveNet autoencoders details a method to successfully enhance the proper automatic recognition and discrimination between dynamical and static features for "content swapping", notably including swapping voices on existing audio recordings, in order to make it more reliable. Another follow-up paper, Sample Efficient Adaptive Text-to-Speech, dated September 2018 (latest revision January 2019), states that DeepMind has successfully reduced the minimum amount of real-life recordings required to sample an existing voice via WaveNet to "merely a few minutes of audio data" while maintaining high-quality results. Its ability to clone voices has raised ethical concerns about WaveNet's ability to mimic the voices of living and dead persons. According to a 2016 BBC article, companies working on similar voice-cloning technologies (such as Adobe Voco) intend to insert watermarking inaudible to humans to prevent counterfeiting, while maintaining that voice cloning satisfying, for instance, the needs of entertainment-industry purposes would be of a far lower complexity and use different methods than required to fool forensic evidencing methods and electronic ID devices, so that natural voices and voices cloned for entertainment-industry purposes could still be easily told apart by technological analysis. == Applications == At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications. As of October 2017, Google announced a 1,000-fold performance improvement along with better voice quality. WaveNet was then used to generate Google Assistant voices for US English and Japanese across all Google platforms. In November 2017, DeepMind researchers released a research paper detailing a proposed method of "generating high-fidelity speech samples at more than 20 times faster than real-time", called "Probability Density Distillation". At the annual I/O developer conference in May 2018, it was announced that new Google Assistant voices were available and made possible by WaveNet; WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples.

    Read more →
  • Feature selection

    Feature selection

    In machine learning, feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: simplification of models to make them easier to interpret, shorter training times, to avoid the curse of dimensionality, improve the compatibility of the data with a certain learning model class, to encode inherent symmetries present in the input space. The central premise when using feature selection is that data sometimes contains features that are redundant or irrelevant, and can thus be removed without incurring much loss of information. Redundancy and irrelevance are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated. Feature extraction creates new features from functions of the original features, whereas feature selection finds a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (data points). == Introduction == A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods. Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem. Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the mutual information, the pointwise mutual information, Pearson product-moment correlation coefficient, Relief-based algorithms, and inter/intra class distance or the scores of significance tests for each class/feature combinations. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples; Elastic net regularization, which combines the L1 penalty of LASSO with the L2 penalty of ridge regression; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients. AEFS further extends LASSO to nonlinear scenario with autoencoders. These approaches tend to be between filters and wrappers in terms of computational complexity. In traditional regression analysis, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network. == Subset selection == Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model. Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc. Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected. Search approaches include: Exhaustive Best first Simulated annealing Genetic algorithm Greedy forward selection Greedy backward elimination Particle swarm optimization Targeted projection pursuit Scatter search Variable neighborhood search Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information; see here. Other available filter metrics include: Class separability Error probability Inter-class distance Probabilistic distance Entropy Consistency-based feature selection Correlation-based feature selection == Optimality criteria == The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include Akaike information criterion (AIC) and Mallows's Cp, which have a penalty of 2 for each added feature. AIC is based on information theory, and is effectively derived via the maximum entropy principle. Other criteria are Bayesian information criterion (BIC), which uses a penalty of log ⁡ n {\displaystyle {\sqrt {\log {n}}}} for each added feature, minimum description length (MDL) which asymptotically uses log ⁡ n {\displaystyle {\sqrt {\log {n}}}} , Bonferroni / RIC which use 2 log ⁡ p {\displaystyle {\sqrt {2\log {p}}}} , maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR), which use something close to 2 log ⁡ p q {\displaystyle {\sqrt {2\log {\frac {p}{q}}}}} . A maximum entropy rate criterion may also be used to select the most relevant subset of features. == Structure learning == Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms

    Read more →
  • Bayesian hierarchical modeling

    Bayesian hierarchical modeling

    Bayesian hierarchical modelling is a statistical model written in multiple levels (hierarchical form) that estimates the posterior distribution of model parameters using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. This integration enables calculation of updated posterior over the (hyper)parameters, effectively updating prior beliefs in light of the observed data. Frequentist statistics may yield conclusions seemingly incompatible with those offered by Bayesian statistics due to the Bayesian treatment of the parameters as random variables and its use of subjective information in establishing assumptions on these parameters. As the approaches answer different questions the formal results are not technically contradictory but the two approaches disagree over which answer is relevant to particular applications. Bayesians argue that relevant information regarding decision-making and updating beliefs cannot be ignored and that hierarchical modeling has the potential to overrule classical methods in applications where respondents give multiple observational data. Moreover, the model has proven to be robust, with the posterior distribution less sensitive to the more flexible hierarchical priors. Hierarchical modeling, as its name implies, retains nested data structure, and is used when information is available at several different levels of observational units. For example, in epidemiological modeling to describe infection trajectories for multiple countries, observational units are countries, and each country has its own time-based profile of daily infected cases. In decline curve analysis to describe oil or gas production decline curve for multiple wells, observational units are oil or gas wells in a reservoir region, and each well has each own time-based profile of oil or gas production rates (usually, barrels per month). Hierarchical modeling is used to devise computation based strategies for multiparameter problems. == Philosophy == Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies a dependence of the joint probability model for these parameters. Individual degrees of belief, expressed in the form of probabilities, come with uncertainty. Amidst this is the change of the degrees of belief over time. As was stated by Professor José M. Bernardo and Professor Adrian F. Smith, "The actuality of the learning process consists in the evolution of individual and subjective beliefs about the reality." These subjective probabilities are more directly involved in the mind rather than the physical probabilities. Hence, it is with this need of updating beliefs that Bayesians have formulated an alternative statistical model which takes into account the prior occurrence of a particular event. == Bayes' theorem == The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options. Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability θ j {\displaystyle \theta _{j}} , the survival probability will be updated with the occurrence of y, the event in which a controversial serum is created which, as believed by some, increases survival in cardiac patients. In order to make updated probability statements about θ j {\displaystyle \theta _{j}} , given the occurrence of event y, we must begin with a model providing a joint probability distribution for θ j {\displaystyle \theta _{j}} and y. This can be written as a product of the two distributions that are often referred to as the prior distribution P ( θ ) {\displaystyle P(\theta )} and the sampling distribution P ( y ∣ θ ) {\displaystyle P(y\mid \theta )} respectively: P ( θ , y ) = P ( θ ) P ( y ∣ θ ) {\displaystyle P(\theta ,y)=P(\theta )P(y\mid \theta )} Using the basic property of conditional probability, the posterior distribution will yield: P ( θ ∣ y ) = P ( θ , y ) P ( y ) = P ( y ∣ θ ) P ( θ ) P ( y ) {\displaystyle P(\theta \mid y)={\frac {P(\theta ,y)}{P(y)}}={\frac {P(y\mid \theta )P(\theta )}{P(y)}}} This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to deconstruct the probability, P ( θ ∣ y ) {\displaystyle P(\theta \mid y)} , relative to solvable subsets of its supportive evidence. == Exchangeability == The usual starting point of a statistical analysis is the assumption that the n values y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} are exchangeable. If no information – other than data y – is available to distinguish any of the θ j {\displaystyle \theta _{j}} 's from any others, and no ordering or grouping of the parameters can be made, one must assume symmetry of prior distribution parameters. This symmetry is represented probabilistically by exchangeability. Generally, it is useful and appropriate to model data from an exchangeable distribution as independently and identically distributed, given some unknown parameter vector θ {\displaystyle \theta } , with distribution P ( θ ) {\displaystyle P(\theta )} . === Finite exchangeability === For a fixed number n, the set y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable if the joint probability P ( y 1 , y 2 , … , y n ) {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})} is invariant under permutations of the indices. That is, for every permutation π {\displaystyle \pi } or ( π 1 , π 2 , … , π n ) {\displaystyle (\pi _{1},\pi _{2},\ldots ,\pi _{n})} of (1, 2, …, n), P ( y 1 , y 2 , … , y n ) = P ( y π 1 , y π 2 , … , y π n ) . {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})=P(y_{\pi _{1}},y_{\pi _{2}},\ldots ,y_{\pi _{n}}).} The following is an exchangeable, but not independent and identical (iid), example: Consider an urn with a red ball and a blue ball inside, with probability 1 2 {\displaystyle {\frac {1}{2}}} of drawing either. Balls are drawn without replacement, i.e. after one ball is drawn from the n {\displaystyle n} balls, there will be n − 1 {\displaystyle n-1} remaining balls left for the next draw. Let Y i = { 1 , if the i th ball is red , 0 , otherwise . {\displaystyle {\text{Let }}Y_{i}={\begin{cases}1,&{\text{if the }}i{\text{th ball is red}},\\0,&{\text{otherwise}}.\end{cases}}} The probability of selecting a red ball in the first draw and a blue ball in the second draw is equal to the probability of selecting a blue ball on the first draw and a red on the second, both of which are 1/2: P ( y 1 = 1 , y 2 = 0 ) = P ( y 1 = 0 , y 2 = 1 ) = 1 2 {\displaystyle P(y_{1}=1,y_{2}=0)=P(y_{1}=0,y_{2}=1)={\frac {1}{2}}} . This makes y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} exchangeable. But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first is 0. This is not equal to the probability that the red ball is selected in the second draw, which is 1/2: P ( y 2 = 1 ∣ y 1 = 1 ) = 0 ≠ P ( y 2 = 1 ) = 1 2 {\displaystyle P(y_{2}=1\mid y_{1}=1)=0\neq P(y_{2}=1)={\frac {1}{2}}} . Thus, y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} are not independent. If x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} are independent and identically distributed, then they are exchangeable, but the converse is not necessarily true. === Infinite exchangeability === Infinite exchangeability is the property that every finite subset of an infinite sequence y 1 {\displaystyle y_{1}} , y 2 , … {\displaystyle y_{2},\ldots } is exchangeable. For any n, the sequence y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable. == Hierarchical models == === Components === Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely: Hyperparameters: parameters of the prior distribution Hyperpriors: distributions of Hyperparameters Suppose a random variable Y follows a normal distribution with parameter θ {\displaystyle \theta } as the mean and 1 as the variance, that is Y ∣ θ ∼ N ( θ , 1 ) {\displaystyle Y\mid \theta \sim N(\theta ,1)} . The tilde relation ∼ {\displaystyle \sim } can be read as "has the distribution of" or "is distributed as". Suppose also that the parameter θ {\displaystyle \theta } has a distribution given by a normal distribution with mean μ {\displaystyle \mu } and variance 1, i.e. θ ∣ μ ∼ N ( μ , 1 ) {\displaystyle \theta \mid \mu \sim N(\mu ,1)} . Furthermore, μ {\displaystyle \mu } follows another distribution given, for example, by the standard normal distribution, N ( 0 , 1 ) {\displaystyle {\text{N}}(0,1)} . The parameter μ {\dis

    Read more →
  • Principal component analysis

    Principal component analysis

    Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data are linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. The principal components of a collection of points in a real coordinate space are a sequence of p {\displaystyle p} unit vectors, where the i {\displaystyle i} -th vector is the direction of a line that best fits the data while being orthogonal to the first i − 1 {\displaystyle i-1} vectors. Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line. These directions (i.e., principal components) constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science. == Overview == When performing PCA, the first principal component of a set of p {\displaystyle p} variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p {\displaystyle p} iterations until all the variance is explained. PCA is most commonly used when many of the variables are highly correlated with each other and it is desirable to reduce their number to an independent set. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i {\displaystyle i} -th principal component can be taken as a direction orthogonal to the first i − 1 {\displaystyle i-1} principal components that maximizes the variance of the projected data. For either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. Robust and L1-norm-based variants of standard PCA have also been proposed. == History == PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (invented in the last quarter of the 19th century), eigenvalue decomposition (EVD) of XTX in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis), Eckart–Young theorem (Harman, 1960), or empirical orthogonal functions (EOF) in meteorological science (Lorenz, 1956), empirical eigenfunction decomposition (Sirovich, 1987), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics. == Intuition == PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. To find the axes of the ellipsoid, we must first center the values of each variable in the dataset on 0 by subtracting the mean of the variable's observed values from each of those values. These transformed values are used instead of the original observed values for each of the variables. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually-orthogonal unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform the covariance matrix into a diagonalized form, in which the diagonal elements represent the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. Biplots and scree plots (degree of explained variance) are used to interpret findings of the PCA. == Details == PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Consider an n × p {\displaystyle n\times p} data matrix, X, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of feature (say, the results from a particular sensor). Mathematically, the transformation is defined by a set of size l {\displaystyle l} (where l {\displaystyle l} is usually selected to be strictly less than p {\displaystyle p} to reduce dimensionality) of p {\displaystyle p} -dimensional vectors of weights or coefficients w ( k ) = ( w 1 , … , w p ) ( k ) {\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} that map each row vector x ( i ) = ( x 1 , … , x p ) ( i ) {\displaystyle \mathbf {x} _{(i)}=(x_{1},\dots ,x_{p})_{(i)}} of X to a new vector of principal component scores t ( i ) = ( t 1 , … , t l ) ( i ) {\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_{(i)}} , given by t k ( i ) = x ( i ) ⋅ w ( k ) f o r i = 1 , … , n k = 1 , … , l {\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l} in such a way that the individual variables t 1 , … , t l {\displaystyle t_{1},\dots ,t_{l}} of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector. The above may equivalently be written in matrix form as T = X W {\displaystyle \mathbf {T} =\mathbf {X} \mathbf {W} } where T i k = t k ( i ) {\displaystyle {\mathbf {T} }_{ik}={t_{k}}_{(i)}} , X i j = x j ( i ) {\displaystyle {\mathbf {X} }_{ij}={x_{j}}_{(i)}} , and W j k = w j ( k ) {\displaystyle {\mathbf {W} }_{jk}={w_{j}}_{(k)}} . === First component === In order to maximize variance, the first weight vector w(1) thus has to satisfy w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( t 1 ) ( i ) 2 } = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( x ( i ) ⋅ w ) 2 } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_{(i)}^{2}\right\}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}} Equivalently, writing this in matrix form gives w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ‖ X w ‖ 2 } = arg ⁡ max ‖ w ‖ = 1 { w T X T X w } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}\right\}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} \right\}} Since w(1) has been defined to be a unit vector, it equivalently also satisfies w ( 1 ) = arg ⁡ max { w T X T X w w T w } {\displaystyle \mathbf {w} _{(1)}=\arg \max \left\{{\frac {\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf {T}}\mathbf {w} }}\right\}} The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w(1) found, the first principal component of a data vector

    Read more →
  • Anthrobotics

    Anthrobotics

    Anthrobotics is the science of developing and studying robots that are either entirely or in some way human-like. The term anthrobotics was originally coined by Mark Rosheim in a paper entitled "Design of An Omnidirectional Arm" presented at the IEEE International Conference on Robotics and Automation, May 13–18, 1990, pp. 2162–2167. Rosheim says he derived the term from "...Anthropomorphic and Robotics to distinguish the new generation of dexterous robots from its simple industrial robot forebears." The word gained wider recognition as a result of its use in the title of Rosheim's subsequent book Robot Evolution: The Development of Anthrobotics, which focussed on facsimiles of human physical and psychological skills and attributes. However, a wider definition of the term anthrobotics has been proposed, in which the meaning is derived from anthropology rather than anthropomorphic. This usage includes robots that respond to input in a human-like fashion, rather than simply mimicking human actions, thus theoretically being able to respond more flexibly or to adapt to unforeseen circumstances. This expanded definition also encompasses robots that are situated in social environments with the ability to respond to those environments appropriately, such as insect robots, robotic pets, and the like. Anthrobotics is now taught at some universities, encouraging students not only to design and build robots for environments beyond current industrial applications, but also to speculate on the future of robotics that are embedded in the world at large, as mobile phones and computers are today. In 2016 philosopher Luis de Miranda created the Anthrobotics Cluster at the University of Edinburgh "a platform of cross-disciplinary research that seeks to investigate some of the biggest questions that will need to be answered" on the relationship between humans, robots and intelligent systems and "a think tank on the social spread of robotics, and also how automation is part of the definition of what humans have always been". to explore the symbiotic relationship between humans and automated protocols.

    Read more →
  • Ordinal regression

    Ordinal regression

    In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference (on a scale from, say, 1–5 for "very poor" through "excellent"), as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning. == Linear models for ordinal regression == Ordinal regression can be performed using a generalized linear model (GLM) that fits both a coefficient vector and a set of thresholds to a dataset. Suppose one has a set of observations, represented by length-p vectors x1 through xn, with associated responses y1 through yn, where each yi is an ordinal variable on a scale 1, ..., K. For simplicity, and without loss of generality, we assume y is a non-decreasing vector, that is, yi ≤ {\displaystyle \leq } yi+1. To this data, one fits a length-p coefficient vector w and a set of thresholds θ1, ..., θK−1 with the property that θ1 < θ2 < ... < θK−1. This set of thresholds divides the real number line into K disjoint segments, corresponding to the K response levels. The model can now be formulated as Pr ( y ≤ i ∣ x ) = σ ( θ i − w ⋅ x ) {\displaystyle \Pr(y\leq i\mid \mathbf {x} )=\sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )} or, the cumulative probability of the response y being at most i is given by a function σ (the inverse link function) applied to a linear function of x. Several choices exist for σ; the logistic function σ ( θ i − w ⋅ x ) = 1 1 + e − ( θ i − w ⋅ x ) {\displaystyle \sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )={\frac {1}{1+e^{-(\theta _{i}-\mathbf {w} \cdot \mathbf {x} )}}}} gives the ordered logit model, while using the CDF of the standard normal distribution gives the ordered probit model. A third option is to use an exponential function σ ( θ i − w ⋅ x ) = 1 − exp ⁡ ( − exp ⁡ ( θ i − w ⋅ x ) ) {\displaystyle \sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )=1-\exp(-\exp(\theta _{i}-\mathbf {w} \cdot \mathbf {x} ))} which gives the proportional hazards model. === Latent variable model === The probit version of the above model can be justified by assuming the existence of a real-valued latent variable (unobserved quantity) y, determined by y ∗ = w ⋅ x + ε {\displaystyle y^{}=\mathbf {w} \cdot \mathbf {x} +\varepsilon } where ε is normally distributed with zero mean and unit variance, conditioned on x. The response variable y results from an "incomplete measurement" of y, where one only determines the interval into which y falls: y = { 1 if y ∗ ≤ θ 1 , 2 if θ 1 < y ∗ ≤ θ 2 , 3 if θ 2 < y ∗ ≤ θ 3 ⋮ K if θ K − 1 < y ∗ . {\displaystyle y={\begin{cases}1&{\text{if}}~~y^{}\leq \theta _{1},\\2&{\text{if}}~~\theta _{1} Read more →

  • Sammon mapping

    Sammon mapping

    Sammon mapping or Sammon projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality (see multidimensional scaling) by trying to preserve the structure of inter-point distances in high-dimensional space in the lower-dimension projection. It is particularly suited for use in exploratory data analysis. The method was proposed by John W. Sammon in 1969. It is considered a non-linear approach as the mapping cannot be represented as a linear combination of the original variables as possible in techniques such as principal component analysis, which also makes it more difficult to use for classification applications. Denote the distance between ith and jth objects in the original space by d i j ∗ {\displaystyle \scriptstyle d_{ij}^{}} , and the distance between their projections by d i j {\displaystyle \scriptstyle d_{ij}^{}} . Sammon's mapping aims to minimize the following error function, which is often referred to as Sammon's stress or Sammon's error: E = 1 ∑ i < j d i j ∗ ∑ i < j ( d i j ∗ − d i j ) 2 d i j ∗ . {\displaystyle E={\frac {1}{\sum \limits _{i Read more →

  • Universal approximation theorem

    Universal approximation theorem

    In the field of machine learning, the universal approximation theorems (UATs) state that neural networks with a certain structure can, in principle, approximate any continuous function to any desired degree of accuracy. These theorems provide a mathematical justification for using neural networks, assuring researchers that a sufficiently large or deep network can model the complex, non-linear relationships often found in real-world data. The best-known version of the theorem applies to feedforward networks with a single hidden layer. It states that if the layer's activation function is non-polynomial (which is true for common choices like the sigmoid function or ReLU), then the network can act as a "universal approximator." Universality is achieved by increasing the number of neurons in the hidden layer, making the network "wider." Other versions of the theorem show that universality can also be achieved by keeping the network's width fixed but increasing its number of layers, making it "deeper." These are existence theorems. They guarantee that a network with the right structure exists, but they do not provide a method for finding the network's parameters (training it), nor do they specify exactly how large the network must be for a given function. Finding a suitable network remains a practical challenge that is typically addressed with optimization algorithms like backpropagation. == Setup == Artificial neural networks are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued vectors to real-valued vectors. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces. Most universal approximation theorems are in one of two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("arbitrary width" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("arbitrary depth" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("bounded depth and bounded width" case). == History == === Arbitrary width === The first results concerned the arbitrary width case. Ken-ichi Funahashi (May 1989) showed that Rumelhart–Hinton–Williams type backpropagation networks possess universal approximation capability with a class of sigmoidal activation functions, extending the result to multi-output mappings as well. Kurt Hornik, Maxwell Stinchcombe, and Halbert White (July 1989) showed that multilayer feed-forward networks with as few as one hidden layer are universal approximators, provided that the activation function satisfies certain conditions. George Cybenko (December 1989) independently established a related result for sigmoid activation functions using functional-analytic methods. Hornik also showed in 1991 that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno et al in 1993 and later Allan Pinkus in 1999 showed that the universal approximation property is equivalent to having a nonpolynomial activation function. === Arbitrary depth === The arbitrary depth case was also studied by a number of authors such as Gustaf Gripenberg in 2003, Dmitry Yarotsky, Zhou Lu et al in 2017, Boris Hanin and Mark Sellke in 2018 who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons extended those results to neural networks with general activation functions such, e.g. tanh or GeLU. One special case of arbitrary depth is that each composition component comes from a finite set of mappings. In 2024, Cai constructed a finite set of mappings, named a vocabulary, such that any continuous function can be approximated by compositing a sequence from the vocabulary. This is similar to the concept of compositionality in linguistics, which is the idea that a finite vocabulary of basic elements can be combined via grammar to express an infinite range of meanings. === Bounded depth and bounded width === The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999. They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators. In 2018, Guliyev and Ismailov constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with fewer units in hidden layers. In 2018, they also constructed single hidden layer networks with bounded width that are still universal approximators for univariate functions. However, this does not apply for multivariable functions. In 2022, Shen et al. obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks. === Quantitative bounds === The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of Lp functions using feed-forward neural networks with ReLU as activation functions. Similar results that can be directly applied to residual neural networks were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using control-theoretic arguments. In 2023, Cai obtained the optimal minimum width bound for the universal approximation. For the arbitrary depth case, Leonie Papon and Anastasis Kratsios derived explicit depth estimates depending on the regularity of the target function and of the activation function. === Kolmogorov network === The Kolmogorov–Arnold representation theorem is similar in spirit. Indeed, certain neural network families can directly apply the Kolmogorov–Arnold theorem to yield a universal approximation theorem. Robert Hecht-Nielsen showed that a three-layer neural network can approximate any continuous multivariate function. This was extended to the discontinuous case by Vugar Ismailov. In 2024, Ziming Liu and co-authors showed a practical application. === Reservoir computing and quantum reservoir computing === In reservoir computing a sparse recurrent neural network with fixed weights equipped of fading memory and echo state property is followed by a trainable output layer. Its universality has been demonstrated separately for what concerns networks of rate neurons and spiking neurons, respectively. In 2024, the framework has been generalized and extended to quantum reservoirs where the reservoir is based on qubits defined over Hilbert spaces. === Variants === Variants include discontinuous activation functions, noncompact domains, certifiable networks, random neural networks, and alternative network architectures and topologies. The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. For input dimension d x {\displaystyle d_{x}} and output dimension d y {\displaystyle d_{y}} the minimum width required for the universal approximation of the Lp functions is exactly m a x { d x + 1 , d y } {\displaystyle max\{d_{x}+1,d_{y}\}} (for a ReLU network). More generally this also holds if both ReLU and a threshold activation function are used. Universal function approximation on graphs (or rather on graph isomorphism classes) by popular graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test. In 2020, a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying O ( | V | ⋅ | E | ) {\displaystyle {\mathcal {O}}(\left|V\right|\cdot \left|E\right|)} -runtime method that performed at state of the art on a collection of benchmarks (where V {\displaystyle V} and E {\displaystyle E} are the sets of nodes and edges of the graph respectively). There are also a variety of results between non-Euclidean spaces and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the convolutional neural network (CNN) architecture, radial basis functions, or neural networks with specific properties. == Arbitrary-width case == A universal approximation theorem formally states that a family of neural network funct

    Read more →
  • Apache OpenNLP

    Apache OpenNLP

    The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as language detection, tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing and coreference resolution. These tasks are usually required to build more advanced text processing services.

    Read more →
  • Variational message passing

    Variational message passing

    Variational message passing (VMP) is an approximate inference technique for continuous- or discrete-valued Bayesian networks, with conjugate-exponential parents, developed by John Winn. VMP was developed as a means of generalizing the approximate variational methods used by such techniques as latent Dirichlet allocation, and works by updating an approximate distribution at each node through messages in the node's Markov blanket. == Likelihood lower bound == Given some set of hidden variables H {\displaystyle H} and observed variables V {\displaystyle V} , the goal of approximate inference is to maximize a lower-bound on the probability that a graphical model is in the configuration V {\displaystyle V} . Over some probability distribution Q {\displaystyle Q} (to be defined later), ln ⁡ P ( V ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) P ( H | V ) = ∑ H Q ( H ) [ ln ⁡ P ( H , V ) Q ( H ) − ln ⁡ P ( H | V ) Q ( H ) ] {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}} . So, if we define our lower bound to be L ( Q ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) Q ( H ) {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}} , then the likelihood is simply this bound plus the relative entropy between P {\displaystyle P} and Q {\displaystyle Q} . Because the relative entropy is non-negative, the function L {\displaystyle L} defined above is indeed a lower bound of the log likelihood of our observation V {\displaystyle V} . The distribution Q {\displaystyle Q} will have a simpler character than that of P {\displaystyle P} because marginalizing over P {\displaystyle P} is intractable for all but the simplest of graphical models. In particular, VMP uses a factorized distribution Q ( H ) = ∏ i Q i ( H i ) , {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),} where H i {\displaystyle H_{i}} is a disjoint part of the graphical model. == Determining the update rule == The likelihood estimate needs to be as large as possible; because it's a lower bound, getting closer log ⁡ P {\displaystyle \log P} improves the approximation of the log likelihood. By substituting in the factorized version of Q {\displaystyle Q} , L ( Q ) {\displaystyle L(Q)} , parameterized over the hidden nodes H i {\displaystyle H_{i}} as above, is simply the negative relative entropy between Q j {\displaystyle Q_{j}} and Q j ∗ {\displaystyle Q_{j}^{}} plus other terms independent of Q j {\displaystyle Q_{j}} if Q j ∗ {\displaystyle Q_{j}^{}} is defined as Q j ∗ ( H j ) = 1 Z e E − j { ln ⁡ P ( H , V ) } {\displaystyle Q_{j}^{}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}} , where E − j { ln ⁡ P ( H , V ) } {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}} is the expectation over all distributions Q i {\displaystyle Q_{i}} except Q j {\displaystyle Q_{j}} . Thus, if we set Q j {\displaystyle Q_{j}} to be Q j ∗ {\displaystyle Q_{j}^{}} , the bound L {\displaystyle L} is maximized. == Messages in variational message passing == Parents send their children the expectation of their sufficient statistic while children send their parents their natural parameter, which also requires messages to be sent from the co-parents of the node. == Relationship to exponential families == Because all nodes in VMP come from exponential families and all parents of nodes are conjugate to their children nodes, the expectation of the sufficient statistic can be computed from the normalization factor. == VMP algorithm == The algorithm begins by computing the expected value of the sufficient statistics for that vector. Then, until the likelihood converges to a stable value (this is usually accomplished by setting a small threshold value and running the algorithm until it increases by less than that threshold value), do the following at each node: Get all messages from parents. Get all messages from children (this might require the children to get messages from the co-parents). Compute the expected value of the nodes sufficient statistics. == Constraints == Because every child must be conjugate to its parent, this has limited the types of distributions that can be used in the model. For example, the parents of a Gaussian distribution must be a Gaussian distribution (corresponding to the Mean) and a gamma distribution (corresponding to the precision, or one over σ {\displaystyle \sigma } in more common parameterizations). Discrete variables can have Dirichlet parents, and Poisson and exponential nodes must have gamma parents. More recently, VMP has been extended to handle models that violate this conditional conjugacy constraint. == Literature == John Winn; Christopher M. Bishop (2005). "Variational Message Passing" (PDF). Journal of Machine Learning Research. 6: 661–694. ISSN 1533-7928. Wikidata Q139488859. Beal, M.J. (2003). Variational Algorithms for Approximate Bayesian Inference (PDF) (PhD). Gatsby Computational Neuroscience Unit, University College London. Archived from the original (PDF) on 2005-04-28. Retrieved 2007-02-15.

    Read more →
  • Linear genetic programming

    Linear genetic programming

    "Linear genetic programming" is unrelated to "linear programming". Linear genetic programming (LGP) is a particular method of genetic programming wherein computer programs in a population are represented as a sequence of register-based instructions from an imperative programming language or machine language. The adjective "linear" stems from the fact that each LGP program is a sequence of instructions and the sequence of instructions is normally executed sequentially. Like in other programs, the data flow in LGP can be modeled as a graph that will visualize the potential multiple usage of register contents and the existence of structurally noneffective code (introns) which are two main differences of this genetic representation from the more common tree-based genetic programming (TGP) variant. Like other Genetic Programming methods, Linear genetic programming requires the input of data to run the program population on. Then, the output of the program (its behaviour) is judged against some target behaviour, using a fitness function. However, LGP is generally more efficient than tree genetic programming due to its two main differences mentioned above: Intermediate results (stored in registers) can be reused and a simple intron removal algorithm exists that can be executed to remove all non-effective code prior to programs being run on the intended data. These two differences often result in compact solutions and substantial computational savings compared to the highly constrained data flow in trees and the common method of executing all tree nodes in TGP. Furthermore, LGP naturally has multiple outputs by defining multiple output registers and easily cooperates with control flow operations. Linear genetic programming has been applied in many domains, including system modeling and system control with considerable success. Linear genetic programming should not be confused with linear tree programs in tree genetic programming, program composed of a variable number of unary functions and a single terminal. Note that linear tree GP differs from bit string genetic algorithms since a population may contain programs of different lengths and there may be more than two types of functions or more than two types of terminals. == Examples of LGP programs == Because LGP programs are basically represented by a linear sequence of instructions, they are simpler to read and to operate on than their tree-based counterparts. For example, a simple program written to solve a Boolean function problem with 3 inputs (in R1, R2, R3) and one output (in R0), could read like this: R1, R2, R3 have to be declared as input (read-only) registers, while R0 and R4 are declared as calculation (read-write) registers. This program is very simple, having just 5 instructions. But mutation and crossover operators could work to increase the length of the program, as well as the content of each of its instructions. Note that one instruction is non-effective or an intron (marked), since it does not impact the output register R0. Recognition of those instructions is the basis for the intron removal algorithm which is used analyze code prior to execution. Technically, this happens by copying an individual and then run the intron removal once. The copy with removed introns is then executed as many times as dictated by the number of training cases. Notably, the original individual is left intact, so as to continue participating in the evolutionary process. It is only the copy that is executed that is compressed by removing these "structural" introns. Another simple program, this one written in the LGP language Slash/A looks like a series of instructions separated by a slash: By representing such code in bytecode format, i.e. as an array of bytes each representing a different instruction, one can make mutation operations simply by changing an element of such an array.

    Read more →
  • Vanishing gradient problem

    Vanishing gradient problem

    In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights are updated proportional to their partial derivative of the loss function. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent activation function. The gradients of this function are in range [0,1]. The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem. Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time). == Prototypical models == This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio. === Recurrent network model === A generic recurrent network has hidden states h 1 , h 2 , … {\displaystyle h_{1},h_{2},\dots } , inputs u 1 , u 2 , … {\displaystyle u_{1},u_{2},\dots } , and outputs x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } . Let it be parameterized by θ {\displaystyle \theta } , so that the system evolves as ( h t , x t ) = F ( h t − 1 , u t , θ ) {\displaystyle (h_{t},x_{t})=F(h_{t-1},u_{t},\theta )} Often, the output x t {\displaystyle x_{t}} is a function of h t {\displaystyle h_{t}} , as some x t = G ( h t ) {\displaystyle x_{t}=G(h_{t})} . The vanishing gradient problem already presents itself clearly when x t = h t {\displaystyle x_{t}=h_{t}} , so we simplify our notation to the special case with: x t = F ( x t − 1 , u t , θ ) {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )} Now, take its differential: d x t = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) d x t − 1 = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) [ ∇ θ F ( x t − 2 , u t − 1 , θ ) d θ + ∇ x F ( x t − 2 , u t − 1 , θ ) d x t − 2 ] ⋮ = [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] d θ {\displaystyle {\begin{aligned}dx_{t}&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )dx_{t-1}\\&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )\left[\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )d\theta +\nabla _{x}F(x_{t-2},u_{t-1},\theta )dx_{t-2}\right]\\&\;\;\vdots \\&=\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]d\theta \end{aligned}}} Training the network requires us to define a loss function to be minimized. Let it be L ( x T , u 1 , … , u T ) {\displaystyle L(x_{T},u_{1},\dots ,u_{T})} , then minimizing it by gradient descent gives Δ θ = − η ⋅ [ ∇ x L ( x T ) ( ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ) ] T {\displaystyle \Delta \theta =-\eta \cdot \left[\nabla _{x}L(x_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)\right]^{T}} where η {\displaystyle \eta } is the learning rate. The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ∇ x F ( x t − 3 , u t − 2 , θ ) ⋯ {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\nabla _{x}F(x_{t-3},u_{t-2},\theta )\cdots } ==== Example: recurrent network with sigmoid activation ==== For a concrete example, consider a typical recurrent network defined by x t = F ( x t − 1 , u t , θ ) = W rec σ ( x t − 1 ) + W in u t + b {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\sigma (x_{t-1})+W_{\text{in}}u_{t}+b} where θ = ( W rec , W in ) {\displaystyle \theta =(W_{\text{rec}},W_{\text{in}})} is the network parameter, σ {\displaystyle \sigma } is the sigmoid activation function, applied to each vector coordinate separately, and b {\displaystyle b} is the bias vector. Then, ∇ x F ( x t − 1 , u t , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))} , and so ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ⋯ ∇ x F ( x t − k , u t − k + 1 , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) W rec diag ⁡ ( σ ′ ( x t − 2 ) ) ⋯ W rec diag ⁡ ( σ ′ ( x t − k ) ) {\displaystyle {\begin{aligned}&\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\cdots \nabla _{x}F(x_{t-k},u_{t-k+1},\theta )\\&=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-2}))\cdots W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-k}))\end{aligned}}} Since | σ ′ | ≤ 1 {\displaystyle \left|\sigma '\right|\leq 1} , the operator norm of the above multiplication is bounded above by ‖ W rec ‖ k {\displaystyle \left\|W_{\text{rec}}\right\|^{k}} . So if the spectral radius of W rec {\displaystyle W_{\text{rec}}} is γ < 1 {\displaystyle \gamma <1} , then at large k {\displaystyle k} , the above multiplication has operator norm bounded above by γ k → 0 {\displaystyle \gamma ^{k}\to 0} . This is the prototypical vanishing gradient problem. The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation (loss differential): ∇ θ L = ∇ x L ( x T , u 1 , … , u T ) [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] {\displaystyle \nabla _{\theta }L=\nabla _{x}L(x_{T},u_{1},\dots ,u_{T})\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]} The components of ∇ θ F ( x , u , θ ) {\displaystyle \nabla _{\theta }F(x,u,\theta )} are just components of σ ( x ) {\displaystyle \sigma (x)} and u {\displaystyle u} , so if u t , u t − 1 , … {\displaystyle u_{t},u_{t-1},\dots } are bounded, then ‖ ∇ θ F ( x t − k − 1 , u t − k , θ ) ‖ {\displaystyle \left\|\nabla _{\theta }F(x_{t-k-1},u_{t-k},\theta )\right\|} is also bounded by some M > 0 {\displaystyle M>0} , and so the terms in ∇ θ L {\displaystyle \nabla _{\theta }L} decay as M γ k {\displaystyle M\gamma ^{k}} . This means that, effectively, ∇ θ L {\displaystyle \nabla _{\theta }L} is affected only by the first O ( γ − 1 ) {\displaystyle O(\gamma ^{-1})} terms in the sum. If γ ≥ 1 {\displaystyle \gamma \geq 1} , the above analysis does not quite work. For the prototypical exploding gradient problem, the next model is clearer. === Dynamical systems model === Following (Doya, 1993), consider this one-neuron recurrent network with sigmoid activation: x t + 1 = ( 1 − ε ) x t + ε σ ( w x t + b ) + ε w ′ u t {\displaystyle x_{t+1}=(1-\varepsilon )x_{t}+\varepsilon \sigma (wx_{t}+b)+\varepsilon w'u_{t}} At the small ε {\displaystyle \varepsilon } limit, the dynamics of the network becomes d x d t = − x ( t ) + σ ( w x ( t ) + b ) + w ′ u ( t ) {\displaystyle {\frac {dx}{dt}}=-x(t)+\sigma (wx(t)+b)+w'u(t)} Consider first the autonomous case, with u = 0 {\displaystyle u=0} . Set w = 5.0 {\displaystyle w=5.0} , and vary b {\displaystyle b} in [ − 3 , − 2 ] {\displaystyle [-3,-2]} . As b {\displaystyle b} decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are ( x , b ) = ( x , ln ⁡ ( x 1 − x ) − 5 x ) {\displaystyle (x,b)=\left(x,\ln \left({\frac {x}{1-x}}\right)-5x\right)} . Now consider Δ x ( T ) Δ x ( 0 ) {\displaystyle {\frac {\Delta x(T)}{\Delta x(0)}}} and Δ x ( T ) Δ b {\displaystyle {\frac {\Delta x(T)}{\Delta b}}} , where T {\displaystyle T} is large enough that the system has settled into one of the stable points. If ( x ( 0 ) , b ) {\displaystyle (x(0),b)} puts the system very close to an unstable point, then a tiny variation in x ( 0 ) {\displaystyle x(0)} or b {\displaystyle b} wo

    Read more →
  • Learning curve (machine learning)

    Learning curve (machine learning)

    In machine learning (ML), a learning curve (or training curve) is a graphical representation that shows how a model's performance on a training set (and usually a validation set) changes with the number of training iterations (epochs) or the amount of training data. Typically, the number of training epochs or training set size is plotted on the x-axis, and the value of the loss function (and possibly some other metric such as the cross-validation score) on the y-axis. Synonyms include error curve, experience curve, improvement curve and generalization curve. More abstractly, learning curves plot the difference between learning effort and predictive performance, where "learning effort" usually means the number of training samples, and "predictive performance" means accuracy on testing samples. Learning curves have many useful purposes in ML, including: choosing model parameters during design, adjusting optimization to improve convergence, and diagnosing problems such as overfitting (or underfitting). Learning curves can also be tools for determining how much a model benefits from adding more training data, and whether the model suffers more from a variance error or a bias error. If both the validation score and the training score converge to a certain value, then the model will no longer significantly benefit from more training data. == Formal definition == When creating a function to approximate the distribution of some data, it is necessary to define a loss function L ( f θ ( X ) , Y ) {\displaystyle L(f_{\theta }(X),Y)} to measure how good the model output is (e.g., accuracy for classification tasks or mean squared error for regression). We then define an optimization process which finds model parameters θ {\displaystyle \theta } such that L ( f θ ( X ) , Y ) {\displaystyle L(f_{\theta }(X),Y)} is minimized, referred to as θ ∗ {\displaystyle \theta ^{}} . === Training curve for amount of data === If the training data is { x 1 , x 2 , … , x n } , { y 1 , y 2 , … y n } {\displaystyle \{x_{1},x_{2},\dots ,x_{n}\},\{y_{1},y_{2},\dots y_{n}\}} and the validation data is { x 1 ′ , x 2 ′ , … x m ′ } , { y 1 ′ , y 2 ′ , … y m ′ } {\displaystyle \{x_{1}',x_{2}',\dots x_{m}'\},\{y_{1}',y_{2}',\dots y_{m}'\}} , a learning curve is the plot of the two curves i ↦ L ( f θ ∗ ( X i , Y i ) ( X i ) , Y i ) {\displaystyle i\mapsto L(f_{\theta ^{}(X_{i},Y_{i})}(X_{i}),Y_{i})} i ↦ L ( f θ ∗ ( X i , Y i ) ( X i ′ ) , Y i ′ ) {\displaystyle i\mapsto L(f_{\theta ^{}(X_{i},Y_{i})}(X_{i}'),Y_{i}')} where X i = { x 1 , x 2 , … x i } {\displaystyle X_{i}=\{x_{1},x_{2},\dots x_{i}\}} === Training curve for number of iterations === Many optimization algorithms are iterative, repeating the same step (such as backpropagation) until the process converges to an optimal value. Gradient descent is one such algorithm. If θ i ∗ {\displaystyle \theta _{i}^{}} is the approximation of the optimal θ {\displaystyle \theta } after i {\displaystyle i} steps, a learning curve is the plot of i ↦ L ( f θ i ∗ ( X , Y ) ( X ) , Y ) {\displaystyle i\mapsto L(f_{\theta _{i}^{}(X,Y)}(X),Y)} i ↦ L ( f θ i ∗ ( X , Y ) ( X ′ ) , Y ′ ) {\displaystyle i\mapsto L(f_{\theta _{i}^{}(X,Y)}(X'),Y')}

    Read more →
  • Multiple correspondence analysis

    Multiple correspondence analysis

    In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables. == As an extension of correspondence analysis == MCA is performed by applying the CA algorithm to either an indicator matrix (also called complete disjunctive table – CDT) or a Burt table formed from these variables. An indicator matrix is an individuals × variables matrix, where the rows represent individuals and the columns are dummy variables representing categories of the variables. Analyzing the indicator matrix allows the direct representation of individuals as points in geometric space. The Burt table is the symmetric matrix of all two-way cross-tabulations between the categorical variables, and has an analogy to the covariance matrix of continuous variables. Analyzing the Burt table is a more natural generalization of simple correspondence analysis, and individuals or the means of groups of individuals can be added as supplementary points to the graphical display. In the indicator matrix approach, associations between variables are uncovered by calculating the chi-square distance between different categories of the variables and between the individuals (or respondents). These associations are then represented graphically as "maps", which eases the interpretation of the structures in the data. Oppositions between rows and columns are then maximized, in order to uncover the underlying dimensions best able to describe the central oppositions in the data. As in factor analysis or principal component analysis, the first axis is the most important dimension, the second axis the second most important, and so on, in terms of the amount of variance accounted for. The number of axes to be retained for analysis is determined by calculating modified eigenvalues. == Details == Since MCA is adapted to draw statistical conclusions from categorical variables (such as multiple choice questions), the first thing one needs to do is to transform quantitative data (such as age, size, weight, day time, etc) into categories (using for instance statistical quantiles). When the dataset is completely represented as categorical variables, one is able to build the corresponding so-called complete disjunctive table. We denote this table X {\displaystyle X} . If I {\displaystyle I} persons answered a survey with J {\displaystyle J} multiple choices questions with 4 answers each, X {\displaystyle X} will have I {\displaystyle I} rows and 4 J {\displaystyle 4J} columns. More theoretically, assume X {\displaystyle X} is the completely disjunctive table of I {\displaystyle I} observations of K {\displaystyle K} categorical variables. Assume also that the k {\displaystyle k} -th variable have J k {\displaystyle J_{k}} different levels (categories) and set J = ∑ k = 1 K J k {\displaystyle J=\sum _{k=1}^{K}J_{k}} . The table X {\displaystyle X} is then a I × J {\displaystyle I\times J} matrix with all coefficient being 0 {\displaystyle 0} or 1 {\displaystyle 1} . Set the sum of all entries of X {\displaystyle X} to be N {\displaystyle N} and introduce Z = X / N {\displaystyle Z=X/N} . In an MCA, there are also two special vectors: first r {\displaystyle r} , that contains the sums along the rows of Z {\displaystyle Z} , and c {\displaystyle c} , that contains the sums along the columns of Z {\displaystyle Z} . Note D r = diag ( r ) {\displaystyle D_{r}={\text{diag}}(r)} and D c = diag ( c ) {\displaystyle D_{c}={\text{diag}}(c)} , the diagonal matrices containing r {\displaystyle r} and c {\displaystyle c} respectively as diagonal. With these notations, computing an MCA consists essentially in the singular value decomposition of the matrix: M = D r − 1 / 2 ( Z − r c T ) D c − 1 / 2 {\displaystyle M=D_{r}^{-1/2}(Z-rc^{T})D_{c}^{-1/2}} The decomposition of M {\displaystyle M} gives you P {\displaystyle P} , Δ {\displaystyle \Delta } and Q {\displaystyle Q} such that M = P Δ Q T {\displaystyle M=P\Delta Q^{T}} with P, Q two unitary matrices and Δ {\displaystyle \Delta } is the generalized diagonal matrix of the singular values (with the same shape as Z {\displaystyle Z} ). The positive coefficients of Δ 2 {\displaystyle \Delta ^{2}} are the eigenvalues of Z {\displaystyle Z} . The interest of MCA comes from the way observations (rows) and variables (columns) in Z {\displaystyle Z} can be decomposed. This decomposition is called a factor decomposition. The coordinates of the observations in the factor space are given by F = D r − 1 / 2 P Δ {\displaystyle F=D_{r}^{-1/2}P\Delta } The i {\displaystyle i} -th rows of F {\displaystyle F} represent the i {\displaystyle i} -th observation in the factor space. And similarly, the coordinates of the variables (in the same factor space as observations!) are given by G = D c − 1 / 2 Q Δ {\displaystyle G=D_{c}^{-1/2}Q\Delta } == Recent works and extensions == In recent years, several students of Jean-Paul Benzécri have refined MCA and incorporated it into a more general framework of data analysis known as geometric data analysis. This involves the development of direct connections between simple correspondence analysis, principal component analysis and MCA with a form of cluster analysis known as Euclidean classification. Two extensions have great practical use. It is possible to include, as active elements in the MCA, several quantitative variables. This extension is called factor analysis of mixed data (see below). Very often, in questionnaires, the questions are structured in several issues. In the statistical analysis it is necessary to take into account this structure. This is the aim of multiple factor analysis which balances the different issues (i.e. the different groups of variables) within a global analysis and provides, beyond the classical results of factorial analysis (mainly graphics of individuals and of categories), several results (indicators and graphics) specific of the group structure. == Application fields == In the social sciences, MCA is arguably best known for its application by Pierre Bourdieu, notably in his books La Distinction, Homo Academicus and The State Nobility. Bourdieu argued that there was an internal link between his vision of the social as spatial and relational --– captured by the notion of field, and the geometric properties of MCA. Sociologists following Bourdieu's work most often opt for the analysis of the indicator matrix, rather than the Burt table, largely because of the central importance accorded to the analysis of the 'cloud of individuals'. == Multiple correspondence analysis and principal component analysis == MCA can also be viewed as a PCA applied to the complete disjunctive table. To do this, the CDT must be transformed as follows. Let y i k {\displaystyle y_{ik}} denote the general term of the CDT. y i k {\displaystyle y_{ik}} is equal to 1 if individual i {\displaystyle i} possesses the category k {\displaystyle k} and 0 if not. Let denote p k {\displaystyle p_{k}} , the proportion of individuals possessing the category k {\displaystyle k} . The transformed CDT (TCDT) has as general term: x i k = y i k / p k − 1 {\displaystyle x_{ik}=y_{ik}/p_{k}-1} The unstandardized PCA applied to TCDT, the column k {\displaystyle k} having the weight p k {\displaystyle p_{k}} , leads to the results of MCA. This equivalence is fully explained in a book by Jérôme Pagès. It plays an important theoretical role because it opens the way to the simultaneous treatment of quantitative and qualitative variables. Two methods simultaneously analyze these two types of variables: factor analysis of mixed data and, when the active variables are partitioned in several groups: multiple factor analysis. This equivalence does not mean that MCA is a particular case of PCA as it is not a particular case of CA. It only means that these methods are closely linked to one another, as they belong to the same family: the factorial methods. == Software == There are numerous software of data analysis that include MCA, such as STATA and SPSS. The R package FactoMineR also features MCA. This software is related to a book describing the basic methods for performing MCA . There is also a Python package for [1] which works with numpy array matrices; the package has not been implemented yet for Spark dataframes.

    Read more →
  • Optical character recognition

    Optical character recognition

    Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printed data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components. == History == Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind. In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code. Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters. In the late 1920s and into the 1930s, Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931, he was granted US Patent number 1,838,389 for the invention. The patent was acquired by IBM. === Visually impaired users === In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-font OCR, which could recognize text printed in virtually any font. (Kurzweil is often credited with inventing omni-font OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s.) Kurzweil used the technology to create a reading machine for blind people to have a computer read text to them out loud. The device included a CCD-type flatbed scanner and a text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind. In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which eventually spun it off as Scansoft, which merged with Nuance Communications. In the 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone. With the advent of smartphones and smartglasses, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have built-in OCR functionality will typically use an OCR API to extract the text from the image file captured by the device. The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display. Various commercial and open source OCR systems are available for most common writing systems, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters. == Applications == OCR engines have been developed into software applications specializing in various subjects such as receipts, invoices, checks, and legal billing documents. The software can be used for: Entering data for business documents, e.g. checks, passports, invoices, bank statements and receipts Automatic number-plate recognition Passport recognition and information extraction in airports Automatically extracting key information from insurance documents Traffic-sign recognition Extracting business card information into a contact list Creating textual versions of printed documents, e.g. book scanning for Project Gutenberg Making electronic images of printed documents searchable, e.g. Google Books Converting handwriting in real-time to control a computer (pen computing) Defeating or testing the robustness of CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR. Assistive technology for blind and visually impaired users Writing instructions for vehicles by identifying CAD images in a database that are appropriate to the vehicle design as it changes in real time Making scanned documents searchable by converting them to PDFs == Types == Optical character recognition (OCR) – targets typewritten text, one glyph or character at a time. Optical word recognition – targets typewritten text, one word at a time (for languages that use a space as a word divider). Usually just called "OCR". Intelligent character recognition (ICR) – also targets handwritten printscript or cursive text one glyph or character at a time, usually involving machine learning. Intelligent word recognition (IWR) – also targets handwritten printscript or cursive text, one word at a time. This is especially useful for languages where glyphs are not separated in cursive script. OCR is generally an offline process, which analyses a static document. There are cloud based services which provide an online OCR API service. Handwriting movement analysis can be used as input to handwriting recognition. Instead of merely using the shapes of glyphs and words, this technique is able to capture motion, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the process more accurate. This technology is also known as "online character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition". == Techniques == === Pre-processing === OCR software often pre-processes images to improve the chances of successful recognition. Techniques include: De-skewing – if the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical. Despeckling – removal of positive and negative spots, smoothing edges Binarization – conversion of an image from color or greyscale to black-and-white (called a binary image because there are two colors). The task is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarization is necessary since most commercial recognition algorithms work only on binary images, as it is simpler to do so. In addition, the effectiveness of binarization influences to a significant extent the quality of character recognition, and careful decisions are made in the choice of the binarization employed for a given input image type; since the quality of the method used to obtain the binary result depends on the type of image (scanned document, scene text image, degraded historical document, etc.). Line removal – Cleaning up non-glyph boxes and lines Layout analysis or zoning – Identification of columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables. Line and word detection – Establishment of a baseline for word and character shapes, separating words as necessary. Script recognition – In multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script. Character isolation or segmentation – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. Normalization of aspect ratio and scale Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace bet

    Read more →