AI Email News

AI Email News — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Community cloud

    Community cloud

    A community cloud in computing is a collaborative effort in which infrastructure is shared between several organizations from a specific community with common concerns (security, compliance, jurisdiction, etc.), whether managed internally or by a third party and hosted internally or externally. This is controlled and used by a group of organizations that have shared interests. The costs are spread over fewer users than a public cloud (but more than a private cloud), so only some of the cost savings potential of cloud computing are realized. The community cloud is provisioned for use by a group of consumers from different organizations who share the same concerns (e.g., application, security, policy, and efficiency demands).

    Read more →
  • Principal component analysis

    Principal component analysis

    Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data are linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. The principal components of a collection of points in a real coordinate space are a sequence of p {\displaystyle p} unit vectors, where the i {\displaystyle i} -th vector is the direction of a line that best fits the data while being orthogonal to the first i − 1 {\displaystyle i-1} vectors. Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line. These directions (i.e., principal components) constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science. == Overview == When performing PCA, the first principal component of a set of p {\displaystyle p} variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p {\displaystyle p} iterations until all the variance is explained. PCA is most commonly used when many of the variables are highly correlated with each other and it is desirable to reduce their number to an independent set. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i {\displaystyle i} -th principal component can be taken as a direction orthogonal to the first i − 1 {\displaystyle i-1} principal components that maximizes the variance of the projected data. For either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. Robust and L1-norm-based variants of standard PCA have also been proposed. == History == PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (invented in the last quarter of the 19th century), eigenvalue decomposition (EVD) of XTX in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis), Eckart–Young theorem (Harman, 1960), or empirical orthogonal functions (EOF) in meteorological science (Lorenz, 1956), empirical eigenfunction decomposition (Sirovich, 1987), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics. == Intuition == PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. To find the axes of the ellipsoid, we must first center the values of each variable in the dataset on 0 by subtracting the mean of the variable's observed values from each of those values. These transformed values are used instead of the original observed values for each of the variables. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually-orthogonal unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform the covariance matrix into a diagonalized form, in which the diagonal elements represent the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. Biplots and scree plots (degree of explained variance) are used to interpret findings of the PCA. == Details == PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Consider an n × p {\displaystyle n\times p} data matrix, X, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of feature (say, the results from a particular sensor). Mathematically, the transformation is defined by a set of size l {\displaystyle l} (where l {\displaystyle l} is usually selected to be strictly less than p {\displaystyle p} to reduce dimensionality) of p {\displaystyle p} -dimensional vectors of weights or coefficients w ( k ) = ( w 1 , … , w p ) ( k ) {\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} that map each row vector x ( i ) = ( x 1 , … , x p ) ( i ) {\displaystyle \mathbf {x} _{(i)}=(x_{1},\dots ,x_{p})_{(i)}} of X to a new vector of principal component scores t ( i ) = ( t 1 , … , t l ) ( i ) {\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_{(i)}} , given by t k ( i ) = x ( i ) ⋅ w ( k ) f o r i = 1 , … , n k = 1 , … , l {\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l} in such a way that the individual variables t 1 , … , t l {\displaystyle t_{1},\dots ,t_{l}} of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector. The above may equivalently be written in matrix form as T = X W {\displaystyle \mathbf {T} =\mathbf {X} \mathbf {W} } where T i k = t k ( i ) {\displaystyle {\mathbf {T} }_{ik}={t_{k}}_{(i)}} , X i j = x j ( i ) {\displaystyle {\mathbf {X} }_{ij}={x_{j}}_{(i)}} , and W j k = w j ( k ) {\displaystyle {\mathbf {W} }_{jk}={w_{j}}_{(k)}} . === First component === In order to maximize variance, the first weight vector w(1) thus has to satisfy w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( t 1 ) ( i ) 2 } = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( x ( i ) ⋅ w ) 2 } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_{(i)}^{2}\right\}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}} Equivalently, writing this in matrix form gives w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ‖ X w ‖ 2 } = arg ⁡ max ‖ w ‖ = 1 { w T X T X w } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}\right\}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} \right\}} Since w(1) has been defined to be a unit vector, it equivalently also satisfies w ( 1 ) = arg ⁡ max { w T X T X w w T w } {\displaystyle \mathbf {w} _{(1)}=\arg \max \left\{{\frac {\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf {T}}\mathbf {w} }}\right\}} The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w(1) found, the first principal component of a data vector

    Read more →
  • Rectified linear unit

    Rectified linear unit

    In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function: ReLU ⁡ ( x ) = x + = max ( 0 , x ) = x + | x | 2 = { x if x > 0 , 0 x ≤ 0 {\displaystyle \operatorname {ReLU} (x)=x^{+}=\max(0,x)={\frac {x+|x|}{2}}={\begin{cases}x&{\text{if }}x>0,\\0&x\leq 0\end{cases}}} where x {\displaystyle x} is the input to a neuron. This is analogous to half-wave rectification in electrical engineering. ReLU is one of the most popular activation functions for artificial neural networks, and finds application in computer vision and speech recognition using deep neural nets and computational neuroscience. == History == The ReLU was first used by Alston Householder in 1941 as a mathematical abstraction of biological neural networks. Kunihiko Fukushima in 1969 used ReLU in the context of visual feature extraction in hierarchical neural networks. In 1998, Gregory Woodbury demonstrated that the rectified linear function could account for a broad range of emergent properties in the visual cortex. His work showed that a single unified model could drive the joint development of refined retinotopic maps, ocular dominance columns, and orientation selectivity. By utilizing the rectifier's "cutoff" property, Woodbury achieved a close quantitative fit to biological data, matching the spatial periodicities and topographic refinement patterns observed in macaque and cat cortical maps. Furthermore, he extended this framework to adult plasticity, accurately replicating the spatial and temporal dynamics of lesion-induced cortical reorganization. This research established that the rectified linear response was a necessary mechanism for the stable self-organisation and maintenance of complex, multi-feature neural maps. In 2000, Hahnloser et al. argued that ReLU approximates the biological relationship between neural firing rates and input current, in addition to enabling recurrent neural network dynamics to stabilise under weaker criteria. Prior to 2010, most activation functions used were the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more numerically efficient counterpart, the hyperbolic tangent. Around 2010, the use of ReLU became common again. Jarrett et al. (2009) noted that rectification by either absolute or ReLU (which they called "positive part") was critical for object recognition in convolutional neural networks (CNNs), specifically because it allows average pooling without neighboring filter outputs cancelling each other out. They hypothesized that the use of sigmoid or tanh was responsible for poor performance in previous CNNs. Nair and Hinton (2010) made a theoretical argument that the softplus activation function should be used, in that the softplus function numerically approximates the sum of an exponential number of linear models that share parameters. They then proposed ReLU as a good approximation to it. Specifically, they began by considering a single binary neuron in a Boltzmann machine that takes x {\displaystyle x} as input, and produces 1 as output with probability σ ( x ) = 1 1 + e − x {\displaystyle \sigma (x)={\frac {1}{1+e^{-x}}}} . They then considered extending its range of output by making infinitely many copies of it X 1 , X 2 , X 3 , … {\displaystyle X_{1},X_{2},X_{3},\dots } , that all take the same input, offset by an amount 0.5 , 1.5 , 2.5 , … {\displaystyle 0.5,1.5,2.5,\dots } , then their outputs are added together as ∑ i = 1 ∞ X i {\displaystyle \sum _{i=1}^{\infty }X_{i}} . They then demonstrated that ∑ i = 1 ∞ X i {\displaystyle \sum _{i=1}^{\infty }X_{i}} is approximately equal to N ( log ⁡ ( 1 + e x ) , σ ( x ) ) {\displaystyle {\mathcal {N}}(\log(1+e^{x}),\sigma (x))} , which is also approximately equal to ReLU ⁡ ( N ( x , σ ( x ) ) ) {\displaystyle \operatorname {ReLU} ({\mathcal {N}}(x,\sigma (x)))} , where N {\displaystyle {\mathcal {N}}} stands for the gaussian distribution. They also argued for another reason for using ReLU: that it allows "intensity equivariance" in image recognition. That is, multiplying input image by a constant k {\displaystyle k} multiplies the output also. In contrast, this is false for other activation functions like sigmoid or tanh. They found that ReLU activation allowed good empirical performance in restricted Boltzmann machines. Glorot et al (2011) argued that ReLU has the following advantages over sigmoid or tanh: ReLU is more similar to biological neurons' responses in their main operating regime. ReLU avoids vanishing gradients. ReLU is cheaper to compute. ReLU creates sparse representation naturally, because many hidden units output exactly zero for a given input. They also found empirically that deep networks trained with ReLU can achieve strong performance without unsupervised pre-training, especially on large, purely supervised tasks. In 2017, the rectified linear function became a central component of the transformer architecture introduced in the Vaswani et al paper "Attention Is All You Need". Within every transformer layer, ReLU is utilized in the position-wise feed-forward networks (FFN), defined by Equation 2 of their paper: FFN ⁡ ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2 {\displaystyle \operatorname {FFN} (x)=\max(0,xW_{1}+b_{1})W_{2}+b_{2}} This equation is foundational to the model's capacity; while the attention mechanism determines the relationships between tokens, the ReLU-based FFN performs the majority of the numerical computation and houses the bulk of the model's parameters. The efficiency and scalability of this rectified framework triggered a global technological revolution, enabling the development of Large Language Models that have had a profound economic impact. The industrial response to this architecture—including the massive expansion of AI-specific hardware and the birth of the generative AI sector—has positioned the Transformer as a cornerstone of 21st-century infrastructure. During the post 2017 period of rapid AI advancement, the rectified linear unit function has been key to achieving increased model performance and scaling due to the fact that it zeros out responses that are immaterial for a given stimuli, preventing them from accumulating in massive scale models. It is the complete silencing of the parts of the model found to be stimuli-irrelevant during learning that allows for scaling. As the stimuli-irrelevant proportion of the model becomes more massive, these highly numerous connections within the model would inevitably accumulate during scaling no matter how small each individual response is. Therefore, the rectified linear unit function, with its absolute zeroing property, enabled the scaling to hundred billion parameter models and beyond. Early Transformer scaling giants like GPT-3 (2020) and Falcon-180B (2023) relied on the rectified linear unit function explicitly, while successors such as GPT-4 (2023) and Llama 3 (2024) utilized smoother variants like GELU or SwiGLU. These variants were used to improve training stability while fundamentally preserving the rectified principle of zeroing low responses. At the centre of modern artificial intelligence ReLU and its variants maintain absolute zero response across the bulk of the model at any one time, while maintaining approximately linear reponses for stimuli-relevant connections enabling high performance on each specific cognitive task. This feature of activation sparsity has been critical for massive scaling and performance gains of AI models right up to the present day. == Advantages == Advantages of ReLU include: Sparse activation: for example, in a randomly initialized network, only about 50% of hidden units are activated (i.e. have a non-zero output). Better gradient propagation: fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions. Efficiency: only requires comparison and addition. Scale-invariant (homogeneous, or "intensity equivariance"): max ( 0 , a x ) = a max ( 0 , x ) for a ≥ 0 {\displaystyle \max(0,ax)=a\max(0,x){\text{ for }}a\geq 0} . == Potential problems == Possible downsides can include: Non-differentiability at zero (however, it is differentiable anywhere else, and the value of the derivative at zero can be chosen to be 0 or 1 arbitrarily). Not zero-centered: ReLU outputs are always non-negative. This can make it harder for the network to learn during backpropagation, because gradient updates tend to push weights in one direction (positive or negative). Batch normalization can help address this. ReLU is unbounded. Redundancy of the parametrization: Because ReLU is scale-invariant, the network computes the exact same function by scaling the weights and biases in front of a ReLU activation by k {\displaystyle k} , and the weights after by 1 / k {\displaystyle 1/k} . Dying ReLU: ReLU neurons can sometimes be pushed into states

    Read more →
  • Dynamic time warping

    Dynamic time warping

    In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules: Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match) The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match) The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j > i {\displaystyle j>i} are indices from the first sequence, then there must not be two indices l > k {\displaystyle l>k} in the other sequence, such that index i {\displaystyle i} is matched with index l {\displaystyle l} and index j {\displaystyle j} is matched with index k {\displaystyle k} , and vice versa We can plot each match between the sequences 1 : M {\displaystyle 1:M} and 1 : N {\displaystyle 1:N} as a path in a M × N {\displaystyle M\times N} matrix from ( 1 , 1 ) {\displaystyle (1,1)} to ( M , N ) {\displaystyle (M,N)} , such that each step is one of ( 0 , 1 ) , ( 1 , 0 ) , ( 1 , 1 ) {\displaystyle (0,1),(1,0),(1,1)} . In this formulation, we see that the number of possible matches is the Delannoy number. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values. The sequences are "warped" non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension. This sequence alignment method is often used in time series classification. Although DTW measures a distance-like quantity between two given sequences, it doesn't guarantee the triangle inequality to hold. In addition to a similarity measure between the two sequences (a so called "warping path" is produced), by warping according to this path the two signals may be aligned in time. The signal with an original set of points X(original), Y(original) is transformed to X(warped), Y(warped). This finds applications in genetic sequence and audio synchronisation. In a related technique sequences of varying speed may be averaged using this technique see the average sequence section. This is conceptually very similar to the Needleman–Wunsch algorithm. == Implementation == This example illustrates the implementation of the dynamic time warping algorithm when the two sequences s and t are strings of discrete symbols. For two symbols x and y, d ( x , y ) {\displaystyle d(x,y)} is a distance between the symbols, e.g., d ( x , y ) = | x − y | {\displaystyle d(x,y)=|x-y|} . int DTWDistance(s: array [1..n], t: array [1..m]) { DTW := array [0..n, 0..m] for i := 0 to n for j := 0 to m DTW[i, j] := infinity DTW[0, 0] := 0 for i := 1 to n for j := 1 to m cost := d(s[i], t[j]) DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion DTW[i , j-1], // deletion DTW[i-1, j-1]) // match return DTW[n, m] } where DTW[i, j] is the distance between s[1:i] and t[1:j] with the best alignment. We sometimes want to add a locality constraint. That is, we require that if s[i] is matched with t[j], then | i − j | {\displaystyle |i-j|} is no larger than w, a window parameter. We can easily modify the above algorithm to add a locality constraint (differences marked). However, the above given modification works only if | n − m | {\displaystyle |n-m|} is no larger than w, i.e. the end point is within the window length from diagonal. In order to make the algorithm work, the window parameter w must be adapted so that | n − m | ≤ w {\displaystyle |n-m|\leq w} (see the line marked with () in the code). int DTWDistance(s: array [1..n], t: array [1..m], w: int) { DTW := array [0..n, 0..m] w := max(w, abs(n-m)) // adapt window size () for i := 0 to n for j:= 0 to m DTW[i, j] := infinity DTW[0, 0] := 0 for i := 1 to n for j := max(1, i-w) to min(m, i+w) DTW[i, j] := 0 for i := 1 to n for j := max(1, i-w) to min(m, i+w) cost := d(s[i], t[j]) DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion DTW[i , j-1], // deletion DTW[i-1, j-1]) // match return DTW[n, m] } == Warping properties == The DTW algorithm produces a discrete matching between existing elements of one series to another. In other words, it does not allow time-scaling of segments within the sequence. Other methods allow continuous warping. For example, Correlation Optimized Warping (COW) divides the sequence into uniform segments that are scaled in time using linear interpolation, to produce the best matching warping. The segment scaling causes potential creation of new elements, by time-scaling segments either down or up, and thus produces a more sensitive warping than DTW's discrete matching of raw elements. == Complexity == The time complexity of the DTW algorithm is O ( N M ) {\displaystyle O(NM)} , where N {\displaystyle N} and M {\displaystyle M} are the lengths of the two input sequences. The 50 years old quadratic time bound was broken in 2016: an algorithm due to Gold and Sharir enables computing DTW in O ( N 2 / log ⁡ log ⁡ N ) {\displaystyle O({N^{2}}/\log \log N)} time and space for two input sequences of length N {\displaystyle N} . This algorithm can also be adapted to sequences of different lengths. Despite this improvement, it was shown that a strongly subquadratic running time of the form O ( N 2 − ϵ ) {\displaystyle O(N^{2-\epsilon })} for some ϵ > 0 {\displaystyle \epsilon >0} cannot exist unless the Strong exponential time hypothesis fails. While the dynamic programming algorithm for DTW requires O ( N M ) {\displaystyle O(NM)} space in a naive implementation, the space consumption can be reduced to O ( min ( N , M ) ) {\displaystyle O(\min(N,M))} using Hirschberg's algorithm. == Fast computation == Fast techniques for computing DTW include PrunedDTW, SparseDTW, FastDTW, and the MultiscaleDTW. A common task, retrieval of similar time series, can be accelerated by using lower bounds such as LB_Keogh, LB_Improved, or LB_Petitjean. However, the Early Abandon and Pruned DTW algorithm reduces the degree of acceleration that lower bounding provides and sometimes renders it ineffective. In a survey, Wang et al. reported slightly better results with the LB_Improved lower bound than the LB_Keogh bound, and found that other techniques were inefficient. Subsequent to this survey, the LB_Enhanced bound was developed that is always tighter than LB_Keogh while also being more efficient to compute. LB_Petitjean is the tightest known lower bound that can be computed in linear time. == Average sequence == Averaging for dynamic time warping is the problem of finding an average sequence for a set of sequences. NLAAF is an exact method to average two sequences using DTW. For more than two sequences, the problem is related to that of multiple alignment and requires heuristics. DBA is currently a reference method to average a set of sequences consistently with DTW. COMASA efficiently randomizes the search for the average sequence, using DBA as a local optimization process. == Supervised learning == A nearest-neighbour classifier can achieve state-of-the-art performance when using dynamic time warping as a distance measure. == Amerced Dynamic Time Warping == Amerced Dynamic Time Warping (ADTW) is a variant of DTW designed to better control DTW's permissiveness in the alignments that it allows. The windows that classical DTW uses to constrain alignments introduce a step function. Any warping of the path is allowed within the window and none beyond it. In contrast, ADTW employs an additive penalty that is incurred each time that the path is warped. Any amount of warping is allowed, but each warping action incurs a direct penalty. ADTW significantly outperforms DTW with windowing when applied as a nearest neighbor classifier on a set of benchmark time series classification tasks. == Alternative approaches == In functional data analysis, time series are regarde

    Read more →
  • Elements of AI

    Elements of AI

    Elements of AI is a massive open online course (MOOC) teaching the basics of artificial intelligence. The course, originally launched in 2018, is designed and organized by the University of Helsinki and learning technology company MinnaLearn. The course includes modules on machine learning, neural networks, the philosophy of artificial intelligence, and using artificial intelligence to solve problems. It consists of two parts: Introduction to AI and its sequel, Building AI, that was released in late 2020. In November 2019, the course was named one of four winners of MIT’s Inclusive Innovation Challenge. University of Helsinki's computer science department is known as the alma mater of Linus Torvalds, a Finnish-American software engineer who is the creator of the Linux kernel, which is the kernel for Linux operating systems. == EU’s AI pledge == The government of Finland has pledged to offer the course for all EU citizens by the end of 2021, as the course is made available in all the official EU languages. The initiative was launched as part of Finland's Presidency of the Council of the European Union in 2019, with the European Commission providing translations of the course materials. In 2017, Finland launched an AI strategy to stay competitive in the field of AI amid growing competition between China and the United States. With the support of private companies and the government, Finland's now-realized goal was to get 1 percent of its citizens to participate in Elements of AI. Other governments have also given their support to the course. For instance, Germany's Federal Minister for Economic Affairs and Energy Peter Altmeier has encouraged citizens to take part in the course to help Germany gain a competitive advantage in AI. Sweden's Minister for Energy and Minister for Digital Development Anders Ygeman has said that Sweden aims to teach 1 percent of its population the basics of AI like Finland has. == Participants == Elements of AI had enrolled more than 1 million students from more than 110 countries by May 2023. A quarter of the course's participants are aged 45 and over, and some 40 percent are women. Among Nordic participants, the share of women is nearly 60 percent. In September 2022, the course was available in Finnish, Swedish, Estonian, English, German, Latvian, Norwegian, French, Belgian, Czech, Greek, Slovakian, Slovenian, Latvian, Lithuanian, Portuguese, Spanish, Irish, Icelandic, Maltese, Croatian, Romanian, Italian, Dutch, Polish, and Danish.

    Read more →
  • Geographical cluster

    Geographical cluster

    A geographical cluster is a localized anomaly, usually an excess of something given the distribution or variation of something else. Often it is considered as an incidence rate that is unusual in that there is more of some variable than might be expected. Examples would include: a local excess disease rate, a crime hot spot, areas of high unemployment, accident blackspots, unusually high positive residuals from a model, high concentrations of flora or fauna, physical features or events like earthquake epicenters etc... Identifying these extreme regions may be useful in that there could be implicit geographical associations with other variables that can be identified and would be of interest. Pattern detection via the identification of such geographical clusters is a very simple and generic form of geographical analysis that has many applications in many different contexts. The emphasis is on localized clustering or patterning because this may well contain the most useful information. A geographical cluster is different from a high concentration as it is generally second order, involving the factoring in of the distribution of something else. == Geographical cluster detection == Identifying geographical clusters can be an important stage in a geographical analysis. Mapping the locations of unusual concentrations may help identify causes of these. Some techniques include the Geographical Analysis Machine and Besag and Newell's cluster detection method.

    Read more →
  • Word2vec

    Word2vec

    Word2vec is a technique in natural language processing for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov, Kai Chen, Greg Corrado, Ilya Sutskever and Jeff Dean at Google, and published in 2013. Word2vec represents a word as a high-dimension vector of numbers which capture relationships between words. In particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. This indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those for "but" and "however", and "Berlin" and "Germany". == Approach == Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a mapping of the set of words to a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a vector in the space. Word2vec can use either of two model architectures to produce these distributed representations of words: continuous bag of words (CBOW) or continuously sliding skip-gram. In both architectures, word2vec considers both individual words and a sliding context window as it iterates over the corpus. The CBOW can be viewed as a 'fill in the blank' task, where the word embedding represents the way the word influences the relative probabilities of other words in the context window. Words which are semantically similar should influence these probabilities in similar ways, because semantically similar words should be used in similar contexts. The order of context words does not influence prediction (bag of words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram does a better job for infrequent words. After the model is trained, the learned word embeddings are positioned in the vector space such that words that share common contexts in the corpus — that is, words that are semantically and syntactically similar — are located close to one another in the space. More dissimilar words are located farther from one another in the space. == Mathematical details == This section is based on expositions. A corpus is a sequence of words. Both CBOW and skip-gram are methods to learn one vector per word appearing in the corpus. Let V {\displaystyle V} ("vocabulary") be the set of all words appearing in the corpus C {\displaystyle C} . Our goal is to learn one vector v w ∈ R d {\displaystyle v_{w}\in \mathbb {R} ^{d}} for each word w ∈ V {\displaystyle w\in V} . The idea of skip-gram is that the vector of a word should be close to the vector of each of its neighbors. The idea of CBOW is that the vector-sum of a word's neighbors should be close to the vector of the word. === Continuous bag-of-words (CBOW) === The idea of CBOW is to represent each word with a vector, such that it is possible to predict a word using the sum of the vectors of its neighbors. Specifically, for each word w i {\displaystyle w_{i}} in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of w i {\displaystyle w_{i}} . The objective of training is to maximize ∑ i ln ⁡ Pr ( w i ∣ w i + j : j ∈ N ) {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i+j}\colon j\in N)} where N {\displaystyle N} is a set of (non-zero) indices representing the relative locations of nearby words considered to be in w i {\displaystyle w_{i}} 's neighborhood. For example, if we want each word in the corpus to be predicted by every other word in a small span of 4 words. The set of relative indexes of neighbor words will be: N = { − 2 , − 1 , + 1 , + 2 } {\displaystyle N=\{-2,-1,+1,+2\}} , and the objective is to maximize ∑ i ln ⁡ Pr ( w i ∣ w i − 2 , w i − 1 , w i + 1 , w i + 2 ) {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i-2},w_{i-1},w_{i+1},w_{i+2})} . In standard bag-of-words, a word's context is represented by a word-count (aka a word histogram) of its neighboring words. For example, the "sat" in "the cat sat on the mat" is represented as {"the": 2, "cat": 1, "on": 1}. Note that the last word "mat" is not used to represent "sat", because it is outside the neighborhood N = { − 2 , − 1 , + 1 , + 2 } {\displaystyle N=\{-2,-1,+1,+2\}} . In continuous bag-of-words, the histogram is multiplied by a matrix V {\displaystyle V} to obtain a continuous representation of the word's context. The matrix V {\displaystyle V} is also called a dictionary. Its columns are the word vectors. It has D {\displaystyle D} columns, where D {\displaystyle D} is the size of the dictionary. Let d {\displaystyle d} be the length of each word vector. We have V ∈ R d × D {\displaystyle V\in \mathbb {R} ^{d\times D}} . For example, multiplying the word histogram {"the": 2, "cat": 1, "on": 1} with V {\displaystyle V} , we obtain 2 v the + v cat + v on {\displaystyle 2v_{\text{the}}+v_{\text{cat}}+v_{\text{on}}} . This is then multiplied with another matrix V ′ {\displaystyle V'} of shape R D × d {\displaystyle \mathbb {R} ^{D\times d}} . Each row of it is a word vector v ′ {\displaystyle v'} . This results in a vector of length D {\displaystyle D} , one entry per dictionary entry. Then, apply the softmax to obtain a probability distribution over the dictionary. This system can be visualized as a neural network, similar in spirit to an autoencoder, of architecture linear-linear-softmax, as depicted in the diagram. The system is trained by gradient descent to minimize the cross-entropy loss. In full formula, the cross-entropy loss is: − ∑ i ln ⁡ e v w i ′ ⋅ ( ∑ j ∈ N v w j + i ) ∑ w ′ e v w ′ ′ ⋅ ( ∑ j ∈ N v w j + i ) {\displaystyle -\sum _{i}\ln {\frac {e^{v_{w_{i}}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}{\sum _{w'}e^{v_{w'}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}}} where the outer summation ∑ i {\displaystyle \sum _{i}} is over the words in a corpus, the quantity ∑ j ∈ N v w j + i {\displaystyle \sum _{j\in N}v_{w_{j+i}}} is the sum of a word's neighbors' vectors, etc. Once such a system is trained, we have two trained matrices V , V ′ {\displaystyle V,V'} . Either the column vectors of V {\displaystyle V} or the row vectors of V ′ {\displaystyle V'} can serve as the dictionary. For example, the word "sat" can be represented as either the "sat"-th column of V {\displaystyle V} or the "sat"-th row of V ′ {\displaystyle V'} . It is also possible to simply define V ′ = V ⊤ {\displaystyle V'=V^{\top }} , in which case there would no longer be a choice. === Skip-gram === The idea of skip-gram is to represent each word with a vector, such that it is possible to predict the vectors of its neighbors using the vector of a word. The architecture is still linear-linear-softmax, the same as CBOW, but the input and the output are switched. Specifically, for each word w i {\displaystyle w_{i}} in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of w i {\displaystyle w_{i}} . The objective of training is to maximize ∑ i ∑ j ∈ N ln ⁡ Pr ( w j + i ∣ w i ) {\displaystyle \sum _{i}\sum _{j\in N}\ln \Pr(w_{j+i}\mid w_{i})} . In full formula, the loss function is − ∑ i ∑ j ∈ N ln ⁡ e v w j + i ′ ⋅ v w i ∑ w ′ e v w ′ ′ ⋅ v w i {\displaystyle -\sum _{i}\sum _{j\in N}\ln {\frac {e^{v_{w_{j+i}}'\cdot v_{w_{i}}}}{\sum _{w'}e^{v_{w'}'\cdot v_{w_{i}}}}}} Same as CBOW, once such a system is trained, we have two trained matrices V , V ′ {\displaystyle V,V'} . Either the column vectors of V {\displaystyle V} or the row vectors of V ′ {\displaystyle V'} can serve as the dictionary. It is also possible to simply define V ′ = V ⊤ {\displaystyle V'=V^{\top }} , in which case there would no longer be a choice. Essentially, skip-gram and CBOW are exactly the same in architecture. They only differ in the objective function during training. == History == During the 1980s, there were some early attempts at using neural networks to represent words and concepts as vectors. In 2010, Tomáš Mikolov (then at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden

    Read more →
  • Density-based clustering validation

    Density-based clustering validation

    Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations. Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence. This metric was introduced in 2014 by David Moulavi and colleagues in their work. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable. The DBCV index has been employed for clustering analysis in bioinformatics, ecology, techno-economy, and health informatics , as well as in numerous other fields. == Definition == DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset X = x 1 , x 2 , . . . , x n {\displaystyle X={x_{1},x_{2},...,x_{n}}} , a density-based algorithm partitions it into K clusters C 1 , C 2 , . . . , C K {\displaystyle {C_{1},C_{2},...,C_{K}}} . Each point x i {\displaystyle x_{i}} belongs to a specific cluster, denoted as C c l u s t e r ( x i ) {\displaystyle C_{cluster(x_{i})}} A key concept in DBCV index is the notion of density-connected paths. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory. DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances: Within-cluster density distance measures how closely a point is related to other members of its cluster: a i = 1 | C c l u s t e r ( x i ) | − 1 ∑ x j ∈ C c l u s t e r ( x i ) , y ≠ x d d e n s i t y ( x j , x i ) {\displaystyle a_{i}={\frac {1}{|C_{cluster(x_{i})}|-1}}\sum _{x_{j}\in C_{cluster(x_{i})},y\neq x}d_{density}(x_{j},x_{i})} Nearest-cluster density distance quantifies how far a point is from the closest external cluster: b i = min C ≠ C cluster ( x i ) C ∈ { C 1 , … , C K } ( 1 | C | ∑ x j ∈ C d density ( x i , x j ) ) . {\displaystyle b_{i}=\min _{C\neq C_{{\text{cluster}}(x_{i})} \atop C\in \{C_{1},\dots ,C_{K}\}}\left({\frac {1}{|C|}}\sum _{x_{j}\in C}d_{\text{density}}(x_{i},x_{j})\right).} Using these measures, the DBCV index is computed as: D B C V = 1 n ∑ i = 1 n b i − a i max ( a i , b i ) {\displaystyle DBCV={\frac {1}{n}}\sum _{i=1}^{n}{\frac {b_{i}-a_{i}}{\max(a_{i},b_{i})}}} == Explanation == DBCV index values range between −1 and +1: +1: Strongly cohesive and well-separated clusters. 0: Ambiguous clustering structure. −1: Poorly formed clusters or incorrect assignments. By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.

    Read more →
  • Stereo cameras

    Stereo cameras

    The stereo cameras approach is a method of distilling a noisy video signal into a coherent data set that a computer can begin to process into actionable symbolic objects, or abstractions. Stereo cameras is one of many approaches used in the broader fields of computer vision and machine vision. == Calculation == In this approach, two cameras with a known physical relationship (i.e. a common field of view the cameras can see, and how far apart their focal points sit in physical space) are correlated via software. By finding mappings of common pixel values, and calculating how far apart these common areas reside in pixel space, a rough depth map can be created. This is very similar to how the human brain uses stereoscopic information from the eyes to gain depth cue information, i.e. how far apart any given object in the scene is from the viewer. The camera attributes must be known, focal length and distance apart etc., and a calibration done. Once this is completed, the systems can be used to sense the distances of objects by triangulation. Finding the same singular physical point in the two left and right images is known as the correspondence problem. Correctly locating the point gives the computer the capability to calculate the distance that the robot or camera is from the object. On the BH2 Lunar Rover the cameras use five steps: a bayer array filter, photometric consistency dense matching algorithm, a Laplace of Gaussian (LoG) edge detection algorithm, a stereo matching algorithm and finally uniqueness constraint. == Uses == This type of stereoscopic image processing technique is used in applications such as 3D reconstruction, robotic control and sensing, crowd dynamics monitoring and off-planet terrestrial rovers; for example, in mobile robot navigation, tracking, gesture recognition, targeting, 3D surface visualization, immersive and interactive gaming. Although the Xbox Kinect sensor is also able to create a depth map of an image, it uses an infrared camera for this purpose, and does not use the dual-camera technique. Other approaches to stereoscopic sensing include time of flight sensors and ultrasound.

    Read more →
  • Multispectral pattern recognition

    Multispectral pattern recognition

    Multispectral remote sensing is the collection and analysis of reflected, emitted, or back-scattered energy from an object or an area of interest in multiple bands of regions of the electromagnetic spectrum (Jensen, 2005). Subcategories of multispectral remote sensing include hyperspectral, in which hundreds of bands are collected and analyzed, and ultraspectral remote sensing where many hundreds of bands are used (Logicon, 1997). The main purpose of multispectral imaging is the potential to classify the image using multispectral classification. This is a much faster method of image analysis than is possible by human interpretation. == Multispectral remote sensing systems == Remote sensing systems gather data via instruments typically carried on satellites in orbit around the Earth. The remote sensing scanner detects the energy that radiates from the object or area of interest. This energy is recorded as an analog electrical signal and converted into a digital value though an A-to-D conversion. There are several multispectral remote sensing systems that can be categorized in the following way: === Multispectral imaging using discrete detectors and scanning mirrors === Landsat Multispectral Scanner (MSS) Landsat Thematic Mapper (TM) NOAA Geostationary Operational Environmental Satellite (GOES) NOAA Advanced Very High Resolution Radiometer (AVHRR) NASA and ORBIMAGE, Inc., Sea-viewing Wide field-of-view Sensor (SeaWiFS) Daedalus, Inc., Aircraft Multispectral Scanner (AMS) NASA Airborne Terrestrial Applications Sensor (ATLAS) === Multispectral imaging using linear arrays === SPOT 1, 2, and 3 High Resolution Visible (HRV) sensors and Spot 4 and 5 High Resolution Visible Infrared (HRVIR) and vegetation sensor Indian Remote Sensing System (IRS) Linear Imaging Self-scanning Sensor (LISS) Space Imaging, Inc. (IKONOS) Digital Globe, Inc. (QuickBird) ORBIMAGE, Inc. (OrbView-3) ImageSat International, Inc. (EROS A1) NASA Terra Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) NASA Terra Multiangle Imaging Spectroradiometer (MISR) === Imaging spectrometry using linear and area arrays === NASA Jet Propulsion Laboratory Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Compact Airborne Spectrographic Imager 3 (CASI 3) NASA Terra Moderate Resolution Imaging Spectrometer (MODIS) NASA Earth Observer (EO-1) Advanced Land Imager (ALI), Hyperion, and LEISA Atmospheric Corrector (LAC) === Satellite analog and digital photographic systems === Russian SPIN-2 TK-350, and KVR-1000 NASA Space Shuttle and International Space Station Imagery == Multispectral classification methods == A variety of methods can be used for the multispectral classification of images: Algorithms based on parametric and nonparametric statistics that use ratio-and interval-scaled data and nonmetric methods that can also incorporate nominal scale data (Duda et al., 2001), Supervised or unsupervised classification logic, Hard or soft (fuzzy) set classification logic to create hard or fuzzy thematic output products, Per-pixel or object-oriented classification logic, and Hybrid approaches == Supervised classification == In this classification method, the identity and location of some of the land-cover types are obtained beforehand from a combination of fieldwork, interpretation of aerial photography, map analysis, and personal experience. The analyst would locate sites that have similar characteristics to the known land-cover types. These areas are known as training sites because the known characteristics of these sites are used to train the classification algorithm for eventual land-cover mapping of the remainder of the image. Multivariate statistical parameters (means, standard deviations, covariance matrices, correlation matrices, etc.) are calculated for each training site. All pixels inside and outside of the training sites are evaluated and allocated to the class with the more similar characteristics. === Classification scheme === The first step in the supervised classification method is to identify the land-cover and land-use classes to be used. Land-cover refers to the type of material present on the site (e.g. water, crops, forest, wet land, asphalt, and concrete). Land-use refers to the modifications made by people to the land cover (e.g. agriculture, commerce, settlement). All classes should be selected and defined carefully to properly classify remotely sensed data into the correct land-use and/or land-cover information. To achieve this purpose, it is necessary to use a classification system that contains taxonomically correct definitions of classes. If a hard classification is desired, the following classes should be used: Mutually exclusive: there is not any taxonomic overlap of any classes (i.e., rain forest and evergreen forest are distinct classes). Exhaustive: all land-covers in the area have been included. Hierarchical: sub-level classes (e.g., single-family residential, multiple-family residential) are created, allowing that these classes can be included in a higher category (e.g., residential). Some examples of hard classification schemes are: American Planning Association Land-Based Classification System United States Geological Survey Land-use/Land-cover Classification System for Use with Remote Sensor Data U.S. Department of the Interior Fish and Wildlife Service U.S. National Vegetation and Classification System International Geosphere-Biosphere Program IGBP Land Cover Classification System === Training sites === Once the classification scheme is adopted, the image analyst may select training sites in the image that are representative of the land-cover or land-use of interest. If the environment where the data was collected is relatively homogeneous, the training data can be used. If different conditions are found in the site, it would not be possible to extend the remote sensing training data to the site. To solve this problem, a geographical stratification should be done during the preliminary stages of the project. All differences should be recorded (e.g. soil type, water turbidity, crop species, etc.). These differences should be recorded on the imagery and the selection training sites made based on the geographical stratification of this data. The final classification map would be a composite of the individual stratum classifications. After the data are organized in different training sites, a measurement vector is created. This vector would contain the brightness values for each pixel in each band in each training class. The mean, standard deviation, variance-covariance matrix, and correlation matrix are calculated from the measurement vectors. Once the statistics from each training site are determined, the most effective bands for each class should be selected. The objective of this discrimination is to eliminate the bands that can provide redundant information. Graphical and statistical methods can be used to achieve this objective. Some of the graphic methods are: Bar graph spectral plots Cospectral mean vector plots Feature space plots Cospectral parallelepiped or ellipse plots === Classification algorithm === The last step in supervised classification is selecting an appropriate algorithm. The choice of a specific algorithm depends on the input data and the desired output. Parametric algorithms are based on the fact that the data is normally distributed. If the data is not normally distributed, nonparametric algorithms should be used. The more common nonparametric algorithms are: One-dimensional density slicing Parallelipiped Minimum distance Nearest-neighbor Expert system analysis Convolutional neural network == Unsupervised classification == Unsupervised classification (also known as clustering) is a method of partitioning remote sensor image data in multispectral feature space and extracting land-cover information. Unsupervised classification require less input information from the analyst compared to supervised classification because clustering does not require training data. This process consists in a series of numerical operations to search for the spectral properties of pixels. From this process, a map with m spectral classes is obtained. Using the map, the analyst tries to assign or transform the spectral classes into thematic information of interest (i.e. forest, agriculture, urban). This process may not be easy because some spectral clusters represent mixed classes of surface materials and may not be useful. The analyst has to understand the spectral characteristics of the terrain to be able to label clusters as a specific information class. There are hundreds of clustering algorithms. Two of the most conceptually simple algorithms are the chain method and the ISODATA method. === Chain method === The algorithm used in this method operates in a two-pass mode (it passes through the multispectral dataset two times. In the first pass, the program reads through the dataset and sequentially builds clusters (groups of p

    Read more →
  • Sliced inverse regression

    Sliced inverse regression

    Sliced inverse regression (SIR) is a tool for dimensionality reduction in the field of multivariate statistics. In statistics, regression analysis is a method of studying the relationship between a response variable y and its input variable x _ {\displaystyle {\underline {x}}} , which is a p-dimensional vector. There are several approaches in the category of regression. For example, parametric methods include multiple linear regression, and non-parametric methods include local smoothing. As the number of observations needed to use local smoothing methods scales exponentially with high-dimensional data (as p grows), reducing the number of dimensions can make the operation computable. Dimensionality reduction aims to achieve this by showing only the most important dimension of the data. SIR uses the inverse regression curve, E ( x _ | y ) {\displaystyle E({\underline {x}}\,|\,y)} , to perform a weighted principal component analysis. == Model == Given a response variable Y {\displaystyle \,Y} and a (random) vector X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} of explanatory variables, SIR is based on the model Y = f ( β 1 ⊤ X , … , β k ⊤ X , ε ) ( 1 ) {\displaystyle Y=f(\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X,\varepsilon )\quad \quad \quad \quad \quad (1)} where β 1 , … , β k {\displaystyle \beta _{1},\ldots ,\beta _{k}} are unknown projection vectors, k {\displaystyle \,k} is an unknown number smaller than p {\displaystyle \,p} , f {\displaystyle \;f} is an unknown function on R k + 1 {\displaystyle \mathbb {R} ^{k+1}} as it only depends on k {\displaystyle \,k} arguments, and ε {\displaystyle \varepsilon } is a random variable representing error with E [ ε | X ] = 0 {\displaystyle E[\varepsilon |X]=0} and a finite variance of σ 2 {\displaystyle \sigma ^{2}} . The model describes an ideal solution, where Y {\displaystyle \,Y} depends on X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} only through a k {\displaystyle \,k} dimensional subspace; i.e., one can reduce the dimension of the explanatory variables from p {\displaystyle \,p} to a smaller number k {\displaystyle \,k} without losing any information. An equivalent version of ( 1 ) {\displaystyle \,(1)} is: the conditional distribution of Y {\displaystyle \,Y} given X {\displaystyle \,X} depends on X {\displaystyle \,X} only through the k {\displaystyle \,k} dimensional random vector ( β 1 ⊤ X , … , β k ⊤ X ) {\displaystyle (\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X)} . It is assumed that this reduced vector is as informative as the original X {\displaystyle \,X} in explaining Y {\displaystyle \,Y} . The unknown β i ′ s {\displaystyle \,\beta _{i}'s} are called the effective dimension reducing directions (EDR-directions). The space that is spanned by these vectors is denoted by the effective dimension reducing space (EDR-space). == Relevant linear algebra background == Given a _ 1 , … , a _ r ∈ R n {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}\in \mathbb {R} ^{n}} , then V := L ( a _ 1 , … , a _ r ) {\displaystyle V:=L({\underline {a}}_{1},\ldots ,{\underline {a}}_{r})} , the set of all linear combinations of these vectors is called a linear subspace and is therefore a vector space. The equation says that vectors a _ 1 , … , a _ r {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}} span V {\displaystyle \,V} , but the vectors that span space V {\displaystyle \,V} are not unique. The dimension of V ( ∈ R n ) {\displaystyle \,V(\in \mathbb {R} ^{n})} is equal to the maximum number of linearly independent vectors in V {\displaystyle \,V} . A set of n {\displaystyle \,n} linear independent vectors of R n {\displaystyle \mathbb {R} ^{n}} makes up a basis of R n {\displaystyle \mathbb {R} ^{n}} . The dimension of a vector space is unique, but the basis itself is not. Several bases can span the same space. Dependent vectors can still span a space, but the linear combinations of the latter are only suitable to a set of vectors lying on a straight line. == Inverse regression == Computing the inverse regression curve (IR) means instead of looking for E [ Y | X = x ] {\displaystyle \,E[Y|X=x]} , which is a curve in R p {\displaystyle \mathbb {R} ^{p}} it is actually E [ X | Y = y ] {\displaystyle \,E[X|Y=y]} , which is also a curve in R p {\displaystyle \mathbb {R} ^{p}} , but consisting of p {\displaystyle \,p} one-dimensional regressions. The center of the inverse regression curve is located at E [ E [ X | Y ] ] = E [ X ] {\displaystyle \,E[E[X|Y]]=E[X]} . Therefore, the centered inverse regression curve is E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} which is a p {\displaystyle \,p} dimensional curve in R p {\displaystyle \mathbb {R} ^{p}} . == Inverse regression versus dimension reduction == The centered inverse regression curve lies on a k {\displaystyle \,k} -dimensional subspace spanned by Σ x x β i ′ s {\displaystyle \,\Sigma _{xx}\beta _{i}\,'s} . This is a connection between the model and inverse regression. Given this condition and ( 1 ) {\displaystyle \,(1)} , the centered inverse regression curve E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} is contained in the linear subspace spanned by Σ x x β k ( k = 1 , … , K ) {\displaystyle \,\Sigma _{xx}\beta _{k}(k=1,\ldots ,K)} , where Σ x x = C o v ( X ) {\displaystyle \,\Sigma _{xx}=Cov(X)} . == Estimation of the EDR-directions == After having had a look at all the theoretical properties, the aim now is to estimate the EDR-directions. For that purpose, weighted principal component analyses are needed. If the sample means m ^ h ′ s {\displaystyle \,{\hat {m}}_{h}\,'s} , X {\displaystyle \,X} would have been standardized to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} . Corresponding to the theorem above, the IR-curve m 1 ( y ) = E [ Z | Y = y ] {\displaystyle \,m_{1}(y)=E[Z|Y=y]} lies in the space spanned by ( η 1 , … , η k ) {\displaystyle \,(\eta _{1},\ldots ,\eta _{k})} , where η i = Σ x x 1 / 2 β i {\displaystyle \,\eta _{i}=\Sigma _{xx}^{1/2}\beta _{i}} . As a consequence, the covariance matrix c o v [ E [ Z | Y ] ] {\displaystyle \,cov[E[Z|Y]]} is degenerate in any direction orthogonal to the η i ′ s {\displaystyle \,\eta _{i}\,'s} . Therefore, the eigenvectors η k ( k = 1 , … , K ) {\displaystyle \,\eta _{k}(k=1,\ldots ,K)} associated with the largest K {\displaystyle \,K} eigenvalues are the standardized EDR-directions. == Algorithm == === SIR algorithm === The algorithm from Li, K-C. (1991) to estimate the EDR-directions via SIR is as follows. 1. Let Σ x x {\displaystyle \,\Sigma _{xx}} be the covariance matrix of X {\displaystyle \,X} . Standardize X {\displaystyle \,X} to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} ( 1 ) {\displaystyle \,(1)} can also be rewritten as Y = f ( η 1 ⊤ Z , … , η k ⊤ Z , ε ) {\displaystyle Y=f(\eta _{1}^{\top }Z,\ldots ,\eta _{k}^{\top }Z,\varepsilon )} where η k = β k Σ x x 1 / 2 ∀ k {\displaystyle \,\eta _{k}=\beta _{k}\Sigma _{xx}^{1/2}\quad \forall \;k} .) 2. Divide the range of y i {\displaystyle \,y_{i}} into S {\displaystyle \,S} non-overlapping slices H s ( s = 1 , … , S ) . n s {\displaystyle \,H_{s}(s=1,\ldots ,S).\;n_{s}} is the number of observations within each slice and I H s {\displaystyle \,I_{H_{s}}} is the indicator function for the slice: n s = ∑ i = 1 n I H s ( y i ) {\displaystyle n_{s}=\sum _{i=1}^{n}I_{H_{s}}(y_{i})} 3. Compute the mean of z i {\displaystyle \,z_{i}} over all slices, which is a crude estimate m ^ 1 {\displaystyle \,{\hat {m}}_{1}} of the inverse regression curve m 1 {\displaystyle \,m_{1}} : z ¯ s = n s − 1 ∑ i = 1 n z i I H s ( y i ) {\displaystyle \,{\bar {z}}_{s}=n_{s}^{-1}\sum _{i=1}^{n}z_{i}I_{H_{s}}(y_{i})} 4. Calculate the estimate for C o v { m 1 ( y ) } {\displaystyle \,Cov\{m_{1}(y)\}} : V ^ = n − 1 ∑ i = 1 S n s z ¯ s z ¯ s ⊤ {\displaystyle \,{\hat {V}}=n^{-1}\sum _{i=1}^{S}n_{s}{\bar {z}}_{s}{\bar {z}}_{s}^{\top }} 5. Identify the eigenvalues λ ^ i {\displaystyle \,{\hat {\lambda }}_{i}} and the eigenvectors η ^ i {\displaystyle \,{\hat {\eta }}_{i}} of V ^ {\displaystyle \,{\hat {V}}} , which are the standardized EDR-directions. 6. Transform the standardized EDR-directions back to the original scale. The estimates for the EDR-directions are given by: β ^ i = Σ ^ x x − 1 / 2 η ^ i {\displaystyle \,{\hat {\beta }}_{i}={\hat {\Sigma }}_{xx}^{-1/2}{\hat {\eta }}_{i}} (which are not necessarily orthogonal.)

    Read more →
  • Naive Bayes classifier

    Naive Bayes classifier

    In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of "probabilistic classifiers" which assume that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes the information about the class provided by each variable is unrelated to the information from the others, with no information shared between the predictors. The highly unrealistic nature of this assumption, called the naive independence assumption, is what gives the classifier its name. These classifiers are some of the simplest Bayesian network models. Naive Bayes classifiers generally perform worse than more advanced models like logistic regressions, especially at quantifying uncertainty (with naive Bayes models often producing wildly overconfident probabilities). However, they are highly scalable, requiring only one parameter for each feature or predictor in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression (simply by counting observations in each group), rather than the expensive iterative approximation algorithms required by most other models. Despite the use of Bayes' theorem in the classifier's decision rule, naive Bayes is not (necessarily) a Bayesian method, and naive Bayes models can be fit to data using either Bayesian or frequentist methods. == Introduction == Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification. == Probabilistic model == Abstractly, naive Bayes is a conditional probability model: it assigns probabilities p ( C k ∣ x 1 , … , x n ) {\displaystyle p(C_{k}\mid x_{1},\ldots ,x_{n})} for each of the K possible outcomes or classes C k {\displaystyle C_{k}} given a problem instance to be classified, represented by a vector x = ( x 1 , … , x n ) {\displaystyle \mathbf {x} =(x_{1},\ldots ,x_{n})} encoding some n features (independent variables). The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. The model must therefore be reformulated to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as: p ( C k ∣ x ) = p ( C k ) p ( x ∣ C k ) p ( x ) {\displaystyle p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\,} In plain English, using Bayesian probability terminology, the above equation can be written as posterior = prior × likelihood evidence {\displaystyle {\text{posterior}}={\frac {{\text{prior}}\times {\text{likelihood}}}{\text{evidence}}}\,} In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C {\displaystyle C} and the values of the features x i {\displaystyle x_{i}} are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model p ( C k , x 1 , … , x n ) {\displaystyle p(C_{k},x_{1},\ldots ,x_{n})\,} which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability: p ( C k , x 1 , … , x n ) = p ( x 1 , … , x n , C k ) = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 , … , x n , C k ) = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 ∣ x 3 , … , x n , C k ) p ( x 3 , … , x n , C k ) = ⋯ = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 ∣ x 3 , … , x n , C k ) ⋯ p ( x n − 1 ∣ x n , C k ) p ( x n ∣ C k ) p ( C k ) {\displaystyle {\begin{aligned}p(C_{k},x_{1},\ldots ,x_{n})&=p(x_{1},\ldots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2},\ldots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2}\mid x_{3},\ldots ,x_{n},C_{k})\ p(x_{3},\ldots ,x_{n},C_{k})\\&=\cdots \\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2}\mid x_{3},\ldots ,x_{n},C_{k})\cdots p(x_{n-1}\mid x_{n},C_{k})\ p(x_{n}\mid C_{k})\ p(C_{k})\\\end{aligned}}} Now the "naive" conditional independence assumptions come into play: assume that all features in x {\displaystyle \mathbf {x} } are mutually independent, conditional on the category C k {\displaystyle C_{k}} . Under this assumption, p ( x i ∣ x i + 1 , … , x n , C k ) = p ( x i ∣ C k ) . {\displaystyle p(x_{i}\mid x_{i+1},\ldots ,x_{n},C_{k})=p(x_{i}\mid C_{k})\,.} Thus, the joint model can be expressed as p ( C k ∣ x 1 , … , x n ) ∝ p ( C k , x 1 , … , x n ) = p ( C k ) p ( x 1 ∣ C k ) p ( x 2 ∣ C k ) p ( x 3 ∣ C k ) ⋯ = p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) , {\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\ldots ,x_{n})\varpropto \ &p(C_{k},x_{1},\ldots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}} where ∝ {\displaystyle \varpropto } denotes proportionality since the denominator p ( x ) {\displaystyle p(\mathbf {x} )} is omitted. This means that under the above independence assumptions, the conditional distribution over the class variable C {\displaystyle C} is: p ( C k ∣ x 1 , … , x n ) = 1 Z p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) {\displaystyle p(C_{k}\mid x_{1},\ldots ,x_{n})={\frac {1}{Z}}\ p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})} where the evidence Z = p ( x ) = ∑ k p ( C k ) p ( x ∣ C k ) {\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})} is a scaling factor dependent only on x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} , that is, a constant if the values of the feature variables are known. Often, it is only necessary to discriminate between classes. In that case, the scaling factor is irrelevant, and it is sufficient to calculate the log-probability up to a factor: ln ⁡ p ( C k ∣ x 1 , … , x n ) = ln ⁡ p ( C k ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C k ) − ln ⁡ Z ⏟ irrelevant {\displaystyle \ln p(C_{k}\mid x_{1},\ldots ,x_{n})=\ln p(C_{k})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{k})\underbrace {-\ln Z} _{\text{irrelevant}}} The scaling factor is irrelevant, since discrimination subtracts it away: ln ⁡ p ( C k ∣ x 1 , … , x n ) p ( C l ∣ x 1 , … , x n ) = ( ln ⁡ p ( C k ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C k ) ) − ( ln ⁡ p ( C l ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C l ) ) {\displaystyle \ln {\frac {p(C_{k}\mid x_{1},\ldots ,x_{n})}{p(C_{l}\mid x_{1},\ldots ,x_{n})}}=\left(\ln p(C_{k})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{k})\right)-\left(\ln p(C_{l})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{l})\right)} There are two benefits of using log-probability. One is that it allows an interpretation in information theory, where log-probabilities are units of information in nats. Another is that it avoids arithmetic underflow. === Constructing a classifier from the probability model === The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable so as to minimize the probability of misclassification; this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label y ^ = C k {\displaystyle {\hat {y}}=C_{k}} for some k as follows: y ^ = argmax k ∈ { 1 , … , K } p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) . {\displaystyle {\hat {y}}={\underset {k\in \{1,\ldots ,K\}}{\operatorname {argmax} }}\ p(C_{k})\displays

    Read more →
  • Radioplayer

    Radioplayer

    Radioplayer is a radio technology platform, owned by UK radio broadcasters and operated under licence in some other countries. It operates an internet radio web tuner, a set of mobile phone apps, an in-car adaptor, and a growing range of integrations with other connected devices and platforms. Radioplayer is operated by UK Radioplayer Ltd which is a not-for-profit organisation owned by UK radio broadcasters. Initial shareholders were the BBC, Global Radio, GMG Radio, Absolute Radio and RadioCentre. After consolidation in the radio market, current shareholders are the BBC, Global Radio, Bauer Media Group and RadioCentre. == History == Launched in the UK on 31 March 2011, Radioplayer set out to offer a simple and accessible way to listen to radio via the internet. It contained 157 stations at launch. Initially working internally at the BBC for Tim Davie, then Director of BBC Audio & Music, Michael Hill led the project since March 2009; he was made Managing Director of UK Radioplayer Ltd on 28 July 2010. At launch, Radioplayer was a simple and straightforward Flash-based radio player, linked-to by radio stations on their own website. The player included searching and bookmarking across all of UK radio station content. On 5 October 2012, Radioplayer launched a mobile app on iOS phones with an Android version following shortly afterwards. The apps are unavailable for download outside the United Kingdom. This was followed by a tablet app on 25 September 2013. The apps also support Android Wear, Android Auto, Smart Device Link, Apple Watch and Apple CarPlay. They are also compatible with Chromecast and Airplay. In September 2016, Radioplayer announced it had been chosen by Amazon to integrate with their new voice-controlled 'Echo' device, ahead of its UK launch. In July 2017, Radioplayer integrated with the Sonos and Bose multi-room speaker platforms. UK Radioplayer currently contains around 500 UK stations, from Ofcom-licensed broadcasters. Online-only 'sister-stations' can also be added, but only by broadcasters with Ofcom licences which have been on the platform for over a year. == Radioplayer Car == Radioplayer Car was announced in September 2014 as a hybrid radio receiver that switches between FM, DAB and streaming to find the strongest signal. Speaking in Oslo in June 2015, Michael Hill said that he hoped to launch the product in the UK and Norway during the summer of 2015. In February 2017, Radioplayer Car was launched. It was marketed as the world’s first voice-controlled hybrid radio adaptor for car stereos. A small box, fitted behind the dashboard, links to the auxiliary input on an existing car radio. It connects wirelessly via Bluetooth to the driver’s smartphone by an app. The adaptor enabled drivers to listen to their own smartphone music collections using Bluetooth, take hands-free calls, listen to inbound text messages and receive instant audio travel news, customised by GPS to their location and direction of travel. The hardware was manufactured under licence by car audio interfaces supplier Connects2, and Hyde Park Corner was promoted as the preferred installer of the audio equipment. There were several spin-off benefits of the Radioplayer Car project, including the creation of the hybrid radio metadata API for cars, known as the 'WRAPI' (Worldwide Radioplayer API). == International == Through a separate company called Radioplayer Worldwide, Radioplayer technology is licensed to a number of different territories.

    Read more →
  • Grammatical evolution

    Grammatical evolution

    Grammatical evolution (GE) is a genetic programming (GP) technique (or approach) from evolutionary computation pioneered by Conor Ryan, JJ Collins and Michael O'Neill in 1998 at the BDS Group in the University of Limerick. As in any other GP approach, the objective is to find an executable program, program fragment, or function, which will achieve a good fitness value for a given objective function. In most published work on GP, a LISP-style tree-structured expression is directly manipulated, whereas GE applies genetic operators to an integer string, subsequently mapped to a program (or similar) through the use of a grammar, which is typically expressed in Backus–Naur form. One of the benefits of GE is that this mapping simplifies the application of search to different programming languages and other structures. == Problem addressed == In type-free, conventional Koza-style GP, the function set must meet the requirement of closure: all functions must be capable of accepting as their arguments the output of all other functions in the function set. Usually, this is implemented by dealing with a single data-type such as double-precision floating point. While modern Genetic Programming frameworks support typing, such type-systems have limitations that Grammatical Evolution does not suffer from. == GE's solution == GE offers a solution to the single-type limitation by evolving solutions according to a user-specified grammar (usually a grammar in Backus-Naur form). Therefore, the search space can be restricted, and domain knowledge of the problem can be incorporated. The inspiration for this approach comes from a desire to separate the "genotype" from the "phenotype": in GP, the objects the search algorithm operates on and what the fitness evaluation function interprets are one and the same. In contrast, GE's "genotypes" are ordered lists of integers which code for selecting rules from the provided context-free grammar. The phenotype, however, is the same as in Koza-style GP: a tree-like structure that is evaluated recursively. This model is more in line with how genetics work in nature, where there is a separation between an organism's genotype and the final expression of phenotype in proteins, etc. Separating genotype and phenotype allows a modular approach. In particular, the search portion of the GE paradigm needn't be carried out by any one particular algorithm or method. Observe that the objects GE performs search on are the same as those used in genetic algorithms. This means, in principle, that any existing genetic algorithm package, such as the popular GAlib, can be used to carry out the search, and a developer implementing a GE system need only worry about carrying out the mapping from list of integers to program tree. It is also in principle possible to perform the search using some other method, such as particle swarm optimization (see the remark below); the modular nature of GE creates many opportunities for hybrids as the problem of interest to be solved dictates. Brabazon and O'Neill have successfully applied GE to predicting corporate bankruptcy, forecasting stock indices, bond credit ratings, and other financial applications. GE has also been used with a classic predator-prey model to explore the impact of parameters such as predator efficiency, niche number, and random mutations on ecological stability. It is possible to structure a GE grammar that for a given function/terminal set is equivalent to genetic programming. == Criticism == Despite its successes, GE has been the subject of some criticism. One issue is that as a result of its mapping operation, GE's genetic operators do not achieve high locality which is a highly regarded property of genetic operators in evolutionary algorithms. == Variants == Although GE was originally described in terms of using an Evolutionary Algorithm, specifically, a Genetic Algorithm, other variants exist. For example, GE researchers have experimented with using particle swarm optimization to carry out the searching instead of genetic algorithms with results comparable to that of normal GE; this is referred to as a "grammatical swarm"; using only the basic PSO model it has been found that PSO is probably equally capable of carrying out the search process in GE as simple genetic algorithms are. (Although PSO is normally a floating-point search paradigm, it can be discretized, e.g., by simply rounding each vector to the nearest integer, for use with GE.) Yet another possible variation that has been experimented with in the literature is attempting to encode semantic information in the grammar in order to further bias the search process. Other work showed that, with biased grammars that leverage domain knowledge, even random search can be used to drive GE. == Related work == GE was originally a combination of the linear representation as used by the Genetic Algorithm for Developing Software (GADS) and Backus Naur Form grammars, which were originally used in tree-based GP by Wong and Leung in 1995 and Whigham in 1996. Other related work noted in the original GE paper was that of Frederic Gruau, who used a conceptually similar "embryonic" approach, as well as that of Keller and Banzhaf, which similarly used linear genomes. == Implementations == There are several implementations of GE. These include the following.

    Read more →
  • Nearest neighbor search

    Nearest neighbor search

    Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values. Formally, the nearest neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M {\displaystyle q\in M} , find the closest point in S to q. Donald Knuth in volume 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a k-NN search, where we need to find the k closest points. Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensional vector space where dissimilarity is measured using the Euclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example is asymmetric Bregman divergence, for which the triangle inequality does not hold. == Applications == The nearest neighbor search problem arises in numerous fields of application, including: Pattern recognition – in particular for optical character recognition Statistical classification – see k-nearest neighbor algorithm Computer vision – for point cloud registration Computational geometry – see Closest pair of points problem Cryptanalysis – for lattice problem Databases – e.g. content-based image retrieval Coding theory – see maximum likelihood decoding Semantic search Vector databases, where nearest-neighbor lookup over embeddings is used to retrieve semantically similar records Retrieval-augmented generation systems, where nearest-neighbor retrieval over embeddings is used to fetch candidate passages or documents before generation Data compression – see MPEG-2 standard Robotic sensing Recommendation systems, e.g. see Collaborative filtering Internet marketing – see contextual advertising and behavioral targeting DNA sequencing Spell checking – suggesting correct spelling Plagiarism detection Similarity scores for predicting career paths of professional athletes. Cluster analysis – assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance Chemical similarity Sampling-based motion planning == Methods == Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time. === Exact methods === ==== Linear search ==== The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces. The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results. ==== Space partitioning ==== Since the 1970s, the branch and bound methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k)) Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R tree. R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances. In the case of general metric space, the branch-and-bound approach is known as the metric tree approach. Particular examples include vp-tree and BK-tree methods. Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result. === Approximation methods === An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most c {\displaystyle c} times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. ==== Greedy search in proximity neighborhood graphs ==== Proximity graph methods (such as navigable small world graphs and HNSW) are considered the current state-of-the-art for the approximate nearest neighbors search. The methods are based on greedy traversing in proximity neighborhood graphs G ( V , E ) {\displaystyle G(V,E)} in which every point x i ∈ S {\displaystyle x_{i}\in S} is uniquely associated with vertex v i ∈ V {\displaystyle v_{i}\in V} . The search for the nearest neighbors to a query q in the set S takes the form of searching for the vertex in the graph G ( V , E ) {\displaystyle G(V,E)} . The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex v i ∈ V {\displaystyle v_{i}\in V} by computing the distances from the query q to each vertex of its neighborhood { v j : ( v i , v j ) ∈ E } {\displaystyle \{v_{j}:(v_{i},v_{j})\in E\}} , and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself. The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount, in the VoroNet syst

    Read more →