AI Headshot Improver

AI Headshot Improver — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Alexis Spectral Data

    Alexis Spectral Data

    Alexis Spectral Data is a software developed for colour matching processes that calculates from available spectral data the colour numbers used by computers to display colours on screen. It displays the colour for each spectral reflectance curve and records the calculated trichromatic values and colour numbers along with the spectral curves. This eliminates the need to scan the samples separately with a truecolour Scanner while creating the database. The spectral data can be introduced manually as a series of reflectance values at wavelengths measured in different standard illuminants with an arbitrary but fixed increment that must be kept for each spectral curve throughout the creation of the whole database. Therefore, older UV-VIS Spectrophotometers that can't be interfaced with computers can also be used for creating the database needed for colour matching. Alexis Spectral Data determines the whiteness degree in a less time-consuming method, which permits storage and easier handling of the obtained data. Alexis Spectral Data can export the trichromatic values, calculated from the spectral curves, to Alexis Analyser, software that handles only trichromatic data. The earliest information about the development of this software comes from a paper published by a student at the University Politehnica Bucharest in 1993. The software runs on Windows based computers but not on other operating systems.

    Read more →
  • Influence diagram

    Influence diagram

    An influence diagram (ID) (also called a relevance diagram, decision diagram or a decision network) is a compact graphical and mathematical representation of a decision situation. It is a generalization of a Bayesian network, in which not only probabilistic inference problems but also decision making problems (following the maximum expected utility criterion) can be modeled and solved. ID was first developed in the mid-1970s by decision analysts with an intuitive semantic that is easy to understand. It is now adopted widely and becoming an alternative to the decision tree which typically suffers from exponential growth in number of branches with each variable modeled. ID is directly applicable in team decision analysis, since it allows incomplete sharing of information among team members to be modeled and solved explicitly. Extensions of ID also find their use in game theory as an alternative representation of the game tree. == Semantics == An ID is a directed acyclic graph with three types (plus one subtype) of node and three types of arc (or arrow) between nodes. Nodes: Decision node (corresponding to each decision to be made) is drawn as a rectangle. Uncertainty node (corresponding to each uncertainty to be modeled) is drawn as an oval. Deterministic node (corresponding to special kind of uncertainty that its outcome is deterministically known whenever the outcome of some other uncertainties are also known) is drawn as a double oval. Value node (corresponding to each component of additively separable Von Neumann-Morgenstern utility function) is drawn as an octagon (or diamond). Arcs: Functional arcs (ending in value node) indicate that one of the components of additively separable utility function is a function of all the nodes at their tails. Conditional arcs (ending in uncertainty node) indicate that the uncertainty at their heads is probabilistically conditioned on all the nodes at their tails. Conditional arcs (ending in deterministic node) indicate that the uncertainty at their heads is deterministically conditioned on all the nodes at their tails. Informational arcs (ending in decision node) indicate that the decision at their heads is made with the outcome of all the nodes at their tails known beforehand. Given a properly structured ID: Decision nodes and incoming information arcs collectively state the alternatives (what can be done when the outcome of certain decisions and/or uncertainties are known beforehand) Uncertainty/deterministic nodes and incoming conditional arcs collectively model the information (what are known and their probabilistic/deterministic relationships) Value nodes and incoming functional arcs collectively quantify the preference (how things are preferred over one another). Alternative, information, and preference are termed decision basis in decision analysis, they represent three required components of any valid decision situation. Formally, the semantic of influence diagram is based on sequential construction of nodes and arcs, which implies a specification of all conditional independencies in the diagram. The specification is defined by the d {\displaystyle d} -separation criterion of Bayesian network. According to this semantic, every node is probabilistically independent on its non-successor nodes given the outcome of its immediate predecessor nodes. Likewise, a missing arc between non-value node X {\displaystyle X} and non-value node Y {\displaystyle Y} implies that there exists a set of non-value nodes Z {\displaystyle Z} , e.g., the parents of Y {\displaystyle Y} , that renders Y {\displaystyle Y} independent of X {\displaystyle X} given the outcome of the nodes in Z {\displaystyle Z} . == Example == Consider the simple influence diagram representing a situation where a decision-maker is planning their vacation. There is 1 decision node (Vacation Activity), 2 uncertainty nodes (Weather Condition, Weather Forecast), and 1 value node (Satisfaction). There are 2 functional arcs (ending in Satisfaction), 1 conditional arc (ending in Weather Forecast), and 1 informational arc (ending in Vacation Activity). Functional arcs ending in Satisfaction indicate that Satisfaction is a utility function of Weather Condition and Vacation Activity. In other words, their satisfaction can be quantified if they know what the weather is like and what their choice of activity is. (Note that they do not value Weather Forecast directly) Conditional arc ending in Weather Forecast indicates their belief that Weather Forecast and Weather Condition can be dependent. Informational arc ending in Vacation Activity indicates that they will only know Weather Forecast, not Weather Condition, when making their choice. In other words, actual weather will be known after they make their choice, and only forecast is what they can count on at this stage. It also follows semantically, for example, that Vacation Activity is independent on (irrelevant to) Weather Condition given Weather Forecast is known. == Applicability to value of information == The above example highlights the power of the influence diagram in representing an extremely important concept in decision analysis known as the value of information. Consider the following three scenarios; Scenario 1: The decision-maker could make their Vacation Activity decision while knowing what Weather Condition will be like. This corresponds to adding extra informational arc from Weather Condition to Vacation Activity in the above influence diagram. Scenario 2: The original influence diagram as shown above. Scenario 3: The decision-maker makes their decision without even knowing the Weather Forecast. This corresponds to removing informational arc from Weather Forecast to Vacation Activity in the above influence diagram. Scenario 1 is the best possible scenario for this decision situation since there is no longer any uncertainty on what they care about (Weather Condition) when making their decision. Scenario 3, however, is the worst possible scenario for this decision situation since they need to make their decision without any hint (Weather Forecast) on what they care about (Weather Condition) will turn out to be. The decision-maker is usually better off (definitely no worse off, on average) to move from scenario 3 to scenario 2 through the acquisition of new information. The most they should be willing to pay for such move is called the value of information on Weather Forecast, which is essentially the value of imperfect information on Weather Condition. The applicability of this simple ID and the value of information concept is tremendous, especially in medical decision making when most decisions have to be made with imperfect information about their patients, diseases, etc. == Related concepts == Influence diagrams are hierarchical and can be defined either in terms of their structure or in greater detail in terms of the functional and numerical relation between diagram elements. An ID that is consistently defined at all levels—structure, function, and number—is a well-defined mathematical representation and is referred to as a well-formed influence diagram (WFID). WFIDs can be evaluated using reversal and removal operations to yield answers to a large class of probabilistic, inferential, and decision questions. More recent techniques have been developed by artificial intelligence researchers concerning Bayesian network inference (belief propagation). An influence diagram having only uncertainty nodes (i.e., a Bayesian network) is also called a relevance diagram. An arc connecting node A to B implies not only that "A is relevant to B", but also that "B is relevant to A" (i.e., relevance is a symmetric relationship).

    Read more →
  • Radial basis function network

    Radial basis function network

    In the field of mathematical modeling, a radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters. Radial basis function networks have many uses, including function approximation, time series prediction, classification, and system control. They were first formulated in a 1988 paper by Broomhead and Lowe, both researchers at the Royal Signals and Radar Establishment. == Network architecture == Radial basis function (RBF) networks typically have three layers: an input layer, a hidden layer with a non-linear RBF activation function and a linear output layer. The input can be modeled as a vector of real numbers x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} . The output of the network is then a scalar function of the input vector, φ : R n → R {\displaystyle \varphi :\mathbb {R} ^{n}\to \mathbb {R} } , and is given by φ ( x ) = ∑ i = 1 N a i ρ ( | | x − c i | | ) {\displaystyle \varphi (\mathbf {x} )=\sum _{i=1}^{N}a_{i}\rho (||\mathbf {x} -\mathbf {c} _{i}||)} where N {\displaystyle N} is the number of neurons in the hidden layer, c i {\displaystyle \mathbf {c} _{i}} is the center vector for neuron i {\displaystyle i} , and a i {\displaystyle a_{i}} is the weight of neuron i {\displaystyle i} in the linear output neuron. Functions that depend only on the distance from a center vector are radially symmetric about that vector, hence the name radial basis function. In the basic form, all inputs are connected to each hidden neuron. The norm is typically taken to be the Euclidean distance (although the Mahalanobis distance appears to perform better with pattern recognition) and the radial basis function is commonly taken to be Gaussian ρ ( ‖ x − c i ‖ ) = exp ⁡ [ − β i ‖ x − c i ‖ 2 ] {\displaystyle \rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}=\exp \left[-\beta _{i}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert ^{2}\right]} . The Gaussian basis functions are local to the center vector in the sense that lim | | x | | → ∞ ρ ( ‖ x − c i ‖ ) = 0 {\displaystyle \lim _{||x||\to \infty }\rho (\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert )=0} i.e. changing parameters of one neuron has only a small effect for input values that are far away from the center of that neuron. Given certain mild conditions on the shape of the activation function, RBF networks are universal approximators on a compact subset of R n {\displaystyle \mathbb {R} ^{n}} . This means that an RBF network with enough hidden neurons can approximate any continuous function on a closed, bounded set with arbitrary precision. The parameters a i {\displaystyle a_{i}} , c i {\displaystyle \mathbf {c} _{i}} , and β i {\displaystyle \beta _{i}} are determined in a manner that optimizes the fit between φ {\displaystyle \varphi } and the data. === Normalization === ==== Normalized architecture ==== In addition to the above unnormalized architecture, RBF networks can be normalized. In this case the mapping is φ ( x ) = d e f ∑ i = 1 N a i ρ ( ‖ x − c i ‖ ) ∑ i = 1 N ρ ( ‖ x − c i ‖ ) = ∑ i = 1 N a i u ( ‖ x − c i ‖ ) {\displaystyle \varphi (\mathbf {x} )\ {\stackrel {\mathrm {def} }{=}}\ {\frac {\sum _{i=1}^{N}a_{i}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{i=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}}=\sum _{i=1}^{N}a_{i}u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} where u ( ‖ x − c i ‖ ) = d e f ρ ( ‖ x − c i ‖ ) ∑ j = 1 N ρ ( ‖ x − c j ‖ ) {\displaystyle u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\ {\stackrel {\mathrm {def} }{=}}\ {\frac {\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{j=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{j}\right\Vert {\big )}}}} is known as a normalized radial basis function. ==== Theoretical motivation for normalization ==== There is theoretical justification for this architecture in the case of stochastic data flow. Assume a stochastic kernel approximation for the joint probability density P ( x ∧ y ) = 1 N ∑ i = 1 N ρ ( ‖ x − c i ‖ ) σ ( | y − e i | ) {\displaystyle P\left(\mathbf {x} \land y\right)={1 \over N}\sum _{i=1}^{N}\,\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\,\sigma {\big (}\left\vert y-e_{i}\right\vert {\big )}} where the weights c i {\displaystyle \mathbf {c} _{i}} and e i {\displaystyle e_{i}} are exemplars from the data and we require the kernels to be normalized ∫ ρ ( ‖ x − c i ‖ ) d n x = 1 {\displaystyle \int \rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}\,d^{n}\mathbf {x} =1} and ∫ σ ( | y − e i | ) d y = 1 {\displaystyle \int \sigma {\big (}\left\vert y-e_{i}\right\vert {\big )}\,dy=1} . The probability densities in the input and output spaces are P ( x ) = ∫ P ( x ∧ y ) d y = 1 N ∑ i = 1 N ρ ( ‖ x − c i ‖ ) {\displaystyle P\left(\mathbf {x} \right)=\int P\left(\mathbf {x} \land y\right)\,dy={1 \over N}\sum _{i=1}^{N}\,\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} and The expectation of y given an input x {\displaystyle \mathbf {x} } is φ ( x ) = d e f E ( y ∣ x ) = ∫ y P ( y ∣ x ) d y {\displaystyle \varphi \left(\mathbf {x} \right)\ {\stackrel {\mathrm {def} }{=}}\ E\left(y\mid \mathbf {x} \right)=\int y\,P\left(y\mid \mathbf {x} \right)dy} where P ( y ∣ x ) {\displaystyle P\left(y\mid \mathbf {x} \right)} is the conditional probability of y given x {\displaystyle \mathbf {x} } . The conditional probability is related to the joint probability through Bayes' theorem P ( y ∣ x ) = P ( x ∧ y ) P ( x ) {\displaystyle P\left(y\mid \mathbf {x} \right)={\frac {P\left(\mathbf {x} \land y\right)}{P\left(\mathbf {x} \right)}}} which yields φ ( x ) = ∫ y P ( x ∧ y ) P ( x ) d y {\displaystyle \varphi \left(\mathbf {x} \right)=\int y\,{\frac {P\left(\mathbf {x} \land y\right)}{P\left(\mathbf {x} \right)}}\,dy} . This becomes φ ( x ) = ∑ i = 1 N e i ρ ( ‖ x − c i ‖ ) ∑ i = 1 N ρ ( ‖ x − c i ‖ ) = ∑ i = 1 N e i u ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)={\frac {\sum _{i=1}^{N}e_{i}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}{\sum _{i=1}^{N}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}}}=\sum _{i=1}^{N}e_{i}u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} when the integrations are performed. === Local linear models === It is sometimes convenient to expand the architecture to include local linear models. In that case the architectures become, to first order, φ ( x ) = ∑ i = 1 N ( a i + b i ⋅ ( x − c i ) ) ρ ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{N}\left(a_{i}+\mathbf {b} _{i}\cdot \left(\mathbf {x} -\mathbf {c} _{i}\right)\right)\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} and φ ( x ) = ∑ i = 1 N ( a i + b i ⋅ ( x − c i ) ) u ( ‖ x − c i ‖ ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{N}\left(a_{i}+\mathbf {b} _{i}\cdot \left(\mathbf {x} -\mathbf {c} _{i}\right)\right)u{\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )}} in the unnormalized and normalized cases, respectively. Here b i {\displaystyle \mathbf {b} _{i}} are weights to be determined. Higher order linear terms are also possible. This result can be written φ ( x ) = ∑ i = 1 2 N ∑ j = 1 n e i j v i j ( x − c i ) {\displaystyle \varphi \left(\mathbf {x} \right)=\sum _{i=1}^{2N}\sum _{j=1}^{n}e_{ij}v_{ij}{\big (}\mathbf {x} -\mathbf {c} _{i}{\big )}} where e i j = { a i , if i ∈ [ 1 , N ] b i j , if i ∈ [ N + 1 , 2 N ] {\displaystyle e_{ij}={\begin{cases}a_{i},&{\mbox{if }}i\in [1,N]\\b_{ij},&{\mbox{if }}i\in [N+1,2N]\end{cases}}} and v i j ( x − c i ) = d e f { δ i j ρ ( ‖ x − c i ‖ ) , if i ∈ [ 1 , N ] ( x i j − c i j ) ρ ( ‖ x − c i ‖ ) , if i ∈ [ N + 1 , 2 N ] {\displaystyle v_{ij}{\big (}\mathbf {x} -\mathbf {c} _{i}{\big )}\ {\stackrel {\mathrm {def} }{=}}\ {\begin{cases}\delta _{ij}\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )},&{\mbox{if }}i\in [1,N]\\\left(x_{ij}-c_{ij}\right)\rho {\big (}\left\Vert \mathbf {x} -\mathbf {c} _{i}\right\Vert {\big )},&{\mbox{if }}i\in [N+1,2N]\end{cases}}} in the unnormalized case and in the normalized case. Here δ i j {\displaystyle \delta _{ij}} is a Kronecker delta function defined as δ i j = { 1 , if i = j 0 , if i ≠ j {\displaystyle \delta _{ij}={\begin{cases}1,&{\mbox{if }}i=j\\0,&{\mbox{if }}i\neq j\end{cases}}} . == Training == RBF networks are typically trained from pairs of input and target values x ( t ) , y ( t ) {\displaystyle \mathbf {x} (t),y(t)} , t = 1 , … , T {\displaystyle t=1,\dots ,T} by a two-step algorithm. In the first step, the center vectors c i {\displaystyle \mathbf {c} _{i}} of the RBF functions in the hidden layer

    Read more →
  • Principal component analysis

    Principal component analysis

    Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data are linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. The principal components of a collection of points in a real coordinate space are a sequence of p {\displaystyle p} unit vectors, where the i {\displaystyle i} -th vector is the direction of a line that best fits the data while being orthogonal to the first i − 1 {\displaystyle i-1} vectors. Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line. These directions (i.e., principal components) constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science. == Overview == When performing PCA, the first principal component of a set of p {\displaystyle p} variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p {\displaystyle p} iterations until all the variance is explained. PCA is most commonly used when many of the variables are highly correlated with each other and it is desirable to reduce their number to an independent set. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i {\displaystyle i} -th principal component can be taken as a direction orthogonal to the first i − 1 {\displaystyle i-1} principal components that maximizes the variance of the projected data. For either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. Robust and L1-norm-based variants of standard PCA have also been proposed. == History == PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (invented in the last quarter of the 19th century), eigenvalue decomposition (EVD) of XTX in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis), Eckart–Young theorem (Harman, 1960), or empirical orthogonal functions (EOF) in meteorological science (Lorenz, 1956), empirical eigenfunction decomposition (Sirovich, 1987), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics. == Intuition == PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. To find the axes of the ellipsoid, we must first center the values of each variable in the dataset on 0 by subtracting the mean of the variable's observed values from each of those values. These transformed values are used instead of the original observed values for each of the variables. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually-orthogonal unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform the covariance matrix into a diagonalized form, in which the diagonal elements represent the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. Biplots and scree plots (degree of explained variance) are used to interpret findings of the PCA. == Details == PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Consider an n × p {\displaystyle n\times p} data matrix, X, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of feature (say, the results from a particular sensor). Mathematically, the transformation is defined by a set of size l {\displaystyle l} (where l {\displaystyle l} is usually selected to be strictly less than p {\displaystyle p} to reduce dimensionality) of p {\displaystyle p} -dimensional vectors of weights or coefficients w ( k ) = ( w 1 , … , w p ) ( k ) {\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} that map each row vector x ( i ) = ( x 1 , … , x p ) ( i ) {\displaystyle \mathbf {x} _{(i)}=(x_{1},\dots ,x_{p})_{(i)}} of X to a new vector of principal component scores t ( i ) = ( t 1 , … , t l ) ( i ) {\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_{(i)}} , given by t k ( i ) = x ( i ) ⋅ w ( k ) f o r i = 1 , … , n k = 1 , … , l {\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l} in such a way that the individual variables t 1 , … , t l {\displaystyle t_{1},\dots ,t_{l}} of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector. The above may equivalently be written in matrix form as T = X W {\displaystyle \mathbf {T} =\mathbf {X} \mathbf {W} } where T i k = t k ( i ) {\displaystyle {\mathbf {T} }_{ik}={t_{k}}_{(i)}} , X i j = x j ( i ) {\displaystyle {\mathbf {X} }_{ij}={x_{j}}_{(i)}} , and W j k = w j ( k ) {\displaystyle {\mathbf {W} }_{jk}={w_{j}}_{(k)}} . === First component === In order to maximize variance, the first weight vector w(1) thus has to satisfy w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( t 1 ) ( i ) 2 } = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( x ( i ) ⋅ w ) 2 } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_{(i)}^{2}\right\}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}} Equivalently, writing this in matrix form gives w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ‖ X w ‖ 2 } = arg ⁡ max ‖ w ‖ = 1 { w T X T X w } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}\right\}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} \right\}} Since w(1) has been defined to be a unit vector, it equivalently also satisfies w ( 1 ) = arg ⁡ max { w T X T X w w T w } {\displaystyle \mathbf {w} _{(1)}=\arg \max \left\{{\frac {\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf {T}}\mathbf {w} }}\right\}} The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w(1) found, the first principal component of a data vector

    Read more →
  • Inferential theory of learning

    Inferential theory of learning

    Inferential Theory of Learning (ITL) is an area of machine learning which describes inferential processes performed by learning agents. ITL has been continuously developed by Ryszard S. Michalski, starting in the 1980s. The first known publication of ITL was in 1983. In the ITL learning process is viewed as a search (inference) through hypotheses space guided by a specific goal. The results of learning need to be stored. Stored information will later be used by the learner for future inferences. Inferences are split into multiple categories including conclusive, deduction, and induction. In order for an inference to be considered complete it was required that all categories must be taken into account. This is how the ITL varies from other machine learning theories like Computational Learning Theory and Statistical Learning Theory; which both use singular forms of inference. == Usage == The most relevant published usage of ITL was in scientific journal published in 2012 and used ITL as a way to describe how agent-based learning works. According to the journal "The Inferential Theory of Learning (ITL) provides an elegant way of describing learning processes by agents".

    Read more →
  • Silhouette (clustering)

    Silhouette (clustering)

    Silhouette is a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette value ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable", and over 0.25 "weak". However, with an increasing dimensionality of the data, it becomes difficult to achieve such high values because of the curse of dimensionality, as the distances become more similar. The silhouette score is specialized for measuring cluster quality when the clusters are convex-shaped, and may not perform well if the data clusters have irregular shapes or are of varying sizes. The silhouette value can be calculated with any distance metric, such as Euclidean distance or Manhattan distance. == Definition == Assume the data have been clustered via any technique, such as k-medoids or k-means, into k {\displaystyle k} clusters. For data point i ∈ C i {\displaystyle i\in C_{i}} (data point i {\displaystyle i} in the cluster C i {\displaystyle C_{i}} ), calculate a ( i ) {\displaystyle a(i)} , the average distance that i {\displaystyle i} is from all other points in that cluster: a ( i ) = 1 | C i | − 1 ∑ j ∈ C i , i ≠ j d ( i , j ) {\displaystyle a(i)={\frac {1}{|C_{i}|-1}}\sum _{j\in C_{i},i\neq j}d(i,j)} where | C i | {\displaystyle |C_{i}|} is the number of points belonging to cluster C i {\displaystyle C_{i}} , and d ( i , j ) {\displaystyle d(i,j)} is the distance between data points i {\displaystyle i} and j {\displaystyle j} in the cluster C i {\displaystyle C_{i}} (we divide by | C i | − 1 {\displaystyle |C_{i}|-1} because the distance d ( i , i ) {\displaystyle d(i,i)} is not included in the sum). a ( i ) {\displaystyle a(i)} can be interpreted as a measure of how well i {\displaystyle i} is assigned to its cluster (the smaller the value, the better the assignment). We then define the mean dissimilarity of point i {\displaystyle i} to some cluster C j {\displaystyle C_{j}} as the mean of the distance from i {\displaystyle i} to all points in C j {\displaystyle C_{j}} (where C j ≠ C i {\displaystyle C_{j}\neq C_{i}} ). For each data point i ∈ C i {\displaystyle i\in C_{i}} , we now define b ( i ) {\displaystyle b(i)} as the average distance between i {\displaystyle i} and the points in the closest cluster (hence: "min") that i {\displaystyle i} does not belong to: b ( i ) = min j ≠ i 1 | C j | ∑ l ∈ C j d ( i , l ) {\displaystyle b(i)=\min _{j\neq i}{\frac {1}{|C_{j}|}}\sum _{l\in C_{j}}d(i,l)} The cluster with the smallest mean dissimilarity is said to be the "neighboring cluster" of i {\displaystyle i} because it is the next best fit cluster for point i {\displaystyle i} . We now define a silhouette (value) of one data point i {\displaystyle i} s ( i ) = b ( i ) − a ( i ) max { a ( i ) , b ( i ) } {\displaystyle s(i)={\frac {b(i)-a(i)}{\max\{a(i),b(i)\}}}} , if | C i | > 1 {\displaystyle |C_{i}|>1} and s ( i ) = 0 {\displaystyle s(i)=0} , if | C i | = 1 {\displaystyle |C_{i}|=1} , which can also be written as s ( i ) = { 1 − a ( i ) b ( i ) , if a ( i ) < b ( i ) 0 , if a ( i ) = b ( i ) b ( i ) a ( i ) − 1 , if a ( i ) > b ( i ) {\displaystyle s(i)={\begin{cases}1-{\frac {a(i)}{b(i)}},&{\mbox{ if }}a(i)b(i)\\\end{cases}}} From the above definition, s ( i ) {\displaystyle s(i)} is bounded to the interval [ − 1 , 1 ] {\displaystyle [-1,1]} , i.e. − 1 ≤ s ( i ) ≤ 1. {\displaystyle -1\leq s(i)\leq 1.} Note that a ( i ) {\displaystyle a(i)} is not clearly defined for clusters with size = 1, in which case we set s ( i ) = 0 {\displaystyle s(i)=0} . This choice is arbitrary, but neutral in the sense that it is at the midpoint of the bounds, -1 and 1. For s ( i ) {\displaystyle s(i)} to be close to 1 we require a ( i ) ≪ b ( i ) {\displaystyle a(i)\ll b(i)} . As a ( i ) {\displaystyle a(i)} is a measure of how dissimilar i {\displaystyle i} is to its own cluster, a small value means it is well matched. Furthermore, a large b ( i ) {\displaystyle b(i)} implies that i {\displaystyle i} is badly matched to its neighbouring cluster. Thus an s ( i ) {\displaystyle s(i)} close to 1 means that the data is appropriately clustered. If s ( i ) {\displaystyle s(i)} is close to -1, then by the same logic we see that i {\displaystyle i} would be more appropriate if it was clustered in its neighbouring cluster. An s ( i ) {\displaystyle s(i)} near zero means that the datum is on the border of two natural clusters. The mean s ( i ) {\displaystyle s(i)} over all points of a cluster is a measure of how tightly grouped all the points in the cluster are. Thus the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice of k {\displaystyle k} is used in the clustering algorithm (e.g., k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and means may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific. Kaufman et al. introduced the term silhouette coefficient for the maximum value of the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset, i.e., S C = max k s ~ ( k ) , {\displaystyle SC=\max _{k}{\tilde {s}}\left(k\right),} where s ~ ( k ) {\displaystyle {\tilde {s}}\left(k\right)} represents the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset for a specific number of clusters k {\displaystyle k} . The silhouette coefficient describes the best possible clustering possible for a given number of clusters, as measured by the highest average silhouette score for all points in the dataset. == Simplified and medoid silhouette == Computing the silhouette coefficient needs all O ( N 2 ) {\displaystyle {\mathcal {O}}(N^{2})} pairwise distances, making this evaluation much more costly than clustering with k-means. For a clustering with centers μ C I {\displaystyle \mu _{C_{I}}} for each cluster C I {\displaystyle C_{I}} , we can use the following simplified Silhouette for each point i ∈ C I {\displaystyle i\in C_{I}} instead, which can be computed using only O ( N k ) {\displaystyle {\mathcal {O}}(Nk)} distances: a ′ ( i ) = d ( i , μ C I ) {\displaystyle a'(i)=d(i,\mu _{C_{I}})} and b ′ ( i ) = min C J ≠ C I d ( i , μ C J ) {\displaystyle b'(i)=\min _{C_{J}\neq C_{I}}d(i,\mu _{C_{J}})} , which has the additional benefit that a ′ ( i ) {\displaystyle a'(i)} is always defined, then define accordingly the simplified silhouette and simplified silhouette coefficient s ′ ( i ) = b ′ ( i ) − a ′ ( i ) max { a ′ ( i ) , b ′ ( i ) } {\displaystyle s'(i)={\frac {b'(i)-a'(i)}{\max\{a'(i),b'(i)\}}}} S C ′ = max k 1 N ∑ i s ′ ( i ) {\displaystyle SC'=\max _{k}{\frac {1}{N}}\sum _{i}s'\left(i\right)} . If the cluster centers are medoids (as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette or medoid silhouette. If every object is assigned to the nearest medoid (as in k-medoids clustering), we know that a ′ ( i ) ≤ b ′ ( i ) {\displaystyle a'(i)\leq b'(i)} , and hence s ′ ( i ) = b ′ ( i ) − a ′ ( i ) b ′ ( i ) = 1 − a ′ ( i ) b ′ ( i ) {\displaystyle s'(i)={\frac {b'(i)-a'(i)}{b'(i)}}=1-{\frac {a'(i)}{b'(i)}}} . == Silhouette clustering == Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette. We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods. Van der Laan et al. proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL: Choose initial medoids by using PAM Compute the average silhouette of this initial solution For each pair of a medoid m and a non-medoid x swap m and x compute the average silhouette of the resulting solution remember the best swap un-swap m and x for the next iteration Perform the best swap and return to

    Read more →
  • Random forest

    Random forest

    Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set. The first algorithm for random decision forests was created in 1995 by Tin Kam Ho using the random subspace method, which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg. An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who registered "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.). The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho and later independently by Amit and Geman in order to construct a collection of decision trees with controlled variance. == History == The general method of random decision forests was first proposed by Salzberg and Heath in 1993, with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995. Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination. The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho was also influential in the design of random forests. This method grows a forest of trees, and introduces variation among the trees by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich. The proper introduction of random forests was made in a paper by Leo Breiman, that has become one of the world's most cited papers. This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular: Using out-of-bag error as an estimate of the generalization error. Measuring variable importance through permutation. The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation. == Algorithm == === Preliminaries: decision tree learning === Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate". In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model. === Bagging === The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': f ^ = 1 B ∑ b = 1 B f b ( x ′ ) {\displaystyle {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x')} or by taking the plurality vote in the case of classification trees. This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets. Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on x′: σ = ∑ b = 1 B ( f b ( x ′ ) − f ^ ) 2 B − 1 . {\displaystyle \sigma ={\sqrt {\frac {\sum _{b=1}^{B}(f_{b}(x')-{\hat {f}})^{2}}{B-1}}}.} The number B of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. B can be optimized using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit. === From bagging to random forests === The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho. Typically, for a classification problem with p {\displaystyle p} features, p {\displaystyle {\sqrt {p}}} (rounded down) features are used in each split. For regression problems the inventors recommend p / 3 {\displaystyle p/3} (rounded down) with a minimum node size of 5 as the default. In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem. === ExtraTrees === Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number of random cut-points are selected, instead of computing the locally optimal cut-point (based on, e.g., information gain or the Gini impurity). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node. Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are p {\displaystyle {\sqrt {p}}} for classification and p {\displaystyle p} for regression, where p {\displaystyle p} is the number of features in the model. === Random forests for high-dimensional data === The basic random forest procedure may

    Read more →
  • Apache Giraph

    Apache Giraph

    Apache Giraph is an Apache project to perform graph processing on big data. Giraph utilizes Apache Hadoop's MapReduce implementation to process graphs. Facebook used Giraph with some performance improvements to analyze one trillion edges using 200 machines in 4 minutes. Giraph is based on a paper published by Google about its own graph processing system called Pregel. It can be compared to other Big Graph processing libraries such as Cassovary. As of September 2023, it is no longer actively developed.

    Read more →
  • Event condition action

    Event condition action

    Event condition action (ECA) is a short-cut for referring to the structure of active rules in event-driven architecture and active database systems. Such a rule traditionally consisted of three parts: The event part specifies the signal that triggers the invocation of the rule The condition part is a logical test that, if satisfied or evaluates to true, causes the action to be carried out The action part consists of updates or invocations on the local data This structure was used by the early research in active databases which started to use the term ECA. Current state of the art ECA rule engines use many variations on rule structure. Also other features not considered by the early research is introduced, such as strategies for event selection into the event part. In a memory-based rule engine, the condition could be some tests on local data and actions could be updates to object attributes. In a database system, the condition could simply be a query to the database, with the result set (if not null) being passed to the action part for changes to the database. In either case, actions could also be calls to external programs or remote procedures. Note that for database usage, updates to the database are regarded as internal events. As a consequence, the execution of the action part of an active rule can match the event part of the same or another active rule, thus triggering it. The equivalent in a memory-based rule engine would be to invoke an external method that caused an external event to trigger another ECA rule. ECA rules can also be used in rule engines that use variants of the Rete algorithm for rule processing. == ECA rule engines == Rulecore Concurrent Rules Apart Database Detect Invocation Rules ConceptBase ECArules

    Read more →
  • Alternating decision tree

    Alternating decision tree

    An alternating decision tree (ADTree) is a machine learning method for classification. It generalizes decision trees and has connections to boosting. An ADTree consists of an alternation of decision nodes, which specify a predicate condition, and prediction nodes, which contain a single number. An instance is classified by an ADTree by following all paths for which all decision nodes are true, and summing any prediction nodes that are traversed. == History == ADTrees were introduced by Yoav Freund and Llew Mason. However, the algorithm as presented had several typographical errors. Clarifications and optimizations were later presented by Bernhard Pfahringer, Geoffrey Holmes and Richard Kirkby. Implementations are available in Weka and JBoost. == Motivation == Original boosting algorithms typically used either decision stumps or decision trees as weak hypotheses. As an example, boosting decision stumps creates a set of T {\displaystyle T} weighted decision stumps (where T {\displaystyle T} is the number of boosting iterations), which then vote on the final classification according to their weights. Individual decision stumps are weighted according to their ability to classify the data. Boosting a simple learner results in an unstructured set of T {\displaystyle T} hypotheses, making it difficult to infer correlations between attributes. Alternating decision trees introduce structure to the set of hypotheses by requiring that they build off a hypothesis that was produced in an earlier iteration. The resulting set of hypotheses can be visualized in a tree based on the relationship between a hypothesis and its "parent." Another important feature of boosted algorithms is that the data is given a different distribution at each iteration. Instances that are misclassified are given a larger weight while accurately classified instances are given reduced weight. == Alternating decision tree structure == An alternating decision tree consists of decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single number. ADTrees always have prediction nodes as both root and leaves. An instance is classified by an ADTree by following all paths for which all decision nodes are true and summing any prediction nodes that are traversed. This is different from binary classification trees such as CART (Classification and regression tree) or C4.5 in which an instance follows only one path through the tree. === Example === The following tree was constructed using JBoost on the spambase dataset (available from the UCI Machine Learning Repository). In this example, spam is coded as 1 and regular email is coded as −1. The following table contains part of the information for a single instance. The instance is scored by summing all of the prediction nodes through which it passes. In the case of the instance above, the score is calculated as The final score of 0.657 is positive, so the instance is classified as spam. The magnitude of the value is a measure of confidence in the prediction. The original authors list three potential levels of interpretation for the set of attributes identified by an ADTree: Individual nodes can be evaluated for their own predictive ability. Sets of nodes on the same path may be interpreted as having a joint effect The tree can be interpreted as a whole. Care must be taken when interpreting individual nodes as the scores reflect a re weighting of the data in each iteration. == Description of the algorithm == The inputs to the alternating decision tree algorithm are: A set of inputs ( x 1 , y 1 ) , … , ( x m , y m ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{m},y_{m})} where x i {\displaystyle x_{i}} is a vector of attributes and y i {\displaystyle y_{i}} is either -1 or 1. Inputs are also called instances. A set of weights w i {\displaystyle w_{i}} corresponding to each instance. The fundamental element of the ADTree algorithm is the rule. A single rule consists of a precondition, a condition, and two scores. A condition is a predicate of the form "attribute value." A precondition is simply a logical conjunction of conditions. Evaluation of a rule involves a pair of nested if statements: 1 if (precondition) 2 if (condition) 3 return score_one 4 else 5 return score_two 6 end if 7 else 8 return 0 9 end if Several auxiliary functions are also required by the algorithm: W + ( c ) {\displaystyle W_{+}(c)} returns the sum of the weights of all positively labeled examples that satisfy predicate c {\displaystyle c} W − ( c ) {\displaystyle W_{-}(c)} returns the sum of the weights of all negatively labeled examples that satisfy predicate c {\displaystyle c} W ( c ) = W + ( c ) + W − ( c ) {\displaystyle W(c)=W_{+}(c)+W_{-}(c)} returns the sum of the weights of all examples that satisfy predicate c {\displaystyle c} The algorithm is as follows: 1 function ad_tree 2 input Set of m training instances 3 4 wi = 1/m for all i 5 a = 1 2 ln W + ( t r u e ) W − ( t r u e ) {\displaystyle a={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(true)}{W_{-}(true)}}} 6 R0 = a rule with scores a and 0, precondition "true" and condition "true." 7 P = { t r u e } {\displaystyle {\mathcal {P}}=\{true\}} 8 C = {\displaystyle {\mathcal {C}}=} the set of all possible conditions 9 for j = 1 … T {\displaystyle j=1\dots T} 10 p ∈ P , c ∈ C {\displaystyle p\in {\mathcal {P}},c\in {\mathcal {C}}} get values that minimize z = 2 ( W + ( p ∧ c ) W − ( p ∧ c ) + W + ( p ∧ ¬ c ) W − ( p ∧ ¬ c ) ) + W ( ¬ p ) {\displaystyle z=2\left({\sqrt {W_{+}(p\wedge c)W_{-}(p\wedge c)}}+{\sqrt {W_{+}(p\wedge \neg c)W_{-}(p\wedge \neg c)}}\right)+W(\neg p)} 11 P + = p ∧ c + p ∧ ¬ c {\displaystyle {\mathcal {P}}+=p\wedge c+p\wedge \neg c} 12 a 1 = 1 2 ln W + ( p ∧ c ) + 1 W − ( p ∧ c ) + 1 {\displaystyle a_{1}={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(p\wedge c)+1}{W_{-}(p\wedge c)+1}}} 13 a 2 = 1 2 ln W + ( p ∧ ¬ c ) + 1 W − ( p ∧ ¬ c ) + 1 {\displaystyle a_{2}={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(p\wedge \neg c)+1}{W_{-}(p\wedge \neg c)+1}}} 14 Rj = new rule with precondition p, condition c, and weights a1 and a2 15 w i = w i e − y i R j ( x i ) {\displaystyle w_{i}=w_{i}e^{-y_{i}R_{j}(x_{i})}} 16 end for 17 return set of Rj The set P {\displaystyle {\mathcal {P}}} grows by two preconditions in each iteration, and it is possible to derive the tree structure of a set of rules by making note of the precondition that is used in each successive rule. == Empirical results == Figure 6 in the original paper demonstrates that ADTrees are typically as robust as boosted decision trees and boosted decision stumps. Typically, equivalent accuracy can be achieved with a much simpler tree structure than recursive partitioning algorithms.

    Read more →
  • Tensor product network

    Tensor product network

    A tensor product network, in artificial neural networks, is a network that exploits the properties of tensors to model associative concepts such as variable assignment. Orthonormal vectors are chosen to model the ideas (such as variable names and target assignments), and the tensor product of these vectors construct a network whose mathematical properties allow the user to easily extract the association from it.

    Read more →
  • C4.5 algorithm

    C4.5 algorithm

    C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. In 2011, authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date". It became quite popular after ranking #1 in the Top 10 Algorithms in Data Mining pre-eminent paper published by Springer LNCS in 2008. == Algorithm == C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = s 1 , s 2 , . . . {\displaystyle S={s_{1},s_{2},...}} of already classified samples. Each sample s i {\displaystyle s_{i}} consists of a p-dimensional vector ( x 1 , i , x 2 , i , . . . , x p , i ) {\displaystyle (x_{1,i},x_{2,i},...,x_{p,i})} , where the x j {\displaystyle x_{j}} represent attribute values or features of the sample, as well as the class in which s i {\displaystyle s_{i}} falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the partitioned sublists. This algorithm has a few base cases. All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class. None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class. Instance of previously unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value. === Pseudocode === In pseudocode, the general algorithm for building decision trees is: Check for the above base cases. For each attribute a, find the normalized information gain ratio from splitting on a. Let a_best be the attribute with the highest normalized information gain. Create a decision node that splits on a_best. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node. == Improvements from ID3 algorithm == C4.5 made a number of improvements to ID3. Some of these are: Handling both continuous and discrete attributes: In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values: C4.5 allows attribute values to be marked as missing. Missing attribute values are simply not used in gain and entropy calculations. Handling attributes with differing costs. Pruning trees after creation: C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. == Improvements in C5.0/See5 algorithm == Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are: Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude) Memory usage - C5.0 is more memory efficient than C4.5 Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees. Support for boosting - Boosting improves the trees and gives them more accuracy. Weighting - C5.0 allows you to weight different cases and misclassification types. Winnowing - a C5.0 option automatically winnows the attributes to remove those that may be unhelpful. Source for a single-threaded Linux version of C5.0 is available under the GNU General Public License (GPL).

    Read more →
  • Graphics

    Graphics

    Graphics (from Ancient Greek γραφικός (graphikós) 'pertaining to drawing, painting, writing, etc.') are visual images or designs on some surface, such as a wall, canvas, screen, paper, or stone, to inform, illustrate, or entertain. In contemporary usage, it includes a pictorial representation of data, as in design and manufacture, in typesetting and the graphic arts, and in educational and recreational software. Images that are generated by a computer are called computer graphics. Examples are photographs, drawings, line art, mathematical graphs, line graphs, charts, diagrams, typography, numbers, symbols, geometric designs, maps, engineering drawings, or other images. Graphics often combine text, illustration, and color. Graphic design may consist of the deliberate selection, creation, or arrangement of typography alone, as in a brochure, flyer, poster, web site, or book without any other element. The objective can be clarity or effective communication, association with other cultural elements, or merely the creation of a distinctive style. Graphics can be functional or artistic. The latter can be a recorded version, such as a photograph, or an interpretation by a scientist to highlight essential features, or an artist, in which case the distinction with imaginary graphics may become blurred. It can also be used for architecture. == History == The earliest graphics known to anthropologists studying prehistoric periods are cave paintings and markings on boulders, bone, ivory, and antlers, which were created during the Upper Palaeolithic period from 40,000 to 10,000 B.C. or earlier. Many of these were found to record astronomical, seasonal, and chronological details. Some of the earliest graphics and drawings are known to the modern world, from almost 6,000 years ago, are that of engraved stone tablets and ceramic cylinder seals, marking the beginning of the historical periods and the keeping of records for accounting and inventory purposes. Records from Egypt predate these and papyrus was used by the Egyptians as a material on which to plan the building of pyramids; they also used slabs of limestone and wood. From 600 to 250 BC, the Greeks played a major role in geometry. They used graphics to represent their mathematical theories such as the Circle Theorem and the Pythagorean theorem. In art, "graphics" is often used to distinguish work in a monotone and made up of lines, as opposed to painting. === Drawing === Drawing generally involves making marks on a surface by applying pressure from a tool or moving a tool across a surface. In which a tool is always used as if there were no tools it would be art. Graphical drawing is an instrumental guided drawing. === Printmaking === Woodblock printing, including images is first seen in China after paper was invented (about A.D. 105). In the West, the main techniques have been woodcut, engraving and etching, but there are many others. ==== Etching ==== Etching is an intaglio method of printmaking in which the image is incised into the surface of a metal plate using an acid. The acid eats the metal, leaving behind roughened areas, or, if the surface exposed to the acid is very thin, burning a line into the plate. The use of the process in printmaking is believed to have been invented by Daniel Hopfer (c. 1470–1536) of Augsburg, Germany, who decorated armour in this way. Etching is also used in the manufacturing of printed circuit boards and semiconductor devices. === Line art === Line art is a rather non-specific term sometimes used for any image that consists of distinct straight and curved lines placed against a (usually plain) background, without gradations in shade (darkness) or hue (color) to represent two-dimensional or three-dimensional objects. Line art is usually monochromatic, although lines may be of different colors. === Illustration === An illustration is a visual representation such as a drawing, painting, photograph or other work of art that stresses the subject more than form. The aim of an illustration is to elucidate or decorate a story, poem or piece of textual information (such as a newspaper article), traditionally by providing a visual representation of something described in the text. The editorial cartoon, also known as a political cartoon, is an illustration containing a political or social message. Illustrations can be used to display a wide range of subject matter and serve a variety of functions, such as: giving faces to characters in a story displaying a number of examples of an item described in an academic textbook (e.g. A Typology) visualizing step-wise sets of instructions in a technical manual communicating subtle thematic tone in a narrative linking brands to the ideas of human expression, individuality, and creativity making a reader laugh or smile for fun (to make laugh) funny === Graphs === A graph or chart is a graphic that represents tabular or numeric data. Charts are often used to make it easier to understand large quantities of data and the relationships between different parts of the data. === Diagrams === A diagram is a simplified and structured visual representation of concepts, ideas, constructions, relations, statistical data, etc., used to visualize and clarify the topic. === Symbols === A symbol, in its basic sense, is a representation of a concept or quantity; i.e., an idea, object, concept, quality, etc. In more psychological and philosophical terms, all concepts are symbolic in nature, and representations for these concepts are simply token artifacts that are allegorical to (but do not directly codify) a symbolic meaning, or symbolism. === Maps === A map is a simplified depiction of a space, a navigational aid which highlights relations between objects within that space. Usually, a map is a two-dimensional, geometrically accurate representation of a three-dimensional space. One of the first 'modern' maps was made by Waldseemüller. === Photography === One difference between photography and other forms of graphics is that a photographer, in principle, just records a single moment in reality, with seemingly no interpretation. However, a photographer can choose the field of view and angle, and may also use other techniques, such as various lenses to choose the view or filters to change the colors. In recent times, digital photography has opened the way to an infinite number of fast, but strong, manipulations. Even in the early days of photography, there was controversy over photographs of enacted scenes that were presented as 'real life' (especially in war photography, where it can be very difficult to record the original events). Shifting the viewer's eyes ever so slightly with simple pinpricks in the negative could have a dramatic effect. The choice of the field of view can have a strong effect, effectively 'censoring out' other parts of the scene, accomplished by cropping them out or simply not including them in the photograph. This even touches on the philosophical question of what reality is. The human brain processes information based on previous experience, making us see what we want to see or what we were taught to see. Photography does the same, although the photographer interprets the scene for their viewer. === Engineering drawings === An engineering drawing is a type of drawing and is technical in nature, used to fully and clearly define requirements for engineered items. It is usually created in accordance with standardized conventions for layout, nomenclature, interpretation, appearance (such as typefaces and line styles), size, etc. === Computer graphics === There are two types of computer graphics: raster graphics, where each pixel is separately defined (as in a digital photograph), and vector graphics, where mathematical formulas are used to draw lines and shapes, which are then interpreted at the viewer's end to produce the graphic. Using vectors results in infinitely sharp graphics and often smaller files, but, when complex, like vectors take time to render and may have larger file sizes than a raster equivalent. In 1950, the first computer-driven display was attached to MIT's Whirlwind I computer to generate simple pictures. This was followed by MIT's TX-0 and TX-2, interactive computing which increased interest in computer graphics during the late 1950s. In 1962, Ivan Sutherland invented Sketchpad, an innovative program that influenced alternative forms of interaction with computers. In the mid-1960s, large computer graphics research projects were begun at MIT, General Motors, Bell Labs, and Lockheed Corporation. Douglas T. Ross of MIT developed an advanced compiler language for graphics programming. S.A.Coons, also at MIT, and J. C. Ferguson at Boeing, began work in sculptured surfaces. GM developed their DAC-1 system, and other companies, such as Douglas, Lockheed, and McDonnell, also made significant developments. In 1968, ray tracing was first described by Arthur Appel of the IBM Research Center, Yorktown Heights, N

    Read more →
  • Mutation (evolutionary algorithm)

    Mutation (evolutionary algorithm)

    Mutation is a genetic operator used to maintain genetic diversity of the chromosomes of a population of an evolutionary algorithm (EA), including genetic algorithms in particular. It is analogous to biological mutation. The classic example of a mutation operator of a binary coded genetic algorithm (GA) involves a probability that an arbitrary bit in a genetic sequence will be flipped from its original state. A common method of implementing the mutation operator involves generating a random variable for each bit in a sequence. This random variable tells whether or not a particular bit will be flipped. This mutation procedure, based on the biological point mutation, is called single point mutation. Other types of mutation operators are commonly used for representations other than binary, such as floating-point encodings or representations for combinatorial problems. The purpose of mutation in EAs is to introduce diversity into the sampled population. Mutation operators are used in an attempt to avoid local minima by preventing the population of chromosomes from becoming too similar to each other, thus slowing or even stopping convergence to the global optimum. This reasoning also leads most EAs to avoid only taking the fittest of the population in generating the next generation, but rather selecting a random (or semi-random) set with a weighting toward those that are fitter. The following requirements apply to all mutation operators used in an EA: every point in the search space must be reachable by one or more mutations. there must be no preference for parts or directions in the search space (no drift). small mutations should be more probable than large ones. For different genome types, different mutation types are suitable. Some mutations are Gaussian, Uniform, Zigzag, Scramble, Insertion, Inversion, Swap, and so on. An overview and more operators than those presented below can be found in the introductory book by Eiben and Smith or in. == Bit string mutation == The mutation of bit strings ensue through bit flips at random positions. Example: The probability of a mutation of a bit is 1 l {\displaystyle {\frac {1}{l}}} , where l {\displaystyle l} is the length of the binary vector. Thus, a mutation rate of 1 {\displaystyle 1} per mutation and individual selected for mutation is reached. == Mutation of real numbers == Many EAs, such as the evolution strategy or the real-coded genetic algorithms, work with real numbers instead of bit strings. This is due to the good experiences that have been made with this type of coding. The value of a real-valued gene can either be changed or redetermined. A mutation that implements the latter should only ever be used in conjunction with the value-changing mutations and then only with comparatively low probability, as it can lead to large changes. In practical applications, the respective value range of the decision variables to be changed of the optimisation problem to be solved is usually limited. Accordingly, the values of the associated genes are each restricted to an interval [ x min , x max ] {\displaystyle [x_{\min },x_{\max }]} . Mutations may or may not take these restrictions into account. In the latter case, suitable post-treatment is then required as described below. === Mutation without consideration of restrictions === A real number x {\displaystyle x} can be mutated using normal distribution N ( 0 , σ ) {\displaystyle {\mathcal {N}}(0,\sigma )} by adding the generated random value to the old value of the gene, resulting in the mutated value x ′ {\displaystyle x'} : x ′ = x + N ( 0 , σ ) {\displaystyle x'=x+{\mathcal {N}}(0,\sigma )} In the case of genes with a restricted range of values, it is a good idea to choose the step size of the mutation σ {\displaystyle \sigma } so that it reasonably fits the range [ x min , x max ] {\displaystyle [x_{\min },x_{\max }]} of the gene to be changed, e.g.: σ = x max − x min 6 {\displaystyle \sigma ={\frac {x_{\text{max}}-x_{\text{min}}}{6}}} The step size can also be adjusted to the smaller permissible change range depending on the current value. In any case, however, it is likely that the new value x ′ {\displaystyle x'} of the gene will be outside the permissible range of values. Such a case must be considered a lethal mutation, since the obvious repair by using the respective violated limit as the new value of the gene would lead to a drift. This is because the limit value would then be selected with the entire probability of the values beyond the limit of the value range. The evolution strategy works with real numbers and mutation based on normal distribution. The step sizes are part of the chromosome and are subject to evolution together with the actual decision variables. === Mutation with consideration of restrictions === One possible form of changing the value of a gene while taking its value range [ x min , x max ] {\displaystyle [x_{\min },x_{\max }]} into account is the mutation relative parameter change of the evolutionary algorithm GLEAM (General Learning Evolutionary Algorithm and Method), in which, as with the mutation presented earlier, small changes are more likely than large ones. First, an equally distributed decision is made as to whether the current value x {\displaystyle x} should be increased or decreased and then the corresponding total change interval is determined. Without loss of generality, an increase is assumed for the explanation and the total change interval is then [ x , x max ] {\displaystyle [x,x_{\max }]} . It is divided into k {\displaystyle k} sub-areas of equal size with the width δ {\displaystyle \delta } , from which k {\displaystyle k} sub-change intervals of different size are formed: i {\displaystyle i} -th sub-change interval: [ x , x + δ ⋅ i ] {\displaystyle [x,x+\delta \cdot i]} with δ = ( x max − x ) k {\displaystyle \delta ={\frac {(x_{\text{max}}-x)}{k}}} and i = 1 , … , k {\displaystyle i=1,\dots ,k} Subsequently, one of the k {\displaystyle k} sub-change intervals is selected in equal distribution and a random number, also equally distributed, is drawn from it as the new value x ′ {\displaystyle x'} of the gene. The resulting summed probabilities of the sub-change intervals result in the probability distribution of the k {\displaystyle k} sub-areas shown in the adjacent figure for the exemplary case of k = 10 {\displaystyle k=10} . This is not a normal distribution as before, but this distribution also clearly favours small changes over larger ones. This mutation for larger values of k {\displaystyle k} , such as 10, is less well suited for tasks where the optimum lies on one of the value range boundaries. This can be remedied by significantly reducing k {\displaystyle k} when a gene value approaches its limits very closely. === Common properties === For both mutation operators for real-valued numbers, the probability of an increase and decrease is independent of the current value and is 50% in each case. In addition, small changes are considerably more likely than large ones. For mixed-integer optimization problems, rounding is usually used. == Mutation of permutations == Mutations of permutations are specially designed for genomes that are themselves permutations of a set. These are often used to solve combinatorial tasks. In the two mutations presented, parts of the genome are rotated or inverted. === Rotation to the right === The presentation of the procedure is illustrated by an example on the right: === Inversion === The presentation of the procedure is illustrated by an example on the right: === Variants with preference for smaller changes === The requirement raised at the beginning for mutations, according to which small changes should be more probable than large ones, is only inadequately fulfilled by the two permutation mutations presented, since the lengths of the partial lists and the number of shift positions are determined in an equally distributed manner. However, the longer the partial list and the shift, the greater the change in gene order. This can be remedied by the following modifications. The end index j {\displaystyle j} of the partial lists is determined as the distance d {\displaystyle d} to the start index i {\displaystyle i} : j = ( i + d ) mod | P 0 | {\displaystyle j=(i+d){\bmod {\left|P_{0}\right|}}} where d {\displaystyle d} is determined randomly according to one of the two procedures for the mutation of real numbers from the interval [ 0 , | P 0 | − 1 ] {\displaystyle \left[0,\left|P_{0}\right|-1\right]} and rounded. For the rotation, k {\displaystyle k} is determined similarly to the distance d {\displaystyle d} , but the value 0 {\displaystyle 0} is forbidden. For the inversion, note that i ≠ j {\displaystyle i\neq j} must hold, so for d {\displaystyle d} the value 0 {\displaystyle 0} must be excluded.

    Read more →
  • GraphLab

    GraphLab

    Turi is a graph-based, high performance, distributed computation framework written in C++. The GraphLab project was started by Prof. Carlos Guestrin of Carnegie Mellon University in 2009. It is an open source project that uses the Apache License. While GraphLab was originally developed for machine learning tasks, it has also been developed for other data-mining tasks. == Motivation == As the amounts of collected data and computing power grow (multicore, GPUs, clusters, clouds), modern datasets no longer fit into one computing node. Efficient distributed parallel algorithms for handling large-scale data are required. The GraphLab framework is a parallel programming abstraction targeted for sparse iterative graph algorithms. GraphLab provides a programming interface, allowing deployment of distributed machine learning algorithms. The main design considerations behind the design of GraphLab are: Sparse data with local dependencies Iterative algorithms Potentially asynchronous execution == GraphLab toolkits == On top of GraphLab, several implemented libraries of algorithms: Topic modeling - contains applications like LDA, which can be used to cluster documents and extract topical representations. Graph analytics - contains applications like pagerank and triangle counting, which can be applied to general graphs to estimate community structure. Clustering - contains standard data clustering tools such as Kmeans Collaborative filtering - contains a collection of applications used to make predictions about users interests and factorize large matrices. Graphical models - contains tools for making joint predictions about collections of related random variables. Computer vision - contains a collection of tools for reasoning about images. == Turi == Turi (formerly called Dato and before that GraphLab Inc.) is a company that was founded by Prof. Carlos Guestrin from University of Washington in May 2013 to continue development support of the GraphLab open source project. Dato Inc. raised a $6.75M Series A from Madrona Venture Group and New Enterprise Associates (NEA). They raised a $18.5M Series B from Vulcan Capital and Opus Capital, with participation from Madrona and NEA. On August 5, 2016, Turi was acquired by Apple Inc. for $200,000,000.

    Read more →