AI Detector That Colleges Use

AI Detector That Colleges Use — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Amália (LLM)

    Amália (LLM)

    Amália is a Portuguese large language model (LLM) announced in November 2024 by the Portuguese Prime-Minister Luís Montenegro. Its final version is expected to be launched in 2026. It is being developed by Center for Responsible AI (Centro para a AI Responsável) and by the research centers of NOVA School of Science and Technology and Instituto Superior Técnico. == History == In 2024 it was announced that the Portuguese Agency for Administrative Modernization (Agência para a Modernização Administrativa) transpose this LLM to Portuguese Public Administration. According to Paulo Dimas (CEO of the Center for Responsible AI) the three fundamental points of this LLM project are the linguistic variant (European Portuguese), cultural representation and data protection. In April 2025 it was announced that Amália had entered beta phase with an improved version being expected to be launched in September 2025. The beta version released in September is available only to the Public Administration, but the website launched in October reiterates the final version will be an open model.

    Read more →
  • Multiclass classification

    Multiclass classification

    In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification). For example, deciding on whether an image is showing a banana, peach, orange, or an apple is a multiclass classification problem, with four possible classes (banana, peach, orange, apple), while deciding on whether an image contains an apple or not is a binary classification problem (with the two possible classes being: apple, no apple). While many classification algorithms (e.g., decision trees, k-NN, neural networks and multinomial logistic regression) naturally permit the use of more than two classes, some are by nature binary algorithms (e.g., classical binary support vector machine) and require decomposition strategies such as one-vs-all, one-vs-one, or ECOC to solve multiclass problems. Multiclass classification should not be confused with multi-label classification, where multiple labels are to be predicted for each instance (e.g., predicting that an image contains both an apple and an orange, in the previous example). == Better-than-random multiclass models == From the confusion matrix of a multiclass model, we can determine whether a model does better than chance. Let K ≥ 3 {\displaystyle K\geq 3} be the number of classes, O {\displaystyle {\mathcal {O}}} a set of observations, y ^ : O → { 1 , . . . , K } {\displaystyle {\hat {y}}:{\mathcal {O}}\to \{1,...,K\}} a model of the target variable y : O → { 1 , . . . , K } {\displaystyle y:{\mathcal {O}}\to \{1,...,K\}} and n i , j {\displaystyle n_{i,j}} be the number of observations in the set { y = i } ∩ { y ^ = j } {\displaystyle \{y=i\}\cap \{{\hat {y}}=j\}} . We note n i . = ∑ j n i , j {\displaystyle n_{i.}=\sum _{j}n_{i,j}} , n . j = ∑ i n i , j {\displaystyle n_{.j}=\sum _{i}n_{i,j}} , n = ∑ j n . j = ∑ i n i . {\displaystyle n=\sum _{j}n_{.j}=\sum _{i}n_{i.}} , λ i = n i . n {\displaystyle \lambda _{i}={\frac {n_{i.}}{n}}} and μ j = n . j n {\displaystyle \mu _{j}={\frac {n_{.j}}{n}}} . It is assumed that the confusion matrix ( n i , j ) i , j {\displaystyle (n_{i,j})_{i,j}} contains at least one non-zero entry in each row, that is λ i > 0 {\displaystyle \lambda _{i}>0} for any i {\displaystyle i} . Finally we call "normalized confusion matrix" the matrix of conditional probabilities ( P ( y ^ = j ∣ y = i ) ) i , j = ( n i , j n i . ) i , j {\displaystyle (\mathbb {P} ({\hat {y}}=j\mid y=i))_{i,j}=\left({\frac {n_{i,j}}{n_{i.}}}\right)_{i,j}} . === Intuitive explanation === The lift is a way of measuring the deviation from independence of two events A {\displaystyle A} and B {\displaystyle B} : L i f t ( A , B ) = P ( A ∩ B ) P ( A ) P ( B ) = P ( A ∣ B ) P ( A ) = P ( B ∣ A ) P ( B ) {\displaystyle \mathrm {Lift} (A,B)={\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (A)\mathbb {P} (B)}}={\frac {\mathbb {P} (A\mid B)}{\mathbb {P} (A)}}={\frac {\mathbb {P} (B\mid A)}{\mathbb {P} (B)}}} We have L i f t ( A , B ) > 1 {\displaystyle \mathrm {Lift} (A,B)>1} if and only if events A {\displaystyle A} and B {\displaystyle B} occur simultaneously with a greater probability than if they were independent. In other words, if one of the two events occurs, the probability of observing the other event increases. A first condition to satisfy is to have L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} . And the quality of a model (better or worse than chance) does not change if we over- or undersample the dataset, that is if we multiply each row R i {\displaystyle R_{i}} of the confusion matrix by a constant c i {\displaystyle c_{i}} . Thus the second condition is that the necessary and sufficient conditions for doing better than chance need only depend on the normalized confusion matrix. The condition on lifts can be reformulated with One versus Rest binary models : for any i {\displaystyle i} , we define the binary target variable y i {\displaystyle y_{i}} which is the indicator of event { y = i } {\displaystyle \{y=i\}} , and the binary model y ^ i {\displaystyle {\hat {y}}_{i}} of y i {\displaystyle y_{i}} which is the indicator of event { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} . Each of the y ^ i {\displaystyle {\hat {y}}_{i}} models is a "One versus Rest" model. L i f t ( y = i , y ^ = i ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)} only depends on the events { y = i } {\displaystyle \{y=i\}} and { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} , so merging or not merging the other classes doesn't change its value. We therefore have L i f t ( y = i , y ^ = i ) = L i f t ( y i = 1 , y ^ i = 1 ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)=\mathrm {Lift} (y_{i}=1,{\hat {y}}_{i}=1)} and the first condition is that all binary One versus Rest models are better than chance. ==== Example ==== If K = 2 {\displaystyle K=2} and 2 is the class of interest , the normalized confusion matrix is ( s p e c i f i c i t y 1 − s p e c i f i c i t y 1 − s e n s i t i v i t y s e n s i t i v i t y ) {\displaystyle {\begin{pmatrix}\mathrm {specificity} &1-\mathrm {specificity} \\1-\mathrm {sensitivity} &\mathrm {sensitivity} \end{pmatrix}}} and we have L i f t ( y = 1 , y ^ = 1 ) − 1 = P ( y = y ^ = 1 ) λ 1 μ 1 − 1 = n 1 , 1 n n 1. n .1 − 1 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)-1={\frac {\mathbb {P} (y={\hat {y}}=1)}{\lambda _{1}\mu _{1}}}-1={\frac {n_{1,1}n}{n_{1.}n_{.1}}}-1} = n 1 , 1 ( n 1 , 1 + n 1 , 2 + n 2 , 1 + n 2 , 2 ) − ( n 1 , 1 + n 1 , 2 ) ( n 1 , 1 + n 2 , 1 ) n 1. n .1 = n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 n 1. n .1 {\displaystyle ={\frac {n_{1,1}(n_{1,1}+n_{1,2}+n_{2,1}+n_{2,2})-(n_{1,1}+n_{1,2})(n_{1,1}+n_{2,1})}{n_{1.}n_{.1}}}={\frac {n_{1,1}n_{2,2}-n_{1,2}n_{2,1}}{n_{1.}n_{.1}}}} . Thus L i f t ( y = 1 , y ^ = 1 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Similarly, by swapping the roles of 1 and 2, we find that L i f t ( y = 2 , y ^ = 2 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=2,{\hat {y}}=2)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Dividing by n 1. n 2. {\displaystyle n_{1.}n_{2.}} we find that the necessary and sufficient condition on the normalized confusion matrix is s e n s i t i v i t y s p e c i f i c i t y − ( 1 − s e n s i t i v i t y ) ( 1 − s p e c i f i c i t y ) ≥ 0 ⟺ s e n s i t i v i t y + s p e c i f i c i t y − 1 ≥ 0 ⟺ J ≥ 0 {\displaystyle \mathrm {sensitivity} \ \mathrm {specificity} -(1-\mathrm {sensitivity} )(1-\mathrm {specificity} )\geq 0\iff \mathrm {sensitivity} +\mathrm {specificity} -1\geq 0\iff J\geq 0} . This brings us back to the classical binary condition: Youden's J must be positive (or zero for random models). === Random models === A random model is a model that is independent of the target variable. This property is easily reformulated with the confusion matrix. This proposition shows that the model y ^ {\displaystyle {\hat {y}}} of y {\displaystyle y} is uninformative if and only if there are two families of numbers ( α i ) i {\displaystyle (\alpha _{i})_{i}} and ( β j ) j {\displaystyle (\beta _{j})_{j}} such that P ( { y = i } ∩ { y ^ = j } ) = α i β j {\displaystyle \mathbb {P} (\{y=i\}\cap \{{\hat {y}}=j\})=\alpha _{i}\beta _{j}} for any i {\displaystyle i} and j {\displaystyle j} . === Multiclass likelihood ratios and diagnostic odds ratios === We define generalized likelihood ratios calculated from the normalized confusion matrix: for any i {\displaystyle i} and j ≠ i {\displaystyle j\not =i} , let L R i , j = P ( y ^ = j ∣ y = j ) P ( y ^ = j ∣ y = i ) {\displaystyle \mathrm {LR} _{i,j}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)}}} . When K = 2 {\displaystyle K=2} , if 2 is the class of interest,, we find the classical likelihood ratios L R 1 , 2 = L R + {\displaystyle \mathrm {LR} _{1,2}=\mathrm {LR} _{+}} and L R 2 , 1 = 1 L R − {\displaystyle \mathrm {LR} _{2,1}={\frac {1}{\mathrm {LR} _{-}}}} . Multiclass diagnostic odds ratios can also be defined using the formula D O R i , j = D O R j , i = L R i , j L R j , i = n i , i n j , j n i , j n j , i = P ( y ^ = j ∣ y = j ) / P ( y ^ = i ∣ y = j ) P ( y ^ = j ∣ y = i ) / P ( y ^ = i ∣ y = i ) {\displaystyle \mathrm {DOR} _{i,j}=\mathrm {DOR} _{j,i}=\mathrm {LR} _{i,j}\mathrm {LR} _{j,i}={\frac {n_{i,i}n_{j,j}}{n_{i,j}n_{j,i}}}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)/\mathbb {P} ({\hat {y}}=i\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)/\mathbb {P} ({\hat {y}}=i\mid y=i)}}} We saw above that a better-than-chance model (or a random model) must verify L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} and λ i {\displaystyle \lambda _{i}} . According to the previous corollary, likelihood ratios are thus greater

    Read more →
  • Polynomial kernel

    Polynomial kernel

    In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of regression analysis, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features. == Definition == For degree-d polynomials, the polynomial kernel is defined as K ( x , y ) = ( x T y + c ) d {\displaystyle K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{\mathsf {T}}\mathbf {y} +c)^{d}} where x and y are vectors of size n in the input space, i.e. vectors of features computed from training or test samples and c ≥ 0 is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When c = 0, the kernel is called homogeneous. (A further generalized polykernel divides xTy by a user-specified scalar parameter a.) As a kernel, K corresponds to an inner product in a feature space based on some mapping φ: K ( x , y ) = ⟨ φ ( x ) , φ ( y ) ⟩ {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {y} )\rangle } The nature of φ can be seen from an example. Let d = 2, so we get the special case of the quadratic kernel. After using the multinomial theorem (twice—the outermost application is the binomial theorem) and regrouping, K ( x , y ) = ( ∑ i = 1 n x i y i + c ) 2 = ∑ i = 1 n ( x i 2 ) ( y i 2 ) + ∑ i = 2 n ∑ j = 1 i − 1 ( 2 x i x j ) ( 2 y i y j ) + ∑ i = 1 n ( 2 c x i ) ( 2 c y i ) + c 2 {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}\left(x_{i}^{2}\right)\left(y_{i}^{2}\right)+\sum _{i=2}^{n}\sum _{j=1}^{i-1}\left({\sqrt {2}}x_{i}x_{j}\right)\left({\sqrt {2}}y_{i}y_{j}\right)+\sum _{i=1}^{n}\left({\sqrt {2c}}x_{i}\right)\left({\sqrt {2c}}y_{i}\right)+c^{2}} From this it follows that the feature map is given by: φ ( x ) = ( x n 2 , … , x 1 2 , 2 x n x n − 1 , … , 2 x n x 1 , 2 x n − 1 x n − 2 , … , 2 x n − 1 x 1 , … , 2 x 2 x 1 , 2 c x n , … , 2 c x 1 , c ) {\displaystyle \varphi (x)=\left(x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\right)} generalizing for ( x T y + c ) d {\displaystyle \left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}} , where x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} , y ∈ R n {\displaystyle \mathbf {y} \in \mathbb {R} ^{n}} and applying the multinomial theorem: ( x T y + c ) d = ∑ j 1 + j 2 + ⋯ + j n + 1 = d d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 d ! j 1 ! ⋯ j n ! j n + 1 ! y 1 j 1 ⋯ y n j n c j n + 1 = φ ( x ) T φ ( y ) {\displaystyle {\begin{alignedat}{2}\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}&=\sum _{j_{1}+j_{2}+\dots +j_{n+1}=d}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}y_{1}^{j_{1}}\cdots y_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\\&=\varphi (\mathbf {x} )^{T}\varphi (\mathbf {y} )\end{alignedat}}} The last summation has l d = ( n + d d ) {\displaystyle l_{d}={\tbinom {n+d}{d}}} elements, so that: φ ( x ) = ( a 1 , … , a l , … , a l d ) {\displaystyle \varphi (\mathbf {x} )=\left(a_{1},\dots ,a_{l},\dots ,a_{l_{d}}\right)} where l = ( j 1 , j 2 , . . . , j n , j n + 1 ) {\displaystyle l=(j_{1},j_{2},...,j_{n},j_{n+1})} and a l = d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 | j 1 + j 2 + ⋯ + j n + j n + 1 = d {\displaystyle a_{l}={\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\quad |\quad j_{1}+j_{2}+\dots +j_{n}+j_{n+1}=d} == Practical use == Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in natural language processing (NLP). The most common degree is d = 2 (quadratic), since larger degrees tend to overfit on NLP problems. Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including: full expansion of the kernel prior to training/testing with a linear SVM, i.e. full computation of the mapping φ as in polynomial regression; basket mining (using a variant of the apriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion; inverted indexing of support vectors. One problem with the polynomial kernel is that it may suffer from numerical instability: when xTy + c < 1, K(x, y) = (xTy + c)d tends to zero with increasing d, whereas when xTy + c > 1, K(x, y) tends to infinity.

    Read more →
  • Vapnik–Chervonenkis dimension

    Vapnik–Chervonenkis dimension

    In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the function class can shatter—that is, for which all possible binary labelings can be realized by some function in the class. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis. Informally, the capacity of a classification model is related to how complicated it can be. For example, consider the thresholding of a high-degree polynomial: if the polynomial evaluates above zero, that point is classified as positive, otherwise as negative. A high-degree polynomial can be wiggly, so that it can fit a given set of training points well. Such a polynomial has a high capacity. A much simpler alternative is to threshold a linear function. This function may not fit the training set well, because it has a low capacity. This notion of capacity is made rigorous below. == Definitions == === VC dimension of a set-family === Let C = { C } C ∈ C {\displaystyle {\mathcal {C}}=\{C\}_{C\in {\mathcal {C}}}} be a family of sets (also called set family, collection of sets or set of sets) and X {\displaystyle X} a set. Their intersection is defined as the following set family: C ∩ X := { C ∩ X ∣ C ∈ C } . {\displaystyle {\mathcal {C}}\cap X:=\{C\cap X\mid C\in {\mathcal {C}}\}.} Here typically X {\displaystyle X} and each C ∈ C {\displaystyle C\in {\mathcal {C}}} are subsets of a big "universe" of possibilities U {\displaystyle U} where intersection takes place. We say that a set X {\displaystyle X} is shattered by C {\displaystyle {\mathcal {C}}} if P ( X ) = C ∩ X {\displaystyle {\mathcal {P}}(X)={\mathcal {C}}\cap X} i.e. the set of intersections contains (hence is equal to) all the subsets of X {\displaystyle X} . For finite sets X {\displaystyle X} this is equivalent to | C ∩ X | = 2 | X | . {\displaystyle |{\mathcal {C}}\cap X|=2^{|X|}.} The VC dimension D {\displaystyle D} of C {\displaystyle {\mathcal {C}}} is the cardinality of the largest set that is shattered by C {\displaystyle {\mathcal {C}}} . If arbitrarily large sets can be shattered, the VC dimension of C {\displaystyle {\mathcal {C}}} is ∞ {\displaystyle \infty } . === VC dimension of a classification model === A binary classification model f {\displaystyle f} with some parameter vector θ {\displaystyle \theta } is said to shatter a set of generally positioned data points ( x 1 , x 2 , … , x n ) {\displaystyle (x_{1},x_{2},\ldots ,x_{n})} if, for every assignment of labels to those points, there exists a θ {\displaystyle \theta } such that the model f {\displaystyle f} makes no errors when evaluating that set of data points. The VC dimension of a model f {\displaystyle f} is the maximum number of points that can be arranged so that f {\displaystyle f} shatters them. More formally, it is the maximum cardinal D {\displaystyle D} such that there exists a generally positioned data point set of cardinality D {\displaystyle D} that can be shattered by f {\displaystyle f} . == Examples == f {\displaystyle f} is a constant classifier (with no parameters); Its VC dimension is 0 since it cannot shatter even a single point. In general, the VC dimension of a finite classification model, which can return at most 2 d {\displaystyle 2^{d}} different classifiers, is at most d {\displaystyle d} (this is an upper bound on the VC dimension; the Sauer–Shelah lemma gives a lower bound on the dimension). f {\displaystyle f} is a single-parametric threshold classifier on real numbers; i.e., for a certain threshold θ {\displaystyle \theta } , the classifier f θ {\displaystyle f_{\theta }} returns 1 if the input number is larger than θ {\displaystyle \theta } and 0 otherwise. The VC dimension of f {\displaystyle f} is 1 because: (a) It can shatter a single point. For every point x {\displaystyle x} , a classifier f θ {\displaystyle f_{\theta }} labels it as 0 if θ > x {\displaystyle \theta >x} and labels it as 1 if θ < x {\displaystyle \theta x + 2 {\displaystyle \theta >x+2} , as (1,0) if θ ∈ [ x − 4 , x − 2 ) {\displaystyle \theta \in [x-4,x-2)} , as (1,1) if θ ∈ [ x − 2 , x ] {\displaystyle \theta \in [x-2,x]} , and as (0,1) if θ ∈ ( x , x + 2 ] {\displaystyle \theta \in (x,x+2]} . (b) It cannot shatter any set of three points. For every set of three numbers, if the smallest and the largest are labeled 1, then the middle one must also be labeled 1, so not all labelings are possible. f {\displaystyle f} is a straight line as a classification model on points in a two-dimensional plane (this is the model used by a perceptron). The line should separate positive data points from negative data points. There exist sets of 3 points that can indeed be shattered using this model (any 3 points that are not collinear can be shattered). However, no set of 4 points can be shattered: by Radon's theorem, any four points can be partitioned into two subsets with intersecting convex hulls, so it is not possible to separate one of these two subsets from the other. Thus, the VC dimension of this particular classifier is 3. It is important to remember that while one can choose any arrangement of points, the arrangement of those points cannot change when attempting to shatter for some label assignment. Note, only 3 of the 23 = 8 possible label assignments are shown for the three points. f {\displaystyle f} is a single-parametric sine classifier, i.e., for a certain parameter θ {\displaystyle \theta } , the classifier f θ {\displaystyle f_{\theta }} returns 1 if the input number x {\displaystyle x} has sin ⁡ ( θ x ) > 0 {\displaystyle \sin(\theta x)>0} and 0 otherwise. The VC dimension of f {\displaystyle f} is infinite, since it can shatter any finite subset of the set { 2 − m ∣ m ∈ N } {\displaystyle \{2^{-m}\mid m\in \mathbb {N} \}} . == Uses == === In statistical learning theory === The VC dimension can predict a probabilistic upper bound on the test error of a classification model. Vapnik proved that the probability of the test error (i.e., risk with 0–1 loss function) distancing from an upper bound (on data that is drawn i.i.d. from the same distribution as the training set) is given by: Pr ( test error ⩽ training error + 1 N [ D ( log ⁡ ( 2 N D ) + 1 ) − log ⁡ ( η 4 ) ] ) = 1 − η , {\displaystyle \Pr \left({\text{test error}}\leqslant {\text{training error}}+{\sqrt {{\frac {1}{N}}\left[D\left(\log \left({\tfrac {2N}{D}}\right)+1\right)-\log \left({\tfrac {\eta }{4}}\right)\right]}}\,\right)=1-\eta ,} where D {\displaystyle D} is the VC dimension of the classification model, 0 < η ⩽ 1 {\displaystyle 0<\eta \leqslant 1} , and N {\displaystyle N} is the size of the training set (restriction: this formula is valid when D ≪ N {\displaystyle D\ll N} . When D {\displaystyle D} is larger, the test-error may be much higher than the training-error. This is due to overfitting). The VC dimension also appears in sample-complexity bounds. A space of binary functions with VC dimension D {\displaystyle D} can be learned with: N = Θ ( D + ln ⁡ 1 δ ε 2 ) {\displaystyle N=\Theta \left({\frac {D+\ln {1 \over \delta }}{\varepsilon ^{2}}}\right)} samples, where ε {\displaystyle \varepsilon } is the learning error and δ {\displaystyle \delta } is the failure probability. Thus, the sample-complexity is a linear function of the VC dimension of the hypothesis space. === In computational geometry === The VC dimension is one of the critical parameters in the size of ε-nets, which determines the complexity of approximation algorithms based on them; range sets without finite VC dimension may not have finite ε-nets at all. == Bounds == The VC dimension of the dual set-family of C {\displaystyle {\mathcal {C}}} is strictly less than 2 vc ⁡ ( C ) + 1 {\displaystyle 2^{\operatorname {vc} ({\mathcal {C}})+1}} , and this is best possible. The VC dimension of a finite set-family C {\displaystyle {\mathcal {C}}} is at most log 2 ⁡ | C | {\displaystyle \log _{2}|{\mathcal {C}}|} . This is because | C ∩ X | ≤ | X | {\displaystyle |{\mathcal {C}}\cap X|\leq |X|} by definition. Given a set-fa

    Read more →
  • Language resource

    Language resource

    In linguistics and language technology, a language resource is a "[composition] of linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research studies and applications." According to Bird & Simons (2003), this includes data, i.e. "any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of handwritten index cards. The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar", tools, i.e., "computational resources that facilitate creating, viewing, querying, or otherwise using language data", and advice, i.e., "any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data". The latter aspect is usually referred to as "best practices" or "(community) standards". In a narrower sense, language resource is specifically applied to resources that are available in digital form, and then, "encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management". == Typology == As of May 2020, no widely used standard typology of language resources has been established (current proposals include the LREMap, METASHARE, and, for data, the LLOD classification). Important classes of language resources include data lexical resources, e.g., machine-readable dictionaries, linguistic corpora, i.e., digital collections of natural language data, linguistic data bases such as the Cross-Linguistic Linked Data collection, tools linguistic annotations and tools for creating such annotations in a manual or semiautomated fashion (e.g., tools for annotating interlinear glossed text such as Toolbox and FLEx, or other language documentation tools), applications for search and retrieval over such data (corpus management systems), for automated annotation (part-of-speech tagging, syntactic parsing, semantic parsing, etc.), metadata and vocabularies vocabularies, repositories of linguistic terminology and language metadata, e.g., MetaShare (for language resource metadata), the ISO 12620 data category registry (for linguistic features, data structures and annotations within a language resource), or the Glottolog database (identifiers for language varieties and bibliographical database). == Language resource publication, dissemination and creation == A major concern of the language resource community has been to develop infrastructures and platforms to present, discuss and disseminate language resources. Selected contributions in this regard include: a series of International Conferences on Language Resources and Evaluation (LREC), the European Language Resources Association (ELRA, EU-based), and the Linguistic Data Consortium (LDC, US-based), which represent commercial hosting and dissemination platforms for language resources, the Open Languages Archives Community (OLAC), which provides and aggregates language resource metadata, the Language Resources and Evaluation Journal (LREJ), the European Language Grid is a European platform for language technologies (eg services), data and resources. As for the development of standards and best practices for language resources, these are subject of several community groups and standardization efforts, including ISO Technical Committee 37: Terminology and other language and content resources (ISO/TC 37), developing standards for all aspects of language resources, W3C Community Group Best Practices for Multilingual Linked Open Data (BPMLOD), working on best practice recommendations for publishing language resources as Linked Data or in RDF, W3C Community Group Linked Data for Language Technology (LD4LT), working on linguistic annotations on the web and language resource metadata, W3C Community Group Ontology-Lexica (OntoLex), working on lexical resources, the Open Linguistics working group of the Open Knowledge Foundation, working on conventions for publishing and linking open language resources, developing the Linguistic Linked Open Data cloud, the Text Encoding Initiative (TEI), working on XML-based specifications for language resources and digitally edited text.

    Read more →
  • Softmax function

    Softmax function

    The softmax function, also known as softargmax or normalized exponential function, converts a tuple of K real numbers into a probability distribution over K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and is used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes. == Definition == The softmax function takes as input a tuple z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval ( 0 , 1 ) {\displaystyle (0,1)} , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Formally, the standard (unit) softmax function σ : R K → ( 0 , 1 ) K {\displaystyle \sigma :\mathbb {R} ^{K}\to (0,1)^{K}} , where ⁠ K > 1 {\displaystyle K>1} ⁠, takes a tuple z = ( z 1 , … , z K ) ∈ R K {\displaystyle \mathbf {z} =(z_{1},\dotsc ,z_{K})\in \mathbb {R} ^{K}} and computes each component of vector σ ( z ) ∈ ( 0 , 1 ) K {\displaystyle \sigma (\mathbf {z} )\in (0,1)^{K}} with σ ( z ) i = e z i ∑ j = 1 K e z j . {\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}\,.} In words, the softmax applies the standard exponential function to each element z i {\displaystyle z_{i}} of the input tuple z {\displaystyle \mathbf {z} } (consisting of K {\displaystyle K} real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector σ ( z ) {\displaystyle \sigma (\mathbf {z} )} is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input tuple. For example, the standard softmax of ( 1 , 2 , 8 ) {\displaystyle (1,2,8)} is approximately ( 0.001 , 0.002 , 0.997 ) {\displaystyle (0.001,0.002,0.997)} , which amounts to assigning almost all of the total unit weight in the result to the position of the tuple's maximal element (of 8). In general, instead of e a different base b > 0 can be used. As above, if b > 1 then larger input components will result in larger output probabilities, and increasing the value of b will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if 0 < b < 1 then smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values. Writing b = e β {\displaystyle b=e^{\beta }} or b = e − β {\displaystyle b=e^{-\beta }} (for real β) yields the expressions: σ ( z ) i = e β z i ∑ j = 1 K e β z j or σ ( z ) i = e − β z i ∑ j = 1 K e − β z j for i = 1 , … , K . {\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{\beta z_{i}}}{\sum _{j=1}^{K}e^{\beta z_{j}}}}{\text{ or }}\sigma (\mathbf {z} )_{i}={\frac {e^{-\beta z_{i}}}{\sum _{j=1}^{K}e^{-\beta z_{j}}}}{\text{ for }}i=1,\dotsc ,K.} A value proportional to the reciprocal of β is sometimes referred to as the temperature: β = 1 / k T {\textstyle \beta =1/kT} , where k is typically 1 or the Boltzmann constant and T is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating. In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter β (or T) is varied. The softmax function is a multiple-variable generalization of the logistic function. == Interpretations == === Smooth arg max === The Softmax function is a smooth approximation to the arg max function: the function whose value is the index of a tuple's largest element. The name "softmax" may be misleading. Softmax is not a smooth maximum (that is, a smooth approximation to the maximum function). The term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning. This section uses the term "softargmax" for clarity. Formally, instead of considering the arg max as a function with categorical output 1 , … , n {\displaystyle 1,\dots ,n} (corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique maximum arg): a r g m a x ⁡ ( z 1 , … , z n ) = ( y 1 , … , y n ) = ( 0 , … , 0 , 1 , 0 , … , 0 ) , {\displaystyle \operatorname {arg\,max} (z_{1},\,\dots ,\,z_{n})=(y_{1},\,\dots ,\,y_{n})=(0,\,\dots ,\,0,\,1,\,0,\,\dots ,\,0),} where the output coordinate y i = 1 {\displaystyle y_{i}=1} if and only if i {\displaystyle i} is the arg max of ( z 1 , … , z n ) {\displaystyle (z_{1},\dots ,z_{n})} , meaning z i {\displaystyle z_{i}} is the unique maximum value of ( z 1 , … , z n ) {\displaystyle (z_{1},\,\dots ,\,z_{n})} . For example, in this encoding a r g m a x ⁡ ( 1 , 5 , 10 ) = ( 0 , 0 , 1 ) , {\displaystyle \operatorname {arg\,max} (1,5,10)=(0,0,1),} since the third argument is the maximum. This can be generalized to multiple arg max values (multiple equal z i {\displaystyle z_{i}} being the maximum) by dividing the 1 between all max args; formally 1/k where k is the number of arguments assuming the maximum. For example, a r g m a x ⁡ ( 1 , 5 , 5 ) = ( 0 , 1 / 2 , 1 / 2 ) , {\displaystyle \operatorname {arg\,max} (1,\,5,\,5)=(0,\,1/2,\,1/2),} since the second and third argument are both the maximum. In case all arguments are equal, this is simply a r g m a x ⁡ ( z , … , z ) = ( 1 / n , … , 1 / n ) . {\displaystyle \operatorname {arg\,max} (z,\dots ,z)=(1/n,\dots ,1/n).} Points z with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a jump discontinuity) – while points with a single arg max are known as non-singular or regular points. With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as ⁠ β → ∞ {\displaystyle \beta \to \infty } ⁠, softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input z as ⁠ β → ∞ {\displaystyle \beta \to \infty } ⁠, σ β ( z ) → a r g m a x ⁡ ( z ) . {\displaystyle \sigma _{\beta }(\mathbf {z} )\to \operatorname {arg\,max} (\mathbf {z} ).} However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, σ β ( 1 , 1.0001 ) → ( 0 , 1 ) , {\displaystyle \sigma _{\beta }(1,\,1.0001)\to (0,1),} but σ β ( 1 , 0.9999 ) → ( 1 , 0 ) , {\displaystyle \sigma _{\beta }(1,\,0.9999)\to (1,\,0),} and σ β ( 1 , 1 ) = 1 / 2 {\displaystyle \sigma _{\beta }(1,\,1)=1/2} for all inputs: the closer the points are to the singular set ( x , x ) {\displaystyle (x,x)} , the slower they converge. However, softargmax does converge compactly on the non-singular set. Conversely, as ⁠ β → − ∞ {\displaystyle \beta \to -\infty } ⁠, softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring (respectively min-plus semiring), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization". It is also the case that, for any fixed β, if one input ⁠ z i {\displaystyle z_{i}} ⁠ is much larger than the others relative to the temperature, T = 1 / β {\displaystyle T=1/\beta } , the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1: σ ( 0 , 10 ) := σ 1 ( 0 , 10 ) = ( 1 / ( 1 + e 10 ) , e 10 / ( 1 + e 10 ) ) ≈ ( 0.00005 , 0.99995 ) {\displaystyle \sigma (0,\,10):=\sigma _{1}(0,\,10)=\left(1/\left(1+e^{10}\right),\,e^{10}/\left(1+e^{10}\right)\right)\approx (0.00005

    Read more →
  • UIMA

    UIMA

    UIMA ( yoo-EE-mə), short for Unstructured Information Management Architecture, is an OASIS standard for content analytics, originally developed at IBM. It provides a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and integration with search technologies. == Structure == The UIMA architecture can be thought of in four dimensions: It specifies component interfaces in an analytics pipeline. It describes a set of design patterns. It suggests two data representations: an in-memory representation of annotations for high-performance analytics and an XML representation of annotations for integration with remote web services. It suggests development roles allowing tools to be used by users with diverse skills. == Implementations and uses == Apache UIMA, a reference implementation of UIMA, is maintained by the Apache Software Foundation. UIMA is used in a number of software projects: IBM Research's Watson uses UIMA for analyzing unstructured data. The Clinical Text Analysis and Knowledge Extraction System (Apache cTAKES) is a UIMA-based system for information extraction from medical records. DKPro Core is a collection of reusable UIMA components for general-purpose natural language processing.

    Read more →
  • Quantum neural network

    Quantum neural network

    Quantum neural networks are computational neural network models which are based on the principles of quantum mechanics. The first ideas on quantum neural computation were published independently in 1995 by Subhash Kak and Ron Chrisley, engaging with the theory of quantum mind, which posits that quantum effects play a role in cognitive function. However, typical research in quantum neural networks involves combining classical artificial neural network models (which are widely used in machine learning for the important task of pattern recognition) with the advantages of quantum information in order to develop more efficient algorithms. One important motivation for these investigations is the difficulty to train classical neural networks, especially in big data applications. The hope is that features of quantum computing such as quantum parallelism or the effects of interference and entanglement can be used as resources. Since the technological implementation of a quantum computer is still in a premature stage, such quantum neural network models are mostly theoretical proposals that await their full implementation in physical experiments. Most Quantum neural networks are developed as feed-forward networks. Similar to their classical counterparts, this structure intakes input from one layer of qubits, and passes that input onto another layer of qubits. This layer of qubits evaluates this information and passes on the output to the next layer. Eventually the path leads to the final layer of qubits. The layers do not have to be of the same width, meaning they don't have to have the same number of qubits as the layer before or after it. This structure is trained on which path to take similar to classical artificial neural networks. This is discussed in a lower section. Quantum neural networks refer to three different categories: Quantum computer with classical data, classical computer with quantum data, and quantum computer with quantum data. == Examples == Quantum neural network research is still in its infancy, and a conglomeration of proposals and ideas of varying scope and mathematical rigor have been put forward. Most of them are based on the idea of replacing classical binary or McCulloch-Pitts neurons with a qubit (which can be called a "quron"), resulting in neural units that can be in a superposition of the state 'firing' and 'resting'. === Quantum perceptrons === A lot of proposals attempt to find a quantum equivalent for the perceptron unit from which neural nets are constructed. A problem is that nonlinear activation functions do not immediately correspond to the mathematical structure of quantum theory, since a quantum evolution is described by linear operations and leads to probabilistic observation. Ideas to imitate the perceptron activation function with a quantum mechanical formalism reach from special measurements to postulating non-linear quantum operators (a mathematical framework that is disputed). A direct implementation of the activation function using the circuit-based model of quantum computation has recently been proposed by Schuld, Sinayskiy and Petruccione based on the quantum phase estimation algorithm. === Quantum networks === At a larger scale, researchers have attempted to generalize neural networks to the quantum setting. One way of constructing a quantum neuron is to first generalise classical neurons and then generalising them further to make unitary gates. Interactions between neurons can be controlled quantumly, with unitary gates, or classically, via measurement of the network states. This high-level theoretical technique can be applied broadly, by taking different types of networks and different implementations of quantum neurons, such as photonically implemented neurons and quantum reservoir processor (quantum version of reservoir computing). Most learning algorithms follow the classical model of training an artificial neural network to learn the input-output function of a given training set and use classical feedback loops to update parameters of the quantum system until they converge to an optimal configuration. Learning as a parameter optimisation problem has also been approached by adiabatic models of quantum computing. Quantum neural networks can be applied to algorithmic design: given qubits with tunable mutual interactions, one can attempt to learn interactions following the classical backpropagation rule from a training set of desired input-output relations, taken to be the desired output algorithm's behavior. The quantum network thus 'learns' an algorithm. === Quantum associative memory === The first quantum associative memory algorithm was introduced by Dan Ventura and Tony Martinez in 1999. The authors do not attempt to translate the structure of artificial neural network models into quantum theory, but propose an algorithm for a circuit-based quantum computer that simulates associative memory. The memory states (in Hopfield neural networks saved in the weights of the neural connections) are written into a superposition, and a Grover-like quantum search algorithm retrieves the memory state closest to a given input. As such, this is not a fully content-addressable memory, since only incomplete patterns can be retrieved. The first truly content-addressable quantum memory, which can retrieve patterns also from corrupted inputs, was proposed by Carlo A. Trugenberger. Both memories can store an exponential (in terms of n qubits) number of patterns but can be used only once due to the no-cloning theorem and their destruction upon measurement. Trugenberger, however, has shown that his probabilistic model of quantum associative memory can be efficiently implemented and re-used multiples times for any polynomial number of stored patterns, a large advantage with respect to classical associative memories. === Classical neural networks inspired by quantum theory === A substantial amount of interest has been given to a "quantum-inspired" model that uses ideas from quantum theory to implement a neural network based on fuzzy logic. == Training == Quantum Neural Networks can be theoretically trained similarly to training classical/artificial neural networks. A key difference lies in communication between the layers of a neural networks. For classical neural networks, at the end of a given operation, the current perceptron copies its output to the next layer of perceptron(s) in the network. However, in a quantum neural network, where each perceptron is a qubit, this would violate the no-cloning theorem. A proposed generalized solution to this is to replace the classical fan-out method with an arbitrary unitary that spreads out, but does not copy, the output of one qubit to the next layer of qubits. Using this fan-out Unitary ( U f {\displaystyle U_{f}} ) with a dummy state qubit in a known state (Ex. | 0 ⟩ {\displaystyle |0\rangle } in the computational basis), also known as an Ancilla bit, the information from the qubit can be transferred to the next layer of qubits. This process adheres to the quantum operation requirement of reversibility. Using this quantum feed-forward network, deep neural networks can be executed and trained efficiently. A deep neural network is essentially a network with many hidden-layers, as seen in the sample model neural network above. Since the Quantum neural network being discussed uses fan-out Unitary operators, and each operator only acts on its respective input, only two layers are used at any given time. In other words, no Unitary operator is acting on the entire network at any given time, meaning the number of qubits required for a given step depends on the number of inputs in a given layer. Since Quantum Computers are notorious for their ability to run multiple iterations in a short period of time, the efficiency of a quantum neural network is solely dependent on the number of qubits in any given layer, and not on the depth of the network. === Cost functions === To determine the effectiveness of a neural network, a cost function is used, which essentially measures the proximity of the network's output to the expected or desired output. In a Classical Neural Network, the weights ( w {\displaystyle w} ) and biases ( b {\displaystyle b} ) at each step determine the outcome of the cost function C ( w , b ) {\displaystyle C(w,b)} . When training a Classical Neural network, the weights and biases are adjusted after each iteration, and given equation 1 below, where y ( x ) {\displaystyle y(x)} is the desired output and a out ( x ) {\displaystyle a^{\text{out}}(x)} is the actual output, the cost function is optimized when C ( w , b ) {\displaystyle C(w,b)} = 0. For a quantum neural network, the cost function is determined by measuring the fidelity of the outcome state ( ρ out {\displaystyle \rho ^{\text{out}}} ) with the desired outcome state ( ϕ out {\displaystyle \phi ^{\text{out}}} ), seen in Equation 2 below. In this case, the Unitary operators are adjusted after each it

    Read more →
  • Comparison of color models in computer graphics

    Comparison of color models in computer graphics

    This article provides introductory information about the RGB, HSV, and HSL color models from a computer graphics (web pages, images) perspective. An introduction to colors is also provided to support the main discussion. == Basics of color == === Primary colors and hue === First, "color" refers to the human brain's subjective interpretation of combinations of a narrow band of wavelengths of light. For this reason, the definition of "color" is not based on a strict set of physical phenomena. Therefore, even basic concepts like "primary colors" are not clearly defined. For example, traditional "Painter's Colors" use red, blue, and yellow as the primary colors, "Printer's Colors" use cyan, yellow, and magenta, and "Light Colors" use red, green, and blue. "Light colors", more formally known as additive colors, are formed by combining red, green, and blue light. This article refers to additive colors and refers to red, green, and blue as the primary colors. Hue is a term describing a pure color, that is, a color not modified by tinting or shading (see below). In additive colors, hues are formed by combining two primary colors. When two primary colors are combined in equal intensities, the result is a "secondary color". === Color wheel === A color wheel is a tool that provides a visual representation of the relationships between all possible hues. The primary colors are arranged around a circle at equal (120 degree) intervals. (Warning: Color wheels frequently depict "Painter's Colors" primary colors, which leads to a different set of hues than additive colors.) The illustration shows a simple color wheel based on the additive colors. Note that the position (top, right) of the starting color, typically red, is arbitrary, as is the order of green and blue (clockwise, counter-clockwise). The illustration also shows the secondary colors, yellow, cyan, and magenta, located halfway between (60 degrees) the primary colors. == Complementary color == The complement of a hue is the hue that is opposite it (180 degrees) on the color wheel. Using additive colors, mixing a hue and its complement in equal amounts produces white. === Tints and shades === The following discussion uses an illustration involving three projectors pointing to the same spot on a screen. Each projector is capable of generating one hue. The "intensities" of each projector are "matched" and can be equally adjusted from zero to full. (Note: "Intensity" is used here in the same sense as the RGB color model. The subject of matching, or "gamma correction", is beyond the level of this article.) A shade is produced by "dimming" a maximum chroma color. Painters refer to this as "adding black". In our illustration, one projector is set to full intensity, a second is set to some intensity between zero and full, and third is set to zero. "Dimming" is accomplished by decreasing each projector's intensity setting to the same fraction of its start setting. In the shade example, with any fully shaded hue, that all three projectors are set to zero intensity, resulting in black. A tint is produced by "lightening" a maximum chroma color. Painters refer to this as "adding white". In our illustration, one projector is set to full intensity, a second is set to some intensity between zero and full, and third is set to zero. "Lightening" is accomplished by increasing each projector's intensity setting by the same fraction from its start setting to full. In the tinting example, note that the third projector is now contributing. When the hue is fully lightened, all three projectors are each at full intensity, and the result is white. Note an attribute of the total intensity in the additive model. If full intensity for one projector is 1, then a primary color has a combined intensity of 1. A secondary color has a total intensity of 2. White has a total intensity of 3. Tinting, or "adding white", increases the total intensity of the hue. While this is simply a fact, the HSL model will take this fact into account in its design. === Tones === Tone is a general term, typically used by painters, to refer to the effects of reducing the "colorfulness" of a maximum chroma color; painters refer to it as "adding gray". Note that gray is not a color or even a single concept but refers to all the range of values between black and white where all three primary colors are equally represented. The general term is provided as more specific terms have conflicting definitions in different color models. Thus, shading takes a hue toward black, tinting takes a hue towards white, and tones cover the range between. == Choosing a color model == No one color model is necessarily "better" than another. Typically, the choice of a color model is dictated by external factors, such as a graphics tool or the need to specify colors according to the CSS2 or CSS3 standard. The following discussion only describes how the models function, centered on the concepts of hue, shade, tint, and tone. === RGB === The RGB model's approach to colors is important because: It directly reflects the physical properties of "Truecolor" displays As of 2011, most graphic cards define pixel values in terms of the colors red, green, and blue. The typical range of intensity values for each color, 0–255, is based on taking a binary number with 32 bits and breaking it up into four bytes of 8 bits each. 8 bits can hold a value from 0 to 255. The fourth byte is used to specify the "alpha", or the opacity, of the color. Opacity comes into play when layers with different colors are stacked. If the color in the top layer is less than fully opaque (alpha < 255), the color from underlying layers "shows through". In the RGB model, hues are represented by specifying one color as full intensity (255), a second color with a variable intensity, and the third color with no intensity (0). The following provides some examples using red as the full-intensity and green as the partial-intensity colors; blue is always zero: Shades are created by multiplying the intensity of each primary color by 1 minus the shade factor, in the range 0 to 1. A shade factor of 0 does nothing to the hue, a shade factor of 1 produces black: new intensity = current intensity (1 – shade factor) The following provides examples using orange: Tints are created by modifying each primary color as follows: the intensity is increased so that the difference between the intensity and full intensity (255) is decreased by the tint factor, in the range 0 to 1. A tint factor of 0 does nothing, a tint factor of 1 produces white: new intensity = current intensity + (255 – current intensity) tint factor The following provides examples using orange: Tones are created by applying both a shade and a tint. The order in which the two operations are performed does not matter, with the following restriction: when a tint operation is performed on a shade, the intensity of the dominant color becomes the "full intensity"; that is, the intensity value of the dominant color must be used in place of 255. The following provides examples using orange: === HSV === The HSV, or HSB, model describes colors in terms of hue, saturation, and value (brightness). Note that the range of values for each attribute is arbitrarily defined by various tools or standards. Be sure to determine the value ranges before attempting to interpret a value. Hue corresponds directly to the concept of hue in the Color Basics section. The advantages of using hue are The angular relationship between tones around the color circle is easily identified Shades, tints, and tones can be generated easily without affecting the hue Saturation corresponds directly to the concept of tint in the Color Basics section, except that full saturation produces no tint, while zero saturation produces white, a shade of gray, or black. Value corresponds directly to the concept of intensity in the Color Basics section. Pure colors are produced by specifying a hue with full saturation and value Shades are produced by specifying a hue with full saturation and less than full value Tints are produced by specifying a hue with less than full saturation and full value Tones are produced by specifying a hue and both less than full saturation and value White is produced by specifying zero saturation and full value, regardless of hue Black is produced by specifying zero value, regardless of hue or saturation Shades of gray are produced by specifying zero saturation and between zero and full value The advantage of HSV is that each of its attributes corresponds directly to the basic color concepts, which makes it conceptually simple. The perceived disadvantage of HSV is that the saturation attribute corresponds to tinting, so desaturated colors have increasing total intensity. For this reason, the CSS3 standard plans to support RGB and HSL but not HSV. === HSL === The HSL model describes colors in terms of hue, saturation, and lightness (also called luminance). (Note: the definition of sa

    Read more →
  • Accumulated local effects

    Accumulated local effects

    Accumulated local effects (ALE) is a machine learning interpretability method. == Concepts == ALE uses a conditional feature distribution as an input and generates augmented data, creating more realistic data than a marginal distribution. It ignores far out-of-distribution (outlier) values. Unlike partial dependence plots and marginal plots, ALE is not defeated in the presence of correlated predictors. It analyzes differences in predictions instead of averaging them by calculating the average of the differences in model predictions over the augmented data, instead of the average of the predictions themselves. == Example == Given a model that predicts house prices based on its distance from city center and size of the building area, ALE compares the differences of predictions of houses of different sizes. The result separates the impact of the size from otherwise correlated features. == Limitations == Defining evaluation windows is subjective. High correlations between features can defeat the technique. ALE requires more and more uniformly distributed observations than PDP so that the conditional distribution can be reliably determined. The technique may produce inadequate results if the data is highly sparse, which is more common with high-dimensional data (curse of dimensionality).

    Read more →
  • Randomized weighted majority algorithm

    Randomized weighted majority algorithm

    The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems. It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert. == Example == Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The principal challenge is that we do not know which experts will give better or worse predictions. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single expert which, in hindsight, gave the most accurate predictions. == Motivation == In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows: initialize all experts to weight 1 for each round: add each expert's weight to the option they predicted predict the option with the largest weighted sum multiply the weights of all experts who predicted wrongly by 1 2 {\displaystyle {\frac {1}{2}}} Suppose there are n {\displaystyle n} experts and the best expert makes m {\displaystyle m} mistakes. Then, the weighted majority algorithm (WMA) makes at most 2.4 ( log 2 ⁡ n + m ) {\displaystyle 2.4(\log _{2}n+m)} mistakes. This bound is highly problematic in the case of highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in N = 100 {\displaystyle N=100} rounds using n = 10 {\displaystyle n=10} experts, the best expert makes m = 20 {\displaystyle m=20} mistakes. Then, the weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} mistakes. As this is a known limitation of the weighted majority algorithm, various strategies have been explored in order to improve the dependence on m {\displaystyle m} . In particular, we can do better by introducing randomization. Drawing inspiration from the Multiplicative Weights Update Method algorithm, we will probabilistically make predictions based on how the experts have performed in the past. Similarly to the WMA, every time an expert makes a wrong prediction, we will decrement their weight. Mirroring the MWUM, we will then use the weights to make a probability distribution over the actions and draw our action from this distribution (instead of deterministically picking the majority vote as the WMA does). == Randomized weighted majority algorithm (RWMA) == The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on m {\displaystyle m} . Instead of predicting based on majority vote, the weights, are used as probabilities for choosing the experts in each round and are updated over time (hence the name randomized weighted majority). Precisely, if w i {\displaystyle w_{i}} is the weight of expert i {\displaystyle i} , let W = ∑ i w i {\displaystyle W=\sum _{i}w_{i}} . We will follow expert i {\displaystyle i} with probability w i W {\displaystyle {\frac {w_{i}}{W}}} . This results in the following algorithm: initialize all experts to weight 1. for each round: add all experts' weights together to obtain the total weight W {\displaystyle W} choose expert i {\displaystyle i} randomly with probability w i W {\displaystyle {\frac {w_{i}}{W}}} predict as the chosen expert predicts multiply the weights of all experts who predicted wrongly by β {\displaystyle \beta } The goal is to bound the worst-case expected number of mistakes, assuming that the adversary has to select one of the answers as correct before we make our coin toss. This is a reasonable assumption in, for instance, the stock market example provided above: the variance of a stock price should not depend on the opinions of experts that influence private buy or sell decisions, so we can treat the price change as if it was decided before the experts gave their recommendations for the day. The randomized algorithm is better in the worst case than the deterministic algorithm (weighted majority algorithm): in the latter, the worst case was when the weights were split 50/50. But in the randomized version, since the weights are used as probabilities, there would still be a 50/50 chance of getting it right. In addition, generalizing to multiplying the weights of the incorrect experts by β < 1 {\displaystyle \beta <1} instead of strictly 1 2 {\displaystyle {\frac {1}{2}}} allows us to trade off between dependence on m {\displaystyle m} and log 2 ⁡ n {\displaystyle \log _{2}n} . This trade-off will be quantified in the analysis section. == Analysis == Let W t {\displaystyle W_{t}} denote the total weight of all experts at round t {\displaystyle t} . Also let F t {\displaystyle F_{t}} denote the fraction of weight placed on experts which predict the wrong answer at round t {\displaystyle t} . Finally, let N {\displaystyle N} be the total number of rounds in the process. By definition, F t {\displaystyle F_{t}} is the probability that the algorithm makes a mistake on round t {\displaystyle t} . It follows from the linearity of expectation that if M {\displaystyle M} denotes the total number of mistakes made during the entire process, E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} . After round t {\displaystyle t} , the total weight is decreased by ( 1 − β ) F t W t {\displaystyle \ (1-\beta )F_{t}W_{t}} , since all weights corresponding to a wrong answer are multiplied by β < 1 {\displaystyle \ \beta <1} . It then follows that W t + 1 = W t ( 1 − ( 1 − β ) F t ) {\displaystyle W_{t+1}=W_{t}(1-(1-\beta )F_{t})} . By telescoping, since W 1 = n {\displaystyle W_{1}=n} , it follows that the total weight after the process concludes is On the other hand, suppose that m {\displaystyle \ m} is the number of mistakes made by the best-performing expert. At the end, this expert has weight β m {\displaystyle \ \beta ^{m}} . It follows, then, that the total weight is at least this much; in other words, W ≥ β m {\displaystyle \ W\geq \beta ^{m}} . This inequality and the above result imply Taking the natural logarithm of both sides yields Now, the Taylor series of the natural logarithm is In particular, it follows that ln ⁡ ( 1 − ( 1 − β ) F t ) < − ( 1 − β ) F t {\displaystyle \ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}} . Thus, Recalling that E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} and rearranging, it follows that Now, as β → 1 {\displaystyle \beta \to 1} from below, the first constant tends to 1 {\displaystyle 1} ; however, the second constant tends to + ∞ {\displaystyle +\infty } . To quantify this tradeoff, define ε = 1 − β {\displaystyle \varepsilon =1-\beta } to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm, It then follows that the mistake bound, for small ε {\displaystyle \varepsilon } , can be written in the form ( 1 + ϵ 2 + O ( ε 2 ) ) m + ϵ − 1 ln ⁡ ( n ) {\displaystyle \ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)} . In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of ε {\displaystyle \varepsilon } and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert. In particular, as long as m {\displaystyle m} is sufficiently large compared to ln ⁡ ( n ) {\displaystyle \ln(n)} (so that their ratio is sufficiently small), we can assign we can obtain an upper bound on the number of mistakes equal to This implies that the "regret bound" on the algorithm (that is, how much worse it performs than the best expert) is sublinear, at O ( m ln ⁡ ( n ) ) {\displaystyle O({\sqrt {m\ln(n)}})} . == Revisiting the motivation == Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in N = 100 {\displaystyle N=100} rounds, with n = 10 {\displaystyle n=10} experts, where the best expert makes m = 20 {\displaystyle m=20} mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} . By the analysis above, it follows that minimizing the number of worst-case expected mistakes is equivalent to minimizing the fun

    Read more →
  • Genotypic and phenotypic repair

    Genotypic and phenotypic repair

    Genotypic and phenotypic repair are optional components of an evolutionary algorithm (EA). An EA reproduces essential elements of biological evolution as a computer algorithm in order to solve demanding optimization or planning tasks, at least approximately. A candidate solution is represented by a - usually linear - data structure that plays the role of an individual's chromosome. New solution candidates are generated by mutation and crossover operators following the example of biology. These offspring may be defective, which is corrected or compensated for by genotypic or phenotypic repair. == Description == Genotypic repair, also known as genetic repair, is the removal or correction of impermissible entries in the chromosome that violate restrictions. In phenotypic repair, the corrections are only made in the genotype-phenotype mapping and the chromosome remains unchanged. Michalewicz wrote about the importance of restrictions in real-world applications: "In general, constraints are an integral part of the formulation of any problem". Restriction violations are application-specific and therefore it depends on the current problem whether and which type of repair is useful. They can usually also be treated by a correspondingly extended evaluation and it depends on the problem which measures are possible and which is the most suitable. If a phenotypic repair is feasible, then it is usually the most efficient compared to the other measures. A survey on repair methods used as constraint handling techniques can be found in. Violations of the range limits of genes should be avoided as far as possible by the formulation of the genome. If this is not possible or if restrictions within the search space defined by the genome are involved, their violations are usually handled by the evaluation. This can be done, for example, by penalty functions that lower the fitness. Repair is often also required for combinatorial tasks. The application of a 1- or n-point crossover operator can, for example, lead to genes being missing in one of the child genomes that are present in duplicate in the other. In this case, a suitable genotypic repair measure is to move the surplus genes to the other genome in a positional manner. The use of the aforementioned operators in combinatorial tasks has also proven to be useful in combination with crossover types specially developed for permutations, at least for certain problems. Particularly in combinatorial problems, it has been observed that genotypic repair can promote premature convergence to a suboptimum, but can also significantly accelerate a successful search. Studies on various tasks have shown that this is application-dependent. An effective measure to avoid premature convergence is generally the use of structured populations instead of the usual panmictic ones. Sequence restrictions play a role in many scheduling tasks, for example when it comes to planning workflows. If, for example, it is specified that step A must be carried out before step B and the gene of step B is located before the gene of A in the chromosome, then there is an impermissible gene sequence. This is because the scheduling operation of step B requires the planned end of step A for correct scheduling, but this is not yet scheduled at the time gene B is processed. The problem can be solved in two ways: The scheduling operation of step B is postponed until the gene from step A has been processed. The genome remains unchanged and the repair only influences the genotype-phenotype mapping. Since only the phenotype is changed, this is referred to as phenotypic repair. If, on the other hand, the gene of step B is moved behind the gene of step A, this is a genotypic repair. The same applies to the alternative shift of gene A in front of gene B. In this case, genotypic repair has the disadvantage that it prevents a meaningful restructuring of the gene sequence in the chromosome if this requires several intermediate steps (mutations) that at least partially violate restrictions.

    Read more →
  • Digital Darkroom

    Digital Darkroom

    Digital Darkroom was a graphics program for editing gray-scale photos, published by Silicon Beach Software for the Macintosh in 1987. It was programmed by Ed Bomke and Don Cone. Digital Darkroom was the first Macintosh program to incorporate a plug-in architecture. Silicon Beach and Ed Bomke are credited with having coined the term "plug-in". Another innovation of Digital Darkroom was the Magic Wand tool, which also appeared later in Photoshop. When Silicon Beach Software was acquired by Aldus Corporation, Digital Darkroom continued to be published by the Aldus Consumer Division, but was never updated to include color. The trademark "Digital Darkroom" was acquired by MicroFrontier in 1997 and used for a completely new image-editing program that does work with color. The software was acquired by Digimage Arts in 2002 and was sold for both Windows and Mac systems.

    Read more →
  • Linear discriminant analysis

    Linear discriminant analysis

    Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label). Logistic regression and probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables have a normal distribution, which is a fundamental assumption of the LDA method. LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA, in contrast, does not take into account any difference in class, and factor analysis builds the feature combinations based on similarities rather than differences. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made. LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification - the act of distributing things into groups, classes or categories of the same type. == History == The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership. == LDA for two classes == Consider a set of observations x → {\displaystyle {\vec {x}}} (also called features, attributes, variables or measurements) for each sample of an object or event with known class y {\displaystyle y} . This set of samples is called the training set in a supervised learning context. The classification problem is then to find a good predictor for the class y {\displaystyle y} of any sample of the same distribution (not necessarily from the training set) given only an observation x → {\displaystyle {\vec {x}}} . LDA approaches the problem by assuming that the conditional probability density functions p ( x → | y = 0 ) {\displaystyle p({\vec {x}}|y=0)} and p ( x → | y = 1 ) {\displaystyle p({\vec {x}}|y=1)} are both the normal distribution with mean and covariance parameters ( μ → 0 , Σ 0 ) {\displaystyle \left({\vec {\mu }}_{0},\Sigma _{0}\right)} and ( μ → 1 , Σ 1 ) {\displaystyle \left({\vec {\mu }}_{1},\Sigma _{1}\right)} , respectively. Under this assumption, the Bayes-optimal solution is to predict points as being from the second class if the log of the likelihood ratios is bigger than some threshold T, so that: 1 2 ( x → − μ → 0 ) T Σ 0 − 1 ( x → − μ → 0 ) + 1 2 ln ⁡ | Σ 0 | − 1 2 ( x → − μ → 1 ) T Σ 1 − 1 ( x → − μ → 1 ) − 1 2 ln ⁡ | Σ 1 | > T {\displaystyle {\frac {1}{2}}({\vec {x}}-{\vec {\mu }}_{0})^{\mathrm {T} }\Sigma _{0}^{-1}({\vec {x}}-{\vec {\mu }}_{0})+{\frac {1}{2}}\ln |\Sigma _{0}|-{\frac {1}{2}}({\vec {x}}-{\vec {\mu }}_{1})^{\mathrm {T} }\Sigma _{1}^{-1}({\vec {x}}-{\vec {\mu }}_{1})-{\frac {1}{2}}\ln |\Sigma _{1}|\ >\ T} Without any further assumptions, the resulting classifier is referred to as quadratic discriminant analysis (QDA). LDA instead makes the additional simplifying homoscedasticity assumption (i.e. that the class covariances are identical, so Σ 0 = Σ 1 = Σ {\displaystyle \Sigma _{0}=\Sigma _{1}=\Sigma } ) and that the covariances have full rank. In this case, several terms cancel: x → T Σ 0 − 1 x → = x → T Σ 1 − 1 x → {\displaystyle {\vec {x}}^{\mathrm {T} }\Sigma _{0}^{-1}{\vec {x}}={\vec {x}}^{\mathrm {T} }\Sigma _{1}^{-1}{\vec {x}}} x → T Σ i − 1 μ → i = μ → i T Σ i − 1 x → {\displaystyle {\vec {x}}^{\mathrm {T} }{\Sigma _{i}}^{-1}{\vec {\mu }}_{i}={{\vec {\mu }}_{i}}^{\mathrm {T} }{\Sigma _{i}}^{-1}{\vec {x}}} because both sides are scalar and transpose to each other ( Σ i {\displaystyle \Sigma _{i}} is Hermitian) and the above decision criterion becomes a threshold on the dot product w → T x → > c {\displaystyle {\vec {w}}^{\mathrm {T} }{\vec {x}}>c} for some threshold constant c, where w → = Σ − 1 ( μ → 1 − μ → 0 ) {\displaystyle {\vec {w}}=\Sigma ^{-1}({\vec {\mu }}_{1}-{\vec {\mu }}_{0})} c = 1 2 w → T ( μ → 1 + μ → 0 ) {\displaystyle c={\frac {1}{2}}\,{\vec {w}}^{\mathrm {T} }({\vec {\mu }}_{1}+{\vec {\mu }}_{0})} This means that the criterion of an input x → {\displaystyle {\vec {x}}} being in a class y {\displaystyle y} is purely a function of this linear combination of the known observations. It is often useful to see this conclusion in geometrical terms: the criterion of an input x → {\displaystyle {\vec {x}}} being in a class y {\displaystyle y} is purely a function of projection of multidimensional-space point x → {\displaystyle {\vec {x}}} onto vector w → {\displaystyle {\vec {w}}} (thus, we only consider its direction). In other words, the observation belongs to y {\displaystyle y} if corresponding x → {\displaystyle {\vec {x}}} is located on a certain side of a hyperplane perpendicular to w → {\displaystyle {\vec {w}}} . The location of the plane is defined by the threshold c {\displaystyle c} . == Assumptions == The assumptions of discriminant analysis are the same as those for MANOVA. The analysis is quite sensitive to outliers and the size of the smallest group must be larger than the number of predictor variables. Multivariate normality: Independent variables are normal for each level of the grouping variable. Homogeneity of variance/covariance (homoscedasticity): Variances among group variables are the same across levels of predictors. Can be tested with Box's M statistic. It has been suggested, however, that linear discriminant analysis be used when covariances are equal, and that quadratic discriminant analysis may be used when covariances are not equal. Independence: Participants are assumed to be randomly sampled, and a participant's score on one variable is assumed to be independent of scores on that variable for all other participants. It has been suggested that discriminant analysis is relatively robust to slight violations of these assumptions, and it has also been shown that discriminant analysis may still be reliable when using dichotomous variables (where multivariate normality is often violated). == Discriminant functions == Discriminant analysis works by creating one or more linear combinations of predictors, creating a new latent variable for each function. These functions are called discriminant functions. The number of functions possible is either N g − 1 {\displaystyle N_{g}-1} where N g {\displaystyle N_{g}} = number of groups, or p {\displaystyle p} (the number of predictors), whichever is smaller. The first function created maximizes the differences between groups on that function. The second function maximizes differences on that function, but also must not be correlated with the previous function. This continues with subsequent functions with the requirement that the new function not be correlated with any of the previous functions. Given group j {\displaystyle j} , with R j {\displaystyle \mathbb {R} _{j}} sets of sample space, there is a discriminant rule such that if x ∈ R j {\displaystyle x\in \mathbb {R} _{j}} , then x ∈ j {\displaystyle x\in j} . Discriminant analysis then, finds “good” regions of R j {\displaystyle \mathbb {R} _{j}} to minimize classification error, therefore leading to a high percent correct classified in the classification table. Each function is given a discriminant score to determine how well it predicts group placement. Structure Corr

    Read more →
  • Farthest-first traversal

    Farthest-first traversal

    In computational geometry, the farthest-first traversal of a compact metric space is a sequence of points in the space, where the first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points. The same concept can also be applied to a finite set of geometric points, by restricting the selected points to belong to the set or equivalently by considering the finite metric space generated by these points. For a finite metric space or finite set of geometric points, the resulting sequence forms a permutation of the points, also known as the greedy permutation. Every prefix of a farthest-first traversal provides a set of points that is widely spaced and close to all remaining points. More precisely, no other set of equally many points can be spaced more than twice as widely, and no other set of equally many points can be less than half as far to its farthest remaining point. In part because of these properties, farthest-point traversals have many applications, including the approximation of the traveling salesman problem and the metric k-center problem. They may be constructed in polynomial time, or (for low-dimensional Euclidean spaces) approximated in near-linear time. == Definition and properties == A farthest-first traversal is a sequence of points in a compact metric space, with each point appearing at most once. If the space is finite, each point appears exactly once, and the traversal is a permutation of all of the points in the space. The first point of the sequence may be any point in the space. Each point p after the first must have the maximum possible distance to the set of points earlier than p in the sequence, where the distance from a point to a set is defined as the minimum of the pairwise distances to points in the set. A given space may have many different farthest-first traversals, depending both on the choice of the first point in the sequence (which may be any point in the space) and on ties for the maximum distance among later choices. Farthest-point traversals may be characterized by the following properties. Fix a number k, and consider the prefix formed by the first k points of the farthest-first traversal of any metric space. Let r be the distance between the final point of the prefix and the other points in the prefix. Then this subset has the following two properties: All pairs of the selected points are at distance at least r from each other, and All points of the metric space are at distance at most r from the subset. Conversely any sequence having these properties, for all choices of k, must be a farthest-first traversal. These are the two defining properties of a Delone set, so each prefix of the farthest-first traversal forms a Delone set. == Applications == Rosenkrantz, Stearns & Lewis (1977) used the farthest-first traversal to define the farthest-insertion heuristic for the travelling salesman problem. This heuristic finds approximate solutions to the travelling salesman problem by building up a tour on a subset of points, adding one point at a time to the tour in the ordering given by a farthest-first traversal. To add each point to the tour, one edge of the previous tour is broken and replaced by a pair of edges through the added point, in the cheapest possible way. Although Rosenkrantz et al. prove only a logarithmic approximation ratio for this method, they show that in practice it often works better than other insertion methods with better provable approximation ratios. Later, the same sequence of points was popularized by Gonzalez (1985), who used it as part of greedy approximation algorithms for two problems in clustering, in which the goal is to partition a set of points into k clusters. One of the two problems that Gonzalez solve in this way seeks to minimize the maximum diameter of a cluster, while the other, known as the metric k-center problem, seeks to minimize the maximum radius, the distance from a chosen central point of a cluster to the farthest point from it in the same cluster. For instance, the k-center problem can be used to model the placement of fire stations within a city, in order to ensure that every address within the city can be reached quickly by a fire truck. For both clustering problems, Gonzalez chooses a set of k cluster centers by selecting the first k points of a farthest-first traversal, and then creates clusters by assigning each input point to the nearest cluster center. If r is the distance from the set of k selected centers to the next point at position k + 1 in the traversal, then with this clustering every point is within distance r of its center and every cluster has diameter at most 2r. However, the subset of k centers together with the next point are all at distance at least r from each other, and any k-clustering would put some two of these points into a single cluster, with one of them at distance at least r/2 from its center and with diameter at least r. Thus, Gonzalez's heuristic gives an approximation ratio of 2 for both clustering problems. Gonzalez's heuristic was independently rediscovered for the metric k-center problem by Dyer & Frieze (1985), who applied it more generally to weighted k-center problems. Another paper on the k-center problem from the same time, Hochbaum & Shmoys (1985), achieves the same approximation ratio of 2, but its techniques are different. Nevertheless, Gonzalez's heuristic, and the name "farthest-first traversal", are often incorrectly attributed to Hochbaum and Shmoys. For both the min-max diameter clustering problem and the metric k-center problem, these approximations are optimal: the existence of a polynomial-time heuristic with any constant approximation ratio less than 2 would imply that P = NP. As well as for clustering, the farthest-first traversal can also be used in another type of facility location problem, the max-min facility dispersion problem, in which the goal is to choose the locations of k different facilities so that they are as far apart from each other as possible. More precisely, the goal in this problem is to choose k points from a given metric space or a given set of candidate points, in such a way as to maximize the minimum pairwise distance between the selected points. Again, this can be approximated by choosing the first k points of a farthest-first traversal. If r denotes the distance of the kth point from all previous points, then every point of the metric space or the candidate set is within distance r of the first k − 1 points. By the pigeonhole principle, some two points of the optimal solution (whatever it is) must both be within distance r of the same point among these first k − 1 chosen points, and (by the triangle inequality) within distance 2r of each other. Therefore, the heuristic solution given by the farthest-first traversal is within a factor of two of optimal. Other applications of the farthest-first traversal include color quantization (clustering the colors in an image to a smaller set of representative colors), progressive scanning of images (choosing an order to display the pixels of an image so that prefixes of the ordering produce good lower-resolution versions of the whole image rather than filling in the image from top to bottom), point selection in the probabilistic roadmap method for motion planning, simplification of point clouds, generating masks for halftone images, hierarchical clustering, finding the similarities between polygon meshes of similar surfaces, choosing diverse and high-value observation targets for underwater robot exploration, fault detection in sensor networks, modeling phylogenetic diversity, matching vehicles in a heterogenous fleet to customer delivery requests, uniform distribution of geodetic observatories on the Earth's surface or of other types of sensor network, generation of virtual point lights in the instant radiosity computer graphics rendering method, and geometric range searching data structures. == Algorithms == === Greedy exact algorithm === The farthest-first traversal of a finite point set may be computed by a greedy algorithm that maintains the distance of each point from the previously selected points, performing the following steps: Initialize the sequence of selected points to the empty sequence, and the distances of each point to the selected points to infinity. While not all points have been selected, repeat the following steps: Scan the list of not-yet-selected points to find a point p that has the maximum distance from the selected points. Remove p from the not-yet-selected points and add it to the end of the sequence of selected points. For each remaining not-yet-selected point q, replace the distance stored for q by the minimum of its old value and the distance from p to q. For a set of n points, this algorithm takes O(n2) steps and O(n2) distance computations. === Approximations === A faster approximation algorithm, given by Har-Peled & Mendel (2006), applie

    Read more →