Shy Girl

Shy Girl

Shy Girl is a horror novel initially self-published in February 2025 by Mia Ballard. Publishing rights for the book were acquired by Hachette Book Group, which released the book in the United Kingdom in November 2025 and planned to publish it in the United States in 2026. Its US release was cancelled and its UK release was discontinued after it faced accusations of being created with generative AI. Ballard denied having personally used AI in the book's writing, claiming that a freelance editor had introduced AI-generated changes. She also stated that she would take legal action against the editor. == Premise == The novel follows Gia, a depressed woman with obsessive–compulsive disorder, who encounters a mysterious man named Nathan while looking for a sugar daddy to ease her financial troubles. Nathan offers to erase all of Gia's debts in exchange for her agreeing to live as his pet. Living like an animal convinces her that she is becoming an animal, making her behave like one. == Publication and cancellation == Shy Girl was first self-published online by Mia Ballard in February 2025. Marketing material described the book as a "buzzy BookTok sensation" and "bloody and unforgiving". The self-published edition of the book was highly successful and had over 4,900 ratings on Goodreads and an average score of 3.52 stars. In an interview, Ballard described her writing style as lyrical, feverish, and introspective, and stated she was more interested in "what it feels like to live inside a body" than in plot-driven storylines. Publishing rights were acquired by Hachette Book Group and it was published by its Wildfire imprint in the United Kingdom in November 2025. By March 2026, the book had sold 1,800 copies in the United Kingdom. A US release was planned for 2026 by the imprint Orbit Books. After the British publication, critics and readers began to make claims that the book appeared to have been written by generative AI. A January 2026 post on Reddit claimed that the book had many of the hallmarks of having been written with a large language model, and stated that it was "repulsive" that the book was accepted by Hachette. A two-and-a-half-hour video essay covering the book, titled "i'm pretty sure this book is ai slop", received 1.2 million views on YouTube by March 2026. In response, Hachette Book Group announced in March 2026 that it would cancel the book's US publication and discontinue its UK publication. It told The Wall Street Journal that it had made "a lengthy investigation" before deciding to cancel the book. Ballard told The New York Times that she had not used AI when writing the book, but that AI-generated elements were added by a freelance editor without her knowledge. She also stated that she could not elaborate on her claim because she was pursuing legal action against the editor. Writer Andrea Bartz opined that the situation "raises many concerns about trust, authenticity and publishing's readiness for a new, A.I.-assisted world", but that "readers made it abundantly clear they want books by humans, not machines".

Pixelmator

Pixelmator is a series of graphics editors developed by Apple for macOS, iOS, and iPadOS. Pixelmator apps leverage Apple-specific technologies such as CoreML and Metal. Pixelmator uses a proprietary format across their apps (.PXD), but supports editing a variety of file types including Photoshop, RAW, and WebP. == History == Pixelmator Team was founded in 2007 by Lithuanian brothers Saulius and Aidas Dailidė, and released Pixelmator (now Pixelmator Classic) 1.0 in September of the same year. The company resided in Vilnius, Lithuania. In November 2024, Pixelmator Team agreed to be acquired by Apple for an unknown monetary amount, which was completed on 11 February 2025, the company was later folded into Apple with its products coming under them fully. == Pixelmator Classic == Pixelmator Classic was the original version of Pixelmator released for Mac on 25 September 2007. It uses a palette-style interface with floating toolbars compared to Pixelmator Pro's single-window interface. It is no longer being updated and has been delisted from the Mac App Store. == Pixelmator iOS == Pixelmator for iOS launched on 23 October 2014 as an iPad-exclusive app with touch-optimized versions of Pixelmator's desktop features. In May 2015, Pixelmator for iOS 2.0 was released with support for the iPhone. Apple no longer updates Pixelmator for iOS as of 13 January 2026, shortly before the release of Pixelmator Pro for iPad. == Pixelmator Pro == Pixelmator Pro is an image, video, and vector editing software for macOS that launched on 29 November 2017. It was a paid upgrade for Pixelmator Classic users, featuring a redesigned interface, a graphics pipeline rewritten using Metal, Apple silicon support and a greater focus on ML/AI editing features. On 28 January 2026, Apple announced Apple Creator Studio, a subscription bundle for their professional software that contains Pixelmator Pro. They also brought Pixelmator Pro to iPad, shortly after discontinuing Pixelmator iOS. == Photomator == Photomator (formerly Pixelmator Photo) is a photo-oriented editing app which launched on iPad in 2019, on iOS in 2021, and macOS in 2022. After launching the macOS version, the app moved from a one-time purchase to a subscription; however, a lifetime license can still be purchased for $99. Photomator differentiates itself from other Pixelmator apps with features such as batch editing of full photoshoots and AI-powered color correction. Edits in Photomator are made on a single layer and are non-destructive.

Principal component analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data are linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. The principal components of a collection of points in a real coordinate space are a sequence of p {\displaystyle p} unit vectors, where the i {\displaystyle i} -th vector is the direction of a line that best fits the data while being orthogonal to the first i − 1 {\displaystyle i-1} vectors. Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line. These directions (i.e., principal components) constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science. == Overview == When performing PCA, the first principal component of a set of p {\displaystyle p} variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p {\displaystyle p} iterations until all the variance is explained. PCA is most commonly used when many of the variables are highly correlated with each other and it is desirable to reduce their number to an independent set. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i {\displaystyle i} -th principal component can be taken as a direction orthogonal to the first i − 1 {\displaystyle i-1} principal components that maximizes the variance of the projected data. For either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. Robust and L1-norm-based variants of standard PCA have also been proposed. == History == PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (invented in the last quarter of the 19th century), eigenvalue decomposition (EVD) of XTX in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis), Eckart–Young theorem (Harman, 1960), or empirical orthogonal functions (EOF) in meteorological science (Lorenz, 1956), empirical eigenfunction decomposition (Sirovich, 1987), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics. == Intuition == PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. To find the axes of the ellipsoid, we must first center the values of each variable in the dataset on 0 by subtracting the mean of the variable's observed values from each of those values. These transformed values are used instead of the original observed values for each of the variables. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually-orthogonal unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform the covariance matrix into a diagonalized form, in which the diagonal elements represent the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. Biplots and scree plots (degree of explained variance) are used to interpret findings of the PCA. == Details == PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Consider an n × p {\displaystyle n\times p} data matrix, X, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of feature (say, the results from a particular sensor). Mathematically, the transformation is defined by a set of size l {\displaystyle l} (where l {\displaystyle l} is usually selected to be strictly less than p {\displaystyle p} to reduce dimensionality) of p {\displaystyle p} -dimensional vectors of weights or coefficients w ( k ) = ( w 1 , … , w p ) ( k ) {\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} that map each row vector x ( i ) = ( x 1 , … , x p ) ( i ) {\displaystyle \mathbf {x} _{(i)}=(x_{1},\dots ,x_{p})_{(i)}} of X to a new vector of principal component scores t ( i ) = ( t 1 , … , t l ) ( i ) {\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_{(i)}} , given by t k ( i ) = x ( i ) ⋅ w ( k ) f o r i = 1 , … , n k = 1 , … , l {\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l} in such a way that the individual variables t 1 , … , t l {\displaystyle t_{1},\dots ,t_{l}} of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector. The above may equivalently be written in matrix form as T = X W {\displaystyle \mathbf {T} =\mathbf {X} \mathbf {W} } where T i k = t k ( i ) {\displaystyle {\mathbf {T} }_{ik}={t_{k}}_{(i)}} , X i j = x j ( i ) {\displaystyle {\mathbf {X} }_{ij}={x_{j}}_{(i)}} , and W j k = w j ( k ) {\displaystyle {\mathbf {W} }_{jk}={w_{j}}_{(k)}} . === First component === In order to maximize variance, the first weight vector w(1) thus has to satisfy w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( t 1 ) ( i ) 2 } = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( x ( i ) ⋅ w ) 2 } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_{(i)}^{2}\right\}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}} Equivalently, writing this in matrix form gives w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ‖ X w ‖ 2 } = arg ⁡ max ‖ w ‖ = 1 { w T X T X w } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}\right\}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} \right\}} Since w(1) has been defined to be a unit vector, it equivalently also satisfies w ( 1 ) = arg ⁡ max { w T X T X w w T w } {\displaystyle \mathbf {w} _{(1)}=\arg \max \left\{{\frac {\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf {T}}\mathbf {w} }}\right\}} The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w(1) found, the first principal component of a data vector

Variational autoencoder

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling in 2013. It is part of the families of probabilistic graphical models and variational Bayesian methods. In addition to being seen as an autoencoder neural network architecture, variational autoencoders can also be studied within the mathematical formulation of variational Bayesian methods, connecting a neural encoder network to its decoder through a probabilistic latent space (for example, as a multivariate Gaussian distribution) that corresponds to the parameters of a variational distribution. Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately. Although this type of model was initially designed for unsupervised learning, its effectiveness has been proven for semi-supervised learning and supervised learning. == Overview of architecture and operation == A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the expectation-maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually computationally intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. In that way, the same parameters are reused for multiple data points, which can result in massive memory savings. The first neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder. The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent. To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–Leibler divergence (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value. More recent approaches replace Kullback–Leibler divergence (KL-D) with various statistical distances, see "Statistical distance VAE variants" below. == Formulation == From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data x {\displaystyle x} by their chosen parameterized probability distribution p θ ( x ) = p ( x | θ ) {\displaystyle p_{\theta }(x)=p(x|\theta )} . This distribution is usually chosen to be a Gaussian N ( x | μ , σ ) {\displaystyle N(x|\mu ,\sigma )} which is parameterized by μ {\displaystyle \mu } and σ {\displaystyle \sigma } respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents z {\displaystyle z} results in intractable integrals. Let us find p θ ( x ) {\displaystyle p_{\theta }(x)} via marginalizing over z {\displaystyle z} . p θ ( x ) = ∫ z p θ ( x , z ) d z , {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }({x,z})\,dz,} where p θ ( x , z ) {\displaystyle p_{\theta }({x,z})} represents the joint distribution under p θ {\displaystyle p_{\theta }} of the observable data x {\displaystyle x} and its latent representation or encoding z {\displaystyle z} . According to the chain rule, the equation can be rewritten as p θ ( x ) = ∫ z p θ ( x | z ) p θ ( z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }({x|z})p_{\theta }(z)\,dz} In the vanilla variational autoencoder, z {\displaystyle z} is usually taken to be a finite-dimensional vector of real numbers, and p θ ( x | z ) {\displaystyle p_{\theta }({x|z})} to be a Gaussian distribution. Then p θ ( x ) {\displaystyle p_{\theta }(x)} is a mixture of Gaussian distributions. It is now possible to define the set of the relationships between the input data and its latent representation as Prior p θ ( z ) {\displaystyle p_{\theta }(z)} Likelihood p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} Posterior p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} Unfortunately, the computation of p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as q ϕ ( z | x ) ≈ p θ ( z | x ) {\displaystyle q_{\phi }({z|x})\approx p_{\theta }({z|x})} with ϕ {\displaystyle \phi } defined as the set of real values that parametrize q {\displaystyle q} . This is sometimes called amortized inference, since by "investing" in finding a good q ϕ {\displaystyle q_{\phi }} , one can later infer z {\displaystyle z} from x {\displaystyle x} quickly without doing any integrals. In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} is computed by the probabilistic decoder, and the approximated posterior distribution q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} is computed by the probabilistic encoder. Parametrize the encoder as E ϕ {\displaystyle E_{\phi }} , and the decoder as D θ {\displaystyle D_{\theta }} . == Evidence lower bound (ELBO) == Like many deep learning approaches that use gradient-based optimization, VAEs require a differentiable loss function to update the network weights through backpropagation. For variational autoencoders, the idea is to jointly optimize the generative model parameters θ {\displaystyle \theta } to reduce the reconstruction error between the input and the output, and ϕ {\displaystyle \phi } to make q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} as close as possible to p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . As reconstruction loss, mean squared error and cross entropy are often used. The Kullback–Leibler divergence D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) {\displaystyle D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))} can be used as a loss function to squeeze q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} under p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . This divergence loss expands to D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) = E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( z | x ) ] = E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( x ) p θ ( x , z ) ] = ln ⁡ p θ ( x ) + E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( x , z ) ] . {\displaystyle {\begin{aligned}D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }(z|x)}{p_{\theta }(z|x)}}\right]\\&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})p_{\theta }(x)}{p_{\theta }(x,z)}}\right]\\&=\ln p_{\theta }(x)+\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})}{p_{\theta }(x,z)}}\right].\end{aligned}}} Now, define the evidence lower bound (ELBO): L θ , ϕ ( x ) := E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ p θ ( x , z ) q ϕ ( z | x ) ] = ln ⁡ p θ ( x ) − D K L ( q ϕ ( ⋅ | x ) ∥ p θ ( ⋅ | x ) ) {\displaystyle L_{\theta ,\phi }(x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\ln p_{\theta }(x)-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }({\cdot |x}))} Maximizing the ELBO θ ∗ , ϕ ∗ = argmax θ , ϕ L θ , ϕ ( x ) {\dis

Cultural algorithm

Cultural algorithms (CA) are a branch of evolutionary computation where there is a knowledge component that is called the belief space in addition to the population component. In this sense, cultural algorithms can be seen as an extension to a conventional genetic algorithm. Cultural algorithms were introduced by Reynolds (see references). == Belief space == The belief space of a cultural algorithm is divided into distinct categories. These categories represent different domains of knowledge that the population has of the search space. The belief space is updated after each iteration by the best individuals of the population. The best individuals can be selected using a fitness function that assesses the performance of each individual in population much like in genetic algorithms. === List of belief space categories === Normative knowledge A collection of desirable value ranges for the individuals in the population component e.g. acceptable behavior for the agents in population. Domain specific knowledge Information about the domain of the cultural algorithm problem is applied to. Situational knowledge Specific examples of important events - e.g. successful/unsuccessful solutions Temporal knowledge History of the search space - e.g. the temporal patterns of the search process Spatial knowledge Information about the topography of the search space == Population == The population component of the cultural algorithm is approximately the same as that of the genetic algorithm. == Communication protocol == Cultural algorithms require an interface between the population and belief space. The best individuals of the population can update the belief space via the update function. Also, the knowledge categories of the belief space can affect the population component via the influence function. The influence function can affect population by altering the genome or the actions of the individuals. == Pseudocode for cultural algorithms == Initialize population space (choose initial population) Initialize belief space (e.g. set domain specific knowledge and normative value-ranges) Repeat until termination condition is met Perform actions of the individuals in population space Evaluate each individual by using the fitness function Select the parents to reproduce a new generation of offspring Let the belief space alter the genome of the offspring by using the influence function Update the belief space by using the accept function (this is done by letting the best individuals to affect the belief space) == Applications == Various optimization problems Social simulation Real-parameter optimization

Dabbler

Dabbler is natural media drawing software for beginners. It was initially developed by Fractal Design Corporation. It is a simplified version of Fractal Design Painter, and included multimedia tutorials and a fullscreen interface. Dabbler was released as "Art Dabbler" after the MetaCreations merger, and rights were eventually transferred to Corel. Dabbler operating systems are Mac OS and Microsoft Windows.

Linear classifier

In machine learning, a linear classifier makes a classification decision for each object based on a linear combination of its features. A simpler definition is to say that a linear classifier is one whose decision boundaries are linear. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. == Definition == If the input feature vector to the classifier is a real vector x → {\displaystyle {\vec {x}}} , then the output score is y = f ( w → ⋅ x → ) = f ( ∑ j w j x j ) , {\displaystyle y=f({\vec {w}}\cdot {\vec {x}})=f\left(\sum _{j}w_{j}x_{j}\right),} where w → {\displaystyle {\vec {w}}} is a real vector of weights and f is a function that converts the dot product of the two vectors into the desired output. (In other words, w → {\displaystyle {\vec {w}}} is a one-form or linear functional mapping x → {\displaystyle {\vec {x}}} onto R.) The weight vector w → {\displaystyle {\vec {w}}} is learned from a set of labeled training samples. Often f is a threshold function, which maps all values of w → ⋅ x → {\displaystyle {\vec {w}}\cdot {\vec {x}}} above a certain threshold to the first class and all other values to the second class; e.g., f ( x ) = { 1 if w T ⋅ x > θ , 0 otherwise {\displaystyle f(\mathbf {x} )={\begin{cases}1&{\text{if }}\ \mathbf {w} ^{T}\cdot \mathbf {x} >\theta ,\\0&{\text{otherwise}}\end{cases}}} The superscript T indicates the transpose and θ {\displaystyle \theta } is a scalar threshold. A more complex f might give the probability that an item belongs to a certain class. For a two-class classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no". A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when x → {\displaystyle {\vec {x}}} is sparse. Also, linear classifiers often work very well when the number of dimensions in x → {\displaystyle {\vec {x}}} is large, as in document classification, where each element in x → {\displaystyle {\vec {x}}} is typically the number of occurrences of a word in a document (see document-term matrix). In such cases, the classifier should be well-regularized. == Generative models vs. discriminative models == There are two broad classes of methods for determining the parameters of a linear classifier w → {\displaystyle {\vec {w}}} . They can be generative and discriminative models. Methods of the former model joint probability distribution, whereas methods of the latter model conditional density functions P ( c l a s s | x → ) {\displaystyle P({\rm {class}}|{\vec {x}})} . Examples of such algorithms include: Linear Discriminant Analysis (LDA)—assumes Gaussian conditional density models Naive Bayes classifier with multinomial or multivariate Bernoulli event models. The second set of methods includes discriminative models, which attempt to maximize the quality of the output on a training set. Additional terms in the training cost function can easily perform regularization of the final model. Examples of discriminative training of linear classifiers include: Logistic regression—maximum likelihood estimation of w → {\displaystyle {\vec {w}}} assuming that the observed training set was generated by a binomial model that depends on the output of the classifier. Perceptron—an algorithm that attempts to fix all errors encountered in the training set Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification. Support vector machine—an algorithm that maximizes the margin between the decision hyperplane and the examples in the training set. Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear dimensionality reduction algorithm: principal components analysis (PCA). LDA is a supervised learning algorithm that utilizes the labels of the data, while PCA is an unsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact. Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models. All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space φ ( x → ) {\displaystyle \varphi ({\vec {x}})} , using the kernel trick. === Discriminative training === Discriminative training of linear classifiers usually proceeds in a supervised way, by means of an optimization algorithm that is given a training set with desired outputs and a loss function that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form arg ⁡ min w R ( w ) + C ∑ i = 1 N L ( y i , w T x i ) {\displaystyle {\underset {\mathbf {w} }{\arg \min }}\;R(\mathbf {w} )+C\sum _{i=1}^{N}L(y_{i},\mathbf {w} ^{\mathsf {T}}\mathbf {x} _{i})} where w is a vector of classifier parameters, L(yi, wTxi) is a loss function that measures the discrepancy between the classifier's prediction and the true output yi for the i'th training example, R(w) is a regularization function that prevents the parameters from getting too large (causing overfitting), and C is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function. Popular loss functions include the hinge loss (for linear SVMs) and the log loss (for linear logistic regression). If the regularization function R is convex, then the above is a convex problem. Many algorithms exist for solving such problems; popular ones for linear classification include (stochastic) gradient descent, L-BFGS, coordinate descent and Newton methods.