AI Code Generator Zzz

AI Code Generator Zzz — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Smoothing

    Smoothing

    In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the data points of a signal are modified so individual points higher than the adjacent points (presumably because of noise) are reduced, and points that are lower than the adjacent points are increased, leading to a smoother signal. Reducing noise by smoothing may aid in data analysis in two notable ways: Help uncover more meaningful information from the underlying data, such as trends. Provide analyses that are both flexible and robust. Many different algorithms are used in smoothing, most commonly binning, kernels, and local weighted regression. == Compared to curve fitting == Smoothing may be distinguished from the related and partially overlapping concept of curve fitting in the following ways: curve fitting often involves the use of an explicit function form for the result, whereas the immediate results from smoothing are the "smoothed" values with no later use made of a functional form if there is one; the aim of smoothing is to give a general idea of relatively slow changes of value with little attention paid to the close matching of data values, while curve fitting concentrates on achieving as close a match as possible. smoothing methods often have an associated tuning parameter which is used to control the extent of smoothing. Curve fitting will adjust any number of parameters of the function to obtain the 'best' fit. == Linear smoothers == In the case that the smoothed values can be written as a linear transformation of the observed values, the smoothing operation is known as a linear smoother; the matrix representing the transformation is known as a smoother matrix or hat matrix. The operation of applying such a matrix transformation is called convolution. Thus the matrix is also called convolution matrix or a convolution kernel. In the case of simple series of data points (rather than a multi-dimensional image), the convolution kernel is a one-dimensional vector. == Algorithms == One of the most common algorithms is the "moving average", often used to try to capture important trends in repeated statistical surveys. In image processing and computer vision, smoothing ideas are used in scale space representations. The simplest smoothing algorithm is the "rectangular" or "unweighted sliding-average smooth". This method replaces each point in the signal with the average of "m" adjacent points, where "m" is a positive integer called the "smooth width". Usually m is an odd number. The triangular smooth is like the rectangular smooth except that it implements a weighted smoothing function. Some specific smoothing and filter types, with their respective uses, pros and cons are:

    Read more →
  • Randomized weighted majority algorithm

    Randomized weighted majority algorithm

    The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems. It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert. == Example == Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The principal challenge is that we do not know which experts will give better or worse predictions. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single expert which, in hindsight, gave the most accurate predictions. == Motivation == In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows: initialize all experts to weight 1 for each round: add each expert's weight to the option they predicted predict the option with the largest weighted sum multiply the weights of all experts who predicted wrongly by 1 2 {\displaystyle {\frac {1}{2}}} Suppose there are n {\displaystyle n} experts and the best expert makes m {\displaystyle m} mistakes. Then, the weighted majority algorithm (WMA) makes at most 2.4 ( log 2 ⁡ n + m ) {\displaystyle 2.4(\log _{2}n+m)} mistakes. This bound is highly problematic in the case of highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in N = 100 {\displaystyle N=100} rounds using n = 10 {\displaystyle n=10} experts, the best expert makes m = 20 {\displaystyle m=20} mistakes. Then, the weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} mistakes. As this is a known limitation of the weighted majority algorithm, various strategies have been explored in order to improve the dependence on m {\displaystyle m} . In particular, we can do better by introducing randomization. Drawing inspiration from the Multiplicative Weights Update Method algorithm, we will probabilistically make predictions based on how the experts have performed in the past. Similarly to the WMA, every time an expert makes a wrong prediction, we will decrement their weight. Mirroring the MWUM, we will then use the weights to make a probability distribution over the actions and draw our action from this distribution (instead of deterministically picking the majority vote as the WMA does). == Randomized weighted majority algorithm (RWMA) == The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on m {\displaystyle m} . Instead of predicting based on majority vote, the weights, are used as probabilities for choosing the experts in each round and are updated over time (hence the name randomized weighted majority). Precisely, if w i {\displaystyle w_{i}} is the weight of expert i {\displaystyle i} , let W = ∑ i w i {\displaystyle W=\sum _{i}w_{i}} . We will follow expert i {\displaystyle i} with probability w i W {\displaystyle {\frac {w_{i}}{W}}} . This results in the following algorithm: initialize all experts to weight 1. for each round: add all experts' weights together to obtain the total weight W {\displaystyle W} choose expert i {\displaystyle i} randomly with probability w i W {\displaystyle {\frac {w_{i}}{W}}} predict as the chosen expert predicts multiply the weights of all experts who predicted wrongly by β {\displaystyle \beta } The goal is to bound the worst-case expected number of mistakes, assuming that the adversary has to select one of the answers as correct before we make our coin toss. This is a reasonable assumption in, for instance, the stock market example provided above: the variance of a stock price should not depend on the opinions of experts that influence private buy or sell decisions, so we can treat the price change as if it was decided before the experts gave their recommendations for the day. The randomized algorithm is better in the worst case than the deterministic algorithm (weighted majority algorithm): in the latter, the worst case was when the weights were split 50/50. But in the randomized version, since the weights are used as probabilities, there would still be a 50/50 chance of getting it right. In addition, generalizing to multiplying the weights of the incorrect experts by β < 1 {\displaystyle \beta <1} instead of strictly 1 2 {\displaystyle {\frac {1}{2}}} allows us to trade off between dependence on m {\displaystyle m} and log 2 ⁡ n {\displaystyle \log _{2}n} . This trade-off will be quantified in the analysis section. == Analysis == Let W t {\displaystyle W_{t}} denote the total weight of all experts at round t {\displaystyle t} . Also let F t {\displaystyle F_{t}} denote the fraction of weight placed on experts which predict the wrong answer at round t {\displaystyle t} . Finally, let N {\displaystyle N} be the total number of rounds in the process. By definition, F t {\displaystyle F_{t}} is the probability that the algorithm makes a mistake on round t {\displaystyle t} . It follows from the linearity of expectation that if M {\displaystyle M} denotes the total number of mistakes made during the entire process, E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} . After round t {\displaystyle t} , the total weight is decreased by ( 1 − β ) F t W t {\displaystyle \ (1-\beta )F_{t}W_{t}} , since all weights corresponding to a wrong answer are multiplied by β < 1 {\displaystyle \ \beta <1} . It then follows that W t + 1 = W t ( 1 − ( 1 − β ) F t ) {\displaystyle W_{t+1}=W_{t}(1-(1-\beta )F_{t})} . By telescoping, since W 1 = n {\displaystyle W_{1}=n} , it follows that the total weight after the process concludes is On the other hand, suppose that m {\displaystyle \ m} is the number of mistakes made by the best-performing expert. At the end, this expert has weight β m {\displaystyle \ \beta ^{m}} . It follows, then, that the total weight is at least this much; in other words, W ≥ β m {\displaystyle \ W\geq \beta ^{m}} . This inequality and the above result imply Taking the natural logarithm of both sides yields Now, the Taylor series of the natural logarithm is In particular, it follows that ln ⁡ ( 1 − ( 1 − β ) F t ) < − ( 1 − β ) F t {\displaystyle \ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}} . Thus, Recalling that E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} and rearranging, it follows that Now, as β → 1 {\displaystyle \beta \to 1} from below, the first constant tends to 1 {\displaystyle 1} ; however, the second constant tends to + ∞ {\displaystyle +\infty } . To quantify this tradeoff, define ε = 1 − β {\displaystyle \varepsilon =1-\beta } to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm, It then follows that the mistake bound, for small ε {\displaystyle \varepsilon } , can be written in the form ( 1 + ϵ 2 + O ( ε 2 ) ) m + ϵ − 1 ln ⁡ ( n ) {\displaystyle \ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)} . In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of ε {\displaystyle \varepsilon } and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert. In particular, as long as m {\displaystyle m} is sufficiently large compared to ln ⁡ ( n ) {\displaystyle \ln(n)} (so that their ratio is sufficiently small), we can assign we can obtain an upper bound on the number of mistakes equal to This implies that the "regret bound" on the algorithm (that is, how much worse it performs than the best expert) is sublinear, at O ( m ln ⁡ ( n ) ) {\displaystyle O({\sqrt {m\ln(n)}})} . == Revisiting the motivation == Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in N = 100 {\displaystyle N=100} rounds, with n = 10 {\displaystyle n=10} experts, where the best expert makes m = 20 {\displaystyle m=20} mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} . By the analysis above, it follows that minimizing the number of worst-case expected mistakes is equivalent to minimizing the fun

    Read more →
  • Nonlinear dimensionality reduction

    Nonlinear dimensionality reduction

    Nonlinear dimensionality reduction (NLDR), also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear decomposition methods, onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping (either from the high-dimensional space to the low-dimensional embedding or vice versa) itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis. == Applications of NLDR == High dimensional data can be hard for machines to work with, requiring significant time and space for analysis. It also presents a challenge for humans, since it's hard to visualize or understand data in more than three dimensions. Reducing the dimensionality of a data set, while keeping its essential features relatively intact, can make algorithms more efficient and allow analysts to visualize trends and patterns. The reduced-dimensional representations of data are often referred to as "intrinsic variables". This description implies that these are the values from which the data was produced. For example, consider a dataset that contains images of a letter 'A', which has been scaled and rotated by varying amounts. Each image has 32×32 pixels. Each image can be represented as a vector of 1024 pixel values. Each row is a sample on a two-dimensional manifold in 1024-dimensional space (a Hamming space). The intrinsic dimensionality is two, because two variables (rotation and scale) were varied in order to produce the data. Information about the shape or look of a letter 'A' is not part of the intrinsic variables because it is the same in every instance. Nonlinear dimensionality reduction will discard the correlated information (the letter 'A') and recover only the varying information (rotation and scale). By comparison, if principal component analysis, which is a linear dimensionality reduction algorithm, is used to reduce this same dataset into two dimensions, the resulting values are not so well organized. This demonstrates that the high-dimensional vectors (each representing a letter 'A') that sample this manifold vary in a non-linear manner. It should be apparent, therefore, that NLDR has several applications in the field of computer-vision. For example, consider a robot that uses a camera to navigate in a closed static environment. The images obtained by that camera can be considered to be samples on a manifold in high-dimensional space, and the intrinsic variables of that manifold will represent the robot's position and orientation. Invariant manifolds are of general interest for model order reduction in dynamical systems. In particular, if there is an attracting invariant manifold in the phase space, nearby trajectories will converge onto it and stay on it indefinitely, rendering it a candidate for dimensionality reduction of the dynamical system. While such manifolds are not guaranteed to exist in general, the theory of spectral submanifolds (SSM) gives conditions for the existence of unique attracting invariant objects in a broad class of dynamical systems. Active research in NLDR seeks to unfold the observation manifolds associated with dynamical systems to develop modeling techniques. Some of the more prominent nonlinear dimensionality reduction techniques are listed below. == Important concepts == === Sammon's mapping === Sammon's mapping is one of the first and most popular NLDR techniques. === Self-organizing map === The self-organizing map (SOM, also called Kohonen map) and its probabilistic variant generative topographic mapping (GTM) use a point representation in the embedded space to form a latent variable model based on a non-linear mapping from the embedded space to the high-dimensional space. These techniques are related to work on density networks, which also are based around the same probabilistic model. === Kernel principal component analysis === Perhaps the most widely used algorithm for dimensional reduction is kernel PCA. PCA begins by computing the covariance matrix of the m × n {\displaystyle m\times n} matrix X {\displaystyle \mathbf {X} } C = 1 m ∑ i = 1 m x i x i T . {\displaystyle C={\frac {1}{m}}\sum _{i=1}^{m}{\mathbf {x} _{i}\mathbf {x} _{i}^{\mathsf {T}}}.} It then projects the data onto the first k eigenvectors of that matrix. By comparison, KPCA begins by computing the covariance matrix of the data after being transformed into a higher-dimensional space, C = 1 m ∑ i = 1 m Φ ( x i ) Φ ( x i ) T . {\displaystyle C={\frac {1}{m}}\sum _{i=1}^{m}{\Phi (\mathbf {x} _{i})\Phi (\mathbf {x} _{i})^{\mathsf {T}}}.} It then projects the transformed data onto the first k eigenvectors of that matrix, just like PCA. It uses the kernel trick to factor away much of the computation, such that the entire process can be performed without actually computing Φ ( x ) {\displaystyle \Phi (\mathbf {x} )} . Of course Φ {\displaystyle \Phi } must be chosen such that it has a known corresponding kernel. Unfortunately, it is not trivial to find a good kernel for a given problem, so KPCA does not yield good results with some problems when using standard kernels. For example, it is known to perform poorly with these kernels on the Swiss roll manifold. However, one can view certain other methods that perform well in such settings (e.g., Laplacian Eigenmaps, LLE) as special cases of kernel PCA by constructing a data-dependent kernel matrix. KPCA has an internal model, so it can be used to map points onto its embedding that were not available at training time. === Principal curves and manifolds === Principal curves and manifolds give the natural geometric framework for nonlinear dimensionality reduction and extend the geometric interpretation of PCA by explicitly constructing an embedded manifold, and by encoding using standard geometric projection onto the manifold. This approach was originally proposed by Trevor Hastie in his 1984 thesis, which he formally introduced in 1989. This idea has been explored further by many authors. How to define the "simplicity" of the manifold is problem-dependent, however, it is commonly measured by the intrinsic dimensionality and/or the smoothness of the manifold. Usually, the principal manifold is defined as a solution to an optimization problem. The objective function includes a quality of data approximation and some penalty terms for the bending of the manifold. The popular initial approximations are generated by linear PCA and Kohonen's SOM. === Laplacian eigenmaps === Laplacian eigenmaps uses spectral techniques to perform dimensionality reduction. This technique relies on the basic assumption that the data lies in a low-dimensional manifold in a high-dimensional space. This algorithm cannot embed out-of-sample points, but techniques based on Reproducing kernel Hilbert space regularization exist for adding this capability. Such techniques can be applied to other nonlinear dimensionality reduction algorithms as well. Traditional techniques like principal component analysis do not consider the intrinsic geometry of the data. Laplacian eigenmaps builds a graph from neighborhood information of the data set. Each data point serves as a node on the graph and connectivity between nodes is governed by the proximity of neighboring points (using e.g. the k-nearest neighbor algorithm). The graph thus generated can be considered as a discrete approximation of the low-dimensional manifold in the high-dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low-dimensional space, preserving local distances. The eigenfunctions of the Laplace–Beltrami operator on the manifold serve as the embedding dimensions, since under mild conditions this operator has a countable spectrum that is a basis for square integrable functions on the manifold (compare to Fourier series on the unit circle manifold). Attempts to place Laplacian eigenmaps on solid theoretical ground have met with some success, as under certain nonrestrictive assumptions, the graph Laplacian matrix has been shown to converge to the Laplace–Beltrami operator as the number of points goes to infinity. === Isomap === Isomap is a combination of the Floyd–Warshall algorithm with classic Multidimensional Scaling (MDS). Classic MDS takes a matrix of pair-wise distances between all points and computes a position for each point. Isomap assumes that the pair-wise distances are only known between neighboring points, and uses the Floyd–Warshall algorithm to compute the pair-wise distances between all other points. This effectively estimates the full matrix of pair-wise geodesic distances between all of the points. Isomap th

    Read more →
  • Markov model

    Markov model

    In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property). Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property. == Introduction == Andrey Andreyevich Markov (14 June 1856 – 20 July 1922) was a Russian mathematician best known for his work on stochastic processes. A primary subject of his research later became known as the Markov chain. There are four common Markov models used in different situations, depending on whether every sequential state is observable or not, and whether the system is to be adjusted on the basis of observations made: == Markov chain == The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property indicates that the distribution for this variable depends only on the distribution of a previous state. An example use of a Markov chain is Markov chain Monte Carlo, which uses the Markov property to prove that a particular method for performing a random walk will sample from the joint distribution. == Hidden Markov model == A hidden Markov model is a Markov chain for which the state is only partially observable or noisily observable. In other words, observations are related to the state of the system, but they are typically insufficient to precisely determine the state. Several well-known algorithms for hidden Markov models exist. For example, given a sequence of observations, the Viterbi algorithm will compute the most-likely corresponding sequence of states, the forward algorithm will compute the probability of the sequence of observations, and the Baum–Welch algorithm will estimate the starting probabilities, the transition function, and the observation function of a hidden Markov model. One common use is for speech recognition, where the observed data is the speech audio waveform and the hidden state is the spoken text. In this example, the Viterbi algorithm finds the most likely sequence of spoken words given the speech audio. == Markov decision process == A Markov decision process is a Markov chain in which state transitions depend on the current state and an action vector that is applied to the system. Typically, a Markov decision process is used to compute a policy of actions that will maximize some utility with respect to expected rewards. == Partially observable Markov decision process == A partially observable Markov decision process (POMDP) is a Markov decision process in which the state of the system is only partially observed. POMDPs are known to be NP complete, but recent approximation techniques have made them useful for a variety of applications, such as controlling simple agents or robots. == Markov random field == A Markov random field, or Markov network, may be considered to be a generalization of a Markov chain in multiple dimensions. In a Markov chain, state depends only on the previous state in time, whereas in a Markov random field, each state depends on its neighbors in any of multiple directions. A Markov random field may be visualized as a field or graph of random variables, where the distribution of each random variable depends on the neighboring variables with which it is connected. More specifically, the joint distribution for any random variable in the graph can be computed as the product of the "clique potentials" of all the cliques in the graph that contain that random variable. Modeling a problem as a Markov random field is useful because it implies that the joint distributions at each vertex in the graph may be computed in this manner. == Hierarchical Markov models == Hierarchical Markov models can be applied to categorize human behavior at various levels of abstraction. For example, a series of simple observations, such as a person's location in a room, can be interpreted to determine more complex information, such as in what task or activity the person is performing. Two kinds of Hierarchical Markov Models are the Hierarchical hidden Markov model and the Abstract Hidden Markov Model. Both have been used for behavior recognition and certain conditional independence properties between different levels of abstraction in the model allow for faster learning and inference. == Tolerant Markov model == A Tolerant Markov model (TMM) is a probabilistic-algorithmic Markov chain model. It assigns the probabilities according to a conditioning context that considers the last symbol, from the sequence to occur, as the most probable instead of the true occurring symbol. A TMM can model three different natures: substitutions, additions or deletions. Successful applications have been efficiently implemented in DNA sequences compression. == Markov-chain forecasting models == Markov-chains have been used as a forecasting methods for several topics, for example price trends, wind power and solar irradiance. The Markov-chain forecasting models utilize a variety of different settings, from discretizing the time-series to hidden Markov-models combined with wavelets and the Markov-chain mixture distribution model (MCM).

    Read more →
  • Just This Once

    Just This Once

    Just This Once is a 1993 romance novel written in the style of Jacqueline Susann by a Macintosh IIcx computer named "Hal" in collaboration with its programmer, Scott French. French reportedly spent $40,000 and 8 years developing an artificial intelligence program to analyze Susann's works and attempt to create a novel that Susann might have written. A legal dispute between the estate of Jacqueline Susann and the publisher resulted in a settlement to split the profits, and the book was referenced in several legal journal articles about copyright laws. The book had two small print runs totaling 35,000 copies, receiving mixed reviews. == Creation == The novel's creation spanned the fields of artificial intelligence, expert systems, and natural language processing. Scott French first scanned and analyzed portions of two books by Jacqueline Susann, Valley of the Dolls and Once Is Not Enough, to determine constituents of Susann's writing style, which French stated was the most difficult task. This analysis extracted several hundred components including frequency and type of sexual acts and sentence structure. "Once you're there, the writer's style emerges, part of her actual personality comes out, and the computer can be programmed to make a story." French also created several thousand rules to govern tone, plotting, scenes, and characters. The text generated by Hal, the computer, was intended to mimic what Susann might have written, although the output required significant editing. French credits Hal's work with "almost 100% of the plot, 100% of the theme and style." French estimates that he wrote 10% of the prose, the computer Hal wrote about 25% of the prose, and the remaining two-thirds was more of a collaboration between the two. A typical scenario to write a scene would involve Hal asking questions that French would answer (for example, Hal might ask about the "cattiness factor" involved in a meeting between two key female characters, and French would reply with a range of 1 to 10), and the computer would then generate a few sentences to which French would make minor edits. The process would repeat for the next few sentences until the scene was written. == Legal issues == Jacqueline Susann's publisher was skeptical of the legality of Just This Once, although French doubted that an author's thought processes could be copyrighted. Susann's estate reportedly threatened to sue Scott French but the parties settled out of court; the settlement involved splitting profits between the parties but the terms of the settlement were not disclosed. The publication of Just This Once raised questions in the legal profession concerning how copyright law applies to computer-generated works derived from an analysis of other copyrighted works, and whether the generation of such works infringes on copyright. The publications on this topic suggested that the copyright laws of the time were ill-equipped to deal with computer-generated creative works. == Reception == The book's publisher Steven Shragis of Carol Group said of the novel, "I'm not going to say this is a great literary work, but it's every bit as good as anything out in this field, and better than an awful lot." The novel received some positive early reviews. In USA Today, novelist Thomas Gifford compared Just This Once to another novel in the same genre, American Star by Jackie Collins. Gifford concluded: "If you do like this stuff, you'd be much, much better off with the one written by the computer." The Dead Jackie Susann Quarterly declared that Susann "would be proud. Lots of money, sleaze, disease, death, oral sex, tragedy and the good girl gone bad." Other reviews were mixed. Publishers Weekly wrote, "If the books of Jacqueline Susann and Harold Robbins seem formulaic, this debut novel of sin and success in Las Vegas outdoes them all. And that, in a way, is the point.... All novelty rests in the conceit of computer authorship, not in the story itself." Library Journal stated "French invested eight years and $50,000 in a scheme to use artificial intelligence to fulfill his authentic, if dubious, desire to generate a trashy novel a la Jacqueline Susann. Shallow, beautiful-people characters are flatly conceived and randomly accessed in a formulaic plot ... a sexy, boring morality tale. Of possible interest to computer buffs for its use of Expert Systems and the virtual promise of more worthy possibilities; others should read Susann." Kirkus Reviews wrote: "The deal here is that author French is not the author, he's just the midwife, having allegedly programmed his computer to write about our times just the way Susann would... almost perfectly capturing glamorous Jackie's turgid but E-Z reading prose style and ultrareliable mix of sex, glitz, dope 'n' despair.... One wonders, though, if French's tale spinning PC will do as well on the talkshows as Jackie did. The computer weenies have been trying to tell us for years, garbage in-garbage out."

    Read more →
  • Silhouette (clustering)

    Silhouette (clustering)

    Silhouette is a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette value ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable", and over 0.25 "weak". However, with an increasing dimensionality of the data, it becomes difficult to achieve such high values because of the curse of dimensionality, as the distances become more similar. The silhouette score is specialized for measuring cluster quality when the clusters are convex-shaped, and may not perform well if the data clusters have irregular shapes or are of varying sizes. The silhouette value can be calculated with any distance metric, such as Euclidean distance or Manhattan distance. == Definition == Assume the data have been clustered via any technique, such as k-medoids or k-means, into k {\displaystyle k} clusters. For data point i ∈ C i {\displaystyle i\in C_{i}} (data point i {\displaystyle i} in the cluster C i {\displaystyle C_{i}} ), calculate a ( i ) {\displaystyle a(i)} , the average distance that i {\displaystyle i} is from all other points in that cluster: a ( i ) = 1 | C i | − 1 ∑ j ∈ C i , i ≠ j d ( i , j ) {\displaystyle a(i)={\frac {1}{|C_{i}|-1}}\sum _{j\in C_{i},i\neq j}d(i,j)} where | C i | {\displaystyle |C_{i}|} is the number of points belonging to cluster C i {\displaystyle C_{i}} , and d ( i , j ) {\displaystyle d(i,j)} is the distance between data points i {\displaystyle i} and j {\displaystyle j} in the cluster C i {\displaystyle C_{i}} (we divide by | C i | − 1 {\displaystyle |C_{i}|-1} because the distance d ( i , i ) {\displaystyle d(i,i)} is not included in the sum). a ( i ) {\displaystyle a(i)} can be interpreted as a measure of how well i {\displaystyle i} is assigned to its cluster (the smaller the value, the better the assignment). We then define the mean dissimilarity of point i {\displaystyle i} to some cluster C j {\displaystyle C_{j}} as the mean of the distance from i {\displaystyle i} to all points in C j {\displaystyle C_{j}} (where C j ≠ C i {\displaystyle C_{j}\neq C_{i}} ). For each data point i ∈ C i {\displaystyle i\in C_{i}} , we now define b ( i ) {\displaystyle b(i)} as the average distance between i {\displaystyle i} and the points in the closest cluster (hence: "min") that i {\displaystyle i} does not belong to: b ( i ) = min j ≠ i 1 | C j | ∑ l ∈ C j d ( i , l ) {\displaystyle b(i)=\min _{j\neq i}{\frac {1}{|C_{j}|}}\sum _{l\in C_{j}}d(i,l)} The cluster with the smallest mean dissimilarity is said to be the "neighboring cluster" of i {\displaystyle i} because it is the next best fit cluster for point i {\displaystyle i} . We now define a silhouette (value) of one data point i {\displaystyle i} s ( i ) = b ( i ) − a ( i ) max { a ( i ) , b ( i ) } {\displaystyle s(i)={\frac {b(i)-a(i)}{\max\{a(i),b(i)\}}}} , if | C i | > 1 {\displaystyle |C_{i}|>1} and s ( i ) = 0 {\displaystyle s(i)=0} , if | C i | = 1 {\displaystyle |C_{i}|=1} , which can also be written as s ( i ) = { 1 − a ( i ) b ( i ) , if a ( i ) < b ( i ) 0 , if a ( i ) = b ( i ) b ( i ) a ( i ) − 1 , if a ( i ) > b ( i ) {\displaystyle s(i)={\begin{cases}1-{\frac {a(i)}{b(i)}},&{\mbox{ if }}a(i)b(i)\\\end{cases}}} From the above definition, s ( i ) {\displaystyle s(i)} is bounded to the interval [ − 1 , 1 ] {\displaystyle [-1,1]} , i.e. − 1 ≤ s ( i ) ≤ 1. {\displaystyle -1\leq s(i)\leq 1.} Note that a ( i ) {\displaystyle a(i)} is not clearly defined for clusters with size = 1, in which case we set s ( i ) = 0 {\displaystyle s(i)=0} . This choice is arbitrary, but neutral in the sense that it is at the midpoint of the bounds, -1 and 1. For s ( i ) {\displaystyle s(i)} to be close to 1 we require a ( i ) ≪ b ( i ) {\displaystyle a(i)\ll b(i)} . As a ( i ) {\displaystyle a(i)} is a measure of how dissimilar i {\displaystyle i} is to its own cluster, a small value means it is well matched. Furthermore, a large b ( i ) {\displaystyle b(i)} implies that i {\displaystyle i} is badly matched to its neighbouring cluster. Thus an s ( i ) {\displaystyle s(i)} close to 1 means that the data is appropriately clustered. If s ( i ) {\displaystyle s(i)} is close to -1, then by the same logic we see that i {\displaystyle i} would be more appropriate if it was clustered in its neighbouring cluster. An s ( i ) {\displaystyle s(i)} near zero means that the datum is on the border of two natural clusters. The mean s ( i ) {\displaystyle s(i)} over all points of a cluster is a measure of how tightly grouped all the points in the cluster are. Thus the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice of k {\displaystyle k} is used in the clustering algorithm (e.g., k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and means may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific. Kaufman et al. introduced the term silhouette coefficient for the maximum value of the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset, i.e., S C = max k s ~ ( k ) , {\displaystyle SC=\max _{k}{\tilde {s}}\left(k\right),} where s ~ ( k ) {\displaystyle {\tilde {s}}\left(k\right)} represents the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset for a specific number of clusters k {\displaystyle k} . The silhouette coefficient describes the best possible clustering possible for a given number of clusters, as measured by the highest average silhouette score for all points in the dataset. == Simplified and medoid silhouette == Computing the silhouette coefficient needs all O ( N 2 ) {\displaystyle {\mathcal {O}}(N^{2})} pairwise distances, making this evaluation much more costly than clustering with k-means. For a clustering with centers μ C I {\displaystyle \mu _{C_{I}}} for each cluster C I {\displaystyle C_{I}} , we can use the following simplified Silhouette for each point i ∈ C I {\displaystyle i\in C_{I}} instead, which can be computed using only O ( N k ) {\displaystyle {\mathcal {O}}(Nk)} distances: a ′ ( i ) = d ( i , μ C I ) {\displaystyle a'(i)=d(i,\mu _{C_{I}})} and b ′ ( i ) = min C J ≠ C I d ( i , μ C J ) {\displaystyle b'(i)=\min _{C_{J}\neq C_{I}}d(i,\mu _{C_{J}})} , which has the additional benefit that a ′ ( i ) {\displaystyle a'(i)} is always defined, then define accordingly the simplified silhouette and simplified silhouette coefficient s ′ ( i ) = b ′ ( i ) − a ′ ( i ) max { a ′ ( i ) , b ′ ( i ) } {\displaystyle s'(i)={\frac {b'(i)-a'(i)}{\max\{a'(i),b'(i)\}}}} S C ′ = max k 1 N ∑ i s ′ ( i ) {\displaystyle SC'=\max _{k}{\frac {1}{N}}\sum _{i}s'\left(i\right)} . If the cluster centers are medoids (as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette or medoid silhouette. If every object is assigned to the nearest medoid (as in k-medoids clustering), we know that a ′ ( i ) ≤ b ′ ( i ) {\displaystyle a'(i)\leq b'(i)} , and hence s ′ ( i ) = b ′ ( i ) − a ′ ( i ) b ′ ( i ) = 1 − a ′ ( i ) b ′ ( i ) {\displaystyle s'(i)={\frac {b'(i)-a'(i)}{b'(i)}}=1-{\frac {a'(i)}{b'(i)}}} . == Silhouette clustering == Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette. We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods. Van der Laan et al. proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL: Choose initial medoids by using PAM Compute the average silhouette of this initial solution For each pair of a medoid m and a non-medoid x swap m and x compute the average silhouette of the resulting solution remember the best swap un-swap m and x for the next iteration Perform the best swap and return to

    Read more →
  • Rprop

    Rprop

    Rprop, short for resilient backpropagation, is a learning heuristic for supervised learning in feedforward artificial neural networks. This is a first-order optimization algorithm. This algorithm was created by Martin Riedmiller and Heinrich Braun in 1992. Similarly to the Manhattan update rule, Rprop takes into account only the sign of the partial derivative over all patterns (not the magnitude), and acts independently on each "weight". For each weight, if there was a sign change of the partial derivative of the total error function compared to the last iteration, the update value for that weight is multiplied by a factor η−, where η− < 1. If the last iteration produced the same sign, the update value is multiplied by a factor of η+, where η+ > 1. The update values are calculated for each weight in the above manner, and finally each weight is changed by its own update value, in the opposite direction of that weight's partial derivative, so as to minimise the total error function. η+ is empirically set to 1.2 and η− to 0.5. Rprop can result in very large weight increments or decrements if the gradients are large, which is a problem when using mini-batches as opposed to full batches. RMSprop addresses this problem by keeping the moving average of the squared gradients for each weight and dividing the gradient by the square root of the mean square. RPROP is a batch update algorithm. Next to the cascade correlation algorithm and the Levenberg–Marquardt algorithm, Rprop is one of the fastest weight update mechanisms. == Variations == Martin Riedmiller developed three algorithms, all named RPROP. Igel and Hüsken assigned names to them and added a new variant: RPROP+ is defined at A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. RPROP− is defined at Advanced Supervised Learning in Multi-layer Perceptrons – From Backpropagation to Adaptive Learning Algorithms. Backtracking is removed from RPROP+. iRPROP− is defined in Rprop – Description and Implementation Details and was reinvented by Igel and Hüsken. This variant is very popular and most simple. iRPROP+ is defined at Improving the Rprop Learning Algorithm and is very robust and typically faster than the other three variants.

    Read more →
  • Softmax function

    Softmax function

    The softmax function, also known as softargmax or normalized exponential function, converts a tuple of K real numbers into a probability distribution over K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and is used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes. == Definition == The softmax function takes as input a tuple z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some tuple components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval ( 0 , 1 ) {\displaystyle (0,1)} , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. Formally, the standard (unit) softmax function σ : R K → ( 0 , 1 ) K {\displaystyle \sigma :\mathbb {R} ^{K}\to (0,1)^{K}} , where ⁠ K > 1 {\displaystyle K>1} ⁠, takes a tuple z = ( z 1 , … , z K ) ∈ R K {\displaystyle \mathbf {z} =(z_{1},\dotsc ,z_{K})\in \mathbb {R} ^{K}} and computes each component of vector σ ( z ) ∈ ( 0 , 1 ) K {\displaystyle \sigma (\mathbf {z} )\in (0,1)^{K}} with σ ( z ) i = e z i ∑ j = 1 K e z j . {\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{z_{i}}}{\sum _{j=1}^{K}e^{z_{j}}}}\,.} In words, the softmax applies the standard exponential function to each element z i {\displaystyle z_{i}} of the input tuple z {\displaystyle \mathbf {z} } (consisting of K {\displaystyle K} real numbers), and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector σ ( z ) {\displaystyle \sigma (\mathbf {z} )} is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input tuple. For example, the standard softmax of ( 1 , 2 , 8 ) {\displaystyle (1,2,8)} is approximately ( 0.001 , 0.002 , 0.997 ) {\displaystyle (0.001,0.002,0.997)} , which amounts to assigning almost all of the total unit weight in the result to the position of the tuple's maximal element (of 8). In general, instead of e a different base b > 0 can be used. As above, if b > 1 then larger input components will result in larger output probabilities, and increasing the value of b will create probability distributions that are more concentrated around the positions of the largest input values. Conversely, if 0 < b < 1 then smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values. Writing b = e β {\displaystyle b=e^{\beta }} or b = e − β {\displaystyle b=e^{-\beta }} (for real β) yields the expressions: σ ( z ) i = e β z i ∑ j = 1 K e β z j or σ ( z ) i = e − β z i ∑ j = 1 K e − β z j for i = 1 , … , K . {\displaystyle \sigma (\mathbf {z} )_{i}={\frac {e^{\beta z_{i}}}{\sum _{j=1}^{K}e^{\beta z_{j}}}}{\text{ or }}\sigma (\mathbf {z} )_{i}={\frac {e^{-\beta z_{i}}}{\sum _{j=1}^{K}e^{-\beta z_{j}}}}{\text{ for }}i=1,\dotsc ,K.} A value proportional to the reciprocal of β is sometimes referred to as the temperature: β = 1 / k T {\textstyle \beta =1/kT} , where k is typically 1 or the Boltzmann constant and T is the temperature. A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating. In some fields, the base is fixed, corresponding to a fixed scale, while in others the parameter β (or T) is varied. The softmax function is a multiple-variable generalization of the logistic function. == Interpretations == === Smooth arg max === The Softmax function is a smooth approximation to the arg max function: the function whose value is the index of a tuple's largest element. The name "softmax" may be misleading. Softmax is not a smooth maximum (that is, a smooth approximation to the maximum function). The term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", though the term "softmax" is conventional in machine learning. This section uses the term "softargmax" for clarity. Formally, instead of considering the arg max as a function with categorical output 1 , … , n {\displaystyle 1,\dots ,n} (corresponding to the index), consider the arg max function with one-hot representation of the output (assuming there is a unique maximum arg): a r g m a x ⁡ ( z 1 , … , z n ) = ( y 1 , … , y n ) = ( 0 , … , 0 , 1 , 0 , … , 0 ) , {\displaystyle \operatorname {arg\,max} (z_{1},\,\dots ,\,z_{n})=(y_{1},\,\dots ,\,y_{n})=(0,\,\dots ,\,0,\,1,\,0,\,\dots ,\,0),} where the output coordinate y i = 1 {\displaystyle y_{i}=1} if and only if i {\displaystyle i} is the arg max of ( z 1 , … , z n ) {\displaystyle (z_{1},\dots ,z_{n})} , meaning z i {\displaystyle z_{i}} is the unique maximum value of ( z 1 , … , z n ) {\displaystyle (z_{1},\,\dots ,\,z_{n})} . For example, in this encoding a r g m a x ⁡ ( 1 , 5 , 10 ) = ( 0 , 0 , 1 ) , {\displaystyle \operatorname {arg\,max} (1,5,10)=(0,0,1),} since the third argument is the maximum. This can be generalized to multiple arg max values (multiple equal z i {\displaystyle z_{i}} being the maximum) by dividing the 1 between all max args; formally 1/k where k is the number of arguments assuming the maximum. For example, a r g m a x ⁡ ( 1 , 5 , 5 ) = ( 0 , 1 / 2 , 1 / 2 ) , {\displaystyle \operatorname {arg\,max} (1,\,5,\,5)=(0,\,1/2,\,1/2),} since the second and third argument are both the maximum. In case all arguments are equal, this is simply a r g m a x ⁡ ( z , … , z ) = ( 1 / n , … , 1 / n ) . {\displaystyle \operatorname {arg\,max} (z,\dots ,z)=(1/n,\dots ,1/n).} Points z with multiple arg max values are singular points (or singularities, and form the singular set) – these are the points where arg max is discontinuous (with a jump discontinuity) – while points with a single arg max are known as non-singular or regular points. With the last expression given in the introduction, softargmax is now a smooth approximation of arg max: as ⁠ β → ∞ {\displaystyle \beta \to \infty } ⁠, softargmax converges to arg max. There are various notions of convergence of a function; softargmax converges to arg max pointwise, meaning for each fixed input z as ⁠ β → ∞ {\displaystyle \beta \to \infty } ⁠, σ β ( z ) → a r g m a x ⁡ ( z ) . {\displaystyle \sigma _{\beta }(\mathbf {z} )\to \operatorname {arg\,max} (\mathbf {z} ).} However, softargmax does not converge uniformly to arg max, meaning intuitively that different points converge at different rates, and may converge arbitrarily slowly. In fact, softargmax is continuous, but arg max is not continuous at the singular set where two coordinates are equal, while the uniform limit of continuous functions is continuous. The reason it fails to converge uniformly is that for inputs where two coordinates are almost equal (and one is the maximum), the arg max is the index of one or the other, so a small change in input yields a large change in output. For example, σ β ( 1 , 1.0001 ) → ( 0 , 1 ) , {\displaystyle \sigma _{\beta }(1,\,1.0001)\to (0,1),} but σ β ( 1 , 0.9999 ) → ( 1 , 0 ) , {\displaystyle \sigma _{\beta }(1,\,0.9999)\to (1,\,0),} and σ β ( 1 , 1 ) = 1 / 2 {\displaystyle \sigma _{\beta }(1,\,1)=1/2} for all inputs: the closer the points are to the singular set ( x , x ) {\displaystyle (x,x)} , the slower they converge. However, softargmax does converge compactly on the non-singular set. Conversely, as ⁠ β → − ∞ {\displaystyle \beta \to -\infty } ⁠, softargmax converges to arg min in the same way, where here the singular set is points with two arg min values. In the language of tropical analysis, the softmax is a deformation or "quantization" of arg max and arg min, corresponding to using the log semiring instead of the max-plus semiring (respectively min-plus semiring), and recovering the arg max or arg min by taking the limit is called "tropicalization" or "dequantization". It is also the case that, for any fixed β, if one input ⁠ z i {\displaystyle z_{i}} ⁠ is much larger than the others relative to the temperature, T = 1 / β {\displaystyle T=1/\beta } , the output is approximately the arg max. For example, a difference of 10 is large relative to a temperature of 1: σ ( 0 , 10 ) := σ 1 ( 0 , 10 ) = ( 1 / ( 1 + e 10 ) , e 10 / ( 1 + e 10 ) ) ≈ ( 0.00005 , 0.99995 ) {\displaystyle \sigma (0,\,10):=\sigma _{1}(0,\,10)=\left(1/\left(1+e^{10}\right),\,e^{10}/\left(1+e^{10}\right)\right)\approx (0.00005

    Read more →
  • Information retrieval

    Information retrieval

    Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images, or sounds. Cross-modal retrieval implies retrieval across modalities. Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides access to books, journals, and other documents, as well as storing and managing those documents. Web search engines are the most visible IR applications. == Overview == An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval, a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevance. An object is an entity that is represented by information in a content collection or database. User queries are matched against the database information. However, as opposed to classical SQL queries of a database, in information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching. Depending on the application the data objects may be, for example, text documents, images, audio, mind maps or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata. Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query. == History == there is ... a machine called the Univac ... whereby letters and figures are coded as a pattern of magnetic spots on a long steel tape. By this means the text of a document, preceded by its subject code symbol, can be recorded ... the machine ... automatically selects and types out those references which have been coded in any desired way at a rate of 120 words a minute The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945. It would appear that Bush was inspired by patents for a 'statistical machine' – filed by Emanuel Goldberg in the 1920s and 1930s – that searched for documents stored on film. The first description of a computer searching for information was described by Holmstrom in 1948, detailing an early mention of the Univac computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s. In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further. By the late 1990s, the rise of the World Wide Web fundamentally transformed information retrieval. While early search engines such as AltaVista (1995) and Yahoo! (1994) offered keyword-based retrieval, they were limited in scale and ranking refinement. The breakthrough came in 1998 with the founding of Google, which introduced the PageRank algorithm, using the web's hyperlink structure to assess page importance and improve relevance ranking. During the 2000s, web search systems evolved rapidly with the integration of machine learning techniques. These systems began to incorporate user behavior data (e.g., click-through logs), query reformulation, and content-based signals to improve search accuracy and personalization. In 2009, Microsoft launched Bing, introducing features that would later incorporate semantic web technologies through the development of its Satori knowledge base. Academic analysis have highlighted Bing's semantic capabilities, including structured data use and entity recognition, as part of a broader industry shift toward improving search relevance and understanding user intent through natural language processing. A major leap occurred in 2018, when Google deployed BERT (Bidirectional Encoder Representations from Transformers) to better understand the contextual meaning of queries and documents. This marked one of the first times deep neural language models were used at scale in real-world retrieval systems. BERT's bidirectional training enabled a more refined comprehension of word relationships in context, improving the handling of natural language queries. Because of its success, transformer-based models gained traction in academic research and commercial search applications. Simultaneously, the research community began exploring neural ranking models that outperformed traditional lexical-based methods. Long-standing benchmarks such as the Text REtrieval Conference (TREC), initiated in 1992, and more recent evaluation frameworks Microsoft MARCO(MAchine Reading COmprehension) (2019) became central to training and evaluating retrieval systems across multiple tasks and domains. MS MARCO has also been adopted in the TREC Deep Learning Tracks, where it serves as a core dataset for evaluating advances in neural ranking models within a standardized benchmarking environment. As deep learning became integral to information retrieval systems, researchers began to categorize neural approaches into three broad classes: sparse, dense, and hybrid models. Sparse models, including traditional term-based methods and learned variants like SPLADE, rely on interpretable representations and inverted indexes to enable efficient exact term matching with added semantic signals. Dense models, such as dual-encoder architectures like ColBERT, use continuous vector embeddings to support semantic similarity beyond keyword overlap. Hybrid models aim to combine the advantages of both, balancing the lexical (token) precision of sparse methods with the semantic depth of dense models. This way of categorizing models balances scalability, relevance, and efficiency in retrieval systems. As IR systems increasingly rely on deep learning, concerns around bias, fairness, and explainability have also come to the picture. Research is now focused not just on relevance and efficiency, but on transparency, accountability, and user trust in retrieval algorithms. == Applications == Areas where information retrieval techniques are employed include (the entries are in alphabetical order within each category): === General applications === Digital libraries Information filtering Recommender systems Media search Blog search Image retrieval 3D retrieval Music retrieval News search Speech retrieval Video retrieval Search engines Site search Desktop search Enterprise search Federated search Mobile search Social search Web search === Domain-specific applications === Expert search finding Genomic information retrieval Geographic information retrieval Information retrieval for chemical structures Information retrieval in software engineering Legal information retrieval Vertical search === Other retrieval methods === Methods/Techniques in which information retrieval techniques are employed include: Cross-modal retrieval Adversarial information retrieval Automatic summarization Multi-document summarization Compound term processing Cross-lingual retrieval Document classification Spam filtering Question answering == Model types == In order to effectively retrieve relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes. The picture on the right illustrates the relationship of som

    Read more →
  • Homogeneity blockmodeling

    Homogeneity blockmodeling

    In mathematics applied to analysis of social structures, homogeneity blockmodeling is an approach in blockmodeling, which is best suited for a preliminary or main approach to valued networks, when a prior knowledge about these networks is not available. This is because homogeneity blockmodeling emphasizes the similarity of link (tie) strengths within the blocks over the pattern of links. In this approach, tie (link) values (or statistical data computed on them) are assumed to be equal (homogenous) within blocks. This approach to the generalized blockmodeling of valued networks was first proposed by Aleš Žiberna in 2007 with the basic idea, "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Similar approach to the homogeneity blockmodeling, dealing with direct approach for structural equivalence, was previously suggested by Stephen P. Borgatti and Martin G. Everett (1992).

    Read more →
  • Alternating decision tree

    Alternating decision tree

    An alternating decision tree (ADTree) is a machine learning method for classification. It generalizes decision trees and has connections to boosting. An ADTree consists of an alternation of decision nodes, which specify a predicate condition, and prediction nodes, which contain a single number. An instance is classified by an ADTree by following all paths for which all decision nodes are true, and summing any prediction nodes that are traversed. == History == ADTrees were introduced by Yoav Freund and Llew Mason. However, the algorithm as presented had several typographical errors. Clarifications and optimizations were later presented by Bernhard Pfahringer, Geoffrey Holmes and Richard Kirkby. Implementations are available in Weka and JBoost. == Motivation == Original boosting algorithms typically used either decision stumps or decision trees as weak hypotheses. As an example, boosting decision stumps creates a set of T {\displaystyle T} weighted decision stumps (where T {\displaystyle T} is the number of boosting iterations), which then vote on the final classification according to their weights. Individual decision stumps are weighted according to their ability to classify the data. Boosting a simple learner results in an unstructured set of T {\displaystyle T} hypotheses, making it difficult to infer correlations between attributes. Alternating decision trees introduce structure to the set of hypotheses by requiring that they build off a hypothesis that was produced in an earlier iteration. The resulting set of hypotheses can be visualized in a tree based on the relationship between a hypothesis and its "parent." Another important feature of boosted algorithms is that the data is given a different distribution at each iteration. Instances that are misclassified are given a larger weight while accurately classified instances are given reduced weight. == Alternating decision tree structure == An alternating decision tree consists of decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single number. ADTrees always have prediction nodes as both root and leaves. An instance is classified by an ADTree by following all paths for which all decision nodes are true and summing any prediction nodes that are traversed. This is different from binary classification trees such as CART (Classification and regression tree) or C4.5 in which an instance follows only one path through the tree. === Example === The following tree was constructed using JBoost on the spambase dataset (available from the UCI Machine Learning Repository). In this example, spam is coded as 1 and regular email is coded as −1. The following table contains part of the information for a single instance. The instance is scored by summing all of the prediction nodes through which it passes. In the case of the instance above, the score is calculated as The final score of 0.657 is positive, so the instance is classified as spam. The magnitude of the value is a measure of confidence in the prediction. The original authors list three potential levels of interpretation for the set of attributes identified by an ADTree: Individual nodes can be evaluated for their own predictive ability. Sets of nodes on the same path may be interpreted as having a joint effect The tree can be interpreted as a whole. Care must be taken when interpreting individual nodes as the scores reflect a re weighting of the data in each iteration. == Description of the algorithm == The inputs to the alternating decision tree algorithm are: A set of inputs ( x 1 , y 1 ) , … , ( x m , y m ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{m},y_{m})} where x i {\displaystyle x_{i}} is a vector of attributes and y i {\displaystyle y_{i}} is either -1 or 1. Inputs are also called instances. A set of weights w i {\displaystyle w_{i}} corresponding to each instance. The fundamental element of the ADTree algorithm is the rule. A single rule consists of a precondition, a condition, and two scores. A condition is a predicate of the form "attribute value." A precondition is simply a logical conjunction of conditions. Evaluation of a rule involves a pair of nested if statements: 1 if (precondition) 2 if (condition) 3 return score_one 4 else 5 return score_two 6 end if 7 else 8 return 0 9 end if Several auxiliary functions are also required by the algorithm: W + ( c ) {\displaystyle W_{+}(c)} returns the sum of the weights of all positively labeled examples that satisfy predicate c {\displaystyle c} W − ( c ) {\displaystyle W_{-}(c)} returns the sum of the weights of all negatively labeled examples that satisfy predicate c {\displaystyle c} W ( c ) = W + ( c ) + W − ( c ) {\displaystyle W(c)=W_{+}(c)+W_{-}(c)} returns the sum of the weights of all examples that satisfy predicate c {\displaystyle c} The algorithm is as follows: 1 function ad_tree 2 input Set of m training instances 3 4 wi = 1/m for all i 5 a = 1 2 ln W + ( t r u e ) W − ( t r u e ) {\displaystyle a={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(true)}{W_{-}(true)}}} 6 R0 = a rule with scores a and 0, precondition "true" and condition "true." 7 P = { t r u e } {\displaystyle {\mathcal {P}}=\{true\}} 8 C = {\displaystyle {\mathcal {C}}=} the set of all possible conditions 9 for j = 1 … T {\displaystyle j=1\dots T} 10 p ∈ P , c ∈ C {\displaystyle p\in {\mathcal {P}},c\in {\mathcal {C}}} get values that minimize z = 2 ( W + ( p ∧ c ) W − ( p ∧ c ) + W + ( p ∧ ¬ c ) W − ( p ∧ ¬ c ) ) + W ( ¬ p ) {\displaystyle z=2\left({\sqrt {W_{+}(p\wedge c)W_{-}(p\wedge c)}}+{\sqrt {W_{+}(p\wedge \neg c)W_{-}(p\wedge \neg c)}}\right)+W(\neg p)} 11 P + = p ∧ c + p ∧ ¬ c {\displaystyle {\mathcal {P}}+=p\wedge c+p\wedge \neg c} 12 a 1 = 1 2 ln W + ( p ∧ c ) + 1 W − ( p ∧ c ) + 1 {\displaystyle a_{1}={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(p\wedge c)+1}{W_{-}(p\wedge c)+1}}} 13 a 2 = 1 2 ln W + ( p ∧ ¬ c ) + 1 W − ( p ∧ ¬ c ) + 1 {\displaystyle a_{2}={\frac {1}{2}}{\textrm {ln}}{\frac {W_{+}(p\wedge \neg c)+1}{W_{-}(p\wedge \neg c)+1}}} 14 Rj = new rule with precondition p, condition c, and weights a1 and a2 15 w i = w i e − y i R j ( x i ) {\displaystyle w_{i}=w_{i}e^{-y_{i}R_{j}(x_{i})}} 16 end for 17 return set of Rj The set P {\displaystyle {\mathcal {P}}} grows by two preconditions in each iteration, and it is possible to derive the tree structure of a set of rules by making note of the precondition that is used in each successive rule. == Empirical results == Figure 6 in the original paper demonstrates that ADTrees are typically as robust as boosted decision trees and boosted decision stumps. Typically, equivalent accuracy can be achieved with a much simpler tree structure than recursive partitioning algorithms.

    Read more →
  • Stochastic variance reduction

    Stochastic variance reduction

    (Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting. Variance reduction approaches are widely used for training machine learning models such as logistic regression and support vector machines as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction. == Finite sum objectives == A function f {\displaystyle f} is considered to have finite sum structure if it can be decomposed into a summation or average: f ( x ) = 1 n ∑ i = 1 n f i ( x ) , {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x),} where the function value and derivative of each f i {\displaystyle f_{i}} can be queried independently. Although variance reduction methods can be applied for any positive n {\displaystyle n} and any f i {\displaystyle f_{i}} structure, their favorable theoretical and practical properties arise when n {\displaystyle n} is large compared to the condition number of each f i {\displaystyle f_{i}} , and when the f i {\displaystyle f_{i}} have similar (but not necessarily identical) Lipschitz smoothness and strong convexity constants. The finite sum structure should be contrasted with the stochastic approximation setting which deals with functions of the form f ( θ ) = E ξ ⁡ [ F ( θ , ξ ) ] {\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]} which is the expected value of a function depending on a random variable ξ {\textstyle \xi } . Any finite sum problem can be optimized using a stochastic approximation algorithm by using F ( ⋅ , ξ ) = f ξ {\displaystyle F(\cdot ,\xi )=f_{\xi }} . == Rapid Convergence == Stochastic variance reduced methods without acceleration are able to find a minima of f {\displaystyle f} within accuracy ϵ > {\displaystyle \epsilon >} , i.e. f ( x ) − f ( x ∗ ) ≤ ϵ {\displaystyle f(x)-f(x_{})\leq \epsilon } in a number of steps of the order: O ( ( L μ + n ) log ⁡ ( 1 ϵ ) ) . {\displaystyle O\left(\left({\frac {L}{\mu }}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right).} The number of steps depends only logarithmically on the level of accuracy required, in contrast to the stochastic approximation framework, where the number of steps O ( L / ( μ ϵ ) ) {\displaystyle O{\bigl (}L/(\mu \epsilon ){\bigr )}} required grows proportionally to the accuracy required. Stochastic variance reduction methods converge almost as fast as the gradient descent method's O ( ( L / μ ) log ⁡ ( 1 / ϵ ) ) {\displaystyle O{\bigl (}(L/\mu )\log(1/\epsilon ){\bigr )}} rate, despite using only a stochastic gradient, at a 1 / n {\displaystyle 1/n} lower cost than gradient descent. Accelerated methods in the stochastic variance reduction framework achieve even faster convergence rates, requiring only O ( ( n L μ + n ) log ⁡ ( 1 ϵ ) ) {\displaystyle O\left(\left({\sqrt {\frac {nL}{\mu }}}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right)} steps to reach ϵ {\displaystyle \epsilon } accuracy, potentially n {\displaystyle {\sqrt {n}}} faster than non-accelerated methods. Lower complexity bounds. for the finite sum class establish that this rate is the fastest possible for smooth strongly convex problems. == Approaches == Variance reduction approaches fall within four main categories: table averaging methods, full-gradient snapshot methods, recursive estimator methods (e.g., SARAH), and dual methods. Each category contains methods designed for dealing with convex, non-smooth, and non-convex problems, each differing in hyper-parameter settings and other algorithmic details. === SAGA === In the SAGA method, the prototypical table averaging approach, a table of size n {\displaystyle n} is maintained that contains the last gradient witnessed for each f i {\displaystyle f_{i}} term, which we denote g i {\displaystyle g_{i}} . At each step, an index i {\displaystyle i} is sampled, and a new gradient ∇ f i ( x k ) {\displaystyle \nabla f_{i}(x_{k})} is computed. The iterate x k {\displaystyle x_{k}} is updated with: x k + 1 = x k − γ [ ∇ f i ( x k ) − g i + 1 n ∑ i = 1 n g i ] , {\displaystyle x_{k+1}=x_{k}-\gamma \left[\nabla f_{i}(x_{k})-g_{i}+{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right],} and afterwards table entry i {\displaystyle i} is updated with g i = ∇ f i ( x k ) {\displaystyle g_{i}=\nabla f_{i}(x_{k})} . SAGA is among the most popular of the variance reduction methods due to its simplicity, easily adaptable theory, and excellent performance. It is the successor of the SAG method, improving on its flexibility and performance. === SVRG === The stochastic variance reduced gradient method (SVRG), the prototypical snapshot method, uses a similar update except instead of using the average of a table it instead uses a full-gradient that is reevaluated at a snapshot point x ~ {\displaystyle {\tilde {x}}} at regular intervals of m ≥ n {\displaystyle m\geq n} iterations. The update becomes: x k + 1 = x k − γ [ ∇ f i ( x k ) − ∇ f i ( x ~ ) + ∇ f ( x ~ ) ] , {\displaystyle x_{k+1}=x_{k}-\gamma [\nabla f_{i}(x_{k})-\nabla f_{i}({\tilde {x}})+\nabla f({\tilde {x}})],} This approach requires two stochastic gradient evaluations per step, one to compute ∇ f i ( x k ) {\displaystyle \nabla f_{i}(x_{k})} and one to compute ∇ f i ( x ~ ) , {\displaystyle \nabla f_{i}({\tilde {x}}),} where-as table averaging approaches need only one. Despite the high computational cost, SVRG is popular as its simple convergence theory is highly adaptable to new optimization settings. It also has lower storage requirements than tabular averaging approaches, which make it applicable in many settings where tabular methods can not be used. === SARAH === The SARAH (stochastic recursive gradient) method maintains a recursive estimator of the gradient rather than storing a table of past gradients (as in SAGA) or computing periodic full-gradient snapshots (as in SVRG). At the start of an inner loop, a full gradient is computed at a reference point x ~ {\displaystyle {\tilde {x}}} : v 0 = ∇ f ( x ~ ) {\displaystyle v_{0}=\nabla f({\tilde {x}})} . For inner iterations, with a sampled index i k {\displaystyle i_{k}} , the gradient estimator and iterate are updated by: v k = ∇ f i k ( x k ) − ∇ f i k ( x k − 1 ) + v k − 1 , x k + 1 = x k − γ v k . {\displaystyle v_{k}=\nabla f_{i_{k}}(x_{k})-\nabla f_{i_{k}}(x_{k-1})+v_{k-1},\qquad x_{k+1}=x_{k}-\gamma v_{k}.} This recursion requires two component-gradient evaluations per step ∇ f i k ( x k ) {\displaystyle \nabla f_{i_{k}}(x_{k})} and ∇ f i k ( x k − 1 ) {\displaystyle \nabla f_{i_{k}}(x_{k-1})} but does not need to store per-sample gradients, resulting in lower memory cost than table-averaging methods. SARAH admits linear convergence for strongly convex functions and has been extended to more general nonconvex and composite problems. === SDCA === Exploiting the dual representation of the objective leads to another variance reduction approach that is particularly suited to finite-sums where each term has a structure that makes computing the convex conjugate f i ∗ , {\displaystyle f_{i}^{},} or its proximal operator tractable. The standard SDCA method considers finite sums that have additional structure compared to generic finite sum setting: f ( x ) = 1 n ∑ i = 1 n f i ( x T v i ) + λ 2 ‖ x ‖ 2 , {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x^{T}v_{i})+{\frac {\lambda }{2}}\|x\|^{2},} where each f i {\displaystyle f_{i}} is 1 dimensional and each v i {\displaystyle v_{i}} is a data point associated with f i {\displaystyle f_{i}} . SDCA solves the dual problem: max α ∈ R n − 1 n ∑ i = 1 n f i ∗ ( − α i ) − λ 2 ‖ 1 λ n ∑ i = 1 n α i v i ‖ 2 , {\displaystyle \max _{\alpha \in \mathbb {R} ^{n}}-{\frac {1}{n}}\sum _{i=1}^{n}f_{i}^{}(-\alpha _{i})-{\frac {\lambda }{2}}\left\|{\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}\right\|^{2},} by a stochastic coordinate ascent procedure, where at each step the objective is optimized with respect to a randomly chosen coordinate α i {\displaystyle \alpha _{i}} , leaving all other coordinates the same. An approximate primal solution x {\displaystyle x} can be recovered from the α {\displaystyle \alpha } values: x = 1 λ n ∑ i = 1 n α i v i {\displaystyle x={\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}} . This method obtains similar theoretical rates of convergence to other stochastic variance reduced methods, while avoiding the need to specify a step-size parameter. It is fast in practice when λ {\displaystyle \lambda } is large, but significantly slower than the other approaches when λ {\displaystyle \lambda } is small. == Accelerated approaches == Accelerated variance reduction methods are built upon the standard methods above. The earliest approaches make use of proximal operators t

    Read more →
  • Colors!

    Colors!

    Colors! is a series of digital painting applications for handheld game consoles and mobile devices. Originally created as a homebrew application for Nintendo DS (as Colors!), which was since legitimately distributed on PlayStation Vita, iOS, and Android, the project eventually evolved into an officially licensed application for Nintendo 3DS (as Colors! 3D) and Nintendo Switch (as Colors Live). == History == === Colors! === Colors! was originally released in June 2007 as a simple homebrew painting application for the Nintendo DS. It was developed by Jens Andersson, a programmer and designer on sabbatical from the games industry who wanted to experiment with the potential of the new handheld platform. Shortly after, Rafał Piasek created an online gallery where users could upload paintings made with the program. Colors! quickly became one of the best-known homebrew applications on the Nintendo DS, and in September 2008, it was also released for the iPhone and iPod Touch. As of August 2010, it had been downloaded almost half a million times. It was voted the most popular homebrew application on the Nintendo DS by readers of the R4 for DS blog. Development of Colors! DS homebrew officially ended in December 2010 although the official gallery still accepted submissions from DS users until 2020 when Colors! Gallery was discontinued. === Colors! 3D === Colors! 3D is a successor to the application Colors! for the Nintendo 3DS. It was released as an officially licensed application for the Nintendo eShop in North America on April 5, 2012, and in the PAL region on April 19, 2012. It was later released in Japan on August 21, 2013, published by Arc System Works. Colors! 3D allows users to draw on five layers, each on their own stereoscopic 3D plane. Drawing is done on the bottom screen, while the top screen displays the painting in 3D. While drawing, players can use the various controls on the Nintendo 3DS to change layers, zoom and pan, and alter the pressure of their brush. Pressing the L button allows users to access a menu to change brush type, size, and opacity, modify the layers, use the camera to provide references, and more. When the user finishes their painting, they can export it to the SD card for viewing in the Nintendo 3DS Camera application. Users can also upload their finished creations to an online gallery, viewed on the 3DS or the official website. Gallery features include hashtags and the ability to follow artists and post comments. Each painting also features a replay feature that allows viewers to see how it was drawn. The application also features local multiplayer, allowing several people to work cooperatively on a painting. In April 2024, the developers of Colors! 3D collaborated with the Pretendo Network project to officially add support for the application, meaning Colors! 3D will continue to operate as normal when using Pretendo Network. ==== Reception ==== IGN gave the application a score of 9.0 and an Editor's Choice award, praising its simple interface and tutorials. Destructoid gave the app a 9.0, calling it "a simple and incredibly fun tool with an amazing community of artists proudly displaying their beautiful and funny 3D images." Nintendo Life gave the app a 9/10, stating, "Though lacking in any structured play, Colors! 3D’s robust free drawing system and unique ability to let anyone create their own three-dimensional artwork more than make up for this." === Colors Live === A Nintendo Switch successor called Colors Live (stylised as Colors L!ve) was released in 2020 after being funded via a Kickstarter campaign. This expanded upon the features of previous installments by adding new brushes, increasing the maximum number of layers to ten, and introducing blend modes. A new game mode called Colors Quest was also included. A pressure-sensitive pen called the Colors SonarPen was developed in collaboration with GreenBulb to facilitate drawing on the Nintendo Switch, and comes pre-bundled with physical copies of the game. ==== Colors Quest ==== This new mode acts as a story-driven adventure wherein players are given a daily drawing challenge with a specific theme and certain stipulations that must be fulfilled. Once the drawing is complete, players must anonymously score other players' submissions, these scores are then aggregated to produce a personal ranking that measures the improvement in the player's art skills over time.

    Read more →
  • Clustering illusion

    Clustering illusion

    The clustering illusion is the tendency to erroneously consider the inevitable "streaks" or "clusters" arising in small samples from random distributions to be non-random. The illusion is caused by a human tendency to underpredict the amount of variability likely to appear in a small sample of random or pseudorandom data. Thomas Gilovich, an early author on the subject, argued that the effect occurs for different types of random dispersions. Some might perceive patterns in stock market price fluctuations over time, or clusters in two-dimensional data such as the locations of impact of World War II V-1 flying bombs on maps of London. Although Londoners developed specific theories about the pattern of impacts within London, a statistical analysis by R. D. Clarke originally published in 1946 showed that the impacts of V-2 rockets on London were a close fit to a random distribution. == Similar biases == Using this cognitive bias in causal reasoning may result in the Texas sharpshooter fallacy, in which differences in data are ignored and similarities are overemphasized. More general forms of erroneous pattern recognition are pareidolia and apophenia. Related biases are the illusion of control which the clustering illusion could contribute to, and insensitivity to sample size in which people don't expect greater variation in smaller samples. A different cognitive bias involving misunderstanding of chance streams is the gambler's fallacy. == Possible causes == Daniel Kahneman and Amos Tversky explained this kind of misprediction as being caused by the representativeness heuristic (which itself they also first proposed).

    Read more →
  • Artificial development

    Artificial development

    Artificial development, also known as artificial embryogeny or machine intelligence or computational development, is an area of computer science and engineering concerned with computational models motivated by genotype–phenotype mappings in biological systems. Artificial development is often considered a sub-field of evolutionary computation, although the principles of artificial development have also been used within stand-alone computational models. Within evolutionary computation, the need for artificial development techniques was motivated by the perceived lack of scalability and evolvability of direct solution encodings (Tufte, 2008). Artificial development entails indirect solution encoding. Rather than describing a solution directly, an indirect encoding describes (either explicitly or implicitly) the process by which a solution is constructed. Often, but not always, these indirect encodings are based upon biological principles of development such as morphogen gradients, cell division and cellular differentiation (e.g. Doursat 2008), gene regulatory networks (e.g. Guo et al., 2009), degeneracy (Whitacre et al., 2010), grammatical evolution (de Salabert et al., 2006), or analogous computational processes such as re-writing, iteration, and time. The influences of interaction with the environment, spatiality and physical constraints on differentiated multi-cellular development have been investigated more recently (e.g. Knabe et al. 2008). Artificial development approaches have been applied to a number of computational and design problems, including electronic circuit design (Miller and Banzhaf 2003), robotic controllers (e.g. Taylor 2004), and the design of physical structures (e.g. Hornby 2004).

    Read more →