AI Content Humanizer

AI Content Humanizer — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Swizzling (computer graphics)

    Swizzling (computer graphics)

    In computer graphics, swizzles are a class of operations that transform vectors by rearranging components. Swizzles can also project from a vector of one dimensionality to a vector of another dimensionality, such as taking a three-dimensional vector and creating a two-dimensional or five-dimensional vector using components from the original vector. For example, if A = {1,2,3,4}, where the components are x, y, z, and w respectively, one could compute B = A.wwxy, whereupon B would equal {4,4,1,2}. Additionally, one could create a two-dimensional vector with A.wx or a five-dimensional vector with A.xyzwx. Combining vectors and swizzling can be employed in various ways. This is common in GPGPU applications. In terms of linear algebra, this is equivalent to multiplying by a matrix whose rows are standard basis vectors. If A = ( 1 , 2 , 3 , 4 ) T {\displaystyle A=(1,2,3,4)^{T}} , then swizzling A {\displaystyle A} as above looks like A . w w x y = [ 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 ] [ 1 2 3 4 ] = [ 4 4 1 2 ] . {\displaystyle A.\!wwxy={\begin{bmatrix}0&0&0&1\\0&0&0&1\\1&0&0&0\\0&1&0&0\end{bmatrix}}{\begin{bmatrix}1\\2\\3\\4\end{bmatrix}}={\begin{bmatrix}4\\4\\1\\2\end{bmatrix}}.}

    Read more →
  • Multi-surface method

    Multi-surface method

    The multi-surface method (MSM) is a form of decision making using the concept of piecewise-linear separability of datasets to categorize data. == Introduction == Two datasets are linearly separable if their convex hulls do not intersect. The method may be formulated as a feedforward neural network with weights that are trained via linear programming. Comparisons between neural networks trained with the MSM versus backpropagation show MSM is better able to classify data. The decision problem associated linear program for the MSM is NP-complete. == Mathematical formulation == Given two finite disjoint point sets A , B ∈ R n {\displaystyle {\mathcal {A,B}}\in \mathbb {R} ^{n}} , find a discriminant, f : R n → R {\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} } such that f ( A ) > 0 , f ( B ) ≤ 0 {\displaystyle f({\mathcal {A}})>0,f({\mathcal {B}})\leq 0} . If the intersection of convex hulls of the two sets is the empty set, then it is possible to use a single linear program to obtain a linear discriminant of the form, f ( x ) = c x + γ {\displaystyle f(x)=cx+\gamma } . Usually, in real applications, the sets' convex hulls do intersect, and a (often non-convex) piecewise-linear discriminant can be used, through the use of several linear programs.

    Read more →
  • One-class classification

    One-class classification

    In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, is an approach to the training of binary classifiers in which only examples of one of the two classes are used. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or assessing the operational status of a nuclear plant as 'normal': In such scenarios, there are few, if any, examples of the catastrophic system states – rare outliers – that comprise the second class. Alternatively, the class that is being focused on may cover a small, coherent subset of the data and the training may rely on an information bottleneck approach. In practice, counter-examples from the second class may be used in later rounds of training to further refine the algorithm. == Overview == The term one-class classification (OCC) was coined by Moya & Hush (1996) and many applications can be found in scientific literature, for example outlier detection, anomaly detection, novelty detection. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes. == Introduction == SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points. This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form, min r , c r 2 subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 ∀ i = 1 , 2 , . . . , n {\displaystyle \min _{r,c}r^{2}{\text{ subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}\;\;\forall i=1,2,...,n} However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below, min r , c , ζ r 2 + 1 ν n ∑ i = 1 n ζ i {\displaystyle \min _{r,c,\zeta }r^{2}+{\frac {1}{\nu n}}\sum _{i=1}^{n}\zeta _{i}} subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 + ζ i ∀ i = 1 , 2 , . . . , n {\displaystyle {\text{subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}+\zeta _{i}\;\;\forall i=1,2,...,n} From the Karush–Kuhn–Tucker conditions for optimality, we get c = ∑ i = 1 n α i Φ ( x i ) , {\displaystyle c=\sum _{i=1}^{n}\alpha _{i}\Phi (x_{i}),} where the α i {\displaystyle \alpha _{i}} 's are the solution to the following optimization problem: max α ∑ i = 1 n α i κ ( x i , x i ) − ∑ i , j = 1 n α i α j κ ( x i , x j ) {\displaystyle \max _{\alpha }\sum _{i=1}^{n}\alpha _{i}\kappa (x_{i},x_{i})-\sum _{i,j=1}^{n}\alpha _{i}\alpha _{j}\kappa (x_{i},x_{j})} subject to, ∑ i = 1 n α i = 1 and 0 ≤ α i ≤ 1 ν n for all i = 1 , 2 , . . . , n . {\displaystyle \sum _{i=1}^{n}\alpha _{i}=1{\text{ and }}0\leq \alpha _{i}\leq {\frac {1}{\nu n}}{\text{for all }}i=1,2,...,n.} The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm. === PU (Positive Unlabeled) learning === A similar problem is PU learning, in which a binary classifier is constructed by semi-supervised learning from only positive and unlabeled sample points. In PU learning, two sets of examples are assumed to be available for training: the positive set P {\displaystyle P} and a mixed set U {\displaystyle U} , which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the EM algorithm. PU learning has been successfully applied to text, time series, bioinformatics tasks, and remote sensing data. == Approaches == Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, density estimation, boundary methods, and reconstruction methods. === Density estimation methods === Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a Poisson distribution. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance. Gaussian model is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT), these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by: p N ( z ; μ ; Σ ) = 1 ( 2 π ) d 2 | Σ | 1 2 exp ⁡ { − 1 2 ( z − μ ) T Σ − 1 ( z − μ ) } {\displaystyle p_{\mathcal {N}}(z;\mu ;\Sigma )={\frac {1}{(2\pi )^{\frac {d}{2}}|\Sigma |^{\frac {1}{2}}}}\exp \left\{-{\frac {1}{2}}(z-\mu )^{T}\Sigma ^{-1}(z-\mu )\right\}} Where, μ {\displaystyle \mu } is the mean and Σ {\displaystyle \Sigma } is the covariance matrix. Computing the inverse of covariance matrix ( Σ − 1 {\displaystyle \Sigma ^{-1}} ) is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse Σ + {\displaystyle \Sigma ^{+}} is used to approximate the inverse, and is calculated as Σ T ( Σ Σ T ) − 1 {\displaystyle \Sigma ^{T}(\Sigma \Sigma ^{T})^{-1}} . === Boundary methods === Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples. K-centers In K-center algorithm, k {\displaystyle k} small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized, ε k − c e n t e r = max i ( min k | | x i − μ k | | 2 ) {\displaystyle \varepsilon _{k-center}=\max _{i}(\min _{k}||x_{i}-\mu _{k}||^{2})} The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object z {\displaystyle z} the distance can be calculated as, d k − c e n t r ( z ) = min k | | z − μ k | | 2 {\displaystyle d_{k-centr}(z)=\min _{k}||z-\mu _{k}||^{2}} === Reconstruction methods === Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc. == Applications == === Document classification === The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using only positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history. === Biomedical studies === One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal). The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters. To apply typicality to one-class classification for biomedical studies, each new observation, y 0 {\displaystyle y_{0}} , is compared to the target class, C {\displaystyle C} , and identified as an outlier or a member of the target class. === Unsupervised Concept Drift Detection === One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initi

    Read more →
  • Generalized blockmodeling of valued networks

    Generalized blockmodeling of valued networks

    Generalized blockmodeling of valued networks is an approach of the generalized blockmodeling, dealing with valued networks (e.g., non-binary). While the generalized blockmodeling signifies a "formal and integrated approach for the study of the underlying functional anatomies of virtually any set of relational data", it is in principle used for binary networks. This is evident from the set of ideal blocks, which are used to interpret blockmodels, that are binary, based on the characteristic link patterns. Because of this, such templates are "not readily comparable with valued empirical blocks". To allow generalized blockmodeling of valued directional (one-mode) networks (e.g. allowing the direct comparisons of empirical valued blocks with ideal binary blocks), a non–parametric approach is used. With this, "an optional parameter determines the prominence of valued ties as a minimum percentile deviation between observed and expected flows". Such two–sided application of parameter then introduces "the possibility of non–determined ties, i.e. valued relations that are deemed neither prominent (1) nor non–prominent (0)." Resulted occurrences of links then motivate the modification of the calculation of inconsistencies between empirical and ideal blocks. At the same time, such links also give a possibility to measure the interpretational certainty, which is specific to each ideal block. Such maximum two–sided deviation threshold, holding the aggregate uncertainty score at zero or near–zero levels, is then proposed as "a measure of interpretational certainty for valued blockmodels, in effect transforming the optional parameter into an outgoing state". Problem with blockmodeling is the standard set of ideal block, as they are all specified using binary link (tie) patters; this results in "a non–trivial exercise to match and count inconsistencies between such ideal binary ties and empirical valued ties". One approach to solve this is by using dichotomization to transform the network into a binary version. The other two approaches were first proposed by Aleš Žiberna in 2007 by introducing valued (generalized) blockmodeling and also homogeneity blockmodeling. The basic idea of the latter is "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Two other approaches were later suggested by Carl Nordlund in 2019: deviational approach and correlation-based generalized approach. Both Nordlund's approaches are based on the idea, that valued networks can be compared with the ideal block without values. With this approach, more information is retained for analysis, which also means, that there are fewer partitions having identical values of the criterion function. This means, that the generalized blockmodeling of valued networks measures the inconsistencies more precisely. Usually, only one optimal partition is found in this approach, especially when it is used by homogeneity blockmodeling. Contrary, while using binary blockmodeling on the same sample, usually more than one optimal partition had occurred on several occasions.

    Read more →
  • Averbis

    Averbis

    Averbis has a focus on healthcare, pharma, automotive and intellectual property analytics. Averbis is involved in various research projects of the German Federal Ministry of Economics and Energy and the European Union such as DebugIT, EUCases, Mantra and SEMCARE. In addition to these projects, Averbis was also involved in the following projects: Greenpilot is a virtual library, which provides technical information in the fields of nutrition, environment and agriculture. Medpilot is a virtual library, which provides information about medicine and related sciences. In 2013, Averbis has been nominated for the German Founder Prize 2013. Averbis GmbH provides text analytics and text mining software to transform unstructured text into actionable information. It was founded in 2007 by IT experts after years of relevant scientific experience in the field of text mining and multilingual information retrieval. Averbis works in the field of terminology management, natural language processing, machine learning and semantic search. Its text mining software is embedded into the text mining framework UIMA.

    Read more →
  • Vladimir Batagelj

    Vladimir Batagelj

    Vladimir Batagelj (born June 14, 1948 in Idrija, Yugoslavia) is a Slovenian mathematician and an emeritus professor of mathematics at the University of Ljubljana. He is known for his work in discrete mathematics and combinatorial optimization, particularly analysis of social networks and other large networks (blockmodeling). == Education and career == Vladimir Batagelj completed his Ph.D. at the University of Ljubljana in 1986 under the direction of Tomaž Pisanski. He stayed at the University of Ljubljana as a professor until his retirement, where he was a professor of sociology and statistics, while also being a chair of the Department of Sociology of the Faculty of Social Sciences. As visiting professor, he was taught at the University of Pittsburgh (1990-91) and at the University of Konstanz (2002). He was also a member of editorial boards of two journals: Informatica and Journal of Social Structure. His work has been cited over 11000 times. His book Exploratory Social Network Analysis with Pajek on blockmodeling, coauthored with Wouter de Nooy and Andrej Mrvar, is Batagelj's most cited work and has over 3300 citations. The book was translated into Chinese and Japanese. The revised and expanded third edition has been published by Cambridge University Press. In 1975, 11 years before completing his PhD, Batagelj published a solo paper in Communications of the ACM. Batagelj authored more than 20 textbooks in Slovenian, covering topics like TeX, combinatorics and discrete mathematics. He has also written extensively in the Slovenian popular science journal Presek. Batagelj has advised 9 Ph.D. students. == Pajek == Batagelj is particularly known for his work on Pajek, a freely available software for analysis and visualization of large networks. He began work on Pajek in 1996 with Andrej Mrvar, who was then his PhD student. == Awards and honors == First prizes for contributions (with Andrej Mrvar) to Graph Drawing Contests in years: 1995, 1996, 1997, 1998, 1999, 2000 and 2005 / Graph Drawing Hall of Fame. In 2007 the book Generalized blockmodeling was awarded the Harrison White Outstanding Book Award by the Mathematical Sociology Section of American Sociological Association In 2007 he was awarded (together with Anuška Ferligoj) the Simmel Award by INSNA. In 2013, Vladimir Batagelj and Andrej Mrvar received the INSNA's William D. Richards Software award for their work on Pajek. == Selected bibliography == Vladimir Batagelj, Social Network Analysis, Large-Scale [1]. in R.A. Meyers, ed., Encyclopedia of Complexity and Systems Science, Springer 2009: 8245–8265. Vladimir Batagelj, Complex Networks, Visualization of [2]. in R.A. Meyers, ed., Encyclopedia of Complexity and Systems Science, Springer 2009: 1253–1268. Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj, Mark Granovetter (Series Editor), Exploratory Social Network Analysis with Pajek (Structural Analysis in the Social Sciences), Cambridge University Press 2005 (ISBN 0-521-60262-9). ESNA in Japanese, TDU, 2010. Patrick Doreian, Vladimir Batagelj, Anuška Ferligoj, Mark Granovetter (Series Editor), Generalized Blockmodeling (Structural Analysis in the Social Sciences), Cambridge University Press 2004 (ISBN 0-521-84085-6)

    Read more →
  • Population model (evolutionary algorithm)

    Population model (evolutionary algorithm)

    The population model of an evolutionary algorithm (EA) describes the structural properties of its population to which its members are subject. A population is the set of all proposed solutions of an EA considered in one iteration, which are also called individuals according to the biological role model. The individuals of a population can generate further individuals as offspring with the help of the genetic operators of the procedure. The simplest and widely used population model in EAs is the global or panmictic model, which corresponds to an unstructured population. It allows each individual to choose any other individual of the population as a partner for the production of offspring by crossover, whereby the details of the selection are irrelevant as long as the fitness of the individuals plays a significant role. Due to global mate selection, the genetic information of even slightly better individuals can prevail in a population after a few generations (iteration of an EA), provided that no better other offspring have emerged in this phase. If the solution found in this way is not the optimum sought, that is called premature convergence. This effect can be observed more often in panmictic populations. In nature global mating pools are rarely found. What prevails is a certain and limited isolation due to spatial distance. The resulting local neighbourhoods initially evolve independently and mutants have a higher chance of persisting over several generations. As a result, genotypic diversity in the gene pool is preserved longer than in a panmictic population. It is therefore obvious to divide the previously global population by substructures. Two basic models were introduced for this purpose, the island models, which are based on a division of the population into fixed subpopulations that exchange individuals from time to time, and the neighbourhood models, which assign individuals to overlapping neighbourhoods, also known as cellular genetic or evolutionary algorithms (cGA or cEA). The associated division of the population also suggests a corresponding parallelization of the procedure. For this reason, the topic of population models is also frequently discussed in the literature in connection with the parallelization of EAs. == Island models == In the island model, also called the migration model or coarse grained model, evolution takes place in strictly divided subpopulations. These can be organised panmictically, but do not have to be. From time to time an exchange of individuals takes place, which is called migration. The time between an exchange is called an epoch and its end can be triggered by various criteria: E.g. after a given time or given number of completed generations, or after the occurrence of stagnation. Stagnation can be detected, for example, by the fact that no fitness improvement has occurred in the island for a given number of generations. Island models introduce a variety of new strategy parameters: Number of subpopulations Size of the subpopulations Neighbourhood relations between islands: they determine which islands are considered neighbouring and can thus exchange individuals, see picture of a simple unidirectional ring (black arrows) and its extension by additional bidirectional neighbourhood relations (additional green arrows) Criteria for the termination of an epoch, synchronous or asynchronous migration Migration rate: number or proportion of individuals involved in migration. Migrant selection: There are many alternatives for this. E.g. the best individuals can replace the worst or randomly selected ones. Depending on the migration rate, this can affect one or more individuals at a time. With these parameters, the selection pressure can be influenced to a considerable extent. For example, it increases with the interconnectedness of the islands and decreases with the number of subpopulations or the epoch length. == Neighbourhood models or cellular evolutionary algorithms == The neighbourhood model, also called diffusion model or fine grained model, defines a topological neighbouhood relation between the individuals of a population that is independent of their phenotypic properties. The fundamental idea of this model is to provide the EA population with a special structure defined as a connected graph, in which each vertex is an individual that communicates with its nearest neighbours. Particularly, individuals are conceptually set in a toroidal mesh, and are only allowed to recombine with close individuals. This leads to a kind of locality known as isolation by distance. The set of potential mates of an individual is called its neighbourhood or deme. The adjacent figure illustrates that by showing two slightly overlapping neighbourhoods of two individuals marked yellow, through which genetic information can spread between the two demes. It is known that in this kind of algorithm, similar individuals tend to cluster and create niches that are independent of the deme boundaries and, in particular, can be larger than a deme. There is no clear borderline between adjacent groups, and close niches could be easily colonized by competitive ones and maybe merge solution contents during this process. Simultaneously, farther niches can be affected more slowly. EAs with this type of population are also well known as cellular EAs (cEA) or cellular genetic algorithms (cGA). A commonly used structure for arranging the individuals of a population is a 2D toroidal grid, although the number of dimensions can be easily extended (to 3D) or reduced (to 1D, e.g. a ring, see the figure on the right). The neighbourhood of a particular individual in the grid is defined in terms of the Manhattan distance from it to others in the population. In the basic algorithm, all the neighbourhoods have the same size and identical shapes. The two most commonly used neighbourhoods for two-dimensional cEAs are L5 and C9, see the figure on the left. Here, L stands for Linear while C stands for Compact. Each deme represents a panmictic subpopulation within which mate selection and the acceptance of offspring takes place by replacing the parent. The rules for the acceptance of offspring are local in nature and based on the neighbourhood: for example, it can be specified that the best offspring must be better than the parent being replaced or, less strictly, only better than the worst individual in the deme. The first rule is elitist and creates a higher selective pressure than the second non-elitist rule. In elitist EAs, the best individual of a population always survives. In this respect, they deviate from the biological model. The overlap of the neighbourhoods causes a mostly slow spread of genetic information across the neighbourhood boundaries, hence the name diffusion model. A better offspring now needs more generations than in panmixy to spread in the population. This promotes the emergence of local niches and their local evolution, thus preserving genotypic diversity over a longer period of time. The result is a better and dynamic balance between breadth and depth search adapted to the search space during a run. Depth search takes place in the niches and breadth search in the niche boundaries and through the evolution of the different niches of the whole population. For the same neighbourhood size, the spread of genetic information is larger for elongated figures like L9 than for a block like C9, and again significantly larger than for a ring. This means that ring neighbourhoods are well suited for achieving high quality results, even if this requires comparatively long run times. On the other hand, if one is primarily interested in fast and good, but possibly suboptimal results, 2D topologies are more suitable. == Comparison == When applying both population models to genetic algorithms, evolutionary strategy and other EAs, the splitting of a total population into subpopulations usually reduces the risk of premature convergence and leads to better results overall more reliably and faster than would be expected with panmictic EAs. Island models have the disadvantage compared to neighbourhood models that they introduce a large number of new strategy parameters. Despite the existing studies on this topic in the literature, a certain risk of unfavourable settings remains for the user. With neighbourhood models, on the other hand, only the size of the neighbourhood has to be specified and, in the case of the two-dimensional model, the choice of the neighbourhood figure is added. == Parallelism == Since both population models imply population partitioning, they are well suited as a basis for parallelizing an EA. This applies even more to cellular EAs, since they rely only on locally available information about the members of their respective demes. Thus, in the extreme case, an independent execution thread can be assigned to each individual, so that the entire cEA can run on a parallel hardware platform. The island model also supports p

    Read more →
  • Teacher forcing

    Teacher forcing

    Teacher forcing is an algorithm for training the weights of recurrent neural networks (RNNs). It involves feeding observed sequence values (i.e. ground-truth samples) back into the RNN after each step, thus forcing the RNN to stay close to the ground-truth sequence. The term "teacher forcing" can be motivated by comparing the RNN to a human student taking a multi-part exam where the answer to each part (for example a mathematical calculation) depends on the answer to the preceding part. In this analogy, rather than grading every answer in the end, with the risk that the student fails every single part even though they only made a mistake in the first one, a teacher records the score for each individual part and then tells the student the correct answer, to be used in the next part. The use of an external teacher signal is in contrast to real-time recurrent learning (RTRL). Teacher signals are known from oscillator networks. The promise is, that teacher forcing helps to reduce the training time. The term "teacher forcing" was introduced in 1989 by Ronald J. Williams and David Zipser, who reported that the technique was already being "frequently used in dynamical supervised learning tasks" around that time. A NeurIPS 2016 paper introduced the related method of "professor forcing".

    Read more →
  • Open Data Center Alliance

    Open Data Center Alliance

    opendatacenteralliance.org appears to have been closed down. The Open Data Center Alliance is an independent organization created in Oct. 2010 with the assistance of Intel to coordinate the development of standards for cloud computing. Approximately 100 companies, which account for more than $50bn of IT spending, have joined the Alliance, including BMW, Royal Dutch Shell and Marriott Hotels. "The Alliance's Cloud 2015 vision is aimed at creating a federated cloud where common standards will be laid down for those in the hardware and software arena." == Usage Model Roadmap == The organization sees a growing need for solutions developed in an open, industry-standard and multivendor fashion, and has thus created a usage model roadmap featuring 19 prioritized usage models. The usage models provide detailed requirements for data center and cloud solutions, and will include detailed technical documentation discussing the requirements for technology deployments. To further its roadmap development, the steering committee established five initial technical workgroups in the areas of infrastructure, management, regulation & ecosystem, security and services. The organization delivered a 0.50 usage model roadmap to Open Data Center Alliance technical workgroups in Oct. 2010, and delivered a full 1.0 roadmap for public use in June 2011. == Membership == The steering committee consists of BMW, Capgemini, China Life, China Unicom Group, Deutsche Bank, JPMorgan Chase, Lockheed Martin, Marriott International, Inc., National Australia Bank, Royal Dutch Shell, Terremark and UBS. Other members include AT&T, CERN, eBay, Logica, Motorola Mobility Inc. and Nokia. "The demands on the IT organisations are coming at such an alarming rate that there are many, many different solutions being developed today that maybe don't work with each other. We need one voice, one road map, so that companies are able to say to manufacturers here is a clear vision of what they should be developing their product to do." says Marvin Wheeler, of Terremark, chairman of the Alliance. "While it's unclear how successful this alliance will be, it is at least shedding the spotlight on cloud interoperability, a big emerging issue," said Larry Dignan of ZDNet.

    Read more →
  • Farthest-first traversal

    Farthest-first traversal

    In computational geometry, the farthest-first traversal of a compact metric space is a sequence of points in the space, where the first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points. The same concept can also be applied to a finite set of geometric points, by restricting the selected points to belong to the set or equivalently by considering the finite metric space generated by these points. For a finite metric space or finite set of geometric points, the resulting sequence forms a permutation of the points, also known as the greedy permutation. Every prefix of a farthest-first traversal provides a set of points that is widely spaced and close to all remaining points. More precisely, no other set of equally many points can be spaced more than twice as widely, and no other set of equally many points can be less than half as far to its farthest remaining point. In part because of these properties, farthest-point traversals have many applications, including the approximation of the traveling salesman problem and the metric k-center problem. They may be constructed in polynomial time, or (for low-dimensional Euclidean spaces) approximated in near-linear time. == Definition and properties == A farthest-first traversal is a sequence of points in a compact metric space, with each point appearing at most once. If the space is finite, each point appears exactly once, and the traversal is a permutation of all of the points in the space. The first point of the sequence may be any point in the space. Each point p after the first must have the maximum possible distance to the set of points earlier than p in the sequence, where the distance from a point to a set is defined as the minimum of the pairwise distances to points in the set. A given space may have many different farthest-first traversals, depending both on the choice of the first point in the sequence (which may be any point in the space) and on ties for the maximum distance among later choices. Farthest-point traversals may be characterized by the following properties. Fix a number k, and consider the prefix formed by the first k points of the farthest-first traversal of any metric space. Let r be the distance between the final point of the prefix and the other points in the prefix. Then this subset has the following two properties: All pairs of the selected points are at distance at least r from each other, and All points of the metric space are at distance at most r from the subset. Conversely any sequence having these properties, for all choices of k, must be a farthest-first traversal. These are the two defining properties of a Delone set, so each prefix of the farthest-first traversal forms a Delone set. == Applications == Rosenkrantz, Stearns & Lewis (1977) used the farthest-first traversal to define the farthest-insertion heuristic for the travelling salesman problem. This heuristic finds approximate solutions to the travelling salesman problem by building up a tour on a subset of points, adding one point at a time to the tour in the ordering given by a farthest-first traversal. To add each point to the tour, one edge of the previous tour is broken and replaced by a pair of edges through the added point, in the cheapest possible way. Although Rosenkrantz et al. prove only a logarithmic approximation ratio for this method, they show that in practice it often works better than other insertion methods with better provable approximation ratios. Later, the same sequence of points was popularized by Gonzalez (1985), who used it as part of greedy approximation algorithms for two problems in clustering, in which the goal is to partition a set of points into k clusters. One of the two problems that Gonzalez solve in this way seeks to minimize the maximum diameter of a cluster, while the other, known as the metric k-center problem, seeks to minimize the maximum radius, the distance from a chosen central point of a cluster to the farthest point from it in the same cluster. For instance, the k-center problem can be used to model the placement of fire stations within a city, in order to ensure that every address within the city can be reached quickly by a fire truck. For both clustering problems, Gonzalez chooses a set of k cluster centers by selecting the first k points of a farthest-first traversal, and then creates clusters by assigning each input point to the nearest cluster center. If r is the distance from the set of k selected centers to the next point at position k + 1 in the traversal, then with this clustering every point is within distance r of its center and every cluster has diameter at most 2r. However, the subset of k centers together with the next point are all at distance at least r from each other, and any k-clustering would put some two of these points into a single cluster, with one of them at distance at least r/2 from its center and with diameter at least r. Thus, Gonzalez's heuristic gives an approximation ratio of 2 for both clustering problems. Gonzalez's heuristic was independently rediscovered for the metric k-center problem by Dyer & Frieze (1985), who applied it more generally to weighted k-center problems. Another paper on the k-center problem from the same time, Hochbaum & Shmoys (1985), achieves the same approximation ratio of 2, but its techniques are different. Nevertheless, Gonzalez's heuristic, and the name "farthest-first traversal", are often incorrectly attributed to Hochbaum and Shmoys. For both the min-max diameter clustering problem and the metric k-center problem, these approximations are optimal: the existence of a polynomial-time heuristic with any constant approximation ratio less than 2 would imply that P = NP. As well as for clustering, the farthest-first traversal can also be used in another type of facility location problem, the max-min facility dispersion problem, in which the goal is to choose the locations of k different facilities so that they are as far apart from each other as possible. More precisely, the goal in this problem is to choose k points from a given metric space or a given set of candidate points, in such a way as to maximize the minimum pairwise distance between the selected points. Again, this can be approximated by choosing the first k points of a farthest-first traversal. If r denotes the distance of the kth point from all previous points, then every point of the metric space or the candidate set is within distance r of the first k − 1 points. By the pigeonhole principle, some two points of the optimal solution (whatever it is) must both be within distance r of the same point among these first k − 1 chosen points, and (by the triangle inequality) within distance 2r of each other. Therefore, the heuristic solution given by the farthest-first traversal is within a factor of two of optimal. Other applications of the farthest-first traversal include color quantization (clustering the colors in an image to a smaller set of representative colors), progressive scanning of images (choosing an order to display the pixels of an image so that prefixes of the ordering produce good lower-resolution versions of the whole image rather than filling in the image from top to bottom), point selection in the probabilistic roadmap method for motion planning, simplification of point clouds, generating masks for halftone images, hierarchical clustering, finding the similarities between polygon meshes of similar surfaces, choosing diverse and high-value observation targets for underwater robot exploration, fault detection in sensor networks, modeling phylogenetic diversity, matching vehicles in a heterogenous fleet to customer delivery requests, uniform distribution of geodetic observatories on the Earth's surface or of other types of sensor network, generation of virtual point lights in the instant radiosity computer graphics rendering method, and geometric range searching data structures. == Algorithms == === Greedy exact algorithm === The farthest-first traversal of a finite point set may be computed by a greedy algorithm that maintains the distance of each point from the previously selected points, performing the following steps: Initialize the sequence of selected points to the empty sequence, and the distances of each point to the selected points to infinity. While not all points have been selected, repeat the following steps: Scan the list of not-yet-selected points to find a point p that has the maximum distance from the selected points. Remove p from the not-yet-selected points and add it to the end of the sequence of selected points. For each remaining not-yet-selected point q, replace the distance stored for q by the minimum of its old value and the distance from p to q. For a set of n points, this algorithm takes O(n2) steps and O(n2) distance computations. === Approximations === A faster approximation algorithm, given by Har-Peled & Mendel (2006), applie

    Read more →
  • Dispersive flies optimisation

    Dispersive flies optimisation

    Dispersive flies optimisation (DFO) is a bare-bones swarm intelligence algorithm which is inspired by the swarming behaviour of flies hovering over food sources. DFO is a simple optimiser which works by iteratively trying to improve a candidate solution with regard to a numerical measure that is calculated by a fitness function. Each member of the population, a fly or an agent, holds a candidate solution whose suitability can be evaluated by their fitness value. Optimisation problems are often formulated as either minimisation or maximisation problems. DFO was introduced with the intention of analysing a simplified swarm intelligence algorithm with the fewest tunable parameters and components. In the first work on DFO, this algorithm was compared against a few other existing swarm intelligence techniques using error, efficiency and diversity measures. It is shown that despite the simplicity of the algorithm, which only uses agents’ position vectors at time t to generate the position vectors for time t + 1, it exhibits a competitive performance. Since its inception, DFO has been used in a variety of applications including medical imaging and image analysis as well as data mining and machine learning. == Algorithm == DFO bears many similarities with other existing continuous, population-based optimisers (e.g. particle swarm optimization and differential evolution). In that, the swarming behaviour of the individuals consists of two tightly connected mechanisms, one is the formation of the swarm and the other is its breaking or weakening. DFO works by facilitating the information exchange between the members of the population (the swarming flies). Each fly x {\displaystyle \mathbf {x} } represents a position in a d-dimensional search space: x = ( x 1 , x 2 , … , x d ) {\displaystyle \mathbf {x} =(x_{1},x_{2},\ldots ,x_{d})} , and the fitness of each fly is calculated by the fitness function f ( x ) {\displaystyle f(\mathbf {x} )} , which takes into account the flies' d dimensions: f ( x ) = f ( x 1 , x 2 , … , x d ) {\displaystyle f(\mathbf {x} )=f(x_{1},x_{2},\ldots ,x_{d})} . The pseudocode below represents one iteration of the algorithm: for i = 1 : N flies x i . fitness = f ( x i ) {\displaystyle \mathbf {x_{i}} .{\text{fitness}}=f(\mathbf {x} _{i})} end for i x s {\displaystyle \mathbf {x} _{s}} = arg min [ f ( x i ) ] , i ∈ { 1 , … , N } {\textstyle [f(\mathbf {x} _{i})],\;i\in \{1,\ldots ,N\}} for i = 1 : N and i ≠ s {\displaystyle i\neq s} for d = 1 : D dimensions if U ( 0 , 1 ) < Δ {\displaystyle U(0,1)<\Delta } x i d t + 1 = U ( x min , d , x max , d ) {\displaystyle x_{id}^{t+1}=U(x_{\min ,d},x_{\max ,d})} else x i d t + 1 = x i n d t + U ( 0 , 1 ) ( x s d t − x i d t ) {\displaystyle x_{id}^{t+1}=x_{i_{nd}}^{t}+U(0,1)(x_{sd}^{t}-x_{id}^{t})} end if end for d end for i In the algorithm above, x i d t + 1 {\displaystyle x_{id}^{t+1}} represents fly i {\displaystyle i} at dimension d {\displaystyle d} and time t + 1 {\displaystyle t+1} ; x i n d t {\displaystyle x_{i_{nd}}^{t}} presents x i {\displaystyle x_{i}} 's best neighbouring fly in ring topology (left or right, using flies indexes), at dimension d {\displaystyle d} and time t {\displaystyle t} ; and x s d t {\displaystyle x_{sd}^{t}} is the swarm's best fly. Using this update equation, the swarm's population update depends on each fly's best neighbour (which is used as the focus μ {\displaystyle \mu } , and the difference between the current fly and the best in swarm represents the spread of movement, σ {\displaystyle \sigma } ). Other than the population size N {\displaystyle N} , the only tunable parameter is the disturbance threshold Δ {\displaystyle \Delta } , which controls the dimension-wise restart in each fly vector. This mechanism is proposed to control the diversity of the swarm. Other notable minimalist swarm algorithm is Bare bones particle swarms (BB-PSO), which is based on particle swarm optimisation, along with bare bones differential evolution (BBDE) which is a hybrid of the bare bones particle swarm optimiser and differential evolution, aiming to reduce the number of parameters. Alhakbani in her PhD thesis covers many aspects of the algorithms including several DFO applications in feature selection as well as parameter tuning. == Applications == Some of the recent applications of DFO are listed below: Optimising support vector machine kernel to classify imbalanced data Quantifying symmetrical complexity in computational aesthetics Analysing computational autopoiesis and computational creativity Identifying calcifications in medical images Building non-identical organic structures for game's space development Deep Neuroevolution: Training Deep Neural Networks for False Alarm Detection in Intensive Care Units Identification of animation key points from 2D-medialness maps

    Read more →
  • Local tangent space alignment

    Local tangent space alignment

    Local tangent space alignment (LTSA) is a method for manifold learning, which can efficiently learn a nonlinear embedding into low-dimensional coordinates from high-dimensional data, and can also reconstruct high-dimensional coordinates from embedding coordinates. It is based on the intuition that when a manifold is correctly unfolded, all of the tangent hyperplanes to the manifold will become aligned. It begins by computing the k-nearest neighbors of every point. It computes the tangent space at every point by computing the d-first principal components in each local neighborhood. It then optimizes to find an embedding that aligns the tangent spaces, but it ignores the label information conveyed by data samples, and thus can not be used for classification directly.

    Read more →
  • Robot learning

    Robot learning

    Robot learning is a research field at the intersection of machine learning and robotics. It studies techniques allowing a robot to acquire novel skills or adapt to its environment through learning algorithms. The embodiment of the robot, situated in a physical embedding, provides at the same time specific difficulties (e.g. high-dimensionality, real time constraints for collecting data and learning) and opportunities for guiding the learning process (e.g. sensorimotor synergies, motor primitives). Example of skills that are targeted by learning algorithms include sensorimotor skills such as locomotion, grasping, active object categorization, as well as interactive skills such as joint manipulation of an object with a human peer, and linguistic skills such as the grounded and situated meaning of human language. Learning can happen either through autonomous self-exploration or through guidance from a human teacher, like for example in robot learning by imitation. Robot learning can be closely related to adaptive control, reinforcement learning as well as developmental robotics which considers the problem of autonomous lifelong acquisition of repertoires of skills. While machine learning is frequently used by computer vision algorithms employed in the context of robotics, these applications are usually not referred to as "robot learning". == Imitation learning == Many research groups are developing techniques where robots learn by imitating. This includes various techniques for learning from demonstration (sometimes also referred to as "programming by demonstration") and observational learning. == Sharing learned skills and knowledge == In Tellex's "Million Object Challenge", the goal is robots that learn how to spot and handle simple items and upload their data to the cloud to allow other robots to analyze and use the information. RoboBrain is a knowledge engine for robots which can be freely accessed by any device wishing to carry out a task. The database gathers new information about tasks as robots perform them, by searching the Internet, interpreting natural language text, images, and videos, object recognition as well as interaction. The project is led by Ashutosh Saxena at Stanford University. RoboEarth is a project that has been described as a "World Wide Web for robots" − it is a network and database repository where robots can share information and learn from each other and a cloud for outsourcing heavy computation tasks. The project brings together researchers from five major universities in Germany, the Netherlands and Spain and is backed by the European Union. Google Research, DeepMind, and Google X have decided to allow their robots share their experiences. == Vision-language-action model == Research groups and companies are developing vision-language-action models, foundation models that allow robotic control through the combination of vision and language. Google DeepMind, Figure AI and Hugging Face are actively working on that.

    Read more →
  • Rectified linear unit

    Rectified linear unit

    In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function: ReLU ⁡ ( x ) = x + = max ( 0 , x ) = x + | x | 2 = { x if x > 0 , 0 x ≤ 0 {\displaystyle \operatorname {ReLU} (x)=x^{+}=\max(0,x)={\frac {x+|x|}{2}}={\begin{cases}x&{\text{if }}x>0,\\0&x\leq 0\end{cases}}} where x {\displaystyle x} is the input to a neuron. This is analogous to half-wave rectification in electrical engineering. ReLU is one of the most popular activation functions for artificial neural networks, and finds application in computer vision and speech recognition using deep neural nets and computational neuroscience. == History == The ReLU was first used by Alston Householder in 1941 as a mathematical abstraction of biological neural networks. Kunihiko Fukushima in 1969 used ReLU in the context of visual feature extraction in hierarchical neural networks. In 1998, Gregory Woodbury demonstrated that the rectified linear function could account for a broad range of emergent properties in the visual cortex. His work showed that a single unified model could drive the joint development of refined retinotopic maps, ocular dominance columns, and orientation selectivity. By utilizing the rectifier's "cutoff" property, Woodbury achieved a close quantitative fit to biological data, matching the spatial periodicities and topographic refinement patterns observed in macaque and cat cortical maps. Furthermore, he extended this framework to adult plasticity, accurately replicating the spatial and temporal dynamics of lesion-induced cortical reorganization. This research established that the rectified linear response was a necessary mechanism for the stable self-organisation and maintenance of complex, multi-feature neural maps. In 2000, Hahnloser et al. argued that ReLU approximates the biological relationship between neural firing rates and input current, in addition to enabling recurrent neural network dynamics to stabilise under weaker criteria. Prior to 2010, most activation functions used were the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more numerically efficient counterpart, the hyperbolic tangent. Around 2010, the use of ReLU became common again. Jarrett et al. (2009) noted that rectification by either absolute or ReLU (which they called "positive part") was critical for object recognition in convolutional neural networks (CNNs), specifically because it allows average pooling without neighboring filter outputs cancelling each other out. They hypothesized that the use of sigmoid or tanh was responsible for poor performance in previous CNNs. Nair and Hinton (2010) made a theoretical argument that the softplus activation function should be used, in that the softplus function numerically approximates the sum of an exponential number of linear models that share parameters. They then proposed ReLU as a good approximation to it. Specifically, they began by considering a single binary neuron in a Boltzmann machine that takes x {\displaystyle x} as input, and produces 1 as output with probability σ ( x ) = 1 1 + e − x {\displaystyle \sigma (x)={\frac {1}{1+e^{-x}}}} . They then considered extending its range of output by making infinitely many copies of it X 1 , X 2 , X 3 , … {\displaystyle X_{1},X_{2},X_{3},\dots } , that all take the same input, offset by an amount 0.5 , 1.5 , 2.5 , … {\displaystyle 0.5,1.5,2.5,\dots } , then their outputs are added together as ∑ i = 1 ∞ X i {\displaystyle \sum _{i=1}^{\infty }X_{i}} . They then demonstrated that ∑ i = 1 ∞ X i {\displaystyle \sum _{i=1}^{\infty }X_{i}} is approximately equal to N ( log ⁡ ( 1 + e x ) , σ ( x ) ) {\displaystyle {\mathcal {N}}(\log(1+e^{x}),\sigma (x))} , which is also approximately equal to ReLU ⁡ ( N ( x , σ ( x ) ) ) {\displaystyle \operatorname {ReLU} ({\mathcal {N}}(x,\sigma (x)))} , where N {\displaystyle {\mathcal {N}}} stands for the gaussian distribution. They also argued for another reason for using ReLU: that it allows "intensity equivariance" in image recognition. That is, multiplying input image by a constant k {\displaystyle k} multiplies the output also. In contrast, this is false for other activation functions like sigmoid or tanh. They found that ReLU activation allowed good empirical performance in restricted Boltzmann machines. Glorot et al (2011) argued that ReLU has the following advantages over sigmoid or tanh: ReLU is more similar to biological neurons' responses in their main operating regime. ReLU avoids vanishing gradients. ReLU is cheaper to compute. ReLU creates sparse representation naturally, because many hidden units output exactly zero for a given input. They also found empirically that deep networks trained with ReLU can achieve strong performance without unsupervised pre-training, especially on large, purely supervised tasks. In 2017, the rectified linear function became a central component of the transformer architecture introduced in the Vaswani et al paper "Attention Is All You Need". Within every transformer layer, ReLU is utilized in the position-wise feed-forward networks (FFN), defined by Equation 2 of their paper: FFN ⁡ ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2 {\displaystyle \operatorname {FFN} (x)=\max(0,xW_{1}+b_{1})W_{2}+b_{2}} This equation is foundational to the model's capacity; while the attention mechanism determines the relationships between tokens, the ReLU-based FFN performs the majority of the numerical computation and houses the bulk of the model's parameters. The efficiency and scalability of this rectified framework triggered a global technological revolution, enabling the development of Large Language Models that have had a profound economic impact. The industrial response to this architecture—including the massive expansion of AI-specific hardware and the birth of the generative AI sector—has positioned the Transformer as a cornerstone of 21st-century infrastructure. During the post 2017 period of rapid AI advancement, the rectified linear unit function has been key to achieving increased model performance and scaling due to the fact that it zeros out responses that are immaterial for a given stimuli, preventing them from accumulating in massive scale models. It is the complete silencing of the parts of the model found to be stimuli-irrelevant during learning that allows for scaling. As the stimuli-irrelevant proportion of the model becomes more massive, these highly numerous connections within the model would inevitably accumulate during scaling no matter how small each individual response is. Therefore, the rectified linear unit function, with its absolute zeroing property, enabled the scaling to hundred billion parameter models and beyond. Early Transformer scaling giants like GPT-3 (2020) and Falcon-180B (2023) relied on the rectified linear unit function explicitly, while successors such as GPT-4 (2023) and Llama 3 (2024) utilized smoother variants like GELU or SwiGLU. These variants were used to improve training stability while fundamentally preserving the rectified principle of zeroing low responses. At the centre of modern artificial intelligence ReLU and its variants maintain absolute zero response across the bulk of the model at any one time, while maintaining approximately linear reponses for stimuli-relevant connections enabling high performance on each specific cognitive task. This feature of activation sparsity has been critical for massive scaling and performance gains of AI models right up to the present day. == Advantages == Advantages of ReLU include: Sparse activation: for example, in a randomly initialized network, only about 50% of hidden units are activated (i.e. have a non-zero output). Better gradient propagation: fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions. Efficiency: only requires comparison and addition. Scale-invariant (homogeneous, or "intensity equivariance"): max ( 0 , a x ) = a max ( 0 , x ) for a ≥ 0 {\displaystyle \max(0,ax)=a\max(0,x){\text{ for }}a\geq 0} . == Potential problems == Possible downsides can include: Non-differentiability at zero (however, it is differentiable anywhere else, and the value of the derivative at zero can be chosen to be 0 or 1 arbitrarily). Not zero-centered: ReLU outputs are always non-negative. This can make it harder for the network to learn during backpropagation, because gradient updates tend to push weights in one direction (positive or negative). Batch normalization can help address this. ReLU is unbounded. Redundancy of the parametrization: Because ReLU is scale-invariant, the network computes the exact same function by scaling the weights and biases in front of a ReLU activation by k {\displaystyle k} , and the weights after by 1 / k {\displaystyle 1/k} . Dying ReLU: ReLU neurons can sometimes be pushed into states

    Read more →
  • Relief (feature selection)

    Relief (feature selection)

    Relief is an algorithm developed by Kenji Kira and Larry Rendell in 1992 that takes a filter-method approach to feature selection that is notably sensitive to feature interactions. It was originally designed for application to binary classification problems with discrete or numerical features. Relief calculates a feature score for each feature which can then be applied to rank and select top scoring features for feature selection. Alternatively, these scores may be applied as feature weights to guide downstream modeling. Relief feature scoring is based on the identification of feature value differences between nearest neighbor instance pairs. If a feature value difference is observed in a neighboring instance pair with the same class (a 'hit'), the feature score decreases. Alternatively, if a feature value difference is observed in a neighboring instance pair with different class values (a 'miss'), the feature score increases. The original Relief algorithm has since inspired a family of Relief-based feature selection algorithms (RBAs), including the ReliefF algorithm. Beyond the original Relief algorithm, RBAs have been adapted to (1) perform more reliably in noisy problems, (2) generalize to multi-class problems (3) generalize to numerical outcome (i.e. regression) problems, and (4) to make them robust to incomplete (i.e. missing) data. To date, the development of RBA variants and extensions has focused on four areas; (1) improving performance of the 'core' Relief algorithm, i.e. examining strategies for neighbor selection and instance weighting, (2) improving scalability of the 'core' Relief algorithm to larger feature spaces through iterative approaches, (3) methods for flexibly adapting Relief to different data types, and (4) improving Relief run efficiency. Their strengths are that they are not dependent on heuristics, they run in low-order polynomial time, and they are noise-tolerant and robust to feature interactions, as well as being applicable for binary or continuous data; however, it does not discriminate between redundant features, and low numbers of training instances fool the algorithm. == Relief Algorithm == Take a data set with n instances of p features, belonging to two known classes. Within the data set, each feature should be scaled to the interval [0 1] (binary data should remain as 0 and 1). The algorithm will be repeated m times. Start with a p-long weight vector (W) of zeros. At each iteration, take the feature vector (X) belonging to one random instance, and the feature vectors of the instance closest to X (by Euclidean distance) from each class. The closest same-class instance is called 'near-hit', and the closest different-class instance is called 'near-miss'. Update the weight vector such that W i = W i − ( x i − n e a r H i t i ) 2 + ( x i − n e a r M i s s i ) 2 , {\displaystyle W_{i}=W_{i}-(x_{i}-\mathrm {nearHit} _{i})^{2}+(x_{i}-\mathrm {nearMiss} _{i})^{2},} where i {\displaystyle i} indexes the components and runs from 1 to p. Thus the weight of any given feature decreases if it differs from that feature in nearby instances of the same class more than nearby instances of the other class, and increases in the reverse case. After m iterations, divide each element of the weight vector by m. This becomes the relevance vector. Features are selected if their relevance is greater than a threshold τ. Kira and Rendell's experiments showed a clear contrast between relevant and irrelevant features, allowing τ to be determined by inspection. However, it can also be determined by Chebyshev's inequality for a given confidence level (α) that a τ of 1/sqrt(αm) is good enough to make the probability of a Type I error less than α, although it is stated that τ can be much smaller than that. Relief was also described as generalizable to multinomial classification by decomposition into a number of binary problems. == ReliefF Algorithm == Kononenko et al. propose a number of updates to Relief. Firstly, they find the near-hit and near-miss instances using the Manhattan (L1) norm rather than the Euclidean (L2) norm, although the rationale is not specified. Furthermore, they found taking the absolute differences between xi and near-hiti, and xi and near-missi to be sufficient when updating the weight vector (rather than the square of those differences). === Reliable probability estimation === Rather than repeating the algorithm m times, implement it exhaustively (i.e. n times, once for each instance) for relatively small n (up to one thousand). Furthermore, rather than finding the single nearest hit and single nearest miss, which may cause redundant and noisy attributes to affect the selection of the nearest neighbors, ReliefF searches for k nearest hits and misses and averages their contribution to the weights of each feature. k can be tuned for any individual problem. === Incomplete data === In ReliefF, the contribution of missing values to the feature weight is determined using the conditional probability that two values should be the same or different, approximated with relative frequencies from the data set. This can be calculated if one or both features are missing. === Multi-class problems === Rather than use Kira and Rendell's proposed decomposition of a multinomial classification into a number of binomial problems, ReliefF searches for k near misses from each different class and averages their contributions for updating W, weighted with the prior probability of each class. == Other Relief-based Algorithm Extensions/Derivatives == The following RBAs are arranged chronologically from oldest to most recent. They include methods for improving (1) the core Relief algorithm concept, (2) iterative approaches for scalability, (3) adaptations to different data types, (4) strategies for computational efficiency, or (5) some combination of these goals. For more on RBAs see these book chapters or this most recent review paper. === RRELIEFF === Robnik-Šikonja and Kononenko propose further updates to ReliefF, making it appropriate for regression. === Relieved-F === Introduced deterministic neighbor selection approach and a new approach for incomplete data handling. === Iterative Relief === Implemented method to address bias against non-monotonic features. Introduced the first iterative Relief approach. For the first time, neighbors were uniquely determined by a radius threshold and instances were weighted by their distance from the target instance. === I-RELIEF === Introduced sigmoidal weighting based on distance from target instance. All instance pairs (not just a defined subset of neighbors) contributed to score updates. Proposed an on-line learning variant of Relief. Extended the iterative Relief concept. Introduced local-learning updates between iterations for improved convergence. === TuRF (a.k.a. Tuned ReliefF) === Specifically sought to address noise in large feature spaces through the recursive elimination of features and the iterative application of ReliefF. === Evaporative Cooling ReliefF === Similarly seeking to address noise in large feature spaces. Utilized an iterative `evaporative' removal of lowest quality features using ReliefF scores in association with mutual information. === EReliefF (a.k.a. Extended ReliefF) === Addressing issues related to incomplete and multi-class data. === VLSReliefF (a.k.a. Very Large Scale ReliefF) === Dramatically improves the efficiency of detecting 2-way feature interactions in very large feature spaces by scoring random feature subsets rather than the entire feature space. === ReliefMSS === Introduced calculation of feature weights relative to average feature 'diff' between instance pairs. === SURF === SURF identifies nearest neighbors (both hits and misses) based on a distance threshold from the target instance defined by the average distance between all pairs of instances in the training data. Results suggest improved power to detect 2-way epistatic interactions over ReliefF. === SURF (a.k.a. SURFStar) === SURF extends the SURF algorithm to not only utilized 'near' neighbors in scoring updates, but 'far' instances as well, but employing inverted scoring updates for 'far instance pairs. Results suggest improved power to detect 2-way epistatic interactions over SURF, but an inability to detect simple main effects (i.e. univariate associations). === SWRF === SWRF extends the SURF algorithm adopting sigmoid weighting to take distance from the threshold into account. Also introduced a modular framework for further developing RBAs called MoRF. === MultiSURF (a.k.a. MultiSURFStar) === MultiSURF extends the SURF algorithm adapting the near/far neighborhood boundaries based on the average and standard deviation of distances from the target instance to all others. MultiSURF uses the standard deviation to define a dead-band zone where 'middle-distance' instances do not contribute to scoring. Evidence suggests MultiSURF performs best in detecting pure 2-way feature interactions. === Reli

    Read more →