AI Chatbot Robot

AI Chatbot Robot — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Decision list

    Decision list

    Decision lists are a representation for Boolean functions which can be easily learned from examples. Single term decision lists are more expressive than disjunctions and conjunctions; however, 1-term decision lists are less expressive than the general disjunctive normal form and the conjunctive normal form. The language specified by a k-length decision list includes as a subset the language specified by a k-depth decision tree. Learning decision lists can be used for attribute efficient learning, a type of machine learning. == Definition == A decision list (DL) of length r is of the form: if f1 then output b1 else if f2 then output b2 ... else if fr then output br where fi is the ith formula and bi is the ith boolean for i ∈ { 1... r } {\displaystyle i\in \{1...r\}} . The last if-then-else is the default case, which means formula fr is always equal to true. A k-DL is a decision list where all of formulas have at most k terms. Sometimes "decision list" is used to refer to a 1-DL, where all of the formulas are either a variable or its negation.

    Read more →
  • List of text mining software

    List of text mining software

    Text mining computer programs are available from many commercial and open source companies and sources. == Commercial == Angoss – Angoss Text Analytics provides entity and theme extraction, topic categorization, sentiment analysis and document summarization capabilities via the embedded AUTINDEX – is a commercial text mining software package based on sophisticated linguistics by IAI (Institute for Applied Information Sciences), Saarbrücken. DigitalMR – social media listening & text+image analytics tool for market research. FICO Score – leading provider of analytics. General Sentiment – Social Intelligence platform that uses natural language processing to discover affinities between the fans of brands with the fans of traditional television shows in social media. Stand alone text analytics to capture social knowledge base on billions of topics stored to 2004. IBM LanguageWare – the IBM suite for text analytics (tools and Runtime). IBM SPSS – provider of Modeler Premium (previously called IBM SPSS Modeler and IBM SPSS Text Analytics), which contains advanced NLP-based text analysis capabilities (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with Predictive Modeling. Text Analytics for Surveys provides the ability to categorize survey responses using NLP-based capabilities for further analysis or reporting. Inxight – provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects that was bought by SAP AG in 2008). Language Computer Corporation – text extraction and analysis tools, available in multiple languages. Lexalytics – provider of a text analytics engine used in Social Media Monitoring, Voice of Customer, Survey Analysis, and other applications. Salience Engine. The software provides the unique capability of merging the output of unstructured, text-based analysis with structured data to provide additional predictive variables for improved predictive models and association analysis. Linguamatics – provider of natural language processing (NLP) based enterprise text mining and text analytics software, I2E, for high-value knowledge discovery and decision support. Mathematica – provides built in tools for text alignment, pattern matching, clustering and semantic analysis. See Wolfram Language, the programming language of Mathematica. MATLAB offers Text Analytics Toolbox for importing text data, converting it to numeric form for use in machine and deep learning, sentiment analysis and classification tasks. Medallia – offers one system of record for survey, social, text, written and online feedback. NetMiner – software for network analysis and text mining. Supports social media and bibliographic data collection, NLP for english and chinese, sentiment analysis, work co-occurrence network(text network analysis) and visualization. NetOwl – suite of multilingual text and entity analytics products, including entity extraction, link and event extraction, sentiment analysis, geotagging, name translation, name matching, and identity resolution, among others. PolyAnalyst - text analytics environment. PoolParty Semantic Suite - graph-based text mining platform. RapidMiner with its Text Processing Extension – data and text mining software. SAS – SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software used for Information Management. Sketch Engine – a corpus manager and analysis software which providing creating text corpora from uploaded texts or the Web including part-of-speech tagging and lemmatization or detecting a particular website. Sysomos – provider social media analytics software platform, including text analytics and sentiment analysis on online consumer conversations. WordStat – Content analysis and text mining add-on module of QDA Miner for analyzing large amounts of text data. == Open source == Carrot2 – text and search results clustering framework. GATE – general Architecture for Text Engineering, an open-source toolbox for natural language processing and language engineering. Gensim – large-scale topic modelling and extraction of semantic information from unstructured text (Python). KH Coder – for Quantitative Content Analysis or Text Mining The KNIME Text Processing extension. Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. OpenNLP – natural language processing. Orange with its text mining add-on. The PLOS Text Mining Collection. The programming language R provides a framework for text mining applications in the package tm. The Natural Language Processing task view contains tm and other text mining library packages. spaCy – open-source Natural Language Processing library for Python Stanbol – an open source text mining engine targeted at semantic content management. Voyant Tools – a web-based text analysis environment, created as a scholarly project.

    Read more →
  • Statistical classification

    Statistical classification

    When classification is performed by a computer, statistical methods are normally used to develop the algorithm. Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure). Other classifiers work by comparing observations to previous observations by means of a similarity or distance function. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category. Terminology across fields is quite varied. In statistics, where classification is often done with logistic regression or a similar procedure, the properties of observations are termed explanatory variables (or independent variables, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the dependent variable. In machine learning, the observations are often known as instances, the explanatory variables are termed features (grouped into a feature vector), and the possible categories to be predicted are classes. Other fields may use different terminology: e.g. in community ecology, the term "classification" normally refers to cluster analysis. == Relation to other problems == Classification and clustering are examples of the more general problem of pattern recognition, which is the assignment of some sort of output value to a given input value. Other examples are regression, which assigns a real-valued output to each input; sequence labeling, which assigns a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence); parsing, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence; etc. A common subclass of classification is probabilistic classification. Algorithms of this nature use statistical inference to find the best class for a given instance. Unlike other algorithms, which simply output a "best" class, probabilistic algorithms output a probability of the instance being a member of each of the possible classes. The best class is normally then selected as the one with the highest probability. However, such an algorithm has numerous advantages over non-probabilistic classifiers: It can output a confidence value associated with its choice (in general, a classifier that can do this is known as a confidence-weighted classifier). Correspondingly, it can abstain when its confidence of choosing any particular output is too low. Because of the probabilities which are generated, probabilistic classifiers can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of error propagation. == Frequentist procedures == Early work on statistical classification was undertaken by Fisher, in the context of two-group problems, leading to Fisher's linear discriminant function as the rule for assigning a group to a new observation. This early work assumed that data-values within each of the two groups had a multivariate normal distribution. The extension of this same context to more than two groups has also been considered with a restriction imposed that the classification rule should be linear. Later work for the multivariate normal distribution allowed the classifier to be nonlinear: several classification rules can be derived based on different adjustments of the Mahalanobis distance, with a new observation being assigned to the group whose centre has the lowest adjusted distance from the observation. == Bayesian procedures == Unlike frequentist procedures, Bayesian classification procedures provide a natural way of taking into account any available information about the relative sizes of the different groups within the overall population. Bayesian procedures tend to be computationally expensive and, in the days before Markov chain Monte Carlo computations were developed, approximations for Bayesian clustering rules were devised. Some Bayesian procedures involve the calculation of group-membership probabilities: these provide a more informative outcome than a simple attribution of a single group-label to each new observation. == Binary and multiclass classification == Classification can be thought of as two separate problems – binary classification and multiclass classification. In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes. Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers. == Feature vectors == Most algorithms describe an individual instance whose category is to be predicted using a feature vector of individual, measurable properties of the instance. Each property is termed a feature, also known in statistics as an explanatory variable (or independent variable, although features may or may not be statistically independent). Features may variously be binary (e.g. "on" or "off"); categorical (e.g. "A", "B", "AB" or "O", for blood type); ordinal (e.g. "large", "medium" or "small"); integer-valued (e.g. the number of occurrences of a particular word in an email); or real-valued (e.g. a measurement of blood pressure). If the instance is an image, the feature values might correspond to the pixels of an image; if the instance is a piece of text, the feature values might be occurrence frequencies of different words. Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10). == Linear classifiers == A large number of algorithms for classification can be phrased in terms of a linear function that assigns a score to each possible category k by combining the feature vector of an instance with a vector of weights, using a dot product. The predicted category is the one with the highest score. This type of score function is known as a linear predictor function and has the following general form: score ⁡ ( X i , k ) = β k ⋅ X i , {\displaystyle \operatorname {score} (\mathbf {X} _{i},k)={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i},} where Xi is the feature vector for instance i, βk is the vector of weights corresponding to category k, and score(Xi, k) is the score associated with assigning instance i to category k. In discrete choice theory, where instances represent people and categories represent choices, the score is considered the utility associated with person i choosing category k. Algorithms with this basic setup are known as linear classifiers. What distinguishes them is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. Examples of such algorithms include Logistic regression – Statistical model for a binary dependent variable Multinomial logistic regression – Regression for more than two discrete outcomes Probit regression – Statistical regression where the dependent variable can take only two valuesPages displaying short descriptions of redirect targets The perceptron algorithm Support vector machine – Set of methods for supervised statistical learning Linear discriminant analysis – Method used in statistics, pattern recognition, and other fields == Algorithms == Since no single form of classification is appropriate for all data sets, a large toolkit of classification algorithms has been developed. The most commonly used include: Artificial neural networks – Computational model used in machine learningPages displaying short descriptions of redirect targets Boosting (machine learning) – Ensemble learning method Random forest – Tree-based ensemble machine learning methods Genetic programming – Evolving computer programs with techniques analogous to natural genetic processes Gene expression programming – Evolutionary algorithm Multi expression programming Linear genetic programming Kernel estimation – Concept in statisticsPages displaying short descriptions of redirect targets k-nearest neighbor – Non-parametric classification methodPages displaying short descriptions of redirect targets Learning vector quantization Linear classifier – Statistical classification in machine learning Fisher's linear discriminant – Method used in statistics, pattern recognition, and other fieldsPages displaying short descriptions of redirect targets Logistic r

    Read more →
  • Multinomial logistic regression

    Multinomial logistic regression

    In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.). Multinomial logistic regression is known by a variety of other names, including polytomous LR, multiclass LR, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model. == Background == Multinomial logistic regression is used when the dependent variable in question is nominal (equivalently categorical, meaning that it falls into any one of a set of categories that cannot be ordered in any meaningful way) and for which there are more than two categories. Some examples would be: Which major will a college student choose, given their grades, stated likes and dislikes, etc.? Which blood type does a person have, given the results of various diagnostic tests? In a hands-free mobile phone dialing application, which person's name was spoken, given various properties of the speech signal? Which candidate will a person vote for, given particular demographic characteristics? Which country will a firm locate an office in, given the characteristics of the firm and of the various candidate countries? These are all statistical classification problems. They all have in common a dependent variable to be predicted that comes from one of a limited set of items that cannot be meaningfully ordered, as well as a set of independent variables (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logistic regression is a particular solution to classification problems that use a linear combination of the observed features and some problem-specific parameters to estimate the probability of each particular value of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken). == Assumptions == The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. As with other types of regression, there is no need for the independent variables to be statistically independent from each other (unlike, for example, in a naive Bayes classifier); however, collinearity is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case. If the multinomial logit is used to model choices, it relies on the assumption of independence of irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of K alternatives to be modeled as a set of K − 1 independent binary choices, in which one alternative is chosen as a "pivot" and the other K − 1 compared against it, one at a time. The IIA hypothesis is a core hypothesis in rational choice theory; however numerous studies in psychology show that individuals often violate this assumption when making choices. An example of a problem case arises if choices include a car and a blue bus. Suppose the odds ratio between the two is 1 : 1. Now if the option of a red bus is introduced, a person may be indifferent between a red and a blue bus, and hence may exhibit a car : blue bus : red bus odds ratio of 1 : 0.5 : 0.5, thus maintaining a 1 : 1 ratio of car : any bus while adopting a changed car : blue bus ratio of 1 : 0.5. Here the red bus option was not in fact irrelevant, because a red bus was a perfect substitute for a blue bus. If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. It is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like the nested logit or the multinomial probit may be used in such cases as they allow for violation of the IIA. == Model == === Introduction === There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make it difficult to compare different treatments of the subject in different texts. The article on logistic regression presents a number of equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model. The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation using a dot product: score ⁡ ( X i , k ) = β k ⋅ X i , {\displaystyle \operatorname {score} (\mathbf {X} _{i},k)={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i},} where Xi is the vector of explanatory variables describing observation i, βk is a vector of weights (or regression coefficients) corresponding to outcome k, and score(Xi, k) is the score associated with assigning observation i to category k. In discrete choice theory, where observations represent people and outcomes represent choices, the score is considered the utility associated with person i choosing outcome k. The predicted outcome is the one with the highest score. The difference between the multinomial logit model and numerous other methods, models, algorithms, etc. with the same basic setup (the perceptron algorithm, support vector machines, linear discriminant analysis, etc.) is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. In particular, in the multinomial logit model, the score can directly be converted to a probability value, indicating the probability of observation i choosing outcome k given the measured characteristics of the observation. This provides a principled way of incorporating the prediction of a particular multinomial logit model into a larger procedure that may involve multiple such predictions, each with a possibility of error. Without such means of combining predictions, errors tend to multiply. For example, imagine a large predictive model that is broken down into a series of submodels where the prediction of a given submodel is used as the input of another submodel, and that prediction is in turn used as the input into a third submodel, etc. If each submodel has 90% accuracy in its predictions, and there are five submodels in series, then the overall model has only 0.95 = 59% accuracy. If each submodel has 80% accuracy, then overall accuracy drops to 0.85 = 33% accuracy. This issue is known as error propagation and is a serious problem in real-world predictive models, which are usually composed of numerous parts. Predicting probabilities of each possible outcome, rather than simply making a single optimal prediction, is one means of alleviating this issue. === Setup === The basic setup is the same as in logistic regression, the only difference being that the dependent variables are categorical rather than binary, i.e. there are K possible outcomes rather than just two. The following description is somewhat shortened; for more details, consult the logistic regression article. ==== Data points ==== Specifically, it is assumed that we have a series of N observed data points. Each data point i (ranging from 1 to N) consists of a set of M explanatory variables x1,i ... xM,i (also known as independent variables, predictor variables, features, etc.), and an associated categorical outcome Yi (also known as dependent variable, response variable), which can take on one of K possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to K. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of N "experiments" — although an "experiment" may consist of nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so tha

    Read more →
  • Adobe PhotoDeluxe

    Adobe PhotoDeluxe

    PhotoDeluxe was a consumer-oriented image editing software line published by Adobe Systems from 1996 until July 8, 2002. At that time it was replaced by Adobe's newly launched consumer-oriented image editing software Photoshop Elements. Adobe no longer provides technical support for the PhotoDeluxe software line. PhotoDeluxe had a range of image processing capabilities for the home photographer and image handler. These included removing red-eye, cropping, and adjusting brightness, contrast, and sharpness. It also included software to extract pictures from an image scanner. Among the functionality included was the ability to dynamically resize photos and export them in a wide range of formats. It also had a range of printing options including printing multiple copies of an image on the same page. It was often bundled free with Epson scanners or as free software with new computers. == Features == Despite the critical concerns regarding the quality of the setup, Photo Deluxe supports layering, blurs, sharpening, cloning, gradient fills, color and background switches, color variations, resizing options, and many other features. Another drawback of PhotoDeluxe was that it was designed for Mac computers, so working on Windows PC was a problem for those who were unable to customize their preferences. == Versions == === Adobe PhotoDeluxe 1.0 === The first version was released in 1996 for Windows and Macintosh computers. In one year, it sold over one million copies. === Adobe PhotoDeluxe 2.0 === The new version was released in 1997 and had added features such as a Clone Tool, red-eye removal, and sample templates for making posters, cards, and calendars. It also had new special effect features. === Adobe PhotoDeluxe 3.0 === The 3rd version was released in 1998. The new features included customizable clipart settings, the ability to import photos on the web, enhanced repair activities following Guided Activities, and Adobe Connectables to add new activities. === Adobe PhotoDeluxe Home Edition (4.0) === Version 4.0 was created by the makers of Photoshop. It had advanced abilities such as tools to add animation, voice, and music to a picture. It also had features to restore photos to their original position. == History == Adobe PhotoDeluxe 1.0 was released in 1996 for Macintosh computers, initially retailing for an MSRP of $49. The software did quite well, reportedly selling over a million copies by February of the next year, primarily due to bundles with companies like Apple and Hewlett-Packard. PhotoDeluxe was primarily advertised to consumers as a way to do basic photo manipulation, such as cropping and rotating images, or creating simple cards and calendars. PhotoDeluxe 2.0 was released in 1997, and was the last version of PhotoDeluxe that Adobe made that worked on Macs. PhotoDeluxe 2.0 became the "number one selling consumer photo-editing software product in the world." PhotoDeluxe 3.0 was released in 1998, where it was rebranded as "3.0 Home Edition", as Adobe released PhotoDeluxe Business Edition later that year for a higher price. PhotoDeluxe Home Edition, unofficially called PhotoDeluxe 4.0, was released in 1999 and was the last version of PhotoDeluxe to be released. Adobe officially cancelled PhotoDeluxe on July 8, 2002, citing the presence of Photoshop and Photoshop Elements, with support being officially cancelled in mid-2003. No version of PhotoDeluxe is compatible with Windows 10, rendering the program obsolete. == Pricing == All home versions of PhotoDeluxe retailed for an MSRP of $49. PhotoDeluxe 2.0 and onwards allowed users to upgrade from a previous version of PhotoDeluxe or a competing piece of graphics software for $39. Additionally PhotoDeluxe Business Edition allowed a similar deal, allowing users to upgrade from other versions of PhotoDeluxe or a competing software for $59, instead of its normal price of $99. Adobe also offered a bundle allowing users of 1.0 or 2.0 to get 3.0 and Business Edition for $79.

    Read more →
  • Optical neural network

    Optical neural network

    An optical neural network is a physical implementation of an artificial neural network with optical components. Early optical neural networks used a photorefractive Volume hologram to interconnect arrays of input neurons to arrays of output with synaptic weights in proportion to the multiplexed hologram's strength. Volume holograms were further multiplexed using spectral hole burning to add one dimension of wavelength to space to achieve four dimensional interconnects of two dimensional arrays of neural inputs and outputs. This research led to extensive research on alternative methods using the strength of the optical interconnect for implementing neuronal communications. Some artificial neural networks that have been implemented as optical neural networks include the Hopfield neural network and the Kohonen self-organizing map with liquid crystal spatial light modulators Optical neural networks can also be based on the principles of neuromorphic engineering, creating neuromorphic photonic systems. Typically, these systems encode information in the networks using spikes, mimicking the functionality of spiking neural networks in optical and photonic hardware. Photonic devices that have demonstrated neuromorphic functionalities include (among others) vertical-cavity surface-emitting lasers, integrated photonic modulators, optoelectronic systems based on superconducting Josephson junctions or systems based on resonant tunnelling diodes. == Electrochemical vs. optical neural networks == Biological neural networks function on an electrochemical basis, while optical neural networks use electromagnetic waves. Optical interfaces to biological neural networks can be created with optogenetics, but is not the same as an optical neural networks. In biological neural networks there exist a lot of different mechanisms for dynamically changing the state of the neurons, these include short-term and long-term synaptic plasticity. Synaptic plasticity is among the electrophysiological phenomena used to control the efficiency of synaptic transmission, long-term for learning and memory, and short-term for short transient changes in synaptic transmission efficiency. Implementing this with optical components is difficult, and ideally requires advanced photonic materials. Properties that might be desirable in photonic materials for optical neural networks include the ability to change their efficiency of transmitting light, based on the intensity of incoming light. == Rising Era of Optical Neural Networks == With the increasing significance of computer vision in various domains, the computational cost of these tasks has increased, making it more important to develop the new approaches of the processing acceleration. Optical computing has emerged as a potential alternative to GPU acceleration for modern neural networks, particularly considering the looming obsolescence of Moore's Law. Consequently, optical neural networks have garnered increased attention in the research community. Presently, two primary methods of optical neural computing are under research: silicon photonics-based and free-space optics. Each approach has its benefits and drawbacks; while silicon photonics may offer superior speed, it lacks the massive parallelism that free-space optics can deliver. Given the substantial parallelism capabilities of free-space optics, researchers have focused on taking advantage of it. One implementation, proposed by Lin et al., involves the training and fabrication of phase masks for a handwritten digit classifier. By stacking 3D-printed phase masks, light passing through the fabricated network can be read by a photodetector array of ten detectors, each representing a digit class ranging from 1 to 10. Although this network can achieve terahertz-range classification, it lacks flexibility, as the phase masks are fabricated for a specific task and cannot be retrained. An alternative method for classification in free-space optics, introduced by Cahng et al., employs a 4F system that is based on the convolution theorem to perform convolution operations. This system uses two lenses to execute the Fourier transforms of the convolution operation, enabling passive conversion into the Fourier domain without power consumption or latency. However, the convolution operation kernels in this implementation are also fabricated phase masks, limiting the device's functionality to specific convolutional layers of the network only. In contrast, Li et al. proposed a technique involving kernel tiling to use the parallelism of the 4F system while using a Digital Micromirror Device (DMD) instead of a phase mask. This approach allows users to upload various kernels into the 4F system and execute the entire network's inference on a single device. Unfortunately, modern neural networks are not designed for the 4F systems, as they were primarily developed during the CPU/GPU era. Mostly because they tend to use a lower resolution and a high number of channels in their feature maps. == Other Implementations == In 2007 there was one model of Optical Neural Network: the Programmable Optical Array/Analogic Computer (POAC). It had been implemented in the year 2000 and reported based on modified Joint Fourier Transform Correlator (JTC) and Bacteriorhodopsin (BR) as a holographic optical memory. Full parallelism, large array size and the speed of light are three promises offered by POAC to implement an optical CNN. They had been investigated during the last years with their practical limitations and considerations yielding the design of the first portable POAC version. The practical details – hardware (optical setups) and software (optical templates) – were published. However, POAC is a general purpose and programmable array computer that has a wide range of applications including: image processing pattern recognition target tracking real-time video processing document security optical switching == Progress in the 2020s == Taichi from Tsinghua University in Beijing is a hybrid ONN that combines the power efficiency and parallelism of optical diffraction and the configurability of optical interference. Taichi offers 13.96 million parameters. Taichi avoids the high error rates that afflict deep (multi-layer) networks by combining clusters of fewer-layer diffractive units with arrays of interferometers for reconfigurable computation. Its encoding protocol divides large network models into sub-models that can be distributed across multiple chiplets in parallel. Taichi achieved 91.89% accuracy in tests with the Omniglot database. It was also used to generate music Bach and generate images the styles of Van Gogh and Munch. The developers claimed energy efficiency of up to 160 trillion operations second−1 watt−1 and an area efficiency of 880 trillion multiply-accumulate operations mm−2 or 103 more energy efficient than the NVIDIA H100, and 102 times more energy efficient and 10 times more area efficient than previous ONNs. Time dimension has recently been introduced into diffractive neural network by fs laser lithography of perovskite hydration. The temporal behaviour of the neuron can be modulated by the fs laser at the nanoscale, enabling a programmable holographic neural network with temporal evolution functionality, i.e., the functionality can change with time under the hydration stimuli. An in-memory temporal inference functionality was demonstrated to mimic the function evolution of the human brain, i.e., the functionality can change from simple digit image classification to more complicated digit and clothing product image classification with time. This is the first time of introducing time dimension into the optical neural network, laying a foundation for future brain-like photonic chip development.

    Read more →
  • Markov model

    Markov model

    In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property). Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property. == Introduction == Andrey Andreyevich Markov (14 June 1856 – 20 July 1922) was a Russian mathematician best known for his work on stochastic processes. A primary subject of his research later became known as the Markov chain. There are four common Markov models used in different situations, depending on whether every sequential state is observable or not, and whether the system is to be adjusted on the basis of observations made: == Markov chain == The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property indicates that the distribution for this variable depends only on the distribution of a previous state. An example use of a Markov chain is Markov chain Monte Carlo, which uses the Markov property to prove that a particular method for performing a random walk will sample from the joint distribution. == Hidden Markov model == A hidden Markov model is a Markov chain for which the state is only partially observable or noisily observable. In other words, observations are related to the state of the system, but they are typically insufficient to precisely determine the state. Several well-known algorithms for hidden Markov models exist. For example, given a sequence of observations, the Viterbi algorithm will compute the most-likely corresponding sequence of states, the forward algorithm will compute the probability of the sequence of observations, and the Baum–Welch algorithm will estimate the starting probabilities, the transition function, and the observation function of a hidden Markov model. One common use is for speech recognition, where the observed data is the speech audio waveform and the hidden state is the spoken text. In this example, the Viterbi algorithm finds the most likely sequence of spoken words given the speech audio. == Markov decision process == A Markov decision process is a Markov chain in which state transitions depend on the current state and an action vector that is applied to the system. Typically, a Markov decision process is used to compute a policy of actions that will maximize some utility with respect to expected rewards. == Partially observable Markov decision process == A partially observable Markov decision process (POMDP) is a Markov decision process in which the state of the system is only partially observed. POMDPs are known to be NP complete, but recent approximation techniques have made them useful for a variety of applications, such as controlling simple agents or robots. == Markov random field == A Markov random field, or Markov network, may be considered to be a generalization of a Markov chain in multiple dimensions. In a Markov chain, state depends only on the previous state in time, whereas in a Markov random field, each state depends on its neighbors in any of multiple directions. A Markov random field may be visualized as a field or graph of random variables, where the distribution of each random variable depends on the neighboring variables with which it is connected. More specifically, the joint distribution for any random variable in the graph can be computed as the product of the "clique potentials" of all the cliques in the graph that contain that random variable. Modeling a problem as a Markov random field is useful because it implies that the joint distributions at each vertex in the graph may be computed in this manner. == Hierarchical Markov models == Hierarchical Markov models can be applied to categorize human behavior at various levels of abstraction. For example, a series of simple observations, such as a person's location in a room, can be interpreted to determine more complex information, such as in what task or activity the person is performing. Two kinds of Hierarchical Markov Models are the Hierarchical hidden Markov model and the Abstract Hidden Markov Model. Both have been used for behavior recognition and certain conditional independence properties between different levels of abstraction in the model allow for faster learning and inference. == Tolerant Markov model == A Tolerant Markov model (TMM) is a probabilistic-algorithmic Markov chain model. It assigns the probabilities according to a conditioning context that considers the last symbol, from the sequence to occur, as the most probable instead of the true occurring symbol. A TMM can model three different natures: substitutions, additions or deletions. Successful applications have been efficiently implemented in DNA sequences compression. == Markov-chain forecasting models == Markov-chains have been used as a forecasting methods for several topics, for example price trends, wind power and solar irradiance. The Markov-chain forecasting models utilize a variety of different settings, from discretizing the time-series to hidden Markov-models combined with wavelets and the Markov-chain mixture distribution model (MCM).

    Read more →
  • Language identification in the limit

    Language identification in the limit

    Language identification in the limit is a formal model for inductive inference of formal languages, mainly by computers (see machine learning and induction of regular languages). It was introduced by E. Mark Gold in a technical report and a journal article with the same title. In this model, a teacher provides to a learner some presentation (i.e. a sequence of strings) of some formal language. The learning is seen as an infinite process. Each time the learner reads an element of the presentation, it should provide a representation (e.g. a formal grammar) for the language. Gold defines that a learner can identify in the limit a class of languages if, given any presentation of any language in the class, the learner will produce only a finite number of wrong representations, and then stick with the correct representation. However, the learner need not be able to announce its correctness; and the teacher might present a counterexample to any representation arbitrarily long after. Gold defined two types of presentations: Text (positive information): an enumeration of all strings the language consists of. Complete presentation (positive and negative information): an enumeration of all possible strings, each with a label indicating if the string belongs to the language or not. == Learnability == This model is an early attempt to formally capture the notion of learnability. Gold's journal article introduces for contrast the stronger models Finite identification (where the learner has to announce correctness after a finite number of steps), and Fixed-time identification (where correctness has to be reached after an apriori-specified number of steps). A weaker formal model of learnability is the Probably approximately correct learning (PAC) model, introduced by Leslie Valiant in 1984. == Examples == It is instructive to look at concrete examples (in the tables) of learning sessions the definition of identification in the limit speaks about. A fictitious session to learn a regular language L over the alphabet {a,b} from text presentation:In each step, the teacher gives a string belonging to L, and the learner answers a guess for L, encoded as a regular expression. In step 3, the learner's guess is not consistent with the strings seen so far; in step 4, the teacher gives a string repeatedly. After step 6, the learner sticks to the regular expression (ab+ba). If this happens to be a description of the language L the teacher has in mind, it is said that the learner has learned that language.If a computer program for the learner's role would exist that was able to successfully learn each regular language, that class of languages would be identifiable in the limit. Gold has shown that this is not the case. A particular learning algorithm always guessing L to be just the union of all strings seen so far:If L is a finite language, the learner will eventually guess it correctly, however, without being able to tell when. Although the guess didn't change during step 3 to 6, the learner couldn't be sure to be correct.Gold has shown that the class of finite languages is identifiable in the limit, however, this class is neither finitely nor fixed-time identifiable. Learning from complete presentation by telling:In each step, the teacher gives a string and tells whether it belongs to L (green) or not (red, struck-out). Each possible string is eventually classified in this way by the teacher. Learning from complete presentation by request:The learner gives a query string, the teacher tells whether it belongs to L (yes) or not (no); the learner then gives a guess for L, followed by the next query string. In this example, the learner happens to query in each step just the same string as given by the teacher in example 3.In general, Gold has shown that each language class identifiable in the request-presentation setting is also identifiable in the telling-presentation setting, since the learner, instead of querying a string, just needs to wait until it is eventually given by the teacher. == Gold's theorem == More formally, a language L {\displaystyle L} is a nonempty set, and its elements are called sentences. a language family is a set of languages. a language-learning environment E {\displaystyle E} for a language L {\displaystyle L} is a stream of sentences from L {\displaystyle L} , such that each sentence in L {\displaystyle L} appears at least once. a language learner is a function f {\displaystyle f} that sends a list of sentences to a language. This is interpreted as saying that, after seeing sentences a 1 , a 2 . . . , a n {\displaystyle a_{1},a_{2}...,a_{n}} in that order, the language learner guesses that the language that produces the sentences should be f ( a 1 , . . . , a n ) {\displaystyle f(a_{1},...,a_{n})} . Note that the learner is not obliged to be correct — it could very well guess a language that does not even contain a 1 , . . . , a n {\displaystyle a_{1},...,a_{n}} . a language learner f {\displaystyle f} learns a language L {\displaystyle L} in environment E = ( a 1 , a 2 , . . . ) {\displaystyle E=(a_{1},a_{2},...)} if the learner always guesses L {\displaystyle L} after seeing enough examples from the environment. a language learner f {\displaystyle f} learns a language L {\displaystyle L} if it learns L {\displaystyle L} in any environment E {\displaystyle E} for L {\displaystyle L} . a language family is learnable if there exists a language learner that can learn all languages in the family. Notes: In the context of Gold's theorem, sentences need only be distinguishable. They need not be anything in particular, such as finite strings (as usual in formal linguistics). Learnability is not a concept for individual languages. Any individual language L {\displaystyle L} could be learned by a trivial learner that always guesses L {\displaystyle L} . Learnability is not a concept for individual learners. A language family is learnable if, and only if, there exists some learner that can learn the family. It does not matter how well the learner performs for learning languages outside the family. Gold's theorem is easily bypassed if negative examples are allowed. In particular, the language family { L 1 , L 2 , . . . , L ∞ } {\displaystyle \{L_{1},L_{2},...,L_{\infty }\}} can be learned by a learner that always guesses L ∞ {\displaystyle L_{\infty }} until it receives the first negative example ¬ a n {\displaystyle \neg a_{n}} , where a n ∈ L n + 1 ∖ L n {\displaystyle a_{n}\in L_{n+1}\setminus L_{n}} , at which point it always guesses L n {\displaystyle L_{n}} . == Learnability characterization == Dana Angluin gave the characterizations of learnability from text (positive information) in a 1980 paper. If a learner is required to be effective, then an indexed class of recursive languages is learnable in the limit if there is an effective procedure that uniformly enumerates tell-tales for each language in the class (Condition 1). It is not hard to see that if an ideal learner (i.e., an arbitrary function) is allowed, then an indexed class of languages is learnable in the limit if each language in the class has a tell-tale (Condition 2). == Language classes learnable in the limit == The table shows which language classes are identifiable in the limit in which learning model. On the right-hand side, each language class is a superclass of all lower classes. Each learning model (i.e. type of presentation) can identify in the limit all classes below it. In particular, the class of finite languages is identifiable in the limit by text presentation (cf. Example 2 above), while the class of regular languages is not. Pattern Languages, introduced by Dana Angluin in another 1980 paper, are also identifiable by normal text presentation; they are omitted in the table, since they are above the singleton and below the primitive recursive language class, but incomparable to the classes in between. == Sufficient conditions for learnability == Condition 1 in Angluin's paper is not always easy to verify. Therefore, people come up with various sufficient conditions for the learnability of a language class. See also Induction of regular languages for learnable subclasses of regular languages. === Finite thickness === A class of languages has finite thickness if every non-empty set of strings is contained in at most finitely many languages of the class. This is exactly Condition 3 in Angluin's paper. Angluin showed that if a class of recursive languages has finite thickness, then it is learnable in the limit. A class with finite thickness certainly satisfies MEF-condition and MFF-condition; in other words, finite thickness implies M-finite thickness. === Finite elasticity === A class of languages is said to have finite elasticity if for every infinite sequence of strings s 0 , s 1 , . . . {\displaystyle s_{0},s_{1},...} and every infinite sequence of languages in the class L 1 , L 2 , . . . {\displaystyle L_{1},L_{2},...} , there exists a finite number n such

    Read more →
  • Apertus (LLM)

    Apertus (LLM)

    Apertus is a public large language model, developed by the Swiss AI Initiative (a collaboration between EPFL, ETH Zurich, and the Swiss National Supercomputing Centre). It was released on September 2, 2025, under the free and open-source Apache 2.0 license. Designed initially for business and research use cases around the world, Apertus was trained on over 1800 languages, and comes in 8 billion or 70 billion parameter versions and is available on Hugging Face for download. The model was developed aiming to adhere to European copyright law, and is one of the first examples of AI as a public good in the vein of AI Sovereignty. It is also the first large model to comply with the European Union's Artificial Intelligence Act. At its launch, the model creators emphasized multilinguality, transparency, and auditability as priorities in contrast to commercial frontier model. While international reception was largely positive, the first iteration was significantly behind the capabilities of frontier models and needs adaptation for many use cases with chatbots being a secondary but not a primary use case. As of late 2025, it was considered the largest and most capable fully open model. The capability of future models will depend in part on how much more funding can be secured.

    Read more →
  • Mating pool

    Mating pool

    Mating pool is a concept used in evolutionary algorithms and means a population of parents for the next population. The mating pool is formed by candidate solutions that the selection operators deem to have the highest fitness in the current population. Solutions that are included in the mating pool are referred to as parents. Individual solutions can be repeatedly included in the mating pool, with individuals of higher fitness values having a higher chance of being included multiple times. Crossover operators are then applied to the parents, resulting in recombination of genes recognized as superior. Lastly, random changes in the genes are introduced through mutation operators, increasing the genetic variation in the gene pool. Those two operators improve the chance of creating new, superior solutions. A new generation of solutions is thereby created, the children, who will constitute the next population. Depending on the selection method, the total number of parents in the mating pool can be different to the size of the initial population, resulting in a new population that’s smaller. To continue the algorithm with an equally sized population, random individuals from the old populations can be chosen and added to the new population. At this point, the fitness value of the new solutions is evaluated. If the termination conditions are fulfilled, processes come to an end. Otherwise, they are repeated. The repetition of the steps result in candidate solutions that evolve towards the most optimal solution over time. The genes will become increasingly uniform towards the most optimal gene, a process called convergence. If 95% of the population share the same version of a gene, the gene has converged. When all the individual fitness values have reached the value of the best individual, i.e. all the genes have converged, population convergence is achieved. == Mating pool creation == Several methods can be applied to create a mating pool. All of these processes involve the selective breeding of a particular number of individuals within a population. There are multiple criteria that can be employed to determine which individuals make it into the mating pool and which are left behind. The selection methods can be split into three general types: fitness proportionate selection, ordinal based selection and threshold based selection. === Fitness proportionate selection === In the case of fitness proportionate selection, random individuals are selected to enter the pool. However, the ones with a higher level of fitness are more likely to be picked and therefore have a greater chance of passing on their features to the next generation. One of the techniques used in this type of parental selection is the roulette wheel selection. This approach divides a hypothetical circular wheel into different slots, the size of which is equal to the fitness values of each potential candidate. Afterwards, the wheel is rotated and a fixed point determines which individual gets picked. The greater the fitness value of an individual, the higher the probability of being chosen as a parent by the random spin of the wheel. Alternatively, stochastic universal sampling can be implemented. This selection method is also based on the rotation of a spinning wheel. However, in this case there is more than one fixed point and as a result all of the mating pool members will be selected simultaneously. === Ordinal based selection === The ordinal based selection methods include the tournament and ranking selection. Tournament selection involves the random selection of individuals of a population and the subsequent comparison of their fitness levels. The winners of these “tournaments” are the ones with the highest values and will be put into the mating pool as parents. In ranking selection all the individuals are sorted based on their fitness values. Then, the selection of the parents is made according to the rank of the candidates. Every individual has a chance of being chosen, but higher ranked ones are favored === Threshold based selection === The last type of selection method is referred to as the threshold based method. This includes the truncation selection method, which sorts individuals based on their phenotypic values on a specific trait and later selects the proportion of them that are within a certain threshold as parents.

    Read more →
  • Nearest centroid classifier

    Nearest centroid classifier

    In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. When applied to text classification using word vectors containing tfidf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback. An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of tumors. == Algorithm == === Training === Given labeled training samples { ( x → 1 , y 1 ) , … , ( x → n , y n ) } {\displaystyle \textstyle \{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}} with class labels y i ∈ Y {\displaystyle y_{i}\in \mathbf {Y} } , compute the per-class centroids μ → ℓ = 1 | C ℓ | ∑ i ∈ C ℓ x → i {\displaystyle \textstyle {\vec {\mu }}_{\ell }={\frac {1}{|C_{\ell }|}}{\underset {i\in C_{\ell }}{\sum }}{\vec {x}}_{i}} where C ℓ {\displaystyle C_{\ell }} is the set of indices of samples belonging to class ℓ ∈ Y {\displaystyle \ell \in \mathbf {Y} } . === Prediction === The class assigned to an observation x → {\displaystyle {\vec {x}}} is y ^ = arg ⁡ min ℓ ∈ Y ‖ μ → ℓ − x → ‖ {\displaystyle {\hat {y}}={\arg \min }_{\ell \in \mathbf {Y} }\|{\vec {\mu }}_{\ell }-{\vec {x}}\|} .

    Read more →
  • Computational learning theory

    Computational learning theory

    In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms. == Overview == Theoretical results in machine learning often focus on a type of inductive learning known as supervised learning. In supervised learning, an algorithm is provided with labeled samples. For instance, the samples might be descriptions of mushrooms, with labels indicating whether they are edible or not. The algorithm uses these labeled samples to create a classifier. This classifier assigns labels to new samples, including those it has not previously encountered. The goal of the supervised learning algorithm is to optimize performance metrics, such as minimizing errors on new samples. In addition to performance bounds, computational learning theory studies the time complexity and feasibility of learning . In computational learning theory, a computation is considered feasible if it can be done in polynomial time . There are two kinds of time complexity results: Positive results – Showing that a certain class of functions is learnable in polynomial time. Negative results – Showing that certain classes cannot be learned in polynomial time. Negative results often rely on commonly believed, but yet unproven assumptions, such as: Computational complexity – P ≠ NP (the P versus NP problem); Cryptographic – One-way functions exist. There are several different approaches to computational learning theory based on making different assumptions about the inference principles used to generalise from limited data. This includes different definitions of probability (see frequency probability, Bayesian probability) and different assumptions on the generation of samples. The different approaches include: Exact learning, proposed by Dana Angluin; Probably approximately correct learning (PAC learning), proposed by Leslie Valiant; VC theory, proposed by Vladimir Vapnik and Alexey Chervonenkis; Inductive inference as developed by Ray Solomonoff; Algorithmic learning theory, from the work of E. Mark Gold; Online machine learning, from the work of Nick Littlestone. While its primary goal is to understand learning abstractly, computational learning theory has led to the development of practical algorithms. For example, PAC theory inspired boosting, VC theory led to support vector machines, and Bayesian inference led to belief networks.

    Read more →
  • Wink Bingo

    Wink Bingo

    Wink Bingo is an online bingo website launched in 2008. It is part of Broadway Gaming Ireland DF Limited and is based and licensed in Ireland. == History == Wink Bingo launched in 2008 and under chief executive Eitan Boyd it grew to 60,000 active players within two years. It had an estimated £1.3 million profit in the first 11 months of trading, and by 2009 it had estimated annual revenue of £15 million. In 2009 Wink Bingo was purchased by 888 Holdings Plc, which operates a number of entertainment brands including 888casino, 888poker and 888sport. The initial up front fee was reported in the London Evening Standard to be £11 million, rising as high as £59.7 million depending on performance-based earn out arrangements. The acquisition included Daub Ltd’s other online bingo businesses Posh Bingo and Bingo Fabulous. In 2011, the sellers agreed to amend the terms and accept two subsequent payments in addition to the initial cost, of £9.2 million in May and £6.1 million in August. In 2011 Wink Bingo sponsored ITV2's The Only Way Is Essex, and other notable advertising campaigns have included sponsorship of Harry Hill's TV Burp. In 2014, Wink Bingo rebranded with an updated slogan 'Wink if you're in!', with an aim of creating a 'sunny, calm and inclusive' online destination, and an accompanying TV commercial featuring the Ottawan song D.I.S.C.O. re-recorded as B.I.N.G.O.. Wink also launched a new digital magazine, 'Winkly', and 'Winkipedia, a bingo encyclopedia'. Wink Bingo is available on desktop and as a mobile app. Wink launched Wink Slots in 2016 as a companion site to Wink Bingo. The Advertising Standards Authority has ruled on Wink Bingo's advertisements on a number of occasions. In August 2008, Wink ran a television ad which showed a midwife celebrating while at work at a hospital maternity unit. The ASA banned the ad, concluding that it condoned gambling in the workplace and suggested that it took priority over professional commitments. In June 2013, the Gambling Reform & Society Perception Group (GRASP) challenged the use of semi-naked "athletic" men together with the claim "Go on ... you know you want to" on an outdoor ad, suggesting it linked gambling to seduction and enhanced attractiveness. The complaint was not upheld. The site underwent another rebrand and pop art inspired redesign in April 2018, taking on a new tone of voice and a new slogan, "You’ve Earned It". An online shop was added, where players can redeem reward points for free play or vouchers for online high street retailers. In 2021 Wink Bingo was purchased by Saphalata Holdings, a company that forms part of the Broadway Gaming group. === Cancer Research UK campaign === In 2015 Wink Bingo began an open-ended partnership with the Peter Andre Fund to raise money for Cancer Research UK. Peter Andre also met with players who were selected in a raffle. == Awards ==

    Read more →
  • Kernel principal component analysis

    Kernel principal component analysis

    In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space. == Background: Linear PCA == Recall that conventional PCA operates on zero-centered data; that is, 1 N ∑ i = 1 N x i = 0 {\displaystyle {\frac {1}{N}}\sum _{i=1}^{N}\mathbf {x} _{i}=\mathbf {0} } , where x i {\displaystyle \mathbf {x} _{i}} is one of the N {\displaystyle N} multivariate observations. It operates by diagonalizing the covariance matrix, C = 1 N ∑ i = 1 N x i x i ⊤ {\displaystyle C={\frac {1}{N}}\sum _{i=1}^{N}\mathbf {x} _{i}\mathbf {x} _{i}^{\top }} in other words, it gives an eigendecomposition of the covariance matrix: λ v = C v {\displaystyle \lambda \mathbf {v} =C\mathbf {v} } which can be rewritten as λ x i ⊤ v = x i ⊤ C v for i = 1 , … , N {\displaystyle \lambda \mathbf {x} _{i}^{\top }\mathbf {v} =\mathbf {x} _{i}^{\top }C\mathbf {v} \quad {\textrm {for}}~i=1,\ldots ,N} . (See also: Covariance matrix as a linear operator) == Introduction of the Kernel to PCA == To understand the utility of kernel PCA, particularly for clustering, observe that, while N points cannot, in general, be linearly separated in d < N {\displaystyle d Read more →

  • C4.5 algorithm

    C4.5 algorithm

    C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. In 2011, authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date". It became quite popular after ranking #1 in the Top 10 Algorithms in Data Mining pre-eminent paper published by Springer LNCS in 2008. == Algorithm == C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = s 1 , s 2 , . . . {\displaystyle S={s_{1},s_{2},...}} of already classified samples. Each sample s i {\displaystyle s_{i}} consists of a p-dimensional vector ( x 1 , i , x 2 , i , . . . , x p , i ) {\displaystyle (x_{1,i},x_{2,i},...,x_{p,i})} , where the x j {\displaystyle x_{j}} represent attribute values or features of the sample, as well as the class in which s i {\displaystyle s_{i}} falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the partitioned sublists. This algorithm has a few base cases. All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class. None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class. Instance of previously unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value. === Pseudocode === In pseudocode, the general algorithm for building decision trees is: Check for the above base cases. For each attribute a, find the normalized information gain ratio from splitting on a. Let a_best be the attribute with the highest normalized information gain. Create a decision node that splits on a_best. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node. == Improvements from ID3 algorithm == C4.5 made a number of improvements to ID3. Some of these are: Handling both continuous and discrete attributes: In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values: C4.5 allows attribute values to be marked as missing. Missing attribute values are simply not used in gain and entropy calculations. Handling attributes with differing costs. Pruning trees after creation: C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. == Improvements in C5.0/See5 algorithm == Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are: Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude) Memory usage - C5.0 is more memory efficient than C4.5 Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees. Support for boosting - Boosting improves the trees and gives them more accuracy. Weighting - C5.0 allows you to weight different cases and misclassification types. Winnowing - a C5.0 option automatically winnows the attributes to remove those that may be unhelpful. Source for a single-threaded Linux version of C5.0 is available under the GNU General Public License (GPL).

    Read more →