AI For Business Escp

AI For Business Escp — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Image registration

    Image registration

    Image registration is the process of transforming different sets of data into one coordinate system. Data may be multiple photographs, data from different sensors, times, depths, or viewpoints. It is used in computer vision, medical imaging, military automatic target recognition, and compiling and analyzing images and data from satellites. Registration is necessary in order to be able to compare or integrate the data obtained from these different measurements. == Algorithm classification == === Intensity-based vs feature-based === Image registration or image alignment algorithms can be classified into intensity-based and feature-based. One of the images is referred to as the target, fixed or sensed image and the others are referred to as the moving or source images. Image registration involves spatially transforming the source/moving image(s) to align with the target image. The reference frame in the target image is stationary, while the other datasets are transformed to match to the target. Intensity-based methods compare intensity patterns in images via correlation metrics, while feature-based methods find correspondence between image features such as points, lines, and contours. Intensity-based methods register entire images or sub-images. If sub-images are registered, centers of corresponding sub images are treated as corresponding feature points. Feature-based methods establish a correspondence between a number of especially distinct points in images. Knowing the correspondence between a number of points in images, a geometrical transformation is then determined to map the target image to the reference images, thereby establishing point-by-point correspondence between the reference and target images. Methods combining intensity-based and feature-based information have also been developed. === Transformation models === Image registration algorithms can also be classified according to the transformation models they use to relate the target image space to the reference image space. The first broad category of transformation models includes affine transformations, which include rotation, scaling, translation and shearing. Affine transformations are global in nature, thus, they cannot model local geometric differences between images. The second category of transformations allow 'elastic' or 'nonrigid' transformations. These transformations are capable of locally warping the target image to align with the reference image. Nonrigid transformations include radial basis functions (thin-plate or surface splines, multiquadrics, and compactly-supported transformations), physical continuum models (viscous fluids), and large deformation models (diffeomorphisms). Transformations are commonly described by a parametrization, where the model dictates the number of parameters. For instance, the translation of a full image can be described by a translation vector parameter. These models are called parametric models. Non-parametric models on the other hand, do not follow any parameterization, allowing each image element to be displaced arbitrarily. There are a number of programs that implement both estimation and application of a warp-field. It is a part of the SPM and AIR programs. === Transformations of coordinates via the law of function composition rather than addition === Alternatively, many advanced methods for spatial normalization are building on structure preserving transformations homeomorphisms and diffeomorphisms since they carry smooth submanifolds smoothly during transformation. Diffeomorphisms are generated in the modern field of Computational Anatomy based on flows since diffeomorphisms are not additive although they form a group, but a group under the law of function composition. For this reason, flows which generalize the ideas of additive groups allow for generating large deformations that preserve topology, providing 1-1 and onto transformations. Computational methods for generating such transformation are often called LDDMM which provide flows of diffeomorphisms as the main computational tool for connecting coordinate systems corresponding to the geodesic flows of Computational Anatomy. There are a number of programs which generate diffeomorphic transformations of coordinates via diffeomorphic mapping including MRI Studio and MRI Cloud.org === Spatial vs frequency domain methods === Spatial methods operate in the image domain, matching intensity patterns or features in images. Some of the feature matching algorithms are outgrowths of traditional techniques for performing manual image registration, in which an operator chooses corresponding control points (CP) in images. When the number of control points exceeds the minimum required to define the appropriate transformation model, iterative algorithms like RANSAC can be used to robustly estimate the parameters of a particular transformation type (e.g. affine) for registration of the images. Frequency-domain methods find the transformation parameters for registration of the images while working in the transform domain. Such methods work for simple transformations, such as translation, rotation, and scaling. Applying the phase correlation method to a pair of images produces a third image which contains a single peak. The location of this peak corresponds to the relative translation between the images. Unlike many spatial-domain algorithms, the phase correlation method is resilient to noise, occlusions, and other defects typical of medical or satellite images. Additionally, the phase correlation uses the fast Fourier transform to compute the cross-correlation between the two images, generally resulting in large performance gains. The method can be extended to determine rotation and scaling differences between two images by first converting the images to log-polar coordinates. Due to properties of the Fourier transform, the rotation and scaling parameters can be determined in a manner invariant to translation. === Single- vs multi-modality methods === Another classification can be made between single-modality and multi-modality methods. Single-modality methods tend to register images in the same modality acquired by the same scanner/sensor type, while multi-modality registration methods tended to register images acquired by different scanner/sensor types. Multi-modality registration methods are often used in medical imaging as images of a subject are frequently obtained from different scanners. Examples include registration of brain CT/MRI images or whole body PET/CT images for tumor localization, registration of contrast-enhanced CT images against non-contrast-enhanced CT images for segmentation of specific parts of the anatomy, and registration of ultrasound and CT images for prostate localization in radiotherapy. === Automatic vs interactive methods === Registration methods may be classified based on the level of automation they provide. Manual, interactive, semi-automatic, and automatic methods have been developed. Manual methods provide tools to align the images manually. Interactive methods reduce user bias by performing certain key operations automatically while still relying on the user to guide the registration. Semi-automatic methods perform more of the registration steps automatically but depend on the user to verify the correctness of a registration. Automatic methods do not allow any user interaction and perform all registration steps automatically. === Similarity measures for image registration === Image similarities are broadly used in medical imaging. An image similarity measure quantifies the degree of similarity between intensity patterns in two images. The choice of an image similarity measure depends on the modality of the images to be registered. Common examples of image similarity measures include cross-correlation, mutual information, sum of squared intensity differences, and ratio image uniformity. Mutual information and normalized mutual information are the most popular image similarity measures for registration of multimodality images. Cross-correlation, sum of squared intensity differences and ratio image uniformity are commonly used for registration of images in the same modality. Many new features have been derived for cost functions based on matching methods via large deformations have emerged in the field Computational Anatomy including Measure matching which are pointsets or landmarks without correspondence, Curve matching and Surface matching via mathematical currents and varifolds. == Uncertainty == There is a level of uncertainty associated with registering images that have any spatio-temporal differences. A confident registration with a measure of uncertainty is critical for many change detection applications such as medical diagnostics. In remote sensing applications where a digital image pixel may represent several kilometers of spatial distance (such as NASA's LANDSAT imagery), an uncertain image registration can mean that a solution could b

    Read more →
  • Inductive logic programming

    Inductive logic programming

    Inductive logic programming (ILP) is a subfield of symbolic artificial intelligence which uses logic programming as a uniform representation for examples, background knowledge and hypotheses. The term "inductive" here refers to philosophical (i.e. suggesting a theory to explain observed facts) rather than mathematical (i.e. proving a property for all members of a well-ordered set) induction. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesised logic program which entails all the positive and none of the negative examples. Schema: positive examples + negative examples + background knowledge ⇒ hypothesis. Bioinformatics and drug design have been highlighted as a principal application area of inductive logic programming techniques. == History == Building on earlier work on Inductive inference, Gordon Plotkin was the first to formalise induction in a clausal setting around 1970, adopting an approach of generalising from examples. In 1981, Ehud Shapiro introduced several ideas that would shape the field in his new approach of model inference, an algorithm employing refinement and backtracing to search for a complete axiomatisation of given examples. His first implementation was the Model Inference System in 1981: a Prolog program that inductively inferred Horn clause logic programs from positive and negative examples. The term Inductive Logic Programming was first introduced in a paper by Stephen Muggleton in 1990, defined as the intersection of machine learning and logic programming. Muggleton and Wray Buntine introduced predicate invention and inverse resolution in 1988. Several inductive logic programming systems that proved influential appeared in the early 1990s. FOIL, introduced by Ross Quinlan in 1990 was based on upgrading propositional learning algorithms AQ and ID3. Golem, introduced by Muggleton and Feng in 1990, went back to a restricted form of Plotkin's least generalisation algorithm. The Progol system, introduced by Muggleton in 1995, first implemented inverse entailment, and inspired many later systems. Aleph, a descendant of Progol introduced by Ashwin Srinivasan in 2001, is still one of the most widely used systems as of 2022. At around the same time, the first practical applications emerged, particularly in bioinformatics, where by 2000 inductive logic programming had been successfully applied to drug design, carcinogenicity and mutagenicity prediction, and elucidation of the structure and function of proteins. Unlike the focus on automatic programming inherent in the early work, these fields used inductive logic programming techniques from a viewpoint of relational data mining. The success of those initial applications and the lack of progress in recovering larger traditional logic programs shaped the focus of the field. Recently, classical tasks from automated programming have moved back into focus, as the introduction of meta-interpretative learning makes predicate invention and learning recursive programs more feasible. This technique was pioneered with the Metagol system introduced by Muggleton, Dianhuan Lin, Niels Pahlavi and Alireza Tamaddoni-Nezhad in 2014. This allows ILP systems to work with fewer examples, and brought successes in learning string transformation programs, answer set grammars and general algorithms. == Setting == Inductive logic programming has adopted several different learning settings, the most common of which are learning from entailment and learning from interpretations. In both cases, the input is provided in the form of background knowledge B, a logical theory (commonly in the form of clauses used in logic programming), as well as positive and negative examples, denoted E + {\textstyle E^{+}} and E − {\textstyle E^{-}} respectively. The output is given as a hypothesis H, itself a logical theory that typically consists of one or more clauses. The two settings differ in the format of examples presented. === Learning from entailment === As of 2022, learning from entailment is by far the most popular setting for inductive logic programming. In this setting, the positive and negative examples are given as finite sets E + {\textstyle E^{+}} and E − {\textstyle E^{-}} of positive and negated ground literals, respectively. A correct hypothesis H is a set of clauses satisfying the following requirements, where the turnstile symbol ⊨ {\displaystyle \models } stands for logical entailment: Completeness: B ∪ H ⊨ E + Consistency: B ∪ H ∪ E − ⊭ false {\displaystyle {\begin{array}{llll}{\text{Completeness:}}&B\cup H&\models &E^{+}\\{\text{Consistency: }}&B\cup H\cup E^{-}&\not \models &{\textit {false}}\end{array}}} Completeness requires any generated hypothesis H to explain all positive examples E + {\textstyle E^{+}} , and consistency forbids generation of any hypothesis H that is inconsistent with the negative examples E − {\textstyle E^{-}} , both given the background knowledge B. In Muggleton's setting of concept learning, "completeness" is referred to as "sufficiency", and "consistency" as "strong consistency". Two further conditions are added: "Necessity", which postulates that B does not entail E + {\textstyle E^{+}} , does not impose a restriction on H, but forbids any generation of a hypothesis as long as the positive facts are explainable without it. "Weak consistency", which states that no contradiction can be derived from B ∧ H {\textstyle B\land H} , forbids generation of any hypothesis H that contradicts the background knowledge B. Weak consistency is implied by strong consistency; if no negative examples are given, both requirements coincide. Weak consistency is particularly important in the case of noisy data, where completeness and strong consistency cannot be guaranteed. === Learning from interpretations === In learning from interpretations, the positive and negative examples are given as a set of complete or partial Herbrand structures, each of which are themselves a finite set of ground literals. Such a structure e is said to be a model of the set of clauses B ∪ H {\textstyle B\cup H} if for any substitution θ {\textstyle \theta } and any clause h e a d ← b o d y {\textstyle \mathrm {head} \leftarrow \mathrm {body} } in B ∪ H {\textstyle B\cup H} such that b o d y θ ⊆ e {\textstyle \mathrm {body} \theta \subseteq e} , h e a d θ ⊆ e {\displaystyle \mathrm {head} \theta \subseteq e} also holds. The goal is then to output a hypothesis that is complete, meaning every positive example is a model of B ∪ H {\textstyle B\cup H} , and consistent, meaning that no negative example is a model of B ∪ H {\textstyle B\cup H} . == Approaches to ILP == An inductive logic programming system is a program that takes as an input logic theories B , E + , E − {\displaystyle B,E^{+},E^{-}} and outputs a correct hypothesis H with respect to theories B , E + , E − {\displaystyle B,E^{+},E^{-}} . A system is complete if and only if for any input logic theories B , E + , E − {\displaystyle B,E^{+},E^{-}} any correct hypothesis H with respect to these input theories can be found with its hypothesis search procedure. Inductive logic programming systems can be roughly divided into two classes, search-based and meta-interpretative systems. Search-based systems exploit that the space of possible clauses forms a complete lattice under the subsumption relation, where one clause C 1 {\textstyle C_{1}} subsumes another clause C 2 {\textstyle C_{2}} if there is a substitution θ {\textstyle \theta } such that C 1 θ {\textstyle C_{1}\theta } , the result of applying θ {\textstyle \theta } to C 1 {\textstyle C_{1}} , is a subset of C 2 {\textstyle C_{2}} . This lattice can be traversed either bottom-up or top-down. === Bottom-up search === Bottom-up methods to search the subsumption lattice have been investigated since Plotkin's first work on formalising induction in clausal logic in 1970. Techniques used include least general generalisation, based on anti-unification, and inverse resolution, based on inverting the resolution inference rule. ==== Least general generalisation ==== A least general generalisation algorithm takes as input two clauses C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} and outputs the least general generalisation of C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} , that is, a clause C {\textstyle C} that subsumes C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} , and that is subsumed by every other clause that subsumes C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} . The least general generalisation can be computed by first computing all selections from C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} , which are pairs of literals ( L , M ) ∈ ( C 1 × C 2 ) {\displaystyle (L,M)\in (C_{1}\times C_{2})} sharing the same predicate symbol and negated/unnegated status. Then, the least general generalisation is obtained as the disjunction of the least general generalisations of the indi

    Read more →
  • Evolutionary multimodal optimization

    Evolutionary multimodal optimization

    In applied mathematics, multimodal optimization deals with optimization tasks that involve finding all or most of the multiple (at least locally optimal) solutions of a problem, as opposed to a single best solution. Evolutionary multimodal optimization is a branch of evolutionary computation, which is closely related to machine learning. Wong provides a short survey, wherein the chapter of Shir and the book of Preuss cover the topic in more detail. == Motivation == Knowledge of multiple solutions to an optimization task is especially helpful in engineering, when due to physical (and/or cost) constraints, the best results may not always be realizable. In such a scenario, if multiple solutions (locally and/or globally optimal) are known, the implementation can be quickly switched to another solution and still obtain the best possible system performance. Multiple solutions could also be analyzed to discover hidden properties (or relationships) of the underlying optimization problem, which makes them important for obtaining domain knowledge. In addition, the algorithms for multimodal optimization usually not only locate multiple optima in a single run, but also preserve their population diversity, resulting in their global optimization ability on multimodal functions. Moreover, the techniques for multimodal optimization are usually borrowed as diversity maintenance techniques to other problems. == Background == Classical techniques of optimization would need multiple restart points and multiple runs in the hope that a different solution may be discovered every run, with no guarantee however. Evolutionary algorithms (EAs) due to their population based approach, provide a natural advantage over classical optimization techniques. They maintain a population of possible solutions, which are processed every generation, and if the multiple solutions can be preserved over all these generations, then at termination of the algorithm we will have multiple good solutions, rather than only the best solution. Note that this is against the natural tendency of classical optimization techniques, which will always converge to the best solution, or a sub-optimal solution (in a rugged, “badly behaving” function). Finding and maintenance of multiple solutions is wherein lies the challenge of using EAs for multi-modal optimization. Niching is a generic term referred to as the technique of finding and preserving multiple stable niches, or favorable parts of the solution space possibly around multiple solutions, so as to prevent convergence to a single solution. The field of Evolutionary algorithms encompasses genetic algorithms (GAs), evolution strategy (ES), differential evolution (DE), particle swarm optimization (PSO), and other methods. Attempts have been made to solve multi-modal optimization in all these realms and most, if not all the various methods implement niching in some form or the other. == Multimodal optimization using genetic algorithms/evolution strategies == De Jong's crowding method, Goldberg's sharing function approach, Petrowski's clearing method, restricted mating, maintaining multiple subpopulations are some of the popular approaches that have been proposed by the community. The first two methods are especially well studied, however, they do not perform explicit separation into solutions belonging to different basins of attraction. The application of multimodal optimization within ES was not explicit for many years, and has been explored only recently. A niching framework utilizing derandomized ES was introduced by Shir, proposing the CMA-ES as a niching optimizer for the first time. The underpinning of that framework was the selection of a peak individual per subpopulation in each generation, followed by its sampling to produce the consecutive dispersion of search-points. The biological analogy of this machinery is an alpha-male winning all the imposed competitions and dominating thereafter its ecological niche, which then obtains all the sexual resources therein to generate its offspring. Recently, an evolutionary multiobjective optimization (EMO) approach was proposed, in which a suitable second objective is added to the originally single objective multimodal optimization problem, so that the multiple solutions form a weak pareto-optimal front. Hence, the multimodal optimization problem can be solved for its multiple solutions using an EMO algorithm. Improving upon their work, the same authors have made their algorithm self-adaptive, thus eliminating the need for pre-specifying the parameters. An approach that does not use any radius for separating the population into subpopulations (or species) but employs the space topology instead is proposed in.

    Read more →
  • Joseph Nechvatal

    Joseph Nechvatal

    Joseph Nechvatal (born January 15, 1951) is an American post-conceptual digital artist and art theoretician who creates computer-assisted paintings and computer animations, often using custom computer viruses. == Life and work == Joseph Nechvatal was born in Chicago. He studied fine art and philosophy at Southern Illinois University Carbondale, Cornell University, and Columbia University. He earned a Doctor of Philosophy in Philosophy of Art and Technology at the Planetary Collegium at University of Wales, Newport and has taught art theory and art history at the School of Visual Arts. He has had many solo exhibitions and is one of five artists that art historian Patrick Frank examines in his 2024 book Art of the 1980s: As If the Digital Mattered. His work in the late 1970s and early 1980s chiefly consisted of postminimal gray palimpsest-like drawings that were often photo-mechanically enlarged. Beginning in 1979 he became associated with the artist group Colab, organized the Public Arts International/Free Speech series, and helped established the non-profit group ABC No Rio. In 1983 he co-founded the avant-garde electronic art music audio project Tellus Audio Cassette Magazine. In 1984, Nechvatal began work on an opera called XS: The Opera Opus (1984-6) with the no wave musical composer Rhys Chatham. He began using computers and robotics to make post-conceptual paintings in 1986 and later, in his signature work, began to employ self-created computer viruses. From 1991 to 1993, he was artist-in-residence at the Louis Pasteur Atelier in Arbois, France and at the Saline Royale/Ledoux Foundation's computer lab. There he worked on The Computer Virus Project, his first artistic experiment with computer viruses and computer virus animation. He exhibited computer-robotic paintings at Documenta 8 in 1987. In 2002 he extended his experimentation into viral artificial life through a collaboration with the programmer Stephane Sikora of music2eye in a work called the Computer Virus Project II. Nechvatal has also created a noise music work called viral symphOny, a collaborative sound symphony created by using his computer virus software at the Institute for Electronic Arts at Alfred University. In 2021 Pentiments released Nechvatal's retrospective audio cassette called Selected Sound Works (1981-2021) and in 2022 his The Viral Tempest, a double vinyl LP of new audio work. In 2025, he joined the roster of artists/musicians at Table of the Elements with two CD/book releases: Selected Sound Works (1981-2021) and The Marriage of Orlando and Artaud, Even. From 1999 to 2013, Nechvatal taught art theories of immersive virtual reality and the viractual at the School of Visual Arts in New York City (SVA). A book of his collected essays entitled Towards an Immersive Intelligence: Essays on the Work of Art in the Age of Computer Technology and Virtual Reality (1993–2006) was published by Edgewise Press in 2009. Also in 2009, his virtual reality art theory and art history book Immersive Ideals / Critical Distances was published. In 2011, his book Immersion Into Noise was published by Open Humanities Press in conjunction with the University of Michigan Library's Scholarly Publishing Office. Nechvatal has also published three books with Punctum Books: Minóy (noise music—ed.—2014), Destroyer of Naivetés (poetry—2015), and Styling Sagaciousness (poetry—2022). In 2023 his art theory cybersex farce novella venus©~Ñ~vibrator, even was published by Orbis Tertius Press The Joseph Nechvatal archive is housed at The Fales Library Downtown Collection at the NYU Special Collections Library in New York City. === Viractualism === Viractualism is an art theory concept developed by Nechvatal in 1999 from Ph.D. research Nechvatal conducted at the Planetary Collegium at University of Wales, Newport. There he developed his concept of the viractual, which strives to create an interface between the actual and the virtual.

    Read more →
  • Structured sparsity regularization

    Structured sparsity regularization

    Structured sparsity regularization is a class of methods, and an area of research in statistical learning theory, that extend and generalize sparsity regularization learning methods. Both sparsity and structured sparsity regularization methods seek to exploit the assumption that the output variable Y {\displaystyle Y} (i.e., response, or dependent variable) to be learned can be described by a reduced number of variables in the input space X {\displaystyle X} (i.e., the domain, space of features or explanatory variables). Sparsity regularization methods focus on selecting the input variables that best describe the output. Structured sparsity regularization methods generalize and extend sparsity regularization methods, by allowing for optimal selection over structures like groups or networks of input variables in X {\displaystyle X} . Common motivation for the use of structured sparsity methods are model interpretability, high-dimensional learning (where dimensionality of X {\displaystyle X} may be higher than the number of observations n {\displaystyle n} ), and reduction of computational complexity. Moreover, structured sparsity methods allow to incorporate prior assumptions on the structure of the input variables, such as overlapping groups, non-overlapping groups, and acyclic graphs. Examples of uses of structured sparsity methods include face recognition, magnetic resonance image (MRI) processing, socio-linguistic analysis in natural language processing, and analysis of genetic expression in breast cancer. == Definition and related concepts == === Sparsity regularization === Consider the linear kernel regularized empirical risk minimization problem with a loss function V ( y i , f ( x ) ) {\displaystyle V(y_{i},f(x))} and the ℓ 0 {\displaystyle \ell _{0}} "norm" as the regularization penalty: min w ∈ R d 1 n ∑ i = 1 n V ( y i , ⟨ w , x i ⟩ ) + λ ‖ w ‖ 0 , {\displaystyle \min _{w\in \mathbb {R} ^{d}}{\frac {1}{n}}\sum _{i=1}^{n}V(y_{i},\langle w,x_{i}\rangle )+\lambda \|w\|_{0},} where x , w ∈ R d {\displaystyle x,w\in \mathbb {R^{d}} } , and ‖ w ‖ 0 {\displaystyle \|w\|_{0}} denotes the ℓ 0 {\displaystyle \ell _{0}} "norm", defined as the number of nonzero entries of the vector w {\displaystyle w} . f ( x ) = ⟨ w , x i ⟩ {\displaystyle f(x)=\langle w,x_{i}\rangle } is said to be sparse if ‖ w ‖ 0 = s < d {\displaystyle \|w\|_{0}=s 0 {\displaystyle w_{j}>0} . However, as in this case groups may overlap, we take the intersection of the complements of those groups that are not set to zero. This intersection of complements selection criteria implies the modeling choice that we allow some coefficients within a particular group g {\displaystyle g} to be set to zero, while others within the same group g {\displaystyle g} may remain positive. In other words, coefficients within a group may differ depending on the several group memberships that each variable within the group may have. ==== Union of groups: latent group Lasso ==== A different approach is to consider union of groups for variable selection. This approach captures the modeling situation where variables can be selected as long as they belong at least to one group with positive coefficients. This modeling perspective implies that we want to preserve group structure. The formulation of the union of groups approach is also referred to as latent group Lasso, and requires to modify the group ℓ 2 {\displaystyle \ell _{2}} norm considered above and introduce the following regularizer R ( w ) = i n f { ∑ g ‖ w g ‖ g : w = ∑ g = 1 G w ¯ g } {\displaystyle R(w)=inf\left\{\sum _{g}\|w_{g}\|_{g}:w=\sum _{g=1}^{G}{\bar {w}}_{g}\right\}} where w ∈ R d {\displaystyle w\in {\mathbb {R^{d}} }} , w g ∈ G g {\displaystyle w_{g}\in G_{g}} is the vector of coefficients of group g, and w ¯ g ∈ R d {\displaystyle {\bar {w}}_{g}\in {\mathbb {R^{d}} }} is a vector with coefficients w g j {\displaystyle w_{g}^{j}} for all variables j {

    Read more →
  • Shattered set

    Shattered set

    A class of sets is said to shatter another set if it is possible to "pick out" any element of that set using intersection. The concept of shattered sets plays an important role in Vapnik–Chervonenkis theory, also known as VC-theory. Shattering and VC-theory are used in the study of empirical processes as well as in statistical computational learning theory. == Definition == Suppose A is a set and C is a class of sets. The class C shatters the set A if for each subset a of A, there is some element c of C such that a = c ∩ A . {\displaystyle a=c\cap A.} Equivalently, C shatters A when their intersection is equal to A's power set: P(A) = { c ∩ A | c ∈ C }. We employ the letter C to refer to a "class" or "collection" of sets, as in a Vapnik–Chervonenkis class (VC-class). The set A is often assumed to be finite because, in empirical processes, we are interested in the shattering of finite sets of data points. == Example == We will show that the class of all discs in the plane (two-dimensional space) does not shatter every set of four points on the unit circle, yet the class of all convex sets in the plane does shatter every finite set of points on the unit circle. Let A be a set of four points on the unit circle and let C be the class of all discs. To test where C shatters A, we attempt to draw a disc around every subset of points in A. First, we draw a disc around the subsets of each isolated point. Next, we try to draw a disc around every subset of point pairs. This turns out to be doable for adjacent points, but impossible for points on opposite sides of the circle. Any attempt to include those points on the opposite side will necessarily include other points not in that pair. Hence, any pair of opposite points cannot be isolated out of A using intersections with class C and so C does not shatter A. As visualized below: Because there is some subset which can not be isolated by any disc in C, we conclude then that A is not shattered by C. And, with a bit of thought, we can prove that no set of four points is shattered by this C. However, if we redefine C to be the class of all elliptical discs, we find that we can still isolate all the subsets from above, as well as the points that were formerly problematic. Thus, this specific set of 4 points is shattered by the class of elliptical discs. Visualized below: With a bit of thought, we could generalize that any set of finite points on a unit circle could be shattered by the class of all convex sets (visualize connecting the dots). == Shatter coefficient == To quantify the richness of a collection C of sets, we use the concept of shattering coefficients (also known as the growth function). For a collection C of sets s ⊂ Ω {\displaystyle s\subset \Omega } , Ω {\displaystyle \Omega } being any space, often a sample space, we define the nth shattering coefficient of C as S C ( n ) = max ∀ x 1 , x 2 , … , x n ∈ Ω card ⁡ { { x 1 , x 2 , … , x n } ∩ s , s ∈ C } {\displaystyle S_{C}(n)=\max _{\forall x_{1},x_{2},\dots ,x_{n}\in \Omega }\operatorname {card} \{\,\{\,x_{1},x_{2},\dots ,x_{n}\}\cap s,s\in C\}} where card {\displaystyle \operatorname {card} } denotes the cardinality of the set and x 1 , x 2 , … , x n ∈ Ω {\displaystyle x_{1},x_{2},\dots ,x_{n}\in \Omega } is any set of n points,. S C ( n ) {\displaystyle S_{C}(n)} is the largest number of subsets of any set A of n points that can be formed by intersecting A with the sets in collection C. For example, if set A contains 3 points, its power set, P ( A ) {\displaystyle P(A)} , contains 2 3 = 8 {\displaystyle 2^{3}=8} elements. If C shatters A, its shattering coefficient(3) would be 8 and S C ( 2 ) {\displaystyle S_{C}(2)} would be 2 2 = 4 {\displaystyle 2^{2}=4} . However, if one of those sets in P ( A ) {\displaystyle P(A)} cannot be obtained through intersections in c, then S C ( 3 ) {\displaystyle S_{C}(3)} would only be 7. If none of those sets can be obtained, S C ( 3 ) {\displaystyle S_{C}(3)} would be 0. Additionally, if S C ( 2 ) = 3 {\displaystyle S_{C}(2)=3} , for example, then there is an element in the set of all 2-point sets from A that cannot be obtained from intersections with C. It follows from this that S C ( 3 ) {\displaystyle S_{C}(3)} would also be less than 8 (i.e. C would not shatter A) because we have already located a "missing" set in the smaller power set of 2-point sets. This example illustrates some properties of S C ( n ) {\displaystyle S_{C}(n)} : S C ( n ) ≤ 2 n {\displaystyle S_{C}(n)\leq 2^{n}} for all n because { s ∩ A | s ∈ C } ⊆ P ( A ) {\displaystyle \{s\cap A|s\in C\}\subseteq P(A)} for any A ⊆ Ω {\displaystyle A\subseteq \Omega } . If S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} , that means there is a set of cardinality n, which can be shattered by C. If S C ( N ) < 2 N {\displaystyle S_{C}(N)<2^{N}} for some N > 1 {\displaystyle N>1} then S C ( n ) < 2 n {\displaystyle S_{C}(n)<2^{n}} for all n ≥ N {\displaystyle n\geq N} . The third property means that if C cannot shatter any set of cardinality N then it can not shatter sets of larger cardinalities. == Vapnik–Chervonenkis class == If A cannot be shattered by C, there will be a smallest value of n that makes the shatter coefficient(n) less than 2 n {\displaystyle 2^{n}} because as n gets larger, there are more sets that could be missed. Alternatively, there is also a largest value of n for which the S C ( n ) {\displaystyle S_{C}(n)} is still 2 n {\displaystyle 2^{n}} , because as n gets smaller, there are fewer sets that could be omitted. The extreme of this is S C ( 0 ) {\displaystyle S_{C}(0)} (the shattering coefficient of the empty set), which must always be 2 0 = 1 {\displaystyle 2^{0}=1} . These statements lends themselves to defining the VC dimension of a class C as: V C ( C ) = min n { n : S C ( n ) < 2 n } {\displaystyle VC(C)={\underset {n}{\min }}\{n:S_{C}(n)<2^{n}\}\,} or, alternatively, as V C 0 ( C ) = max n { n : S C ( n ) = 2 n } . {\displaystyle VC_{0}(C)={\underset {n}{\max }}\{n:S_{C}(n)=2^{n}\}.\,} Note that V C ( C ) = V C 0 ( C ) + 1. {\displaystyle VC(C)=VC_{0}(C)+1.} . The VC dimension is usually defined as V C 0 {\displaystyle VC_{0}} , the largest cardinality of points chosen that will still shatter A (i.e. n such that S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} ). Altneratively, if for any n there is a set of cardinality n which can be shattered by C, then S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} for all n and the VC dimension of this class C is infinite. A class with finite VC dimension is called a Vapnik–Chervonenkis class or VC class. A class C is uniformly Glivenko–Cantelli if and only if it is a VC class.

    Read more →
  • Self-play

    Self-play

    Self-play is a technique for improving the performance of reinforcement learning agents. Intuitively, agents learn to improve their performance by playing "against themselves". == Definition and motivation == In multi-agent reinforcement learning experiments, researchers try to optimize the performance of a learning agent on a given task, in cooperation or competition with one or more agents. These agents learn by trial-and-error, and researchers may choose to have the learning algorithm play the role of two or more of the different agents. When successfully executed, this technique has a double advantage: It provides a straightforward way to determine the actions of the other agents, resulting in a meaningful challenge. It increases the amount of experience that can be used to improve the policy, by a factor of two or more, since the viewpoints of each of the different agents can be used for learning. Czarnecki et al argue that most of the games that people play for fun are "Games of Skill", meaning games whose space of all possible strategies looks like a spinning top. In more detail, we can partition the space of strategies into sets L 1 , L 2 , . . . , L n {\displaystyle L_{1},L_{2},...,L_{n}} , such that any i < j , π i ∈ L i , π j ∈ L j {\displaystyle i Read more →

  • Autoencoder

    Autoencoder

    An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms. Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (sparse, denoising and contractive autoencoders), which are effective in learning representations for subsequent classification tasks, and variational autoencoders, which can be used as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data. == Mathematical principles == === Definition === An autoencoder is defined by the following components: Two sets: the space of encoded messages Z {\displaystyle {\mathcal {Z}}} ; the space of decoded messages X {\displaystyle {\mathcal {X}}} . Typically X {\displaystyle {\mathcal {X}}} and Z {\displaystyle {\mathcal {Z}}} are Euclidean spaces, that is, X = R m , Z = R n {\displaystyle {\mathcal {X}}=\mathbb {R} ^{m},{\mathcal {Z}}=\mathbb {R} ^{n}} with m > n . {\displaystyle m>n.} Two parametrized families of functions: the encoder family E ϕ : X → Z {\displaystyle E_{\phi }:{\mathcal {X}}\rightarrow {\mathcal {Z}}} , parametrized by ϕ {\displaystyle \phi } ; the decoder family D θ : Z → X {\displaystyle D_{\theta }:{\mathcal {Z}}\rightarrow {\mathcal {X}}} , parametrized by θ {\displaystyle \theta } .For any x ∈ X {\displaystyle x\in {\mathcal {X}}} , we usually write z = E ϕ ( x ) {\displaystyle z=E_{\phi }(x)} , and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any z ∈ Z {\displaystyle z\in {\mathcal {Z}}} , we usually write x ′ = D θ ( z ) {\displaystyle x'=D_{\theta }(z)} , and refer to it as the (decoded) message. Usually, both the encoder and the decoder are defined as multilayer perceptrons (MLPs). For example, a one-layer-MLP encoder E ϕ {\displaystyle E_{\phi }} is: E ϕ ( x ) = σ ( W x + b ) {\displaystyle E_{\phi }(\mathbf {x} )=\sigma (Wx+b)} where σ {\displaystyle \sigma } is an element-wise activation function, W {\displaystyle W} is a "weight" matrix, and b {\displaystyle b} is a "bias" vector. === Training an autoencoder === An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution μ r e f {\displaystyle \mu _{ref}} over X {\displaystyle {\mathcal {X}}} , and a "reconstruction quality" function d : X × X → [ 0 , ∞ ] {\displaystyle d:{\mathcal {X}}\times {\mathcal {X}}\to [0,\infty ]} , such that d ( x , x ′ ) {\displaystyle d(x,x')} measures how much x ′ {\displaystyle x'} differs from x {\displaystyle x} . With those, we can define the loss function for the autoencoder as L ( θ , ϕ ) := E x ∼ μ r e f [ d ( x , D θ ( E ϕ ( x ) ) ) ] {\displaystyle L(\theta ,\phi ):=\mathbb {\mathbb {E} } _{x\sim \mu _{ref}}[d(x,D_{\theta }(E_{\phi }(x)))]} The optimal autoencoder for the given task ( μ r e f , d ) {\displaystyle (\mu _{ref},d)} is then arg ⁡ min θ , ϕ L ( θ , ϕ ) {\displaystyle \arg \min _{\theta ,\phi }L(\theta ,\phi )} . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder". In most situations, the reference distribution is just the empirical distribution given by a dataset { x 1 , . . . , x N } ⊂ X {\displaystyle \{x_{1},...,x_{N}\}\subset {\mathcal {X}}} , so that μ r e f = 1 N ∑ i = 1 N δ x i {\displaystyle \mu _{ref}={\frac {1}{N}}\sum _{i=1}^{N}\delta _{x_{i}}} where δ x i {\displaystyle \delta _{x_{i}}} is the Dirac measure, the quality function is just L 2 {\displaystyle L^{2}} loss: d ( x , x ′ ) = ‖ x − x ′ ‖ 2 2 {\displaystyle d(x,x')=\|x-x'\|_{2}^{2}} , and ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization: min θ , ϕ L ( θ , ϕ ) , where L ( θ , ϕ ) = 1 N ∑ i = 1 N ‖ x i − D θ ( E ϕ ( x i ) ) ‖ 2 2 {\displaystyle \min _{\theta ,\phi }L(\theta ,\phi ),\qquad {\text{where }}L(\theta ,\phi )={\frac {1}{N}}\sum _{i=1}^{N}\|x_{i}-D_{\theta }(E_{\phi }(x_{i}))\|_{2}^{2}} === Interpretation === An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function d {\displaystyle d} . The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space Z {\displaystyle {\mathcal {Z}}} usually has fewer dimensions than the message space X {\displaystyle {\mathcal {X}}} . Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality. At the limit of an ideal undercomplete autoencoder, every possible code z {\displaystyle z} in the code space is used to encode a message x {\displaystyle x} that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} , and the decoder is also perfect: D θ ( E ϕ ( x ) ) = x {\displaystyle D_{\theta }(E_{\phi }(x))=x} . This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code z {\displaystyle z} and obtaining D θ ( z ) {\displaystyle D_{\theta }(z)} , which is a message that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} . If the code space Z {\displaystyle {\mathcal {Z}}} has dimension larger than (overcomplete), or equal to, the message space X {\displaystyle {\mathcal {X}}} , or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below. == Variations == === Variational autoencoder (VAE) === Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. Given an input dataset x {\displaystyle x} characterized by an unknown probability function P ( x ) {\displaystyle P(x)} and a multivariate latent encoding vector z {\displaystyle z} , the objective is to model the data as a distribution p θ ( x ) {\displaystyle p_{\theta }(x)} , with θ {\displaystyle \theta } defined as the set of the network parameters so that p θ ( x ) = ∫ z p θ ( x , z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }(x,z)dz} . === Sparse autoencoder (SAE) === Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes E ϕ ( x ) {\displaystyle E_{\phi }(x)} for messages tend to be sparse codes, that is, E ϕ ( x ) {\displaystyle E_{\phi }(x)} is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks. There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder. The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder: f k ( x 1 , . . . , x n ) = ( x 1 b 1 , . . . , x n b n ) {\displaystyle f_{k}(x_{1},...,x_{n})=(x_{1}b_{1},...,x_{n}b_{n})} where b i = 1 {\displaystyle b_{i}=1} if | x i | {\displaystyle |x_{i}|} ranks in the top k, and 0 otherwise. Backpropagating through f k {\displaystyle f_{k}} is simple: set gradient to 0 for b i = 0 {\displaystyle b_{i}=0} entries, and keep gradient for b i = 1 {\displaystyle b_{i}=1} entries. This is essentially a generalized ReLU function. The other way is a relaxed version of the k-

    Read more →
  • GoodRx

    GoodRx

    GoodRx Holdings, Inc. is an American healthcare company that operates a telemedicine platform and free-to-use website and mobile app that track prescription drug prices in the United States and provide drug coupons for discounts on medications. GoodRx compares prescription drug prices at more than 75,000 pharmacies in the United States. The platform allows users to consult a doctor online and obtain a prescription for certain types of medications. == History == === Financial performance === GoodRx was founded in Santa Monica, California in 2011. GoodRx experienced substantial growth in net income in 2017 ($9 million), 2018 ($44 million), and 2019 ($66 million), but recorded a loss of $293.6 million in 2020 due to IPO-related expenses. In September 2020, GoodRx went public on the Nasdaq under the ticker symbol GDRX. The company priced its initial public offering at $33 per share, above the expected range of $24 to $28, raising more than $1.1 billion at an initial valuation of approximately $12.7 billion. In the first half of 2020, the company reported revenues of $257 million and net income of $55 million. GoodRx generated $745.4 million in revenue for the full year 2021, a 35.36% increase over 2020. During the first half of 2021, the company’s share price declined by 10.7%. The decline was attributed to increased competition in online pharmacy services and slower user growth. GoodRx reported full-year revenue of $766.6 million, with adjusted EBITDA reaching $213.5 million, exceeding guidance in the fourth quarter. GoodRx reported that 41% of prescriptions filled using its coupons were newly adherent, meaning they would not have been filled without the service. GoodRx reported a full-year 2023 revenue of $750.3 million, a decrease of 2.1% from 2022. However, its fourth-quarter revenue increased by 7% year-over-year. GoodRx achieved an Adjusted EBITDA of $217.4 million for the year and an Adjusted EBITDA Margin of 28.6%. In 2024, GoodRx achieved 6% revenue growth with $792.3 million for the full year and turned a net loss into a positive net income of $16.4 million. The company also demonstrated strong operational efficiency, with a 32.8% increase in full-year Adjusted EBITDA. In Q2 2025, GoodRx reported revenue of $203.1 million, a 1.2% increase from the previous year, and a net income of $12.8 million, a significant 92% jump, which resulted in a 6.3% net income margin. However, prescription transaction revenue declined by 3% due to a decrease in monthly active consumers, but this was offset by strong 32% growth in its Pharma Manufacturer Solutions business. GoodRx also saw a 7% decrease in subscription revenue. === Mergers and acquisitions === In 2019, GoodRx acquired HeyDoctor, a telemedicine company, to integrate virtual healthcare services into the platform. In 2021, a health video content producer, HealthiNation was acquired by GoodRx, which helped provide consumers with health information and offered pharmaceutical manufacturers new ways to reach relevant audiences. In April 2022, GoodRx acquired VitaCare Prescription Services from TherapeuticsMD to strengthen its pharma manufacturer solutions business. === Partnerships === In 2017, the company announced partnerships with major pharmaceutical companies to negotiate lower prescription drug costs. GoodRx has deep relationships with major pharmacy chains, including Walgreens, Walmart, CVS Caremark, and Publix, to allow customers to use GoodRx discounts and Gold benefits. GoodRx began its partnership with CVS Caremark in July 2023 to automatically apply coupons to insured CVS customers purchasing generic prescriptions at certain locations. In April 2024, GoodRx added Publix into its network, allowing GoodRx Gold members to use their cards at Publix Pharmacies. GoodRx partners with Pharmacy Benefit Management like Caremark, Express Scripts, and MedImpact to apply their savings directly to eligible insurance plans and members. GoodRx partners with companies like Affirm, Benefitfocus, and DoorDash to integrate their services that offer members discounts and financial flexibility for prescriptions. GoodRx also partners with organizations like the American Academy of Family Physicians Foundation to support broader access to care. In October 2022, GoodRx launched Provider Mode, which allows healthcare providers to use the app to compare costs of drugs for patients based on different payment methods and drug alternatives. In 2025, GoodRx partnered with Novo Nordisk to offer discounted cash-pay access to semaglutide products like Ozempic and Wegovy through its platform and participating pharmacies. == Products and services == GoodRx started its telemedicine service GoodRx Care in September 2019. It lets people talk to a licensed provider online for common issues and get prescriptions even if they don't have insurance. They also run condition-specific subscription plans that bundle online doctor visits, FDA-approved meds, and home delivery into one monthly payment. On the weight management side, GoodRx offers prescriptions for GLP-1 drugs like semaglutide through their telemedicine platform. This got a boost when the oral version of Wegovy became widely available in the US in early 2026. GoodRx works with drug makers like Novo Nordisk to make some medications (including semaglutide options) more affordable for people paying cash. The telemedicine part took off after GoodRx bought HeyDoctor in 2019 and brought their virtual care tools into the main platform. == Key people == The Santa Monica-based startup was founded in September 2011 by Trevor Bezdek and former Facebook executives Doug Hirsch and Scott Marlette. Marlette was one of the first 20 employees at Facebook and built Facebook's photo application. In 2005, Hirsch was the Vice President of Product at Facebook, working closely with Mark Zuckerberg. Bezdek and Hirsch served as co-chief executive officers until April 2023, when they stepped down from those roles and technology executive Scott Wagner was appointed interim chief executive officer. Bezdek became chair of the board, while Hirsch took on the role of chief mission officer. In December 2024, GoodRx announced that healthcare executive Wendy Barnes would become president and chief executive officer effective January 1, 2025. As of 2025, Barnes serves as the company’s CEO, while Trevor Bezdek and Scott Wagner serve as co-chairs of the board, and Doug Hirsch remains involved as a co-founder and senior executive. == Controversy == On February 25, 2020, Consumer Reports published an article stating that GoodRx shared user data—specifically, pseudonymized advertising ID numbers that companies use to track the behavior of web users across websites, the names of the drugs that users browsed, and the pharmacies where users sought to fill prescriptions—with Google, Facebook, and around twenty other Internet-based companies. A few days later, GoodRx released a statement saying that it had made changes to prevent user search data on medical conditions and pharmaceuticals from being shared with Facebook. In March 2020, GoodRx stopped sending data about user prescriptions to Facebook. On February 1, 2023, the Federal Trade Commission fined GoodRx US$1.5 million for violations of the Breach Notification Rule and the Federal Trade Commission Act for allegedly failing to obtain specific, informed, and unambiguous consent from users before disclosing health-related information to Facebook and Google. In November 2024, independent pharmacies filed at least three class action lawsuits against GoodRx and major pharmacy benefit managers. The cases, brought by independent pharmacies in California, Michigan, Pennsylvania, and Rhode Island, allege that GoodRx and the PBMs collaborated to suppress reimbursements for generic prescription drugs. They allege that agreements using GoodRx’s software suppressed reimbursements for generic drugs and violated the Sherman Antitrust Act. The suits claim the practices amount to price fixing which harms small pharmacies while benefiting PBMs and their affiliates. GoodRx settled both the 2023 FTC action and the 2025 class action lawsuit without admitting wrongdoing.

    Read more →
  • Ensemble learning

    Ensemble learning

    In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives. == Overview == Supervised learning algorithms search through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem. Even if this space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form one which should be theoretically better. Ensemble learning trains two or more machine learning algorithms on a specific classification or regression task. The algorithms within the ensemble model are generally referred as "base models", "base learners", or "weak learners" in literature. These base models can be constructed using a single modelling algorithm, or several different algorithms. The idea is to train a diverse set of weak models on the same modelling task, such that the outputs of each weak learner have poor predictive ability (i.e., high bias), and among all weak learners, the outcome and error values exhibit high variance. Fundamentally, an ensemble learning model trains at least two high-bias (weak) and high-variance (diverse) models to be combined into a better-performing model. The set of weak models — which would not produce satisfactory predictive results individually — are combined or averaged to produce a single, high performing, accurate, and low-variance model to fit the task as required. Ensemble learning typically refers to bagging (bootstrap aggregating), boosting or stacking/blending techniques to induce high variance among the base models. Bagging creates diversity by generating random samples from the training observations and fitting the same model to each different sample — also known as homogeneous parallel ensembles. Boosting follows an iterative process by sequentially training each base model on the up-weighted errors of the previous base model, producing an additive model to reduce the final model errors — also known as sequential ensemble learning. Stacking or blending consists of different base models, each trained independently (i.e. diverse/high variance) to be combined into the ensemble model — producing a heterogeneous parallel ensemble. Common applications of ensemble learning include random forests (an extension of bagging), Boosted Tree models, and Gradient Boosted Tree Models. Models in applications of stacking are generally more task-specific — such as combining clustering techniques with other parametric and/or non-parametric techniques. Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model. In one sense, ensemble learning may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. On the other hand, the alternative is to do a lot more learning with one non-ensemble model. An ensemble may be more efficient at improving overall accuracy for the same increase in compute, storage, or communication resources by using that increase on two or more methods, than would have been improved by increasing resource use for a single method. Fast algorithms such as decision trees are commonly used in ensemble methods (e.g., random forests), although slower algorithms can benefit from ensemble techniques as well. By analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection. == Ensemble theory == Empirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine. Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees). Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity. It is possible to increase diversity in the training stage of the model using correlation for regression tasks or using information measures such as cross entropy for classification tasks. Theoretically, one can justify the diversity concept because the lower bound of the error rate of an ensemble system can be decomposed into accuracy, diversity, and the other term. === The geometric framework === Ensemble learning, including both regression and classification tasks, can be explained using a geometric framework. Within this framework, the output of each individual classifier or regressor for the entire dataset can be viewed as a point in a multi-dimensional space. Additionally, the target result is also represented as a point in this space, referred to as the "ideal point." The Euclidean distance is used as the metric to measure both the performance of a single classifier or regressor (the distance between its point and the ideal point) and the dissimilarity between two classifiers or regressors (the distance between their respective points). This perspective transforms ensemble learning into a deterministic problem. For example, within this geometric framework, it can be proved that the averaging of the outputs (scores) of all base classifiers or regressors can lead to equal or better results than the average of all the individual models. It can also be proved that if the optimal weighting scheme is used, then a weighted averaging approach can outperform any of the individual classifiers or regressors that make up the ensemble or as good as the best performer at least. == Ensemble size == While the number of component classifiers of an ensemble has a great impact on the accuracy of prediction, there is a limited number of studies addressing this problem. A priori determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers. Mostly statistical tests were used for determining the proper number of components. More recently, a theoretical framework suggested that there is an ideal number of component classifiers for an ensemble such that having more or less than this number of classifiers would deteriorate the accuracy. It is called "the law of diminishing returns in ensemble construction." Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy. == Common types of ensembles == === Bayes optimal classifier === The Bayes optimal classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it. The Naive Bayes classifier is a version of this that assumes that the data is conditionally independent on the class and makes the computation more feasible. Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes optimal classifier can be expressed with the following equation: y = a r g m a x c j ∈ C ∑ h i ∈ H P ( c j | h i ) P ( T | h i ) P ( h i ) {\displaystyle y={\underset {c_{j}\in C}{\mathrm {argmax} }}\sum _{h_{i}\in H}{P(c_{j}|h_{i})P(T|h_{i})P(h_{i})}} where y {\displaystyle y} is the predicted class, C {\displaystyle C} is the set of all possible classes, H {\displaystyle H} is the hypothesis space, P {\displaystyle P} refers to a probability, and T {\displaystyle T} is the training data. As an ensemble, the Bayes optimal classifier represents a hypothesis that is not necessarily in H {\displaystyle H} . The hypothesis represented by the Bayes optimal classifier, however, is the optimal hypothesis in ensemble space (the space of all possible ensembles consisting only of hypotheses in H {\displaystyle H} ). This formula can be restated using Bayes' theorem, which says that the posterior is proportional to the likelihood times the prior: P ( h i | T ) ∝ P ( T | h i ) P ( h i ) {\displaystyle P(h_{i}|T)\propto P(T|h_{i})P(h_{i})} hence, y = a r g m a x c j ∈ C ∑ h i ∈ H P ( c j | h i ) P ( h i | T ) {\displaystyle y={\underset {c_{j}\in C}{\mathrm {argmax} }}\sum _{h_{i}\in H}{P(c_{j}|h_{i})P(h_{i}|T)}} === Bootstrap aggregating (bagging) === Bootstrap aggregation (bagging) involves training an ensemble on bootstrapped data sets. A bootstrapped set is cr

    Read more →
  • Mathematics of neural networks in machine learning

    Mathematics of neural networks in machine learning

    An artificial neural network (ANN) or neural network combines biological principles with advanced statistics to solve problems in domains such as pattern recognition and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways. == Structure == === Neuron === A neuron with label j {\displaystyle j} receiving an input p j ( t ) {\displaystyle p_{j}(t)} from predecessor neurons consists of the following components: an activation a j ( t ) {\displaystyle a_{j}(t)} , the neuron's state, depending on a discrete time parameter, an optional threshold θ j {\displaystyle \theta _{j}} , which stays fixed unless changed by learning, an activation function f {\displaystyle f} that computes the new activation at a given time t + 1 {\displaystyle t+1} from a j ( t ) {\displaystyle a_{j}(t)} , θ j {\displaystyle \theta _{j}} and the net input p j ( t ) {\displaystyle p_{j}(t)} giving rise to the relation a j ( t + 1 ) = f ( a j ( t ) , p j ( t ) , θ j ) , {\displaystyle a_{j}(t+1)=f(a_{j}(t),p_{j}(t),\theta _{j}),} and an output function f out {\displaystyle f_{\text{out}}} computing the output from the activation o j ( t ) = f out ( a j ( t ) ) . {\displaystyle o_{j}(t)=f_{\text{out}}(a_{j}(t)).} Often the output function is simply the identity function. An input neuron has no predecessor but serves as input interface for the whole network. Similarly an output neuron has no successor and thus serves as output interface of the whole network. === Propagation function === The propagation function computes the input p j ( t ) {\displaystyle p_{j}(t)} to the neuron j {\displaystyle j} from the outputs o i ( t ) {\displaystyle o_{i}(t)} and typically has the form p j ( t ) = ∑ i o i ( t ) w i j . {\displaystyle p_{j}(t)=\sum _{i}o_{i}(t)w_{ij}.} === Bias === A bias term can be added, changing the form to the following: p j ( t ) = ∑ i o i ( t ) w i j + w 0 j , {\displaystyle p_{j}(t)=\sum _{i}o_{i}(t)w_{ij}+w_{0j},} where w 0 j {\displaystyle w_{0j}} is a bias. == Neural networks as functions == Neural network models can be viewed as defining a function that takes an input (observation) and produces an output (decision) f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} or a distribution over X {\displaystyle \textstyle X} or both X {\displaystyle \textstyle X} and Y {\displaystyle \textstyle Y} . Sometimes models are intimately associated with a particular learning rule. A common use of the phrase "ANN model" is really the definition of a class of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons, number of layers or their connectivity). Mathematically, a neuron's network function f ( x ) {\displaystyle \textstyle f(x)} is defined as a composition of other functions g i ( x ) {\displaystyle \textstyle g_{i}(x)} , that can further be decomposed into other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between functions. A widely used type of composition is the nonlinear weighted sum, where f ( x ) = K ( ∑ i w i g i ( x ) ) {\displaystyle \textstyle f(x)=K\left(\sum _{i}w_{i}g_{i}(x)\right)} , where K {\displaystyle \textstyle K} (commonly referred to as the activation function) is some predefined function, such as the hyperbolic tangent, sigmoid function, softmax function, or rectifier function. The important characteristic of the activation function is that it provides a smooth transition as input values change, i.e. a small change in input produces a small change in output. The following refers to a collection of functions g i {\displaystyle \textstyle g_{i}} as a vector g = ( g 1 , g 2 , … , g n ) {\displaystyle \textstyle g=(g_{1},g_{2},\ldots ,g_{n})} . This figure depicts such a decomposition of f {\displaystyle \textstyle f} , with dependencies between variables indicated by arrows. These can be interpreted in two ways. The first view is the functional view: the input x {\displaystyle \textstyle x} is transformed into a 3-dimensional vector h {\displaystyle \textstyle h} , which is then transformed into a 2-dimensional vector g {\displaystyle \textstyle g} , which is finally transformed into f {\displaystyle \textstyle f} . This view is most commonly encountered in the context of optimization. The second view is the probabilistic view: the random variable F = f ( G ) {\displaystyle \textstyle F=f(G)} depends upon the random variable G = g ( H ) {\displaystyle \textstyle G=g(H)} , which depends upon H = h ( X ) {\displaystyle \textstyle H=h(X)} , which depends upon the random variable X {\displaystyle \textstyle X} . This view is most commonly encountered in the context of graphical models. The two views are largely equivalent. In either case, for this particular architecture, the components of individual layers are independent of each other (e.g., the components of g {\displaystyle \textstyle g} are independent of each other given their input h {\displaystyle \textstyle h} ). This naturally enables a degree of parallelism in the implementation. Networks such as the previous one are commonly called feedforward, because their graph is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such networks are commonly depicted in the manner shown at the top of the figure, where f {\displaystyle \textstyle f} is shown as dependent upon itself. However, an implied temporal dependence is not shown. == Backpropagation == Backpropagation training algorithms fall into three categories: steepest descent (with variable learning rate and momentum, resilient backpropagation); quasi-Newton (Broyden–Fletcher–Goldfarb–Shanno, one step secant); Levenberg–Marquardt and conjugate gradient (Fletcher–Reeves update, Polak–Ribiére update, Powell–Beale restart, scaled conjugate gradient). === Algorithm === Let N {\displaystyle N} be a network with e {\displaystyle e} connections, m {\displaystyle m} inputs and n {\displaystyle n} outputs. Below, x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } denote vectors in R m {\displaystyle \mathbb {R} ^{m}} , y 1 , y 2 , … {\displaystyle y_{1},y_{2},\dots } vectors in R n {\displaystyle \mathbb {R} ^{n}} , and w 0 , w 1 , w 2 , … {\displaystyle w_{0},w_{1},w_{2},\ldots } vectors in R e {\displaystyle \mathbb {R} ^{e}} . These are called inputs, outputs and weights, respectively. The network corresponds to a function y = f N ( w , x ) {\displaystyle y=f_{N}(w,x)} which, given a weight w {\displaystyle w} , maps an input x {\displaystyle x} to an output y {\displaystyle y} . In supervised learning, a sequence of training examples ( x 1 , y 1 ) , … , ( x p , y p ) {\displaystyle (x_{1},y_{1}),\dots ,(x_{p},y_{p})} produces a sequence of weights w 0 , w 1 , … , w p {\displaystyle w_{0},w_{1},\dots ,w_{p}} starting from some initial weight w 0 {\displaystyle w_{0}} , usually chosen at random. These weights are computed in turn: first compute w i {\displaystyle w_{i}} using only ( x i , y i , w i − 1 ) {\displaystyle (x_{i},y_{i},w_{i-1})} for i = 1 , … , p {\displaystyle i=1,\dots ,p} . The output of the algorithm is then w p {\displaystyle w_{p}} , giving a new function x ↦ f N ( w p , x ) {\displaystyle x\mapsto f_{N}(w_{p},x)} . The computation is the same in each step, hence only the case i = 1 {\displaystyle i=1} is described. w 1 {\displaystyle w_{1}} is calculated from ( x 1 , y 1 , w 0 ) {\displaystyle (x_{1},y_{1},w_{0})} by considering a variable weight w {\displaystyle w} and applying gradient descent to the function w ↦ E ( f N ( w , x 1 ) , y 1 ) {\displaystyle w\mapsto E(f_{N}(w,x_{1}),y_{1})} to find a local minimum, starting at w = w 0 {\displaystyle w=w_{0}} . This makes w 1 {\displaystyle w_{1}} the minimizing weight found by gradient descent. == Learning pseudocode == To implement the algorithm above, explicit formulas are required for the gradient of the function w ↦ E ( f N ( w , x ) , y ) {\displaystyle w\mapsto E(f_{N}(w,x),y)} where the function is E ( y , y ′ ) = | y − y ′ | 2 {\displaystyle E(y,y')=|y-y'|^{2}} . The learning algorithm can be divided into two phases: propagation and weight update. === Propagation === Propagation involves the following steps: Propagation forward through the network to generate the output value(s) Calculation of the cost (error term) Propagation of the output activations back through the network using the training pattern target to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons. === Weight update === For each weight: Multiply the weight's output delta and input activation to find the gradient of the weight. Subtract the ratio (percentage) of the weight's gradient from the weight. The learning rate is the ratio (percentage) that influences the speed and quality of learning. The greater the ratio, the faster the neuron trains, but the lower the ratio, the more accurat

    Read more →
  • Probably approximately correct learning

    Probably approximately correct learning

    In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the "probably" part), the selected function will have low generalization error (the "approximately correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The model was later extended to treat noise (misclassified samples). An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds). == Definitions and terminology == In order to give the definition for something that is PAC-learnable, we first have to introduce some terminology. For the following definitions, two examples will be used. The first is the problem of character recognition given an array of n {\displaystyle n} bits encoding a binary-valued image. The other example is the problem of finding an interval that will correctly classify points within the interval as positive and the points outside of the range as negative. Let X {\displaystyle X} be a set called the instance space or the encoding of all the samples. In the character recognition problem, the instance space is X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} . In the interval problem the instance space, X {\displaystyle X} , is the set of all bounded intervals in R {\displaystyle \mathbb {R} } , where R {\displaystyle \mathbb {R} } denotes the set of all real numbers. A concept is a subset c ⊂ X {\displaystyle c\subset X} . One concept is the set of all patterns of bits in X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} that encode a picture of the letter "P". An example concept from the second example is the set of open intervals, { ( a , b ) ∣ 0 ≤ a ≤ π / 2 , π ≤ b ≤ 13 } {\displaystyle \{(a,b)\mid 0\leq a\leq \pi /2,\pi \leq b\leq {\sqrt {13}}\}} , each of which contains only the positive points. A concept class C {\displaystyle C} is a collection of concepts over X {\displaystyle X} . This could be the set of all subsets of the array of bits that are skeletonized 4-connected (width of the font is 1). Let EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} be a procedure that draws an example, x {\displaystyle x} , using a probability distribution D {\displaystyle D} and gives the correct label c ( x ) {\displaystyle c(x)} , that is 1 if x ∈ c {\displaystyle x\in c} and 0 otherwise. Now, given 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} , assume there is an algorithm A {\displaystyle A} and a polynomial p {\displaystyle p} in 1 / ϵ , 1 / δ {\displaystyle 1/\epsilon ,1/\delta } (and other relevant parameters of the class C {\displaystyle C} ) such that, given a sample of size p {\displaystyle p} drawn according to EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} , then, with probability of at least 1 − δ {\displaystyle 1-\delta } , A {\displaystyle A} outputs a hypothesis h ∈ C {\displaystyle h\in C} that has an average error less than or equal to ϵ {\displaystyle \epsilon } on X {\displaystyle X} with the same distribution D {\displaystyle D} . Further if the above statement for algorithm A {\displaystyle A} is true for every concept c ∈ C {\displaystyle c\in C} and for every distribution D {\displaystyle D} over X {\displaystyle X} , and for all 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} then C {\displaystyle C} is (efficiently) PAC learnable (or distribution-free PAC learnable). We can also say that A {\displaystyle A} is a PAC learning algorithm for C {\displaystyle C} . == Equivalence == Under some regularity conditions these conditions are equivalent: The concept class C is PAC learnable. The VC dimension of C is finite. C is a uniformly Glivenko-Cantelli class. C is compressible in the sense of Littlestone and Warmuth

    Read more →
  • VideoPoet

    VideoPoet

    VideoPoet is a large language model developed by Google Research in 2023 for video making. It can be asked to animate still images. The model accepts text, images, and videos as inputs, with a program to add feature for any input to any format generated content. VideoPoet was publicly announced on December 19, 2023. It uses an autoregressive language model.

    Read more →
  • Andrej Mrvar

    Andrej Mrvar

    Andrej Mrvar is a Slovenian computer scientist and a professor at the University of Ljubljana's Faculty of Social Sciences. He is known for his work in network analysis, graph drawing, decision making, virtual reality, timing and data processing of sports competitions. == Education and career == He is well known for his work on Pajek, a free software for analysis and visualization of large networks. Mrvar began work on Pajek in 1996 with Vladimir Batagelj. His book Exploratory Social Network Analysis with Pajek, coauthored with Wouter de Nooy and Vladimir Batagelj, is his most cited work. It was published by Cambridge University Press in three editions (first 2005, second 2011, and third 2018). The book was translated into Japanese (2009) and Chinese (first edition 2012, second 2014). With Anuška Ferligoj, he was a founding co-editor-in-chief of the Metodološki zvezki - Advances in Methodology and Statistics journal. == Awards and honors == Vidmar Award (Faculty of Electrical and Computer Engineering, University of Ljubljana): 1988, 1990 First prizes for contributions (with Vladimir Batagelj) to Graph Drawing Contests in years: 1995, 1996, 1997, 1998, 1999, 2000 and 2005 / Graph Drawing Hall of Fame. Award of University of Ljubljana for contributions in education and research (Svečana listina Univerze v Ljubljani za pomembne dosežke na področju vzgojnoizobraževalnega in znanstvenoraziskovalega dela): 2001 The INSNA's William D. Richards Software award for work on Pajek (with Vladimir Batagelj): 2013 Award of Faculty of Social Sciences, University of Ljubljana for scientific excellence (Priznanje za znanstveno odličnost): 2013 == Selected publications == Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj, Mark Granovetter (Series Editor), Exploratory Social Network Analysis with Pajek (Structural Analysis in the Social Sciences), Cambridge University Press (First Edition: 2005, Second Edition: 2011, Third Edition: 2018 ). Japanese Translation (2010). Chinese Translation (First Edition: 2012, Second Edition: 2014) Andrej Mrvar and Vladimir Batagelj, Analysis and visualization of large networks with program package Pajek. Complex Adaptive Systems Modeling, 4:6. SpringerOpen, 2016 Vladimir Batagelj and Andrej Mrvar, Some Analyses of Erdős Collaboration Graph, Social Networks, 22, 173–186, 2000 Vladimir Batagelj and Andrej Mrvar, A Subquadratic Triad Census Algorithm for Large Sparse Networks with Small Maximum Degree. Social Networks, 23, 237–243, 2001 Patrick Doreian and Andrej Mrvar, A Partitioning Approach to Structural Balance, Social Networks, 18, 149–168, 1996 Patrick Doreian and Andrej Mrvar, Partitioning Signed Social Networks, Social Networks, 31, 1–11, 2009 Andrej Mrvar and Patrick Doreian, Partitioning Signed Two-Mode Networks, Journal of Mathematical Sociology, 33, 196–221, 2009 Patrick Doreian and Andrej Mrvar, The international reach of the Koch brothers network. In: Antonyuk, A. and Basov, N. (Eds.): Networks in the Global World V. NetGloW 2020. Lecture Notes in Networks and Systems, 181, 225–235. Springer, 2021 Patrick Doreian and Andrej Mrvar, Delineating Changes in the Fundamental Structure of Signed Networks, Frontiers in Physics, 294, 1–11, 2021 Patrick Doreian and Andrej Mrvar, Hubs and Authorities in the Koch Brothers Network. Social Networks, Social Networks, 64, 148–157, 2021 Patrick Doreian and Andrej Mrvar, Public issues, policy proposals, social movements, and the interests of the Koch Brothers network of allies, Quality and Quantity, 56, 305–322, 2022 Douglas R. White, Vladimir Batagelj, Andrej Mrvar, Analyzing Large Kinship and Marriage Networks with Pgraph and Pajek. Social Science Computer Review, 17, 245–274, 1999 Ion Georgiou, Ronald Concer, Andrej Mrvar, A Systemic Approach to Sociometric Group Research: Advancing The Work of Leslie Day Zeleny, 1939–1947, Social Networks, 63, 174–200, 2020

    Read more →
  • Constructing skill trees

    Constructing skill trees

    Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP (maximum a posteriori) change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010. == Algorithm == CST consists of mainly three parts;change point detection, alignment and merging. The main focus of CST is online change-point detection. The change-point detection algorithm is used to segment data into skills and uses the sum of discounted reward R t {\displaystyle R_{t}} as the target regression variable. Each skill is assigned an appropriate abstraction. A particle filter is used to control the computational complexity of CST. The change point detection algorithm is implemented as follows. The data for times t ∈ T {\displaystyle t\in T} and models Q with prior p ( q ∈ Q ) {\displaystyle p(q\in Q)} are given. The algorithm is assumed to be able to fit a segment from time j + 1 {\displaystyle j+1} to t using model q with the fit probability P ( j , t , q ) {\displaystyle P(j,t,q)_{}^{}} . A linear regression model with Gaussian noise is used to compute P ( j , t , q ) {\displaystyle P(j,t,q)} . The Gaussian noise prior has mean zero, and variance which follows I n v e r s e G a m m a ( v 2 , u 2 ) {\displaystyle \mathrm {InverseGamma} \left({\frac {v}{2}},{\frac {u}{2}}\right)} . The prior for each weight follows N o r m a l ( 0 , σ 2 δ ) {\displaystyle \mathrm {Normal} (0,\sigma ^{2}\delta )} . The fit probability P ( j , t , q ) {\displaystyle P(j,t,q)} is computed by the following equation. P ( j , t , q ) = π − n 2 δ m | ( A + D ) − 1 | 1 2 u v 2 ( y + u ) u + v 2 Γ ( n + v 2 ) Γ ( v 2 ) {\displaystyle P(j,t,q)={\frac {\pi ^{-{\frac {n}{2}}}}{\delta ^{m}}}\left|(A+D)^{-1}\right|^{\frac {1}{2}}{\frac {u^{\frac {v}{2}}}{(y+u)^{\frac {u+v}{2}}}}{\frac {\Gamma ({\frac {n+v}{2}})}{\Gamma ({\frac {v}{2}})}}} Then, CST compute the probability of the changepoint at time j with model q, P t ( j , q ) {\displaystyle P_{t}(j,q)} and P j MAP {\displaystyle P_{j}^{\text{MAP}}} using a Viterbi algorithm. P t ( j , q ) = ( 1 − G ( t − j − 1 ) ) P ( j , t , q ) p ( q ) P j MAP {\displaystyle P_{t}(j,q)=(1-G(t-j-1))P(j,t,q)p(q)P_{j}^{\text{MAP}}} P j MAP = max i , q P j ( i , q ) g ( j − i ) 1 − G ( j − i − 1 ) , ∀ j < t {\displaystyle P_{j}^{\text{MAP}}=\max _{i,q}{\frac {P_{j}(i,q)g(j-i)}{1-G(j-i-1)}},\forall j Read more →