AI Detector And Fixer

AI Detector And Fixer — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Similarity learning

    Similarity learning

    Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification. == Learning setup == There are four common setups for similarity and metric distance learning. Regression similarity learning In this setup, pairs of objects are given ( x i 1 , x i 2 ) {\displaystyle (x_{i}^{1},x_{i}^{2})} together with a measure of their similarity y i ∈ R {\displaystyle y_{i}\in R} . The goal is to learn a function that approximates f ( x i 1 , x i 2 ) ∼ y i {\displaystyle f(x_{i}^{1},x_{i}^{2})\sim y_{i}} for every new labeled triplet example ( x i 1 , x i 2 , y i ) {\displaystyle (x_{i}^{1},x_{i}^{2},y_{i})} . This is typically achieved by minimizing a regularized loss min W ∑ i l o s s ( w ; x i 1 , x i 2 , y i ) + r e g ( w ) {\displaystyle \min _{W}\sum _{i}loss(w;x_{i}^{1},x_{i}^{2},y_{i})+reg(w)} . Classification similarity learning Given are pairs of similar objects ( x i , x i + ) {\displaystyle (x_{i},x_{i}^{+})} and non similar objects ( x i , x i − ) {\displaystyle (x_{i},x_{i}^{-})} . An equivalent formulation is that every pair ( x i 1 , x i 2 ) {\displaystyle (x_{i}^{1},x_{i}^{2})} is given together with a binary label y i ∈ { 0 , 1 } {\displaystyle y_{i}\in \{0,1\}} that determines if the two objects are similar or not. The goal is again to learn a classifier that can decide if a new pair of objects is similar or not. Ranking similarity learning Given are triplets of objects ( x i , x i + , x i − ) {\displaystyle (x_{i},x_{i}^{+},x_{i}^{-})} whose relative similarity obey a predefined order: x i {\displaystyle x_{i}} is known to be more similar to x i + {\displaystyle x_{i}^{+}} than to x i − {\displaystyle x_{i}^{-}} . The goal is to learn a function f {\displaystyle f} such that for any new triplet of objects ( x , x + , x − ) {\displaystyle (x,x^{+},x^{-})} , it obeys f ( x , x + ) > f ( x , x − ) {\displaystyle f(x,x^{+})>f(x,x^{-})} (contrastive learning). This setup assumes a weaker form of supervision than in regression, because instead of providing an exact measure of similarity, one only has to provide the relative order of similarity. For this reason, ranking-based similarity learning is easier to apply in real large-scale applications. Locality sensitive hashing (LSH) Hashes input items so that similar items map to the same "buckets" in memory with high probability (the number of buckets being much smaller than the universe of possible input items). It is often applied in nearest neighbor search on large-scale high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. A common approach for learning similarity is to model the similarity function as a bilinear form. For example, in the case of ranking similarity learning, one aims to learn a matrix W that parametrizes the similarity function f W ( x , z ) = x T W z {\displaystyle f_{W}(x,z)=x^{T}Wz} . When data is abundant, a common approach is to learn a siamese network – a deep network model with parameter sharing. == Metric learning == Similarity learning is closely related to distance metric learning. Metric learning is the task of learning a distance function over objects. A metric or distance function has to obey four axioms: non-negativity, identity of indiscernibles, symmetry and subadditivity (or the triangle inequality). In practice, metric learning algorithms ignore the condition of identity of indiscernibles and learn a pseudo-metric. When the objects x i {\displaystyle x_{i}} are vectors in R d {\displaystyle R^{d}} , then any matrix W {\displaystyle W} in the symmetric positive semi-definite cone S + d {\displaystyle S_{+}^{d}} defines a distance pseudo-metric of the space of x through the form D W ( x 1 , x 2 ) 2 = ( x 1 − x 2 ) ⊤ W ( x 1 − x 2 ) {\displaystyle D_{W}(x_{1},x_{2})^{2}=(x_{1}-x_{2})^{\top }W(x_{1}-x_{2})} . When W {\displaystyle W} is a symmetric positive definite matrix, D W {\displaystyle D_{W}} is a metric. Moreover, as any symmetric positive semi-definite matrix W ∈ S + d {\displaystyle W\in S_{+}^{d}} can be decomposed as W = L ⊤ L {\displaystyle W=L^{\top }L} where L ∈ R e × d {\displaystyle L\in R^{e\times d}} and e ≥ r a n k ( W ) {\displaystyle e\geq rank(W)} , the distance function D W {\displaystyle D_{W}} can be rewritten equivalently D W ( x 1 , x 2 ) 2 = ( x 1 − x 2 ) ⊤ L ⊤ L ( x 1 − x 2 ) = ‖ L ( x 1 − x 2 ) ‖ 2 2 {\displaystyle D_{W}(x_{1},x_{2})^{2}=(x_{1}-x_{2})^{\top }L^{\top }L(x_{1}-x_{2})=\|L(x_{1}-x_{2})\|_{2}^{2}} . The distance D W ( x 1 , x 2 ) 2 = ‖ x 1 ′ − x 2 ′ ‖ 2 2 {\displaystyle D_{W}(x_{1},x_{2})^{2}=\|x_{1}'-x_{2}'\|_{2}^{2}} corresponds to the Euclidean distance between the transformed feature vectors x 1 ′ = L x 1 {\displaystyle x_{1}'=Lx_{1}} and x 2 ′ = L x 2 {\displaystyle x_{2}'=Lx_{2}} . Many formulations for metric learning have been proposed. Some well-known approaches for metric learning include learning from relative comparisons, which is based on the triplet loss, large margin nearest neighbor, and information theoretic metric learning (ITML). In statistics, the covariance matrix of the data is sometimes used to define a distance metric called Mahalanobis distance. == Applications == Similarity learning is used in information retrieval for learning to rank, in face verification or face identification, and in recommendation systems. Also, many machine learning approaches rely on some metric. This includes unsupervised learning such as clustering, which groups together close or similar objects. It also includes supervised approaches like K-nearest neighbor algorithm which rely on labels of nearby objects to decide on the label of a new object. Metric learning has been proposed as a preprocessing step for many of these approaches. == Scalability == Metric and similarity learning scale quadratically with the dimension of the input space, as can easily see when the learned metric has a bilinear form f W ( x , z ) = x T W z {\displaystyle f_{W}(x,z)=x^{T}Wz} . Scaling to higher dimensions can be achieved by enforcing a sparseness structure over the matrix model, as done with HDSL, and with COMET. == Software == metric-learn is a free software Python library which offers efficient implementations of several supervised and weakly-supervised similarity and metric learning algorithms. The API of metric-learn is compatible with scikit-learn. OpenMetricLearning is a Python framework to train and validate the models producing high-quality embeddings. == Further information == For further information on this topic, see the surveys on metric and similarity learning by Bellet et al. and Kulis.

    Read more →
  • FERET (facial recognition technology)

    FERET (facial recognition technology)

    The Facial Recognition Technology (FERET) program was a government-sponsored project that aimed to create a large, automatic face-recognition system for intelligence, security, and law enforcement purposes. The program began in 1993 under the combined leadership of Dr. Harry Wechsler at George Mason University (GMU) and Dr. Jonathon Phillips at the Army Research Laboratory (ARL) in Adelphi, Maryland and resulted in the development of the Facial Recognition Technology (FERET) database. The goal of the FERET program was to advance the field of face recognition technology by establishing a common database of facial imagery for researchers to use and setting a performance baseline for face-recognition algorithms. Potential areas where this face-recognition technology could be used include: Automated searching of mug books using surveillance photos Controlling access to restricted facilities or equipment Checking the credentials of personnel for background and security clearances Monitoring airports, border crossings, and secure manufacturing facilities for particular individuals Finding and logging multiple appearances of individuals over time in surveillance videos Verifying identities at ATM machines Searching photo ID records for fraud detection The FERET database has been used by more than 460 research groups and is currently managed by the National Institute of Standards and Technology (NIST). By 2017, the FERET database has been used to train artificial intelligence programs and computer vision algorithms to identify and sort faces. == History == The origin of facial recognition technology is largely attributed to Woodrow Wilson Bledsoe and his work in the 1960s, when he developed a system to identify faces from a database of thousands of photographs. The FERET program first began as a way to unify a large body of face-recognition technology research under a standard database. Before the program's inception, most researchers created their own facial imagery database that was attuned to their own specific area of study. These personal databases were small and usually consisted of images from less than 50 individuals. The only notable exceptions were the following: Alex Pentland’s database of around 7500 facial images at the Massachusetts Institute of Technology (MIT) Joseph Wilder's database of around 250 individuals at Rutgers University Christoph von der Malsburg’s database of around 100 facial images at the University of Southern California (USC) The lack of a common database made it difficult to compare the results of face recognition studies in the scientific literature because each report involved different assumptions, scoring methods, and images. Most of the papers that were published did not use images from a common database nor follow a standard testing protocol. As a result, researchers were unable to make informed comparisons between the performances of different face-recognition algorithms. In September 1993, the FERET program was spearheaded by Dr. Harry Wechsler and Dr. Jonathon Phillips under the sponsorship of the U.S. Department of Defense Counterdrug Technology Development Program through DARPA with ARL serving as technical agent. === Phase I === The first facial images for the FERET database were collected from August 1993 to December 1994, a time period known as Phase I. The pictures were initially taken with a 35-mm camera at both GMU and ARL facilities, and the same physical setup was used in each photography session to keep the images consistent. For each individual, the pictures were taken in sets, including two frontal views, a right and left profile, a right and left quarter profile, a right and left half profile, and sometimes at five extra locations. Therefore, a set of images consisted of 5 to 11 images per person. At the end of Phase I, the FERET database had collected 673 sets of images, resulting in over 5000 total images. At the end of Phase I, five organizations were given the opportunity to test their face-recognition algorithm on the newly created FERET database in order to compare how they performed against each other. There five principal investigators were: MIT, led by Alex Pentland Rutgers University, led by Joseph Wilder The Analytic Science Company (TASC), led by Gale Gordon The University of Illinois at Chicago (UIC) and the University of Illinois at Urbana-Champaign, led by Lewis Sadler and Thomas Huang USC, led by Christoph von der Malsburg During this evaluation, three different automatic tests were given to the principal investigators without human intervention: The large gallery test, which served to baseline how algorithms performed against a database when it has not been properly tuned. The false-alarm test, which tested how well the algorithm monitored an airport for suspected terrorists. The rotation test, which measured how well the algorithm performed when the images of an individual in the gallery had different poses compared to those in the probe set. For most of the test trials, the algorithms developed by USC and MIT managed to outperform the other three algorithms for the Phase I evaluation. === Phase II === Phase II began after Phase I, and during this time, the FERET database acquired more sets of facial images. By the start of the Phase II evaluation in March 1995, the database contained 1109 sets of images for a total of 8525 images of 884 individuals. During the second evaluation, the same algorithms from the Phase I evaluation were given a single test. However, the database now contained significantly more duplicate images (463, compared to the previous 60), making the test more challenging. === Phase III === Afterwards, the FERET program entered Phase III where another 456 sets of facial images were added to the database. The Phase III evaluation, which took place in September 1996, aimed to not only gauge the progress of the algorithms since the Phase I assessment but also identify the strengths and weaknesses of each algorithm and determine future objectives for research. By the end of 1996, the FERET database had accumulated a total of 14,126 facial images pertaining to 1199 different individuals as well as 365 duplicate sets of images. As a result of the FERET program, researchers were able to establish a common baseline for comparing different face-recognition algorithms and create a large standard database of facial images that is open for research. In 2003, DARPA released a high-resolution, 24-bit color version of the images in the FERET database (existing reference).

    Read more →
  • Calibration (statistics)

    Calibration (statistics)

    There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean a reverse process to regression, where instead of a future dependent variable being predicted from known explanatory variables, a known observation of the dependent variables is used to predict a corresponding explanatory variable; procedures in statistical classification to determine class membership probabilities which assess the uncertainty of a given new observation belonging to each of the already established classes. In addition, calibration is used in statistics with the usual general meaning of calibration. For example, model calibration can be also used to refer to Bayesian inference about the value of a model's parameters, given some data set, or more generally to any type of fitting of a statistical model. As Philip Dawid puts it, "a forecaster is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent." == In classification == Calibration in classification means transforming classifier scores into class membership probabilities. An overview of calibration methods for two-class and multi-class classification tasks is given by Gebel (2009). A classifier might separate the classes well, but be poorly calibrated, meaning that the estimated class probabilities are far from the true class probabilities. In this case, a calibration step may help improve the estimated probabilities. A variety of metrics exist that are aimed to measure the extent to which a classifier produces well-calibrated probabilities. Foundational work includes the Expected Calibration Error (ECE). Into the 2020s, variants include the Adaptive Calibration Error (ACE) and the Test-based Calibration Error (TCE), which address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1] range. A 2020s advancement in calibration assessment is the introduction of the Estimated Calibration Index (ECI). The ECI extends the concepts of the Expected Calibration Error (ECE) to provide a more nuanced measure of a model's calibration, particularly addressing overconfidence and underconfidence tendencies. Originally formulated for binary settings, the ECI has been adapted for multiclass settings, offering both local and global insights into model calibration. This framework aims to overcome some of the theoretical and interpretative limitations of existing calibration metrics. Through a series of experiments, Famiglini et al. demonstrate the framework's effectiveness in delivering a more accurate understanding of model calibration levels and discuss strategies for mitigating biases in calibration assessment. An online tool has been proposed to compute both ECE and ECI. The following univariate calibration methods exist for transforming classifier scores into class membership probabilities in the two-class case: Assignment value approach, see Garczarek (2002) Bayes approach, see Bennett (2002) Isotonic regression, see Zadrozny and Elkan (2002) Platt scaling (a form of logistic regression), see Lewis and Gale (1994) and Platt (1999) Bayesian Binning into Quantiles (BBQ) calibration, see Naeini, Cooper, Hauskrecht (2015) Beta calibration, see Kull, Filho, Flach (2017) === In probability prediction and forecasting === In prediction and forecasting, a Brier score is sometimes used to assess prediction accuracy of a set of predictions, specifically that the magnitude of the assigned probabilities track the relative frequency of the observed outcomes. Philip E. Tetlock employs the term "calibration" in this sense in his 2015 book Superforecasting. This differs from accuracy and precision. For example, as expressed by Daniel Kahneman, "if you give all events that happen a probability of .6 and all the events that don't happen a probability of .4, your discrimination is perfect but your calibration is miserable". In meteorology, in particular, as concerns weather forecasting, a related mode of assessment is known as forecast skill. == In regression == The calibration problem in regression is the use of known data on the observed relationship between a dependent variable and an independent variable to make estimates of other values of the independent variable from new observations of the dependent variable. This can be known as "inverse regression"; there is also sliced inverse regression. The following multivariate calibration methods exist for transforming classifier scores into class membership probabilities in the case with classes count greater than two: Reduction to binary tasks and subsequent pairwise coupling, see Hastie and Tibshirani (1998) Dirichlet calibration, see Gebel (2009) === Example === One example is that of dating objects, using observable evidence such as tree rings for dendrochronology or carbon-14 for radiometric dating. The observation is caused by the age of the object being dated, rather than the reverse, and the aim is to use the method for estimating dates based on new observations. The problem is whether the model used for relating known ages with observations should aim to minimise the error in the observation, or minimise the error in the date. The two approaches will produce different results, and the difference will increase if the model is then used for extrapolation at some distance from the known results.

    Read more →
  • Homogeneity blockmodeling

    Homogeneity blockmodeling

    In mathematics applied to analysis of social structures, homogeneity blockmodeling is an approach in blockmodeling, which is best suited for a preliminary or main approach to valued networks, when a prior knowledge about these networks is not available. This is because homogeneity blockmodeling emphasizes the similarity of link (tie) strengths within the blocks over the pattern of links. In this approach, tie (link) values (or statistical data computed on them) are assumed to be equal (homogenous) within blocks. This approach to the generalized blockmodeling of valued networks was first proposed by Aleš Žiberna in 2007 with the basic idea, "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Similar approach to the homogeneity blockmodeling, dealing with direct approach for structural equivalence, was previously suggested by Stephen P. Borgatti and Martin G. Everett (1992).

    Read more →
  • Google Clips

    Google Clips

    Google Clips is a discontinued miniature clip-on camera device developed by Google. == History == It was announced on October 4, 2017 and went on sale on January 27, 2018. Google Clips automatically captured video clips (without audio) at moments its machine learning algorithms determined to be interesting or relevant. An indicator flashed when the camera was looking for scenes to capture. Google Clips' artificial intelligence (AI) could learn the faces of people to take photographs with certain people, and could automatically set lighting and framing. It had 16 GB of storage built-in storage and could record clips for up to 3 hours. This camera was originally priced at US$249 in the United States. It was withdrawn from sale on October 15, 2019, but supported until the end of December 2021. == Reception == The Independent wrote that Google Clips is "an impressive little device, but one that also has the potential to feel very creepy." According to The Verge's generally negative review, "it didn't capture anything special" over two weeks of testing.

    Read more →
  • Differential evolution

    Differential evolution

    Differential evolution (DE) is an evolutionary algorithm to optimize a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics as they make few or no assumptions about the optimized problem and can search very large spaces of candidate solutions. However, metaheuristics such as DE do not guarantee an optimal solution is ever found. DE is used for multidimensional real-valued functions but does not use the gradient of the problem being optimized, which means DE does not require the optimization problem to be differentiable, as is required by classic optimization methods such as gradient descent and quasi-newton methods. DE can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae, and then keeping whichever candidate solution has the best score or fitness on the optimization problem at hand. In this way, the optimization problem is treated as a black box that merely provides a measure of quality given a candidate solution and the gradient is therefore not needed. == History == Storn and Price introduced Differential Evolution in 1995. Books have been published on theoretical and practical aspects of using DE in parallel computing, multiobjective optimization, constrained optimization, and the books also contain surveys of application areas. Surveys on the multi-faceted research aspects of DE can be found in journal articles. == Algorithm == A basic variant of the DE algorithm works by having a population of candidate solutions (called agents). These agents are moved around in the search-space by using simple mathematical formulae to combine the positions of existing agents from the population. If the new position of an agent is an improvement then it is accepted and forms part of the population, otherwise the new position is simply discarded. The process is repeated and by doing so it is hoped, but not guaranteed, that a satisfactory solution will eventually be discovered. Formally, let f : R n → R {\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} } be the fitness function which must be minimized (note that maximization can be performed by considering the function h := − f {\displaystyle h:=-f} instead). The function takes a candidate solution as argument in the form of a vector of real numbers. It produces a real number as output which indicates the fitness of the given candidate solution. The gradient of f {\displaystyle f} is not known. The goal is to find a solution m {\displaystyle \mathbf {m} } for which f ( m ) ≤ f ( p ) {\displaystyle f(\mathbf {m} )\leq f(\mathbf {p} )} for all p {\displaystyle \mathbf {p} } in the search-space, which means that m {\displaystyle \mathbf {m} } is the global minimum. Let x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} designate a candidate solution (agent) in the population. The basic DE algorithm can then be described as follows: Choose the parameters NP ≥ 4 {\displaystyle {\text{NP}}\geq 4} , CR ∈ [ 0 , 1 ] {\displaystyle {\text{CR}}\in [0,1]} , and F ∈ [ 0 , 2 ] {\displaystyle F\in [0,2]} . NP : NP {\displaystyle {\text{NP}}} is the population size, i.e. the number of candidate agents or "parents". CR : The parameter CR ∈ [ 0 , 1 ] {\displaystyle {\text{CR}}\in [0,1]} is called the crossover probability. F : The parameter F ∈ [ 0 , 2 ] {\displaystyle F\in [0,2]} is called the differential weight. Typical settings are N P = 10 n {\displaystyle NP=10n} , C R = 0.9 {\displaystyle CR=0.9} and F = 0.8 {\displaystyle F=0.8} . Optimization performance may be greatly impacted by these choices; see below. Initialize all agents x {\displaystyle \mathbf {x} } with random positions in the search-space. Until a termination criterion is met (e.g. number of iterations performed, or adequate fitness reached), repeat the following: For each agent x {\displaystyle \mathbf {x} } in the population do: Pick three agents a , b {\displaystyle \mathbf {a} ,\mathbf {b} } , and c {\displaystyle \mathbf {c} } from the population at random, they must be distinct from each other as well as from agent x {\displaystyle \mathbf {x} } . ( a {\displaystyle \mathbf {a} } is called the "base" vector.) Pick a random index R ∈ { 1 , … , n } {\displaystyle R\in \{1,\ldots ,n\}} where n {\displaystyle n} is the dimensionality of the problem being optimized. Compute the agent's potentially new position y = [ y 1 , … , y n ] {\displaystyle \mathbf {y} =[y_{1},\ldots ,y_{n}]} as follows: For each i ∈ { 1 , … , n } {\displaystyle i\in \{1,\ldots ,n\}} , pick a uniformly distributed random number r i ∼ U ( 0 , 1 ) {\displaystyle r_{i}\sim U(0,1)} If r i < C R {\displaystyle r_{i} Read more →

  • Logic learning machine

    Logic learning machine

    Logic learning machine (LLM) is a machine learning method based on the generation of intelligible rules. LLM is an efficient implementation of the Switching Neural Network (SNN) paradigm, developed by Marco Muselli, Senior Researcher at the Italian National Research Council CNR-IEIIT in Genoa. LLM has been employed in many different sectors, including the field of medicine (orthopedic patient classification, DNA micro-array analysis and Clinical Decision Support Systems), financial services and supply chain management. == History == The Switching Neural Network approach was developed in the 1990s to overcome the drawbacks of the most commonly used machine learning methods. In particular, black box methods, such as multilayer perceptron and support vector machine, had good accuracy but could not provide deep insight into the studied phenomenon. On the other hand, decision trees were able to describe the phenomenon but often lacked accuracy. Switching Neural Networks made use of Boolean algebra to build sets of intelligible rules able to obtain very good performance. In 2014, an efficient version of Switching Neural Network was developed and implemented in the Rulex suite with the name Logic Learning Machine. Also, an LLM version devoted to regression problems was developed. == General == Like other machine learning methods, LLM uses data to build a model able to perform a good forecast about future behaviors. LLM starts from a table including a target variable (output) and some inputs and generates a set of rules that return the output value y {\displaystyle y} corresponding to a given configuration of inputs. A rule is written in the form: if premise then consequence where consequence contains the output value whereas premise includes one or more conditions on the inputs. According to the input type, conditions can have different forms: for categorical variables the input value must be in a given subset: x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} . for ordered variables the condition is written as an inequality or an interval: x 2 ≤ α {\displaystyle x_{2}\leq \alpha } or β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } A possible rule is therefore in the form if x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} AND x 2 ≤ α {\displaystyle x_{2}\leq \alpha } AND β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } then y = y ¯ {\displaystyle y={\bar {y}}} == Types == According to the output type, different versions of the Logic Learning Machine have been developed: Logic Learning Machine for classification, when the output is a categorical variable, which can assume values in a finite set Logic Learning Machine for regression, when the output is an integer or real number.

    Read more →
  • Concept class

    Concept class

    In computational learning theory in mathematics, a concept over a domain X is a total Boolean function over X. A concept class is a class of concepts. Concept classes are a subject of computational learning theory. Concept class terminology frequently appears in model theory associated with probably approximately correct (PAC) learning. In this setting, if one takes a set Y as a set of (classifier output) labels, and X is a set of examples, the map c : X → Y {\displaystyle c:X\to Y} , i.e. from examples to classifier labels (where Y = { 0 , 1 } {\displaystyle Y=\{0,1\}} and where c is a subset of X), c is then said to be a concept. A concept class C {\displaystyle C} is then a collection of such concepts. Given a class of concepts C, a subclass D is reachable if there exists a sample s such that D contains exactly those concepts in C that are extensions to s. Not every subclass is reachable. == Background == A sample s {\displaystyle s} is a partial function from X {\displaystyle X} to { 0 , 1 } {\displaystyle \{0,1\}} . Identifying a concept with its characteristic function mapping X {\displaystyle X} to { 0 , 1 } {\displaystyle \{0,1\}} , it is a special case of a sample. Two samples are consistent if they agree on the intersection of their domains. A sample s ′ {\displaystyle s'} extends another sample s {\displaystyle s} if the two are consistent and the domain of s {\displaystyle s} is contained in the domain of s ′ {\displaystyle s'} . == Examples == Suppose that C = S + ( X ) {\displaystyle C=S^{+}(X)} . Then: the subclass { { x } } {\displaystyle \{\{x\}\}} is reachable with the sample s = { ( x , 1 ) } {\displaystyle s=\{(x,1)\}} ; the subclass S + ( Y ) {\displaystyle S^{+}(Y)} for Y ⊆ X {\displaystyle Y\subseteq X} are reachable with a sample that maps the elements of X − Y {\displaystyle X-Y} to zero; the subclass S ( X ) {\displaystyle S(X)} , which consists of the singleton sets, is not reachable. == Applications == Let C {\displaystyle C} be some concept class. For any concept c ∈ C {\displaystyle c\in C} , we call this concept 1 / d {\displaystyle 1/d} -good for a positive integer d {\displaystyle d} if, for all x ∈ X {\displaystyle x\in X} , at least 1 / d {\displaystyle 1/d} of the concepts in C {\displaystyle C} agree with c {\displaystyle c} on the classification of x {\displaystyle x} . The fingerprint dimension F D ( C ) {\displaystyle FD(C)} of the entire concept class C {\displaystyle C} is the least positive integer d {\displaystyle d} such that every reachable subclass C ′ ⊆ C {\displaystyle C'\subseteq C} contains a concept that is 1 / d {\displaystyle 1/d} -good for it. This quantity can be used to bound the minimum number of equivalence queries needed to learn a class of concepts according to the following inequality: F D ( C ) − 1 ≤ # E Q ( C ) ≤ ⌈ F D ( C ) ln ⁡ ( | C | ) ⌉ {\textstyle FD(C)-1\leq \#EQ(C)\leq \lceil FD(C)\ln(|C|)\rceil } .

    Read more →
  • IgHome

    IgHome

    igHome is a customizable start page introduced in 2012 as an alternative to iGoogle, the personal web portal launched by Google in May 2005. Just like iGoogle, igHome offers users the possibility to build a start page containing a central search box and a number of gadgets. igHome mimics the user interface of iGoogle. Registered igHome users can create multiple tabs and import RSS feeds.

    Read more →
  • Probit model

    Probit model

    In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model. A probit model is a popular specification for a binary response model. As such it treats the same set of problems as does logistic regression using similar techniques. When viewed in the generalized linear model framework, the probit model employs a probit link function. It is most often estimated using the maximum likelihood procedure, such an estimation being called a probit regression. == Conceptual framework == Suppose a response variable Y is binary, that is it can have only two possible outcomes which we will denote as 1 and 0. For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X, which are assumed to influence the outcome Y. Specifically, we assume that the model takes the form P ( Y = 1 ∣ X ) = Φ ( X T β ) , {\displaystyle P(Y=1\mid X)=\Phi (X^{\operatorname {T} }\beta ),} where P is the probability and Φ {\displaystyle \Phi } is the cumulative distribution function (CDF) of the standard normal distribution. The parameters β are typically estimated by maximum likelihood. It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable Y ∗ = X T β + ε , {\displaystyle Y^{\ast }=X^{T}\beta +\varepsilon ,} where ε ~ N(0, 1). Then Y can be viewed as an indicator for whether this latent variable is positive: Y = { 1 Y ∗ > 0 0 otherwise } = { 1 X T β + ε > 0 0 otherwise } {\displaystyle Y=\left.{\begin{cases}1&Y^{}>0\\0&{\text{otherwise}}\end{cases}}\right\}=\left.{\begin{cases}1&X^{\operatorname {T} }\beta +\varepsilon >0\\0&{\text{otherwise}}\end{cases}}\right\}} The use of the standard normal distribution causes no loss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount. To see that the two models are equivalent, note that P ( Y = 1 ∣ X ) = P ( Y ∗ > 0 ) = P ( X T β + ε > 0 ) = P ( ε > − X T β ) = P ( ε < X T β ) by symmetry of the normal distribution = Φ ( X T β ) {\displaystyle {\begin{aligned}P(Y=1\mid X)&=P(Y^{\ast }>0)\\&=P(X^{\operatorname {T} }\beta +\varepsilon >0)\\&=P(\varepsilon >-X^{\operatorname {T} }\beta )\\&=P(\varepsilon 0 {\displaystyle t,\lim _{n\rightarrow \infty }n_{t}/n=c_{t}>0} . Denote p ^ t = r t / n t {\displaystyle {\hat {p}}_{t}=r_{t}/n_{t}} σ ^ t 2 = 1 n t p ^ t ( 1 − p ^ t ) φ 2 ( Φ − 1 ( p ^ t ) ) {\displaystyle {\hat {\sigma }}_{t}^{2}={\frac {1}{n_{t}}}{\frac {{\hat {p}}_{t}(1-{\hat {p}}_{t})}{\varphi ^{2}{\big (}\Phi ^{-1}({\hat {p}}_{t}){\big )}}}} Then Berkson's minimum chi-square estimator is a generalized least squares estimator in a regression of Φ − 1 ( p ^ t ) {\displaystyle \Phi ^{-1}({\hat {p}}_{t})} on x ( t ) {\displaystyle x_{(t)}} with weights σ ^ t − 2 {\displaystyle {\hat {\sigma }}_{t}^{-2}} : β ^ = ( ∑ t = 1 T σ ^ t − 2 x ( t ) x ( t ) T ) − 1 ∑ t = 1 T σ ^ t − 2 x ( t ) Φ − 1 ( p ^ t ) {\displaystyle {\hat {\beta }}={\Bigg (}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}x_{(t)}^{\operatorname {T} }{\Bigg )}^{-1}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}\Phi ^{-1}({\hat {p}}_{t})} It can be shown that this estimator is consistent (as n→∞ and T fixed), asymptotically normal and efficient. Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated counts r t {\displaystyle r_{t}} , n t {\disp

    Read more →
  • Kernel method

    Kernel method

    In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations (for example clusters, rankings, principal components, correlations, classifications) in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing. Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the "kernel trick". Kernel functions have been introduced for sequence data, graphs, text, images, as well as vectors. Algorithms capable of operating with kernels include the kernel perceptron, support-vector machines (SVM), Gaussian processes, principal components analysis (PCA), canonical correlation analysis, ridge regression, spectral clustering, linear adaptive filters and many others. Most kernel algorithms are based on convex optimization or eigenproblems and are statistically well-founded. Typically, their statistical properties are analyzed using statistical learning theory (for example, using Rademacher complexity). == Motivation and informal explanation == Kernel methods can be thought of as instance-based learners: rather than learning some fixed set of parameters corresponding to the features of their inputs, they instead "remember" the i {\displaystyle i} -th training example ( x i , y i ) {\displaystyle (\mathbf {x} _{i},y_{i})} and learn for it a corresponding weight w i {\displaystyle w_{i}} . Prediction for unlabeled inputs, i.e., those not in the training set, are treated by the application of a similarity function k {\displaystyle k} , called a kernel, between the unlabeled input x ′ {\displaystyle \mathbf {x'} } and each of the training inputs x i {\displaystyle \mathbf {x} _{i}} . For instance, a kernelized binary classifier typically computes a weighted sum of similarities y ^ = sgn ⁡ ∑ i = 1 n w i y i k ( x i , x ′ ) , {\displaystyle {\hat {y}}=\operatorname {sgn} \sum _{i=1}^{n}w_{i}y_{i}k(\mathbf {x} _{i},\mathbf {x'} ),} where y ^ ∈ { − 1 , + 1 } {\displaystyle {\hat {y}}\in \{-1,+1\}} is the kernelized binary classifier's predicted label for the unlabeled input x ′ {\displaystyle \mathbf {x'} } whose hidden true label y {\displaystyle y} is of interest; k : X × X → R {\displaystyle k\colon {\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} } is the kernel function that measures similarity between any pair of inputs x , x ′ ∈ X {\displaystyle \mathbf {x} ,\mathbf {x'} \in {\mathcal {X}}} ; the sum ranges over the n labeled examples { ( x i , y i ) } i = 1 n {\displaystyle \{(\mathbf {x} _{i},y_{i})\}_{i=1}^{n}} in the classifier's training set, with y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} ; the w i ∈ R {\displaystyle w_{i}\in \mathbb {R} } are the weights for the training examples, as determined by the learning algorithm; the sign function sgn {\displaystyle \operatorname {sgn} } determines whether the predicted classification y ^ {\displaystyle {\hat {y}}} comes out positive or negative. Kernel classifiers were described as early as the 1960s, with the invention of the kernel perceptron. They rose to great prominence with the popularity of the support-vector machine (SVM) in the 1990s, when the SVM was found to be competitive with neural networks on tasks such as handwriting recognition. == Mathematics: the kernel trick == The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary. For all x {\displaystyle \mathbf {x} } and x ′ {\displaystyle \mathbf {x'} } in the input space X {\displaystyle {\mathcal {X}}} , certain functions k ( x , x ′ ) {\displaystyle k(\mathbf {x} ,\mathbf {x'} )} can be expressed as an inner product in another space V {\displaystyle {\mathcal {V}}} . The function k : X × X → R {\displaystyle k\colon {\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} } is often referred to as a kernel or a kernel function. The word "kernel" is used in mathematics to denote a weighting function for a weighted sum or integral. Certain problems in machine learning have more structure than an arbitrary weighting function k {\displaystyle k} . The computation is made much simpler if the kernel can be written in the form of a "feature map" φ : X → V {\displaystyle \varphi \colon {\mathcal {X}}\to {\mathcal {V}}} which satisfies k ( x , x ′ ) = ⟨ φ ( x ) , φ ( x ′ ) ⟩ V . {\displaystyle k(\mathbf {x} ,\mathbf {x'} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle _{\mathcal {V}}.} The key restriction is that ⟨ ⋅ , ⋅ ⟩ V {\displaystyle \langle \cdot ,\cdot \rangle _{\mathcal {V}}} must be a proper inner product. On the other hand, an explicit representation for φ {\displaystyle \varphi } is not necessary, as long as V {\displaystyle {\mathcal {V}}} is an inner product space. The alternative follows from Mercer's theorem: an implicitly defined function φ {\displaystyle \varphi } exists whenever the space X {\displaystyle {\mathcal {X}}} can be equipped with a suitable measure ensuring the function k {\displaystyle k} satisfies Mercer's condition. Mercer's theorem is similar to a generalization of the result from linear algebra that associates an inner product to any positive-definite matrix. In fact, Mercer's condition can be reduced to this simpler case. If we choose as our measure the counting measure μ ( T ) = | T | {\displaystyle \mu (T)=|T|} for all T ⊂ X {\displaystyle T\subset X} , which counts the number of points inside the set T {\displaystyle T} , then the integral in Mercer's theorem reduces to a summation ∑ i = 1 n ∑ j = 1 n k ( x i , x j ) c i c j ≥ 0. {\displaystyle \sum _{i=1}^{n}\sum _{j=1}^{n}k(\mathbf {x} _{i},\mathbf {x} _{j})c_{i}c_{j}\geq 0.} If this summation holds for all finite sequences of points ( x 1 , … , x n ) {\displaystyle (\mathbf {x} _{1},\dotsc ,\mathbf {x} _{n})} in X {\displaystyle {\mathcal {X}}} and all choices of n {\displaystyle n} real-valued coefficients ( c 1 , … , c n ) {\displaystyle (c_{1},\dots ,c_{n})} (cf. positive definite kernel), then the function k {\displaystyle k} satisfies Mercer's condition. Some algorithms that depend on arbitrary relationships in the native space X {\displaystyle {\mathcal {X}}} would, in fact, have a linear interpretation in a different setting: the range space of φ {\displaystyle \varphi } . The linear interpretation gives us insight about the algorithm. Furthermore, there is often no need to compute φ {\displaystyle \varphi } directly during computation, as is the case with support-vector machines. Some cite this running time shortcut as the primary benefit. Researchers also use it to justify the meanings and properties of existing algorithms. Theoretically, a Gram matrix K ∈ R n × n {\displaystyle \mathbf {K} \in \mathbb {R} ^{n\times n}} with respect to { x 1 , … , x n } {\displaystyle \{\mathbf {x} _{1},\dotsc ,\mathbf {x} _{n}\}} (sometimes also called a "kernel matrix"), where K i j = k ( x i , x j ) {\displaystyle K_{ij}=k(\mathbf {x} _{i},\mathbf {x} _{j})} , must be positive semi-definite (PSD). Empirically, for machine learning heuristics, choices of a function k {\displaystyle k} that do not satisfy Mercer's condition may still perform reasonably if k {\displaystyle k} at least approximates the intuitive idea of similarity. Regardless of whether k {\displaystyle k} is a Mercer kernel, k {\displaystyle k} may still be referred to as a "kernel". If the kernel function k {\displaystyle k} is also a covariance function as used in Gaussian processes, then the Gram matrix K {\displaystyle \mathbf {K} } can also be called a covariance matrix. == Applications == Application areas of kernel methods are diverse and include geostatistics, kriging, inverse distance weighting, 3D reconstruction, bioinformatics, cheminformatics, information extraction and handwriting recognition. == Popular kernels == Fisher kernel Graph kernels Kernel smoother Polynomial kernel Radial basis function kern

    Read more →
  • Logic learning machine

    Logic learning machine

    Logic learning machine (LLM) is a machine learning method based on the generation of intelligible rules. LLM is an efficient implementation of the Switching Neural Network (SNN) paradigm, developed by Marco Muselli, Senior Researcher at the Italian National Research Council CNR-IEIIT in Genoa. LLM has been employed in many different sectors, including the field of medicine (orthopedic patient classification, DNA micro-array analysis and Clinical Decision Support Systems), financial services and supply chain management. == History == The Switching Neural Network approach was developed in the 1990s to overcome the drawbacks of the most commonly used machine learning methods. In particular, black box methods, such as multilayer perceptron and support vector machine, had good accuracy but could not provide deep insight into the studied phenomenon. On the other hand, decision trees were able to describe the phenomenon but often lacked accuracy. Switching Neural Networks made use of Boolean algebra to build sets of intelligible rules able to obtain very good performance. In 2014, an efficient version of Switching Neural Network was developed and implemented in the Rulex suite with the name Logic Learning Machine. Also, an LLM version devoted to regression problems was developed. == General == Like other machine learning methods, LLM uses data to build a model able to perform a good forecast about future behaviors. LLM starts from a table including a target variable (output) and some inputs and generates a set of rules that return the output value y {\displaystyle y} corresponding to a given configuration of inputs. A rule is written in the form: if premise then consequence where consequence contains the output value whereas premise includes one or more conditions on the inputs. According to the input type, conditions can have different forms: for categorical variables the input value must be in a given subset: x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} . for ordered variables the condition is written as an inequality or an interval: x 2 ≤ α {\displaystyle x_{2}\leq \alpha } or β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } A possible rule is therefore in the form if x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} AND x 2 ≤ α {\displaystyle x_{2}\leq \alpha } AND β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } then y = y ¯ {\displaystyle y={\bar {y}}} == Types == According to the output type, different versions of the Logic Learning Machine have been developed: Logic Learning Machine for classification, when the output is a categorical variable, which can assume values in a finite set Logic Learning Machine for regression, when the output is an integer or real number.

    Read more →
  • Coghead

    Coghead

    Coghead was a web application company based out of Redwood City, California. The company offered a web-based service for building and hosting custom online database applications. Applications were built around custom data collections and were typically designed to facilitate management of, and collaboration on, business data. Examples of Coghead's "gallery" applications include project management, simple Customer relationship management, bug tracking and extreme programming. Coghead's service was available through a limited-access beta program before "going live" for free trial accounts in April, 2007. Coghead launched a paid subscription plans in June, 2007. On February 19, 2009, Coghead announced that its intellectual property assets (its 'service') had been acquired by SAP AG (NYSE:SAP). == Product == Coghead's product was a fully hosted environment for building, accessing, and maintaining applications and the associated business data. Like other so-called "Web 2.0" companies, Coghead built its product around the idea of "software as a service". The product was intended to allow users to design a range of applications from scratch using only a drag and drop, WYSIWYG user interface, with very limited scripting or coding (if any) required. Coghead also offered its paid subscribers the ability to develop and publish "Coglets," web forms that allowed site visitors to view data in, or submit data into, the host's Coghead database. On February 19, 2009, Coghead announced that SAP AG had acquired the Coghead service through an asset purchase. The SAP asset purchase closed in the 1st Quarter 2009. Immediately upon closing the asset purchase, the public-facing service was taken off-line by SAP as they prepared to integrate the Coghead code with other SAP assets. This forced many of Coghead's customers to find alternative solutions.

    Read more →
  • Correlation clustering

    Correlation clustering

    Clustering is the problem of partitioning data points into groups based on similarity or dissimilarity. Correlation clustering is a clustering framework in which a set of objects is partitioned into clusters based on pairwise similarity and dissimilarity information, without requiring the number of clusters to be specified in advance. == Description of the problem == In machine learning, correlation clustering (also known as cluster editing) considers settings in which pairwise similarity or dissimilarity relationships between objects are known. A standard formulation models the input as an unweighted complete graph G = ( V , E ) {\displaystyle G=(V,E)} , where each edge is labeled either + {\displaystyle +} or − {\displaystyle -} (that is, the graph is a signed graph), indicating whether the corresponding endpoints are similar or dissimilar. The goal is to find a clustering (that is, a partition of V {\displaystyle V} ) that either maximizes the number of agreements—the sum of positive edges whose endpoints lie in the same cluster and negative edges whose endpoints lie in different clusters—or minimizes the number of disagreements—the sum of positive edges whose endpoints are separated and negative edges whose endpoints lie in the same cluster. Unlike other clustering methods such as k-means, correlation clustering does not require choosing the number of clusters k {\displaystyle k} in advance. It is not always possible to find a clustering with zero disagreements. For example, consider a triangle graph containing two positive edges and one negative edge. In this case, every clustering incurs at least one disagreement. Such configurations are referred to in the literature as bad triangles. From a computational perspective, optimizing the correlation clustering objective is challenging. The (decision version of the) problem is NP-complete. A large body of subsequent work has developed approximation algorithms for correlation clustering under various assumptions, including complete or general graphs and unweighted or weighted graphs, for both minimization and maximization objectives. This problem is considered one of the fundamental combinatorial optimization problems, and many algorithmic techniques have been developed to address it. The problem has also been studied extensively across multiple disciplines. A comprehensive literature review of early correlation clustering research is provided by Wahid and Hassini. == Formal Definitions == Let G = ( V , E ) {\displaystyle G=(V,E)} be a graph with nodes V {\displaystyle V} and edges E {\displaystyle E} . A clustering of G {\displaystyle G} is a partition of its node set Π = { π 1 , … , π k } {\displaystyle \Pi =\{\pi _{1},\dots ,\pi _{k}\}} with V = π 1 ∪ ⋯ ∪ π k {\displaystyle V=\pi _{1}\cup \dots \cup \pi _{k}} and π i ∩ π j = ∅ {\displaystyle \pi _{i}\cap \pi _{j}=\emptyset } for i ≠ j {\displaystyle i\neq j} . For a given clustering Π {\displaystyle \Pi } , let δ ( Π ) = { { u , v } ∈ E ∣ { u , v } ⊈ π ∀ π ∈ Π } {\displaystyle \delta (\Pi )=\{\{u,v\}\in E\mid \{u,v\}\not \subseteq \pi \;\forall \pi \in \Pi \}} denote the subset of edges of G {\displaystyle G} whose endpoints are in different subsets of the clustering Π {\displaystyle \Pi } . Now, let w : E → R ≥ 0 {\displaystyle w\colon E\to \mathbb {R} _{\geq 0}} be a function that assigns a non-negative weight to each edge of the graph and let E = E + ∪ E − {\displaystyle E=E^{+}\cup E^{-}} be a partition of the edges into attractive ( E + {\displaystyle E^{+}} ) and repulsive ( E − {\displaystyle E^{-}} ) edges; that is, the edges are signed. The minimum disagreement correlation clustering problem is the following optimization problem: minimize Π ∑ e ∈ E + ∩ δ ( Π ) w e + ∑ e ∈ E − ∖ δ ( Π ) w e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in E^{+}\cap \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\setminus \delta (\Pi )}w_{e}\;.\end{aligned}}} Here, the set E + ∩ δ ( Π ) {\displaystyle E^{+}\cap \delta (\Pi )} contains the attractive edges whose endpoints are in different components with respect to the clustering Π {\displaystyle \Pi } and the set E − ∖ δ ( Π ) {\displaystyle E^{-}\setminus \delta (\Pi )} contains the repulsive edges whose endpoints are in the same component with respect to the clustering Π {\displaystyle \Pi } . Together these two sets contain all edges that disagree with the clustering Π {\displaystyle \Pi } . Similarly to the minimum disagreement correlation clustering problem, the maximum agreement correlation clustering problem is defined as maximize Π ∑ e ∈ E + ∖ δ ( Π ) w e + ∑ e ∈ E − ∩ δ ( Π ) w e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E^{+}\setminus \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\cap \delta (\Pi )}w_{e}\;.\end{aligned}}} Here, the set E + ∖ δ ( Π ) {\displaystyle E^{+}\setminus \delta (\Pi )} contains the attractive edges whose endpoints are in the same component with respect to the clustering Π {\displaystyle \Pi } and the set E − ∩ δ ( Π ) {\displaystyle E^{-}\cap \delta (\Pi )} contains the repulsive edges whose endpoints are in different components with respect to the clustering Π {\displaystyle \Pi } . Together these two sets contain all edges that agree with the clustering Π {\displaystyle \Pi } . Instead of formulating the correlation clustering problem in terms of non-negative edge weights and a partition of the edges into attractive and repulsive edges the problem is also formulated in terms of positive and negative edge costs without partitioning the set of edges explicitly. For given weights w : E → R ≥ 0 {\displaystyle w\colon E\to \mathbb {R} _{\geq 0}} and a given partition E = E + ∪ E − {\displaystyle E=E^{+}\cup E^{-}} of the edges into attractive and repulsive edges, the edge costs can be defined by c e = { w e if e ∈ E + − w e if e ∈ E − {\displaystyle {\begin{aligned}c_{e}={\begin{cases}\;\;w_{e}&{\text{if }}e\in E^{+}\\-w_{e}&{\text{if }}e\in E^{-}\end{cases}}\end{aligned}}} for all e ∈ E {\displaystyle e\in E} . An edge whose endpoints are in different clusters is said to be cut. The set δ ( Π ) {\displaystyle \delta (\Pi )} of all edges that are cut is often called a multicut of G {\displaystyle G} . The minimum cost multicut problem is the problem of finding a clustering Π {\displaystyle \Pi } of G {\displaystyle G} such that the sum of the costs of the edges whose endpoints are in different clusters is minimal: minimize Π ∑ e ∈ δ ( Π ) c e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in \delta (\Pi )}c_{e}\;.\end{aligned}}} Similar to the minimum cost multicut problem, coalition structure generation in weighted graph games is the problem of finding a clustering such that the sum of the costs of the edges that are not cut is maximal: maximize Π ∑ e ∈ E ∖ δ ( Π ) c e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E\setminus \delta (\Pi )}c_{e}\;.\end{aligned}}} This formulation is also known as the clique partitioning problem. It can be shown that all four problems that are formulated above are equivalent. This means that a clustering that is optimal with respect to any of the four objectives is optimal for all of the four objectives. == Algorithms == If the graph admits a clustering with zero disagreements, then deleting all negative edges and computing the connected components of the remaining graph yields an optimal clustering. A necessary and sufficient condition for the existence of such a clustering was given by Davis: no cycle in the graph may contain exactly one negative edge. Bansal et al. discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Ailon et al. propose a randomized 3-approximation algorithm for the same problem. CC-Pivot(G=(V,E+,E−)) Pick random pivot i ∈ V Set C = { i } {\displaystyle C=\{i\}} , V'=Ø For all j ∈ V, j ≠ i; If (i,j) ∈ E+ then Add j to C Else (If (i,j) ∈ E−) Add j to V' Let G' be the subgraph induced by V' Return clustering C,CC-Pivot(G') The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering. The best polynomial-time approximation algorithm known at the moment for this problem achieves a ~2.06 approximation by rounding a linear program, as shown by Chawla, Makarychev, Schramm, and Yaroslavtsev. Karpinski and Schudy proved existence of a polynomial time approximation scheme (PTAS) for that problem on complete graphs and fixed number of clusters. == Optimal number of clusters == In 2011, it was shown by Bagon and Galun that the optimization of the correlation clustering functional is closely related to well known discrete optimization methods. In their work they proposed a probabilistic analysis of the underlying implicit model that allows the correlation clustering functional to estimate the

    Read more →
  • Latent and observable variables

    Latent and observable variables

    In statistics, latent variables (from Latin: present participle of lateo 'lie hidden') are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such latent variable models are used in many disciplines, including engineering, medicine, ecology, physics, machine learning/artificial intelligence, natural language processing, bioinformatics, chemometrics, demography, economics, management, political science, psychology and the social sciences. Latent variables may correspond to aspects of physical reality. These could in principle be measured, but may not be for practical reasons. Among the earliest expressions of this idea is Francis Bacon's polemic the Novum Organum, itself a challenge to the more traditional logic expressed in Aristotle's Organon: But the latent process of which we speak, is far from being obvious to men’s minds, beset as they now are. For we mean not the measures, symptoms, or degrees of any process which can be exhibited in the bodies themselves, but simply a continued process, which, for the most part, escapes the observation of the senses. In this situation, the term hidden variables is commonly used, reflecting the fact that the variables are meaningful, but not observable. Other latent variables correspond to abstract concepts, like categories, behavioral or mental states, or data structures. The terms hypothetical variables or hypothetical constructs may be used in these situations. The use of latent variables can serve to reduce the dimensionality of data. Many observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data. In this sense, they serve a function similar to that of scientific theories. At the same time, latent variables link observable "sub-symbolic" data in the real world to symbolic data in the modeled world. == Examples == === Psychology === Latent variables, as created by factor analytic methods, generally represent "shared" variance, or the degree to which variables "move" together. Variables that have no correlation cannot result in a latent construct based on the common factor model. The "Big Five personality traits" have been inferred using factor analysis. extraversion spatial ability wisdom: “Two of the more predominant means of assessing wisdom include wisdom-related performance and latent variable measures.” Spearman's g, or the general intelligence factor in psychometrics === Economics === Examples of latent variables from the field of economics include quality of life, business confidence, morale, happiness and conservatism: these are all variables which cannot be measured directly. However, by linking these latent variables to other, observable variables, the values of the latent variables can be inferred from measurements of the observable variables. Quality of life is a latent variable which cannot be measured directly, so observable variables are used to infer quality of life. Observable variables to measure quality of life include wealth, employment, environment, physical and mental health, education, recreation and leisure time, and social belonging. === Medicine === Latent-variable methodology is used in many branches of medicine. A class of problems that naturally lend themselves to latent variables approaches are longitudinal studies where the time scale (e.g. age of participant or time since study baseline) is not synchronized with the trait being studied. For such studies, an unobserved time scale that is synchronized with the trait being studied can be modeled as a transformation of the observed time scale using latent variables. Examples of this include disease progression modeling and modeling of growth (see box). == Inferring latent variables == There exists a range of different model classes and methodology that make use of latent variables and allow inference in the presence of latent variables. Models include: linear mixed-effects models and nonlinear mixed-effects models Hidden Markov models Factor analysis Item response theory Analysis and inference methods include: Principal component analysis Instrumented principal component analysis Partial least squares regression Latent semantic analysis and probabilistic latent semantic analysis EM algorithms Metropolis–Hastings algorithm === Bayesian algorithms and methods === Bayesian statistics is often used for inferring latent variables. Latent Dirichlet allocation The Chinese restaurant process is often used to provide a prior distribution over assignments of objects to latent categories. The Indian buffet process is often used to provide a prior distribution over assignments of latent binary features to objects.

    Read more →