Point-set registration

Point-set registration

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation (e.g., scaling, rotation and translation) that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model (or coordinate frame), and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging. As a special case, registration of two point sets that only differ by a 3D rotation (i.e., there is no scaling and translation), is called the Wahba Problem and also related to the orthogonal procrustes problem. == Formulation == The problem may be summarized as follows: Let { M , S } {\displaystyle \lbrace {\mathcal {M}},{\mathcal {S}}\rbrace } be two finite size point sets in a finite-dimensional real vector space R d {\displaystyle \mathbb {R} ^{d}} , which contain M {\displaystyle M} and N {\displaystyle N} points respectively (e.g., d = 3 {\displaystyle d=3} recovers the typical case of when M {\displaystyle {\mathcal {M}}} and S {\displaystyle {\mathcal {S}}} are 3D point sets). The problem is to find a transformation to be applied to the moving "model" point set M {\displaystyle {\mathcal {M}}} such that the difference (typically defined in the sense of point-wise Euclidean distance) between M {\displaystyle {\mathcal {M}}} and the static "scene" set S {\displaystyle {\mathcal {S}}} is minimized. In other words, a mapping from R d {\displaystyle \mathbb {R} ^{d}} to R d {\displaystyle \mathbb {R} ^{d}} is desired which yields the best alignment between the transformed "model" set and the "scene" set. The mapping may consist of a rigid or non-rigid transformation. The transformation model may be written as T {\displaystyle T} , using which the transformed, registered model point set is: The output of a point set registration algorithm is therefore the optimal transformation T ⋆ {\displaystyle T^{\star }} such that M {\displaystyle {\mathcal {M}}} is best aligned to S {\displaystyle {\mathcal {S}}} , according to some defined notion of distance function dist ⁡ ( ⋅ , ⋅ ) {\displaystyle \operatorname {dist} (\cdot ,\cdot )} : where T {\displaystyle {\mathcal {T}}} is used to denote the set of all possible transformations that the optimization tries to search for. The most popular choice of the distance function is to take the square of the Euclidean distance for every pair of points: where ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} denotes the vector 2-norm, s m {\displaystyle s_{m}} is the corresponding point in set S {\displaystyle {\mathcal {S}}} that attains the shortest distance to a given point m {\displaystyle m} in set M {\displaystyle {\mathcal {M}}} after transformation. Minimizing such a function in rigid registration is equivalent to solving a least squares problem. == Types of algorithms == When the correspondences (i.e., s m ↔ m {\displaystyle s_{m}\leftrightarrow m} ) are given before the optimization, for example, using feature matching techniques, then the optimization only needs to estimate the transformation. This type of registration is called correspondence-based registration. On the other hand, if the correspondences are unknown, then the optimization is required to jointly find out the correspondences and transformation together. This type of registration is called simultaneous pose and correspondence registration. === Rigid registration === Given two point sets, rigid registration yields a rigid transformation which maps one point set to the other. A rigid transformation is defined as a transformation that does not change the distance between any two points. Typically such a transformation consists of translation and rotation. In rare cases, the point set may also be mirrored. In robotics and computer vision, rigid registration has the most applications. === Non-rigid registration === Given two point sets, non-rigid registration yields a non-rigid transformation which maps one point set to the other. Non-rigid transformations include affine transformations such as scaling and shear mapping. However, in the context of point set registration, non-rigid registration typically involves nonlinear transformation. If the eigenmodes of variation of the point set are known, the nonlinear transformation may be parametrized by the eigenvalues. A nonlinear transformation may also be parametrized as a thin plate spline. === Other types === Some approaches to point set registration use algorithms that solve the more general graph matching problem. However, the computational complexity of such methods tend to be high and they are limited to rigid registrations. In this article, we will only consider algorithms for rigid registration, where the transformation is assumed to contain 3D rotations and translations (possibly also including a uniform scaling). The PCL (Point Cloud Library) is an open-source framework for n-dimensional point cloud and 3D geometry processing. It includes several point registration algorithms. == Correspondence-based registration == Correspondence-based methods assume the putative correspondences m ↔ s m {\displaystyle m\leftrightarrow s_{m}} are given for every point m ∈ M {\displaystyle m\in {\mathcal {M}}} . Therefore, we arrive at a setting where both point sets M {\displaystyle {\mathcal {M}}} and S {\displaystyle {\mathcal {S}}} have N {\displaystyle N} points and the correspondences m i ↔ s i , i = 1 , … , N {\displaystyle m_{i}\leftrightarrow s_{i},i=1,\dots ,N} are given. === Outlier-free registration === In the simplest case, one can assume that all the correspondences are correct, meaning that the points m i , s i ∈ R 3 {\displaystyle m_{i},s_{i}\in \mathbb {R} ^{3}} are generated as follows:where l > 0 {\displaystyle l>0} is a uniform scaling factor (in many cases l = 1 {\displaystyle l=1} is assumed), R ∈ SO ( 3 ) {\displaystyle R\in {\text{SO}}(3)} is a proper 3D rotation matrix ( SO ( d ) {\displaystyle {\text{SO}}(d)} is the special orthogonal group of degree d {\displaystyle d} ), t ∈ R 3 {\displaystyle t\in \mathbb {R} ^{3}} is a 3D translation vector and ϵ i ∈ R 3 {\displaystyle \epsilon _{i}\in \mathbb {R} ^{3}} models the unknown additive noise (e.g., Gaussian noise). Specifically, if the noise ϵ i {\displaystyle \epsilon _{i}} is assumed to follow a zero-mean isotropic Gaussian distribution with standard deviation σ i {\displaystyle \sigma _{i}} , i.e., ϵ i ∼ N ( 0 , σ i 2 I 3 ) {\displaystyle \epsilon _{i}\sim {\mathcal {N}}(0,\sigma _{i}^{2}I_{3})} , then the following optimization can be shown to yield the maximum likelihood estimate for the unknown scale, rotation and translation:Note that when the scaling factor is 1 and the translation vector is zero, then the optimization recovers the formulation of the Wahba problem. Despite the non-convexity of the optimization (cb.2) due to non-convexity of the set SO ( 3 ) {\displaystyle {\text{SO}}(3)} , seminal work by Berthold K.P. Horn showed that (cb.2) actually admits a closed-form solution, by decoupling the estimation of scale, rotation and translation. Similar results were discovered by Arun et al. In addition, in order to find a unique transformation ( l , R , t ) {\displaystyle (l,R,t)} , at least N = 3 {\displaystyle N=3} non-collinear points in each point set are required. More recently, Briales and Gonzalez-Jimenez have developed a semidefinite relaxation using Lagrangian duality, for the case where the model set M {\displaystyle {\mathcal {M}}} contains different 3D primitives such as points, lines and planes (which is the case when the model M {\displaystyle {\mathcal {M}}} is a 3D mesh). Interestingly, the semidefinite relaxation is empirically tight, i.e., a certifiably globally optimal solution can be extracted from the solution of the semidefinite relaxation. === Robust registration === The least squares formulation (cb.2) is known to perform arbitrarily badly in the presence of outliers. An outlier correspondence is a pair of measurements s i ↔ m i {\displaystyle s_{i}\leftrightarrow m_{i}} that departs from the generative model (cb.1). In this case, one can consider a differen

Negobot

Negobot also referred to as Lolita or Lolita chatbot is a chatterbot that was introduced to the public in 2013, designed by researchers from the University of Deusto and Optenet to catch online pedophiles. It is a conversational agent that utilizes natural language processing (NLP), information retrieval (IR) and Automatic Learning. Because the bot poses as a young female in order to entice and track potential predators, it became known in media as the "virtual Lolita", in reference to Vladimir Nabokov's novel. == Background == In 2013, the University of Deusto researchers published a paper on their work with Negobot and disclosed the text online. In their abstract, the researchers addressed the issue that an increasing number of children are using the internet and that these young users are more susceptible to existing internet risks. Their main objective was to create a chatterbot with the ability to trap online predators that posed a threat to children. They intended to deploy the bot into sites frequented by predators such as social networks and chatrooms. The university researchers used information provided by anti-pedophilia activist organization Perverted-Justice, including examples of online encounters and conversations with sexual predators, to supplement the program's artificial intelligence system. == Features == === Programmed persona === The chatterbot takes the guise of a naive and vulnerable 14-year-old girl. The bot's programmers used methods of artificial intelligence and natural language processing to create a conversational agent fluent in typical teenage slang, misspellings, and knowledge of pop culture. Through these linguistic features, the bot is able to mimic the conversational style of young teenagers. It also features split personalities and seven different patterns of conversation. Negobot's primary creator, Dr. Carlos Laorden, expressed the significance of the bot's distinguishable style of communication, stating that normally, "chatbots tend to be very predictable. Their behavior and interest in a conversation are flat, which is a problem when attempting to detect untrustworthy targets like paedophiles." What makes Negobot different is its game theory feature, which makes it able to "maintain a much more realistic conversation." Apart from being able to imitate a stereotypical teenager, the program is also able to translate messages into different languages. === Game theory === Negobot's designers programmed it with the ability to treat conversations with potential predators as if it were a game, the objective being to collect as much information on the suspect as possible that could provide evidence of pedophilic characteristics and motives. The use of game theory shapes the decisions the bot makes and the overall direction of the conversation. The bot initiates its undercover operations by entering a chat as a passive participant, waiting to be chatted by a user. Once a user elicits conversation, the bot will frame the conversation in such a way that keeps the target engaged, extracting personal information and discouraging it from leaving the chat. The information is then recorded to be potentially sent to the police. If the target seems to lose interest, the bot attempts to make it feel guilty by expressing sentiments of loneliness and emotional need through strategic, formulated responses, ultimately prolonging interaction. In addition, the bot may provide fake information about itself in attempt to lure the target into physical meetings. === Limitations === Despite being able to carry out a realistic conversation, Negobot is still unable to detect linguistic subtleties in the messages of others, including sarcasm. == Controversy == John Carr, a specialist in online child safety, expressed his concern to BBC over the legality of this undercover investigation. He claimed that using the bot on unsuspecting internet users could be considered a form of entrapment or harassment. The type of information that Negobot collects from potential online predators, he said, is unlikely to be upheld in court. Furthermore, he warned that relying on only software without any real-world policing risks enticing individuals to do or say things that they would not have if real-world policing were a factor.

Markov model

In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property). Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property. == Introduction == Andrey Andreyevich Markov (14 June 1856 – 20 July 1922) was a Russian mathematician best known for his work on stochastic processes. A primary subject of his research later became known as the Markov chain. There are four common Markov models used in different situations, depending on whether every sequential state is observable or not, and whether the system is to be adjusted on the basis of observations made: == Markov chain == The simplest Markov model is the Markov chain. It models the state of a system with a random variable that changes through time. In this context, the Markov property indicates that the distribution for this variable depends only on the distribution of a previous state. An example use of a Markov chain is Markov chain Monte Carlo, which uses the Markov property to prove that a particular method for performing a random walk will sample from the joint distribution. == Hidden Markov model == A hidden Markov model is a Markov chain for which the state is only partially observable or noisily observable. In other words, observations are related to the state of the system, but they are typically insufficient to precisely determine the state. Several well-known algorithms for hidden Markov models exist. For example, given a sequence of observations, the Viterbi algorithm will compute the most-likely corresponding sequence of states, the forward algorithm will compute the probability of the sequence of observations, and the Baum–Welch algorithm will estimate the starting probabilities, the transition function, and the observation function of a hidden Markov model. One common use is for speech recognition, where the observed data is the speech audio waveform and the hidden state is the spoken text. In this example, the Viterbi algorithm finds the most likely sequence of spoken words given the speech audio. == Markov decision process == A Markov decision process is a Markov chain in which state transitions depend on the current state and an action vector that is applied to the system. Typically, a Markov decision process is used to compute a policy of actions that will maximize some utility with respect to expected rewards. == Partially observable Markov decision process == A partially observable Markov decision process (POMDP) is a Markov decision process in which the state of the system is only partially observed. POMDPs are known to be NP complete, but recent approximation techniques have made them useful for a variety of applications, such as controlling simple agents or robots. == Markov random field == A Markov random field, or Markov network, may be considered to be a generalization of a Markov chain in multiple dimensions. In a Markov chain, state depends only on the previous state in time, whereas in a Markov random field, each state depends on its neighbors in any of multiple directions. A Markov random field may be visualized as a field or graph of random variables, where the distribution of each random variable depends on the neighboring variables with which it is connected. More specifically, the joint distribution for any random variable in the graph can be computed as the product of the "clique potentials" of all the cliques in the graph that contain that random variable. Modeling a problem as a Markov random field is useful because it implies that the joint distributions at each vertex in the graph may be computed in this manner. == Hierarchical Markov models == Hierarchical Markov models can be applied to categorize human behavior at various levels of abstraction. For example, a series of simple observations, such as a person's location in a room, can be interpreted to determine more complex information, such as in what task or activity the person is performing. Two kinds of Hierarchical Markov Models are the Hierarchical hidden Markov model and the Abstract Hidden Markov Model. Both have been used for behavior recognition and certain conditional independence properties between different levels of abstraction in the model allow for faster learning and inference. == Tolerant Markov model == A Tolerant Markov model (TMM) is a probabilistic-algorithmic Markov chain model. It assigns the probabilities according to a conditioning context that considers the last symbol, from the sequence to occur, as the most probable instead of the true occurring symbol. A TMM can model three different natures: substitutions, additions or deletions. Successful applications have been efficiently implemented in DNA sequences compression. == Markov-chain forecasting models == Markov-chains have been used as a forecasting methods for several topics, for example price trends, wind power and solar irradiance. The Markov-chain forecasting models utilize a variety of different settings, from discretizing the time-series to hidden Markov-models combined with wavelets and the Markov-chain mixture distribution model (MCM).

Occam learning

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set. Occam learnability implies PAC learning, and for a wide variety of concept classes, the converse is also true: PAC learnability implies Occam learnability. == Introduction == Occam Learning is named after Occam's razor, which is a principle stating that, given all other things being equal, a shorter explanation for observed data should be favored over a lengthier explanation. The theory of Occam learning is a formal and mathematical justification for this principle. It was first shown by Blumer, et al. that Occam learning implies PAC learning, which is the standard model of learning in computational learning theory. In other words, parsimony (of the output hypothesis) implies predictive power. == Definition of Occam learning == The succinctness of a concept c {\displaystyle c} in concept class C {\displaystyle {\mathcal {C}}} can be expressed by the length s i z e ( c ) {\displaystyle size(c)} of the shortest bit string that can represent c {\displaystyle c} in C {\displaystyle {\mathcal {C}}} . Occam learning connects the succinctness of a learning algorithm's output to its predictive power on unseen data. Let C {\displaystyle {\mathcal {C}}} and H {\displaystyle {\mathcal {H}}} be concept classes containing target concepts and hypotheses respectively. Then, for constants α ≥ 0 {\displaystyle \alpha \geq 0} and 0 ≤ β < 1 {\displaystyle 0\leq \beta <1} , a learning algorithm L {\displaystyle L} is an ( α , β ) {\displaystyle (\alpha ,\beta )} -Occam algorithm for C {\displaystyle {\mathcal {C}}} using H {\displaystyle {\mathcal {H}}} iff, given a set S = { x 1 , … , x m } {\displaystyle S=\{x_{1},\dots ,x_{m}\}} of m {\displaystyle m} samples labeled according to a concept c ∈ C {\displaystyle c\in {\mathcal {C}}} , L {\displaystyle L} outputs a hypothesis h ∈ H {\displaystyle h\in {\mathcal {H}}} such that h {\displaystyle h} is consistent with c {\displaystyle c} on S {\displaystyle S} (that is, h ( x ) = c ( x ) , ∀ x ∈ S {\displaystyle h(x)=c(x),\forall x\in S} ), and s i z e ( h ) ≤ ( n ⋅ s i z e ( c ) ) α m β {\displaystyle size(h)\leq (n\cdot size(c))^{\alpha }m^{\beta }} where n {\displaystyle n} is the maximum length of any sample x ∈ S {\displaystyle x\in S} . An Occam algorithm is called efficient if it runs in time polynomial in n {\displaystyle n} , m {\displaystyle m} , and s i z e ( c ) . {\displaystyle size(c).} We say a concept class C {\displaystyle {\mathcal {C}}} is Occam learnable with respect to a hypothesis class H {\displaystyle {\mathcal {H}}} if there exists an efficient Occam algorithm for C {\displaystyle {\mathcal {C}}} using H . {\displaystyle {\mathcal {H}}.} == The relation between Occam and PAC learning == Occam learnability implies PAC learnability, as the following theorem of Blumer, et al. shows: === Theorem (Occam learning implies PAC learning) === Let L {\displaystyle L} be an efficient ( α , β ) {\displaystyle (\alpha ,\beta )} -Occam algorithm for C {\displaystyle {\mathcal {C}}} using H {\displaystyle {\mathcal {H}}} . Then there exists a constant a > 0 {\displaystyle a>0} such that for any 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} , for any distribution D {\displaystyle {\mathcal {D}}} , given m ≥ a ( 1 ϵ log ⁡ 1 δ + ( ( n ⋅ s i z e ( c ) ) α ϵ ) 1 1 − β ) {\displaystyle m\geq a\left({\frac {1}{\epsilon }}\log {\frac {1}{\delta }}+\left({\frac {(n\cdot size(c))^{\alpha }}{\epsilon }}\right)^{\frac {1}{1-\beta }}\right)} samples drawn from D {\displaystyle {\mathcal {D}}} and labelled according to a concept c ∈ C {\displaystyle c\in {\mathcal {C}}} of length n {\displaystyle n} bits each, the algorithm L {\displaystyle L} will output a hypothesis h ∈ H {\displaystyle h\in {\mathcal {H}}} such that e r r o r ( h ) ≤ ϵ {\displaystyle error(h)\leq \epsilon } with probability at least 1 − δ {\displaystyle 1-\delta } .Here, e r r o r ( h ) {\displaystyle error(h)} is with respect to the concept c {\displaystyle c} and distribution D {\displaystyle {\mathcal {D}}} . This implies that the algorithm L {\displaystyle L} is also a PAC learner for the concept class C {\displaystyle {\mathcal {C}}} using hypothesis class H {\displaystyle {\mathcal {H}}} . A slightly more general formulation is as follows: === Theorem (Occam learning implies PAC learning, cardinality version) === Let 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} . Let L {\displaystyle L} be an algorithm such that, given m {\displaystyle m} samples drawn from a fixed but unknown distribution D {\displaystyle {\mathcal {D}}} and labeled according to a concept c ∈ C {\displaystyle c\in {\mathcal {C}}} of length n {\displaystyle n} bits each, outputs a hypothesis h ∈ H n , m {\displaystyle h\in {\mathcal {H}}_{n,m}} that is consistent with the labeled samples. Then, there exists a constant b {\displaystyle b} such that if log ⁡ | H n , m | ≤ b ϵ m − log ⁡ 1 δ {\displaystyle \log |{\mathcal {H}}_{n,m}|\leq b\epsilon m-\log {\frac {1}{\delta }}} , then L {\displaystyle L} is guaranteed to output a hypothesis h ∈ H n , m {\displaystyle h\in {\mathcal {H}}_{n,m}} such that e r r o r ( h ) ≤ ϵ {\displaystyle error(h)\leq \epsilon } with probability at least 1 − δ {\displaystyle 1-\delta } . While the above theorems show that Occam learning is sufficient for PAC learning, it doesn't say anything about necessity. Board and Pitt show that, for a wide variety of concept classes, Occam learning is in fact necessary for PAC learning. They proved that for any concept class that is polynomially closed under exception lists, PAC learnability implies the existence of an Occam algorithm for that concept class. Concept classes that are polynomially closed under exception lists include Boolean formulas, circuits, deterministic finite automata, decision-lists, decision-trees, and other geometrically defined concept classes. A concept class C {\displaystyle {\mathcal {C}}} is polynomially closed under exception lists if there exists a polynomial-time algorithm A {\displaystyle A} such that, when given the representation of a concept c ∈ C {\displaystyle c\in {\mathcal {C}}} and a finite list E {\displaystyle E} of exceptions, outputs a representation of a concept c ′ ∈ C {\displaystyle c'\in {\mathcal {C}}} such that the concepts c {\displaystyle c} and c ′ {\displaystyle c'} agree except on the set E {\displaystyle E} . == Proof that Occam learning implies PAC learning == We first prove the Cardinality version. Call a hypothesis h ∈ H {\displaystyle h\in {\mathcal {H}}} bad if e r r o r ( h ) ≥ ϵ {\displaystyle error(h)\geq \epsilon } , where again e r r o r ( h ) {\displaystyle error(h)} is with respect to the true concept c {\displaystyle c} and the underlying distribution D {\displaystyle {\mathcal {D}}} . The probability that a set of samples S {\displaystyle S} is consistent with h {\displaystyle h} is at most ( 1 − ϵ ) m {\displaystyle (1-\epsilon )^{m}} , by the independence of the samples. By the union bound, the probability that there exists a bad hypothesis in H n , m {\displaystyle {\mathcal {H}}_{n,m}} is at most | H n , m | ( 1 − ϵ ) m {\displaystyle |{\mathcal {H}}_{n,m}|(1-\epsilon )^{m}} , which is less than δ {\displaystyle \delta } if log ⁡ | H n , m | ≤ O ( ϵ m ) − log ⁡ 1 δ {\displaystyle \log |{\mathcal {H}}_{n,m}|\leq O(\epsilon m)-\log {\frac {1}{\delta }}} . This concludes the proof of the second theorem above. Using the second theorem, we can prove the first theorem. Since we have a ( α , β ) {\displaystyle (\alpha ,\beta )} -Occam algorithm, this means that any hypothesis output by L {\displaystyle L} can be represented by at most ( n ⋅ s i z e ( c ) ) α m β {\displaystyle (n\cdot size(c))^{\alpha }m^{\beta }} bits, and thus log ⁡ | H n , m | ≤ ( n ⋅ s i z e ( c ) ) α m β {\displaystyle \log |{\mathcal {H}}_{n,m}|\leq (n\cdot size(c))^{\alpha }m^{\beta }} . This is less than O ( ϵ m ) − log ⁡ 1 δ {\displaystyle O(\epsilon m)-\log {\frac {1}{\delta }}} if we set m ≥ a ( 1 ϵ log ⁡ 1 δ + ( ( n ⋅ s i z e ( c ) ) α ) ϵ ) 1 1 − β ) {\displaystyle m\geq a\left({\frac {1}{\epsilon }}\log {\frac {1}{\delta }}+\left({\frac {(n\cdot size(c))^{\alpha })}{\epsilon }}\right)^{\frac {1}{1-\beta }}\right)} for some constant a > 0 {\displaystyle a>0} . Thus, by the Cardinality version Theorem, L {\displaystyle L} will output a consistent hypothesis h {\displaystyle h} with probability at least 1 − δ {\displaystyle 1-\delta } . This concludes the proof of the first theorem above. == Improving sample complexity for common problems == Though Occam and PAC learnability are equivalent, the Occam framework can be used to produce tighter bounds on the sample complexity of classical problems including conjunctions, co

Calibration (statistics)

There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean a reverse process to regression, where instead of a future dependent variable being predicted from known explanatory variables, a known observation of the dependent variables is used to predict a corresponding explanatory variable; procedures in statistical classification to determine class membership probabilities which assess the uncertainty of a given new observation belonging to each of the already established classes. In addition, calibration is used in statistics with the usual general meaning of calibration. For example, model calibration can be also used to refer to Bayesian inference about the value of a model's parameters, given some data set, or more generally to any type of fitting of a statistical model. As Philip Dawid puts it, "a forecaster is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent." == In classification == Calibration in classification means transforming classifier scores into class membership probabilities. An overview of calibration methods for two-class and multi-class classification tasks is given by Gebel (2009). A classifier might separate the classes well, but be poorly calibrated, meaning that the estimated class probabilities are far from the true class probabilities. In this case, a calibration step may help improve the estimated probabilities. A variety of metrics exist that are aimed to measure the extent to which a classifier produces well-calibrated probabilities. Foundational work includes the Expected Calibration Error (ECE). Into the 2020s, variants include the Adaptive Calibration Error (ACE) and the Test-based Calibration Error (TCE), which address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1] range. A 2020s advancement in calibration assessment is the introduction of the Estimated Calibration Index (ECI). The ECI extends the concepts of the Expected Calibration Error (ECE) to provide a more nuanced measure of a model's calibration, particularly addressing overconfidence and underconfidence tendencies. Originally formulated for binary settings, the ECI has been adapted for multiclass settings, offering both local and global insights into model calibration. This framework aims to overcome some of the theoretical and interpretative limitations of existing calibration metrics. Through a series of experiments, Famiglini et al. demonstrate the framework's effectiveness in delivering a more accurate understanding of model calibration levels and discuss strategies for mitigating biases in calibration assessment. An online tool has been proposed to compute both ECE and ECI. The following univariate calibration methods exist for transforming classifier scores into class membership probabilities in the two-class case: Assignment value approach, see Garczarek (2002) Bayes approach, see Bennett (2002) Isotonic regression, see Zadrozny and Elkan (2002) Platt scaling (a form of logistic regression), see Lewis and Gale (1994) and Platt (1999) Bayesian Binning into Quantiles (BBQ) calibration, see Naeini, Cooper, Hauskrecht (2015) Beta calibration, see Kull, Filho, Flach (2017) === In probability prediction and forecasting === In prediction and forecasting, a Brier score is sometimes used to assess prediction accuracy of a set of predictions, specifically that the magnitude of the assigned probabilities track the relative frequency of the observed outcomes. Philip E. Tetlock employs the term "calibration" in this sense in his 2015 book Superforecasting. This differs from accuracy and precision. For example, as expressed by Daniel Kahneman, "if you give all events that happen a probability of .6 and all the events that don't happen a probability of .4, your discrimination is perfect but your calibration is miserable". In meteorology, in particular, as concerns weather forecasting, a related mode of assessment is known as forecast skill. == In regression == The calibration problem in regression is the use of known data on the observed relationship between a dependent variable and an independent variable to make estimates of other values of the independent variable from new observations of the dependent variable. This can be known as "inverse regression"; there is also sliced inverse regression. The following multivariate calibration methods exist for transforming classifier scores into class membership probabilities in the case with classes count greater than two: Reduction to binary tasks and subsequent pairwise coupling, see Hastie and Tibshirani (1998) Dirichlet calibration, see Gebel (2009) === Example === One example is that of dating objects, using observable evidence such as tree rings for dendrochronology or carbon-14 for radiometric dating. The observation is caused by the age of the object being dated, rather than the reverse, and the aim is to use the method for estimating dates based on new observations. The problem is whether the model used for relating known ages with observations should aim to minimise the error in the observation, or minimise the error in the date. The two approaches will produce different results, and the difference will increase if the model is then used for extrapolation at some distance from the known results.

Princh

Princh is a Danish software company, which is headquartered in Aarhus, Denmark. Founded in 2015, Princh develops cloud printing and electronic payment products. The company is headquartered in the city of Aarhus. While utilizing a smartphone or web app, users can locate a nearby printer to their current location, get directions to access said printer, and/or authorize a print and pay for the print job in question. The product is available as a native mobile apps for Android and iOS, as well as on web and desktop products for businesses and libraries. The app connects a network of printer owners and users around the world. Princh supports an array of printable files. == History == The company was founded in 2015. The company is currently based in the southern part of Aarhus. The Princh printing service was officially launched on June 23, 2015. Currently, Princh is available as a service in a multitude of locations such as print shops, libraries, hotels, or universities. Princh is a popular printing and payment product among libraries and can among other places be found in Denmark, Sweden, Norway, Germany, United Kingdom, United States, and Canada. == How it works == With the Princh app, users will be able to locate their nearest printer. Once the user is at the printer, the user chooses the document to be printed out and shares it with the Princh app. The user then selects the desired nearby printer entering the printer ID number or scanning the QR-code located on top of the printer, pays electronically and the print job is processed by the printer. Printer owners get access to a personal control panel where they can set printing prices and monitor all Princh activity for their business. == Notes and references ==

Information gain (decision tree)

In the context of decision trees in information theory and machine learning, information gain refers to the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one. (In broader contexts, information gain can also be used as a synonym for either Kullback–Leibler divergence or mutual information, but the focus of this article is on the more narrow meaning below.) Explicitly, the information gain of a random variable X {\displaystyle X} obtained from an observation of a random variable A {\displaystyle A} taking value a {\displaystyle a} is defined as: I G ( X , a ) = D KL ( P X ∣ a ∥ P X ) {\displaystyle {\mathit {IG}}(X,a)=D_{\text{KL}}{\bigl (}P_{X\mid a}\parallel P_{X}{\bigr )}} In other words, it is the Kullback–Leibler divergence of P X ( x ) {\displaystyle P_{X}(x)} (the prior distribution for X {\displaystyle X} ) from P X ∣ a ( x ) {\displaystyle P_{X\mid a}(x)} (the posterior distribution for X {\displaystyle X} given A = a {\displaystyle A=a} ). The expected value of the information gain is the mutual information I ( X ; A ) {\displaystyle I(X;A)} : E A ⁡ [ I G ( X , A ) ] = I ( X ; A ) {\displaystyle \operatorname {E} _{A}[{\mathit {IG}}(X,A)]=I(X;A)} i.e. the reduction in the entropy of X {\displaystyle X} achieved by learning the state of the random variable A {\displaystyle A} . In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree, and when applied in the area of machine learning is known as decision tree learning. Usually an attribute with high mutual information should be preferred to other attributes. == General definition == In general terms, the expected information gain is the reduction in information entropy Η from a prior state to a state that takes some information as given: I G ( T , a ) = H ( T ) − H ( T | a ) , {\displaystyle IG(T,a)=\mathrm {H} {(T)}-\mathrm {H} {(T|a)},} where H ( T | a ) {\displaystyle \mathrm {H} {(T|a)}} is the conditional entropy of T {\displaystyle T} given the value of attribute a {\displaystyle a} . This is intuitively plausible when interpreting entropy Η as a measure of uncertainty of a random variable T {\displaystyle T} : by learning (or assuming) a {\displaystyle a} about T {\displaystyle T} , our uncertainty about T {\displaystyle T} is reduced (i.e. I G ( T , a ) {\displaystyle IG(T,a)} is positive), unless of course T {\displaystyle T} is independent of a {\displaystyle a} , in which case H ( T | a ) = H ( T ) {\displaystyle \mathrm {H} (T|a)=\mathrm {H} (T)} , meaning I G ( T , a ) = 0 {\displaystyle IG(T,a)=0} . == Formal definition == Let T denote a set of training examples, each of the form ( x , y ) = ( x 1 , x 2 , x 3 , . . . , x k , y ) {\displaystyle ({\textbf {x}},y)=(x_{1},x_{2},x_{3},...,x_{k},y)} where x a ∈ v a l s ( a ) {\displaystyle x_{a}\in \mathrm {vals} (a)} is the value of the a th {\displaystyle a^{\text{th}}} attribute or feature of example x {\displaystyle {\textbf {x}}} and y is the corresponding class label. The information gain for an attribute a is defined in terms of Shannon entropy H ( − ) {\displaystyle \mathrm {H} (-)} as follows. For a value v taken by attribute a, let S a ( v ) = { x ∈ T | x a = v } {\displaystyle S_{a}{(v)}=\{{\textbf {x}}\in T|x_{a}=v\}} be defined as the set of training inputs of T for which attribute a is equal to v. Then the information gain of T for attribute a is the difference between the a priori Shannon entropy H ( T ) {\displaystyle \mathrm {H} (T)} of the training set and the conditional entropy H ( T | a ) {\displaystyle \mathrm {H} {(T|a)}} . H ( T | a ) = ∑ v ∈ v a l s ( a ) | S a ( v ) | | T | ⋅ H ( S a ( v ) ) . {\displaystyle \mathrm {H} (T|a)=\sum _{v\in \mathrm {vals} (a)}{{\frac {|S_{a}{(v)}|}{|T|}}\cdot \mathrm {H} \left(S_{a}{\left(v\right)}\right)}.} I G ( T , a ) = H ( T ) − H ( T | a ) {\displaystyle IG(T,a)=\mathrm {H} (T)-\mathrm {H} (T|a)} The mutual information is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case, the relative entropies subtracted from the total entropy are 0. In particular, the values v ∈ v a l s ( a ) {\displaystyle v\in vals(a)} defines a partition of the training set data T into mutually exclusive and all-inclusive subsets, inducing a categorical probability distribution P a ( v ) {\textstyle P_{a}{(v)}} on the values v ∈ v a l s ( a ) {\textstyle v\in vals(a)} of attribute a. The distribution is given P a ( v ) := | S a ( v ) | | T | {\textstyle P_{a}{(v)}:={\frac {|S_{a}{(v)}|}{|T|}}} . In this representation, the information gain of T given a can be defined as the difference between the unconditional Shannon entropy of T and the expected entropy of T conditioned on a, where the expectation value is taken with respect to the induced distribution on the values of a. I G ( T , a ) = H ( T ) − ∑ v ∈ v a l s ( a ) P a ( v ) H ( S a ( v ) ) = H ( T ) − E P a [ H ( S a ( v ) ) ] = H ( T ) − H ( T | a ) . {\displaystyle {\begin{alignedat}{2}IG(T,a)&=\mathrm {H} (T)-\sum _{v\in \mathrm {vals} (a)}{P_{a}{(v)}\mathrm {H} \left(S_{a}{(v)}\right)}\\&=\mathrm {H} (T)-\mathbb {E} _{P_{a}}{\left[\mathrm {H} {(S_{a}{(v)})}\right]}\\&=\mathrm {H} (T)-\mathrm {H} {(T|a)}.\end{alignedat}}} == Example == In engineering applications, information is analogous to signal, and entropy is analogous to noise. It determines how a decision tree chooses to split data. The leftmost figure below is very impure and has high entropy corresponding to higher disorder and lower information value. As we go to the right, the entropy decreases, and the information value increases. Now, it is clear that information gain is the measure of how much information a feature provides about a class. Let's visualize information gain in a decision tree as shown in the right: The node t is the parent node, and the sub-nodes tL and tR are child nodes. In this case, the parent node t has a collection of cancer and non-cancer samples denoted as C and NC respectively. We can use information gain to determine how good the splitting of nodes is in a decision tree. In terms of entropy, information gain is defined as: To understand this idea, let's start by an example in which we create a simple dataset and want to see if gene mutations could be related to patients with cancer. Given four different gene mutations, as well as seven samples, the training set for a decision can be created as follows: In this dataset, a 1 means the sample has the mutation (True), while a 0 means the sample does not (False). A sample with C denotes that it has been confirmed to be cancerous, while NC means it is non-cancerous. Using this data, a decision tree can be created with information gain used to determine the candidate splits for each node. For the next step, the entropy at parent node t of the above simple decision tree is computed as:H(t) = −[pC,t log2(pC,t) + pNC,t log2(pNC,t)] where, probability of selecting a class ‘C’ sample at node t, pC,t = n(t, C) / n(t), probability of selecting a class ‘NC’ sample at node t, pNC,t = n(t, NC) / n(t), n(t), n(t, C), and n(t, NC) are the number of total samples, ‘C’ samples and ‘NC’ samples at node t respectively.Using this with the example training set, the process for finding information gain beginning with H ( t ) {\displaystyle \mathrm {H} {(t)}} for Mutation 1 is as follows: pC, t = 4/7 pNC, t = 3/7 H ( t ) {\displaystyle \mathrm {H} {(t)}} = −(4/7 × log2(4/7) + 3/7 × log2(3/7)) = 0.985 Note: H ( t ) {\displaystyle \mathrm {H} {(t)}} will be the same for all mutations at the root. The relatively high value of entropy H ( t ) = 0.985 {\displaystyle \mathrm {H} {(t)}=0.985} (1 is the optimal value) suggests that the root node is highly impure and the constituents of the input at the root node would look like the leftmost figure in the above Entropy Diagram. However, such a set of data is good for learning the attributes of the mutations used to split the node. At a certain node, when the homogeneity of the constituents of the input occurs (as shown in the rightmost figure in the above Entropy Diagram), the dataset would no longer be good for learning. Moving on, the entropy at left and right child nodes of the above decision tree is computed using the formulae:H(tL) = −[pC,L log2(pC,L) + pNC,L log2(pNC,L)]H(tR) = −[pC,R log2(pC,R) + pNC,R log2(pNC,R)]where, probability of selecting a class ‘C’ sample at the left child node, pC,L = n(tL, C) / n(tL), probability of selecting a class ‘NC’ sample at the left child node, pNC,L = n(tL, NC) / n(tL), probability of selecting a class ‘C’ sample at the right child node, pC,R = n(tR, C) / n(tR), prob