AI Analytics Certification

AI Analytics Certification — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Focus recovery based on the linear canonical transform

    Focus recovery based on the linear canonical transform

    For digital image processing, the Focus recovery from a defocused image is an ill-posed problem since it loses the component of high frequency. Most of the methods for focus recovery are based on depth estimation theory. The Linear canonical transform (LCT) gives a scalable kernel to fit many well-known optical effects. Using LCTs to approximate an optical system for imaging and inverting this system, theoretically permits recovery of a defocused image. == Depth of field and perceptual focus == In photography, depth of field (DOF) means an effective focal length. It is usually used for stressing an object and deemphasizing the background (and/or the foreground). The important measure related to DOF is the lens aperture. Decreasing the diameter of aperture increases focus and lowers resolution and vice versa. == The Huygens–Fresnel principle and DOF == The Huygens–Fresnel principle describes diffraction of wave propagation between two fields. It belongs to Fourier optics rather than geometric optics. The disturbance of diffraction depends on two circumstance parameters, the size of aperture and the interfiled distance. Consider a source field and a destination field, field 1 and field 0, respectively. P1(x1,y1) is the position in the source field, P0(x0,y0) is the position in the destination field. The Huygens–Fresnel principle gives the diffraction formula for two fields U(x0,y0), U(x1,y1) as following: U ( x 0 , y 0 ) = 1 j λ ∫ ∫ U ( x 1 , y 1 ) e j k r 01 r 01 cos ⁡ θ d x 1 d y 1 {\displaystyle \mathbf {U} (x_{0},y_{0})={\frac {1}{j\lambda }}\int \!\int \mathbf {U} (x_{1},y_{1}){\frac {e^{jkr_{01}}}{r_{01}}}\cos \theta dx_{1}dy_{1}} where θ denotes the angle between r 01 {\displaystyle r_{01}} and z {\displaystyle z} . Replace cos θ by r 01 z {\displaystyle {\frac {r_{01}}{z}}} and r 01 {\displaystyle r_{01}} by [ ( x 0 − x 1 ) 2 + ( y 0 − y 1 ) 2 + z 2 ] 1 / 2 {\displaystyle [(x_{0}-x_{1})^{2}+(y_{0}-y_{1})^{2}+z^{2}]^{1/2}} we get U ( x 0 , y 0 ) = 1 j λ z ∫ ∫ U ( x 1 , y 1 ) exp ⁡ ( j k z [ 1 + ( x 0 − x 1 z ) 2 + ( y 0 − y 1 z ) 2 ] 1 / 2 ) 1 + ( x 0 − x 1 z ) 2 + ( y 0 − y 1 z ) 2 d x 1 d y 1 {\displaystyle \mathbf {U} (x_{0},y_{0})={\frac {1}{j\lambda z}}\int \!\int \mathbf {U} (x_{1},y_{1}){\frac {\exp(jkz[1+({\frac {x_{0}-x_{1}}{z}})^{2}+({\frac {y_{0}-y_{1}}{z}})^{2}]^{1/2})}{1+({\frac {x_{0}-x_{1}}{z}})^{2}+({\frac {y_{0}-y_{1}}{z}})^{2}}}dx_{1}dy_{1}} The further distance z or the smaller aperture (x1,y1) causes a greater diffraction. A larger DOF can lead to a more effective focused wave distribution. This seems to be a conflict. Here are the notations: Diffraction In a real imaging environment, the depths of objects comparing to the aperture are usually not enough to lead to serious diffraction. However, a long enough depth of the object can truly blurs the image. Effective Focus Small aperture, small blurring radius, few wave information. Loses details in comparing to a large aperture. In conclusion, diffraction explains a micro behavior whereas DOF shows a macro behavior. Both of them are related to aperture size. == Linear canonical transform == As the meaning of "canonical", the linear canonical transform (LCT) is a scalable transform that connects to many important kernels such as the Fresnel transform, Fraunhofer transform and the fractional Fourier transform. It can be easily controlled by its four parameters, a, b, c, d (3 degrees of freedom). The definition: L M ( f ( u ) ) = ∫ L M ( u , u ′ ) f ( u ′ ) d u ′ {\displaystyle L_{M}(f(u))=\int L_{M}(u,u')f(u')du'} where L M ( u , u ′ ) = { 1 b e − j π / 4 e [ j π ( d b u 2 ) − 2 1 b u u ′ + a b u ′ 2 ] , if b ≠ 0 d e j 2 c d u 2 δ ( u ′ − d u ) , if b = 0 {\displaystyle L_{M}(u,u')={\begin{cases}{\sqrt {\frac {1}{b}}}e^{-j\pi /4}e^{[j\pi ({\frac {d}{b}}u^{2})-2{\frac {1}{b}}uu'+{\frac {a}{b}}u'^{2}]},&{\mbox{if }}b\neq 0\\{\sqrt {d}}e^{{\frac {j}{2}}cdu^{2}}\delta (u'-du),&{\mbox{if }}b=0\end{cases}}} Consider a general imaging system with object distance z0, focal length of the thin lens f and an imaging distance z1. The effect of the propagation in freespace acts as nearly a chirp convolution, that is, the formula of diffraction. Besides, the effect of the propagation in thin lens acts as a chirp multiplication. The parameters are all simplified as paraxial approximations while meeting the freespace propagation. It does not consider aperture size. From the properties of the LCT, it is possible to obtain those 4 parameters for this optical system as: [ 1 − z 1 f λ z 0 − λ z 0 z 1 f + λ z 1 − 1 λ f 1 − z 0 f ] {\displaystyle {\begin{bmatrix}1-{\frac {z_{1}}{f}}\quad &\lambda z_{0}-{\frac {\lambda z_{0}z_{1}}{f}}+\lambda z_{1}\\-{\frac {1}{\lambda f}}\quad &1-{\frac {z_{0}}{f}}\end{bmatrix}}} Once the values of z1, z0 and f are known, the LCT can simulate any optical system.

    Read more →
  • Population model (evolutionary algorithm)

    Population model (evolutionary algorithm)

    The population model of an evolutionary algorithm (EA) describes the structural properties of its population to which its members are subject. A population is the set of all proposed solutions of an EA considered in one iteration, which are also called individuals according to the biological role model. The individuals of a population can generate further individuals as offspring with the help of the genetic operators of the procedure. The simplest and widely used population model in EAs is the global or panmictic model, which corresponds to an unstructured population. It allows each individual to choose any other individual of the population as a partner for the production of offspring by crossover, whereby the details of the selection are irrelevant as long as the fitness of the individuals plays a significant role. Due to global mate selection, the genetic information of even slightly better individuals can prevail in a population after a few generations (iteration of an EA), provided that no better other offspring have emerged in this phase. If the solution found in this way is not the optimum sought, that is called premature convergence. This effect can be observed more often in panmictic populations. In nature global mating pools are rarely found. What prevails is a certain and limited isolation due to spatial distance. The resulting local neighbourhoods initially evolve independently and mutants have a higher chance of persisting over several generations. As a result, genotypic diversity in the gene pool is preserved longer than in a panmictic population. It is therefore obvious to divide the previously global population by substructures. Two basic models were introduced for this purpose, the island models, which are based on a division of the population into fixed subpopulations that exchange individuals from time to time, and the neighbourhood models, which assign individuals to overlapping neighbourhoods, also known as cellular genetic or evolutionary algorithms (cGA or cEA). The associated division of the population also suggests a corresponding parallelization of the procedure. For this reason, the topic of population models is also frequently discussed in the literature in connection with the parallelization of EAs. == Island models == In the island model, also called the migration model or coarse grained model, evolution takes place in strictly divided subpopulations. These can be organised panmictically, but do not have to be. From time to time an exchange of individuals takes place, which is called migration. The time between an exchange is called an epoch and its end can be triggered by various criteria: E.g. after a given time or given number of completed generations, or after the occurrence of stagnation. Stagnation can be detected, for example, by the fact that no fitness improvement has occurred in the island for a given number of generations. Island models introduce a variety of new strategy parameters: Number of subpopulations Size of the subpopulations Neighbourhood relations between islands: they determine which islands are considered neighbouring and can thus exchange individuals, see picture of a simple unidirectional ring (black arrows) and its extension by additional bidirectional neighbourhood relations (additional green arrows) Criteria for the termination of an epoch, synchronous or asynchronous migration Migration rate: number or proportion of individuals involved in migration. Migrant selection: There are many alternatives for this. E.g. the best individuals can replace the worst or randomly selected ones. Depending on the migration rate, this can affect one or more individuals at a time. With these parameters, the selection pressure can be influenced to a considerable extent. For example, it increases with the interconnectedness of the islands and decreases with the number of subpopulations or the epoch length. == Neighbourhood models or cellular evolutionary algorithms == The neighbourhood model, also called diffusion model or fine grained model, defines a topological neighbouhood relation between the individuals of a population that is independent of their phenotypic properties. The fundamental idea of this model is to provide the EA population with a special structure defined as a connected graph, in which each vertex is an individual that communicates with its nearest neighbours. Particularly, individuals are conceptually set in a toroidal mesh, and are only allowed to recombine with close individuals. This leads to a kind of locality known as isolation by distance. The set of potential mates of an individual is called its neighbourhood or deme. The adjacent figure illustrates that by showing two slightly overlapping neighbourhoods of two individuals marked yellow, through which genetic information can spread between the two demes. It is known that in this kind of algorithm, similar individuals tend to cluster and create niches that are independent of the deme boundaries and, in particular, can be larger than a deme. There is no clear borderline between adjacent groups, and close niches could be easily colonized by competitive ones and maybe merge solution contents during this process. Simultaneously, farther niches can be affected more slowly. EAs with this type of population are also well known as cellular EAs (cEA) or cellular genetic algorithms (cGA). A commonly used structure for arranging the individuals of a population is a 2D toroidal grid, although the number of dimensions can be easily extended (to 3D) or reduced (to 1D, e.g. a ring, see the figure on the right). The neighbourhood of a particular individual in the grid is defined in terms of the Manhattan distance from it to others in the population. In the basic algorithm, all the neighbourhoods have the same size and identical shapes. The two most commonly used neighbourhoods for two-dimensional cEAs are L5 and C9, see the figure on the left. Here, L stands for Linear while C stands for Compact. Each deme represents a panmictic subpopulation within which mate selection and the acceptance of offspring takes place by replacing the parent. The rules for the acceptance of offspring are local in nature and based on the neighbourhood: for example, it can be specified that the best offspring must be better than the parent being replaced or, less strictly, only better than the worst individual in the deme. The first rule is elitist and creates a higher selective pressure than the second non-elitist rule. In elitist EAs, the best individual of a population always survives. In this respect, they deviate from the biological model. The overlap of the neighbourhoods causes a mostly slow spread of genetic information across the neighbourhood boundaries, hence the name diffusion model. A better offspring now needs more generations than in panmixy to spread in the population. This promotes the emergence of local niches and their local evolution, thus preserving genotypic diversity over a longer period of time. The result is a better and dynamic balance between breadth and depth search adapted to the search space during a run. Depth search takes place in the niches and breadth search in the niche boundaries and through the evolution of the different niches of the whole population. For the same neighbourhood size, the spread of genetic information is larger for elongated figures like L9 than for a block like C9, and again significantly larger than for a ring. This means that ring neighbourhoods are well suited for achieving high quality results, even if this requires comparatively long run times. On the other hand, if one is primarily interested in fast and good, but possibly suboptimal results, 2D topologies are more suitable. == Comparison == When applying both population models to genetic algorithms, evolutionary strategy and other EAs, the splitting of a total population into subpopulations usually reduces the risk of premature convergence and leads to better results overall more reliably and faster than would be expected with panmictic EAs. Island models have the disadvantage compared to neighbourhood models that they introduce a large number of new strategy parameters. Despite the existing studies on this topic in the literature, a certain risk of unfavourable settings remains for the user. With neighbourhood models, on the other hand, only the size of the neighbourhood has to be specified and, in the case of the two-dimensional model, the choice of the neighbourhood figure is added. == Parallelism == Since both population models imply population partitioning, they are well suited as a basis for parallelizing an EA. This applies even more to cellular EAs, since they rely only on locally available information about the members of their respective demes. Thus, in the extreme case, an independent execution thread can be assigned to each individual, so that the entire cEA can run on a parallel hardware platform. The island model also supports p

    Read more →
  • Evolutionary algorithm

    Evolutionary algorithm

    Evolutionary algorithms (EA) reproduce essential elements of biological evolution in a computer algorithm in order to solve "difficult" problems, at least approximately, for which no exact or satisfactory solution methods are known. They are metaheuristics and population-based bio-inspired algorithms and evolutionary computation, which itself are part of the field of computational intelligence. The mechanisms of biological evolution that an EA mainly imitates are reproduction, mutation, recombination and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators. Evolutionary algorithms often perform well approximating solutions to all types of problems because they ideally do not make any assumption about the underlying fitness landscape. Techniques from evolutionary algorithms applied to the modeling of biological evolution are generally limited to explorations of microevolution (microevolutionary processes) and planning models based upon cellular processes. In most real applications of EAs, computational complexity is a prohibiting factor. In fact, this computational complexity is due to fitness function evaluation. Fitness approximation is one of the solutions to overcome this difficulty. However, seemingly simple EA can solve often complex problems; therefore, there may be no direct link between algorithm complexity and problem complexity. == Generic definition == The following is an example of a generic evolutionary algorithm: Randomly generate the initial population of individuals, the first generation. Evaluate the fitness of each individual in the population. Check, if the goal is reached and the algorithm can be terminated. Select individuals as parents, preferably of higher fitness. Produce offspring with optional crossover (mimicking reproduction). Apply mutation operations on the offspring. Select individuals preferably of lower fitness for replacement with new individuals (mimicking natural selection). Return to 2 == Types == Similar techniques differ in genetic representation and other implementation details, and the nature of the particular applied problem. Genetic algorithm – This is the most popular type of EA. One seeks the solution of a problem in the form of strings of numbers (traditionally binary, although the best representations are usually those that reflect something about the problem being solved), by applying operators such as recombination and mutation (sometimes one, sometimes both). This type of EA is often used in optimization problems. Genetic programming – Here the solutions are in the form of computer programs, and their fitness is determined by their ability to solve a computational problem. There are many variants of Genetic Programming: Cartesian genetic programming Gene expression programming Grammatical evolution Linear genetic programming Multi expression programming Evolutionary programming – Similar to evolution strategy, but with a deterministic selection of all parents. Evolution strategy (ES) – Works with vectors of real numbers as representations of solutions, and typically uses self-adaptive mutation rates. The method is mainly used for numerical optimization, although there are also variants for combinatorial tasks. CMA-ES Natural evolution strategy Differential evolution – Based on vector differences and is therefore primarily suited for numerical optimization problems. Coevolutionary algorithm – Similar to genetic algorithms and evolution strategies, but the created solutions are compared on the basis of their outcomes from interactions with other solutions. Solutions can either compete or cooperate during the search process. Coevolutionary algorithms are often used in scenarios where the fitness landscape is dynamic, complex, or involves competitive interactions. Neuroevolution – Similar to genetic programming but the genomes represent artificial neural networks by describing structure and connection weights. The genome encoding can be direct or indirect. Learning classifier system – Here the solution is a set of classifiers (rules or conditions). A Michigan-LCS evolves at the level of individual classifiers whereas a Pittsburgh-LCS uses populations of classifier-sets. Initially, classifiers were only binary, but now include real, neural net, or S-expression types. Fitness is typically determined with either a strength or accuracy based reinforcement learning or supervised learning approach. Quality–Diversity algorithms – QD algorithms simultaneously aim for high-quality and diverse solutions. Unlike traditional optimization algorithms that solely focus on finding the best solution to a problem, QD algorithms explore a wide variety of solutions across a problem space and keep those that are not just high performing, but also diverse and unique. == Theoretical background == The following theoretical principles apply to all or almost all EAs. === No free lunch theorem === The no free lunch theorem of optimization states that all optimization strategies are equally effective when the set of all optimization problems is considered. Under the same condition, no evolutionary algorithm is fundamentally better than another. This can only be the case if the set of all problems is restricted. This is exactly what is inevitably done in practice. Therefore, to improve an EA, it must exploit problem knowledge in some form (e.g. by choosing a certain mutation strength or a problem-adapted coding). Thus, if two EAs are compared, this constraint is implied. In addition, an EA can use problem specific knowledge by, for example, not randomly generating the entire start population, but creating some individuals through heuristics or other procedures. Another possibility to tailor an EA to a given problem domain is to involve suitable heuristics, local search procedures or other problem-related procedures in the process of generating the offspring. This form of extension of an EA is also known as a memetic algorithm. Both extensions play a major role in practical applications, as they can speed up the search process and make it more robust. === Convergence === For EAs in which, in addition to the offspring, at least the best individual of the parent generation is used to form the subsequent generation (so-called elitist EAs), there is a general proof of convergence under the condition that an optimum exists. Without loss of generality, a maximum search is assumed for the proof: From the property of elitist offspring acceptance and the existence of the optimum it follows that per generation k {\displaystyle k} an improvement of the fitness F {\displaystyle F} of the respective best individual x ′ {\displaystyle x'} will occur with a probability P > 0 {\displaystyle P>0} . Thus: F ( x 1 ′ ) ≤ F ( x 2 ′ ) ≤ F ( x 3 ′ ) ≤ ⋯ ≤ F ( x k ′ ) ≤ ⋯ {\displaystyle F(x'_{1})\leq F(x'_{2})\leq F(x'_{3})\leq \cdots \leq F(x'_{k})\leq \cdots } I.e., the fitness values represent a monotonically non-decreasing sequence, which is bounded due to the existence of the optimum. From this follows the convergence of the sequence against the optimum. Since the proof makes no statement about the speed of convergence, it is of little help in practical applications of EAs. But it does justify the recommendation to use elitist EAs. However, when using the usual panmictic population model, elitist EAs tend to converge prematurely more than non-elitist ones. In a panmictic population model, mate selection (see step 4 of the generic definition) is such that every individual in the entire population is eligible as a mate. In non-panmictic populations, selection is suitably restricted, so that the dispersal speed of better individuals is reduced compared to panmictic ones. Thus, the general risk of premature convergence of elitist EAs can be significantly reduced by suitable population models that restrict mate selection. === Virtual alphabets === With the theory of virtual alphabets, David E. Goldberg showed in 1990 that by using a representation with real numbers, an EA that uses classical recombination operators (e.g. uniform or n-point crossover) cannot reach certain areas of the search space, in contrast to a coding with binary numbers. This results in the recommendation for EAs with real representation to use arithmetic operators for recombination (e.g. arithmetic mean or intermediate recombination). With suitable operators, real-valued representations are more effective than binary ones, contrary to earlier opinion. == Comparison to other concepts == === Biological processes === A possible limitation of many evolutionary algorithms is their lack of a clear genotype–phenotype distinction. In nature, the fertilized egg cell undergoes a complex process known as embryogenesis to become a mature p

    Read more →
  • Generalized blockmodeling

    Generalized blockmodeling

    In generalized blockmodeling, the blockmodeling is done by "the translation of an equivalence type into a set of permitted block types", which differs from the conventional blockmodeling, which is using the indirect approach. It's a special instance of the direct blockmodeling approach. Generalized blockmodeling was introduced in 1994 by Patrick Doreian, Vladimir Batagelj and Anuška Ferligoj. == Definition == Generalized blockmodeling approach is a direct one, "where the optimal partition(s) is (are) identified based on minimal values of a compatible criterion function defined by the difference between empirical blocks and corresponding ideal blocks". At the same time, the much broader set of block types is introduced (while in conventional blockmodeling only certain types are used). The conventional blockmodeling is inductive due to nonspecification of neither the clusters or the location of block types, while in generalized blockmodeling the blockmodel is specified with more detail than just the permition of certain block types (e.g., prespecification). Further, it's possible to define departures from the permitted (ideal) blocktype, using criterion function. Using local optimization procedure, firstly the initial clustering (with specified number of clusters is done, based on random creation. How the clusters are neighboring to each other, is based on two transformations: 1) a vertex is moved from one to another cluster or 2) a pair of vertices is interchanged between two different clusters. This process of transformation steps is repeated many times, until only the best fitting partitions (with the minimized value of the criterion function) are kept as blockmodels for the future exploration of the network. Different types of generalized blockmodeling are: generalized binary blockmodeling, generalized valued blockmodeling and generalized homogeneity blockmodeling. == Benefits == According to Patrick Doreian, the benefits of generalized blockmodeling, are as follows: usage of explicit criterion function, compatible with a given type of equivalence, results to in-built measure of fit, which is integral to the establishment of the blockmodels (in conventional blockmodeling, there is no compelling and coherent measures of fit); partitions, based on generalized blockmodeling, regularly outperform and never perform less well than the partitions, based on conventional approach; with generalized blockmodeling it's possible to specify new types of blockmodels; this potentially unlimited set of new block types also results in permittion of inclusion of substantively driven blockmodels; in generalized blockmodeling, the specification of the block types and the location of some of them in the blockmodel is possible; researcher can speficy which (pair of) vertices must be (not) clustered together; this approach also allows the imposition of penalties, resulting into identification of empirical null blocks without inconsistencies with a corresponding ideal null block. == Problems == According to Doreian, the problems of generalized blockmodeling, are as follows: unknown sensitivity to particular data features, examination of boundary problems, computationally burdensome, which results in a constraint regarding practical network size (generalized blockmodeling is thus primarily used to analyse smaller networks (below 100 units)), identifying structure from incomplete network information, most of generalized blockmodeling is based on binary networks, but there is also development in the field of valued networks, criterion function is minimized for a specified blockmodel, with results in issues of evaluating statistically, based on the structural data alone, problems regarding three dimensional network data, problems regarding the evolution of fundamental network structure. == Book == The book with the same title, Generalized blockmodeling, written by Patrick Doreian, Vladimir Batagelj and Anuška Ferligoj, was in 2007 awarded the Harrison White Outstanding Book Award by the Mathematical Sociology Section of American Sociological Association.

    Read more →
  • TalkBack

    TalkBack

    TalkBack is an accessibility service for the Android operating system that helps blind and visually impaired users to interact with their devices. It uses spoken words, vibration and other audible feedback to allow the user to know what is happening on the screen allowing the user to better interact with their device. The service is pre-installed on many Android devices, and it became part of the Android Accessibility Suite in 2017. According to the Google Play Store, the Android Accessibility Suite has been downloaded over five billion times, including devices that have the suite preinstalled. == Open-source == Google releases the source code of TalkBack with some releases of the accessibility service to GitHub, with the latest of these changes being from May 6, 2021. The source for these versions of Google TalkBack have been released under the Apache License version 2.0. == Release history ==

    Read more →
  • Multiple kernel learning

    Multiple kernel learning

    Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm. Reasons to use multiple kernel learning include a) the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods, and b) combining data from different sources (e.g. sound and images from a video) that have different notions of similarity and thus require different kernels. Instead of creating a new kernel, multiple kernel algorithms can be used to combine kernels already established for each individual data source. Multiple kernel learning approaches have been used in many applications, such as event recognition in video, object recognition in images, and biomedical data fusion. == Algorithms == Multiple kernel learning algorithms have been developed for supervised, semi-supervised, as well as unsupervised learning. Most work has been done on the supervised learning case with linear combinations of kernels, however, many algorithms have been developed. The basic idea behind multiple kernel learning algorithms is to add an extra parameter to the minimization problem of the learning algorithm. As an example, consider the case of supervised learning of a linear combination of a set of n {\displaystyle n} kernels K {\displaystyle K} . We introduce a new kernel K ′ = ∑ i = 1 n β i K i {\displaystyle K'=\sum _{i=1}^{n}\beta _{i}K_{i}} , where β {\displaystyle \beta } is a vector of coefficients for each kernel. Because the kernels are additive (due to properties of reproducing kernel Hilbert spaces), this new function is still a kernel. For a set of data X {\displaystyle X} with labels Y {\displaystyle Y} , the minimization problem can then be written as min β , c E ( Y , K ′ c ) + R ( K , c ) {\displaystyle \min _{\beta ,c}\mathrm {E} (Y,K'c)+R(K,c)} where E {\displaystyle \mathrm {E} } is an error function and R {\displaystyle R} is a regularization term. E {\displaystyle \mathrm {E} } is typically the square loss function (Tikhonov regularization) or the hinge loss function (for SVM algorithms), and R {\displaystyle R} is usually an ℓ n {\displaystyle \ell _{n}} norm or some combination of the norms (i.e. elastic net regularization). This optimization problem can then be solved by standard optimization methods. Adaptations of existing techniques such as the Sequential Minimal Optimization have also been developed for multiple kernel SVM-based methods. === Supervised learning === For supervised learning, there are many other algorithms that use different methods to learn the form of the kernel. The following categorization has been proposed by Gonen and Alpaydın (2011) ==== Fixed rules approaches ==== Fixed rules approaches such as the linear combination algorithm described above use rules to set the combination of the kernels. These do not require parameterization and use rules like summation and multiplication to combine the kernels. The weighting is learned in the algorithm. Other examples of fixed rules include pairwise kernels, which are of the form k ( ( x 1 i , x 1 j ) , ( x 2 i , x 2 j ) ) = k ( x 1 i , x 2 i ) k ( x 1 j , x 2 j ) + k ( x 1 i , x 2 j ) k ( x 1 j , x 2 i ) {\displaystyle k((x_{1i},x_{1j}),(x_{2i},x_{2j}))=k(x_{1i},x_{2i})k(x_{1j},x_{2j})+k(x_{1i},x_{2j})k(x_{1j},x_{2i})} . These pairwise approaches have been used in predicting protein-protein interactions. ==== Heuristic approaches ==== These algorithms use a combination function that is parameterized. The parameters are generally defined for each individual kernel based on single-kernel performance or some computation from the kernel matrix. Examples of these include the kernel from Tenabe et al. (2008). Letting π m {\displaystyle \pi _{m}} be the accuracy obtained using only K m {\displaystyle K_{m}} , and letting δ {\displaystyle \delta } be a threshold less than the minimum of the single-kernel accuracies, we can define β m = π m − δ ∑ h = 1 n ( π h − δ ) {\displaystyle \beta _{m}={\frac {\pi _{m}-\delta }{\sum _{h=1}^{n}(\pi _{h}-\delta )}}} Other approaches use a definition of kernel similarity, such as A ( K 1 , K 2 ) = ⟨ K 1 , K 2 ⟩ ⟨ K 1 , K 1 ⟩ ⟨ K 2 , K 2 ⟩ {\displaystyle A(K_{1},K_{2})={\frac {\langle K_{1},K_{2}\rangle }{\sqrt {\langle K_{1},K_{1}\rangle \langle K_{2},K_{2}\rangle }}}} Using this measure, Qui and Lane (2009) used the following heuristic to define β m = A ( K m , Y Y T ) ∑ h = 1 n A ( K h , Y Y T ) {\displaystyle \beta _{m}={\frac {A(K_{m},YY^{T})}{\sum _{h=1}^{n}A(K_{h},YY^{T})}}} ==== Optimization approaches ==== These approaches solve an optimization problem to determine parameters for the kernel combination function. This has been done with similarity measures and structural risk minimization approaches. For similarity measures such as the one defined above, the problem can be formulated as follows: max β , tr ⁡ ( K t r a ′ ) = 1 , K ′ ≥ 0 A ( K t r a ′ , Y Y T ) . {\displaystyle \max _{\beta ,\operatorname {tr} (K'_{tra})=1,K'\geq 0}A(K'_{tra},YY^{T}).} where K t r a ′ {\displaystyle K'_{tra}} is the kernel of the training set. Structural risk minimization approaches that have been used include linear approaches, such as that used by Lanckriet et al. (2002). We can define the implausibility of a kernel ω ( K ) {\displaystyle \omega (K)} to be the value of the objective function after solving a canonical SVM problem. We can then solve the following minimization problem: min tr ⁡ ( K t r a ′ ) = c ω ( K t r a ′ ) {\displaystyle \min _{\operatorname {tr} (K'_{tra})=c}\omega (K'_{tra})} where c {\displaystyle c} is a positive constant. Many other variations exist on the same idea, with different methods of refining and solving the problem, e.g. with nonnegative weights for individual kernels and using non-linear combinations of kernels. ==== Bayesian approaches ==== Bayesian approaches put priors on the kernel parameters and learn the parameter values from the priors and the base algorithm. For example, the decision function can be written as f ( x ) = ∑ i = 0 n α i ∑ m = 1 p η m K m ( x i m , x m ) {\displaystyle f(x)=\sum _{i=0}^{n}\alpha _{i}\sum _{m=1}^{p}\eta _{m}K_{m}(x_{i}^{m},x^{m})} η {\displaystyle \eta } can be modeled with a Dirichlet prior and α {\displaystyle \alpha } can be modeled with a zero-mean Gaussian and an inverse gamma variance prior. This model is then optimized using a customized multinomial probit approach with a Gibbs sampler. These methods have been used successfully in applications such as protein fold recognition and protein homology problems ==== Boosting approaches ==== Boosting approaches add new kernels iteratively until some stopping criteria that is a function of performance is reached. An example of this is the MARK model developed by Bennett et al. (2002) f ( x ) = ∑ i = 1 N ∑ m = 1 P α i m K m ( x i m , x m ) + b {\displaystyle f(x)=\sum _{i=1}^{N}\sum _{m=1}^{P}\alpha _{i}^{m}K_{m}(x_{i}^{m},x^{m})+b} The parameters α i m {\displaystyle \alpha _{i}^{m}} and b {\displaystyle b} are learned by gradient descent on a coordinate basis. In this way, each iteration of the descent algorithm identifies the best kernel column to choose at each particular iteration and adds that to the combined kernel. The model is then rerun to generate the optimal weights α i {\displaystyle \alpha _{i}} and b {\displaystyle b} . === Semisupervised learning === Semisupervised learning approaches to multiple kernel learning are similar to other extensions of supervised learning approaches. An inductive procedure has been developed that uses a log-likelihood empirical loss and group LASSO regularization with conditional expectation consensus on unlabeled data for image categorization. We can define the problem as follows. Let L = ( x i , y i ) {\displaystyle L={(x_{i},y_{i})}} be the labeled data, and let U = x i {\displaystyle U={x_{i}}} be the set of unlabeled data. Then, we can write the decision function as follows. f ( x ) = α 0 + ∑ i = 1 | L | α i K i ( x ) {\displaystyle f(x)=\alpha _{0}+\sum _{i=1}^{|L|}\alpha _{i}K_{i}(x)} The problem can be written as min f L ( f ) + λ R ( f ) + γ Θ ( f ) {\displaystyle \min _{f}L(f)+\lambda R(f)+\gamma \Theta (f)} where L {\displaystyle L} is the loss function (weighted negative log-likelihood in this case), R {\displaystyle R} is the regularization parameter (Group LASSO in this case), and Θ {\displaystyle \Theta } is the conditional expectation consensus (CEC) penalty on unlabeled data. The CEC penalty is defined as follows. Let the marginal kernel density for all the data be g m π ( x ) = ⟨ ϕ m π , ψ m ( x ) ⟩ {\displaystyle g_{m}^{\pi }(x)=\langle \phi _{m}^{\pi },\psi _{m}(x)\rangle } where ψ m ( x ) = [ K m ( x 1 , x ) , … , K m ( x L , x ) ] T {\displaystyle \psi _{m}(x)=[K_{m}(x_{1},x),\ldots ,K_{m}(x_{L},x)]^{T}} (the kernel distance between the labe

    Read more →
  • Language identification in the limit

    Language identification in the limit

    Language identification in the limit is a formal model for inductive inference of formal languages, mainly by computers (see machine learning and induction of regular languages). It was introduced by E. Mark Gold in a technical report and a journal article with the same title. In this model, a teacher provides to a learner some presentation (i.e. a sequence of strings) of some formal language. The learning is seen as an infinite process. Each time the learner reads an element of the presentation, it should provide a representation (e.g. a formal grammar) for the language. Gold defines that a learner can identify in the limit a class of languages if, given any presentation of any language in the class, the learner will produce only a finite number of wrong representations, and then stick with the correct representation. However, the learner need not be able to announce its correctness; and the teacher might present a counterexample to any representation arbitrarily long after. Gold defined two types of presentations: Text (positive information): an enumeration of all strings the language consists of. Complete presentation (positive and negative information): an enumeration of all possible strings, each with a label indicating if the string belongs to the language or not. == Learnability == This model is an early attempt to formally capture the notion of learnability. Gold's journal article introduces for contrast the stronger models Finite identification (where the learner has to announce correctness after a finite number of steps), and Fixed-time identification (where correctness has to be reached after an apriori-specified number of steps). A weaker formal model of learnability is the Probably approximately correct learning (PAC) model, introduced by Leslie Valiant in 1984. == Examples == It is instructive to look at concrete examples (in the tables) of learning sessions the definition of identification in the limit speaks about. A fictitious session to learn a regular language L over the alphabet {a,b} from text presentation:In each step, the teacher gives a string belonging to L, and the learner answers a guess for L, encoded as a regular expression. In step 3, the learner's guess is not consistent with the strings seen so far; in step 4, the teacher gives a string repeatedly. After step 6, the learner sticks to the regular expression (ab+ba). If this happens to be a description of the language L the teacher has in mind, it is said that the learner has learned that language.If a computer program for the learner's role would exist that was able to successfully learn each regular language, that class of languages would be identifiable in the limit. Gold has shown that this is not the case. A particular learning algorithm always guessing L to be just the union of all strings seen so far:If L is a finite language, the learner will eventually guess it correctly, however, without being able to tell when. Although the guess didn't change during step 3 to 6, the learner couldn't be sure to be correct.Gold has shown that the class of finite languages is identifiable in the limit, however, this class is neither finitely nor fixed-time identifiable. Learning from complete presentation by telling:In each step, the teacher gives a string and tells whether it belongs to L (green) or not (red, struck-out). Each possible string is eventually classified in this way by the teacher. Learning from complete presentation by request:The learner gives a query string, the teacher tells whether it belongs to L (yes) or not (no); the learner then gives a guess for L, followed by the next query string. In this example, the learner happens to query in each step just the same string as given by the teacher in example 3.In general, Gold has shown that each language class identifiable in the request-presentation setting is also identifiable in the telling-presentation setting, since the learner, instead of querying a string, just needs to wait until it is eventually given by the teacher. == Gold's theorem == More formally, a language L {\displaystyle L} is a nonempty set, and its elements are called sentences. a language family is a set of languages. a language-learning environment E {\displaystyle E} for a language L {\displaystyle L} is a stream of sentences from L {\displaystyle L} , such that each sentence in L {\displaystyle L} appears at least once. a language learner is a function f {\displaystyle f} that sends a list of sentences to a language. This is interpreted as saying that, after seeing sentences a 1 , a 2 . . . , a n {\displaystyle a_{1},a_{2}...,a_{n}} in that order, the language learner guesses that the language that produces the sentences should be f ( a 1 , . . . , a n ) {\displaystyle f(a_{1},...,a_{n})} . Note that the learner is not obliged to be correct — it could very well guess a language that does not even contain a 1 , . . . , a n {\displaystyle a_{1},...,a_{n}} . a language learner f {\displaystyle f} learns a language L {\displaystyle L} in environment E = ( a 1 , a 2 , . . . ) {\displaystyle E=(a_{1},a_{2},...)} if the learner always guesses L {\displaystyle L} after seeing enough examples from the environment. a language learner f {\displaystyle f} learns a language L {\displaystyle L} if it learns L {\displaystyle L} in any environment E {\displaystyle E} for L {\displaystyle L} . a language family is learnable if there exists a language learner that can learn all languages in the family. Notes: In the context of Gold's theorem, sentences need only be distinguishable. They need not be anything in particular, such as finite strings (as usual in formal linguistics). Learnability is not a concept for individual languages. Any individual language L {\displaystyle L} could be learned by a trivial learner that always guesses L {\displaystyle L} . Learnability is not a concept for individual learners. A language family is learnable if, and only if, there exists some learner that can learn the family. It does not matter how well the learner performs for learning languages outside the family. Gold's theorem is easily bypassed if negative examples are allowed. In particular, the language family { L 1 , L 2 , . . . , L ∞ } {\displaystyle \{L_{1},L_{2},...,L_{\infty }\}} can be learned by a learner that always guesses L ∞ {\displaystyle L_{\infty }} until it receives the first negative example ¬ a n {\displaystyle \neg a_{n}} , where a n ∈ L n + 1 ∖ L n {\displaystyle a_{n}\in L_{n+1}\setminus L_{n}} , at which point it always guesses L n {\displaystyle L_{n}} . == Learnability characterization == Dana Angluin gave the characterizations of learnability from text (positive information) in a 1980 paper. If a learner is required to be effective, then an indexed class of recursive languages is learnable in the limit if there is an effective procedure that uniformly enumerates tell-tales for each language in the class (Condition 1). It is not hard to see that if an ideal learner (i.e., an arbitrary function) is allowed, then an indexed class of languages is learnable in the limit if each language in the class has a tell-tale (Condition 2). == Language classes learnable in the limit == The table shows which language classes are identifiable in the limit in which learning model. On the right-hand side, each language class is a superclass of all lower classes. Each learning model (i.e. type of presentation) can identify in the limit all classes below it. In particular, the class of finite languages is identifiable in the limit by text presentation (cf. Example 2 above), while the class of regular languages is not. Pattern Languages, introduced by Dana Angluin in another 1980 paper, are also identifiable by normal text presentation; they are omitted in the table, since they are above the singleton and below the primitive recursive language class, but incomparable to the classes in between. == Sufficient conditions for learnability == Condition 1 in Angluin's paper is not always easy to verify. Therefore, people come up with various sufficient conditions for the learnability of a language class. See also Induction of regular languages for learnable subclasses of regular languages. === Finite thickness === A class of languages has finite thickness if every non-empty set of strings is contained in at most finitely many languages of the class. This is exactly Condition 3 in Angluin's paper. Angluin showed that if a class of recursive languages has finite thickness, then it is learnable in the limit. A class with finite thickness certainly satisfies MEF-condition and MFF-condition; in other words, finite thickness implies M-finite thickness. === Finite elasticity === A class of languages is said to have finite elasticity if for every infinite sequence of strings s 0 , s 1 , . . . {\displaystyle s_{0},s_{1},...} and every infinite sequence of languages in the class L 1 , L 2 , . . . {\displaystyle L_{1},L_{2},...} , there exists a finite number n such

    Read more →
  • Variational message passing

    Variational message passing

    Variational message passing (VMP) is an approximate inference technique for continuous- or discrete-valued Bayesian networks, with conjugate-exponential parents, developed by John Winn. VMP was developed as a means of generalizing the approximate variational methods used by such techniques as latent Dirichlet allocation, and works by updating an approximate distribution at each node through messages in the node's Markov blanket. == Likelihood lower bound == Given some set of hidden variables H {\displaystyle H} and observed variables V {\displaystyle V} , the goal of approximate inference is to maximize a lower-bound on the probability that a graphical model is in the configuration V {\displaystyle V} . Over some probability distribution Q {\displaystyle Q} (to be defined later), ln ⁡ P ( V ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) P ( H | V ) = ∑ H Q ( H ) [ ln ⁡ P ( H , V ) Q ( H ) − ln ⁡ P ( H | V ) Q ( H ) ] {\displaystyle \ln P(V)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{P(H|V)}}=\sum _{H}Q(H){\Bigg [}\ln {\frac {P(H,V)}{Q(H)}}-\ln {\frac {P(H|V)}{Q(H)}}{\Bigg ]}} . So, if we define our lower bound to be L ( Q ) = ∑ H Q ( H ) ln ⁡ P ( H , V ) Q ( H ) {\displaystyle L(Q)=\sum _{H}Q(H)\ln {\frac {P(H,V)}{Q(H)}}} , then the likelihood is simply this bound plus the relative entropy between P {\displaystyle P} and Q {\displaystyle Q} . Because the relative entropy is non-negative, the function L {\displaystyle L} defined above is indeed a lower bound of the log likelihood of our observation V {\displaystyle V} . The distribution Q {\displaystyle Q} will have a simpler character than that of P {\displaystyle P} because marginalizing over P {\displaystyle P} is intractable for all but the simplest of graphical models. In particular, VMP uses a factorized distribution Q ( H ) = ∏ i Q i ( H i ) , {\displaystyle Q(H)=\prod _{i}Q_{i}(H_{i}),} where H i {\displaystyle H_{i}} is a disjoint part of the graphical model. == Determining the update rule == The likelihood estimate needs to be as large as possible; because it's a lower bound, getting closer log ⁡ P {\displaystyle \log P} improves the approximation of the log likelihood. By substituting in the factorized version of Q {\displaystyle Q} , L ( Q ) {\displaystyle L(Q)} , parameterized over the hidden nodes H i {\displaystyle H_{i}} as above, is simply the negative relative entropy between Q j {\displaystyle Q_{j}} and Q j ∗ {\displaystyle Q_{j}^{}} plus other terms independent of Q j {\displaystyle Q_{j}} if Q j ∗ {\displaystyle Q_{j}^{}} is defined as Q j ∗ ( H j ) = 1 Z e E − j { ln ⁡ P ( H , V ) } {\displaystyle Q_{j}^{}(H_{j})={\frac {1}{Z}}e^{\mathbb {E} _{-j}\{\ln P(H,V)\}}} , where E − j { ln ⁡ P ( H , V ) } {\displaystyle \mathbb {E} _{-j}\{\ln P(H,V)\}} is the expectation over all distributions Q i {\displaystyle Q_{i}} except Q j {\displaystyle Q_{j}} . Thus, if we set Q j {\displaystyle Q_{j}} to be Q j ∗ {\displaystyle Q_{j}^{}} , the bound L {\displaystyle L} is maximized. == Messages in variational message passing == Parents send their children the expectation of their sufficient statistic while children send their parents their natural parameter, which also requires messages to be sent from the co-parents of the node. == Relationship to exponential families == Because all nodes in VMP come from exponential families and all parents of nodes are conjugate to their children nodes, the expectation of the sufficient statistic can be computed from the normalization factor. == VMP algorithm == The algorithm begins by computing the expected value of the sufficient statistics for that vector. Then, until the likelihood converges to a stable value (this is usually accomplished by setting a small threshold value and running the algorithm until it increases by less than that threshold value), do the following at each node: Get all messages from parents. Get all messages from children (this might require the children to get messages from the co-parents). Compute the expected value of the nodes sufficient statistics. == Constraints == Because every child must be conjugate to its parent, this has limited the types of distributions that can be used in the model. For example, the parents of a Gaussian distribution must be a Gaussian distribution (corresponding to the Mean) and a gamma distribution (corresponding to the precision, or one over σ {\displaystyle \sigma } in more common parameterizations). Discrete variables can have Dirichlet parents, and Poisson and exponential nodes must have gamma parents. More recently, VMP has been extended to handle models that violate this conditional conjugacy constraint. == Literature == John Winn; Christopher M. Bishop (2005). "Variational Message Passing" (PDF). Journal of Machine Learning Research. 6: 661–694. ISSN 1533-7928. Wikidata Q139488859. Beal, M.J. (2003). Variational Algorithms for Approximate Bayesian Inference (PDF) (PhD). Gatsby Computational Neuroscience Unit, University College London. Archived from the original (PDF) on 2005-04-28. Retrieved 2007-02-15.

    Read more →
  • Huawei Member Center

    Huawei Member Center

    Huawei Member Center is a benefits app which runs using Huawei Mobile Services. Originally launched in China, Huawei Member Center is now being developed primarily around devices such as P40 Pro and the Nova 7. == Membership Levels == The Huawei Member Center provides rewards in two primary ways, 1) device-specific & promotions and 2) via frequent use of Huawei products and apps, using points to redeem additional benefits. In China, Huawei members are already classified into three levels, the highest being “elite”. Membership level determines the level of perks received, from priority access to the service hotline, new device events & proprietary early-access opportunities. Huawei ran a number of member events in 2019 called "Huawei Member Day" to promote the Member Center including providing tips for the Mate 30 Pro and offering a 50Gb cloud storage upgrade to users. == HMC in China == Huawei Member Center Has seen significant adoption in China and the east, the rewards for use on the app have ranged from free book coupons, discounted travel and exclusive gifts of new devices, such as the Huawei Enjoy Z.

    Read more →
  • Log-linear model

    Log-linear model

    A log-linear model is a mathematical model that takes the form of a function whose logarithm equals a linear combination of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. That is, it has the general form exp ⁡ ( c + ∑ i w i f i ( X ) ) {\displaystyle \exp \left(c+\sum _{i}w_{i}f_{i}(X)\right)} , in which the fi(X) are quantities that are functions of the variable X, in general a vector of values, while c and the wi stand for the model parameters. The term may specifically be used for: A log-linear plot or graph, which is a type of semi-log plot. Poisson regression for contingency tables, a type of generalized linear model. The specific applications of log-linear models are where the output quantity lies in the range 0 to ∞, for values of the independent variables X, or more immediately, the transformed quantities fi(X) in the range −∞ to +∞. This may be contrasted to logistic models, similar to the logistic function, for which the output quantity lies in the range 0 to 1. Thus the contexts where these models are useful or realistic often depends on the range of the values being modelled.

    Read more →
  • Tanagra (machine learning)

    Tanagra (machine learning)

    Tanagra is a free suite of machine learning software for research and academic purposes developed by Ricco Rakotomalala at the Lumière University Lyon 2, France. Tanagra supports several standard data mining tasks such as: Visualization, Descriptive statistics, Instance selection, feature selection, feature construction, regression, factor analysis, clustering, classification and association rule learning. Tanagra is an academic project. It is widely used in French-speaking universities. Tanagra is frequently used in real studies and in software comparison papers. == History == The development of Tanagra was started in June 2003. The first version was distributed in December 2003. Tanagra is the successor of Sipina, another free data mining tool which is intended only for supervised learning tasks (classification), especially the interactive and visual construction of decision trees. Sipina is still available online and is maintained. Tanagra is an "open source project" as every researcher can access the source code and add their own algorithms, as long as they agree and conform to the software distribution license. The main purpose of the Tanagra project is to give researchers and students a user-friendly data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing the analyzation of either real or synthetic data. From 2006, Ricco Rakotomalala made an important documentation effort. A large number of tutorials are published on a dedicated website. They describe the statistical and machine learning methods and their implementation with Tanagra on real case studies. The use of other free data mining tools on the same problems is also widely described. The comparison of the tools enables readers to understand the possible differences in the presentation of results. == Description == Tanagra works similarly to current data mining tools. The user can design visually a data mining process in a diagram. Each node is a statistical or machine learning technique, the connection between two nodes represents the data transfer. But unlike the majority of tools which are based on the workflow paradigm, Tanagra is very simplified. The treatments are represented in a tree diagram. The results are displayed in an HTML format. This makes it is easy to export the outputs in order to visualize the results in a browser. It is also possible to copy the result tables to a spreadsheet. Tanagra makes a good compromise between statistical approaches (e.g. parametric and nonparametric statistical tests), multivariate analysis methods (e.g. factor analysis, correspondence analysis, cluster analysis, regression) and machine learning techniques (e.g. neural network, support vector machine, decision trees, random forest).

    Read more →
  • Hinge loss

    Hinge loss

    In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as ℓ ( y ) = max ( 0 , 1 − t ⋅ y ) {\displaystyle \ell (y)=\max(0,1-t\cdot y)} Note that y {\displaystyle y} should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, y = w ⋅ x + b {\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b} , where ( w , b ) {\displaystyle (\mathbf {w} ,b)} are the parameters of the hyperplane and x {\displaystyle \mathbf {x} } is the input variable(s). When t and y have the same sign (meaning y predicts the right class) and | y | ≥ 1 {\displaystyle |y|\geq 1} , the hinge loss ℓ ( y ) = 0 {\displaystyle \ell (y)=0} . When they have opposite signs, ℓ ( y ) {\displaystyle \ell (y)} increases linearly with y, and similarly if | y | < 1 {\displaystyle |y|<1} , even if it has the same sign (correct prediction, but not by enough margin). The Hinge loss is not a proper scoring rule. == Extensions == While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion, it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed. For example, Crammer and Singer defined it for a linear classifier as ℓ ( y ) = max ( 0 , 1 + max y ≠ t w y x − w t x ) {\displaystyle \ell (y)=\max(0,1+\max _{y\neq t}\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )} , where t {\displaystyle t} is the target label, w t {\displaystyle \mathbf {w} _{t}} and w y {\displaystyle \mathbf {w} _{y}} are the model parameters. Weston and Watkins provided a similar definition, but with a sum rather than a max: ℓ ( y ) = ∑ y ≠ t max ( 0 , 1 + w y x − w t x ) {\displaystyle \ell (y)=\sum _{y\neq t}\max(0,1+\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )} . In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss: ℓ ( y ) = max ( 0 , Δ ( y , t ) + ⟨ w , ϕ ( x , y ) ⟩ − ⟨ w , ϕ ( x , t ) ⟩ ) = max ( 0 , max y ∈ Y ( Δ ( y , t ) + ⟨ w , ϕ ( x , y ) ⟩ ) − ⟨ w , ϕ ( x , t ) ⟩ ) {\displaystyle {\begin{aligned}\ell (\mathbf {y} )&=\max(0,\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle -\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\\&=\max(0,\max _{y\in {\mathcal {Y}}}\left(\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle \right)-\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\end{aligned}}} . == Optimization == The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function y = w ⋅ x {\displaystyle y=\mathbf {w} \cdot \mathbf {x} } that is given by ∂ ℓ ∂ w i = { − t ⋅ x i if t ⋅ y < 1 , 0 otherwise . {\displaystyle {\frac {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1,\\0&{\text{otherwise}}.\end{cases}}} However, since the derivative of the hinge loss at t y = 1 {\displaystyle ty=1} is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's ℓ ( y ) = { 1 2 − t y if t y ≤ 0 , 1 2 ( 1 − t y ) 2 if 0 < t y < 1 , 0 if 1 ≤ t y {\displaystyle \ell (y)={\begin{cases}{\frac {1}{2}}-ty&{\text{if}}~~ty\leq 0,\\{\frac {1}{2}}(1-ty)^{2}&{\text{if}}~~0 Read more →

  • Radar geo-warping

    Radar geo-warping

    Radar geo-warping is the adjustment of geo-referenced radar images and video data to be consistent with a geographical projection. This image warping avoids any restrictions when displaying it together with video from multiple radar sources or with other geographical data including scanned maps and satellite images which may be provided in a particular projection. There are many areas where geo warping has unique benefits: Single radar video signal displayed together with maps of different geographical projections. E.g. Mercator UTM stereographic Multiple radar video signals displayed simultaneously: Having the computing power to do so on one computer. Adapting the projection of all radar signals allowing the geographically correct display and accurate superimposition of those videos. Slant range correction: a modern 3D radar system can measure the height of a target and hence it is possible to correct the radar video by the real corrected range of the target. Slant Range Correction also allows to compensate the radar tower height e.g. for maritime surveillance radars. == Introduction == Radar video presents the echoes of electromagnetic waves a radar system has emitted and received as reflections afterwards. These echoes are typically presented on a computer screen with a color-coding scheme depicting the reflection strength. Two problems have to be solved during such a visualization process. The first problem arises from the fact that typically the radar antenna turns around its position and measures the reflection echo distances from its position in one direction. This effectively means that the radar video data are present in polar coordinates. In older systems the polar oriented picture has been displayed in so called plan position indicators (PPI). The PPI-scope uses a radial sweep pivoting about the center of the presentation. This results in a map-like picture of the area covered by the radar beam. A long-persistence screen is used so that the display remains visible until the sweep passes again. Bearing to the target is indicated by the target's angular position in relation to an imaginary line extending vertically from the sweep origin to the top of the scope. The top of the scope is either true north (when the indicator is operated in the true bearing mode) or ship's heading (when the indicator is operated in the relative bearing mode). For visualization on a modern computer screen the polar coordinates have to be converted into Cartesian coordinates. This process called radar scan conversion is presented with more detail in the next section. The second problem to solve arises from the fact that a radar system is placed in the real world and measures real world echo positions. These echoes have to be displayed together with other real world data like object positions, vector maps and satellite images in a consistent way. All this information refers to the curved earth surface but is displayed on a flat computer display. Building a link from real world earth positions to display pixels is commonly called geographical referencing or in short geo-referencing. Part of the geo-referencing process is to map the 3D earth surface onto a 2D display. This process of a geographical projection can be performed in many ways, but different data sources have their own 'natural' projection. E.g. Cartesian radar video data from a radar source on the earth surface are geo-referenced by a so-called radar projection. When using this radar projection the Cartesian radar video pixels can directly displayed on a computer screen (only being linearly transformed according to the current position on the screen and e.g. the current zoom level). A problem now arises if e.g. also a satellite map shall be shown together with the radar video data. The 'natural' geographical projection of a satellite image would be a satellite projection which depends on the satellite orbit, position and further parameters. Now either the satellite image has to be reprojected to a radar projection or the radar video has to use the satellite projection. This geographical re-projection is also called geographical warping or Geo Warping where each image pixel has to be transformed from one projection into another. This article describes in further detail the Geo Warping of radar video images in real time. It will also show that radar video Geo Warping is done most efficiently when it is integrated with the radar scan conversion process. == Radar-scan conversion == This section describes the principles of the radar-scan conversion (RSC) process. The radar supplies its measured data in polar coordinates (ρ,θ) directly from the rotating antenna. ρ defines the target/echo distance and θ the target angle in polar world coordinates. These data are measured, digitized and stored in a polar coordinate polar store or polar pixmap. The main RSC task is to convert these data to Cartesian (x, y) display coordinates, creating the necessary display pixels. The RSC process is influenced by the current zoom, shift and rotation settings defining which part of the 'world' shall be visible in the display image. As detailed later the RSC process also takes the currently used geographical projection into account when the radar video images are Geo Warped. The OpenGL RSC is implemented using a reverse scan conversion approach which calculates for every image pixel the most appropriate radar amplitude value in the polar store. This approach generates an optimal image without any artifacts known from forward spoke fill algorithms. By applying bi-linear filtering between adjacent pixels in the polar store during the conversion process the OpenGL RSC finally achieves a very high visual quality radar display image for every zoom level, creating smooth images of the radar echoes. == Radar projection == This section illustrates how radar video data are geo referenced and displayed on a computer screen. The radar sensor is positioned on the earth surface with a height h above the ground. It measures the direct distance d to the target (and not e.g. the distance the target is away from the radar if one would move on the earth surface). This distance is then used in the display plane after adjustment to the current display zoom level by the radar scan converter (RSC). Now it has to be clarified how the radar video data is geo referenced. This basically means, that if we want to display a geographical real world object (like e.g. a light house) which is at the same real world position as the radar target, that it also shall appear at the same position in the display plane. This is realized by calculating the distance from the radar sensor to the respective real world object and use that distance in the display plane. The position of the real world object is typically given in geographical coordinates (latitude, longitude and height above the earth surface). In other words, using a radar projection with geographical data is done by simulating a radar measurement process with the real world objects and use the resulting range and azimuth in the display plane. The second picture to the right shows an example radar projection with the center of projection (COP) at latitude 50.0° and longitude 0.0° which is also the radar position. The dashed lines are the equal-latitude and equal-longitude lines on top of the background map. The solid lines show equal-range and equal-azimuth with the respect to the radar position. It is a feature of the radar projection that equal-range lines are circles and equal-azimuth lines are straight lines. This is necessary to display radar video consistently with other map data when using a radar projection where the projection center has to be the radar position. == Geo Warping process == This section explains the actual geo warping or re-projection process when applied to radar video in real time. Assume we want to display radar video on top of a satellite image. As an example we use the CIB projection which is used to display satellite data in CIB (Controlled Image Base) format. The Figure Geo Warping Radar to CIB Projection shows dashed the maximal range circle for a range of 111 km or 60 miles using the radar projection. Such a range is typical for long range coastal surveillance radars. As stated in the last section this is a perfect circle also on the computer screen. The solid line ellipse shows the same range circle for the CIB projection. Typically the errors occurring without Geo Warping are smallest near the radar position if at least the projection center (COP) coincides with the radar position, as realized in our example. Otherwise the error distribution depends both on the used projection and also on the projection parameters. Thus, in our case the errors are most significant near the maximum radar range. The CIB projection error corrected in east–west direction at half the radar range is 2.6 km and is 5.3 km at the full radar range of 111 km. An error of 5.3 km is

    Read more →
  • Yooreeka

    Yooreeka

    Yooreeka is a library for data mining, machine learning, soft computing, and mathematical analysis. The project started with the code of the book "Algorithms of the Intelligent Web". Although the term "Web" prevailed in the title, in essence, the algorithms are valuable in any software application. It covers all major algorithms and provides many examples. Yooreeka 2.x is licensed under the Apache License rather than the somewhat more restrictive LGPL (which was the license of v1.x). The library is written 100% in the Java language. == Algorithms == The following algorithms are covered: Clustering Hierarchical—Agglomerative (e.g. MST single link; ROCK) and Divisive Partitional (e.g. k-means) Classification Bayesian Decision trees Neural Networks Rule based (via Drools) Recommendations Collaborative filtering Content based Search PageRank DocRank Personalization

    Read more →
  • Bayesian hierarchical modeling

    Bayesian hierarchical modeling

    Bayesian hierarchical modelling is a statistical model written in multiple levels (hierarchical form) that estimates the posterior distribution of model parameters using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. This integration enables calculation of updated posterior over the (hyper)parameters, effectively updating prior beliefs in light of the observed data. Frequentist statistics may yield conclusions seemingly incompatible with those offered by Bayesian statistics due to the Bayesian treatment of the parameters as random variables and its use of subjective information in establishing assumptions on these parameters. As the approaches answer different questions the formal results are not technically contradictory but the two approaches disagree over which answer is relevant to particular applications. Bayesians argue that relevant information regarding decision-making and updating beliefs cannot be ignored and that hierarchical modeling has the potential to overrule classical methods in applications where respondents give multiple observational data. Moreover, the model has proven to be robust, with the posterior distribution less sensitive to the more flexible hierarchical priors. Hierarchical modeling, as its name implies, retains nested data structure, and is used when information is available at several different levels of observational units. For example, in epidemiological modeling to describe infection trajectories for multiple countries, observational units are countries, and each country has its own time-based profile of daily infected cases. In decline curve analysis to describe oil or gas production decline curve for multiple wells, observational units are oil or gas wells in a reservoir region, and each well has each own time-based profile of oil or gas production rates (usually, barrels per month). Hierarchical modeling is used to devise computation based strategies for multiparameter problems. == Philosophy == Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies a dependence of the joint probability model for these parameters. Individual degrees of belief, expressed in the form of probabilities, come with uncertainty. Amidst this is the change of the degrees of belief over time. As was stated by Professor José M. Bernardo and Professor Adrian F. Smith, "The actuality of the learning process consists in the evolution of individual and subjective beliefs about the reality." These subjective probabilities are more directly involved in the mind rather than the physical probabilities. Hence, it is with this need of updating beliefs that Bayesians have formulated an alternative statistical model which takes into account the prior occurrence of a particular event. == Bayes' theorem == The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options. Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability θ j {\displaystyle \theta _{j}} , the survival probability will be updated with the occurrence of y, the event in which a controversial serum is created which, as believed by some, increases survival in cardiac patients. In order to make updated probability statements about θ j {\displaystyle \theta _{j}} , given the occurrence of event y, we must begin with a model providing a joint probability distribution for θ j {\displaystyle \theta _{j}} and y. This can be written as a product of the two distributions that are often referred to as the prior distribution P ( θ ) {\displaystyle P(\theta )} and the sampling distribution P ( y ∣ θ ) {\displaystyle P(y\mid \theta )} respectively: P ( θ , y ) = P ( θ ) P ( y ∣ θ ) {\displaystyle P(\theta ,y)=P(\theta )P(y\mid \theta )} Using the basic property of conditional probability, the posterior distribution will yield: P ( θ ∣ y ) = P ( θ , y ) P ( y ) = P ( y ∣ θ ) P ( θ ) P ( y ) {\displaystyle P(\theta \mid y)={\frac {P(\theta ,y)}{P(y)}}={\frac {P(y\mid \theta )P(\theta )}{P(y)}}} This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to deconstruct the probability, P ( θ ∣ y ) {\displaystyle P(\theta \mid y)} , relative to solvable subsets of its supportive evidence. == Exchangeability == The usual starting point of a statistical analysis is the assumption that the n values y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} are exchangeable. If no information – other than data y – is available to distinguish any of the θ j {\displaystyle \theta _{j}} 's from any others, and no ordering or grouping of the parameters can be made, one must assume symmetry of prior distribution parameters. This symmetry is represented probabilistically by exchangeability. Generally, it is useful and appropriate to model data from an exchangeable distribution as independently and identically distributed, given some unknown parameter vector θ {\displaystyle \theta } , with distribution P ( θ ) {\displaystyle P(\theta )} . === Finite exchangeability === For a fixed number n, the set y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable if the joint probability P ( y 1 , y 2 , … , y n ) {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})} is invariant under permutations of the indices. That is, for every permutation π {\displaystyle \pi } or ( π 1 , π 2 , … , π n ) {\displaystyle (\pi _{1},\pi _{2},\ldots ,\pi _{n})} of (1, 2, …, n), P ( y 1 , y 2 , … , y n ) = P ( y π 1 , y π 2 , … , y π n ) . {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})=P(y_{\pi _{1}},y_{\pi _{2}},\ldots ,y_{\pi _{n}}).} The following is an exchangeable, but not independent and identical (iid), example: Consider an urn with a red ball and a blue ball inside, with probability 1 2 {\displaystyle {\frac {1}{2}}} of drawing either. Balls are drawn without replacement, i.e. after one ball is drawn from the n {\displaystyle n} balls, there will be n − 1 {\displaystyle n-1} remaining balls left for the next draw. Let Y i = { 1 , if the i th ball is red , 0 , otherwise . {\displaystyle {\text{Let }}Y_{i}={\begin{cases}1,&{\text{if the }}i{\text{th ball is red}},\\0,&{\text{otherwise}}.\end{cases}}} The probability of selecting a red ball in the first draw and a blue ball in the second draw is equal to the probability of selecting a blue ball on the first draw and a red on the second, both of which are 1/2: P ( y 1 = 1 , y 2 = 0 ) = P ( y 1 = 0 , y 2 = 1 ) = 1 2 {\displaystyle P(y_{1}=1,y_{2}=0)=P(y_{1}=0,y_{2}=1)={\frac {1}{2}}} . This makes y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} exchangeable. But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first is 0. This is not equal to the probability that the red ball is selected in the second draw, which is 1/2: P ( y 2 = 1 ∣ y 1 = 1 ) = 0 ≠ P ( y 2 = 1 ) = 1 2 {\displaystyle P(y_{2}=1\mid y_{1}=1)=0\neq P(y_{2}=1)={\frac {1}{2}}} . Thus, y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} are not independent. If x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} are independent and identically distributed, then they are exchangeable, but the converse is not necessarily true. === Infinite exchangeability === Infinite exchangeability is the property that every finite subset of an infinite sequence y 1 {\displaystyle y_{1}} , y 2 , … {\displaystyle y_{2},\ldots } is exchangeable. For any n, the sequence y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable. == Hierarchical models == === Components === Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely: Hyperparameters: parameters of the prior distribution Hyperpriors: distributions of Hyperparameters Suppose a random variable Y follows a normal distribution with parameter θ {\displaystyle \theta } as the mean and 1 as the variance, that is Y ∣ θ ∼ N ( θ , 1 ) {\displaystyle Y\mid \theta \sim N(\theta ,1)} . The tilde relation ∼ {\displaystyle \sim } can be read as "has the distribution of" or "is distributed as". Suppose also that the parameter θ {\displaystyle \theta } has a distribution given by a normal distribution with mean μ {\displaystyle \mu } and variance 1, i.e. θ ∣ μ ∼ N ( μ , 1 ) {\displaystyle \theta \mid \mu \sim N(\mu ,1)} . Furthermore, μ {\displaystyle \mu } follows another distribution given, for example, by the standard normal distribution, N ( 0 , 1 ) {\displaystyle {\text{N}}(0,1)} . The parameter μ {\dis

    Read more →