AI For Students With Adhd

AI For Students With Adhd — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • PropBank

    PropBank

    PropBank is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced by Martha Palmer et al., the term propbank is also coming to be used as a common noun referring to any corpus that has been annotated with propositions and their arguments. The PropBank project has played a role in research in natural language processing, and has been used in semantic role labelling. == Comparison == PropBank differs from FrameNet, the resource to which it is most frequently compared, in several ways. PropBank is a verb-oriented resource, while FrameNet is centered on the more abstract notion of frames, which generalizes descriptions across similar verbs (e.g. "describe" and "characterize") as well as nouns and other words (e.g. "description"). PropBank does not annotate events or states of affairs described using nouns. PropBank commits to annotating all verbs in a corpus, whereas the FrameNet project chooses sets of example sentences from a large corpus and only in a few cases has annotated longer continuous stretches of text. PropBank-style annotations often remain close to the syntactic level, while FrameNet-style annotations are sometimes more semantically motivated. From the start, PropBank was developed with the idea of serving as training data for machine learning-based semantic role labeling systems in mind. It requires that all arguments to a verb be syntactic constituents and different senses of a word are only distinguished if the differences bear on the arguments. Due to such differences, semantic role labeling with respect to PropBank is often a somewhat easier task than producing FrameNet-style annotations.

    Read more →
  • Multinomial logistic regression

    Multinomial logistic regression

    In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.). Multinomial logistic regression is known by a variety of other names, including polytomous LR, multiclass LR, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model. == Background == Multinomial logistic regression is used when the dependent variable in question is nominal (equivalently categorical, meaning that it falls into any one of a set of categories that cannot be ordered in any meaningful way) and for which there are more than two categories. Some examples would be: Which major will a college student choose, given their grades, stated likes and dislikes, etc.? Which blood type does a person have, given the results of various diagnostic tests? In a hands-free mobile phone dialing application, which person's name was spoken, given various properties of the speech signal? Which candidate will a person vote for, given particular demographic characteristics? Which country will a firm locate an office in, given the characteristics of the firm and of the various candidate countries? These are all statistical classification problems. They all have in common a dependent variable to be predicted that comes from one of a limited set of items that cannot be meaningfully ordered, as well as a set of independent variables (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logistic regression is a particular solution to classification problems that use a linear combination of the observed features and some problem-specific parameters to estimate the probability of each particular value of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken). == Assumptions == The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. As with other types of regression, there is no need for the independent variables to be statistically independent from each other (unlike, for example, in a naive Bayes classifier); however, collinearity is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case. If the multinomial logit is used to model choices, it relies on the assumption of independence of irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of K alternatives to be modeled as a set of K − 1 independent binary choices, in which one alternative is chosen as a "pivot" and the other K − 1 compared against it, one at a time. The IIA hypothesis is a core hypothesis in rational choice theory; however numerous studies in psychology show that individuals often violate this assumption when making choices. An example of a problem case arises if choices include a car and a blue bus. Suppose the odds ratio between the two is 1 : 1. Now if the option of a red bus is introduced, a person may be indifferent between a red and a blue bus, and hence may exhibit a car : blue bus : red bus odds ratio of 1 : 0.5 : 0.5, thus maintaining a 1 : 1 ratio of car : any bus while adopting a changed car : blue bus ratio of 1 : 0.5. Here the red bus option was not in fact irrelevant, because a red bus was a perfect substitute for a blue bus. If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. It is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like the nested logit or the multinomial probit may be used in such cases as they allow for violation of the IIA. == Model == === Introduction === There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make it difficult to compare different treatments of the subject in different texts. The article on logistic regression presents a number of equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model. The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation using a dot product: score ⁡ ( X i , k ) = β k ⋅ X i , {\displaystyle \operatorname {score} (\mathbf {X} _{i},k)={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i},} where Xi is the vector of explanatory variables describing observation i, βk is a vector of weights (or regression coefficients) corresponding to outcome k, and score(Xi, k) is the score associated with assigning observation i to category k. In discrete choice theory, where observations represent people and outcomes represent choices, the score is considered the utility associated with person i choosing outcome k. The predicted outcome is the one with the highest score. The difference between the multinomial logit model and numerous other methods, models, algorithms, etc. with the same basic setup (the perceptron algorithm, support vector machines, linear discriminant analysis, etc.) is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. In particular, in the multinomial logit model, the score can directly be converted to a probability value, indicating the probability of observation i choosing outcome k given the measured characteristics of the observation. This provides a principled way of incorporating the prediction of a particular multinomial logit model into a larger procedure that may involve multiple such predictions, each with a possibility of error. Without such means of combining predictions, errors tend to multiply. For example, imagine a large predictive model that is broken down into a series of submodels where the prediction of a given submodel is used as the input of another submodel, and that prediction is in turn used as the input into a third submodel, etc. If each submodel has 90% accuracy in its predictions, and there are five submodels in series, then the overall model has only 0.95 = 59% accuracy. If each submodel has 80% accuracy, then overall accuracy drops to 0.85 = 33% accuracy. This issue is known as error propagation and is a serious problem in real-world predictive models, which are usually composed of numerous parts. Predicting probabilities of each possible outcome, rather than simply making a single optimal prediction, is one means of alleviating this issue. === Setup === The basic setup is the same as in logistic regression, the only difference being that the dependent variables are categorical rather than binary, i.e. there are K possible outcomes rather than just two. The following description is somewhat shortened; for more details, consult the logistic regression article. ==== Data points ==== Specifically, it is assumed that we have a series of N observed data points. Each data point i (ranging from 1 to N) consists of a set of M explanatory variables x1,i ... xM,i (also known as independent variables, predictor variables, features, etc.), and an associated categorical outcome Yi (also known as dependent variable, response variable), which can take on one of K possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to K. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of N "experiments" — although an "experiment" may consist of nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so tha

    Read more →
  • L-1 Identity Solutions

    L-1 Identity Solutions

    L-1 Identity Solutions, Inc. was an American biometric technology company headquartered in Stamford, Connecticut, specializing in identity management products and services including facial recognition systems, fingerprint readers, and secure credentialing solutions for governments and commercial enterprises. The company's shares traded on the New York Stock Exchange under the ticker symbol "ID." == History == L-1 Identity Solutions was formed on August 29, 2006, from a merger of Viisage Technology, Inc. and Identix Incorporated. Prior to the Safran acquisition, L-1 divested its Intelligence Services Group (ISG) comprising SpecTal LLC, Advanced Concepts Inc., and McClendon LLC to BAE Systems, Inc. for approximately $297 million. The transaction, initially announced in September 2010, closed on February 15, 2011, with more than 1,000 ISG employees joining BAE Systems' Intelligence & Security sector. It specializes in selling face recognition systems, electronic passports, such as Fly Clear, and other biometric technology to governments such as the United States and Saudi Arabia. It also licenses technology to other companies internationally, including China. On July 26, 2011, Safran (NYSE Euronext Paris: SAF) acquired L-1 Identity Solutions, Inc. for a total cash amount of USD 1.09 billion. L-1 was part of Morpho's MorphoTrust department which rebranded to Idemia in 2017. Bioscrypt is a biometrics research, development and manufacturing company purchased by L-1 Identity Solutions. It provides fingerprint IP readers for physical access control systems, Facial recognition system readers for contactless access control authentication and OEM fingerprint modules for embedded applications. According to IMS Research, Bioscrypt has been the world market leader in biometric access control for enterprises (since 2006) with a worldwide market share of over 13%. In 2011, Bioscrypt was sold to Safran Morpho.

    Read more →
  • Win–stay, lose–switch

    Win–stay, lose–switch

    In psychology, game theory, statistics, and machine learning, win–stay, lose–switch (also win–stay, lose–shift or Pavlov, named after Ivan Pavlov) is a heuristic learning strategy used to model learning in decision situations. It was first invented as an improvement over randomization in bandit problems. It was later applied to the prisoner's dilemma in order to model the evolution of altruism. In most versions, it starts either with a cooperate, then proceeds as always, or starts with a "probe" of cooperate-defect-cooperate to determine the other player's strategy. A mutual cooperation is regarded as a win. The learning rule bases its decision only on the outcome of the previous play. Outcomes are divided into successes (wins) and failures (losses). If the play on the previous round resulted in a success, then the agent plays the same strategy on the next round. Alternatively, if the play resulted in a failure the agent switches to another action. A large-scale empirical study of players of the game rock, paper, scissors shows that a variation of this strategy is adopted by real-world players of the game, instead of the Nash equilibrium strategy of choosing entirely at random between the three options.

    Read more →
  • Free boundary condition

    Free boundary condition

    In image processing, the free boundary condition is the convention used when applying a convolution kernel to a digital image in which pixel locations that lie outside the image boundaries are interpreted as having a value of zero.[1] The question of what value to assign out-of-bounds pixels may arise, for instance, when applying a 3×3 kernel to the corner pixel in an image.

    Read more →
  • Genetic programming

    Genetic programming

    Genetic programming (GP) is an evolutionary algorithm, an artificial intelligence technique mimicking natural evolution, which operates on a population of programs. It applies the genetic operators selection according to a predefined fitness measure, mutation and crossover. The crossover operation involves swapping specified parts of selected pairs (parents) to produce new and different offspring that become part of the new generation of programs. Some programs not selected for reproduction are copied from the current generation to the new generation. Mutation involves substitution of some random part of a program with some other random part of a program. Then the selection and other operations are recursively applied to the new generation of programs. Typically, members of each new generation are on average more fit than the members of the previous generation, and the best-of-generation program is often better than the best-of-generation programs from previous generations. Termination of the evolution usually occurs when some individual program reaches a predefined proficiency or fitness level. It may and often does happen that a particular run of the algorithm results in premature convergence to some local maximum that is not a globally optimal or even good solution. Multiple runs (dozens to hundreds) are usually necessary to produce a very good result. It may also be necessary to have a large starting population size and variability of the individuals to avoid pathologies. == History == The first record of the proposal to evolve programs is probably that of Alan Turing in 1950 in "Computing Machinery and Intelligence". There was a gap of 25 years before the publication of John Holland's 'Adaptation in Natural and Artificial Systems' laid out the theoretical and empirical foundations of the science. In 1981, Richard Forsyth demonstrated the successful evolution of small programs, represented as trees, to perform classification of crime scene evidence for the UK Home Office. Although the idea of evolving programs, initially in the computer language Lisp, was current amongst John Holland's students, it was not until they organised the first Genetic Algorithms (GA) conference in Pittsburgh that Nichael Cramer published evolved programs in two specially designed languages, which included the first statement of modern "tree-based" genetic programming (that is, procedural languages organized in tree-based structures and operated on by suitably defined GA-operators). In 1988, John Koza (also a PhD student of John Holland) patented his invention of a GA for program evolution. This was followed by publication in the International Joint Conference on Artificial Intelligence IJCAI-89. Koza followed this with 205 publications on "genetic programming", a term coined by David Goldberg, also a PhD student of John Holland. However, it is the series of 4 books by Koza, starting in 1992 with accompanying videos, that really established GP. Subsequently, there was an enormous expansion of the number of publications with the Genetic Programming Bibliography, surpassing 10,000 entries. In 2010, Koza listed 77 results where genetic programming was human competitive. The departure of GP from the rigid, fixed-length representations typical of early GA models was not entirely without precedent. Early work on variable-length representations laid the groundwork. One notable example is messy genetic algorithms, which introduced irregular, variable-length chromosomes to address building block disruption and positional bias in standard GAs. Another precursor was robot trajectory programming, where genome representations encoded program instructions for robotic movements—structures inherently variable in length. Even earlier, unfixed-length representations were proposed in a doctoral dissertation by Cavicchio, who explored adaptive search using simulated evolution. His work provided foundational ideas for flexible program structures. In 1996, Koza started the annual Genetic Programming conference, which was followed in 1998 by the annual EuroGP conference, and the first book in a GP series edited by Koza. 1998 also saw the first GP textbook. GP continued to flourish, leading to the first specialist GP journal and three years later (2003) the annual Genetic Programming Theory and Practice (GPTP) workshop was established by Rick Riolo. Genetic programming papers continue to be published at a diversity of conferences and associated journals. Today there are nineteen GP books including several for students. === Foundational work in GP === Early work that set the stage for current genetic programming research topics and applications is diverse, and includes software synthesis and repair, predictive modeling, data mining, financial modeling, soft sensors, design, and image processing. Applications in some areas, such as design, often make use of intermediate representations, such as Fred Gruau's cellular encoding. Industrial uptake has been significant in several areas including finance, the chemical industry, bioinformatics and the steel industry. == Methods == === Program representation === GP evolves computer programs, traditionally represented in memory as tree structures. Trees can be easily evaluated in a recursive manner. Every internal node has an operator function and every terminal node has an operand, making mathematical expressions easy to evolve and evaluate. Thus traditionally GP favors the use of programming languages that naturally embody tree structures (for example, Lisp; other functional programming languages are also suitable). Non-tree representations have been suggested and successfully implemented, such as linear genetic programming, which perhaps suits the more traditional imperative languages. The commercial GP software Discipulus uses automatic induction of binary machine code ("AIM") to achieve better performance. μGP uses directed multigraphs to generate programs that fully exploit the syntax of a given assembly language. Multi expression programming uses three-address code for encoding solutions. Other program representations on which significant research and development have been conducted include programs for stack-based virtual machines, and sequences of integers that are mapped to arbitrary programming languages via grammars. Cartesian genetic programming is another form of GP, which uses a graph representation instead of the usual tree based representation to encode computer programs. Most representations have structurally noneffective code (introns). Such non-coding genes may seem to be useless because they have no effect on the performance of any one individual. However, they alter the probabilities of generating different offspring under the variation operators, and thus alter the individual's variational properties. Experiments seem to show faster convergence when using program representations that allow such non-coding genes, compared to program representations that do not have any non-coding genes. Instantiations may have both trees with introns and those without; the latter are called canonical trees. Special canonical crossover operators are introduced that maintain the canonical structure of parents in their children. === Initialisation === The methods for creation of the initial population include: Grow creates the individuals sequentially. Every GP tree is created starting from the root, creating functional nodes with children as well as terminal nodes up to a certain depth. Full is similar to the Grow. The difference is that all brunches in a tree are of same predetermined depth. Ramped half-and-half creates a population consisting of m d − 1 {\displaystyle md-1} parts and a maximum depth of m d {\displaystyle md} for its trees. The first part has a maximum depth of 2, second of 3 and so on up to the m d − 1 {\displaystyle md-1} -th part with maximum depth m d {\displaystyle md} . Half of every part is created by Grow, while the other part is created by Full. === Selection === Selection is a process whereby certain individuals are selected from the current generation that would serve as parents for the next generation. The individuals are selected probabilistically such that the better performing individuals have a higher chance of getting selected. The most commonly used selection method in GP is tournament selection, although other methods such as fitness proportionate selection, lexicase selection, and others have been demonstrated to perform better for many GP problems. Elitism, which involves seeding the next generation with the best individual (or best n individuals) from the current generation, is a technique sometimes employed to avoid regression. === Crossover === In genetic programming two fit individuals are chosen from the population to be parents for one or two children. In tree genetic programming, these parents are represented as inverted lisp like trees, with their root nodes at the top. In subtree cro

    Read more →
  • Proximal policy optimization

    Proximal policy optimization

    Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large. == History == The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems. PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient. Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games. == TRPO == TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces. The pseudocode is as follows: Input: initial policy parameters θ 0 {\textstyle \theta _{0}} , initial value function parameters ϕ 0 {\textstyle \phi _{0}} Hyperparameters: KL-divergence limit δ {\textstyle \delta } , backtracking coefficient α {\textstyle \alpha } , maximum number of backtracking steps K {\textstyle K} for k = 0 , 1 , 2 , … {\textstyle k=0,1,2,\ldots } do Collect set of trajectories D k = { τ i } {\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}} by running policy π k = π ( θ k ) {\textstyle \pi _{k}=\pi \left(\theta _{k}\right)} in the environment. Compute rewards-to-go R ^ t {\textstyle {\hat {R}}_{t}} . Compute advantage estimates, A ^ t {\textstyle {\hat {A}}_{t}} (using any method of advantage estimation) based on the current value function V ϕ k {\textstyle V_{\phi _{k}}} . Estimate policy gradient as g ^ k = 1 | D k | ∑ τ ∈ D k ∑ t = 0 T ∇ θ log ⁡ π θ ( a t ∣ s t ) | θ k A ^ t {\displaystyle {\hat {g}}_{k}=\left.{\frac {1}{\left|{\mathcal {D}}_{k}\right|}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\nabla _{\theta }\log \pi _{\theta }\left(a_{t}\mid s_{t}\right)\right|_{\theta _{k}}{\hat {A}}_{t}} Use the conjugate gradient algorithm to compute x ^ k ≈ H ^ k − 1 g ^ k {\displaystyle {\hat {x}}_{k}\approx {\hat {H}}_{k}^{-1}{\hat {g}}_{k}} where H ^ k {\textstyle {\hat {H}}_{k}} is the Hessian of the sample average KL-divergence. Update the policy by backtracking line search with θ k + 1 = θ k + α j 2 δ x ^ k T H ^ k x ^ k x ^ k {\displaystyle \theta _{k+1}=\theta _{k}+\alpha ^{j}{\sqrt {\frac {2\delta }{{\hat {x}}_{k}^{T}{\hat {H}}_{k}{\hat {x}}_{k}}}}{\hat {x}}_{k}} where j ∈ { 0 , 1 , 2 , … K } {\textstyle j\in \{0,1,2,\ldots K\}} is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint. Fit value function by regression on mean-squared error: ϕ k + 1 = arg ⁡ min ϕ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 {\displaystyle \phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}} typically via some gradient descent algorithm. == PPO == The pseudocode is as follows: Input: initial policy parameters θ 0 {\textstyle \theta _{0}} , initial value function parameters ϕ 0 {\textstyle \phi _{0}} for k = 0 , 1 , 2 , … {\textstyle k=0,1,2,\ldots } do Collect set of trajectories D k = { τ i } {\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}} by running policy π k = π ( θ k ) {\textstyle \pi _{k}=\pi \left(\theta _{k}\right)} in the environment. Compute rewards-to-go R ^ t {\textstyle {\hat {R}}_{t}} . Compute advantage estimates, A ^ t {\textstyle {\hat {A}}_{t}} (using any method of advantage estimation) based on the current value function V ϕ k {\textstyle V_{\phi _{k}}} . Update the policy by maximizing the PPO-Clip objective: θ k + 1 = arg ⁡ max θ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T min ( π θ ( a t ∣ s t ) π θ k ( a t ∣ s t ) A π θ k ( s t , a t ) , g ( ϵ , A π θ k ( s t , a t ) ) ) {\displaystyle \theta _{k+1}=\arg \max _{\theta }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\min \left({\frac {\pi _{\theta }\left(a_{t}\mid s_{t}\right)}{\pi _{\theta _{k}}\left(a_{t}\mid s_{t}\right)}}A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right),\quad g\left(\epsilon ,A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right)\right)\right)} typically via stochastic gradient ascent with Adam. Fit value function by regression on mean-squared error: ϕ k + 1 = arg ⁡ min ϕ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 {\displaystyle \phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}} typically via some gradient descent algorithm. Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent. Intuitively, a policy gradient method takes small policy update steps, so the agent can reach higher and higher rewards in expectation. Policy gradient methods may be unstable: A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. To solve the instability, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the gradient ascent process. === Basic concepts === To begin the PPO training process, the agent is set in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of transition samples and policy updates, the agent will select an action to take by randomly sampling from the probability distribution P ( A | S ) {\displaystyle P(A|S)} generated by the policy network. The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario (a new state) by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize the cumulative reward signal across sequences of states, known as episodes. === Policy gradient laws: the advantage function === The advantage function (denoted as A {\displaystyle A} ) is central to PPO, as it tries to answer the question of whether a specific action of the agent is better or worse than some other possible action in a given state. By definition, the advantage function is an estimate of the relative value for a selected action. If the output of this function is positive, it means that the action in question is better than the average return, so the possibilities of selecting that specific action will increase. The opposite is true for a negative advantage output. The advantage function can be defined as A = Q − V {\displaystyle A=Q-V} , where Q {\displaystyle Q} is the discounted sum of rewards (the total weighted reward for the completion of an episode) and V {\displaystyle V} is the baseline estimate. Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an unsupervised learning problem. The baseline estimate comes from the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some variance), as it also uses a neural network, like the policy function itself. With Q {\displaystyle Q} and V {\displaystyle V} computed, the advantage function is calculated by subtracting the baseline estimate from the actual discounted return. If A > 0 {\displaystyle A>0} , the actual return of the action is better than the expected return from experience; if A < 0 {\displaystyle A<0} , the actual return is worse. === Ratio function === In PPO, the ratio function ( r t {\displaystyle r_{t}} ) calculates the probability of selecting action a {\displaystyle a} in state s {\displaystyle s} given the current policy network, divided by the previous probability under the old policy. In other words: If r t ( θ ) > 1 {\displaystyle r_{t}(\theta )>1} , where θ {\displaystyle \theta } are the policy network parameters, then selecting action a {\displaystyle a} in state s {\displaystyle s} is more likely based on the current policy than the previous policy. If 0 ≤ r t ( θ ) < 1 {\displaystyle 0\leq r_{t}(\theta )<1} , then selecting actio

    Read more →
  • Plate notation

    Plate notation

    In Bayesian inference, plate notation is a method of representing variables that repeat in a graphical model. Instead of drawing each repeated variable individually, a plate or rectangle is used to group variables into a subgraph that repeat together, and a number is drawn on the plate to represent the number of repetitions of the subgraph in the plate. The assumptions are that the subgraph is duplicated that many times, the variables in the subgraph are indexed by the repetition number, and any links that cross a plate boundary are replicated once for each subgraph repetition. == Example == In this example, we consider Latent Dirichlet allocation, a Bayesian network that models how documents in a corpus are topically related. There are two variables not in any plate; α is the parameter of the uniform Dirichlet prior on the per-document topic distributions, and β is the parameter of the uniform Dirichlet prior on the per-topic word distribution. The outermost plate represents all the variables related to a specific document, including θ i {\displaystyle \theta _{i}} , the topic distribution for document i. The M in the corner of the plate indicates that the variables inside are repeated M times, once for each document. The inner plate represents the variables associated with each of the N i {\displaystyle N_{i}} words in document i: z i j {\displaystyle z_{ij}} is the topic distribution for the jth word in document i, and w i j {\displaystyle w_{ij}} is the actual word used. The N in the corner represents the repetition of the variables in the inner plate N j {\displaystyle N_{j}} times, once for each word in document i. The circle representing the individual words is shaded, indicating that each w i j {\displaystyle w_{ij}} is observable, and the other circles are empty, indicating that the other variables are latent variables. The directed edges between variables indicate dependencies between the variables: for example, each w i j {\displaystyle w_{ij}} depends on z i j {\displaystyle z_{ij}} and β. == Extensions == A number of extensions have been created by various authors to express more information than simply the conditional relationships. However, few of these have become standard. Perhaps the most commonly used extension is to use rectangles in place of circles to indicate non-random variables—either parameters to be computed, hyperparameters given a fixed value (or computed through empirical Bayes), or variables whose values are computed deterministically from a random variable. The diagram on the right shows a few more non-standard conventions used in some articles in Wikipedia (e.g. variational Bayes): Variables that are actually random vectors are indicated by putting the vector size in brackets in the middle of the node. Variables that are actually random matrices are similarly indicated by putting the matrix size in brackets in the middle of the node, with commas separating row size from column size. Categorical variables are indicated by placing their size (without a bracket) in the middle of the node. Categorical variables that act as "switches", and which pick one or more other random variables to condition on from a large set of such variables (e.g. mixture components), are indicated with a special type of arrow containing a squiggly line and ending in a T junction. Boldface is consistently used for vector or matrix nodes (but not categorical nodes). == Software implementation == Plate notation has been implemented in various TeX/LaTeX drawing packages, but also as part of graphical user interfaces to Bayesian statistics programs such as BUGS and BayesiaLab and PyMC.

    Read more →
  • CloudSim

    CloudSim

    CloudSim is a framework for modeling and simulation of cloud computing infrastructures and services. Originally built primarily at the Cloud Computing and Distributed Systems (CLOUDS) Laboratory, the University of Melbourne, Australia, CloudSim has become one of the most popular open source cloud simulators in the research and academia. CloudSim is completely written in Java. The latest version of CloudSim is CloudSim v6.0.0-beta on GitHub. Cloudsim is suitable for implementing simulations scenarios based on Infrastructure as a service as well as with latest version Platform as a service, so get started here == CloudSim extensions == Initially developed as a stand-alone cloud simulator, CloudSim has further been extended by independent researchers. GPUCloudSim is an enhanced CloudSim tool for modeling GPU-based cloud infrastructures and data centers. It offers simulations for multi-GPU setups, customizable GPU policies, GPU remoting, etc. It also examines performance impacts and interactions within virtualized GPU environments. CloudSim Plus is a totally re-engineered CloudSim fork providing general-purpose cloud computing simulation and exclusive features such as: multi-cloud simulations, vertical and horizontal VM scaling, host fault injection and recovery, joint power- and network-aware simulations and more. Though CloudSim itself does not have a graphical user interface, extensions such as CloudReports offer a GUI for CloudSim simulations. CloudSimEx extends CloudSim by adding MapReduce simulation capabilities and parallel simulations. Cloud2Sim extends CloudSim to execute on multiple distributed servers, by leveraging Hazelcast distributed execution framework. RECAP DES extends the CloudSim Plus framework to model synchronous hierarchical architectures (such as ElasticSearch). ThermoSim extends CloudSim toolkit by incorporating thermal characteristics, and uses Deep learning-based temperature predictor for cloud nodes.

    Read more →
  • Diffusion map

    Diffusion map

    Diffusion maps is a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon which computes a family of embeddings of a data set into Euclidean space (often low-dimensional) whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the "diffusion distance" between probability distributions centered at those points. Different from linear dimensionality reduction methods such as principal component analysis (PCA), diffusion maps are part of the family of nonlinear dimensionality reduction methods which focus on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps give a global description of the data-set. Compared with other methods, the diffusion map algorithm is robust to noise perturbation and computationally inexpensive. == Definition of diffusion maps == Following and , diffusion maps can be defined in four steps. === Connectivity === Diffusion maps exploit the relationship between heat diffusion and random walk Markov chain. The basic observation is that if we take a random walk on the data, walking to a nearby data-point is more likely than walking to another that is far away. Let ( X , A , μ ) {\displaystyle (X,{\mathcal {A}},\mu )} be a measure space, where X {\displaystyle X} is the data set and μ {\displaystyle \mu } represents the distribution of the points on X {\displaystyle X} . Based on this, the connectivity k {\displaystyle k} between two data points, x {\displaystyle x} and y {\displaystyle y} , can be defined as the probability of walking from x {\displaystyle x} to y {\displaystyle y} in one step of the random walk. Usually, this probability is specified in terms of a kernel function of the two points: k : X × X → R {\displaystyle k:X\times X\rightarrow \mathbb {R} } . For example, the popular Gaussian kernel: k ( x , y ) = exp ⁡ ( − | | x − y | | 2 ϵ ) {\displaystyle k(x,y)=\exp \left(-{\frac {||x-y||^{2}}{\epsilon }}\right)} More generally, the kernel function has the following properties k ( x , y ) = k ( y , x ) {\displaystyle k(x,y)=k(y,x)} ( k {\displaystyle k} is symmetric) k ( x , y ) ≥ 0 ∀ x , y {\displaystyle k(x,y)\geq 0\,\,\forall x,y} ( k {\displaystyle k} is positivity preserving). The kernel constitutes the prior definition of the local geometry of the data-set. Since a given kernel will capture a specific feature of the data set, its choice should be guided by the application that one has in mind. This is a major difference with methods such as principal component analysis, where correlations between all data points are taken into account at once. Given ( X , k ) {\displaystyle (X,k)} , we can then construct a reversible discrete-time Markov chain on X {\displaystyle X} (a process known as the normalized graph Laplacian construction): d ( x ) = ∫ X k ( x , y ) d μ ( y ) {\displaystyle d(x)=\int _{X}k(x,y)d\mu (y)} and define: p ( x , y ) = k ( x , y ) d ( x ) {\displaystyle p(x,y)={\frac {k(x,y)}{d(x)}}} Although the new normalized kernel does not inherit the symmetric property, it does inherit the positivity-preserving property and gains a conservation property: ∫ X p ( x , y ) d μ ( y ) = 1 {\displaystyle \int _{X}p(x,y)d\mu (y)=1} === Diffusion process === From p ( x , y ) {\displaystyle p(x,y)} we can construct a transition matrix of a Markov chain ( M {\displaystyle M} ) on X {\displaystyle X} . In other words, p ( x , y ) {\displaystyle p(x,y)} represents the one-step transition probability from x {\displaystyle x} to y {\displaystyle y} , and M t {\displaystyle M^{t}} gives the t-step transition matrix. We define the diffusion matrix L {\displaystyle L} (it is also a version of graph Laplacian matrix) L i , j = k ( x i , x j ) {\displaystyle L_{i,j}=k(x_{i},x_{j})\,} We then define the new kernel L i , j ( α ) = k ( α ) ( x i , x j ) = L i , j ( d ( x i ) d ( x j ) ) α {\displaystyle L_{i,j}^{(\alpha )}=k^{(\alpha )}(x_{i},x_{j})={\frac {L_{i,j}}{(d(x_{i})d(x_{j}))^{\alpha }}}\,} or equivalently, L ( α ) = D − α L D − α {\displaystyle L^{(\alpha )}=D^{-\alpha }LD^{-\alpha }\,} where D is a diagonal matrix and D i , i = ∑ j L i , j . {\displaystyle D_{i,i}=\sum _{j}L_{i,j}.} We apply the graph Laplacian normalization to this new kernel: M = ( D ( α ) ) − 1 L ( α ) , {\displaystyle M=({D}^{(\alpha )})^{-1}L^{(\alpha )},\,} where D ( α ) {\displaystyle D^{(\alpha )}} is a diagonal matrix and D i , i ( α ) = ∑ j L i , j ( α ) . {\displaystyle {D}_{i,i}^{(\alpha )}=\sum _{j}L_{i,j}^{(\alpha )}.} p ( x j , t | x i ) = M i , j t {\displaystyle p(x_{j},t|x_{i})=M_{i,j}^{t}\,} One of the main ideas of the diffusion framework is that running the chain forward in time (taking larger and larger powers of M {\displaystyle M} ) reveals the geometric structure of X {\displaystyle X} at larger and larger scales (the diffusion process). Specifically, the notion of a cluster in the data set is quantified as a region in which the probability of escaping this region is low (within a certain time t). Therefore, t not only serves as a time parameter, but it also has the dual role of scale parameter. The eigendecomposition of the matrix M t {\displaystyle M^{t}} yields M i , j t = ∑ l λ l t ψ l ( x i ) ϕ l ( x j ) {\displaystyle M_{i,j}^{t}=\sum _{l}\lambda _{l}^{t}\psi _{l}(x_{i})\phi _{l}(x_{j})\,} where { λ l } {\displaystyle \{\lambda _{l}\}} is the sequence of eigenvalues of M {\displaystyle M} and { ψ l } {\displaystyle \{\psi _{l}\}} and { ϕ l } {\displaystyle \{\phi _{l}\}} are the biorthogonal left and right eigenvectors respectively. Due to the spectrum decay of the eigenvalues, only a few terms are necessary to achieve a given relative accuracy in this sum. ==== Parameter α and the diffusion operator ==== The reason to introduce the normalization step involving α {\displaystyle \alpha } is to tune the influence of the data point density on the infinitesimal transition of the diffusion. In some applications, the sampling of the data is generally not related to the geometry of the manifold we are interested in describing. In this case, we can set α = 1 {\displaystyle \alpha =1} and the diffusion operator approximates the Laplace–Beltrami operator. We then recover the Riemannian geometry of the data set regardless of the distribution of the points. To describe the long-term behavior of the point distribution of a system of stochastic differential equations, we can use α = 0.5 {\displaystyle \alpha =0.5} and the resulting Markov chain approximates the Fokker–Planck diffusion. With α = 0 {\displaystyle \alpha =0} , it reduces to the classical graph Laplacian normalization. === Diffusion distance === The diffusion distance at time t {\displaystyle t} between two points can be measured as the similarity of two points in the observation space with the connectivity between them. It is given by D t ( x i , x j ) 2 = ∑ y ( p ( y , t | x i ) − p ( y , t | x j ) ) 2 ϕ 0 ( y ) {\displaystyle D_{t}(x_{i},x_{j})^{2}=\sum _{y}{\frac {(p(y,t|x_{i})-p(y,t|x_{j}))^{2}}{\phi _{0}(y)}}} where ϕ 0 ( y ) {\displaystyle \phi _{0}(y)} is the stationary distribution of the Markov chain, given by the first left eigenvector of M {\displaystyle M} . Explicitly: ϕ 0 ( y ) = d ( y ) ∑ z ∈ X d ( z ) {\displaystyle \phi _{0}(y)={\frac {d(y)}{\sum _{z\in X}d(z)}}} Intuitively, D t ( x i , x j ) {\displaystyle D_{t}(x_{i},x_{j})} is small if there is a large number of short paths connecting x i {\displaystyle x_{i}} and x j {\displaystyle x_{j}} . There are several interesting features associated with the diffusion distance, based on our previous discussion that t {\displaystyle t} also serves as a scale parameter: Points are closer at a given scale (as specified by D t ( x i , x j ) {\displaystyle D_{t}(x_{i},x_{j})} ) if they are highly connected in the graph, therefore emphasizing the concept of a cluster. This distance is robust to noise, since the distance between two points depends on all possible paths of length t {\displaystyle t} between the points. From a machine learning point of view, the distance takes into account all evidences linking x i {\displaystyle x_{i}} to x j {\displaystyle x_{j}} , allowing us to conclude that this distance is appropriate for the design of inference algorithms based on the majority of preponderance. === Diffusion process and low-dimensional embedding === The diffusion distance can be calculated using the eigenvectors by D t ( x i , x j ) 2 = ∑ l λ l 2 t ( ψ l ( x i ) − ψ l ( x j ) ) 2 {\displaystyle D_{t}(x_{i},x_{j})^{2}=\sum _{l}\lambda _{l}^{2t}(\psi _{l}(x_{i})-\psi _{l}(x_{j}))^{2}\,} So the eigenvectors can be used as a new set of coordinates for the data. The diffusion map is defined as: Ψ t ( x ) = ( λ 1 t ψ 1 ( x ) , λ 2 t ψ 2 ( x ) , … , λ k t ψ k ( x ) ) {\displaystyle \Psi _{t}(x)=(\lambda _{1}^{t}\psi _{1}(x),\lambda _{2}^{t}\psi _{2}(x),\ld

    Read more →
  • ID3 algorithm

    ID3 algorithm

    In decision tree learning, ID3 (Iterative Dichotomiser 3) is a greedy algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm. The 3 in the name is meant to signify that this was Quinlan's third attempt at a model based on entropy-based splitting, and the term dichotimser is a misnomer as it implies a binary split, but the ID3 algorithm can split on multi-valued attributes. == Algorithm == The ID3 algorithm begins with the original set S {\displaystyle S} as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S {\displaystyle S} and calculates the entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} or the information gain I G ( S ) {\displaystyle IG(S)} of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S {\displaystyle S} is then split or partitioned by the selected attribute to produce subsets of the data. (For example, a node can be split into child nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.) The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: every element in the subset belongs to the same class; in which case the node is turned into a leaf node and labelled with the class of the examples. there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset. there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute. An example could be the absence of a person among the population with age over 100 years. Then a leaf node is created and labelled with the most common class of the examples in the parent node's set. Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal node) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch. === Summary === Calculate the entropy of every attribute a {\displaystyle a} of the data set S {\displaystyle S} . Partition ("split") the set S {\displaystyle S} into subsets using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is maximum. Make a decision tree node containing that attribute. Recurse on subsets using the remaining attributes. === Properties === ID3 does not guarantee an optimal solution. It can converge upon local optima. It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. The algorithm's optimality can be improved by using backtracking during the search for the optimal decision tree at the cost of possibly taking longer. ID3 can overfit the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree. ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time-consuming. === Usage === The ID3 algorithm is used by training on a data set S {\displaystyle S} to produce a decision tree which is stored in memory. At runtime, this decision tree is used to classify new test cases (feature vectors) by traversing the decision tree using the features of the datum to arrive at a leaf node. == The ID3 metrics == === Entropy === Entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} is a measure of the amount of uncertainty in the (data) set S {\displaystyle S} (i.e. entropy characterizes the (data) set S {\displaystyle S} ). H ( S ) = ∑ x ∈ X − p ( x ) log 2 ⁡ p ( x ) {\displaystyle \mathrm {H} {(S)}=\sum _{x\in X}{-p(x)\log _{2}p(x)}} Where, S {\displaystyle S} – The current dataset for which entropy is being calculated This changes at each step of the ID3 algorithm, either to a subset of the previous set in the case of splitting on an attribute or to a "sibling" partition of the parent in case the recursion terminated previously. X {\displaystyle X} – The set of classes in S {\displaystyle S} p ( x ) {\displaystyle p(x)} – The proportion of the number of elements in class x {\displaystyle x} to the number of elements in set S {\displaystyle S} When H ( S ) = 0 {\displaystyle \mathrm {H} {(S)}=0} , the set S {\displaystyle S} is perfectly classified (i.e. all elements in S {\displaystyle S} are of the same class). In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set S {\displaystyle S} on this iteration. Entropy in information theory measures how much information is expected to be gained upon measuring a random variable; as such, it can also be used to quantify the amount to which the distribution of the quantity's values is unknown. A constant quantity has zero entropy, as its distribution is perfectly known. In contrast, a uniformly distributed random variable (discretely or continuously uniform) maximizes entropy. Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here. As such, ID3 is a greedy heuristic performing a best-first search for locally optimal entropy values. Its accuracy can be improved by preprocessing the data. === Information gain === Information gain I G ( A ) {\displaystyle IG(A)} is the measure of the difference in entropy from before to after the set S {\displaystyle S} is split on an attribute A {\displaystyle A} . In other words, how much uncertainty in S {\displaystyle S} was reduced after splitting set S {\displaystyle S} on attribute A {\displaystyle A} . I G ( S , A ) = H ( S ) − ∑ t ∈ T p ( t ) H ( t ) = H ( S ) − H ( S | A ) . {\displaystyle IG(S,A)=\mathrm {H} {(S)}-\sum _{t\in T}p(t)\mathrm {H} {(t)}=\mathrm {H} {(S)}-\mathrm {H} {(S|A)}.} Where, H ( S ) {\displaystyle \mathrm {H} (S)} – Entropy of set S {\displaystyle S} T {\displaystyle T} – The subsets created from splitting set S {\displaystyle S} by attribute A {\displaystyle A} such that S = ⋃ t ∈ T t {\displaystyle S=\bigcup _{t\in T}t} p ( t ) {\displaystyle p(t)} – The proportion of the number of elements in t {\displaystyle t} to the number of elements in set S {\displaystyle S} H ( t ) {\displaystyle \mathrm {H} (t)} – Entropy of subset t {\displaystyle t} In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set S {\displaystyle S} on this iteration.

    Read more →
  • Junction tree algorithm

    Junction tree algorithm

    The junction tree algorithm (also known as 'Clique Tree') is a method used in machine learning to extract marginalization in general graphs. In essence, it entails performing belief propagation on a modified graph called a junction tree. The graph is called a tree because it branches into different sections of data; nodes of variables are the branches. The basic premise is to eliminate cycles by clustering them into single nodes. Multiple extensive classes of queries can be compiled at the same time into larger structures of data. There are different algorithms to meet specific needs and for what needs to be calculated. Inference algorithms gather new developments in the data and calculate it based on the new information provided. == Junction tree algorithm == === Hugin algorithm === If the graph is directed then moralize it to make it un-directed. Introduce the evidence. Triangulate the graph to make it chordal. Construct a junction tree from the triangulated graph (we will call the vertices of the junction tree "supernodes"). Propagate the probabilities along the junction tree (via belief propagation) Note that this last step is inefficient for graphs of large treewidth. Computing the messages to pass between supernodes involves doing exact marginalization over the variables in both supernodes. Performing this algorithm for a graph with treewidth k will thus have at least one computation which takes time exponential in k. It is a message passing algorithm. The Hugin algorithm takes fewer computations to find a solution compared to Shafer-Shenoy. === Shafer-Shenoy algorithm === Computed recursively Multiple recursions of the Shafer-Shenoy algorithm results in Hugin algorithm Found by the message passing equation Separator potentials are not stored The Shafer-Shenoy algorithm is the sum product of a junction tree. It is used because it runs programs and queries more efficiently than the Hugin algorithm. The algorithm makes calculations for conditionals for belief functions possible. Joint distributions are needed to make local computations happen. === Underlying theory === The first step concerns only Bayesian networks, and is a procedure to turn a directed graph into an undirected one. We do this because it allows for the universal applicability of the algorithm, regardless of direction. The second step is setting variables to their observed value. This is usually needed when we want to calculate conditional probabilities, so we fix the value of the random variables we condition on. Those variables are also said to be clamped to their particular value. The third step is to ensure that graphs are made chordal if they aren't already chordal. This is the first essential step of the algorithm. It makes use of the following theorem: Theorem: For an undirected graph, G, the following properties are equivalent: Graph G is triangulated. The clique graph of G has a junction tree. There is an elimination ordering for G that does not lead to any added edges. Thus, by triangulating a graph, we make sure that the corresponding junction tree exists. A usual way to do this, is to decide an elimination order for its nodes, and then run the Variable elimination algorithm. The variable elimination algorithm states that the algorithm must be run each time there is a different query. This will result to adding more edges to the initial graph, in such a way that the output will be a chordal graph. All chordal graphs have a junction tree. The next step is to construct the junction tree. To do so, we use the graph from the previous step, and form its corresponding clique graph. Now the next theorem gives us a way to find a junction tree: Theorem: Given a triangulated graph, weight the edges of the clique graph by their cardinality, |A∩B|, of the intersection of the adjacent cliques A and B. Then any maximum-weight spanning tree of the clique graph is a junction tree. So, to construct a junction tree we just have to extract a maximum weight spanning tree out of the clique graph. This can be efficiently done by, for example, modifying Kruskal's algorithm. The last step is to apply belief propagation to the obtained junction tree. Usage: A junction tree graph is used to visualize the probabilities of the problem. The tree can become a binary tree to form the actual building of the tree. A specific use could be found in auto encoders, which combine the graph and a passing network on a large scale automatically. === Inference Algorithms === Loopy belief propagation: A different method of interpreting complex graphs. The loopy belief propagation is used when an approximate solution is needed instead of the exact solution. It is an approximate inference. Cutset conditioning: Used with smaller sets of variables. Cutset conditioning allows for simpler graphs that are easier to read but are not exact.

    Read more →
  • Visual servoing

    Visual servoing

    Visual servoing, also known as vision-based robot control and abbreviated VS, is a technique which uses feedback information extracted from a vision sensor (visual feedback) to control the motion of a robot. One of the earliest papers that talks about visual servoing was from the SRI International Labs in 1979. == Visual servoing taxonomy == There are two fundamental configurations of the robot end-effector (hand) and the camera: Eye-in-hand, or end-point open-loop control, where the camera is attached to the moving hand and observing the relative position of the target. Eye-to-hand, or end-point closed-loop control, where the camera is fixed in the world and observing the target and the motion of the hand. Visual Servoing control techniques are broadly classified into the following types: Image-based (IBVS) Position/pose-based (PBVS) Hybrid approach IBVS was proposed by Weiss and Sanderson. The control law is based on the error between current and desired features on the image plane, and does not involve any estimate of the pose of the target. The features may be the coordinates of visual features, lines or moments of regions. IBVS has difficulties with motions very large rotations, which has come to be called camera retreat. PBVS is a model-based technique (with a single camera). This is because the pose of the object of interest is estimated with respect to the camera and then a command is issued to the robot controller, which in turn controls the robot. In this case the image features are extracted as well, but are additionally used to estimate 3D information (pose of the object in Cartesian space), hence it is servoing in 3D. Hybrid approaches use some combination of the 2D and 3D servoing. There have been a few different approaches to hybrid servoing 2-1/2-D Servoing Motion partition-based Partitioned DOF Based == Survey == The following description of the prior work is divided into 3 parts Survey of existing visual servoing methods. Various features used and their impacts on visual servoing. Error and stability analysis of visual servoing schemes. === Survey of existing visual servoing methods === Visual servo systems, also called servoing, have been around since the early 1980s , although the term visual servo itself was only coined in 1987. Visual Servoing is, in essence, a method for robot control where the sensor used is a camera (visual sensor). Servoing consists primarily of two techniques, one involves using information from the image to directly control the degrees of freedom (DOF) of the robot, thus referred to as Image Based Visual Servoing (IBVS). While the other involves the geometric interpretation of the information extracted from the camera, such as estimating the pose of the target and parameters of the camera (assuming some basic model of the target is known). Other servoing classifications exist based on the variations in each component of a servoing system , e.g. the location of the camera, the two kinds are eye-in-hand and hand–eye configurations. Based on the control loop, the two kinds are end-point-open-loop and end-point-closed-loop. Based on whether the control is applied to the joints (or DOF) directly or as a position command to a robot controller the two types are direct servoing and dynamic look-and-move. Being one of the earliest works the authors proposed a hierarchical visual servo scheme applied to image-based servoing. The technique relies on the assumption that a good set of features can be extracted from the object of interest (e.g. edges, corners and centroids) and used as a partial model along with global models of the scene and robot. The control strategy is applied to a simulation of a two and three DOF robot arm. Feddema et al. introduced the idea of generating task trajectory with respect to the feature velocity. This is to ensure that the sensors are not rendered ineffective (stopping the feedback) for any the robot motions. The authors assume that the objects are known a priori (e.g. CAD model) and all the features can be extracted from the object. The work by Espiau et al. discusses some of the basic questions in visual servoing. The discussions concentrate on modeling of the interaction matrix, camera, visual features (points, lines, etc..). In an adaptive servoing system was proposed with a look-and-move servoing architecture. The method used optical flow along with SSD to provide a confidence metric and a stochastic controller with Kalman filtering for the control scheme. The system assumes (in the examples) that the plane of the camera and the plane of the features are parallel., discusses an approach of velocity control using the Jacobian relationship s˙ = Jv˙ . In addition the author uses Kalman filtering, assuming that the extracted position of the target have inherent errors (sensor errors). A model of the target velocity is developed and used as a feed-forward input in the control loop. Also, mentions the importance of looking into kinematic discrepancy, dynamic effects, repeatability, settling time oscillations and lag in response. Corke poses a set of very critical questions on visual servoing and tries to elaborate on their implications. The paper primarily focuses the dynamics of visual servoing. The author tries to address problems like lag and stability, while also talking about feed-forward paths in the control loop. The paper also, tries to seek justification for trajectory generation, methodology of axis control and development of performance metrics. Chaumette in provides good insight into the two major problems with IBVS. One, servoing to a local minima and second, reaching a Jacobian singularity. The author show that image points alone do not make good features due to the occurrence of singularities. The paper continues, by discussing the possible additional checks to prevent singularities namely, condition numbers of J_s and Jˆ+_s, to check the null space of ˆ J_s and J^T_s . One main point that the author highlights is the relation between local minima and unrealizable image feature motions. Over the years many hybrid techniques have been developed. These involve computing partial/complete pose from Epipolar Geometry using multiple views or multiple cameras. The values are obtained by direct estimation or through a learning or a statistical scheme. While others have used a switching approach that changes between image-based and position-based on a Lyapnov function. The early hybrid techniques that used a combination of image-based and pose-based (2D and 3D information) approaches for servoing required either a full or partial model of the object in order to extract the pose information and used a variety of techniques to extract the motion information from the image. used an affine motion model from the image motion in addition to a rough polyhedral CAD model to extract the object pose with respect to the camera to be able to servo onto the object (on the lines of PBVS). 2-1/2-D visual servoing developed by Malis et al. is a well known technique that breaks down the information required for servoing into an organized fashion which decouples rotations and translations. The papers assume that the desired pose is known a priori. The rotational information is obtained from partial pose estimation, a homography, (essentially 3D information) giving an axis of rotation and the angle (by computing the eigenvalues and eigenvectors of the homography). The translational information is obtained from the image directly by tracking a set of feature points. The only conditions being that the feature points being tracked never leave the field of view and that a depth estimate be predetermined by some off-line technique. 2-1/2-D servoing has been shown to be more stable than the techniques that preceded it. Another interesting observation with this formulation is that the authors claim that the visual Jacobian will have no singularities during the motions. The hybrid technique developed by Corke and Hutchinson, popularly called portioned approach partitions the visual (or image) Jacobian into motions (both rotations and translations) relating X and Y axes and motions related to the Z axis. outlines the technique, to break out columns of the visual Jacobian that correspond to the Z axis translation and rotation (namely, the third and sixth columns). The partitioned approach is shown to handle the Chaumette Conundrum discussed in. This technique requires a good depth estimate in order to function properly. outlines a hybrid approach where the servoing task is split into two, namely main and secondary. The main task is keep the features of interest within the field of view. While the secondary task is to mark a fixation point and use it as a reference to bring the camera to the desired pose. The technique does need a depth estimate from an off-line procedure. The paper discusses two examples for which depth estimates are obtained from robot odometry and by assuming that all

    Read more →
  • Yooreeka

    Yooreeka

    Yooreeka is a library for data mining, machine learning, soft computing, and mathematical analysis. The project started with the code of the book "Algorithms of the Intelligent Web". Although the term "Web" prevailed in the title, in essence, the algorithms are valuable in any software application. It covers all major algorithms and provides many examples. Yooreeka 2.x is licensed under the Apache License rather than the somewhat more restrictive LGPL (which was the license of v1.x). The library is written 100% in the Java language. == Algorithms == The following algorithms are covered: Clustering Hierarchical—Agglomerative (e.g. MST single link; ROCK) and Divisive Partitional (e.g. k-means) Classification Bayesian Decision trees Neural Networks Rule based (via Drools) Recommendations Collaborative filtering Content based Search PageRank DocRank Personalization

    Read more →
  • Random indexing

    Random indexing

    Random indexing is a dimensionality reduction method and computational framework for distributional semantics, based on the insight that very-high-dimensional vector space model implementations are impractical, that models need not grow in dimensionality when new items (e.g. new terminology) are encountered, and that a high-dimensional model can be projected into a space of lower dimensionality without compromising L2 distance metrics if the resulting dimensions are chosen appropriately. This is the original point of the random projection approach to dimension reduction first formulated as the Johnson–Lindenstrauss lemma, and locality-sensitive hashing has some of the same starting points. Random indexing, as used in representation of language, originates from the work of Pentti Kanerva on sparse distributed memory, and can be described as an incremental formulation of a random projection. It can be also verified that random indexing is a random projection technique for the construction of Euclidean spaces—i.e. L2 normed vector spaces. In Euclidean spaces, random projections are elucidated using the Johnson–Lindenstrauss lemma. The TopSig technique extends the random indexing model to produce bit vectors for comparison with the Hamming distance similarity function. It is used for improving the performance of information retrieval and document clustering. In a similar line of research, Random Manhattan Integer Indexing (RMII) is proposed for improving the performance of the methods that employ the Manhattan distance between text units. Many random indexing methods primarily generate similarity from co-occurrence of items in a corpus. Reflexive Random Indexing (RRI) generates similarity from co-occurrence and from shared occurrence with other items.

    Read more →