Caspio, Inc. is an American software company providing a low-code platform for building cloud-based business applications. Founded in 2000 by Frank Zamani, the company is headquartered in Sunnyvale, California, with operations in Poland, the Philippines, and Spain. Caspio’s platform allows organizations to create online database applications and workflow tools without extensive coding. == History == Caspio was founded by Frank Zamani in 2000. The company initially focused on simplifying custom cloud applications and reducing development time and cost as compared to traditional software development. Caspio released the first version of its platform, Caspio Bridge, in 2001. In 2014, Caspio released a HIPAA-Compliant Edition of its low-code application development platform. Caspio also released an EU General Data Protection Regulation (GDPR) Compliance Edition of its low-code application development platform in 2016. Caspio's second European Software Development Center opened in Kraków, Poland in 2017. In 2019, Forrester Research listed Caspio and three other platforms in its highest of four ranked tiers of twelve low-code platforms for business developers based on rankings of offerings and strategy at that time. Caspio also opened data centers in Montreal, Canada and India in 2020.
Agent verification
Agent verification is activity to gain assurances that purposeful artificial constructs act in accordance with their specifications. While primitive forms of inorganic agents have been used in manufacturing for centuries, the study of artificial agents did not begin until the mid 20th century. Foundational work on such agents was closely bound with the emergence of artificial intelligence as an academic discipline. Early agents deployed for industrial control systems and in computing were often controlled by quite simple logic however, not involving artificial intelligence as such. When deployed as part of a multi-agent system, even such simple agents could require special agent orientated testing methods, as their collective behaviour was challenging to verify with traditional testing techniques. Difficulties in providing assurances that agents will not behave in dangerous ways became more prevalent after the introduction of LLM agents, especially after the rapid acceleration of their deployment in 2025. The verification of agent behaviour can be conducted by formal or informal methods. Informal verification requires less mathematical skill. But when agents are part of systems where errors have significant risks — such as danger to human life, environmental damage or major financial loss — formal verification is preferred. Both regulators and system designers themselves like formal verification as it provides a high degree of mathematical certainty. It is not however always possible to formally test all aspects of an agent based system's behaviour, especially where newer LLM based agents are concerned, due in part to their high degree of autonomy. Accordingly, agent verification for low impact deployments might be carried out only with informal methods, while for high impact deployments, it may be performed with a mix of formal and informal techniques. == Terminology == In academia, the term agent verification is often defined to mean activity concerned with gaining assurance that the agent behaves in accordance with its specification - whether by processes such as testing or simulation. 'Verification' is typically contrasted with 'validation', the latter meaning activity concerned with checking that the specification itself meets user or real world needs. Such definitions are not universally adhered to however - for example, in some workplaces and documents, the words 'verification' and 'validation' can be used synonymously. Efforts to gain confidence in Agents have intensified sharply since 2025 due to the rapid roll out of LLM agents; different terms are sometimes used in the commercial sector. Here the term 'agent verification' can be used in the same sense as it is in academia, but sometimes the same activity can be covered by more ambiguous and wider ranging terms such as 'Agent governance' , 'Agent observability' or 'AI agent policing'. == History == === Classical agents === The theoretical underpinnings for artificial (inorganic) agents emerged in the mid 20th century, with establishment of cybernetics and artificial intelligence. Oliver Selfridge's 1958 Pandemonium - A Paradigm for Learning paper was an important early theoretical contribution in establishing agent oriented architecture. Practical implementations of agents for real world applications began to become widespread in the 1990s, after the introduction of the belief–desire–intention software model (BDI), and agent-oriented programming. Pure digital agents were deployed in computer infrastructure for purposes such as monitoring, while agents connected to real-world sensors and actuators were increasingly used in industrial control systems. While the concept of artificial agents was interwoven with early artificial intelligence studies right from the start, early agents lacked general purpose reasoning capabilities, often only having simple if then logic. Even a device as simple as a thermostat, which has a sensor and a means of acting, can be considered a proto agent in this sense. Verifying the behaviours of a simple single agent system is not generally especially difficult, but it can be a different matter when several simple agents coexist in the same system. Craig Reynolds's work on boids showed that relatively complex, "intelligent" behaviour can emerge from a number of such simple agents working together in a Multi-agent system (MAS). By the 1990s, even the behaviour of a single agent system could sometimes be quite complex; in accordance with the Belief–desire–intention software model, agents could have believes that might evolve over time. Agents were increasingly introduced that were controlled by quite large decision tree models, which had new vulnerabilities to adversarial attack. It was becoming increasingly apparent that traditional software verification methods had limitations for testing such agents, or even for the more primitive type of agents when they were deployed as part of a MAS. It was the use of agents for industrial control systems, sometimes associated with robotics, that lent urgency to the practice of agent verification. Informal testing might be acceptable for digital agents used say to monitor whether each of an organisation's computers are properly licensed. But with an increasing potential for faulty agents to result in a failure that might cause a large fire to break out at a chemical manufacturing plant, a botched medical operation, or even a crashed aircraft, the need to develop reliable means of verifying behaviour of such agents was considered urgent. The Foundation for Intelligent Physical Agents was established in 1996. From the late 90s, a growing number of industry and university based scientists began working on the problem, with researchers publishing papers on the verification of both single and multi agent systems. Much of this work showed how formal verification techniques like model checking could be used to gain a high level of assurance that agent based systems would conform with their specification. A 2018 systematic review covering 231 studies found that model checking was the most common technique for agent verification, with theorem proving the second most commonly used formal verification method. In the first two decades of the 20th century, agents run by AI became more common, with Siri and Alexa being well known examples. But such agents still lacked general reasoning capabilities and did not pose new pressing problems for agent verification. === General purpose reasoning agents === The advent of LLMs created huge potential for further use of artificial agents, as agents based on them could have general purpose cognitive abilities. Agents run by LLMs (and occasionally non-LLM foundation models) have similar vulnerability to adversarial attack as those run by decision tree models. The wider scope of actions for LLM agents has created new challenges for their verification, over and above those present for classical agents. For example, the LLM's neural network endows it with infinite domains, an especial challenge for traditional formal verification techniques. Academics began to study the problems involved in verifying LLM agents from 2018. Deployment of such agents began to accelerate in late 2023 after OpenAI's "function-calling" API was made available, and especially after Anthropic's late 2024 introduction of Model Context Protocol (MCP), a standardised way for LLM agents to gain contextual awareness, and to act on the world by calling various external tools. The rapid rollout of LLM agents following MCP's release has seen the task of agent verification receive increased attention within academia, and also from the private sector. In 2024 and 2025 several startups focusing on LLM agent verification have been founded in both Europe and the US to meet growing demand. == Approaches == === Formal verification === Formal verification involves proving the correctness of some or all aspects of a system using mathematical methods. Such methods can range from manual formal proof, to verification assisted with automated theorem provers like Isabelle. For agent verification, model checking is by far the most frequently used formal verification method; for pre-LLM models it was often complemented with techniques using computation tree logic. Another common method is theorem proving. Formal verification provides a higher degree of confidence than informal methods, but it is not always used, even when it is possible. Sometimes a person or organisation developing software agents won't have the necessary skills, or may not see it as worth the effort if the agent(s) will not have the ability to cause much harm even if they malfunction. When agents are deployed in systems where errors could have serious consequences, the ability of formal verification methods to provide mathematical certainty tends to be strongly preferred by both regulators and designers themselves. But even for high impact systems, formal verificatio
Genetic programming
Genetic programming (GP) is an evolutionary algorithm, an artificial intelligence technique mimicking natural evolution, which operates on a population of programs. It applies the genetic operators selection according to a predefined fitness measure, mutation and crossover. The crossover operation involves swapping specified parts of selected pairs (parents) to produce new and different offspring that become part of the new generation of programs. Some programs not selected for reproduction are copied from the current generation to the new generation. Mutation involves substitution of some random part of a program with some other random part of a program. Then the selection and other operations are recursively applied to the new generation of programs. Typically, members of each new generation are on average more fit than the members of the previous generation, and the best-of-generation program is often better than the best-of-generation programs from previous generations. Termination of the evolution usually occurs when some individual program reaches a predefined proficiency or fitness level. It may and often does happen that a particular run of the algorithm results in premature convergence to some local maximum that is not a globally optimal or even good solution. Multiple runs (dozens to hundreds) are usually necessary to produce a very good result. It may also be necessary to have a large starting population size and variability of the individuals to avoid pathologies. == History == The first record of the proposal to evolve programs is probably that of Alan Turing in 1950 in "Computing Machinery and Intelligence". There was a gap of 25 years before the publication of John Holland's 'Adaptation in Natural and Artificial Systems' laid out the theoretical and empirical foundations of the science. In 1981, Richard Forsyth demonstrated the successful evolution of small programs, represented as trees, to perform classification of crime scene evidence for the UK Home Office. Although the idea of evolving programs, initially in the computer language Lisp, was current amongst John Holland's students, it was not until they organised the first Genetic Algorithms (GA) conference in Pittsburgh that Nichael Cramer published evolved programs in two specially designed languages, which included the first statement of modern "tree-based" genetic programming (that is, procedural languages organized in tree-based structures and operated on by suitably defined GA-operators). In 1988, John Koza (also a PhD student of John Holland) patented his invention of a GA for program evolution. This was followed by publication in the International Joint Conference on Artificial Intelligence IJCAI-89. Koza followed this with 205 publications on "genetic programming", a term coined by David Goldberg, also a PhD student of John Holland. However, it is the series of 4 books by Koza, starting in 1992 with accompanying videos, that really established GP. Subsequently, there was an enormous expansion of the number of publications with the Genetic Programming Bibliography, surpassing 10,000 entries. In 2010, Koza listed 77 results where genetic programming was human competitive. The departure of GP from the rigid, fixed-length representations typical of early GA models was not entirely without precedent. Early work on variable-length representations laid the groundwork. One notable example is messy genetic algorithms, which introduced irregular, variable-length chromosomes to address building block disruption and positional bias in standard GAs. Another precursor was robot trajectory programming, where genome representations encoded program instructions for robotic movements—structures inherently variable in length. Even earlier, unfixed-length representations were proposed in a doctoral dissertation by Cavicchio, who explored adaptive search using simulated evolution. His work provided foundational ideas for flexible program structures. In 1996, Koza started the annual Genetic Programming conference, which was followed in 1998 by the annual EuroGP conference, and the first book in a GP series edited by Koza. 1998 also saw the first GP textbook. GP continued to flourish, leading to the first specialist GP journal and three years later (2003) the annual Genetic Programming Theory and Practice (GPTP) workshop was established by Rick Riolo. Genetic programming papers continue to be published at a diversity of conferences and associated journals. Today there are nineteen GP books including several for students. === Foundational work in GP === Early work that set the stage for current genetic programming research topics and applications is diverse, and includes software synthesis and repair, predictive modeling, data mining, financial modeling, soft sensors, design, and image processing. Applications in some areas, such as design, often make use of intermediate representations, such as Fred Gruau's cellular encoding. Industrial uptake has been significant in several areas including finance, the chemical industry, bioinformatics and the steel industry. == Methods == === Program representation === GP evolves computer programs, traditionally represented in memory as tree structures. Trees can be easily evaluated in a recursive manner. Every internal node has an operator function and every terminal node has an operand, making mathematical expressions easy to evolve and evaluate. Thus traditionally GP favors the use of programming languages that naturally embody tree structures (for example, Lisp; other functional programming languages are also suitable). Non-tree representations have been suggested and successfully implemented, such as linear genetic programming, which perhaps suits the more traditional imperative languages. The commercial GP software Discipulus uses automatic induction of binary machine code ("AIM") to achieve better performance. μGP uses directed multigraphs to generate programs that fully exploit the syntax of a given assembly language. Multi expression programming uses three-address code for encoding solutions. Other program representations on which significant research and development have been conducted include programs for stack-based virtual machines, and sequences of integers that are mapped to arbitrary programming languages via grammars. Cartesian genetic programming is another form of GP, which uses a graph representation instead of the usual tree based representation to encode computer programs. Most representations have structurally noneffective code (introns). Such non-coding genes may seem to be useless because they have no effect on the performance of any one individual. However, they alter the probabilities of generating different offspring under the variation operators, and thus alter the individual's variational properties. Experiments seem to show faster convergence when using program representations that allow such non-coding genes, compared to program representations that do not have any non-coding genes. Instantiations may have both trees with introns and those without; the latter are called canonical trees. Special canonical crossover operators are introduced that maintain the canonical structure of parents in their children. === Initialisation === The methods for creation of the initial population include: Grow creates the individuals sequentially. Every GP tree is created starting from the root, creating functional nodes with children as well as terminal nodes up to a certain depth. Full is similar to the Grow. The difference is that all brunches in a tree are of same predetermined depth. Ramped half-and-half creates a population consisting of m d − 1 {\displaystyle md-1} parts and a maximum depth of m d {\displaystyle md} for its trees. The first part has a maximum depth of 2, second of 3 and so on up to the m d − 1 {\displaystyle md-1} -th part with maximum depth m d {\displaystyle md} . Half of every part is created by Grow, while the other part is created by Full. === Selection === Selection is a process whereby certain individuals are selected from the current generation that would serve as parents for the next generation. The individuals are selected probabilistically such that the better performing individuals have a higher chance of getting selected. The most commonly used selection method in GP is tournament selection, although other methods such as fitness proportionate selection, lexicase selection, and others have been demonstrated to perform better for many GP problems. Elitism, which involves seeding the next generation with the best individual (or best n individuals) from the current generation, is a technique sometimes employed to avoid regression. === Crossover === In genetic programming two fit individuals are chosen from the population to be parents for one or two children. In tree genetic programming, these parents are represented as inverted lisp like trees, with their root nodes at the top. In subtree cro
Margin classifier
In machine learning (ML), a margin classifier is a type of classification model which is able to give an associated distance from the decision boundary for each data sample. For instance, if a linear classifier is used, the distance (typically Euclidean, though others may be used) of a sample from the separating hyperplane is the margin of that sample. The notion of margins is important in several ML classification algorithms, as it can be used to bound the generalization error of these classifiers. These bounds are frequently shown using the VC dimension. The generalization error bound in boosting algorithms and support vector machines is particularly prominent. == Margin for boosting algorithms == The margin for an iterative boosting algorithm given a dataset with two classes can be defined as follows: the classifier is given a sample pair ( x , y ) {\displaystyle (x,y)} , where x ∈ X {\displaystyle x\in X} is a domain space and y ∈ Y = { − 1 , + 1 } {\displaystyle y\in Y=\{-1,+1\}} is the sample's label. The algorithm then selects a classifier h j ∈ C {\displaystyle h_{j}\in C} at each iteration j {\displaystyle j} where C {\displaystyle C} is a space of possible classifiers that predict real values. This hypothesis is then weighted by α j ∈ R {\displaystyle \alpha _{j}\in R} as selected by the boosting algorithm. At iteration t {\displaystyle t} , the margin of a sample x {\displaystyle x} can thus be defined as y ∑ j t α j h j ( x ) ∑ | α j | . {\displaystyle {\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}.} By this definition, the margin is positive if the sample is labeled correctly, or negative if the sample is labeled incorrectly. This definition may be modified and is not the only way to define the margin for boosting algorithms. However, there are reasons why this definition may be appealing. == Examples of margin-based algorithms == Many classifiers can give an associated margin for each sample. However, only some classifiers utilize information of the margin while learning from a dataset. Many boosting algorithms rely on the notion of a margin to assign weight to samples. If a convex loss is utilized (as in AdaBoost or LogitBoost, for instance) then a sample with a higher margin will receive less (or equal) weight than a sample with a lower margin. This leads the boosting algorithm to focus weight on low-margin samples. In non-convex algorithms (e.g., BrownBoost), the margin still dictates the weighting of a sample, though the weighting is non-monotone with respect to the margin. == Generalization error bounds == One theoretical motivation behind margin classifiers is that their generalization error may be bound by the algorithm parameters and a margin term. An example of such a bound is for the AdaBoost algorithm. Let S {\displaystyle S} be a set of m {\displaystyle m} data points, sampled independently at random from a distribution D {\displaystyle D} . Assume the VC-dimension of the underlying base classifier is d {\displaystyle d} and m ≥ d ≥ 1 {\displaystyle m\geq d\geq 1} . Then, with probability 1 − δ {\displaystyle 1-\delta } , we have the bound: P D ( y ∑ j t α j h j ( x ) ∑ | α j | ≤ 0 ) ≤ P S ( y ∑ j t α j h j ( x ) ∑ | α j | ≤ θ ) + O ( 1 m d log 2 ( m / d ) / θ 2 + log ( 1 / δ ) ) {\displaystyle P_{D}\left({\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}\leq 0\right)\leq P_{S}\left({\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}\leq \theta \right)+O\left({\frac {1}{\sqrt {m}}}{\sqrt {d\log ^{2}(m/d)/\theta ^{2}+\log(1/\delta )}}\right)} for all θ > 0 {\displaystyle \theta >0} .
Minimum Population Search
In evolutionary computation, Minimum Population Search (MPS) is a computational method that optimizes a problem by iteratively trying to improve a set of candidate solutions with regard to a given measure of quality. It solves a problem by evolving a small population of candidate solutions by means of relatively simple arithmetical operations. MPS is a metaheuristic as it makes few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. For problems where finding the precise global optimum is less important than finding an acceptable local optimum in a fixed amount of time, using a metaheuristic such as MPS may be preferable to alternatives such as brute-force search or gradient descent. MPS is used for multidimensional real-valued functions but does not use the gradient of the problem being optimized, which means MPS does not require for the optimization problem to be differentiable as is required by classic optimization methods such as gradient descent and quasi-newton methods. MPS can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. == Background == In a similar way to Differential evolution, MPS uses difference vectors between the members of the population in order to generate new solutions. It attempts to provide an efficient use of function evaluations by maintaining a small population size. If the population size is smaller than the dimensionality of the search space, then the solutions generated through difference vectors will be constrained to the n − 1 {\displaystyle n-1} dimensional hyperplane. A smaller population size will lead to a more restricted subspace. With a population size equal to the dimensionality of the problem ( n = d ) {\displaystyle (n=d)} , the “line/hyperplane points” in MPS will be generated within a d − 1 {\displaystyle d-1} dimensional hyperplane. Taking a step orthogonal to this hyperplane will allow the search process to cover all the dimensions of the search space. Population size is a fundamental parameter in the performance of population-based heuristics. Larger populations promote exploration, but they also allow fewer generations, and this can reduce the chance of convergence. Searching with a small population can increase the chances of convergence and the efficient use of function evaluations, but it can also induce the risk of premature convergence. If the risk of premature convergence can be avoided, then a population-based heuristic could benefit from the efficiency and faster convergence rate of a smaller population. To avoid premature convergence, it is important to have a diversified population. By including techniques for explicitly increasing diversity and exploration, it is possible to have smaller populations with less risk of premature convergence. === Thresheld Convergence === Thresheld Convergence (TC) is a diversification technique which attempts to separate the processes of exploration and exploitation. TC uses a “threshold” function to establish a minimum search step, and managing this step makes it possible to influence the transition from exploration to exploitation, convergence is thus “held” back until the last stages of the search process. The goal of a controlled transition is to avoid an early concentration of the population around a few search regions and avoid the loss of diversity which can cause premature convergence. Thresheld Convergence has been successfully applied to several population-based metaheuristics such as Particle Swarm Optimization, Differential evolution, Evolution strategies, Simulated annealing and Estimation of Distribution Algorithms. The ideal case for Thresheld Convergence is to have one sample solution from each attraction basin, and for each sample solution to have the same relative fitness with respect to its local optimum. Enforcing a minimum step aims to achieve this ideal case. In MPS Thresheld Convergence is specifically used to preserve diversity and avoid premature convergence by establishing a minimum search step. By disallowing new solutions which are too close to members of the current population, TC forces a strong exploration during the early stages of the search while preserving the diversity of the (small) population. == Algorithm == A basic variant of the MPS algorithm works by having a population of size equal to the dimension of the problem. New solutions are generated by exploring the hyperplane defined by the current solutions (by means of difference vectors) and performing an additional orthogonal step in order to avoid getting caught in this hyperplane. The step sizes are controlled by the Thresheld Convergence technique, which gradually reduces step sizes as the search process advances. An outline for the algorithm is given below: Generate the first initial population. Allowing these solutions to lie near the bounds of the search space generally gives good results: s k = ( r s 1 ∗ b o u n d 1 / 2 , r s 2 ∗ b o u n d 2 / 2 , . . . , r s n ∗ b o u n d n / 2 ) {\displaystyle s_{k}=(rs_{1}bound_{1}/2,rs_{2}bound_{2}/2,...,rs_{n}bound_{n}/2)} where s k {\displaystyle s_{k}} is the k {\displaystyle k} -th population member, r s i {\displaystyle rs_{i}} are random numbers which can be −1 or 1, and the b o u n d i {\displaystyle bound_{i}} are the lower and upper bounds on each dimension. While a stop condition is not reached: Update threshold convergence values ( m i n _ s t e p {\displaystyle min\_step} and m a x _ s t e p {\displaystyle max\_step} ) Calculate the centroid of the current population ( x c {\displaystyle x_{c}} ) For each member of the population ( x i {\displaystyle x_{i}} ), generate a new offspring as follows: Uniformly generate a scaling factor ( F i {\displaystyle F_{i}} ) between − m a x _ s t e p {\displaystyle -max\_step} and m a x _ s t e p {\displaystyle max\_step} Generate a vector ( x o {\displaystyle x_{o}} ) orthogonal to the difference vector between x i {\displaystyle x_{i}} and x c {\displaystyle x_{c}} Calculate a scaling factor for the orthogonal vector: m i n _ o r t h = s q r t ( m a x ( m i n _ s t e p 2 − F i 2 , 0 ) ) {\displaystyle min\_orth=sqrt(max(min\_step^{2}-F_{i}^{2},0))} m a x _ o r t h = s q r t ( m a x ( m a x _ s t e p 2 − F i 2 , 0 ) ) {\displaystyle max\_orth=sqrt(max(max\_step^{2}-F_{i}^{2},0))} o r t h _ s t e p = u n i f o r m ( m i n _ o r t h , m a x _ o r t h ) {\displaystyle orth\_step=uniform(min\_orth,max\_orth)} Generate the new solution by adding the difference and the orthogonal vectors to the original solution n e w _ s o l u t i o n = x i + F i ∗ ( x i − x c ) ∗ o r t h _ s t e p ∗ x o {\displaystyle new\_solution=x_{i}+F_{i}(x_{i}-x_{c})orth\_stepx_{o}} Pick the best members between the old population and the new one by discarding the least fit members. Return the single best solution or the best population found as the final result.
Open information extraction
In natural language processing, open information extraction (OIE) is the task of generating a structured, machine-readable representation of the information in text, usually in the form of triples or n-ary propositions. == Overview == A proposition can be understood as truth-bearer, a textual expression of a potential fact (e.g., "Dante wrote the Divine Comedy"), represented in an amenable structure for computers [e.g., ("Dante", "wrote", "Divine Comedy")]. An OIE extraction normally consists of a relation and a set of arguments. For instance, ("Dante", "passed away in" "Ravenna") is a proposition formed by the relation "passed away in" and the arguments "Dante" and "Ravenna". The first argument is usually referred as the subject while the second is considered to be the object. The extraction is said to be a textual representation of a potential fact because its elements are not linked to a knowledge base. Furthermore, the factual nature of the proposition has not yet been established. In the above example, transforming the extraction into a full fledged fact would first require linking, if possible, the relation and the arguments to a knowledge base. Second, the truth of the extraction would need to be determined. In computer science transforming OIE extractions into ontological facts is known as relation extraction. In fact, OIE can be seen as the first step to a wide range of deeper text understanding tasks such as relation extraction, knowledge-base construction, question answering, semantic role labeling. The extracted propositions can also be directly used for end-user applications such as structured search (e.g., retrieve all propositions with "Dante" as subject). OIE was first introduced by TextRunner developed at the University of Washington Turing Center headed by Oren Etzioni. Other methods introduced later such as Reverb, OLLIE, ClausIE or CSD helped to shape the OIE task by characterizing some of its aspects. At a high level, all of these approaches make use of a set of patterns to generate the extractions. Depending on the particular approach, these patterns are either hand-crafted or learned. == OIE systems and contributions == Reverb suggested the necessity to produce meaningful relations to more accurately capture the information in the input text. For instance, given the sentence "Faust made a pact with the devil", it would be erroneous to just produce the extraction ("Faust", "made", "a pact") since it would not be adequately informative. A more precise extraction would be ("Faust", "made a pact with", "the devil"). Reverb also argued against the generation of overspecific relations. OLLIE stressed two important aspects for OIE. First, it pointed to the lack of factuality of the propositions. For instance, in a sentence like "If John studies hard, he will pass the exam", it would be inaccurate to consider ("John", "will pass", "the exam") as a fact. Additionally, the authors indicated that an OIE system should be able to extract non-verb mediated relations, which account for significant portion of the information expressed in natural language text. For instance, in the sentence "Obama, the former US president, was born in Hawaii", an OIE system should be able to recognize a proposition ("Obama", "is", "former US president"). ClausIE introduced the connection between grammatical clauses, propositions, and OIE extractions. The authors stated that as each grammatical clause expresses a proposition, each verb mediated proposition can be identified by solely recognizing the set of clauses expressed in each sentence. This implies that to correctly recognize the set of propositions in an input sentence, it is necessary to understand its grammatical structure. The authors studied the case in the English language that only admits seven clause types, meaning that the identification of each proposition only requires defining seven grammatical patterns. The finding also established a separation between the recognition of the propositions and its materialization. In a first step, the proposition can be identified without any consideration of its final form, in a domain-independent and unsupervised way, mostly based on linguistic principles. In a second step, the information can be represented according to the requirements of the underlying application, without conditioning the identification phase. Consider the sentence "Albert Einstein was born in Ulm and died in Princeton". The first step will recognize the two propositions ("Albert Einstein", "was born", "in Ulm") and ("Albert Einstein", "died", "in Princeton"). Once the information has been correctly identified, the propositions can take the particular form required by the underlying application [e.g., ("Albert Einstein", "was born in", "Ulm") and ("Albert Einstein", "died in", "Princeton")]. CSD introduced the idea of minimality in OIE. It considers that computers can make better use of the extractions if they are expressed in a compact way. This is especially important in sentences with subordinate clauses. In these cases, CSD suggests the generation of nested extractions. For example, consider the sentence "The Embassy said that 6,700 Americans were in Pakistan". CSD generates two extractions [i] ("6,700 Americans", "were", "in Pakistan") and [ii] ("The Embassy", "said", "that [i]"). This is usually known as reification.
Mathematics of neural networks in machine learning
An artificial neural network (ANN) or neural network combines biological principles with advanced statistics to solve problems in domains such as pattern recognition and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways. == Structure == === Neuron === A neuron with label j {\displaystyle j} receiving an input p j ( t ) {\displaystyle p_{j}(t)} from predecessor neurons consists of the following components: an activation a j ( t ) {\displaystyle a_{j}(t)} , the neuron's state, depending on a discrete time parameter, an optional threshold θ j {\displaystyle \theta _{j}} , which stays fixed unless changed by learning, an activation function f {\displaystyle f} that computes the new activation at a given time t + 1 {\displaystyle t+1} from a j ( t ) {\displaystyle a_{j}(t)} , θ j {\displaystyle \theta _{j}} and the net input p j ( t ) {\displaystyle p_{j}(t)} giving rise to the relation a j ( t + 1 ) = f ( a j ( t ) , p j ( t ) , θ j ) , {\displaystyle a_{j}(t+1)=f(a_{j}(t),p_{j}(t),\theta _{j}),} and an output function f out {\displaystyle f_{\text{out}}} computing the output from the activation o j ( t ) = f out ( a j ( t ) ) . {\displaystyle o_{j}(t)=f_{\text{out}}(a_{j}(t)).} Often the output function is simply the identity function. An input neuron has no predecessor but serves as input interface for the whole network. Similarly an output neuron has no successor and thus serves as output interface of the whole network. === Propagation function === The propagation function computes the input p j ( t ) {\displaystyle p_{j}(t)} to the neuron j {\displaystyle j} from the outputs o i ( t ) {\displaystyle o_{i}(t)} and typically has the form p j ( t ) = ∑ i o i ( t ) w i j . {\displaystyle p_{j}(t)=\sum _{i}o_{i}(t)w_{ij}.} === Bias === A bias term can be added, changing the form to the following: p j ( t ) = ∑ i o i ( t ) w i j + w 0 j , {\displaystyle p_{j}(t)=\sum _{i}o_{i}(t)w_{ij}+w_{0j},} where w 0 j {\displaystyle w_{0j}} is a bias. == Neural networks as functions == Neural network models can be viewed as defining a function that takes an input (observation) and produces an output (decision) f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} or a distribution over X {\displaystyle \textstyle X} or both X {\displaystyle \textstyle X} and Y {\displaystyle \textstyle Y} . Sometimes models are intimately associated with a particular learning rule. A common use of the phrase "ANN model" is really the definition of a class of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons, number of layers or their connectivity). Mathematically, a neuron's network function f ( x ) {\displaystyle \textstyle f(x)} is defined as a composition of other functions g i ( x ) {\displaystyle \textstyle g_{i}(x)} , that can further be decomposed into other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between functions. A widely used type of composition is the nonlinear weighted sum, where f ( x ) = K ( ∑ i w i g i ( x ) ) {\displaystyle \textstyle f(x)=K\left(\sum _{i}w_{i}g_{i}(x)\right)} , where K {\displaystyle \textstyle K} (commonly referred to as the activation function) is some predefined function, such as the hyperbolic tangent, sigmoid function, softmax function, or rectifier function. The important characteristic of the activation function is that it provides a smooth transition as input values change, i.e. a small change in input produces a small change in output. The following refers to a collection of functions g i {\displaystyle \textstyle g_{i}} as a vector g = ( g 1 , g 2 , … , g n ) {\displaystyle \textstyle g=(g_{1},g_{2},\ldots ,g_{n})} . This figure depicts such a decomposition of f {\displaystyle \textstyle f} , with dependencies between variables indicated by arrows. These can be interpreted in two ways. The first view is the functional view: the input x {\displaystyle \textstyle x} is transformed into a 3-dimensional vector h {\displaystyle \textstyle h} , which is then transformed into a 2-dimensional vector g {\displaystyle \textstyle g} , which is finally transformed into f {\displaystyle \textstyle f} . This view is most commonly encountered in the context of optimization. The second view is the probabilistic view: the random variable F = f ( G ) {\displaystyle \textstyle F=f(G)} depends upon the random variable G = g ( H ) {\displaystyle \textstyle G=g(H)} , which depends upon H = h ( X ) {\displaystyle \textstyle H=h(X)} , which depends upon the random variable X {\displaystyle \textstyle X} . This view is most commonly encountered in the context of graphical models. The two views are largely equivalent. In either case, for this particular architecture, the components of individual layers are independent of each other (e.g., the components of g {\displaystyle \textstyle g} are independent of each other given their input h {\displaystyle \textstyle h} ). This naturally enables a degree of parallelism in the implementation. Networks such as the previous one are commonly called feedforward, because their graph is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such networks are commonly depicted in the manner shown at the top of the figure, where f {\displaystyle \textstyle f} is shown as dependent upon itself. However, an implied temporal dependence is not shown. == Backpropagation == Backpropagation training algorithms fall into three categories: steepest descent (with variable learning rate and momentum, resilient backpropagation); quasi-Newton (Broyden–Fletcher–Goldfarb–Shanno, one step secant); Levenberg–Marquardt and conjugate gradient (Fletcher–Reeves update, Polak–Ribiére update, Powell–Beale restart, scaled conjugate gradient). === Algorithm === Let N {\displaystyle N} be a network with e {\displaystyle e} connections, m {\displaystyle m} inputs and n {\displaystyle n} outputs. Below, x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } denote vectors in R m {\displaystyle \mathbb {R} ^{m}} , y 1 , y 2 , … {\displaystyle y_{1},y_{2},\dots } vectors in R n {\displaystyle \mathbb {R} ^{n}} , and w 0 , w 1 , w 2 , … {\displaystyle w_{0},w_{1},w_{2},\ldots } vectors in R e {\displaystyle \mathbb {R} ^{e}} . These are called inputs, outputs and weights, respectively. The network corresponds to a function y = f N ( w , x ) {\displaystyle y=f_{N}(w,x)} which, given a weight w {\displaystyle w} , maps an input x {\displaystyle x} to an output y {\displaystyle y} . In supervised learning, a sequence of training examples ( x 1 , y 1 ) , … , ( x p , y p ) {\displaystyle (x_{1},y_{1}),\dots ,(x_{p},y_{p})} produces a sequence of weights w 0 , w 1 , … , w p {\displaystyle w_{0},w_{1},\dots ,w_{p}} starting from some initial weight w 0 {\displaystyle w_{0}} , usually chosen at random. These weights are computed in turn: first compute w i {\displaystyle w_{i}} using only ( x i , y i , w i − 1 ) {\displaystyle (x_{i},y_{i},w_{i-1})} for i = 1 , … , p {\displaystyle i=1,\dots ,p} . The output of the algorithm is then w p {\displaystyle w_{p}} , giving a new function x ↦ f N ( w p , x ) {\displaystyle x\mapsto f_{N}(w_{p},x)} . The computation is the same in each step, hence only the case i = 1 {\displaystyle i=1} is described. w 1 {\displaystyle w_{1}} is calculated from ( x 1 , y 1 , w 0 ) {\displaystyle (x_{1},y_{1},w_{0})} by considering a variable weight w {\displaystyle w} and applying gradient descent to the function w ↦ E ( f N ( w , x 1 ) , y 1 ) {\displaystyle w\mapsto E(f_{N}(w,x_{1}),y_{1})} to find a local minimum, starting at w = w 0 {\displaystyle w=w_{0}} . This makes w 1 {\displaystyle w_{1}} the minimizing weight found by gradient descent. == Learning pseudocode == To implement the algorithm above, explicit formulas are required for the gradient of the function w ↦ E ( f N ( w , x ) , y ) {\displaystyle w\mapsto E(f_{N}(w,x),y)} where the function is E ( y , y ′ ) = | y − y ′ | 2 {\displaystyle E(y,y')=|y-y'|^{2}} . The learning algorithm can be divided into two phases: propagation and weight update. === Propagation === Propagation involves the following steps: Propagation forward through the network to generate the output value(s) Calculation of the cost (error term) Propagation of the output activations back through the network using the training pattern target to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons. === Weight update === For each weight: Multiply the weight's output delta and input activation to find the gradient of the weight. Subtract the ratio (percentage) of the weight's gradient from the weight. The learning rate is the ratio (percentage) that influences the speed and quality of learning. The greater the ratio, the faster the neuron trains, but the lower the ratio, the more accurat