AI Email Write Up

AI Email Write Up — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Data remanence

    Data remanence

    Data remanence is the residual representation of digital data that remains even after attempts have been made to remove or erase the data. This residue may result from data being left intact by a nominal file deletion operation, by reformatting of storage media that does not remove data previously written to the media, or through physical properties of the storage media that allow previously written data to be recovered. Data remanence may make inadvertent disclosure of sensitive information possible should the storage media be released into an uncontrolled environment (e.g., thrown in refuse containers or lost). Various techniques have been developed to counter data remanence. These techniques are classified as clearing, purging/sanitizing, or destruction. Specific methods include overwriting, degaussing, encryption, and media destruction. Effective application of countermeasures can be complicated by several factors, including media that are inaccessible, media that cannot effectively be erased, advanced storage systems that maintain histories of data throughout the data's life cycle, and persistence of data in memory that is typically considered volatile. Several standards exist for the secure removal of data and the elimination of data remanence. == Causes == Many operating systems, file managers, and other software provide a facility where a file is not immediately deleted when the user requests that action. Instead, the file is moved to a holding area (i.e. the "trash"), making it easy for the user to undo a mistake. Similarly, many software products automatically create backup copies of files that are being edited, to allow the user to restore the original version, or to recover from a possible crash (autosave feature). Even when an explicit deleted file retention facility is not provided or when the user does not use it, operating systems do not actually remove the contents of a file when it is deleted unless they are aware that explicit erasure commands are required, like on a solid-state drive. (In such cases, the operating system will issue the Serial ATA TRIM command or the SCSI UNMAP command to let the drive know to no longer maintain the deleted data.) Instead, they simply remove the file's entry from the file system directory because this requires less work and is therefore faster, and the contents of the file—the actual data—remain on the storage medium. The data will remain there until the operating system reuses the space for new data. In some systems, enough filesystem metadata are also left behind to enable easy undeletion by commonly available utility software. Even when undelete has become impossible, the data, until it has been overwritten, can be read by software that reads disk sectors directly. Computer forensics often employs such software. Likewise, reformatting, repartitioning, or reimaging a system is unlikely to write to every area of the disk, though all will cause the disk to appear empty or, in the case of reimaging, empty except for the files present in the image, to most software. Finally, even when the storage media is overwritten, physical properties of the media may permit recovery of the previous contents. In most cases however, this recovery is not possible by just reading from the storage device in the usual way, but requires using laboratory techniques such as disassembling the device and directly accessing/reading from its components. § Complications below gives further explanations for causes of data remanence. == Countermeasures == There are three levels commonly recognized for eliminating remnant data: === Clearing === Clearing is the removal of sensitive data from storage devices in such a way that there is assurance that the data may not be reconstructed using normal system functions or software file/data recovery utilities. The data may still be recoverable, but not without special laboratory techniques. Clearing is typically an administrative protection against accidental disclosure within an organization. For example, before a hard drive is re-used within an organization, its contents may be cleared to prevent their accidental disclosure to the next user. === Purging === Purging or sanitizing is the physical rewrite of sensitive data from a system or storage device done with the specific intent of rendering the data unrecoverable at a later time. Purging, proportional to the sensitivity of the data, is generally done before releasing media beyond control, such as before discarding old media, or moving media to a computer with different security requirements. === Destruction === The storage media is made unusable for conventional equipment. Effectiveness of destroying the media varies by medium and method. Depending on recording density of the media, and/or the destruction technique, this may leave data recoverable by laboratory methods. Conversely, destruction using appropriate techniques is the most secure method of preventing retrieval. == Specific methods == === Overwriting === A common method used to counter data remanence is to overwrite the storage media with new data. This is often called wiping or shredding a disk or file, by analogy to common methods of destroying print media, although the mechanism bears no similarity to these. Because such a method can often be implemented in software alone, and may be able to selectively target only part of the media, it is a popular, low-cost option for some applications. Overwriting is generally an acceptable method of clearing, as long as the media is writable and not damaged. The simplest overwrite technique writes the same data everywhere—often just a pattern of all zeros. At a minimum, this will prevent the data from being retrieved simply by reading from the media again using standard system functions. The UEFI in modern machines may offer an ATA class disk erase function as well. The ATA-6 standard governs secure erases specifications. Bitlocker is whole disk encryption and illegible without the key. Writing a fresh GPT allows a new file system to be established. Blocks will set empty but LBA read is illegible. New data will be unaffected and work fine. In an attempt to counter more advanced data recovery techniques, specific overwrite patterns and multiple passes have often been prescribed. These may be generic patterns intended to eradicate any trace signatures; an example is the seven-pass pattern 0xF6, 0x00, 0xFF, , 0x00, 0xFF, , sometimes erroneously attributed to US standard DOD 5220.22-M. One challenge with overwriting is that some areas of the disk may be inaccessible, due to media degradation or other errors. Software overwrite may also be problematic in high-security environments, which require stronger controls on data commingling than can be provided by the software in use. The use of advanced storage technologies may also make file-based overwrite ineffective (see the related discussion below under § Complications). There are specialized machines and software that are capable of doing overwriting. The software can sometimes be a standalone operating system specifically designed for data destruction. There are also machines specifically designed to wipe hard drives to the department of defense specifications DOD 5220.22-M. Writing zero to each block on hard disks and SSDs has the advantage of affording the firmware to deploy spare blocks when bad blocks are identified. Bitlocker has the advantage that data is illegible without the key. Seatools and other tools can erase disks with zero which is typical to revive old consumer class disks but they can wipe server disks albeit slowly. Modern 28TB and larger disks have an enormous number of LBA48 blocks. 40TB and 60TB disks will take proportionately longer times to wipe. ==== Feasibility of recovering overwritten data ==== Peter Gutmann investigated data recovery from nominally overwritten media in the mid-1990s. He suggested magnetic force microscopy may be able to recover such data, and developed specific patterns, for specific drive technologies, designed to counter such. These patterns have come to be known as the Gutmann method. Gutmann's belief in the possibility of data recovery is based on many questionable assumptions and factual errors that indicate a low level of understanding of how hard drives work. Daniel Feenberg, an economist at the private National Bureau of Economic Research, claims that the chances of overwritten data being recovered from a modern hard drive amount to "urban legend". He also points to the "18+1⁄2-minute gap" Rose Mary Woods created on a tape of Richard Nixon discussing the Watergate break-in. Erased information in the gap has not been recovered, and Feenberg claims doing so would be an easy task compared to recovery of a modern high density digital signal. As of November 2007, the United States Department of Defense considers overwriting acceptable for clearing magnetic media within the same security area/

    Read more →
  • Mountain car problem

    Mountain car problem

    Mountain Car, a standard testing domain in Reinforcement learning, is a problem in which an under-powered car must drive up a steep hill. Since gravity is stronger than the car's engine, even at full throttle, the car cannot simply accelerate up the steep slope. The car is situated in a valley and must learn to leverage potential energy by driving up the opposite hill before the car is able to make it to the goal at the top of the rightmost hill. The domain has been used as a test bed in various reinforcement learning papers. == Introduction == The mountain car problem, although fairly simple, is commonly applied because it requires a reinforcement learning agent to learn on two continuous variables: position and velocity. For any given state (position and velocity) of the car, the agent is given the possibility of driving left, driving right, or not using the engine at all. In the standard version of the problem, the agent receives a negative reward at every time step when the goal is not reached; the agent has no information about the goal until an initial success. == History == The mountain car problem appeared first in Andrew Moore's PhD thesis (1990). It was later more strictly defined in Singh and Sutton's reinforcement learning paper with eligibility traces. The problem became more widely studied when Sutton and Barto added it to their book Reinforcement Learning: An Introduction (1998). Throughout the years many versions of the problem have been used, such as those which modify the reward function, termination condition, and the start state. == Techniques used to solve mountain car == Q-learning and similar techniques for mapping discrete states to discrete actions need to be extended to be able to deal with the continuous state space of the problem. Approaches often fall into one of two categories, state space discretization or function approximation. === Discretization === In this approach, two continuous state variables are pushed into discrete states by bucketing each continuous variable into multiple discrete states. This approach works with properly tuned parameters but a disadvantage is information gathered from one state is not used to evaluate another state. Tile coding can be used to improve discretization and involves continuous variables mapping into sets of buckets offset from one another. Each step of training has a wider impact on the value function approximation because when the offset grids are summed, the information is diffused. === Function approximation === Function approximation is another way to solve the mountain car. By choosing a set of basis functions beforehand, or by generating them as the car drives, the agent can approximate the value function at each state. Unlike the step-wise version of the value function created with discretization, function approximation can more cleanly estimate the true smooth function of the mountain car domain. === Eligibility traces === One aspect of the problem involves the delay of actual reward. The agent is not able to learn about the goal until a successful completion. Given a naive approach for each trial the car can only backup the reward of the goal slightly. This is a problem for naive discretization because each discrete state will only be backed up once, taking a larger number of episodes to learn the problem. This problem can be alleviated via the mechanism of eligibility traces, which will automatically backup the reward given to states before, dramatically increasing the speed of learning. Eligibility traces can be viewed as a bridge from temporal difference learning methods to Monte Carlo methods. == Technical details == The mountain car problem has undergone many iterations. This section focuses on the standard well-defined version from Sutton (2008). === State variables === Two-dimensional continuous state space. V e l o c i t y = ( − 0.07 , 0.07 ) {\displaystyle Velocity=(-0.07,0.07)} P o s i t i o n = ( − 1.2 , 0.6 ) {\displaystyle Position=(-1.2,0.6)} === Actions === One-dimensional discrete action space. m o t o r = ( l e f t , n e u t r a l , r i g h t ) {\displaystyle motor=(left,neutral,right)} === Reward === For every time step: r e w a r d = − 1 {\displaystyle reward=-1} === Update function === For every time step: A c t i o n = [ − 1 , 0 , 1 ] {\displaystyle Action=[-1,0,1]} V e l o c i t y = V e l o c i t y + ( A c t i o n ) ∗ 0.001 + cos ⁡ ( 3 ∗ P o s i t i o n ) ∗ ( − 0.0025 ) {\displaystyle Velocity=Velocity+(Action)0.001+\cos(3Position)(-0.0025)} P o s i t i o n = P o s i t i o n + V e l o c i t y {\displaystyle Position=Position+Velocity} === Starting condition === Optionally, many implementations include randomness in both parameters to show better generalized learning. P o s i t i o n = − 0.5 {\displaystyle Position=-0.5} V e l o c i t y = 0.0 {\displaystyle Velocity=0.0} === Termination condition === End the simulation when: P o s i t i o n ≥ 0.6 {\displaystyle Position\geq 0.6} == Variations == There are many versions of the mountain car which deviate in different ways from the standard model. Variables that vary include but are not limited to changing the constants (gravity and steepness) of the problem so specific tuning for specific policies become irrelevant and altering the reward function to affect the agent's ability to learn in a different manner. An example is changing the reward to be equal to the distance from the goal, or changing the reward to zero everywhere and one at the goal. Additionally, a 3D mountain car can be used, with a 4D continuous state space.

    Read more →
  • Solomonoff's theory of inductive inference

    Solomonoff's theory of inductive inference

    Solomonoff's theory of inductive inference proves that, under its common sense assumptions (axioms), the best possible scientific model is the shortest algorithm that generates the empirical data under consideration. In addition to the choice of data, other assumptions are that, to avoid the post-hoc fallacy, the programming language must be chosen prior to the data and that the environment being observed is generated by an unknown algorithm. This is also called a theory of induction. Due to its basis in the dynamical (state-space model) character of Algorithmic Information Theory, it encompasses statistical as well as dynamical information criteria for model selection. It was introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes' rule and some universal prior, that is, a prior that assigns a positive probability to any computable theory. Solomonoff proved that this induction is incomputable (or more precisely, lower semi-computable), but noted that "this incomputability is of a very benign kind", and that it "in no way inhibits its use for practical prediction" (as it can be approximated from below more accurately with more computational resources). It is only "incomputable" in the benign sense that no scientific consensus is able to prove that the best current scientific theory is the best of all possible theories. However, Solomonoff's theory does provide an objective criterion for deciding among the current scientific theories explaining a given set of observations. Solomonoff's induction naturally formalizes Occam's razor by assigning larger prior credences to theories that require a shorter algorithmic description. == Origin == === Philosophical === The theory is based in philosophical foundations, and was founded by Ray Solomonoff around 1960. It is a mathematically formalized combination of Occam's razor and the Principle of Multiple Explanations. All computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's universal artificial intelligence builds upon this to calculate the expected value of an action. === Principle === Solomonoff's induction has been argued to be the computational formalization of pure Bayesianism. To understand, recall that Bayesianism derives the posterior probability P [ T | D ] {\displaystyle \mathbb {P} [T|D]} of a theory T {\displaystyle T} given data D {\displaystyle D} by applying Bayes rule, which yields P [ T | D ] = P [ D | T ] P [ T ] P [ D | T ] P [ T ] + ∑ A ≠ T P [ D | A ] P [ A ] {\displaystyle \mathbb {P} [T|D]={\frac {\mathbb {P} [D|T]\mathbb {P} [T]}{\mathbb {P} [D|T]\mathbb {P} [T]+\sum _{A\neq T}\mathbb {P} [D|A]\mathbb {P} [A]}}} where theories A {\displaystyle A} are alternatives to theory T {\displaystyle T} . For this equation to make sense, the quantities P [ D | T ] {\displaystyle \mathbb {P} [D|T]} and P [ D | A ] {\displaystyle \mathbb {P} [D|A]} must be well-defined for all theories T {\displaystyle T} and A {\displaystyle A} . In other words, any theory must define a probability distribution over observable data D {\displaystyle D} . Solomonoff's induction essentially boils down to demanding that all such probability distributions be computable. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is countable. Similarly, the sets of observable data considered by Solomonoff were finite. Without loss of generality, we can thus consider that any observable data is a finite bit string. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data F {\displaystyle F} , by simply obeying the laws of probability. Namely, we have P [ F | D ] = E T [ P [ F | T , D ] ] = ∑ T P [ F | T , D ] P [ T | D ] {\displaystyle \mathbb {P} [F|D]=\mathbb {E} _{T}[\mathbb {P} [F|T,D]]=\sum _{T}\mathbb {P} [F|T,D]\mathbb {P} [T|D]} . This quantity can be interpreted as the average predictions P [ F | T , D ] {\displaystyle \mathbb {P} [F|T,D]} of all theories T {\displaystyle T} given past data D {\displaystyle D} , weighted by their posterior credences P [ T | D ] {\displaystyle \mathbb {P} [T|D]} . === Mathematical === The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of probability) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every ϵ {\displaystyle \epsilon } > 0, there is some length l such that the probability of all programs longer than l is at most ϵ {\displaystyle \epsilon } . This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of algorithmic probability and Kolmogorov complexity. The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion. == Mathematical guarantees == === Solomonoff's completeness === The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the Kolmogorov complexity of the (stochastic) data generating process. The errors can be measured using the Kullback–Leibler divergence or the square of the difference between the induction's prediction and the probability assigned by the (stochastic) data generating process. === Solomonoff's uncomputability === Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that computability and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the no free lunch theorem. == Modern applications == === Artificial intelligence === Though Solomonoff's inductive inference is not computable, several AIXI-derived algorithms approximate it in order to make it run on a modern computer. The more computing power they are given, the closer their predictions are to the predictions of inductive inference (their mathematical limit is Solomonoff's inductive inference). Another direction of inductive inference is based on E. Mark Gold's model of learning in the limit from 1967 and has developed since then more and more models of learning. The general scenario is the following: Given a class S of computable functions, is there a learner (that is, recursive functional) which for any input of the form (f(0),f(1),...,f(n)) outputs a hypothesis (an index e with respect to a previously agreed on acceptable numbering of all computable functions; the indexed function may be required consistent with the given values of f). A learner M learns a function f if almost all its hypotheses are the same index e, which generates the function f; M learns S if M learns every f in S. Basic results are that all recursively enumerable classes of functions are learnable while the class REC of all computable functions is not learnable. Many related models have been considered and also the learning of classes of recursively enumerable sets from positive data is a topic studied from Gold's pioneering paper in 1967 onwards. A far reaching extension of the Gold’s approach is developed by Schmidhuber's theory of generalized Kolmogorov complexities, which are kinds of super-recursive algorithms.

    Read more →
  • Pattern theory

    Pattern theory

    Pattern theory, formulated by Ulf Grenander, is a mathematical formalism to describe knowledge of the world as patterns. It differs from other approaches to artificial intelligence in that it does not begin by prescribing algorithms and machinery to recognize and classify patterns; rather, it prescribes a vocabulary to articulate and recast the pattern concepts in precise language. Broad in its mathematical coverage, Pattern Theory spans algebra and statistics, as well as local topological and global entropic properties. In addition to the new algebraic vocabulary, its statistical approach is novel in its aim to: Identify the hidden variables of a data set using real world data rather than artificial stimuli, which was previously commonplace. Formulate prior distributions for hidden variables and models for the observed variables that form the vertices of a Gibbs-like graph. Study the randomness and variability of these graphs. Create the basic classes of stochastic models applied by listing the deformations of the patterns. Synthesize (sample) from the models, not just analyze signals with them. The Brown University Pattern Theory Group was formed in 1972 by Ulf Grenander. Many mathematicians are currently working in this group, noteworthy among them being the Fields Medalist David Mumford. Mumford regards Grenander as his "guru" in Pattern Theory.

    Read more →
  • TalkBack

    TalkBack

    TalkBack is an accessibility service for the Android operating system that helps blind and visually impaired users to interact with their devices. It uses spoken words, vibration and other audible feedback to allow the user to know what is happening on the screen allowing the user to better interact with their device. The service is pre-installed on many Android devices, and it became part of the Android Accessibility Suite in 2017. According to the Google Play Store, the Android Accessibility Suite has been downloaded over five billion times, including devices that have the suite preinstalled. == Open-source == Google releases the source code of TalkBack with some releases of the accessibility service to GitHub, with the latest of these changes being from May 6, 2021. The source for these versions of Google TalkBack have been released under the Apache License version 2.0. == Release history ==

    Read more →
  • Spreading activation

    Spreading activation

    Spreading activation is a method for searching associative networks, biological and artificial neural networks, or semantic networks. The search process is initiated by labeling a set of source nodes (e.g. concepts in a semantic network) with weights or "activation" and then iteratively propagating or "spreading" that activation out to other nodes linked to the source nodes. Most often these "weights" are real values that decay as activation propagates through the network. When the weights are discrete this process is often referred to as marker passing. Activation may originate from alternate paths, identified by distinct markers, and terminate when two alternate paths reach the same node. However brain studies show that several different brain areas play an important role in semantic processing. Spreading activation in semantic networks as a model were invented in cognitive psychology to model the fan out effect. Spreading activation can also be applied in information retrieval, by means of a network of nodes representing documents and terms contained in those documents. == Cognitive psychology == As it relates to cognitive psychology, spreading activation is the theory of how the brain iterates through a network of associated ideas to retrieve specific information. The spreading activation theory presents the array of concepts within our memory as cognitive units, each consisting of a node and its associated elements or characteristics, all connected together by edges. A spreading activation network can be represented schematically, in a sort of web diagram with shorter lines between two nodes meaning the ideas are more closely related and will typically be associated more quickly to the original concept. In memory psychology, the spreading activation model holds that people organize their knowledge of the world based on their personal experiences, which in turn form the network of ideas that is the person's knowledge of the world. When a word (the target) is preceded by an associated word (the prime) in word recognition tasks, participants seem to perform better in the amount of time that it takes them to respond. For instance, subjects respond faster to the word "doctor" when it is preceded by "nurse" than when it is preceded by an unrelated word like "carrot". This semantic priming effect with words that are close in meaning within the cognitive network has been seen in a wide range of tasks given by experimenters, ranging from sentence verification to lexical decision and naming. As another example, if the original concept is "red" and the concept "vehicles" is primed, they are much more likely to say "fire engine" instead of something unrelated to vehicles, such as "cherries". If instead "fruits" was primed, they would likely name "cherries" and continue on from there. The activation of pathways in the network has everything to do with how closely linked two concepts are by meaning, as well as how a subject is primed. == Algorithm == A directed graph is populated by Nodes[ 1...N ] each having an associated activation value A [ i ] which is a real number in the range [0.0 ... 1.0]. A Link[ i, j ] connects source node[ i ] with target node[ j ]. Each edge has an associated weight W [ i, j ] usually a real number in the range [0.0 ... 1.0]. Parameters: Firing threshold F, a real number in the range [0.0 ... 1.0] Decay factor D, a real number in the range [0.0 ... 1.0] Steps: Initialize the graph setting all activation values A [ i ] to zero. Set one or more origin nodes to an initial activation value greater than the firing threshold F. A typical initial value is 1.0. For each unfired node [ i ] in the graph having an activation value A [ i ] greater than the node firing threshold F: For each Link [ i, j ] connecting the source node [ i ] with target node [ j ], adjust A [ j ] = A [ j ] + (A [ i ] W [ i, j ] D) where D is the decay factor. If a target node receives an adjustment to its activation value so that it would exceed 1.0, then set its new activation value to 1.0. Likewise maintain 0.0 as a lower bound on the target node's activation value should it receive an adjustment to below 0.0. Once a node has fired it may not fire again, although variations of the basic algorithm permit repeated firings and loops through the graph. Nodes receiving a new activation value that exceeds the firing threshold F are marked for firing on the next spreading activation cycle. If activation originates from more than one node, a variation of the algorithm permits marker passing to distinguish the paths by which activation is spread over the graph The procedure terminates when either there are no more nodes to fire or in the case of marker passing from multiple origins, when a node is reached from more than one path. Variations of the algorithm that permit repeated node firings and activation loops in the graph, terminate after a steady activation state, with respect to some delta, is reached, or when a maximum number of iterations is exceeded. == Examples ==

    Read more →
  • Symbolic artificial intelligence

    Symbolic artificial intelligence

    In artificial intelligence, symbolic artificial intelligence (also known as classical artificial intelligence or logic-based artificial intelligence) is the term for the collection of all methods in artificial intelligence research that are based on high-level symbolic (human-readable) representations of problems, logic, and search. Symbolic AI used tools such as logic programming, production rules, semantic nets and frames, and it developed applications such as knowledge-based systems (in particular, expert systems), symbolic mathematics, automated theorem provers, ontologies, the semantic web, and automated planning and scheduling systems. The Symbolic AI paradigm led to important ideas in search, symbolic programming languages, agents, multi-agent systems, the semantic web, and the strengths and limitations of formal knowledge and reasoning systems. Symbolic AI was the dominant paradigm of AI research from the mid-1950s until the mid-1990s. Researchers in the 1960s and the 1970s were convinced that symbolic approaches would eventually succeed in creating a machine with artificial general intelligence and considered this the ultimate goal of their field. An early boom, with early successes such as the Logic Theorist and Samuel's Checkers Playing Program, led to unrealistic expectations and promises and was followed by the first AI Winter as funding dried up. A second boom (1969–1986) occurred with the rise of expert systems, their promise of capturing corporate expertise, and an enthusiastic corporate embrace. That boom, and some early successes, e.g., with XCON at DEC, was followed again by later disappointment. Problems with difficulties in knowledge acquisition, maintaining large knowledge bases, and brittleness in handling out-of-domain problems arose. Another, second, AI Winter (1988–2011) followed. Subsequently, AI researchers focused on addressing underlying problems in handling uncertainty and in knowledge acquisition. Uncertainty was addressed with formal methods such as hidden Markov models, Bayesian reasoning, and statistical relational learning. Symbolic machine learning addressed the knowledge acquisition problem with contributions including Version Space, Valiant's PAC learning, Quinlan's ID3 decision-tree learning, case-based learning, and inductive logic programming to learn relations. Neural networks, a subsymbolic approach, had been pursued from early days and reemerged strongly in 2012. Early examples are Rosenblatt's perceptron learning work, the backpropagation work of Rumelhart, Hinton and Williams, and work in convolutional neural networks by LeCun et al. in 1989. However, neural networks were not viewed as successful until about 2012: "Until Big Data became commonplace, the general consensus in the Al community was that the so-called neural-network approach was hopeless. Systems just didn't work that well, compared to other methods. ... A revolution came in 2012, when a number of people, including a team of researchers working with Hinton, worked out a way to use the power of GPUs to enormously increase the power of neural networks." Over the next several years, deep learning had spectacular success in handling vision, speech recognition, speech synthesis, image generation, and machine translation, though symbolic approaches continue to be useful in a few domains such as computer algebra systems and proof assistants. == History == A short history of symbolic AI to the present day follows below. Time periods and titles are drawn from Henry Kautz's 2020 AAAI Robert S. Engelmore Memorial Lecture and the longer Wikipedia article on the History of AI, with dates and titles differing slightly for increased clarity. === The first AI summer: irrational exuberance, 1948–1966 === Success at early attempts in AI occurred in three main areas: artificial neural networks, knowledge representation, and heuristic search, contributing to high expectations. This section summarizes Kautz's reprise of early AI history. ==== Approaches inspired by human or animal cognition or behavior ==== Cybernetic approaches attempted to replicate the feedback loops between animals and their environments. A robotic turtle, with sensors, motors for driving and steering, and seven vacuum tubes for control, based on a preprogrammed neural net, was built as early as 1948. This work can be seen as an early precursor to later work in neural networks, reinforcement learning, and situated robotics. An important early symbolic AI program was the Logic theorist, written by Allen Newell, Herbert Simon and Cliff Shaw in 1955–56, as it was able to prove 38 elementary theorems from Whitehead and Russell's Principia Mathematica. Newell, Simon, and Shaw later generalized this work to create a domain-independent problem solver, GPS (General Problem Solver). GPS solved problems represented with formal operators via state-space search using means-ends analysis. During the 1960s, symbolic approaches achieved great success at simulating intelligent behavior in structured environments such as game-playing, symbolic mathematics, and theorem-proving. AI research was concentrated in four institutions in the 1960s: Carnegie Mellon University, Stanford, MIT and (later) University of Edinburgh. Each one developed its own style of research. Earlier approaches based on cybernetics or artificial neural networks were abandoned or pushed into the background. Herbert Simon and Allen Newell studied human problem-solving skills and attempted to formalize them, and their work laid the foundations of the field of artificial intelligence, as well as cognitive science, operations research and management science. Their research team used the results of psychological experiments to develop programs that simulated the techniques that people used to solve problems. This tradition, centered at Carnegie Mellon University would eventually culminate in the development of the Soar architecture in the middle 1980s. ==== Heuristic search ==== In addition to the highly specialized domain-specific kinds of knowledge that we will see later used in expert systems, early symbolic AI researchers discovered another more general application of knowledge. These were called heuristics, rules of thumb that guide a search in promising directions: "How can non-enumerative search be practical when the underlying problem is exponentially hard? The approach advocated by Simon and Newell is to employ heuristics: fast algorithms that may fail on some inputs or output suboptimal solutions." Another important advance was to find a way to apply these heuristics that guarantees a solution will be found, if there is one, not withstanding the occasional fallibility of heuristics: "The A algorithm provided a general frame for complete and optimal heuristically guided search. A is used as a subroutine within practically every AI algorithm today but is still no magic bullet; its guarantee of completeness is bought at the cost of worst-case exponential time. ==== Early work on knowledge representation and reasoning ==== Early work covered both applications of formal reasoning emphasizing first-order logic, along with attempts to handle common-sense reasoning in a less formal manner. ===== Modeling formal reasoning with logic: the "neats" ===== Unlike Simon and Newell, John McCarthy felt that machines did not need to simulate the exact mechanisms of human thought, but could instead try to find the essence of abstract reasoning and problem-solving with logic, regardless of whether people used the same algorithms. His laboratory at Stanford (SAIL) focused on using formal logic to solve a wide variety of problems, including knowledge representation, planning and learning. Logic was also the focus of the work at the University of Edinburgh and elsewhere in Europe which led to the development of the programming language Prolog and the science of logic programming. ===== Modeling implicit common-sense knowledge with frames and scripts: the "scruffies" ===== Researchers at MIT (such as Marvin Minsky and Seymour Papert) found that solving difficult problems in vision and natural language processing required ad hoc solutions—they argued that no simple and general principle (like logic) would capture all the aspects of intelligent behavior. Roger Schank described their "anti-logic" approaches as "scruffy" (as opposed to the "neat" paradigms at CMU and Stanford). Commonsense knowledge bases (such as Doug Lenat's Cyc) are an example of "scruffy" AI, since they must be built by hand, one complicated concept at a time. === The first AI winter: crushed dreams, 1967–1977 === The first AI winter was a shock: During the first AI summer, many people thought that machine intelligence could be achieved in just a few years. The Defense Advance Research Projects Agency (DARPA) launched programs to support AI research to use AI to solve problems of national security; in particular, to automate the translation of Russian to English for inte

    Read more →
  • Eager learning

    Eager learning

    In artificial intelligence, eager learning is a learning method in which the system tries to construct a general, input-independent target function during training of the system, as opposed to lazy learning, where generalization beyond the training data is delayed until a query is made to the system. The main advantage gained in employing an eager learning method, such as an artificial neural network, is that the target function will be approximated globally during training, thus requiring much less space than using a lazy learning system. Eager learning systems also deal much better with noise in the training data. Eager learning is an example of offline learning, in which post-training queries to the system have no effect on the system itself, and thus the same query to the system will always produce the same result. The main disadvantage with eager learning is that it is generally unable to provide good local approximations in the target function.

    Read more →
  • Similarity learning

    Similarity learning

    Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification. == Learning setup == There are four common setups for similarity and metric distance learning. Regression similarity learning In this setup, pairs of objects are given ( x i 1 , x i 2 ) {\displaystyle (x_{i}^{1},x_{i}^{2})} together with a measure of their similarity y i ∈ R {\displaystyle y_{i}\in R} . The goal is to learn a function that approximates f ( x i 1 , x i 2 ) ∼ y i {\displaystyle f(x_{i}^{1},x_{i}^{2})\sim y_{i}} for every new labeled triplet example ( x i 1 , x i 2 , y i ) {\displaystyle (x_{i}^{1},x_{i}^{2},y_{i})} . This is typically achieved by minimizing a regularized loss min W ∑ i l o s s ( w ; x i 1 , x i 2 , y i ) + r e g ( w ) {\displaystyle \min _{W}\sum _{i}loss(w;x_{i}^{1},x_{i}^{2},y_{i})+reg(w)} . Classification similarity learning Given are pairs of similar objects ( x i , x i + ) {\displaystyle (x_{i},x_{i}^{+})} and non similar objects ( x i , x i − ) {\displaystyle (x_{i},x_{i}^{-})} . An equivalent formulation is that every pair ( x i 1 , x i 2 ) {\displaystyle (x_{i}^{1},x_{i}^{2})} is given together with a binary label y i ∈ { 0 , 1 } {\displaystyle y_{i}\in \{0,1\}} that determines if the two objects are similar or not. The goal is again to learn a classifier that can decide if a new pair of objects is similar or not. Ranking similarity learning Given are triplets of objects ( x i , x i + , x i − ) {\displaystyle (x_{i},x_{i}^{+},x_{i}^{-})} whose relative similarity obey a predefined order: x i {\displaystyle x_{i}} is known to be more similar to x i + {\displaystyle x_{i}^{+}} than to x i − {\displaystyle x_{i}^{-}} . The goal is to learn a function f {\displaystyle f} such that for any new triplet of objects ( x , x + , x − ) {\displaystyle (x,x^{+},x^{-})} , it obeys f ( x , x + ) > f ( x , x − ) {\displaystyle f(x,x^{+})>f(x,x^{-})} (contrastive learning). This setup assumes a weaker form of supervision than in regression, because instead of providing an exact measure of similarity, one only has to provide the relative order of similarity. For this reason, ranking-based similarity learning is easier to apply in real large-scale applications. Locality sensitive hashing (LSH) Hashes input items so that similar items map to the same "buckets" in memory with high probability (the number of buckets being much smaller than the universe of possible input items). It is often applied in nearest neighbor search on large-scale high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. A common approach for learning similarity is to model the similarity function as a bilinear form. For example, in the case of ranking similarity learning, one aims to learn a matrix W that parametrizes the similarity function f W ( x , z ) = x T W z {\displaystyle f_{W}(x,z)=x^{T}Wz} . When data is abundant, a common approach is to learn a siamese network – a deep network model with parameter sharing. == Metric learning == Similarity learning is closely related to distance metric learning. Metric learning is the task of learning a distance function over objects. A metric or distance function has to obey four axioms: non-negativity, identity of indiscernibles, symmetry and subadditivity (or the triangle inequality). In practice, metric learning algorithms ignore the condition of identity of indiscernibles and learn a pseudo-metric. When the objects x i {\displaystyle x_{i}} are vectors in R d {\displaystyle R^{d}} , then any matrix W {\displaystyle W} in the symmetric positive semi-definite cone S + d {\displaystyle S_{+}^{d}} defines a distance pseudo-metric of the space of x through the form D W ( x 1 , x 2 ) 2 = ( x 1 − x 2 ) ⊤ W ( x 1 − x 2 ) {\displaystyle D_{W}(x_{1},x_{2})^{2}=(x_{1}-x_{2})^{\top }W(x_{1}-x_{2})} . When W {\displaystyle W} is a symmetric positive definite matrix, D W {\displaystyle D_{W}} is a metric. Moreover, as any symmetric positive semi-definite matrix W ∈ S + d {\displaystyle W\in S_{+}^{d}} can be decomposed as W = L ⊤ L {\displaystyle W=L^{\top }L} where L ∈ R e × d {\displaystyle L\in R^{e\times d}} and e ≥ r a n k ( W ) {\displaystyle e\geq rank(W)} , the distance function D W {\displaystyle D_{W}} can be rewritten equivalently D W ( x 1 , x 2 ) 2 = ( x 1 − x 2 ) ⊤ L ⊤ L ( x 1 − x 2 ) = ‖ L ( x 1 − x 2 ) ‖ 2 2 {\displaystyle D_{W}(x_{1},x_{2})^{2}=(x_{1}-x_{2})^{\top }L^{\top }L(x_{1}-x_{2})=\|L(x_{1}-x_{2})\|_{2}^{2}} . The distance D W ( x 1 , x 2 ) 2 = ‖ x 1 ′ − x 2 ′ ‖ 2 2 {\displaystyle D_{W}(x_{1},x_{2})^{2}=\|x_{1}'-x_{2}'\|_{2}^{2}} corresponds to the Euclidean distance between the transformed feature vectors x 1 ′ = L x 1 {\displaystyle x_{1}'=Lx_{1}} and x 2 ′ = L x 2 {\displaystyle x_{2}'=Lx_{2}} . Many formulations for metric learning have been proposed. Some well-known approaches for metric learning include learning from relative comparisons, which is based on the triplet loss, large margin nearest neighbor, and information theoretic metric learning (ITML). In statistics, the covariance matrix of the data is sometimes used to define a distance metric called Mahalanobis distance. == Applications == Similarity learning is used in information retrieval for learning to rank, in face verification or face identification, and in recommendation systems. Also, many machine learning approaches rely on some metric. This includes unsupervised learning such as clustering, which groups together close or similar objects. It also includes supervised approaches like K-nearest neighbor algorithm which rely on labels of nearby objects to decide on the label of a new object. Metric learning has been proposed as a preprocessing step for many of these approaches. == Scalability == Metric and similarity learning scale quadratically with the dimension of the input space, as can easily see when the learned metric has a bilinear form f W ( x , z ) = x T W z {\displaystyle f_{W}(x,z)=x^{T}Wz} . Scaling to higher dimensions can be achieved by enforcing a sparseness structure over the matrix model, as done with HDSL, and with COMET. == Software == metric-learn is a free software Python library which offers efficient implementations of several supervised and weakly-supervised similarity and metric learning algorithms. The API of metric-learn is compatible with scikit-learn. OpenMetricLearning is a Python framework to train and validate the models producing high-quality embeddings. == Further information == For further information on this topic, see the surveys on metric and similarity learning by Bellet et al. and Kulis.

    Read more →
  • Discovery system (artificial intelligence)

    Discovery system (artificial intelligence)

    A discovery system is an artificial intelligence system that attempts to discover new scientific concepts or laws. The aim of discovery systems is to automate scientific data analysis and the scientific discovery process. Ideally, an artificial intelligence system should be able to search systematically through the space of all possible hypotheses and yield the hypothesis - or set of equally likely hypotheses - that best describes the complex patterns in data. During the era known as the second AI summer (approximately 1978–1987), various systems akin to the era's dominant expert systems were developed to tackle the problem of extracting scientific hypotheses from data, with or without interacting with a human scientist. These systems included Autoclass, Automated Mathematician, Eurisko, which aimed at general-purpose hypothesis discovery, and more specific systems such as Dalton, which uncovers molecular properties from data. The dream of building systems that discover scientific hypotheses was pushed to the background with the second AI winter and the subsequent resurgence of subsymbolic methods such as neural networks. Subsymbolic methods emphasize prediction over explanation, and yield models which works well but are difficult or impossible to explain which has earned them the name black box AI. A black-box model cannot be considered a scientific hypothesis, and this development has even led some researchers to suggest that the traditional aim of science - to uncover hypotheses and theories about the structure of reality - is obsolete. Other researchers disagree and argue that subsymbolic methods are useful in many cases, just not for generating scientific theories. == Discovery systems from the 1970s and 1980s == Autoclass was a Bayesian Classification System written in 1986 Automated Mathematician was one of the earliest successful discovery systems. It was written in 1977 and worked by generating a modifying small Lisp programs Eurisko was a Sequel to Automated Mathematician written in 1984 Dalton is a still maintained program capable of calculating various molecular properties initially launched in 1983 and available in open source since 2017 Glauber is a scientific discovery method written in the context of computational philosophy of science launched in 1983 == Modern discovery systems (2009–present) == After a couple of decades with little interest in discovery systems, the interest in using AI to uncover natural laws and scientific explanations was renewed by the work of Michael Schmidt, then a PhD student in Computational Biology at Cornell University. Schmidt and his advisor, Hod Lipson, invented Eureqa, which they described as a symbolic regression approach to "distilling free-form natural laws from experimental data". This work effectively demonstrated that symbolic regression was a promising way forward for AI-driven scientific discovery. Since 2009, symbolic regression has matured further, and today, various commercial and open source systems are actively used in scientific research. Notable examples include Eureqa, now a part of DataRobot AI Cloud Platform, AI Feynman, and QLattice.

    Read more →
  • Quantum machine learning

    Quantum machine learning

    Quantum machine learning (QML) is the study of quantum algorithms for machine learning. It often refers to quantum algorithms for machine learning tasks which analyze classical data, sometimes called quantum-enhanced machine learning. QML algorithms use qubits and quantum operations to try to improve the space and time complexity of classical machine learning algorithms. Hybrid QML methods involve both classical and quantum processing, where computationally difficult subroutines are outsourced to a quantum device. These routines can be more complex in nature and executed faster on a quantum computer. Furthermore, quantum algorithms can be used to analyze quantum states instead of classical data. The term "quantum machine learning" is sometimes used to refer classical machine learning methods applied to data generated from quantum experiments (i.e. machine learning of quantum systems), such as learning the phase transitions of a quantum system or creating new quantum experiments. QML also extends to a branch of research that explores methodological and structural similarities between certain physical systems and learning systems, in particular neural networks. For example, some mathematical and numerical techniques from quantum physics are applicable to classical deep learning and vice versa. Furthermore, researchers investigate more abstract notions of learning theory with respect to quantum information, sometimes referred to as "quantum learning theory". == Machine learning with quantum computers == Quantum-enhanced machine learning refers to quantum algorithms that solve tasks in machine learning, thereby improving and often expediting classical machine learning techniques. Such algorithms typically require one to encode the given classical data set into a quantum computer to make it accessible for quantum information processing. Subsequently, quantum information processing routines are applied and the result of the quantum computation is read out by measuring the quantum system. For example, the outcome of the measurement of a qubit reveals the result of a binary classification task. While many proposals of QML algorithms are still purely theoretical and require a full-scale universal quantum computer to be tested, others have been implemented on small-scale or special purpose quantum devices. === Quantum associative memories and quantum pattern recognition === Early work on quantum associative memories has been done by Dan Ventura and Tony Martinez and by Carlo A. Trugenberger in the late 1990s and early 2000s. Associative (or content-addressable) memories are able to recognize stored content on the basis of a similarity measure, while random access memories are accessed by the address of stored information and not its content. As such they must be able to retrieve both incomplete and corrupted patterns, the essential machine learning task of pattern recognition. Typical classical associative memories store p patterns in the O ( n 2 ) {\displaystyle O(n^{2})} interactions (synapses) of a real, symmetric energy matrix over a network of n artificial neurons. The encoding is such that the desired patterns are local minima of the energy functional and retrieval is done by minimizing the total energy, starting from an initial configuration. Unfortunately, classical associative memories are severely limited by the phenomenon of cross-talk. When too many patterns are stored, spurious memories appear which quickly proliferate, so that the energy landscape becomes disordered and no retrieval is anymore possible. The number of storable patterns is typically limited by a linear function of the number of neurons, p ≤ O ( n ) {\displaystyle p\leq O(n)} . Quantum associative memories (in their simplest realization) store patterns in a unitary matrix U acting on the Hilbert space of n qubits. Retrieval is realized by the unitary evolution of a fixed initial state to a quantum superposition of the desired patterns with probability distribution peaked on the most similar pattern to an input. By its very quantum nature, the retrieval process is thus probabilistic. Because quantum associative memories are free from cross-talk, however, spurious memories are never generated. Correspondingly, they have a superior capacity than classical ones. The number of parameters in the unitary matrix U is O ( p n ) {\displaystyle O(pn)} . One can thus have efficient, spurious-memory-free quantum associative memories for any polynomial number of patterns. If the matrix U is encoded as a unique operator (as opposed as to a sequence of gates as in the circuit model), e.g. by an optical interferometer, the retrieval becomes efficient even for an exponential number of patterns. === Linear algebra simulation with quantum amplitudes === A number of quantum algorithms for machine learning are based on the idea of amplitude encoding, that is, to associate the amplitudes of a quantum state with the inputs and outputs of computations. Since a state of n {\displaystyle n} qubits is described by 2 n {\displaystyle 2^{n}} complex amplitudes, this information encoding can allow for an exponentially compact representation. Intuitively, this corresponds to associating a discrete probability distribution over binary random variables with a classical vector. The goal of algorithms based on amplitude encoding is to formulate quantum algorithms whose resources grow polynomially in the number of qubits n {\displaystyle n} , which amounts to a logarithmic time complexity in the number of amplitudes and thereby the dimension of the input. Many QML algorithms in this category are based on variations of the quantum algorithm for linear systems of equations (colloquially called HHL, after the paper's authors) which, under specific conditions, performs a matrix inversion using an amount of physical resources growing only logarithmically in the dimensions of the matrix. One of these conditions is that a Hamiltonian which entry-wise corresponds to the matrix can be simulated efficiently, which is known to be possible if the matrix is sparse or low rank. For reference, any known classical algorithm for matrix inversion requires a number of operations that grows more than quadratically in the dimension of the matrix (e.g. O ( n 2.373 ) {\displaystyle O{\mathord {\left(n^{2.373}\right)}}} ), but they are not restricted to sparse matrices. Quantum matrix inversion can be applied to machine learning methods in which the training reduces to solving a linear system of equations, for example in least-squares linear regression, the least-squares version of support vector machines, and Gaussian processes. A crucial bottleneck of methods that simulate linear algebra computations with the amplitudes of quantum states is state preparation, which often requires one to initialise a quantum system in a state whose amplitudes reflect the features of the entire dataset. Although efficient methods for state preparation are known for specific cases, this step easily hides the complexity of the task. === Variational quantum algorithms (VQAs) === In a variational quantum algorithm, a classical computer optimizes the parameters used to prepare a quantum state, while a quantum computer is used to do the actual state preparation and measurement. VQAs are considered promising candidates for noisy intermediate-scale quantum computers. Variational quantum circuits (or parameterized quantum circuits) are a popular class of VQAs where the parameters are those used in a fixed quantum circuit. Researchers have studied VQCs to solve optimization problems and find the ground state energy of complex quantum systems, which were difficult to solve using a classical computer. === Quantum binary classifier === Pattern reorganization is one of the important tasks of machine learning, binary classification is one of the tools or algorithms to find patterns. Binary classification is used in supervised learning and in unsupervised learning. In QML, classical bits are converted to qubits and they are mapped to Hilbert space; complex value data are used in a quantum binary classifier to use the advantage of Hilbert space. By exploiting the quantum mechanic properties such as superposition, entanglement, interference the quantum binary classifier produces the accurate result in short period of time. === Quantum machine learning algorithms based on Grover search === Another approach to improving classical machine learning with quantum information processing uses amplitude amplification methods based on Grover's search algorithm, which has been shown to solve unstructured search problems with a quadratic speedup compared to classical algorithms. These quantum routines can be employed for learning algorithms that translate into an unstructured search task, as can be done, for instance, in the case of the k-medians and the k-nearest neighbors algorithms. Other applications include quadratic speedups in the training of perceptrons. An e

    Read more →
  • Spike-and-slab regression

    Spike-and-slab regression

    Spike-and-slab regression is a type of Bayesian linear regression in which a particular hierarchical prior distribution for the regression coefficients is chosen such that only a subset of the possible regressors is retained. The technique is particularly useful when the number of possible predictors is larger than the number of observations. The idea of the spike-and-slab model was originally proposed by Mitchell & Beauchamp (1988). The approach was further significantly developed by Madigan & Raftery (1994) and George & McCulloch (1997). A recent and important contribution to this literature is Ishwaran & Rao (2005). == Model description == Suppose we have P possible predictors in some model. Vector γ has a length equal to P and consists of zeros and ones. This vector indicates whether a particular variable is included in the regression or not. If no specific prior information on initial inclusion probabilities of particular variables is available, a Bernoulli prior distribution is a common default choice. Conditional on a predictor being in the regression, we identify a prior distribution for the model coefficient, which corresponds to that variable (β). A common choice on that step is to use a normal prior with a mean equal to zero and a large variance calculated based on ( X T X ) − 1 {\displaystyle (X^{T}X)^{-1}} (where X {\displaystyle X} is a design matrix of explanatory variables of the model). A draw of γ from its prior distribution is a list of the variables included in the regression. Conditional on this set of selected variables, we take a draw from the prior distribution of the regression coefficients (if γi = 1 then βi ≠ 0 and if γi = 0 then βi = 0). βγ denotes the subset of β for which γi = 1. In the next step, we calculate a posterior probability for both inclusion and coefficients by applying a standard statistical procedure. All steps of the described algorithm are repeated thousands of times using the Markov chain Monte Carlo (MCMC) technique. As a result, we obtain a posterior distribution of γ (variable inclusion in the model), β (regression coefficient values) and the corresponding prediction of y. The model got its name (spike-and-slab) due to the shape of the two prior distributions. The "spike" is the probability of a particular coefficient in the model to be zero. The "slab" is the prior distribution for the regression coefficient values. An advantage of Bayesian variable selection techniques is that they are able to make use of prior knowledge about the model. In the absence of such knowledge, some reasonable default values can be used; to quote Scott and Varian (2013): "For the analyst who prefers simplicity at the cost of some reasonable assumptions, useful prior information can be reduced to an expected model size, an expected R2, and a sample size ν determining the weight given to the guess at R2." Some researchers suggest the following default values: R2 = 0.5, ν = 0.01, and π = 0.5 (parameter of a prior Bernoulli distribution).

    Read more →
  • Fatsecret

    Fatsecret

    Fatsecret, commonly styled as fatsecret, is a mobile application, website and API that helps people achieve their weight loss goals and find accurate nutrition information. It also offers a weight loss clinic with coaching and medically supported programs. The platform powers global health apps. == History == Fatsecret was founded in 2006 in Melbourne, Australia by Lenny Moses and Rodney Moses. As of 2019, Lenny serves as the company's CEO. The company is known for its calorie counting and meal tracking app, and by April 2016, the company claimed to have 45 million users of its services. In August 2018, a premium version of its app was released. Since August 2009, the company has operated the Fatsecret Platform API, which allows access to its global food and nutrition database. Fatsecret reportedly had 900,000 downloads of its app in January 2020. In an analysis of several Health & Fitness app subcategories for the United States in January 2021, Fatsecret was reported to have the highest 30 day user retention rate of top Calorie Counter + Meal Planner for Weight Loss apps.

    Read more →
  • Empirical risk minimization

    Empirical risk minimization

    In statistical learning theory, the principle of empirical risk minimization defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice (i.e. the "true risk") because we do not know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the "empirical risk". == Background == The following situation is a general setting of many supervised learning problems. There are two spaces of objects X {\displaystyle X} and Y {\displaystyle Y} and we would like to learn a function h : X → Y {\displaystyle \ h:X\to Y} (often called hypothesis) which outputs an object y ∈ Y {\displaystyle y\in Y} , given x ∈ X {\displaystyle x\in X} . To do so, there is a training set of n {\displaystyle n} examples ( x 1 , y 1 ) , … , ( x n , y n ) {\displaystyle \ (x_{1},y_{1}),\ldots ,(x_{n},y_{n})} where x i ∈ X {\displaystyle x_{i}\in X} is an input and y i ∈ Y {\displaystyle y_{i}\in Y} is the corresponding response that is desired from h ( x i ) {\displaystyle h(x_{i})} . To put it more formally, assuming that there is a joint probability distribution P ( x , y ) {\displaystyle P(x,y)} over X {\displaystyle X} and Y {\displaystyle Y} , and that the training set consists of n {\displaystyle n} instances ( x 1 , y 1 ) , … , ( x n , y n ) {\displaystyle \ (x_{1},y_{1}),\ldots ,(x_{n},y_{n})} drawn i.i.d. from P ( x , y ) {\displaystyle P(x,y)} . The assumption of a joint probability distribution allows for the modelling of uncertainty in predictions (e.g. from noise in data) because y {\displaystyle y} is not a deterministic function of x {\displaystyle x} , but rather a random variable with conditional distribution P ( y | x ) {\displaystyle P(y|x)} for a fixed x {\displaystyle x} . It is also assumed that there is a non-negative real-valued loss function L ( y ^ , y ) {\displaystyle L({\hat {y}},y)} which measures how different the prediction y ^ {\displaystyle {\hat {y}}} of a hypothesis is from the true outcome y {\displaystyle y} . For classification tasks, these loss functions can be scoring rules. The risk associated with hypothesis h ( x ) {\displaystyle h(x)} is then defined as the expectation of the loss function: R ( h ) = E [ L ( h ( x ) , y ) ] = ∫ L ( h ( x ) , y ) d P ( x , y ) . {\displaystyle R(h)=\mathbf {E} [L(h(x),y)]=\int L(h(x),y)\,dP(x,y).} A loss function commonly used in theory is the 0-1 loss function: L ( y ^ , y ) = { 1 if y ^ ≠ y 0 if y ^ = y {\displaystyle L({\hat {y}},y)={\begin{cases}1&{\mbox{ if }}\quad {\hat {y}}\neq y\\0&{\mbox{ if }}\quad {\hat {y}}=y\end{cases}}} . The ultimate goal of a learning algorithm is to find a hypothesis h ∗ {\displaystyle h^{}} among a fixed class of functions H {\displaystyle {\mathcal {H}}} for which the risk R ( h ) {\displaystyle R(h)} is minimal: h ∗ = a r g m i n h ∈ H R ( h ) . {\displaystyle h^{}={\underset {h\in {\mathcal {H}}}{\operatorname {arg\,min} }}\,{R(h)}.} For classification problems, the Bayes classifier is defined to be the classifier minimizing the risk defined with the 0–1 loss function. == Formal definition == In general, the risk R ( h ) {\displaystyle R(h)} cannot be computed because the distribution P ( x , y ) {\displaystyle P(x,y)} is unknown to the learning algorithm. However, given a sample of iid training data points, we can compute an estimate, called the empirical risk, by computing the average of the loss function over the training set; more formally, computing the expectation with respect to the empirical measure: R emp ( h ) = 1 n ∑ i = 1 n L ( h ( x i ) , y i ) . {\displaystyle \!R_{\text{emp}}(h)={\frac {1}{n}}\sum _{i=1}^{n}L(h(x_{i}),y_{i}).} The empirical risk minimization principle states that the learning algorithm should choose a hypothesis h ^ {\displaystyle {\hat {h}}} which minimizes the empirical risk over the hypothesis class H {\displaystyle {\mathcal {H}}} : h ^ = a r g m i n h ∈ H R emp ( h ) . {\displaystyle {\hat {h}}={\underset {h\in {\mathcal {H}}}{\operatorname {arg\,min} }}\,R_{\text{emp}}(h).} Thus, the learning algorithm defined by the empirical risk minimization principle consists in solving the above optimization problem. == Properties == Guarantees for the performance of empirical risk minimization depend strongly on the function class selected as well as the distributional assumptions made. In general, distribution-free methods are too coarse, and do not lead to practical bounds. However, they are still useful in deriving asymptotic properties of learning algorithms, such as consistency. In particular, distribution-free bounds on the performance of empirical risk minimization given a fixed function class can be derived using bounds on the VC complexity of the function class. For simplicity, considering the case of binary classification tasks, it is possible to bound the probability of the selected classifier, ϕ n {\displaystyle \phi _{n}} being much worse than the best possible classifier ϕ ∗ {\displaystyle \phi ^{}} . Consider the risk L {\displaystyle L} defined over the hypothesis class C {\displaystyle {\mathcal {C}}} with growth function S ( C , n ) {\displaystyle {\mathcal {S}}({\mathcal {C}},n)} given a dataset of size n {\displaystyle n} . Then, for every ϵ > 0 {\displaystyle \epsilon >0} : P ( L ( ϕ n ) − L ( ϕ ∗ ) > ϵ ) ≤ 8 S ( C , n ) exp ⁡ { − n ϵ 2 / 32 } {\displaystyle \mathbb {P} \left(L(\phi _{n})-L(\phi ^{})>\epsilon \right)\leq {\mathcal {8}}S({\mathcal {C}},n)\exp\{-n\epsilon ^{2}/32\}} Similar results hold for regression tasks. These results are often based on uniform laws of large numbers, which control the deviation of the empirical risk from the true risk, uniformly over the hypothesis class. === Impossibility results === It is also possible to show lower bounds on algorithm performance if no distributional assumptions are made. This is sometimes referred to as the No free lunch theorem. Even though a specific learning algorithm may provide the asymptotically optimal performance for any distribution, the finite sample performance is always poor for at least one data distribution. This means that no classifier can improve on the error for a given sample size for all distributions. Specifically, let ϵ > 0 {\displaystyle \epsilon >0} and consider a sample size n {\displaystyle n} and classification rule ϕ n {\displaystyle \phi _{n}} , there exists a distribution of ( X , Y ) {\displaystyle (X,Y)} with risk L ∗ = 0 {\displaystyle L^{}=0} (meaning that perfect prediction is possible) such that: E L n ≥ 1 / 2 − ϵ . {\displaystyle \mathbb {E} L_{n}\geq 1/2-\epsilon .} It is further possible to show that the convergence rate of a learning algorithm is poor for some distributions. Specifically, given a sequence of decreasing positive numbers a i {\displaystyle a_{i}} converging to zero, it is possible to find a distribution such that: E L n ≥ a i {\displaystyle \mathbb {E} L_{n}\geq a_{i}} for all n {\displaystyle n} . This result shows that universally good classification rules do not exist, in the sense that the rule must be low quality for at least one distribution. === Computational complexity === Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for a relatively simple class of functions such as linear classifiers. Nevertheless, it can be solved efficiently when the minimal empirical risk is zero, i.e., data is linearly separable. In practice, machine learning algorithms cope with this issue either by employing a convex approximation to the 0–1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P ( x , y ) {\displaystyle P(x,y)} (and thus stop being agnostic learning algorithms to which the above result applies). In the case of convexification, Zhang's lemma majors the excess risk of the original problem using the excess risk of the convexified problem. Minimizing the latter using convex optimization also allow to control the former. == Tilted empirical risk minimization == Tilted empirical risk minimization is a machine learning technique used to modify standard loss functions like squared error, by introducing a tilt parameter. This parameter dynamically adjusts the weight of data points during training, allowing the algorithm to focus on specific regions or characteristics of the data distribution. Tilted empirical risk minimization is particularly useful in scenarios with imbalanced data or when there is a need to emphasize errors in certain parts of the prediction space.

    Read more →
  • Google Clips

    Google Clips

    Google Clips is a discontinued miniature clip-on camera device developed by Google. == History == It was announced on October 4, 2017 and went on sale on January 27, 2018. Google Clips automatically captured video clips (without audio) at moments its machine learning algorithms determined to be interesting or relevant. An indicator flashed when the camera was looking for scenes to capture. Google Clips' artificial intelligence (AI) could learn the faces of people to take photographs with certain people, and could automatically set lighting and framing. It had 16 GB of storage built-in storage and could record clips for up to 3 hours. This camera was originally priced at US$249 in the United States. It was withdrawn from sale on October 15, 2019, but supported until the end of December 2021. == Reception == The Independent wrote that Google Clips is "an impressive little device, but one that also has the potential to feel very creepy." According to The Verge's generally negative review, "it didn't capture anything special" over two weeks of testing.

    Read more →