An automated storage and retrieval system (ASRS or AS/RS) consists of a variety of computer-controlled systems for automatically placing and retrieving loads from defined storage locations. Automated storage and retrieval systems (AS/RS) are typically used in applications where: There is a very high volume of loads being moved into and out of storage Storage density is important because of space constraints No value is added in this process (no processing, only storage and transport) Accuracy is critical because of potential expensive damages to the load An AS/RS can be used with standard loads as well as nonstandard loads, meaning that each standard load can fit in a uniformly-sized volume; for example, the film canisters in the image of the Defense Visual Information Center are each stored as part of the contents of the uniformly sized metal boxes, which are shown in the image. Standard loads simplify the handling of a request of an item. In addition, audits of the accuracy of the inventory of contents can be restricted to the contents of an individual metal box, rather than undergoing a top-to-bottom search of the entire facility, for a single item. They can also be used in self storage places. == Overview == AS/RS systems are designed for automated storage and retrieval of parts and items in manufacturing, distribution, retail, wholesale and institutions. They first originated in the 1960s, initially focusing on heavy pallet loads but with the evolution of the technology the handled loads have become smaller. The systems operate under computerized control, maintaining an inventory of stored items. Retrieval of items is accomplished by specifying the item type and quantity to be retrieved. The computer determines where in the storage area the item can be retrieved from and schedules the retrieval. It directs the proper automated storage and retrieval machine (SRM) to the location where the item is stored and directs the machine to deposit the item at a location where it is to be picked up. A system of conveyors and or automated guided vehicles is sometimes part of the AS/RS system. These take loads into and out of the storage area and move them to the manufacturing floor or loading docks. To store items, the pallet or tray is placed at an input station for the system, the information for inventory is entered into a computer terminal and the AS/RS system moves the load to the storage area, determines a suitable location for the item, and stores the load. As items are stored into or retrieved from the racks, the computer updates its inventory accordingly. The benefits of an AS/RS system include reduced labor for transporting items into and out of inventory, reduced inventory levels, more accurate tracking of inventory, and space savings. Items are often stored more densely than in systems where items are stored and retrieved manually. Within the storage, items can be placed on trays or hang from bars, which are attached to chains/drives in order to move up and down. The equipment required for an AS/RS include a storage & retrieval machine (SRM) that is used for rapid storage and retrieval of material. SRMs are used to move loads vertically or horizontally, and can also move laterally to place objects in the correct storage location. The trend towards Just In Time production often requires sub-pallet level availability of production inputs, and AS/RS is a much faster way of organizing the storage of smaller items next to production lines. The Material Handling Institute of America (MHIA), the non-profit trade association for the material handling world, and its members have categorised AS/RS into two primary segments: Fixed Aisle and Carousels/Vertical Lift Modules (VLMs). Both sets of technologies provide automated storage and retrieval for parts and items, but use different technologies. Each technology has its unique set of benefits and disadvantages. Fixed Aisle systems are characteristically larger systems whereas carousels and Vertical Lift Modules are used individually or grouped, but in small to medium-sized applications. A fixed-aisle AS/R machine (stacker crane) is one of two main designs: single-masted or double masted. Most are supported on a track and ceiling guided at the top by guide rails or channels to ensure accurate vertical alignment, although some are suspended from the ceiling. The 'shuttles' that make up the system travel between fixed storage shelves to deposit or retrieve a requested load (ranging from a single book in a library system to a several ton pallet of goods in a warehouse system). The entire unit moves horizontally within an aisle, while the shuttles are able to elevate up to the necessary height to reach the load, and can extend and retract to store or retrieve loads that are several positions deep in the shelving. A semi-automated system can be achieved by utilizing only specialized shuttles within an existing rack system. Another AS/RS technology is known as shuttle technology. In this technology the horizontal movement is made by independent shuttles each operating on one level of the rack while a lift at a fixed position within the rack is responsible for the vertical movement. By using two separate machines for these two axes the shuttle technology is able to provide higher throughput rates than stacker cranes. Storage and Retrieval Machines pick up or drop off loads to the rest of the supporting transportation system at specific stations, where inbound and outbound loads are precisely positioned for proper handling. In addition, there are several types of Automated Storage & Retrieval Systems (AS/RS) devices called Unit-load AS/RS, Mini-load AS/RS, Mid-Load AS/RS, Vertical Lift Modules (VLMs), Horizontal Carousels and Vertical Carousels. These systems are used either as stand-alone units or in integrated workstations called pods or systems. These units are usually integrated with various types of pick to light systems and use either a microprocessor controller for basic usage or inventory management software. These systems are ideal for increasing space utilization up to 90%, productivity levels by 90%, accuracy to 99.9%+ levels and throughput up to 750 lines per hour/per operator or more depending on the configuration of the system. == Horizontal carousels == Robotic Inserter/Extractor devices can be used for horizontal carousels. The robotic device is positioned in the front or rear of up to three horizontal carousels tiered high. The robot grabs the tote required in the order and often replenishes at the same time to speed up throughput. The tote(s) are then delivered to a conveyor, which routes it to a work station for picking or replenishing. Up to eight transactions per minute per unit can be done. Totes or containers up to 36" x 36" x 36" can be used in a system. On a simplistic level, horizontal carousels are also often used as "rotating shelving". With simple "fetch" command, items are brought to the operator and otherwise wasted space is eliminated. AS/RS Applications: Most applications of AS/RS technology have been associated with warehousing and distribution operations. An AS/RS can also be used to store raw materials and work in process in manufacturing. Three application areas can be distinguished for AS/RS: (1) Unit load storage and handling, (2) Order picking, and (3) Work in process storage. Unit load storage and retrieval applications are represented by unit load AS/RS and deep-lane storage systems. These kinds of applications are commonly found in warehousing for finishing goods in a distribution center, rarely in manufacturing. Deep-lane systems are used in the food industry. As described above, order picking involves retrieving materials in less than full unit load quantities. Minilpass, man-on board, and items retrieval systems are used for this second application area. Work in process storage is a more recent application of automated storage technology. While it is desirable to minimize the amount of work in process, WIP is unavoidable and must be effectively managed. Automated storage systems, either automated storage/retrieval systems or carousel systems, represent an efficient way to store materials between processing steps, particularly in batch and job shop production. In high production, work in process is often carried between operations by conveyor system, which this serve both storage and transport functions. === Inventory Category-specific AS/RS === Each inventory category—raw materials, work-in-process, and finished goods—requires its own specialized Automated Storage and Retrieval System (AS/RS). Particularly for work-in-process (WIP) inventories, due to variations in manufacturing processes, the AS/RS systems are significantly different in design and function, tailored specifically to match unique handling, storage, and retrieval requirements === Installed applications === Installed applications of this technology can be wide-ranging. In some librarie
Neuro-symbolic AI
Neuro-symbolic AI is a subfield of artificial intelligence that integrates neural methods (e.g., neural networks and deep learning) with symbolic methods (e.g., formal logic, knowledge representation, and automated reasoning). The goal is to combine the strengths of both approaches, resulting in AI systems that can be trained from raw data and demonstrate robustness against outliers or errors in the base data, while preserving explainability, explicit use of expert knowledge, and explicit cognitive reasoning. As argued by Leslie Valiant and others, the effective construction of rich computational cognitive models demands the combination of symbolic reasoning and efficient machine learning. Gary Marcus argued, "We cannot construct rich cognitive models in an adequate, automated way without the triumvirate of hybrid architecture, rich prior knowledge, and sophisticated techniques for reasoning." Further, "To build a robust, knowledge-driven approach to AI we must have the machinery of symbol manipulation in our toolkit. Too much of useful knowledge is abstract to make do without tools that represent and manipulate abstraction, and to date, the only known machinery that can manipulate such abstract knowledge reliably is the apparatus of symbol manipulation." Angelo Dalli, Henry Kautz, Francesca Rossi, and Bart Selman also argued for such a synthesis. Their arguments attempt to address the two kinds of thinking, as discussed in Daniel Kahneman's book Thinking, Fast and Slow. It describes cognition as encompassing two components: System 1 is fast, reflexive, intuitive, and unconscious. System 2 is slower, step-by-step, and explicit. System 1 is used for pattern recognition. System 2 handles planning, deduction, and deliberative thinking. In this view, deep learning best handles the first kind of cognition, while symbolic reasoning best handles the second kind. Both are necessary for the development of a robust and reliable AI system capable of learning, reasoning, and interacting with humans to accept advice and answer questions. Since the 1990s, dual-process models with explicit references to the two contrasting systems have been the focus of research in both the fields of AI and cognitive science by numerous researchers. In 2025, the adoption of neurosymbolic AI, an approach that integrates neural networks with symbolic reasoning, increased in response to the need to address hallucination issues in large language models. For example, Amazon implemented Neurosymbolic AI in its Vulcan warehouse robots and Rufus shopping assistant to enhance accuracy and decision-making. == Approaches == Approaches for integration are diverse. Henry Kautz's taxonomy of neuro-symbolic architectures follows, along with some examples: Symbolic Neural symbolic is the current approach of many neural models in natural language processing, where words or subword tokens are the ultimate input and output of large language models. Examples include BERT, RoBERTa, and GPT-3. Symbolic[Neural] is exemplified by AlphaGo, where symbolic techniques are used to invoke neural techniques. In this case, the symbolic approach is Monte Carlo tree search and the neural techniques learn how to evaluate game positions. Neural | Symbolic uses a neural architecture to interpret perceptual data as symbols and relationships that are reasoned about symbolically. Neural-Concept Learner is an example. Neural: Symbolic → Neural relies on symbolic reasoning to generate or label training data that is subsequently learned by a deep learning model, e.g., to train a neural model for symbolic computation by using a Macsyma-like symbolic mathematics system to create or label examples. NeuralSymbolic uses a neural net that is generated from symbolic rules. An example is the Neural Theorem Prover, which constructs a neural network from an AND-OR proof tree generated from knowledge base rules and terms. Logic Tensor Networks also fall into this category. Neural[Symbolic] according to Kautz, this approach embeds true symbolic reasoning inside a neural network. These are tightly-coupled neural-symbolic systems, in which the logical inference rules are internal to the neural network. This way, the neural network internally computes the inference from the premises and learns to reason based on logical inference systems. Early work on connectionist modal and temporal logics by Garcez, Lamb, and Gabbay is aligned with this approach. These categories are not exhaustive, as they do not consider multi-agent systems. In 2005, Bader and Hitzler presented a more fine-grained categorization that took into account, e.g., whether the use of symbols included logic and, if so, whether the logic was propositional or first-order logic. The 2005 categorization and Kautz's taxonomy above are compared and contrasted in a 2021 article. Sepp Hochreiter argued that Graph Neural Networks "...are the predominant models of neural-symbolic computing" since "[t]hey describe the properties of molecules, simulate social networks, or predict future states in physical and engineering applications with particle-particle interactions." == Artificial general intelligence == Gary Marcus argues that "...hybrid architectures that combine learning and symbol manipulation are necessary for robust intelligence, but not sufficient", and that there are ...four cognitive prerequisites for building robust artificial intelligence: hybrid architectures that combine large-scale learning with the representational and computational powers of symbol manipulation, large-scale knowledge bases—likely leveraging innate frameworks—that incorporate symbolic knowledge along with other forms of knowledge, reasoning mechanisms capable of leveraging those knowledge bases in tractable ways, and rich cognitive models that work together with those mechanisms and knowledge bases. This echoes earlier calls for hybrid models as early as the 1990s. == History == Garcez and Lamb described research in this area as ongoing, at least since the 1990s. During that period, the terms symbolic and sub-symbolic AI were popular. A series of workshops on neuro-symbolic AI has been held annually since 2005 Neuro-Symbolic Artificial Intelligence. In the early 1990s, an initial set of workshops on this topic were organized. == Research == Key research questions remain, such as: What is the best way to integrate neural and symbolic architectures? How should symbolic structures be represented within neural networks and extracted from them? How should common-sense knowledge be learned and reasoned about? How can abstract knowledge that is hard to encode logically be handled? == Implementations == Implementations of neuro-symbolic approaches include: AllegroGraph: an integrated Knowledge Graph based platform for neuro-symbolic application development. Scallop: a language based on Datalog that supports differentiable logical and relational reasoning. Scallop can be integrated in Python and with a PyTorch learning module. Logic Tensor Networks: encode logical formulas as neural networks and simultaneously learn term encodings, term weights, and formula weights. DeepProbLog: combines neural networks with the probabilistic reasoning of ProbLog. Abductive Learning: integrates machine learning and logical reasoning in a balanced-loop via abductive reasoning, enabling them to work together in a mutually beneficial way. SymbolicAI: a compositional differentiable programming library.
Distributional Soft Actor Critic
Distributional Soft Actor Critic (DSAC) is a suite of model-free off-policy reinforcement learning algorithms, tailored for learning decision-making or control policies in complex systems with continuous action spaces. Distinct from traditional methods that focus solely on expected returns, DSAC algorithms are designed to learn a Gaussian distribution over stochastic returns, called value distribution. This focus on Gaussian value distribution learning notably diminishes value overestimations, which in turn boosts policy performance. Additionally, the value distribution learned by DSAC can also be used for risk-aware policy learning. From a technical standpoint, DSAC is essentially a distributional adaptation of the well-established soft actor-critic (SAC) method. To date, the DSAC family comprises two iterations: the original DSAC-v1 and its successor, DSAC-T (also known as DSAC-v2), with the latter demonstrating superior capabilities over the Soft Actor-Critic (SAC) in Mujoco benchmark tasks. The source code for DSAC-T can be found at the following URL: Jingliang-Duan/DSAC-T. Both iterations have been integrated into an advanced, Pytorch-powered reinforcement learning toolkit named GOPS: GOPS (General Optimal control Problem Solver).
Expectation–maximization algorithm
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem. == History == The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin. They pointed out that the method had been "proposed many times in special circumstances" by earlier authors. One of the earliest is the gene-counting method for estimating allele frequencies by Cedric Smith. Another was proposed by H.O. Hartley in 1958, and Hartley and Hocking in 1977, from which many of the ideas in the Dempster–Laird–Rubin paper originated. Another one by S.K Ng, Thriyambakam Krishnan and G.J McLachlan in 1977. Hartley's ideas can be broadened to any grouped discrete distribution. A very detailed treatment of the EM method for exponential families was published by Rolf Sundberg in his thesis and several papers, following his collaboration with Per Martin-Löf and Anders Martin-Löf. The Dempster–Laird–Rubin paper in 1977 generalized the method and sketched a convergence analysis for a wider class of problems. The Dempster–Laird–Rubin paper established the EM method as an important tool of statistical analysis. See also Meng and van Dyk (1997). The convergence analysis of the Dempster–Laird–Rubin algorithm was flawed and a correct convergence analysis was published by C. F. Jeff Wu in 1983. Wu's proof established the EM method's convergence also outside of the exponential family, as claimed by Dempster–Laird–Rubin. == Introduction == The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. That is, either missing values exist among the data, or the model can be formulated more simply by assuming the existence of further unobserved data points. For example, a mixture model can be described more simply by assuming that each observed data point has a corresponding unobserved data point, or latent variable, specifying the mixture component to which each data point belongs. Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations. In statistical models with latent variables, this is usually impossible. Instead, the result is typically a set of interlocking equations in which the solution to the parameters requires the values of the latent variables and vice versa, but substituting one set of equations into the other produces an unsolvable equation. The EM algorithm proceeds from the observation that there is a way to solve these two sets of equations numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the second set, then use these new values to find a better estimate of the first set, and then keep alternating between the two until the resulting values both converge to fixed points. It's not obvious that this will work, but it can be proven in this context. Additionally, it can be proven that the derivative of the likelihood is (arbitrarily close to) zero at that point, which in turn means that the point is either a local maximum or a saddle point. In general, multiple maxima may occur, with no guarantee that the global maximum will be found. Some likelihoods also have singularities in them, i.e., nonsensical maxima. For example, one of the solutions that may be found by EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the same component to be equal to one of the data points. == Description == === The symbols === Given the statistical model which generates a set X {\displaystyle \mathbf {X} } of observed data, a set of unobserved latent data or missing values Z {\displaystyle \mathbf {Z} } , and a vector of unknown parameters θ {\displaystyle {\boldsymbol {\theta }}} , along with a likelihood function L ( θ ; X , Z ) = p ( X , Z ∣ θ ) {\displaystyle L({\boldsymbol {\theta }};\mathbf {X} ,\mathbf {Z} )=p(\mathbf {X} ,\mathbf {Z} \mid {\boldsymbol {\theta }})} , the maximum likelihood estimate (MLE) of the unknown parameters is determined by maximizing the marginal likelihood of the observed data L ( θ ; X ) = p ( X ∣ θ ) = ∫ p ( X , Z ∣ θ ) d Z = ∫ p ( X ∣ Z , θ ) p ( Z ∣ θ ) d Z {\displaystyle {\begin{aligned}L({\boldsymbol {\theta }};\mathbf {X} )=p(\mathbf {X} \mid {\boldsymbol {\theta }})&=\int p(\mathbf {X} ,\mathbf {Z} \mid {\boldsymbol {\theta }})\,d\mathbf {Z} \\&=\int p(\mathbf {X} \mid \mathbf {Z} ,{\boldsymbol {\theta }})p(\mathbf {Z} \mid {\boldsymbol {\theta }})\,d\mathbf {Z} \end{aligned}}} However, this quantity is often intractable since Z {\displaystyle \mathbf {Z} } is unobserved and the distribution of Z {\displaystyle \mathbf {Z} } is unknown before attaining θ {\displaystyle {\boldsymbol {\theta }}} . === The EM algorithm === The EM algorithm seeks to find the maximum likelihood estimate of the marginal likelihood by iteratively applying these two steps: More succinctly, we can write it as one equation: θ ( t + 1 ) = arg max θ E Z ∼ p ( ⋅ | X , θ ( t ) ) [ log p ( X , Z | θ ) ] {\displaystyle {\boldsymbol {\theta }}^{(t+1)}=\mathop {\arg \max } _{\boldsymbol {\theta }}\operatorname {E} _{\mathbf {Z} \sim p(\cdot |\mathbf {X} ,{\boldsymbol {\theta }}^{(t)})}\left[\log p(\mathbf {X} ,\mathbf {Z} |{\boldsymbol {\theta }})\right]\,} === Interpretation of the variables === The typical models to which EM is applied use Z {\displaystyle \mathbf {Z} } as a latent variable indicating membership in one of a set of groups: The observed data points X {\displaystyle \mathbf {X} } may be discrete (taking values in a finite or countably infinite set) or continuous (taking values in an uncountably infinite set). Associated with each data point may be a vector of observations. The missing values (aka latent variables) Z {\displaystyle \mathbf {Z} } are discrete, drawn from a fixed number of values, and with one latent variable per observed unit. The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and those associated with a specific value of a latent variable (i.e., associated with all data points whose corresponding latent variable has that value). However, it is possible to apply EM to other sorts of models. The motivation is as follows. If the value of the parameters θ {\displaystyle {\boldsymbol {\theta }}} is known, usually the value of the latent variables Z {\displaystyle \mathbf {Z} } can be found by maximizing the log-likelihood over all possible values of Z {\displaystyle \mathbf {Z} } , either simply by iterating over Z {\displaystyle \mathbf {Z} } or through an algorithm such as the Viterbi algorithm for hidden Markov models. Conversely, if we know the value of the latent variables Z {\displaystyle \mathbf {Z} } , we can find an estimate of the parameters θ {\displaystyle {\boldsymbol {\theta }}} fairly easily, typically by simply grouping the observed data points according to the value of the associated latent variable and averaging the values, or some function of the values, of the points in each group. This suggests an iterative algorithm, in the case where both θ {\displaystyle {\boldsymbol {\theta }}} and Z {\displaystyle \mathbf {Z} } are unknown: First, initialize the parameters θ {\displaystyle {\boldsymbol {\theta }}} to some random values. Compute the probability of each possible value of Z {\displaystyle \mathbf {Z} } , given θ {\displaystyle {\boldsymbol {\theta }}} . Then, use the just-computed values of Z {\displaystyle \mathbf {Z} } to compute a better estimate for the parameters θ {\displaystyle {\boldsymbol {\theta }}} . Iterate steps 2 and 3 until convergence. The algorithm as just described monotonically approaches a local minimum of the cost function. == Properties == Although an EM iteration does increase the observed data (i.e., marginal) likelihood function, no guarantee exists that the sequence converges to a maximum likelihood estimator. For multimodal distributions, this means that an EM algorithm may co
Yooreeka
Yooreeka is a library for data mining, machine learning, soft computing, and mathematical analysis. The project started with the code of the book "Algorithms of the Intelligent Web". Although the term "Web" prevailed in the title, in essence, the algorithms are valuable in any software application. It covers all major algorithms and provides many examples. Yooreeka 2.x is licensed under the Apache License rather than the somewhat more restrictive LGPL (which was the license of v1.x). The library is written 100% in the Java language. == Algorithms == The following algorithms are covered: Clustering Hierarchical—Agglomerative (e.g. MST single link; ROCK) and Divisive Partitional (e.g. k-means) Classification Bayesian Decision trees Neural Networks Rule based (via Drools) Recommendations Collaborative filtering Content based Search PageRank DocRank Personalization
Security and Privacy in Computer Systems
Security and Privacy in Computer Systems is a paper by Willis Ware that was first presented to the public at the 1967 Spring Joint Computer Conference. == Significance == Ware's presentation was the first public conference session about information security and privacy in respect of computer systems, especially networked or remotely-accessed ones. The IEEE Annals of the History of Computing said that Ware's 1967 Spring Joint Computer Conference session, together with 1970's Ware report, marked the start of the field of computer security.
Large margin nearest neighbor
Large margin nearest neighbor (LMNN) classification is a statistical machine learning algorithm for metric learning. It learns a pseudometric designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming, a sub-class of convex optimization. The goal of supervised learning (more specifically classification) is to learn a decision rule that can categorize data instances into pre-defined classes. The k-nearest neighbor rule assumes a training data set of labeled instances (i.e. the classes are known). It classifies a new data instance with the class obtained from the majority vote of the k closest (labeled) training instances. Closeness is measured with a pre-defined metric. Large margin nearest neighbors is an algorithm that learns this global (pseudo-)metric in a supervised fashion to improve the classification accuracy of the k-nearest neighbor rule. == Setup == The main intuition behind LMNN is to learn a pseudometric under which all data instances in the training set are surrounded by at least k instances that share the same class label. If this is achieved, the leave-one-out error (a special case of cross validation) is minimized. Let the training data consist of a data set D = { ( x → 1 , y 1 ) , … , ( x → n , y n ) } ⊂ R d × C {\displaystyle D=\{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}\subset R^{d}\times C} , where the set of possible class categories is C = { 1 , … , c } {\displaystyle C=\{1,\dots ,c\}} . The algorithm learns a pseudometric of the type d ( x → i , x → j ) = ( x → i − x → j ) ⊤ M ( x → i − x → j ) {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})=({\vec {x}}_{i}-{\vec {x}}_{j})^{\top }\mathbf {M} ({\vec {x}}_{i}-{\vec {x}}_{j})} . For d ( ⋅ , ⋅ ) {\displaystyle d(\cdot ,\cdot )} to be well defined, the matrix M {\displaystyle \mathbf {M} } needs to be positive semi-definite. The Euclidean metric is a special case, where M {\displaystyle \mathbf {M} } is the identity matrix. This generalization is often (falsely) referred to as Mahalanobis metric. Figure 1 illustrates the effect of the metric under varying M {\displaystyle \mathbf {M} } . The two circles show the set of points with equal distance to the center x → i {\displaystyle {\vec {x}}_{i}} . In the Euclidean case this set is a circle, whereas under the modified (Mahalanobis) metric it becomes an ellipsoid. The algorithm distinguishes between two types of special data points: target neighbors and impostors. === Target neighbors === Target neighbors are selected before learning. Each instance x → i {\displaystyle {\vec {x}}_{i}} has exactly k {\displaystyle k} different target neighbors within D {\displaystyle D} , which all share the same class label y i {\displaystyle y_{i}} . The target neighbors are the data points that should become nearest neighbors under the learned metric. Let us denote the set of target neighbors for a data point x → i {\displaystyle {\vec {x}}_{i}} as N i {\displaystyle N_{i}} . === Impostors === An impostor of a data point x → i {\displaystyle {\vec {x}}_{i}} is another data point x → j {\displaystyle {\vec {x}}_{j}} with a different class label (i.e. y i ≠ y j {\displaystyle y_{i}\neq y_{j}} ) which is one of the nearest neighbors of x → i {\displaystyle {\vec {x}}_{i}} . During learning the algorithm tries to minimize the number of impostors for all data instances in the training set. == Algorithm == Large margin nearest neighbors optimizes the matrix M {\displaystyle \mathbf {M} } with the help of semidefinite programming. The objective is twofold: For every data point x → i {\displaystyle {\vec {x}}_{i}} , the target neighbors should be close and the impostors should be far away. Figure 1 shows the effect of such an optimization on an illustrative example. The learned metric causes the input vector x → i {\displaystyle {\vec {x}}_{i}} to be surrounded by training instances of the same class. If it was a test point, it would be classified correctly under the k = 3 {\displaystyle k=3} nearest neighbor rule. The first optimization goal is achieved by minimizing the average distance between instances and their target neighbors ∑ i , j ∈ N i d ( x → i , x → j ) {\displaystyle \sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})} . The second goal is achieved by penalizing distances to impostors x → l {\displaystyle {\vec {x}}_{l}} that are less than one unit further away than target neighbors x → j {\displaystyle {\vec {x}}_{j}} (and therefore pushing them out of the local neighborhood of x → i {\displaystyle {\vec {x}}_{i}} ). The resulting value to be minimized can be stated as: ∑ i , j ∈ N i , l , y l ≠ y i [ d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ] + {\displaystyle \sum _{i,j\in N_{i},l,y_{l}\neq y_{i}}[d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})]_{+}} With a hinge loss function [ ⋅ ] + = max ( ⋅ , 0 ) {\textstyle [\cdot ]_{+}=\max(\cdot ,0)} , which ensures that impostor proximity is not penalized when outside the margin. The margin of exactly one unit fixes the scale of the matrix M {\displaystyle M} . Any alternative choice c > 0 {\displaystyle c>0} would result in a rescaling of M {\displaystyle M} by a factor of 1 / c {\displaystyle 1/c} . The final optimization problem becomes: min M ∑ i , j ∈ N i d ( x → i , x → j ) + λ ∑ i , j , l ξ i j l {\displaystyle \min _{\mathbf {M} }\sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})+\lambda \sum _{i,j,l}\xi _{ijl}} ∀ i , j ∈ N i , l , y l ≠ y i {\displaystyle \forall _{i,j\in N_{i},l,y_{l}\neq y_{i}}} d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ≤ ξ i j l {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})\leq \xi _{ijl}} ξ i j l ≥ 0 {\displaystyle \xi _{ijl}\geq 0} M ⪰ 0 {\displaystyle \mathbf {M} \succeq 0} The hyperparameter λ > 0 {\textstyle \lambda >0} is some positive constant (typically set through cross-validation). Here the variables ξ i j l {\displaystyle \xi _{ijl}} (together with two types of constraints) replace the term in the cost function. They play a role similar to slack variables to absorb the extent of violations of the impostor constraints. The last constraint ensures that M {\displaystyle \mathbf {M} } is positive semi-definite. The optimization problem is an instance of semidefinite programming (SDP). Although SDPs tend to suffer from high computational complexity, this particular SDP instance can be solved very efficiently due to the underlying geometric properties of the problem. In particular, most impostor constraints are naturally satisfied and do not need to be enforced during runtime (i.e. the set of variables ξ i j l {\displaystyle \xi _{ijl}} is sparse). A particularly well suited solver technique is the working set method, which keeps a small set of constraints that are actively enforced and monitors the remaining (likely satisfied) constraints only occasionally to ensure correctness. == Extensions and efficient solvers == LMNN was extended to multiple local metrics in the 2008 paper. This extension significantly improves the classification error, but involves a more expensive optimization problem. In their 2009 publication in the Journal of Machine Learning Research, Weinberger and Saul derive an efficient solver for the semi-definite program. It can learn a metric for the MNIST handwritten digit data set in several hours, involving billions of pairwise constraints. An open source Matlab implementation is freely available at the authors web page. Kumal et al. extended the algorithm to incorporate local invariances to multivariate polynomial transformations and improved regularization.