AI For Psychology Students

AI For Psychology Students — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Hierarchical navigable small world

    Hierarchical navigable small world

    Hierarchical navigable small world (HNSW) is an algorithm for approximate nearest neighbor search. It is used to find items that are similar to a query item in a large collection, without comparing the query with every item one by one. The algorithm is commonly used for searching vector data. In these systems, an item such as a document, image, song, or user profile is represented by a list of numbers called a vector. Items with similar vectors are treated as similar according to the model that produced the vectors. HNSW provides a way to search these vectors quickly, especially in large datasets. HNSW stores vectors in a graph. Each vector is a node, and links connect it to some nearby vectors. The graph has several layers: upper layers contain fewer nodes and act like a rough map, while the bottom layer contains all nodes and gives a more detailed view. A search starts in an upper layer, follows links toward nodes that are closer to the query, and then repeats the process in lower layers until it finds a set of likely nearest neighbors. == Background == The nearest neighbor search problem asks which items in a dataset are closest to a query item. A direct search can compare the query with every item in the dataset, but this becomes slow when the dataset is large. Exact search methods based on spatial trees, such as the k-d tree and R-tree, can also become less effective for high-dimensional data, a problem often associated with the curse of dimensionality. Approximate nearest neighbor methods trade some exactness for speed or lower resource use. Instead of always guaranteeing the exact closest item, they try to return close items quickly. Other approximate methods include locality-sensitive hashing and product quantization. HNSW builds on research into small-world networks and navigable graphs. In a small-world graph, most nodes can be reached from other nodes through a short chain of links. In a navigable graph, a search procedure can use local information to move toward a target. Jon Kleinberg's work on navigation in small-world networks is an important example of this research area. Later work studied ways to add links that make graphs easier to navigate greedily. The HNSW algorithm extends earlier navigable small world methods for similarity search by adding a hierarchy of graph layers. This hierarchy helps the algorithm find a good region of the graph before doing a more detailed search in the bottom layer. == Algorithm == HNSW is based on a proximity graph. In this graph, nearby vectors are connected by edges. The algorithm uses these edges to move through the dataset, rather than scanning every vector. The graph is hierarchical. Every vector appears in the bottom layer. Some vectors are also placed in higher layers, with fewer vectors appearing as the layers go upward. The upper layers allow long-range movement across the dataset, while the lower layers allow a more detailed search near promising candidates. A typical search proceeds as follows: The search begins from an entry point in the highest layer. At each step, the algorithm looks at neighboring nodes and moves to a neighbor that is closer to the query. When it cannot find a closer neighbor in that layer, it moves down to the next layer. In the bottom layer, it explores a wider set of candidate nodes and returns the nearest candidates found. This search strategy is often described as greedy navigation. The algorithm repeatedly chooses locally better nodes, using the graph structure to approach the query point. == Construction and parameters == The HNSW graph is built incrementally. When a new vector is inserted, the algorithm assigns it a maximum layer, searches for nearby existing nodes, and connects the new node to selected neighbors in each layer where it appears. Implementations usually expose parameters that control the trade-off between speed, accuracy, memory use, and construction time. A higher number of graph connections can improve recall but requires more memory. A larger search candidate list can improve accuracy but makes queries slower. A larger construction candidate list can improve the quality of the graph but makes index building slower. Because HNSW is approximate, its results are not always identical to a full exact search. Its practical performance depends on the dataset, distance measure, implementation, and parameter settings. Benchmarking studies have found HNSW-based libraries to be strong performers among approximate nearest neighbor methods, although worst-case performance can differ from performance on common benchmark datasets. == Use in vector search systems == HNSW is used as an index in systems that store and search high-dimensional vectors. These systems include vector databases, search engines, and database extensions. Typical uses include semantic search, recommender systems, image similarity search, and retrieval-augmented generation. Several software projects implement or support HNSW. Libraries include hnswlib, which is associated with the original HNSW authors, and FAISS. Database and search systems that document HNSW support include Apache Lucene, Chroma, ClickHouse, DuckDB, MariaDB, Milvus, pgvector, Qdrant, and Redis.

    Read more →
  • CN2 algorithm

    CN2 algorithm

    The CN2 induction algorithm is a learning algorithm for rule induction. It is designed to work even when the training data is imperfect. It is based on ideas from the AQ algorithm and the ID3 algorithm. As a consequence it creates a rule set like that created by AQ but is able to handle noisy data like ID3. == Description of algorithm == The algorithm must be given a set of examples, TrainingSet, which have already been classified in order to generate a list of classification rules. A set of conditions, SimpleConditionSet, which can be applied, alone or in combination, to any set of examples is predefined to be used for the classification. routine CN2(TrainingSet) let the ClassificationRuleList be empty repeat let the BestConditionExpression be Find_BestConditionExpression(TrainingSet) if the BestConditionExpression is not nil then let the TrainingSubset be the examples covered by the BestConditionExpression remove from the TrainingSet the examples in the TrainingSubset let the MostCommonClass be the most common class of examples in the TrainingSubset append to the ClassificationRuleList the rule 'if ' the BestConditionExpression ' then the class is ' the MostCommonClass until the TrainingSet is empty or the BestConditionExpression is nil return the ClassificationRuleList routine Find_BestConditionExpression(TrainingSet) let the ConditionalExpressionSet be empty let the BestConditionExpression be nil repeat let the TrialConditionalExpressionSet be the set of conditional expressions, {x and y where x belongs to the ConditionalExpressionSet and y belongs to the SimpleConditionSet}. remove all formulae in the TrialConditionalExpressionSet that are either in the ConditionalExpressionSet (i.e., the unspecialized ones) or null (e.g., big = y and big = n) for every expression, F, in the TrialConditionalExpressionSet if F is statistically significant and F is better than the BestConditionExpression by user-defined criteria when tested on the TrainingSet then replace the current value of the BestConditionExpression by F while the number of expressions in the TrialConditionalExpressionSet > user-defined maximum remove the worst expression from the TrialConditionalExpressionSet let the ConditionalExpressionSet be the TrialConditionalExpressionSet until the ConditionalExpressionSet is empty return the BestConditionExpression

    Read more →
  • LogitBoost

    LogitBoost

    In machine learning and computational learning theory, LogitBoost is a boosting algorithm formulated by Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The original paper casts the AdaBoost algorithm into a statistical framework. Specifically, if one considers AdaBoost as a generalized additive model and then applies the cost function of logistic regression, one can derive the LogitBoost algorithm. == Minimizing the LogitBoost cost function == LogitBoost can be seen as a convex optimization. Specifically, given that we seek an additive model of the form f = ∑ t α t h t {\displaystyle f=\sum _{t}\alpha _{t}h_{t}} the LogitBoost algorithm minimizes the logistic loss: ∑ i log ⁡ ( 1 + e − y i f ( x i ) ) {\displaystyle \sum _{i}\log \left(1+e^{-y_{i}f(x_{i})}\right)}

    Read more →
  • Nearest neighbor search

    Nearest neighbor search

    Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values. Formally, the nearest neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M {\displaystyle q\in M} , find the closest point in S to q. Donald Knuth in volume 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a k-NN search, where we need to find the k closest points. Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensional vector space where dissimilarity is measured using the Euclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example is asymmetric Bregman divergence, for which the triangle inequality does not hold. == Applications == The nearest neighbor search problem arises in numerous fields of application, including: Pattern recognition – in particular for optical character recognition Statistical classification – see k-nearest neighbor algorithm Computer vision – for point cloud registration Computational geometry – see Closest pair of points problem Cryptanalysis – for lattice problem Databases – e.g. content-based image retrieval Coding theory – see maximum likelihood decoding Semantic search Vector databases, where nearest-neighbor lookup over embeddings is used to retrieve semantically similar records Retrieval-augmented generation systems, where nearest-neighbor retrieval over embeddings is used to fetch candidate passages or documents before generation Data compression – see MPEG-2 standard Robotic sensing Recommendation systems, e.g. see Collaborative filtering Internet marketing – see contextual advertising and behavioral targeting DNA sequencing Spell checking – suggesting correct spelling Plagiarism detection Similarity scores for predicting career paths of professional athletes. Cluster analysis – assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance Chemical similarity Sampling-based motion planning == Methods == Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time. === Exact methods === ==== Linear search ==== The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces. The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results. ==== Space partitioning ==== Since the 1970s, the branch and bound methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k)) Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R tree. R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances. In the case of general metric space, the branch-and-bound approach is known as the metric tree approach. Particular examples include vp-tree and BK-tree methods. Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result. === Approximation methods === An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most c {\displaystyle c} times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. ==== Greedy search in proximity neighborhood graphs ==== Proximity graph methods (such as navigable small world graphs and HNSW) are considered the current state-of-the-art for the approximate nearest neighbors search. The methods are based on greedy traversing in proximity neighborhood graphs G ( V , E ) {\displaystyle G(V,E)} in which every point x i ∈ S {\displaystyle x_{i}\in S} is uniquely associated with vertex v i ∈ V {\displaystyle v_{i}\in V} . The search for the nearest neighbors to a query q in the set S takes the form of searching for the vertex in the graph G ( V , E ) {\displaystyle G(V,E)} . The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex v i ∈ V {\displaystyle v_{i}\in V} by computing the distances from the query q to each vertex of its neighborhood { v j : ( v i , v j ) ∈ E } {\displaystyle \{v_{j}:(v_{i},v_{j})\in E\}} , and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself. The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount, in the VoroNet syst

    Read more →
  • Coghead

    Coghead

    Coghead was a web application company based out of Redwood City, California. The company offered a web-based service for building and hosting custom online database applications. Applications were built around custom data collections and were typically designed to facilitate management of, and collaboration on, business data. Examples of Coghead's "gallery" applications include project management, simple Customer relationship management, bug tracking and extreme programming. Coghead's service was available through a limited-access beta program before "going live" for free trial accounts in April, 2007. Coghead launched a paid subscription plans in June, 2007. On February 19, 2009, Coghead announced that its intellectual property assets (its 'service') had been acquired by SAP AG (NYSE:SAP). == Product == Coghead's product was a fully hosted environment for building, accessing, and maintaining applications and the associated business data. Like other so-called "Web 2.0" companies, Coghead built its product around the idea of "software as a service". The product was intended to allow users to design a range of applications from scratch using only a drag and drop, WYSIWYG user interface, with very limited scripting or coding (if any) required. Coghead also offered its paid subscribers the ability to develop and publish "Coglets," web forms that allowed site visitors to view data in, or submit data into, the host's Coghead database. On February 19, 2009, Coghead announced that SAP AG had acquired the Coghead service through an asset purchase. The SAP asset purchase closed in the 1st Quarter 2009. Immediately upon closing the asset purchase, the public-facing service was taken off-line by SAP as they prepared to integrate the Coghead code with other SAP assets. This forced many of Coghead's customers to find alternative solutions.

    Read more →
  • Time-aware long short-term memory

    Time-aware long short-term memory

    Time-aware LSTM (T-LSTM) is a long short-term memory (LSTM) unit capable of handling irregular time intervals in longitudinal patient records. T-LSTM was developed by researchers from Michigan State University, IBM Research, and Cornell University and was first presented in the Knowledge Discovery and Data Mining (KDD) conference. Experiments using real and synthetic data proved that T-LSTM auto-encoder outperformed widely used frameworks including LSTM and MF1-LSTM auto-encoders.

    Read more →
  • Non-negative matrix factorization

    Non-negative matrix factorization

    Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically. NMF finds applications in such fields as astronomy, computer vision, document clustering, missing data imputation, chemometrics, audio signal processing, recommender systems, and bioinformatics. == History == In chemometrics non-negative matrix factorization has a long history under the name "self modeling curve resolution". In this framework the vectors in the right matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the 1990s under the name positive matrix factorization. It became more widely known as non-negative matrix factorization after Lee and Seung investigated the properties of the algorithm and published some simple and useful algorithms for two types of factorizations. == Background == Let matrix V be the product of the matrices W and H, V = W H . {\displaystyle \mathbf {V} =\mathbf {W} \mathbf {H} \,.} Matrix multiplication can be implemented as computing the column vectors of V as linear combinations of the column vectors in W using coefficients supplied by columns of H. That is, each column of V can be computed as follows: v i = W h i , {\displaystyle \mathbf {v} _{i}=\mathbf {W} \mathbf {h} _{i}\,,} where vi is the i-th column vector of the product matrix V and hi is the i-th column vector of the matrix H. When multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced dimensions compared to the original matrix. For example, if V is an m × n matrix, W is an m × p matrix, and H is a p × n matrix then p can be significantly less than both m and n. Here is an example based on a text-mining application: Let the input matrix (the matrix to be factored) be V with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words. It follows that a column vector v in V represents a document. Assume we ask the algorithm to find 10 features in order to generate a features matrix W with 10000 rows and 10 columns and a coefficients matrix H with 10 rows and 500 columns. The product of W and H is a matrix with 10000 rows and 500 columns, the same shape as the input matrix V and, if the factorization worked, it is a reasonable approximation to the input matrix V. From the treatment of matrix multiplication above it follows that each column in the product matrix WH is a linear combination of the 10 column vectors in the features matrix W with coefficients supplied by the coefficients matrix H. This last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. NMF generates these features. It is useful to think of each feature (column vector) in the features matrix W as a document archetype comprising a set of words where each word's cell value defines the word's rank in the feature: The higher a word's cell value the higher the word's rank in the feature. A column in the coefficients matrix H represents an original document with a cell value defining the document's rank for a feature. We can now reconstruct a document (column vector) from our input matrix by a linear combination of our features (column vectors in W) where each feature is weighted by the feature's cell value from the document's column in H. == Clustering property == NMF has an inherent clustering property, i.e., it automatically clusters the columns of input data V = ( v 1 , … , v n ) {\displaystyle \mathbf {V} =(v_{1},\dots ,v_{n})} . More specifically, the approximation of V {\displaystyle \mathbf {V} } by V ≃ W H {\displaystyle \mathbf {V} \simeq \mathbf {W} \mathbf {H} } is achieved by finding W {\displaystyle W} and H {\displaystyle H} that minimize the error function (using the Frobenius norm) ‖ V − W H ‖ F , {\displaystyle \left\|V-WH\right\|_{F},} subject to W ≥ 0 , H ≥ 0. {\displaystyle W\geq 0,H\geq 0.} , If we furthermore impose an orthogonality constraint on H {\displaystyle \mathbf {H} } , i.e. H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} , then the above minimization is mathematically equivalent to the minimization of K-means clustering. Furthermore, the computed H {\displaystyle H} gives the cluster membership, i.e., if H k j > H i j {\displaystyle \mathbf {H} _{kj}>\mathbf {H} _{ij}} for all i ≠ k, this suggests that the input data v j {\displaystyle v_{j}} belongs to k {\displaystyle k} -th cluster. The computed W {\displaystyle W} gives the cluster centroids, i.e., the k {\displaystyle k} -th column gives the cluster centroid of k {\displaystyle k} -th cluster. This centroid's representation can be significantly enhanced by convex NMF. When the orthogonality constraint H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} is not explicitly imposed, the orthogonality holds to a large extent, and the clustering property holds too. When the error function to be used is Kullback–Leibler divergence, NMF is identical to the probabilistic latent semantic analysis (PLSA), a popular document clustering method. == Types == === Approximate non-negative matrix factorization === Usually the number of columns of W and the number of rows of H in NMF are selected so the product WH will become an approximation to V. The full decomposition of V then amounts to the two non-negative matrices W and H as well as a residual U, such that: V = WH + U. The elements of the residual matrix can either be negative or positive. When W and H are smaller than V they become easier to store and manipulate. Another reason for factorizing V into smaller matrices W and H, is that if one's goal is to approximately represent the elements of V by significantly less data, then one has to infer some latent structure in the data. === Convex non-negative matrix factorization === In standard NMF, matrix factor W ∈ R+m × k, i.e., W can be anything in that space. Convex NMF restricts the columns of W to convex combinations of the input data vectors ( v 1 , … , v n ) {\displaystyle (v_{1},\dots ,v_{n})} . This greatly improves the quality of data representation of W. Furthermore, the resulting matrix factor H becomes more sparse and orthogonal. === Nonnegative rank factorization === In case the nonnegative rank of V is equal to its actual rank, V = WH is called a nonnegative rank factorization (NRF). The problem of finding the NRF of V, if it exists, is known to be NP-hard. === Different cost functions and regularizations === There are different types of non-negative matrix factorizations. The different types arise from using different cost functions for measuring the divergence between V and WH and possibly by regularization of the W and/or H matrices. Two simple divergence functions studied by Lee and Seung are the squared error (or Frobenius norm) and an extension of the Kullback–Leibler divergence to positive matrices (the original Kullback–Leibler divergence is defined on probability distributions). Each divergence leads to a different NMF algorithm, usually minimizing the divergence using iterative update rules. The factorization problem in the squared error version of NMF may be stated as: Given a matrix V {\displaystyle \mathbf {V} } find nonnegative matrices W and H that minimize the function F ( W , H ) = ‖ V − W H ‖ F 2 {\displaystyle F(\mathbf {W} ,\mathbf {H} )=\left\|\mathbf {V} -\mathbf {WH} \right\|_{F}^{2}} Another type of NMF for images is based on the total variation norm. When L1 regularization (akin to Lasso) is added to NMF with the mean squared error cost function, the resulting problem may be called non-negative sparse coding due to the similarity to the sparse coding problem, although it may also still be referred to as NMF. === Online NMF === Many standard NMF algorithms analyze all the data together; i.e., the whole matrix is available from the start. This may be unsatisfactory in applications where there are too many data to fit into memory or where the data are provided in streaming fashion. One such use is for collaborative filtering in recommendation systems, where there may be many users and many items to recommend, and it would be inefficient to recalculate everything when one user or one item is added to the system. The cost function for o

    Read more →
  • Modern Hopfield network

    Modern Hopfield network

    Modern Hopfield networks (also known as Dense Associative Memories) are generalizations of the classical Hopfield networks that break the linear scaling relationship between the number of input features and the number of stored memories. This is achieved by introducing stronger non-linearities (either in the energy function or neurons’ activation functions) leading to super-linear (even an exponential) memory storage capacity as a function of the number of feature neurons. The network still requires a sufficient number of hidden neurons. The key theoretical idea behind the modern Hopfield networks is to use an energy function and an update rule that is more sharply peaked around the stored memories in the space of neuron’s configurations compared to the classical Hopfield network. == Classical Hopfield networks == Hopfield networks are recurrent neural networks with dynamical trajectories converging to fixed point attractor states and described by an energy function. The state of each model neuron i {\textstyle i} is defined by a time-dependent variable V i {\displaystyle V_{i}} , which can be chosen to be either discrete or continuous. A complete model describes the mathematics of how the future state of activity of each neuron depends on the known present or previous activity of all the neurons. In the original Hopfield model of associative memory, the variables were binary, and the dynamics were described by a one-at-a-time update of the state of the neurons. An energy function quadratic in the V i {\displaystyle V_{i}} was defined, and the dynamics consisted of changing the activity of each single neuron i {\displaystyle i} only if doing so would lower the total energy of the system. This same idea was extended to the case of V i {\displaystyle V_{i}} being a continuous variable representing the output of neuron i {\displaystyle i} , and V i {\displaystyle V_{i}} being a monotonic function of an input current. The dynamics became expressed as a set of first-order differential equations for which the "energy" of the system always decreased. The energy in the continuous case has one term which is quadratic in the V i {\displaystyle V_{i}} (as in the binary model), and a second term which depends on the gain function (neuron's activation function). While having many desirable properties of associative memory, both of these classical systems suffer from a small memory storage capacity, which scales linearly with the number of input features. == Discrete variables == A simple example of the Modern Hopfield network can be written in terms of binary variables V i {\displaystyle V_{i}} that represent the active V i = + 1 {\displaystyle V_{i}=+1} and inactive V i = − 1 {\displaystyle V_{i}=-1} state of the model neuron i {\displaystyle i} . E = − ∑ μ = 1 N mem F ( ∑ i = 1 N f ξ μ i V i ) {\displaystyle E=-\sum \limits _{\mu =1}^{N_{\text{mem}}}F{\Big (}\sum \limits _{i=1}^{N_{f}}\xi _{\mu i}V_{i}{\Big )}} In this formula the weights ξ μ i {\textstyle \xi _{\mu i}} represent the matrix of memory vectors (index μ = 1... N mem {\displaystyle \mu =1...N_{\text{mem}}} enumerates different memories, and index i = 1... N f {\displaystyle i=1...N_{f}} enumerates the content of each memory corresponding to the i {\displaystyle i} -th feature neuron), and the function F ( x ) {\displaystyle F(x)} is a rapidly growing non-linear function. The update rule for individual neurons (in the asynchronous case) can be written in the following form V i ( t + 1 ) = sign ⁡ [ ∑ μ = 1 N mem ( F ( ξ μ i + ∑ j ≠ i ξ μ j V j ( t ) ) − F ( − ξ μ i + ∑ j ≠ i ξ μ j V j ( t ) ) ) ] {\displaystyle V_{i}^{(t+1)}=\operatorname {sign} {\bigg [}\sum \limits _{\mu =1}^{N_{\text{mem}}}{\bigg (}F{\Big (}\xi _{\mu i}+\sum \limits _{j\neq i}\xi _{\mu j}V_{j}^{(t)}{\Big )}-F{\Big (}-\xi _{\mu i}+\sum \limits _{j\neq i}\xi _{\mu j}V_{j}^{(t)}{\Big )}{\bigg )}{\bigg ]}} which states that in order to calculate the updated state of the i {\textstyle i} -th neuron the network compares two energies: the energy of the network with the i {\displaystyle i} -th neuron in the ON state and the energy of the network with the i {\displaystyle i} -th neuron in the OFF state, given the states of the remaining neuron. The updated state of the i {\displaystyle i} -th neuron selects the state that has the lowest of the two energies. In the limiting case when the non-linear energy function is quadratic F ( x ) = x 2 {\displaystyle F(x)=x^{2}} these equations reduce to the familiar energy function and the update rule for the classical binary Hopfield network. The memory storage capacity of these networks can be calculated for random binary patterns. For the power energy function F ( x ) = x n {\displaystyle F(x)=x^{n}} the maximal number of memories that can be stored and retrieved from this network without errors is given by N mem max ≈ 1 2 ( 2 n − 3 ) ! ! N f n − 1 ln ⁡ ( N f ) {\displaystyle N_{\text{mem}}^{\max }\approx {\frac {1}{2(2n-3)!!}}{\frac {N_{f}^{n-1}}{\ln(N_{f})}}} For an exponential energy function F ( x ) = e x {\textstyle F(x)=e^{x}} the memory storage capacity is exponential in the number of feature neurons N mem max ≈ 2 N f / 2 {\displaystyle N_{\text{mem}}^{\max }\approx 2^{N_{f}/2}} == Continuous variables == Modern Hopfield networks or Dense Associative Memories can be best understood in continuous variables and continuous time. Consider the network architecture, shown in Fig.1, and the equations for the neurons' state evolutionwhere the currents of the feature neurons are denoted by x i {\textstyle x_{i}} , and the currents of the memory neurons are denoted by h μ {\displaystyle h_{\mu }} ( h {\displaystyle h} stands for hidden neurons). There are no synaptic connections among the feature neurons or the memory neurons. A matrix ξ μ i {\displaystyle \xi _{\mu i}} denotes the strength of synapses from a feature neuron i {\displaystyle i} to the memory neuron μ {\displaystyle \mu } . The synapses are assumed to be symmetric, so that the same value characterizes a different physical synapse from the memory neuron μ {\displaystyle \mu } to the feature neuron i {\displaystyle i} . The outputs of the memory neurons and the feature neurons are denoted by f μ {\displaystyle f_{\mu }} and g i {\displaystyle g_{i}} , which are non-linear functions of the corresponding currents. In general these outputs can depend on the currents of all the neurons in that layer so that f μ = f ( { h μ } ) {\displaystyle f_{\mu }=f(\{h_{\mu }\})} and g i = g ( { x i } ) {\textstyle g_{i}=g(\{x_{i}\})} . It is convenient to define these activation function as derivatives of the Lagrangian functions for the two groups of neuronsThis way the specific form of the equations for neuron's states is completely defined once the Lagrangian functions are specified. Finally, the time constants for the two groups of neurons are denoted by τ f {\displaystyle \tau _{f}} and τ h {\displaystyle \tau _{h}} , I i {\displaystyle I_{i}} is the input current to the network that can be driven by the presented data. General systems of non-linear differential equations can have many complicated behaviors that can depend on the choice of the non-linearities and the initial conditions. For Hopfield networks, however, this is not the case - the dynamical trajectories always converge to a fixed point attractor state. This property is achieved because these equations are specifically engineered so that they have an underlying energy function The terms grouped into square brackets represent a Legendre transform of the Lagrangian function with respect to the states of the neurons. If the Hessian matrices of the Lagrangian functions are positive semi-definite, the energy function is guaranteed to decrease on the dynamical trajectory This property makes it possible to prove that the system of dynamical equations describing temporal evolution of neurons' activities will eventually reach a fixed point attractor state. In certain situations one can assume that the dynamics of hidden neurons equilibrates at a much faster time scale compared to the feature neurons, τ h ≪ τ f {\textstyle \tau _{h}\ll \tau _{f}} . In this case the steady state solution of the second equation in the system (1) can be used to express the currents of the hidden units through the outputs of the feature neurons. This makes it possible to reduce the general theory (1) to an effective theory for feature neurons only. The resulting effective update rules and the energies for various common choices of the Lagrangian functions are shown in Fig.2. In the case of log-sum-exponential Lagrangian function the update rule (if applied once) for the states of the feature neurons is the attention mechanism commonly used in many modern AI systems (see Ref. for the derivation of this result from the continuous time formulation). == Relationship to classical Hopfield network with continuous variables == Classical formulation of continuous Hopfield networks can be understood as a

    Read more →
  • Stanza Living

    Stanza Living

    Stanza Living is the common brand name for Dtwelve Spaces Private Limited. It provides fully-managed shared living accommodations to students and young professionals. Founded by Anindya Dutta and Sandeep Dalmia, the company is present across 23 cities including Delhi, NCR, Bangalore, Visakhapatnam, Hyderabad, Chennai, Coimbatore, Indore, Pune, Baroda, Vijayawada, and Dehradun, Kota in India, with a capacity of 70,000 beds. Stanza Living is a technology-enabled housing concept which provides fully-furnished residences with amenities like meals, internet, laundry services, housekeeping, security and community engagement programmes. The company has an asset-light business model under which it engages in long-term lease agreements with property owners/developers, who convert their assets into shared living residences as per company guidelines. These assets are subsequently operated by Stanza Living. == Industry background == A report by Cushman & Wakefield (C&W) titled 'Exploring the Student Housing Universe in India City Insights', estimates that there were over 9.08 million migrant student enrolments in India's higher educational institutions (HEIs) for the year 2018-19 who need quality accommodation facilities. According to the report, Delhi-NCR, Mumbai, and Pune are the three biggest markets for student housing in the country, and these cities require an additional 4.75 lakh beds from organized co-living operators to meet the current demand. == History == Stanza Living provides tech-enabled, fully managed community living facilities for students and working professionals. The company was launched as a student housing business in Delhi NCR with a capacity of 100 beds, and grew to 14 cities by 2019. By early 2020, the company began catering to working professionals as well. The company has a combined inventory of 70,000 beds under management for both students and working professionals. Stanza Living is currently valued at $300 million. It has raised a capital of about $70 million from leading global investors like Falcon Edge Capital, Sequoia Capital, Matrix Partners and Accel Partners. November 2017 – Seed funding, September 2018 – Series A, March 2019 – Debt financing, July 2019 – Series C round, December 2019 - Debt financing. The company has invested in building technology products for business efficiency and consumer experience, like the Stanza Resident App and Stanza Real Estate App. Stanza Living has close to 1,500 employees across India. It is recognized among Top Real Estate Tech Startups of 2020 across the globe by research and analysis company Tracxn. The company has been shortlisted among Top 25 Start-ups of India in 2019 by LinkedIn == Founders == Stanza Living was co-founded by Anindya Dutta and Sandeep Dalmia. Sandeep Dalmia is an alumnus of Delhi College of Engineering and IIM Ahmedabad. Prior to Stanza, he was a Principal at Boston Consulting Group, working across India, US and South East Asia markets. Anindya Dutta was previously a Real Estate investor with Oaktree Capital and prior to that, he worked at Goldman Sachs in London. He is an alumnus of IIT Kharagpur and IIM Ahmedabad.

    Read more →
  • Neural Networks (journal)

    Neural Networks (journal)

    Neural Networks is a monthly peer-reviewed scientific journal and an official journal of the International Neural Network Society, European Neural Network Society, and Japanese Neural Network Society. == History == The journal was established in 1988 and is published by Elsevier. It covers all aspects of research on artificial neural networks. The founding editor-in-chief was Stephen Grossberg (Boston University). The current editors-in-chief are DeLiang Wang (Ohio State University) and Taro Toyoizumi (RIKEN Center for Brain Science). == Abstracting and indexing == The journal is abstracted and indexed in Scopus and the Science Citation Index Expanded. According to the Journal Citation Reports, the journal has a 2022 impact factor of 7.8.

    Read more →
  • TabPFN

    TabPFN

    TabPFN (Tabular Prior-data Fitted Network) is a machine learning model for tabular datasets proposed in 2022. It uses a transformer architecture. It is intended for supervised classification and regression analysis on tabular datasets, particularly focusing on small- to medium-sized datasets. The latest version, TabPFN-3, was released in May 2026 and supports datasets with up to one million rows and 200 features. == History == TabPFN was first introduced in a 2022 pre-print and presented at ICLR 2023. TabPFN v2 was published in 2025 in Nature by Hollmann and co-authors. The source code is published on GitHub under a modified Apache License and on PyPi. Writing for ICLR blogs, McCarter states that the model has attracted attention due to its performance on small dataset benchmarks. TabPFN v2.5 was released on November 6, 2025. TabPFN-3 was released on May 12, 2026. Prior Labs, founded in 2024, aims to commercialize TabPFN. As of April 2026, the open-source TabPFN repository had more than 6,000 stars on GitHub. == Overview and pre-training == TabPFN supports classification, regression and generative tasks. It leverages "Prior-Data Fitted Networks" models to model tabular data. By using a transformer pre-trained on synthetic tabular datasets, TabPFN avoids benchmark contamination and costs of curating real-world data. TabPFN v2 was pre-trained on approximately 130 million such datasets. Synthetic datasets are generated using causal models or Bayesian neural networks; this can include simulating missing values, imbalanced data, and noise. Random inputs are passed through these models to generate outputs, with a bias towards simpler causal structures. During pre-training, TabPFN predicts the masked target values of new data points given training data points and their known targets, effectively learning a generic learning algorithm that is executed by running a neural network forward pass. The new dataset is then processed in a single forward pass without retraining. The model's transformer encoder processes features and labels by alternating attention across rows and columns. TabPFN v2 handles numerical and categorical features, missing values, and supports tasks like regression and synthetic data generation, while TabPFN-2.5 scales this approach to datasets with up to 50,000 rows and 2,000 features. TabPFN-3 introduced a redesigned architecture with row-compression, an attention-based many-class decoder, native missing-value handling, and inference optimizations such as row chunking and a reduced key-value cache, with benchmark-validated regimes of up to 1 million rows with 200 features, 100,000 rows with 2,000 features, or 1,000 rows with 20,000 features. Since TabPFN is pre-trained, in contrast to other deep learning methods, it does not require costly hyperparameter optimization. == Research == TabPFN is the subject of on-going research. Applications for TabPFN have been investigated for domains such as chemoproteomics, insurance risk classification, and metagenomics. In clinical research, TabPFN was used in a study on the early detection of pancreatic cancer from blood samples, where it was combined with metabolomic data and reported high diagnostic performance. == Applications == TabPFN has been used in industrial and biomedical contexts. Hitachi Ltd. has been reported to use the model for predictive maintenance in rail networks, with its use described as helping to identify track issues earlier and reduce manual inspections. In the biomedical domain, Oxford Cancer Analytics has used TabPFN in the analysis of proteomic data in lung disease research. A 2025 ML Contests report noted that the winners of DrivenData's PREPARE challenge used TabPFN to generate features for gradient-boosted decision tree models. == Limitations == TabPFN has been criticized for its "one large neural network is all you need" approach to modeling problems. Further, its performance is limited in high-dimensional and large-scale datasets. == Scaling Mode == In late November 2025, Prior Labs introduced ‘‘Scaling Mode’’, an operating mode for TabPFN designed to remove the fixed upper bound on training set size. Earlier versions of TabPFN had been optimized and validated primarily for datasets of up to 100,000 rows, whereas Scaling Mode was reported to extend support to substantially larger datasets, with benchmarked experiments on datasets containing up to 10 million rows. According to Prior Labs, Scaling Mode preserves the existing TabPFN architecture, including its alternating row-attention and column-attention design, as well as the same feature-count limits as prior releases. Inference remains based on a single forward pass rather than dataset-specific gradient-descent training, while scalability is described as being constrained primarily by available compute and memory resources. Prior Labs reported benchmark results on an internal collection of datasets ranging from 1 million to 10 million rows across industry and scientific applications. In these benchmarks, Scaling Mode was compared with CatBoost, XGBoost, LightGBM, and TabPFN 2.5 using 50,000-row subsampling. The company stated that predictive performance improved monotonically with increasing training set size and that no diminishing returns from scaling were observed within the tested range. Prior Labs also announced the release through company and executive social media channels. TabPFN-3 later incorporated related scaling improvements, including row chunking and a reduced key-value cache, into the model architecture and inference pipeline.

    Read more →
  • Plate notation

    Plate notation

    In Bayesian inference, plate notation is a method of representing variables that repeat in a graphical model. Instead of drawing each repeated variable individually, a plate or rectangle is used to group variables into a subgraph that repeat together, and a number is drawn on the plate to represent the number of repetitions of the subgraph in the plate. The assumptions are that the subgraph is duplicated that many times, the variables in the subgraph are indexed by the repetition number, and any links that cross a plate boundary are replicated once for each subgraph repetition. == Example == In this example, we consider Latent Dirichlet allocation, a Bayesian network that models how documents in a corpus are topically related. There are two variables not in any plate; α is the parameter of the uniform Dirichlet prior on the per-document topic distributions, and β is the parameter of the uniform Dirichlet prior on the per-topic word distribution. The outermost plate represents all the variables related to a specific document, including θ i {\displaystyle \theta _{i}} , the topic distribution for document i. The M in the corner of the plate indicates that the variables inside are repeated M times, once for each document. The inner plate represents the variables associated with each of the N i {\displaystyle N_{i}} words in document i: z i j {\displaystyle z_{ij}} is the topic distribution for the jth word in document i, and w i j {\displaystyle w_{ij}} is the actual word used. The N in the corner represents the repetition of the variables in the inner plate N j {\displaystyle N_{j}} times, once for each word in document i. The circle representing the individual words is shaded, indicating that each w i j {\displaystyle w_{ij}} is observable, and the other circles are empty, indicating that the other variables are latent variables. The directed edges between variables indicate dependencies between the variables: for example, each w i j {\displaystyle w_{ij}} depends on z i j {\displaystyle z_{ij}} and β. == Extensions == A number of extensions have been created by various authors to express more information than simply the conditional relationships. However, few of these have become standard. Perhaps the most commonly used extension is to use rectangles in place of circles to indicate non-random variables—either parameters to be computed, hyperparameters given a fixed value (or computed through empirical Bayes), or variables whose values are computed deterministically from a random variable. The diagram on the right shows a few more non-standard conventions used in some articles in Wikipedia (e.g. variational Bayes): Variables that are actually random vectors are indicated by putting the vector size in brackets in the middle of the node. Variables that are actually random matrices are similarly indicated by putting the matrix size in brackets in the middle of the node, with commas separating row size from column size. Categorical variables are indicated by placing their size (without a bracket) in the middle of the node. Categorical variables that act as "switches", and which pick one or more other random variables to condition on from a large set of such variables (e.g. mixture components), are indicated with a special type of arrow containing a squiggly line and ending in a T junction. Boldface is consistently used for vector or matrix nodes (but not categorical nodes). == Software implementation == Plate notation has been implemented in various TeX/LaTeX drawing packages, but also as part of graphical user interfaces to Bayesian statistics programs such as BUGS and BayesiaLab and PyMC.

    Read more →
  • Automatic summarization

    Automatic summarization

    Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence (AI) algorithms are commonly developed and employed to achieve this, specialized for different types of data. Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document. On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection. Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion. Video summaries simply retain a carefully selected subset of the original video frames and, therefore, are not identical to the output of video synopsis algorithms, where new video frames are being synthesized based on the original video content. == Commercial products == In 2022 Google Docs released an automatic summarization feature. == Approaches == There are two general approaches to automatic summarization: extraction and abstraction. === Extraction-based summarization === Here, content is extracted from the original data, but the extracted content is not modified in any way. Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above. For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail. Other examples of extraction that include key sequences of text in terms of clinical relevance (including patient/problem, intervention, and outcome). === Abstractive-based summarization === Abstractive summarization methods generate new text that did not exist in the original text. This has been applied mainly for text. Abstractive methods build an internal semantic representation of the original content (often called a language model), and then use this representation to create a summary that is closer to what a human might express. Abstraction may transform the extracted content by paraphrasing sections of the source document, to condense a text more strongly than extraction. Such transformation, however, is computationally much more challenging than extraction, involving both natural language processing and often a deep understanding of the domain of the original text in cases where the original document relates to a special field of knowledge. "Paraphrasing" is even more difficult to apply to images and videos, which is why most summarization systems are extractive. === Aided summarization === Approaches aimed at higher summarization quality rely on combined software and human effort. In Machine Aided Human Summarization, extractive techniques highlight candidate passages for inclusion (to which the human adds or removes text). In Human Aided Machine Summarization, a human post-processes software output, in the same way that one edits the output of automatic translation by Google Translate. == Applications and systems for summarization == There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs. An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary. Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images. A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured. At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set. This is also called the core-set. These algorithms model notions like diversity, coverage, information and representativeness of the summary. Query based summarization techniques, additionally model for relevance of the summary with the query. Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc. === Keyphrase extraction === The task is the following. You are given a piece of text, such as a journal article, and you must produce a list of keywords or key[phrase]s that capture the primary topics discussed in the text. In the case of research articles, many authors provide manually assigned keywords, but most text lacks pre-existing keyphrases. For example, news articles rarely have keyphrases attached, but it would be useful to be able to automatically do so for a number of applications discussed below. Consider the example text from a news article: "The Army Corps of Engineers, rushing to meet President Bush's promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm, according to documents obtained by The Associated Press". A keyphrase extractor might select "Army Corps of Engineers", "President Bush", "New Orleans", and "defective flood-control pumps" as keyphrases. These are pulled directly from the text. In contrast, an abstractive keyphrase system would somehow internalize the content and generate keyphrases that do not appear in the text, but more closely resemble what a human might produce, such as "political negligence" or "inadequate protection from floods". Abstraction requires a deep understanding of the text, which makes it difficult for a computer system. Keyphrases have many applications. They can enable document browsing by providing a short summary, improve information retrieval (if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a full-text search), and be employed in generating index entries for a large text corpus. Depending on the different literature and the definition of key terms, words or phrases, keyword extraction is a highly related theme. ==== Supervised learning approaches ==== Beginning with the work of Turney, many researchers have approached keyphrase extraction as a supervised machine learning problem. Given a document, we construct an example for each unigram, bigram, and trigram found in the text (though other text units are also possible, as discussed below). We then compute various features describing each example (e.g., does the phrase begin with an upper-case letter?). We assume there are known keyphrases available for a set of training documents. Using the known keyphrases, we can assign positive or negative labels to the examples. Then we learn a classifier that can discriminate between positive and negative examples as a function of the features. Some classifiers make a binary classification for a test example, while others assign a probability of being a keyphrase. For ins

    Read more →
  • Training, validation, and test data sets

    Training, validation, and test data sets

    In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and testing sets. The model is initially fit on a training data set, which is a set of examples used to fit the parameters (e.g. weights of connections between neurons in artificial neural networks) of the model. The model (e.g. a naive Bayes classifier) is trained on the training data set using a supervised learning method, for example using optimization methods such as gradient descent or stochastic gradient descent. In practice, the training data set often consists of pairs of an input vector (or scalar) and the corresponding output vector (or scalar), where the answer key is commonly denoted as the target (or label). The current model is run with the training data set and produces a result, which is then compared with the target, for each input vector in the training data set. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted. The model fitting can include both variable selection and parameter estimation. Successively, the fitted model is used to predict the responses for the observations in a second data set called the validation data set. The validation data set provides an unbiased evaluation of a model fit on the training data set while tuning the model's hyperparameters (e.g. the number of hidden units—layers and layer widths—in a neural network). Validation data sets can be used for regularization by early stopping (stopping training when the error on the validation data set increases, as this is a sign of over-fitting to the training data set). This simple procedure is complicated in practice by the fact that the validation data set's error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when over-fitting has truly begun. Finally, the test data set is a data set used to provide an unbiased evaluation of a model fit on the training data set. When the data in the test data set has never been used (for example in cross-validation), the test data set is called a holdout data set. The term "validation set" is sometimes used instead of "test set" in some literature (e.g., if the original data set was partitioned into only two subsets, the test set might be referred to as the validation set). Deciding the sizes and strategies for data set division in training, test and validation sets is very dependent on the problem and data available. == Training data set == A training data set is a data set of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier. For classification tasks, a supervised learning algorithm looks at the training data set to determine, or learn, the optimal combinations of variables that will generate a good predictive model. The goal is to produce a trained (fitted) model that generalizes well to new, unknown data. The fitted model is evaluated using “new” examples from the held-out data sets (validation and test data sets) to estimate the model’s accuracy in classifying new data. To reduce the risk of issues such as over-fitting, the examples in the validation and test data sets should not be used to train the model. Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify and exploit apparent relationships in the training data that do not hold in general. When a training set is continuously expanded with new data, then this is incremental learning. == Validation data set == A validation data set is a data set of examples used to tune the hyperparameters (i.e. the architecture) of a model. It is sometimes also called the development set or the "dev set". An example of a hyperparameter for artificial neural networks includes the number of hidden units in each layer. It, as well as the testing set (as mentioned below), should follow the same probability distribution as the training data set. In order to avoid overfitting, when any classification parameter needs to be adjusted, it is necessary to have a validation data set in addition to the training and test data sets. For example, if the most suitable classifier for the problem is sought, the training data set is used to train the different candidate classifiers, the validation data set is used to compare their performances and decide which one to take and, finally, the test data set is used to obtain the performance characteristics such as accuracy, sensitivity, specificity, F-measure, and so on. The validation data set functions as a hybrid: it is training data used for testing, but neither as part of the low-level training nor as part of the final testing. The basic process of using a validation data set for model selection (as part of training data set, validation data set, and test data set) is: Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set. An application of this process is in early stopping, where the candidate models are successive iterations of the same network, and training stops when the error on the validation set grows, choosing the previous model (the one with minimum error). == Test data set == A test data set is a data set that is independent of the training data set, but that follows the same probability distribution as the training data set. A test set is therefore a set of examples used only to assess the performance (i.e. generalization) of a specified classifier on unseen data. To do this, the model is used to predict classifications of examples in the test set. Those predictions are compared to the examples' true classifications to assess the model's accuracy. If a model fit to the training and validation data set also fits the test data set well, minimal overfitting has taken place (see figure below). A better fitting of the training or validation data sets as opposed to the test data set usually points to overfitting. In the scenario where a data set has a low number of samples, it is usually partitioned into a training set and a validation data set, where the model is trained on the training set and refined using the validation set to improve accuracy, but this approach will lead to overfitting. The holdout method can also be employed, where the test set is used at the end, after training on the training set. Other techniques, such as cross-validation and bootstrapping, are used on small data sets. The bootstrap method generates numerous simulated data sets of the same size by randomly sampling with replacement from the original data, allowing the random data points to serve as test sets for evaluating model performance. Cross-validation splits the data set into multiple folds, with a single sub-fold used as test data; the model is trained on the remaining folds, and all folds are cross-validated (with results averaged and models consolidated) to estimate final model performance. Note that some sources advise against using a single split, as it can lead to overfitting as well as biased model performance estimates. For this reason, data sets are split into three partitions: training, validation and test data sets. The standard machine learning practice is to train on the training set and tune hyperparameters using the validation set, where the validation process selects the model with the lowest validation loss, which is then tested on the test data set (normally held out) to assess the final model. The holdout method for the test set reduces computation by avoiding using the test set after each epoch. The test data set should never be used for validating the training model or fine-tuning hyperparameters, as it provides an accurate and honest evaluation of the model's final performance on unseen dat

    Read more →
  • K-nearest neighbors algorithm

    K-nearest neighbors algorithm

    In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method. It was first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. In classification, a new example is assigned a label based on the labels of its k nearest training examples; in regression, the prediction is computed from the values of those neighbors. Most often, it is used for classification, as a k-NN classifier, the output of which is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. The k-NN algorithm can also be generalized for regression. In k-NN regression, also known as nearest neighbor smoothing, the output is the property value for the object. This value is the average of the values of k nearest neighbors. If k = 1, then the output is simply assigned to the value of that single nearest neighbor, also known as nearest neighbor interpolation. For both classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that nearer neighbors contribute more to the average than distant ones. For example, a common weighting scheme consists of giving each neighbor a weight of 1/d, where d is the distance to the neighbor. The input consists of the k closest training examples in a data set. The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A peculiarity (sometimes even a disadvantage) of the k-NN algorithm is its sensitivity to the local structure of the data. In k-NN classification the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance, if the features represent different physical units or come in vastly different scales, then feature-wise normalizing of the training data can greatly improve its accuracy. == Statistical setting == Suppose we have pairs ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , … , ( X n , Y n ) {\displaystyle (X_{1},Y_{1}),(X_{2},Y_{2}),\dots ,(X_{n},Y_{n})} taking values in R d × { 1 , 2 } {\displaystyle \mathbb {R} ^{d}\times \{1,2\}} , where Y is the class label of X, so that X | Y = r ∼ P r {\displaystyle X|Y=r\sim P_{r}} for r = 1 , 2 {\displaystyle r=1,2} (and probability distributions P r {\displaystyle P_{r}} ). Given some norm ‖ ⋅ ‖ {\displaystyle \|\cdot \|} on R d {\displaystyle \mathbb {R} ^{d}} and a point x ∈ R d {\displaystyle x\in \mathbb {R} ^{d}} , let ( X ( 1 ) , Y ( 1 ) ) , … , ( X ( n ) , Y ( n ) ) {\displaystyle (X_{(1)},Y_{(1)}),\dots ,(X_{(n)},Y_{(n)})} be a reordering of the training data such that ‖ X ( 1 ) − x ‖ ≤ ⋯ ≤ ‖ X ( n ) − x ‖ {\displaystyle \|X_{(1)}-x\|\leq \dots \leq \|X_{(n)}-x\|} . == Algorithm == The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, k-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric. Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as large margin nearest neighbor or neighborhood components analysis. A drawback of the basic "majority voting" classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number. One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. k-NN can then be applied to the SOM. == Parameter selection == The best choice of k depends upon the data; generally, larger values of k reduces effect of the noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques (see hyperparameter optimization). The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm. The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale features by the mutual information of the training data with the training classes. In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via bootstrap method. == The 1-nearest neighbor classifier == The most intuitive nearest neighbour type classifier is the one nearest neighbour classifier that assigns a point x to the class of its closest neighbour in the feature space, that is C n 1 n n ( x ) = Y ( 1 ) {\displaystyle C_{n}^{1nn}(x)=Y_{(1)}} . As the size of training data set approaches infinity, the one nearest neighbour classifier guarantees an error rate of no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). == The weighted nearest neighbour classifier == The k-nearest neighbour classifier can be viewed as assigning the k nearest neighbours a weight 1 / k {\displaystyle 1/k} and all others 0 weight. This can be generalised to weighted nearest neighbour classifiers. That is, where the ith nearest neighbour is assigned a weight w n i {\displaystyle w_{ni}} , with ∑ i = 1 n w n i = 1 {\textstyle \sum _{i=1}^{n}w_{ni}=1} . An analogous result on the strong consistency of weighted nearest neighbour classifiers also holds. Let C n w n n {\displaystyle C_{n}^{wnn}} denote the weighted nearest classifier with weights { w n i } i = 1 n {\displaystyle \{w_{ni}\}_{i=1}^{n}} . Subject to regularity conditions, which in asymptotic theory are conditional variables which require assumptions to differentiate among parameters with some criteria. On the class distributions the excess risk has the following asymptotic expansion R R ( C n w n n ) − R R ( C Bayes ) = ( B 1 s n 2 + B 2 t n 2 ) { 1 + o ( 1 ) } , {\displaystyle {\mathcal {R}}_{\mathcal {R}}(C_{n}^{wnn})-{\mathcal {R}}_{\mathcal {R}}(C^{\text{Bayes}})=\left(B_{1}s_{n}^{2}+B_{2}t_{n}^{2}\right)\{1+o(1)\},} for constants B 1 {\displaystyle B_{1}} and B 2 {\displaystyle B_{2}} where s n 2 = ∑ i = 1 n w n i 2 {\displaystyle s_{n}^{2}=\sum _{i=1}^{n}w_{ni}^{2}} and t n = n − 2 / d ∑ i = 1 n w n i { i 1 + 2 / d − ( i − 1 ) 1 + 2 / d } {\displaystyle t_{n}=n^{-2/d}\sum _{i=1}^{n}w_{ni}\left\{i^{1+2/d}-(i-1)^{1+2/d}\right\}} . The optimal weighting scheme { w n i ∗ } i = 1 n {\displaystyle \{w_{ni}^{}\}_{i=1}^{n}} , that balances the two terms in the display above, is given as follows: set k ∗ = ⌊ B n 4 d + 4 ⌋ {\displaystyle k^{}=\lfloor Bn^{\frac {4}{d+4}}\rfloor } , w n i ∗ = 1 k ∗ [ 1 + d 2 − d 2 k ∗ 2 / d { i 1 + 2 / d − ( i − 1 ) 1 + 2 / d } ] {\displaystyle w_{ni}^{}={\frac {1}{k^{}}}\left[1+{\frac {d}{2}}-{\frac {d}{2{k^{}}^{2/d}}}\{i^{1+2/d}-(i-1)^{1+2/d}\}\right]} for i = 1 , 2 , … , k ∗ {\displaystyle i=1,2,\dots ,k^{}} and w n i ∗ = 0 {\displaystyle w_{ni}^{}=0} for i = k ∗ + 1 , … , n {\displaystyle i=k^{}+1,\dots ,n} . With optimal weights the dominant term in the asymptotic expansion of the excess risk is O ( n − 4 d + 4 ) {\displaystyle {\mathcal {O}}(n^{-{\frac {4}{d+4}}})}

    Read more →