Nearest neighbor search

Nearest neighbor search

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values. Formally, the nearest neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M {\displaystyle q\in M} , find the closest point in S to q. Donald Knuth in volume 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a k-NN search, where we need to find the k closest points. Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensional vector space where dissimilarity is measured using the Euclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example is asymmetric Bregman divergence, for which the triangle inequality does not hold. == Applications == The nearest neighbor search problem arises in numerous fields of application, including: Pattern recognition – in particular for optical character recognition Statistical classification – see k-nearest neighbor algorithm Computer vision – for point cloud registration Computational geometry – see Closest pair of points problem Cryptanalysis – for lattice problem Databases – e.g. content-based image retrieval Coding theory – see maximum likelihood decoding Semantic search Vector databases, where nearest-neighbor lookup over embeddings is used to retrieve semantically similar records Retrieval-augmented generation systems, where nearest-neighbor retrieval over embeddings is used to fetch candidate passages or documents before generation Data compression – see MPEG-2 standard Robotic sensing Recommendation systems, e.g. see Collaborative filtering Internet marketing – see contextual advertising and behavioral targeting DNA sequencing Spell checking – suggesting correct spelling Plagiarism detection Similarity scores for predicting career paths of professional athletes. Cluster analysis – assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance Chemical similarity Sampling-based motion planning == Methods == Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time. === Exact methods === ==== Linear search ==== The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces. The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results. ==== Space partitioning ==== Since the 1970s, the branch and bound methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k)) Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R tree. R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances. In the case of general metric space, the branch-and-bound approach is known as the metric tree approach. Particular examples include vp-tree and BK-tree methods. Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result. === Approximation methods === An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most c {\displaystyle c} times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. ==== Greedy search in proximity neighborhood graphs ==== Proximity graph methods (such as navigable small world graphs and HNSW) are considered the current state-of-the-art for the approximate nearest neighbors search. The methods are based on greedy traversing in proximity neighborhood graphs G ( V , E ) {\displaystyle G(V,E)} in which every point x i ∈ S {\displaystyle x_{i}\in S} is uniquely associated with vertex v i ∈ V {\displaystyle v_{i}\in V} . The search for the nearest neighbors to a query q in the set S takes the form of searching for the vertex in the graph G ( V , E ) {\displaystyle G(V,E)} . The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex v i ∈ V {\displaystyle v_{i}\in V} by computing the distances from the query q to each vertex of its neighborhood { v j : ( v i , v j ) ∈ E } {\displaystyle \{v_{j}:(v_{i},v_{j})\in E\}} , and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself. The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount, in the VoroNet syst

Clapper (service)

Clapper is an American short-form video-hosting service headquartered in Dallas, Texas. It was founded in 2020 by Edison Chen as an alternative for TikTok for mature audiences. The app is functionally similar to TikTok and includes tipping and e-commerce features. Following an influx of far-right content in early 2021, Clapper strengthened its moderation practices. It achieved 2 million monthly active users by 2023, and the number of downloads increased after a U.S. bill that would potentially ban TikTok in the country was signed in 2024. == History == With its offices in Dallas, Texas, Clapper was founded in July 2020 by Chinese-American entrepreneur Edison Chen. Chen considered that most online platforms, such as TikTok, were being targeted to young generations, such as Generation Z. He then concepted Clapper as a service with short-form content for mature audiences among Generation X and millennials, while not intending to compete directly with TikTok. Clapper averaged fewer than ten thousand daily active users during 2020, reaching 500 thousand downloads in the next year. Initially without paying for external advertising, the company raised about $3 million during a 2021 seed funding round. In 2023, the app reportedly reached about 300 to 400 thousand daily active users and 2 million monthly active users. The average user was between the ages of 35 and 55. Following the April 2024 signing of the Protecting Americans from Foreign Adversary Controlled Applications Act, which would potentially enact a ban on TikTok in the U.S. in January 2025, Clapper averaged 200 thousand weekly downloads. In 2025, before the day scheduled for the ban (January 19), TikTok users migrated to other apps. As a result, Clapper received 1.4 million new downloads in a week preceding the date. It was listed as the third most-downloaded free app on Apple's App Store on January 14, behind Xiaohongshu and Lemon8, and the term "TikTok refugee" became a trending term. == Features == Clapper presents similarities with TikTok in its layout, including "Following" and "For You" tabs with videos up to three minutes long that can be liked, commented on or shared. A "Clapback" feature allows users to create responses to videos from others. Users can create livestreams and chat rooms in the app. Users can tip Clapper creators through its Clapper Fam monetization feature, in place of in-app advertisements. The Clapper Shop allows for e-commerce between users. The service had distributed $10 million to its users in total by 2023, according to Clapper CEO Chen. == Content == Clapper includes a policy requiring users to be at least 17 years of age, although Clapper CEO Chen described that "there is no adult content" on the platform. Lindsay Dodgson of Business Insider described the content as generally outdated and "reminiscent of 'getting owned' compilations of the earlier internet." The Washington Post's Tatum Hunter characterized Clapper as including sexual or engagement baiting content more prevalently than TikTok. === Moderation === Clapper's team, which had fifteen employees in early 2021, initially stated it would not moderate content as strictly as TikTok and would mostly rely on user reports. Following that year's January 6 United States Capitol attack, far-right conservative videos promoting QAnon and anti-vaccine conspiracy theories appeared on Clapper's "For You" page to a substantial degree for weeks. The videos were made in protest against decisions by platforms, particularly TikTok, to ban such content. Clapper's team stated in January 10 that its rules prohibiting incitements to violence would be strictly enforced. By February, videos and accounts promoting the conspiracy theories had been removed, and QAnon-related content was banned permanently. Clapper's team hired more content auditors and implemented moderation by artificial intelligence for further community guideline violations.

Count sketch

Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by Alon, Matias and Szegedy for approximating the frequency moments of streams (these calculations require counting of the number of occurrences for the distinct elements of the stream). The sketch is nearly identical to the Feature hashing algorithm by John Moody, but differs in its use of hash functions with low dependence, which makes it more practical. In order to still have a high probability of success, the median trick is used to aggregate multiple count sketches, rather than the mean. These properties allow use for explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms. == Intuitive explanation == The inventors of this data structure offer the following iterative explanation of its operation: at the simplest level, the output of a single hash function s mapping stream elements q into {+1, -1} is feeding a single up/down counter C. After a single pass over the data, the frequency n ( q ) {\displaystyle n(q)} of a stream element q can be approximated, although extremely poorly, by the expected value E [ C ⋅ s ( q ) ] {\displaystyle {\mathbf {E}}[C\cdot s(q)]} ; a straightforward way to improve the variance of the previous estimate is to use an array of different hash functions s i {\displaystyle s_{i}} , each connected to its own counter C i {\displaystyle C_{i}} . For each i, the E [ C i ⋅ s i ( q ) ] = n ( q ) {\displaystyle {\mathbf {E}}[C_{i}\cdot s_{i}(q)]=n(q)} still holds, so averaging across the i range will tighten the approximation; the previous construct still has a major deficiency: if a lower-frequency-but-still-important output element a exhibits a hash collision with a high-frequency element even for one of the s i {\displaystyle s_{i}} hashes, n ( a ) {\displaystyle n(a)} estimate can be significantly affected. Avoiding this requires reducing the frequency of collision counter updates between any two distinct elements. This is achieved by replacing each C i {\displaystyle C_{i}} in the previous construct with an array of m counters (making the counter set into a two-dimensional matrix C i , j {\displaystyle C_{i,j}} ), with index j of a particular counter to be incremented/decremented selected via another set of hash functions h i {\displaystyle h_{i}} that map element q into the range {1..m}. Since E [ C i , h i ( q ) ⋅ s i ( q ) ] = n ( q ) {\displaystyle {\mathbf {E}}[C_{i,h_{i}(q)}\cdot s_{i}(q)]=n(q)} , averaging across all values of i will work. == Mathematical definition == 1. For constants w {\displaystyle w} and t {\displaystyle t} (to be defined later) independently choose d = 2 t + 1 {\displaystyle d=2t+1} random hash functions h 1 , … , h d {\displaystyle h_{1},\dots ,h_{d}} and s 1 , … , s d {\displaystyle s_{1},\dots ,s_{d}} such that h i : [ n ] → [ w ] {\displaystyle h_{i}:[n]\to [w]} and s i : [ n ] → { ± 1 } {\displaystyle s_{i}:[n]\to \{\pm 1\}} . It is necessary that the hash families from which h i {\displaystyle h_{i}} and s i {\displaystyle s_{i}} are chosen be pairwise independent. 2. For each item q i {\displaystyle q_{i}} in the stream, add s j ( q i ) {\displaystyle s_{j}(q_{i})} to the h j ( q i ) {\displaystyle h_{j}(q_{i})} th bucket of the j {\displaystyle j} th hash. At the end of this process, one has w d {\displaystyle wd} sums ( C i j ) {\displaystyle (C_{ij})} where C i , j = ∑ h i ( k ) = j s i ( k ) . {\displaystyle C_{i,j}=\sum _{h_{i}(k)=j}s_{i}(k).} To estimate the count of q {\displaystyle q} s one computes the following value: r q = median i = 1 d s i ( q ) ⋅ C i , h i ( q ) . {\displaystyle r_{q}={\text{median}}_{i=1}^{d}\,s_{i}(q)\cdot C_{i,h_{i}(q)}.} The values s i ( q ) ⋅ C i , h i ( q ) {\displaystyle s_{i}(q)\cdot C_{i,h_{i}(q)}} are unbiased estimates of how many times q {\displaystyle q} has appeared in the stream. The estimate r q {\displaystyle r_{q}} has variance O ( m i n { m 1 2 / w 2 , m 2 2 / w } ) {\displaystyle O(\mathrm {min} \{m_{1}^{2}/w^{2},m_{2}^{2}/w\})} , where m 1 {\displaystyle m_{1}} is the length of the stream and m 2 2 {\displaystyle m_{2}^{2}} is ∑ q ( ∑ i [ q i = q ] ) 2 {\displaystyle \sum _{q}(\sum _{i}[q_{i}=q])^{2}} . Furthermore, r q {\displaystyle r_{q}} is guaranteed to never be more than 2 m 2 / w {\displaystyle 2m_{2}/{\sqrt {w}}} off from the true value, with probability 1 − e − O ( t ) {\displaystyle 1-e^{-O(t)}} . === Vector formulation === Alternatively Count-Sketch can be seen as a linear mapping with a non-linear reconstruction function. Let M ( i ∈ [ d ] ) ∈ { − 1 , 0 , 1 } w × n {\displaystyle M^{(i\in [d])}\in \{-1,0,1\}^{w\times n}} , be a collection of d = 2 t + 1 {\displaystyle d=2t+1} matrices, defined by M h i ( j ) , j ( i ) = s i ( j ) {\displaystyle M_{h_{i}(j),j}^{(i)}=s_{i}(j)} for j ∈ [ w ] {\displaystyle j\in [w]} and 0 everywhere else. Then a vector v ∈ R n {\displaystyle v\in \mathbb {R} ^{n}} is sketched by C ( i ) = M ( i ) v ∈ R w {\displaystyle C^{(i)}=M^{(i)}v\in \mathbb {R} ^{w}} . To reconstruct v {\displaystyle v} we take v j ∗ = median i C j ( i ) s i ( j ) {\displaystyle v_{j}^{}={\text{median}}_{i}C_{j}^{(i)}s_{i}(j)} . This gives the same guarantees as stated above, if we take m 1 = ‖ v ‖ 1 {\displaystyle m_{1}=\|v\|_{1}} and m 2 = ‖ v ‖ 2 {\displaystyle m_{2}=\|v\|_{2}} . == Relation to Tensor sketch == The count sketch projection of the outer product of two vectors is equivalent to the convolution of two component count sketches. The count sketch computes a vector convolution C ( 1 ) x ∗ C ( 2 ) x T {\displaystyle C^{(1)}x\ast C^{(2)}x^{T}} , where C ( 1 ) {\displaystyle C^{(1)}} and C ( 2 ) {\displaystyle C^{(2)}} are independent count sketch matrices. Pham and Pagh show that this equals C ( x ⊗ x T ) {\displaystyle C(x\otimes x^{T})} – a count sketch C {\displaystyle C} of the outer product of vectors, where ⊗ {\displaystyle \otimes } denotes Kronecker product. The fast Fourier transform can be used to do fast convolution of count sketches. By using the face-splitting product such structures can be computed much faster than normal matrices.

Memtransistor

The memtransistor (a blend word from Memory Transfer Resistor) is an experimental multi-terminal passive electronic component that might be used in the construction of artificial neural networks. It is a combination of the memristor and transistor technology. This technology is different from the 1T-1R approach since the devices are merged into one single entity. Multiple memristors can be embedded with a single transistor, enabling it to more accurately model a neuron with its multiple synaptic connections. A neural network produced from these would provide hardware-based artificial intelligence with a good foundation. == Applications == These types of devices would allow for a synapse model that could realise a learning rule, by which the synaptic efficacy is altered by voltages applied to the terminals of the device. An example of such a learning rule is spike-timing-dependant-plasticty by which the weight of the synapse, in this case the conductivity, could be modulated based on the timing of pre and post synaptic spikes arriving at each terminal. The advantage of this approach over two terminal memristive devices is that read and write protocols have the possibility to occur simultaneously and distinctly.

Santa Fe Trail problem

The Santa Fe Trail problem is a genetic programming exercise in which artificial ants search for food pellets according to a programmed set of instructions. The layout of food pellets in the Santa Fe Trail problem has become a standard for comparing different genetic programming algorithms and solutions. One method for programming and testing algorithms on the Santa Fe Trail problem is by using the NetLogo application. There is at least one case of a student creating a Lego robotic ant to solve the problem.

Teknomo–Fernandez algorithm

The Teknomo–Fernandez algorithm (TF algorithm), is an efficient algorithm for generating the background image of a given video sequence. By assuming that the background image is shown in the majority of the video, the algorithm is able to generate a good background image of a video in O ( R ) {\displaystyle O(R)} -time using only a small number of binary operations and Boolean bit operations, which require a small amount of memory and has built-in operators found in many programming languages such as C, C++, and Java. == History == People tracking from videos usually involves some form of background subtraction to segment foreground from background. Once foreground images are extracted, then desired algorithms (such as those for motion tracking, object tracking, and facial recognition) may be executed using these images. However, background subtraction requires that the background image is already available and unfortunately, this is not always the case. Traditionally, the background image is searched for manually or automatically from the video images when there are no objects. More recently, automatic background generation through object detection, medial filtering, medoid filtering, approximated median filtering, linear predictive filter, non-parametric model, Kalman filter, and adaptive smoothening have been suggested; however, most of these methods have high computational complexity and are resource-intensive. The Teknomo–Fernandez algorithm is also an automatic background generation algorithm. Its advantage, however, is its computational speed of only O ( R ) {\displaystyle O(R)} -time, depending on the resolution R {\displaystyle R} of an image and its accuracy gained within a manageable number of frames. Only at least three frames from a video is needed to produce the background image assuming that for every pixel position, the background occurs in the majority of the videos. Furthermore, it can be performed for both grayscale and colored videos. == Assumptions == The camera is stationary. The light of the environment changes only slowly relative to the motions of the people in the scene. The number of people does not occupy the scene for most of the time at the same place. Generally, however, the algorithm will certainly work whenever the following single important assumption holds: For each pixel position, the majority of the pixel values in the entire video contain the pixel value of the actual background image (at that position).As long as each part of the background is shown in the majority of the video, the entire background image needs not to appear in any of its frames. The algorithm is expected to work accurately. == Background image generation == === Equations === For three frames of image sequence x 1 {\displaystyle x_{1}} , x 2 {\displaystyle x_{2}} , and x 3 {\displaystyle x_{3}} , the background image B {\displaystyle B} is obtained using B = x 3 ( x 1 ⊕ x 2 ) + x 1 x 2 {\displaystyle B=x_{3}(x_{1}\oplus x_{2})+x_{1}x_{2}} where ⊕ {\displaystyle \oplus } denotes the exclusive disjunctive bit operator. The Boolean mode function S {\displaystyle S} of the table occurs when the number of 1 entries is larger than half of the number of images such that S = { 1 , if ∑ i = 1 n x i ≥ ⌈ n 2 + 1 ⌉ , and n ≥ 3 0 , otherwise {\displaystyle S={\begin{cases}1,&{\text{if }}\sum _{i=1}^{n}x_{i}\geq \left\lceil {\frac {n}{2}}+1\right\rceil ,{\text{ and }}n\geq 3\\0,&{\text{otherwise}}\end{cases}}} For three images, the background image B {\displaystyle B} can be taken as the value x ¯ 1 x 2 x 3 + x 1 x ¯ 2 x 3 + x 1 x 2 x ¯ 3 + x 1 x 2 x 3 {\displaystyle {\bar {x}}_{1}x_{2}x_{3}+x_{1}{\bar {x}}_{2}x_{3}+x_{1}x_{2}{\bar {x}}_{3}+x_{1}x_{2}x_{3}} === Background generation algorithm === At the first level, three frames are selected at random from the image sequence to produce a background image by combining them using the first equation. This yields a better background image at the second level. The procedure is repeated until desired level L {\displaystyle L} . == Theoretical accuracy == At level ℓ {\displaystyle \ell } , the probability p ℓ {\displaystyle p_{\ell }} that the modal bit predicted is the actual modal bit is represented by the equation p ℓ = ( p ℓ − 1 ) 3 + 3 ( p ℓ − 1 ) 2 ( 1 − p ℓ − 1 ) {\displaystyle p_{\ell }=(p_{\ell -1})^{3}+3(p_{\ell -1})^{2}(1-p_{\ell -1})} . The table below gives the computed probability values across several levels using some specific initial probabilities. It can be observed that even if the modal bit at the considered position is at a low 60% of the frames, the probability of accurate modal bit determination is already more than 99% at 6 levels. == Space complexity == The space requirement of the Teknomo–Fernandez algorithm is given by the function O ( R F + R 3 L ) {\displaystyle O(RF+R3^{L})} , depending on the resolution R {\displaystyle R} of the image, the number F {\displaystyle F} of frames in the video, and the desired number L {\displaystyle L} of levels. However, the fact that L {\displaystyle L} will probably not exceed 6 reduces the space complexity to O ( R F ) {\displaystyle O(RF)} . == Time complexity == The entire algorithm runs in O ( R ) {\displaystyle O(R)} -time, only depending on the resolution of the image. Computing the modal bit for each bit can be done in O ( 1 ) {\displaystyle O(1)} -time while the computation of the resulting image from the three given images can be done in O ( R ) {\displaystyle O(R)} -time. The number of the images to be processed in L {\displaystyle L} levels is O ( 3 L ) {\displaystyle O(3^{L})} . However, since L ≤ 6 {\displaystyle L\leq 6} , then this is actually O ( 1 ) {\displaystyle O(1)} , thus the algorithm runs in O ( R ) {\displaystyle O(R)} . == Variants == A variant of the Teknomo–Fernandez algorithm that incorporates the Monte-Carlo method named CRF has been developed. Two different configurations of CRF were implemented: CRF9,2 and CRF81,1. Experiments on some colored video sequences showed that the CRF configurations outperform the TF algorithm in terms of accuracy. However, the TF algorithm remains more efficient in terms of processing time. == Applications == Object detection Face detection Face recognition Pedestrian detection Video surveillance Motion capture Human-computer interaction Content-based video coding Traffic monitoring Real-time gesture recognition

Local tangent space alignment

Local tangent space alignment (LTSA) is a method for manifold learning, which can efficiently learn a nonlinear embedding into low-dimensional coordinates from high-dimensional data, and can also reconstruct high-dimensional coordinates from embedding coordinates. It is based on the intuition that when a manifold is correctly unfolded, all of the tangent hyperplanes to the manifold will become aligned. It begins by computing the k-nearest neighbors of every point. It computes the tangent space at every point by computing the d-first principal components in each local neighborhood. It then optimizes to find an embedding that aligns the tangent spaces, but it ignores the label information conveyed by data samples, and thus can not be used for classification directly.