AI Assistant Gemini

AI Assistant Gemini — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Normal distributions transform

    Normal distributions transform

    The normal distributions transform (NDT) is a point cloud registration algorithm introduced by Peter Biber and Wolfgang Straßer in 2003, while working at University of Tübingen. The algorithm registers two point clouds by first associating a piecewise normal distribution to the first point cloud, that gives the probability of sampling a point belonging to the cloud at a given spatial coordinate, and then finding a transform that maps the second point cloud to the first by maximising the likelihood of the second point cloud on such distribution as a function of the transform parameters. Originally introduced for 2D point cloud map matching in simultaneous localization and mapping (SLAM) and relative position tracking, the algorithm was extended to 3D point clouds and has wide applications in computer vision and robotics. NDT is very fast and accurate, making it suitable for application to large scale data, but it is also sensitive to initialisation, requiring a sufficiently accurate initial guess, and for this reason it is typically used in a coarse-to-fine alignment strategy. == Formulation == The NDT function associated to a point cloud is constructed by partitioning the space in regular cells. For each cell, it is possible to define the mean q = 1 n ∑ i x i {\displaystyle \textstyle \mathbf {q} ={\frac {1}{n}}\sum _{i}\mathbf {x_{i}} } and covariance S = 1 n ∑ i ( x i − q ) ( x i − q ) ⊤ {\displaystyle \textstyle \mathbf {S} ={\frac {1}{n}}\sum _{i}\left(\mathbf {x} _{i}-\mathbf {q} \right)\left(\mathbf {x} _{i}-\mathbf {q} \right)^{\top }} of the n {\displaystyle n} points of the cloud x 1 , … , x n {\displaystyle \mathbf {x} _{1},\dots ,\mathbf {x} _{n}} that fall within the cell. The probability density of sampling a point at a given spatial location x {\displaystyle \mathbf {x} } within the cell is then given by the normal distribution e − 1 2 ( x − q ) ⊤ S − 1 ( x − q ) {\displaystyle e^{-{\frac {1}{2}}\left(\mathbf {x} -\mathbf {q} \right)^{\top }\mathbf {S} ^{-1}\left(\mathbf {x} -\mathbf {q} \right)}} . Two point clouds can be mapped by a Euclidean transformation f {\displaystyle f} with rotation matrix R {\displaystyle \mathbf {R} } and translation vector t {\displaystyle \mathbf {t} } f R , t ( x ) = R x + t {\displaystyle f_{\mathbf {R} ,\mathbf {t} }(\mathbf {x} )=\mathbf {R} \mathbf {x} +\mathbf {t} } that maps from the second cloud to the first, parametrised by the rotation angles and translation components. The algorithm registers the two point clouds by optimising the parameters of the transformation that maps the second cloud to the first, with respect to a loss function based on the NDT of the first point cloud, solving the following problem arg ⁡ min R , t { − ∑ i NDT ⁡ ( f R , t ( x i ) ) } {\displaystyle \arg \min _{\mathbf {R} ,\mathbf {t} }\left\{-\sum _{i}\operatorname {NDT} \left(f_{\mathbf {R} ,\mathbf {t} }\left(\mathbf {x_{i}} \right)\right)\right\}} where the loss function represents the negated likelihood, obtained by applying the transformation to all points in the second cloud and summing the value of the NDT at each transformed point f R , t ( x ) {\displaystyle f_{\mathbf {R} ,\mathbf {t} }(\mathbf {x} )} . The loss is piecewise continuous and differentiable, and can be optimised with gradient-based methods (in the original formulation, the authors use Newton's method). In order to reduce the effect of cell discretisation, a technique consists of partitioning the space into multiple overlapping grids, shifted by half cell size along the spatial directions, and computing the likelihood at a given location as the sum of the NDTs induced by each grid.

    Read more →
  • Non-negative matrix factorization

    Non-negative matrix factorization

    Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically. NMF finds applications in such fields as astronomy, computer vision, document clustering, missing data imputation, chemometrics, audio signal processing, recommender systems, and bioinformatics. == History == In chemometrics non-negative matrix factorization has a long history under the name "self modeling curve resolution". In this framework the vectors in the right matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the 1990s under the name positive matrix factorization. It became more widely known as non-negative matrix factorization after Lee and Seung investigated the properties of the algorithm and published some simple and useful algorithms for two types of factorizations. == Background == Let matrix V be the product of the matrices W and H, V = W H . {\displaystyle \mathbf {V} =\mathbf {W} \mathbf {H} \,.} Matrix multiplication can be implemented as computing the column vectors of V as linear combinations of the column vectors in W using coefficients supplied by columns of H. That is, each column of V can be computed as follows: v i = W h i , {\displaystyle \mathbf {v} _{i}=\mathbf {W} \mathbf {h} _{i}\,,} where vi is the i-th column vector of the product matrix V and hi is the i-th column vector of the matrix H. When multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced dimensions compared to the original matrix. For example, if V is an m × n matrix, W is an m × p matrix, and H is a p × n matrix then p can be significantly less than both m and n. Here is an example based on a text-mining application: Let the input matrix (the matrix to be factored) be V with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words. It follows that a column vector v in V represents a document. Assume we ask the algorithm to find 10 features in order to generate a features matrix W with 10000 rows and 10 columns and a coefficients matrix H with 10 rows and 500 columns. The product of W and H is a matrix with 10000 rows and 500 columns, the same shape as the input matrix V and, if the factorization worked, it is a reasonable approximation to the input matrix V. From the treatment of matrix multiplication above it follows that each column in the product matrix WH is a linear combination of the 10 column vectors in the features matrix W with coefficients supplied by the coefficients matrix H. This last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. NMF generates these features. It is useful to think of each feature (column vector) in the features matrix W as a document archetype comprising a set of words where each word's cell value defines the word's rank in the feature: The higher a word's cell value the higher the word's rank in the feature. A column in the coefficients matrix H represents an original document with a cell value defining the document's rank for a feature. We can now reconstruct a document (column vector) from our input matrix by a linear combination of our features (column vectors in W) where each feature is weighted by the feature's cell value from the document's column in H. == Clustering property == NMF has an inherent clustering property, i.e., it automatically clusters the columns of input data V = ( v 1 , … , v n ) {\displaystyle \mathbf {V} =(v_{1},\dots ,v_{n})} . More specifically, the approximation of V {\displaystyle \mathbf {V} } by V ≃ W H {\displaystyle \mathbf {V} \simeq \mathbf {W} \mathbf {H} } is achieved by finding W {\displaystyle W} and H {\displaystyle H} that minimize the error function (using the Frobenius norm) ‖ V − W H ‖ F , {\displaystyle \left\|V-WH\right\|_{F},} subject to W ≥ 0 , H ≥ 0. {\displaystyle W\geq 0,H\geq 0.} , If we furthermore impose an orthogonality constraint on H {\displaystyle \mathbf {H} } , i.e. H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} , then the above minimization is mathematically equivalent to the minimization of K-means clustering. Furthermore, the computed H {\displaystyle H} gives the cluster membership, i.e., if H k j > H i j {\displaystyle \mathbf {H} _{kj}>\mathbf {H} _{ij}} for all i ≠ k, this suggests that the input data v j {\displaystyle v_{j}} belongs to k {\displaystyle k} -th cluster. The computed W {\displaystyle W} gives the cluster centroids, i.e., the k {\displaystyle k} -th column gives the cluster centroid of k {\displaystyle k} -th cluster. This centroid's representation can be significantly enhanced by convex NMF. When the orthogonality constraint H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} is not explicitly imposed, the orthogonality holds to a large extent, and the clustering property holds too. When the error function to be used is Kullback–Leibler divergence, NMF is identical to the probabilistic latent semantic analysis (PLSA), a popular document clustering method. == Types == === Approximate non-negative matrix factorization === Usually the number of columns of W and the number of rows of H in NMF are selected so the product WH will become an approximation to V. The full decomposition of V then amounts to the two non-negative matrices W and H as well as a residual U, such that: V = WH + U. The elements of the residual matrix can either be negative or positive. When W and H are smaller than V they become easier to store and manipulate. Another reason for factorizing V into smaller matrices W and H, is that if one's goal is to approximately represent the elements of V by significantly less data, then one has to infer some latent structure in the data. === Convex non-negative matrix factorization === In standard NMF, matrix factor W ∈ R+m × k, i.e., W can be anything in that space. Convex NMF restricts the columns of W to convex combinations of the input data vectors ( v 1 , … , v n ) {\displaystyle (v_{1},\dots ,v_{n})} . This greatly improves the quality of data representation of W. Furthermore, the resulting matrix factor H becomes more sparse and orthogonal. === Nonnegative rank factorization === In case the nonnegative rank of V is equal to its actual rank, V = WH is called a nonnegative rank factorization (NRF). The problem of finding the NRF of V, if it exists, is known to be NP-hard. === Different cost functions and regularizations === There are different types of non-negative matrix factorizations. The different types arise from using different cost functions for measuring the divergence between V and WH and possibly by regularization of the W and/or H matrices. Two simple divergence functions studied by Lee and Seung are the squared error (or Frobenius norm) and an extension of the Kullback–Leibler divergence to positive matrices (the original Kullback–Leibler divergence is defined on probability distributions). Each divergence leads to a different NMF algorithm, usually minimizing the divergence using iterative update rules. The factorization problem in the squared error version of NMF may be stated as: Given a matrix V {\displaystyle \mathbf {V} } find nonnegative matrices W and H that minimize the function F ( W , H ) = ‖ V − W H ‖ F 2 {\displaystyle F(\mathbf {W} ,\mathbf {H} )=\left\|\mathbf {V} -\mathbf {WH} \right\|_{F}^{2}} Another type of NMF for images is based on the total variation norm. When L1 regularization (akin to Lasso) is added to NMF with the mean squared error cost function, the resulting problem may be called non-negative sparse coding due to the similarity to the sparse coding problem, although it may also still be referred to as NMF. === Online NMF === Many standard NMF algorithms analyze all the data together; i.e., the whole matrix is available from the start. This may be unsatisfactory in applications where there are too many data to fit into memory or where the data are provided in streaming fashion. One such use is for collaborative filtering in recommendation systems, where there may be many users and many items to recommend, and it would be inefficient to recalculate everything when one user or one item is added to the system. The cost function for o

    Read more →
  • Stress majorization

    Stress majorization

    Stress majorization is an optimization strategy used in multidimensional scaling (MDS) where, for a set of n {\displaystyle n} m {\displaystyle m} -dimensional data items, a configuration X {\displaystyle X} of n {\displaystyle n} points in r {\displaystyle r} ( ≪ m ) {\displaystyle (\ll m)} -dimensional space is sought that minimizes the so-called stress function σ ( X ) {\displaystyle \sigma (X)} . Usually r {\displaystyle r} is 2 {\displaystyle 2} or 3 {\displaystyle 3} , i.e. the ( n × r ) {\displaystyle (n\times r)} matrix X {\displaystyle X} lists points in 2 − {\displaystyle 2-} or 3 − {\displaystyle 3-} dimensional Euclidean space so that the result may be visualised (i.e. an MDS plot). The function σ {\displaystyle \sigma } is a cost or loss function that measures the squared differences between ideal ( m {\displaystyle m} -dimensional) distances and actual distances in r-dimensional space. It is defined as: σ ( X ) = ∑ i < j ≤ n w i j ( d i j ( X ) − δ i j ) 2 {\displaystyle \sigma (X)=\sum _{i Read more →

  • Structured kNN

    Structured kNN

    Structured k-nearest neighbours (SkNN) is a machine learning algorithm that generalizes k-nearest neighbors (k-NN). k-NN supports binary classification, multiclass classification, and regression, whereas SkNN allows training of a classifier for general structured output. For instance, a data sample might be a natural language sentence, and the output could be an annotated parse tree. Training a classifier consists of showing many instances of ground truth sample-output pairs. After training, the SkNN model is able to predict the corresponding output for new, unseen sample instances; that is, given a natural language sentence, the classifier can produce the most likely parse tree. == Training == As a training set, SkNN accepts sequences of elements with class labels. The type of element does not matter; the only requirement is a defined metric function that gives a distance between each pair of elements of a set. SkNN is based on idea of creating a graph, with each node representing a class label. There is an edge between a pair of nodes if there is a sequence of two elements in the training set with corresponding classes. The first step of SkNN training is the construction of such a graph from training sequences. There are two special nodes in the graph corresponding to sentence beginnings and ends: if a sequence starts with class C, the edge between node START and node C should be created. Like regular k-NN, the second part of SkNN training consists of storing the elements of a training sequence in a certain way. Each element of the training sequences is stored in the node related to the class of the previous element in the sequence. Every first element is stored in the START node. == Inference == Labelling input sequences by SkNN consists of finding the sequence of transitions in the graph, starting from node START. Each transition corresponds to a single element of the input sequence. As a result, the label of each element is determined as the target node label of the transition. The cost of the path is defined as the sum of all transitions, with the cost of transition from node A to node B being the distance from the current input sequence element to the nearest element of class B, stored in node A. Determining an optimal path may be performed using a modified Viterbi algorithm (where the sum of the distances is minimized, unlike the original algorithm which maximizes the product of probabilities).

    Read more →
  • Trigger list

    Trigger list

    Trigger list in its most general meaning refers to a list whose items are used to initiate ("trigger") certain actions. == United States: Private financial information == In the United States, when a person applies for a mortgage loan, the lender makes a credit inquiry about the potential borrower from the national credit bureaus, Equifax, Experian and TransUnion. Unless the borrower is opted out, the credit bureaus put the applicants onto a "trigger list" of "leads" about persons who are interested in new loans. These lists are sold to numerous lenders all over the United States, and soon after the application the applicant starts receiving offers from all parts of the country. The trigger lists contain a significant amount of personal financial information. Among the buyers of trigger lists are "lead generators" which resell filtered information to borrowers, e.g., of people who live in a certain area and have a certain credit score. While the Federal Trade Commission considers the market of "trigger lists" to be a legal business, many people and organizations (such as the National Association of Mortgage Brokers) consider this a serious breach of privacy and lobby for putting this practice under regulatory controls. As of now, American consumers may opt-out from "trigger lists" by calling 1-888-5-OPTOUT (1-888-567-8688). == Nuclear non-proliferation == The Zangger Committee and the Nuclear Suppliers Group maintain lists of items that may contribute to nuclear proliferation; The nuclear non-proliferation treaty forbids its members to export such items to non-treaty members. these items are said to trigger the countries' responsibilities under the NPT, hence the name.

    Read more →
  • Jackknife variance estimates for random forest

    Jackknife variance estimates for random forest

    In statistics, jackknife variance estimates for random forest are a way to estimate the variance in random forest models, in order to eliminate the bootstrap effects. == Jackknife variance estimates == The sampling variance of bagged learners is: V ( x ) = V a r [ θ ^ ∞ ( x ) ] {\displaystyle V(x)=Var[{\hat {\theta }}^{\infty }(x)]} Jackknife estimates can be considered to eliminate the bootstrap effects. The jackknife variance estimator is defined as: V ^ j = n − 1 n ∑ i = 1 n ( θ ^ ( − i ) − θ ¯ ) 2 {\displaystyle {\hat {V}}_{j}={\frac {n-1}{n}}\sum _{i=1}^{n}({\hat {\theta }}_{(-i)}-{\overline {\theta }})^{2}} In some classification problems, when random forest is used to fit models, jackknife estimated variance is defined as: V ^ j = n − 1 n ∑ i = 1 n ( t ¯ ( − i ) ⋆ ( x ) − t ¯ ⋆ ( x ) ) 2 {\displaystyle {\hat {V}}_{j}={\frac {n-1}{n}}\sum _{i=1}^{n}({\overline {t}}_{(-i)}^{\star }(x)-{\overline {t}}^{\star }(x))^{2}} Here, t ⋆ {\displaystyle t^{\star }} denotes a decision tree after training, t ( − i ) ⋆ {\displaystyle t_{(-i)}^{\star }} denotes the result based on samples without i t h {\displaystyle ith} observation. == Examples == E-mail spam problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper( Confidence Intervals for Random Forests: The jackknife and the Infinitesimal Jackknife ) that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low. Here, accuracy is measured by error rate, which is defined as: E r r o r R a t e = 1 N ∑ i = 1 N ∑ j = 1 M y i j , {\displaystyle ErrorRate={\frac {1}{N}}\sum _{i=1}^{N}\sum _{j=1}^{M}y_{ij},} Here N is also the number of samples, M is the number of classes, y i j {\displaystyle y_{ij}} is the indicator function which equals 1 when i t h {\displaystyle ith} observation is in class j, equals 0 when in other classes. No probability is considered here. There is another method which is similar to error rate to measure accuracy: l o g l o s s = 1 N ∑ i = 1 N ∑ j = 1 M y i j l o g ( p i j ) {\displaystyle logloss={\frac {1}{N}}\sum _{i=1}^{N}\sum _{j=1}^{M}y_{ij}log(p_{ij})} Here N is the number of samples, M is the number of classes, y i j {\displaystyle y_{ij}} is the indicator function which equals 1 when i t h {\displaystyle ith} observation is in class j, equals 0 when in other classes. p i j {\displaystyle p_{ij}} is the predicted probability of i t h {\displaystyle ith} observation in class j {\displaystyle j} .This method is used in Kaggle These two methods are very similar. == Modification for bias == When using Monte Carlo MSEs for estimating V I J ∞ {\displaystyle V_{IJ}^{\infty }} and V J ∞ {\displaystyle V_{J}^{\infty }} , a problem about the Monte Carlo bias should be considered, especially when n is large, the bias is getting large: E [ V ^ I J B ] − V ^ I J ∞ ≈ n ∑ b = 1 B ( t b ⋆ − t ¯ ⋆ ) 2 B {\displaystyle E[{\hat {V}}_{IJ}^{B}]-{\hat {V}}_{IJ}^{\infty }\approx {\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}} To eliminate this influence, bias-corrected modifications are suggested: V ^ I J − U B = V ^ I J B − n ∑ b = 1 B ( t b ⋆ − t ¯ ⋆ ) 2 B {\displaystyle {\hat {V}}_{IJ-U}^{B}={\hat {V}}_{IJ}^{B}-{\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}} V ^ J − U B = V ^ J B − ( e − 1 ) n ∑ b = 1 B ( t b ⋆ − t ¯ ⋆ ) 2 B {\displaystyle {\hat {V}}_{J-U}^{B}={\hat {V}}_{J}^{B}-(e-1){\frac {n\sum _{b=1}^{B}(t_{b}^{\star }-{\bar {t}}^{\star })^{2}}{B}}}

    Read more →
  • Latent space

    Latent space

    A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances between the objects. In most cases, the dimensionality of the latent space is chosen to be lower than the dimensionality of the feature space from which the data points are drawn, making the construction of a latent space an example of dimensionality reduction, which can also be viewed as a form of data compression. Latent spaces are usually fit via machine learning, and they can then be used as feature spaces in machine learning models, including classifiers and other supervised predictors. The interpretation of latent spaces in machine learning models is an ongoing area of research, but achieving clear interpretations remains challenging. The black-box nature of these models often makes the latent space unintuitive, while its high-dimensional, complex, and nonlinear characteristics further complicate the task of understanding it. Analysis of the latent space geometry of diffusion models reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher information metric. Some visualization techniques have been developed to connect the latent space to the visual world, but there is often not a direct connection between the latent space interpretation and the model itself. Such techniques include t-distributed stochastic neighbor embedding (t-SNE), where the latent space is mapped to two dimensions for visualization. Latent space distances lack physical units, so the interpretation of these distances may depend on the application. == Embedding models == Several embedding models have been developed to perform this transformation to create latent space embeddings given a set of data items and a similarity function. These models learn the embeddings by leveraging statistical techniques and machine learning algorithms. Here are some commonly used embedding models: Word2Vec: Word2Vec is a popular embedding model used in natural language processing (NLP). It learns word embeddings by training a neural network on a large corpus of text. Word2Vec captures semantic and syntactic relationships between words, allowing for meaningful computations like word analogies. GloVe: GloVe (Global Vectors for Word Representation) is another widely used embedding model for NLP. It combines global statistical information from a corpus with local context information to learn word embeddings. GloVe embeddings are known for capturing both semantic and relational similarities between words. Siamese Networks: Siamese networks are a type of neural network architecture commonly used for similarity-based embedding. They consist of two identical subnetworks that process two input samples and produce their respective embeddings. Siamese networks are often used for tasks like image similarity, recommendation systems, and face recognition. Variational Autoencoders (VAEs): VAEs are generative models that simultaneously learn to encode and decode data. The latent space in VAEs acts as an embedding space. By training VAEs on high-dimensional data, such as images or audio, the model learns to encode the data into a compact latent representation. VAEs are known for their ability to generate new data samples from the learned latent space. == Multimodality == Multimodality refers to the integration and analysis of multiple modes or types of data within a single model or framework. Embedding multimodal data involves capturing relationships and interactions between different data types, such as images, text, audio, and structured data. Multimodal embedding models aim to learn joint representations that fuse information from multiple modalities, allowing for cross-modal analysis and tasks. These models enable applications like image captioning, visual question answering, and multimodal sentiment analysis. To embed multimodal data, specialized architectures such as deep multimodal networks or multimodal transformers are employed. These architectures combine different types of neural network modules to process and integrate information from various modalities. The resulting embeddings capture the complex relationships between different data types, facilitating multimodal analysis and understanding. == Applications == Embedding latent space and multimodal embedding models have found numerous applications across various domains: Information retrieval: Embedding techniques enable efficient similarity search and recommendation systems by representing data points in a compact space. Natural language processing: Word embeddings have revolutionized NLP tasks like sentiment analysis, machine translation, and document classification. Computer vision: Image and video embeddings enable tasks like object recognition, image retrieval, and video summarization. Recommendation systems: Embeddings help capture user preferences and item characteristics, enabling personalized recommendations. Healthcare: Embedding techniques have been applied to electronic health records, medical imaging, and genomic data for disease prediction, diagnosis, and treatment. Social systems: Embedding techniques can be used to learn latent representations of social systems such as internal migration systems, academic citation networks, and world trade networks.

    Read more →
  • Error tolerance (PAC learning)

    Error tolerance (PAC learning)

    In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated. == Notation and the Valiant learning model == In the following, let X {\displaystyle X} be our n {\displaystyle n} -dimensional input space. Let H {\displaystyle {\mathcal {H}}} be a class of functions that we wish to use in order to learn a { 0 , 1 } {\displaystyle \{0,1\}} -valued target function f {\displaystyle f} defined over X {\displaystyle X} . Let D {\displaystyle {\mathcal {D}}} be the distribution of the inputs over X {\displaystyle X} . The goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . Let us suppose we have a function s i z e ( f ) {\displaystyle size(f)} that can measure the complexity of f {\displaystyle f} . Let Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} be an oracle that, whenever called, returns an example x {\displaystyle x} and its correct label f ( x ) {\displaystyle f(x)} . When no noise corrupts the data, we can define learning in the Valiant setting: Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the Valiant setting if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} and 0 < δ ≤ 1 {\displaystyle 0<\delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 ε , 1 δ , n , size ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,{\text{size}}(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition error ( h ) ≤ ε {\displaystyle {\text{error}}(h)\leq \varepsilon } . In the following we will define learnability of f {\displaystyle f} when data have suffered some modification. == Classification noise == In the classification noise model a noise rate 0 ≤ η < 1 2 {\displaystyle 0\leq \eta <{\frac {1}{2}}} is introduced. Then, instead of Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} that returns always the correct label of example x {\displaystyle x} , algorithm A {\displaystyle {\mathcal {A}}} can only call a faulty oracle Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} that will flip the label of x {\displaystyle x} with probability η {\displaystyle \eta } . As in the Valiant case, the goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . In applications it is difficult to have access to the real value of η {\displaystyle \eta } , but we assume we have access to its upperbound η B {\displaystyle \eta _{B}} . Note that if we allow the noise rate to be 1 / 2 {\displaystyle 1/2} , then learning becomes impossible in any amount of computation time, because every label conveys no information about the target function. Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the classification noise model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 ≤ η ≤ 1 2 {\displaystyle 0\leq \eta \leq {\frac {1}{2}}} , 0 ≤ ε ≤ 1 {\displaystyle 0\leq \varepsilon \leq 1} and 0 ≤ δ ≤ 1 {\displaystyle 0\leq \delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 1 − 2 η B , 1 ε , 1 δ , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{1-2\eta _{B}}},{\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,size(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition e r r o r ( h ) ≤ ε {\displaystyle error(h)\leq \varepsilon } . == Statistical query learning == Statistical Query Learning is a kind of active learning problem in which the learning algorithm A {\displaystyle {\mathcal {A}}} can decide if to request information about the likelihood P f ( x ) {\displaystyle P_{f(x)}} that a function f {\displaystyle f} correctly labels example x {\displaystyle x} , and receives an answer accurate within a tolerance α {\displaystyle \alpha } . Formally, whenever the learning algorithm A {\displaystyle {\mathcal {A}}} calls the oracle Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} , it receives as feedback probability Q f ( x ) {\displaystyle Q_{f(x)}} , such that Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } . Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the statistical query learning model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} and polynomials p ( ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot )} , q ( ⋅ , ⋅ , ⋅ ) {\displaystyle q(\cdot ,\cdot ,\cdot )} , and r ( ⋅ , ⋅ , ⋅ ) {\displaystyle r(\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} the following hold: Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} can evaluate P f ( x ) {\displaystyle P_{f(x)}} in time q ( 1 ε , n , s i z e ( f ) ) {\displaystyle q\left({\frac {1}{\varepsilon }},n,size(f)\right)} ; 1 α {\displaystyle {\frac {1}{\alpha }}} is bounded by r ( 1 ε , n , s i z e ( f ) ) {\displaystyle r\left({\frac {1}{\varepsilon }},n,size(f)\right)} A {\displaystyle {\mathcal {A}}} outputs a model h {\displaystyle h} such that e r r ( h ) < ε {\displaystyle err(h)<\varepsilon } , in a number of calls to the oracle bounded by p ( 1 ε , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},n,size(f)\right)} . Note that the confidence parameter δ {\displaystyle \delta } does not appear in the definition of learning. This is because the main purpose of δ {\displaystyle \delta } is to allow the learning algorithm a small probability of failure due to an unrepresentative sample. Since now Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} always guarantees to meet the approximation criterion Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } , the failure probability is no longer needed. The statistical query model is strictly weaker than the PAC model: any efficiently SQ-learnable class is efficiently PAC learnable in the presence of classification noise, but there exist efficient PAC-learnable problems such as parity that are not efficiently SQ-learnable. == Malicious classification == In the malicious classification model an adversary generates errors to foil the learning algorithm. This setting describes situations of error burst, which may occur when for a limited time transmission equipment malfunctions repeatedly. Formally, algorithm A {\displaystyle {\mathcal {A}}} calls an oracle Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )} that returns a correctly labeled example x {\displaystyle x} drawn, as usual, from distribution D {\displaystyle {\mathcal {D}}} over the input space with probability 1 − β {\displaystyle 1-\beta } , but it returns with probability β {\displaystyle \beta } an example drawn from a distribution that is not related to D {\displaystyle {\mathcal {D}}} . Moreover, this maliciously chosen example may strategically selected by an adversary who has knowledge of f {\displaystyle f} , β {\displaystyle \beta } , D {\displaystyle {\mathcal {D}}} , or the current progress of the learning algorithm. Definition: Given a bound β B < 1 2 {\displaystyle \beta _{B}<{\frac {1}{2}}} for 0 ≤ β < 1 2 {\displaystyle 0\leq \beta <{\frac {1}{2}}} , we say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the malicious classification model, if there exist a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )}

    Read more →
  • Coupled pattern learner

    Coupled pattern learner

    Coupled Pattern Learner (CPL) is a machine learning algorithm which couples the semi-supervised learning of categories and relations to forestall the problem of semantic drift associated with boot-strap learning methods. == Coupled Pattern Learner == Semi-supervised learning approaches using a small number of labeled examples with many unlabeled examples are usually unreliable as they produce an internally consistent, but incorrect set of extractions. CPL solves this problem by simultaneously learning classifiers for many different categories and relations in the presence of an ontology defining constraints that couple the training of these classifiers. It was introduced by Andrew Carlson, Justin Betteridge, Estevam R. Hruschka Jr. and Tom M. Mitchell in 2009. == CPL overview == CPL is an approach to semi-supervised learning that yields more accurate results by coupling the training of many information extractors. Basic idea behind CPL is that semi-supervised training of a single type of extractor such as ‘coach’ is much more difficult than simultaneously training many extractors that cover a variety of inter-related entity and relation types. Using prior knowledge about the relationships between these different entities and relations CPL makes unlabeled data as a useful constraint during training. For e.g., ‘coach(x)’ implies ‘person(x)’ and ‘not sport(x)’. == CPL description == === Coupling of predicates === CPL primarily relies on the notion of coupling the learning of multiple functions so as to constrain the semi-supervised learning problem. CPL constrains the learned function in two ways. Sharing among same-arity predicates according to logical relations Relation argument type-checking === Sharing among same-arity predicates === Each predicate P in the ontology has a list of other same-arity predicates with which P is mutually exclusive. If A is mutually exclusive with predicate B, A’s positive instances and patterns become negative instances and negative patterns for B. For example, if ‘city’, having an instance ‘Boston’ and a pattern ‘mayor of arg1’, is mutually exclusive with ‘scientist’, then ‘Boston’ and ‘mayor of arg1’ will become a negative instance and a negative pattern respectively for ‘scientist.’ Further, Some categories are declared to be a subset of another category. For e.g., ‘athlete’ is a subset of ‘person’. === Relation argument type-checking === This is a type checking information used to couple the learning of relations and categories. For example, the arguments of the ‘ceoOf’ relation are declared to be of the categories ‘person’ and ‘company’. CPL does not promote a pair of noun phrases as an instance of a relation unless the two noun phrases are classified as belonging to the correct argument types. === Algorithm description === Following is a quick summary of the CPL algorithm. Input: An ontology O, and a text corpus C Output: Trusted instances/patterns for each predicate for i=1,2,...,∞ do foreach predicate p in O do EXTRACT candidate instances/contextual patterns using recently promoted patterns/instances; FILTER candidates that violate coupling; RANK candidate instances/patterns; PROMOTE top candidates; end end ==== Inputs ==== A large corpus of Part-Of-Speech tagged sentences and an initial ontology with predefined categories, relations, mutually exclusive relationships between same-arity predicates, subset relationships between some categories, seed instances for all predicates, and seed patterns for the categories. ==== Candidate extraction ==== CPL finds new candidate instances by using newly promoted patterns to extract the noun phrases that co-occur with those patterns in the text corpus. CPL extracts, Category Instances Category Patterns Relation Instances Relation Patterns ==== Candidate filtering ==== Candidate instances and patterns are filtered to maintain high precision, and to avoid extremely specific patterns. An instance is only considered for assessment if it co-occurs with at least two promoted patterns in the text corpus, and if its co-occurrence count with all promoted patterns is at least three times greater than its co-occurrence count with negative patterns. ==== Candidate ranking ==== CPL ranks candidate instances using the number of promoted patterns that they co-occur with so that candidates that occur with more patterns are ranked higher. Patterns are ranked using an estimate of the precision of each pattern. ==== Candidate promotion ==== CPL ranks the candidates according to their assessment scores and promotes at most 100 instances and 5 patterns for each predicate. Instances and patterns are only promoted if they co-occur with at least two promoted patterns or instances, respectively. == Meta-Bootstrap Learner == Meta-Bootstrap Learner (MBL) was also proposed by the authors of CPL. Meta-Bootstrap learner couples the training of multiple extraction techniques with a multi-view constraint, which requires the extractors to agree. It makes addition of coupling constraints on top of existing extraction algorithms, while treating them as black boxes, feasible. MBL assumes that the errors made by different extraction techniques are independent. Following is a quick summary of MBL. Input: An ontology O, a set of extractors ε Output: Trusted instances for each predicate for i=1,2,...,∞ do foreach predicate p in O do foreach extractor e in ε do Extract new candidates for p using e with recently promoted instances; end FILTER candidates that violate mutual-exclusion or type-checking constraints; PROMOTE candidates that were extracted by all extractors; end end Subordinate algorithms used with MBL do not promote any instance on their own, they report the evidence about each candidate to MBL and MBL is responsible for promoting instances. == Applications == In their paper authors have presented results showing the potential of CPL to contribute new facts to existing repository of semantic knowledge, Freebase

    Read more →
  • Proximal policy optimization

    Proximal policy optimization

    Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large. == History == The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems. PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient. Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games. == TRPO == TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces. The pseudocode is as follows: Input: initial policy parameters θ 0 {\textstyle \theta _{0}} , initial value function parameters ϕ 0 {\textstyle \phi _{0}} Hyperparameters: KL-divergence limit δ {\textstyle \delta } , backtracking coefficient α {\textstyle \alpha } , maximum number of backtracking steps K {\textstyle K} for k = 0 , 1 , 2 , … {\textstyle k=0,1,2,\ldots } do Collect set of trajectories D k = { τ i } {\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}} by running policy π k = π ( θ k ) {\textstyle \pi _{k}=\pi \left(\theta _{k}\right)} in the environment. Compute rewards-to-go R ^ t {\textstyle {\hat {R}}_{t}} . Compute advantage estimates, A ^ t {\textstyle {\hat {A}}_{t}} (using any method of advantage estimation) based on the current value function V ϕ k {\textstyle V_{\phi _{k}}} . Estimate policy gradient as g ^ k = 1 | D k | ∑ τ ∈ D k ∑ t = 0 T ∇ θ log ⁡ π θ ( a t ∣ s t ) | θ k A ^ t {\displaystyle {\hat {g}}_{k}=\left.{\frac {1}{\left|{\mathcal {D}}_{k}\right|}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\nabla _{\theta }\log \pi _{\theta }\left(a_{t}\mid s_{t}\right)\right|_{\theta _{k}}{\hat {A}}_{t}} Use the conjugate gradient algorithm to compute x ^ k ≈ H ^ k − 1 g ^ k {\displaystyle {\hat {x}}_{k}\approx {\hat {H}}_{k}^{-1}{\hat {g}}_{k}} where H ^ k {\textstyle {\hat {H}}_{k}} is the Hessian of the sample average KL-divergence. Update the policy by backtracking line search with θ k + 1 = θ k + α j 2 δ x ^ k T H ^ k x ^ k x ^ k {\displaystyle \theta _{k+1}=\theta _{k}+\alpha ^{j}{\sqrt {\frac {2\delta }{{\hat {x}}_{k}^{T}{\hat {H}}_{k}{\hat {x}}_{k}}}}{\hat {x}}_{k}} where j ∈ { 0 , 1 , 2 , … K } {\textstyle j\in \{0,1,2,\ldots K\}} is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint. Fit value function by regression on mean-squared error: ϕ k + 1 = arg ⁡ min ϕ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 {\displaystyle \phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}} typically via some gradient descent algorithm. == PPO == The pseudocode is as follows: Input: initial policy parameters θ 0 {\textstyle \theta _{0}} , initial value function parameters ϕ 0 {\textstyle \phi _{0}} for k = 0 , 1 , 2 , … {\textstyle k=0,1,2,\ldots } do Collect set of trajectories D k = { τ i } {\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}} by running policy π k = π ( θ k ) {\textstyle \pi _{k}=\pi \left(\theta _{k}\right)} in the environment. Compute rewards-to-go R ^ t {\textstyle {\hat {R}}_{t}} . Compute advantage estimates, A ^ t {\textstyle {\hat {A}}_{t}} (using any method of advantage estimation) based on the current value function V ϕ k {\textstyle V_{\phi _{k}}} . Update the policy by maximizing the PPO-Clip objective: θ k + 1 = arg ⁡ max θ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T min ( π θ ( a t ∣ s t ) π θ k ( a t ∣ s t ) A π θ k ( s t , a t ) , g ( ϵ , A π θ k ( s t , a t ) ) ) {\displaystyle \theta _{k+1}=\arg \max _{\theta }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\min \left({\frac {\pi _{\theta }\left(a_{t}\mid s_{t}\right)}{\pi _{\theta _{k}}\left(a_{t}\mid s_{t}\right)}}A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right),\quad g\left(\epsilon ,A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right)\right)\right)} typically via stochastic gradient ascent with Adam. Fit value function by regression on mean-squared error: ϕ k + 1 = arg ⁡ min ϕ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 {\displaystyle \phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}} typically via some gradient descent algorithm. Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent. Intuitively, a policy gradient method takes small policy update steps, so the agent can reach higher and higher rewards in expectation. Policy gradient methods may be unstable: A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. To solve the instability, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the gradient ascent process. === Basic concepts === To begin the PPO training process, the agent is set in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of transition samples and policy updates, the agent will select an action to take by randomly sampling from the probability distribution P ( A | S ) {\displaystyle P(A|S)} generated by the policy network. The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario (a new state) by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize the cumulative reward signal across sequences of states, known as episodes. === Policy gradient laws: the advantage function === The advantage function (denoted as A {\displaystyle A} ) is central to PPO, as it tries to answer the question of whether a specific action of the agent is better or worse than some other possible action in a given state. By definition, the advantage function is an estimate of the relative value for a selected action. If the output of this function is positive, it means that the action in question is better than the average return, so the possibilities of selecting that specific action will increase. The opposite is true for a negative advantage output. The advantage function can be defined as A = Q − V {\displaystyle A=Q-V} , where Q {\displaystyle Q} is the discounted sum of rewards (the total weighted reward for the completion of an episode) and V {\displaystyle V} is the baseline estimate. Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an unsupervised learning problem. The baseline estimate comes from the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some variance), as it also uses a neural network, like the policy function itself. With Q {\displaystyle Q} and V {\displaystyle V} computed, the advantage function is calculated by subtracting the baseline estimate from the actual discounted return. If A > 0 {\displaystyle A>0} , the actual return of the action is better than the expected return from experience; if A < 0 {\displaystyle A<0} , the actual return is worse. === Ratio function === In PPO, the ratio function ( r t {\displaystyle r_{t}} ) calculates the probability of selecting action a {\displaystyle a} in state s {\displaystyle s} given the current policy network, divided by the previous probability under the old policy. In other words: If r t ( θ ) > 1 {\displaystyle r_{t}(\theta )>1} , where θ {\displaystyle \theta } are the policy network parameters, then selecting action a {\displaystyle a} in state s {\displaystyle s} is more likely based on the current policy than the previous policy. If 0 ≤ r t ( θ ) < 1 {\displaystyle 0\leq r_{t}(\theta )<1} , then selecting actio

    Read more →
  • Triplet loss

    Triplet loss

    Triplet loss is a machine learning loss function widely used in one-shot learning, a setting where models are trained to generalize effectively from limited examples. It was conceived by Google researchers for their prominent FaceNet algorithm for face detection. Triplet loss is designed to support metric learning. Namely, to assist training models to learn an embedding (mapping to a feature space) where similar data points are closer together and dissimilar ones are farther apart, enabling robust discrimination across varied conditions. In the context of face detection, data points correspond to images. == Definition == The loss function is defined using triplets of training points of the form ( A , P , N ) {\displaystyle (A,P,N)} . In each triplet, A {\displaystyle A} (called an "anchor point") denotes a reference point of a particular identity, P {\displaystyle P} (called a "positive point") denotes another point of the same identity in point A {\displaystyle A} , and N {\displaystyle N} (called a "negative point") denotes a point of an identity different from the identity in point A {\displaystyle A} and P {\displaystyle P} . Let x {\displaystyle x} be some point and let f ( x ) {\displaystyle f(x)} be the embedding of x {\displaystyle x} in the finite-dimensional Euclidean space. It shall be assumed that the L2-norm of f ( x ) {\displaystyle f(x)} is unity (the L2 norm of a vector X {\displaystyle X} in a finite dimensional Euclidean space is denoted by ‖ X ‖ {\displaystyle \Vert X\Vert } .) We assemble m {\displaystyle m} triplets of points from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets ( A ( i ) , P ( i ) , N ( i ) ) {\displaystyle (A^{(i)},P^{(i)},N^{(i)})} in the training data set: ‖ f ( A ( i ) ) − f ( P ( i ) ) ‖ 2 2 + α < ‖ f ( A ( i ) ) − f ( N ( i ) ) ‖ 2 2 {\displaystyle \Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}+\alpha <\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}} The variable α {\displaystyle \alpha } is a hyperparameter called the margin, and its value must be set manually. In the FaceNet system, its value was set as 0.2. Thus, the full form of the function to be minimized is the following: L = ∑ i = 1 m max ( ‖ f ( A ( i ) ) − f ( P ( i ) ) ‖ 2 2 − ‖ f ( A ( i ) ) − f ( N ( i ) ) ‖ 2 2 + α , 0 ) {\displaystyle L=\sum _{i=1}^{m}\max {\Big (}\Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}-\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}+\alpha ,0{\Big )}} == Intuition == A baseline for understanding the effectiveness of triplet loss is the contrastive loss, which operates on pairs of samples (rather than triplets). Training with the contrastive loss pulls embeddings of similar pairs closer together, and pushes dissimilar pairs apart. Its pairwise approach is greedy, as it considers each pair in isolation. Triplet loss innovates by considering relative distances. Its goal is that the embedding of an anchor (query) point be closer to positive points than to negative points (also accounting for the margin). It does not try to further optimize the distances once this requirement is met. This is approximated by simultaneously considering two pairs (anchor-positive and anchor-negative), rather than each pair in isolation. == Triplet "mining" == One crucial implementation detail when training with triplet loss is triplet "mining", which focuses on the smart selection of triplets for optimization. This process adds an additional layer of complexity compared to contrastive loss. A naive approach to preparing training data for the triplet loss involves randomly selecting triplets from the dataset. In general, the set of valid triplets of the form ( A ( i ) , P ( i ) , N ( i ) ) {\displaystyle (A^{(i)},P^{(i)},N^{(i)})} is very large. To speed-up training convergence, it is essential to focus on challenging triplets. In the FaceNet paper, several options were explored, eventually arriving at the following. For each anchor-positive pair, the algorithm considers only semi-hard negatives. These are negatives that violate the triplet requirement (i.e, are "hard"), but lie farther from the anchor than the positive (not too hard). Restated, for each A ( i ) {\displaystyle A^{(i)}} and P ( i ) {\displaystyle P^{(i)}} , they seek N ( i ) {\displaystyle N^{(i)}} such that: The rationale for this design choice is heuristic. It may appear puzzling that the mining process neglects "very hard" negatives (i.e., closer to the anchor than the positive). Experiments conducted by the FaceNet designers found that this often leads to a convergence to degenerate local minima. Triplet mining is performed at each training step, from within the sample points contained in the training batch (this is known as online mining), after embeddings were computed for all points in the batch. While ideally the entire dataset could be used, this is impractical in general. To support a large search space for triplets, the FaceNet authors used very large batches (1800 samples). Batches are constructed by selecting a large number of same-category sample points (40), and randomly selected negatives for them. == Extensions == Triplet loss has been extended to simultaneously maintain a series of distance orders by optimizing a continuous relevance degree with a chain (i.e., ladder) of distance inequalities. This leads to the Ladder Loss, which has been demonstrated to offer performance enhancements of visual-semantic embedding in learning to rank tasks. In Natural Language Processing, triplet loss is one of the loss functions considered for BERT fine-tuning in the SBERT architecture. Other extensions involve specifying multiple negatives (multiple negatives ranking loss).

    Read more →
  • Synaptic transistor

    Synaptic transistor

    A synaptic transistor is an electrical device that can learn in ways similar to a neural synapse. It optimizes its own properties for the functions it has carried out in the past. The device mimics the behavior of the property of neurons called spike-timing-dependent plasticity, or STDP. == Structure == Its structure is similar to that of a field effect transistor, where an ionic liquid takes the place of the gate insulating layer between the gate electrode and the conducting channel. That channel is composed of samarium nickelate (SmNiO3, or SNO) rather than the field effect transistor's doped silicon. == Function == A synaptic transistor has a traditional immediate response whose amount of current that passes between the source and drain contacts varies with voltage applied to the gate electrode. It also produces a much slower learned response such that the conductivity of the SNO layer varies in response to the transistor's STDP history, essentially by shuttling oxygen ions between the SNO and the ionic liquid. The analog of strengthening a synapse is to increase the SNO's conductivity, which essentially increases gain. Similarly, weakening a synapse is analogous to decreasing the SNO's conductivity, lowering the gain. The input and output of the synaptic transistor are continuous analog values, rather than digital on-off signals. While the physical structure of the device has the potential to learn from history, it contains no way to bias the transistor to control the memory effect. An external supervisory circuit converts the time delay between input and output into a voltage applied to the ionic liquid that either drives ions into the SNO or removes them. A network of such devices can learn particular responses to "sensory inputs", with those responses being learned through experience rather than explicitly programmed.

    Read more →
  • Mobile DevOps

    Mobile DevOps

    Mobile DevOps is a set of practices that applies the principles of DevOps specifically to the development of mobile applications. Traditional DevOps focuses on streamlining the software development process in general, but mobile development has its own unique challenges that require a tailored approach. Mobile DevOps is not simply as a branch of DevOps specific to mobile app development, instead an extension and reinterpretation of the DevOps philosophy due to very specific requirements of the mobile world. == Rationale == Traditional DevOps approach has been formed around 2007-2008, close to the dates when iOS and Android mobile operating systems were released to the public. The traditional DevOps approach primarily evolved to meet the changing needs of the software development world with the paradigm shift towards continuous and rapid development and deployment (such as in web development, where interpreted languages are more prevalent than compiled languages). While traditional DevOps embraced agility and flexibility, mobile operating system providers steered towards a walled-garden approach with compiled apps with tight controls over how they can be distributed and installed on a mobile device. This difference in the mobile development mindset compared to what the traditional DevOps approach is advocating, is augmented further with the mobile applications to be deployed on a high number of varying devices and operating systems. Eventually, the concept of Mobile DevOps took off as a trend around 2014-2015, in line with the fast growth of the number of applications in mobile app stores. As individuals and corporations alike are developing and publishing more and more mobile applications, the need for efficiency and shorter release cycles increased, which is addressed by the continuous feedback and continuous development approach within the concept of DevOps, while requiring a significant level of adaptation and extension of the traditional DevOps practices. == Mindset shift from traditional DevOps to mobile DevOps == Mobile DevOps has a unique set of challenges and constraints, which solidifies the fact that it needs to be approached as a separate discipline. These challenges can be outlined as follows: Platform-specific requirements and tight controls of mobile operating system providers, where for instance a macOS device is mandatory for iOS application development and release. The walled-garden approach of distributing mobile apps, specifically applying to iOS applications, which comes with app review and app release delays that would not be needed in web development, for instance. Code signing requirements that come with the walled-garden approach, which introduce additional processes in the mobile application build pipeline along with new security concerns. An entire deployment cycle is re-run even in the slightest code change due to how applications are compiled and delivered to the users. The final product is to be deployed to a wide variety of mobile devices worldwide, which requires extensive testing and user feedback. Monitoring mobile applications require additional tools and approaches to be able to get data from an application running on a mobile device while respecting user privacy. Frequent operating system updates by mobile platforms can require rapid adaptation of apps, introducing further complexity to the development and maintenance cycles. == Benefits of mobile DevOps == Mobile DevOps is not an abstract concept and offers a range of benefits that can help improve the efficiency and effectiveness of the mobile app development process. These benefits can even be quantified by collecting the data within the mobile application development lifecycle. The benefits can be categorized into the following areas: Faster Release Cycles: By automating tasks and streamlining the development process, mobile DevOps enables teams to deliver new features and updates more frequently. Improved Quality: Automated testing and continuous monitoring help to identify and fix bugs earlier in the development cycle, leading to higher quality apps. Optimized Resource Utilization: Mobile DevOps promotes optimized resource utilization by automating tasks and streamlining workflows. Furthermore, mobile DevOps practices like containerization can help to create more efficient and scalable development environments. Increased Agility: Mobile DevOps allows teams to be more responsive to changes in the market and user feedback. == List of Dedicated Mobile DevOps Platforms == Even though it is possible to run a mobile DevOps cycle with most of the CI/CD platforms, they may require significant effort compared to non-mobile CI/CD (e.g. you need to bring your own infrastructure or it may require "reinventing the wheel" for commonly used platforms like Jenkins). To overcome the mobile-specific challenges specified, there are certain platforms that are dedicated to the lifecycle of mobile applications. These platforms exclusively focus on DevOps processes for mobile app development and are also referred as mobile CI/CD platforms. Appcircle (Multiplatform | Cloud-based & On-premise) Visual Studio App Center (Multiplatform | Cloud-based) Xcode Cloud (Apple platforms only | Cloud-based)

    Read more →
  • Sliced inverse regression

    Sliced inverse regression

    Sliced inverse regression (SIR) is a tool for dimensionality reduction in the field of multivariate statistics. In statistics, regression analysis is a method of studying the relationship between a response variable y and its input variable x _ {\displaystyle {\underline {x}}} , which is a p-dimensional vector. There are several approaches in the category of regression. For example, parametric methods include multiple linear regression, and non-parametric methods include local smoothing. As the number of observations needed to use local smoothing methods scales exponentially with high-dimensional data (as p grows), reducing the number of dimensions can make the operation computable. Dimensionality reduction aims to achieve this by showing only the most important dimension of the data. SIR uses the inverse regression curve, E ( x _ | y ) {\displaystyle E({\underline {x}}\,|\,y)} , to perform a weighted principal component analysis. == Model == Given a response variable Y {\displaystyle \,Y} and a (random) vector X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} of explanatory variables, SIR is based on the model Y = f ( β 1 ⊤ X , … , β k ⊤ X , ε ) ( 1 ) {\displaystyle Y=f(\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X,\varepsilon )\quad \quad \quad \quad \quad (1)} where β 1 , … , β k {\displaystyle \beta _{1},\ldots ,\beta _{k}} are unknown projection vectors, k {\displaystyle \,k} is an unknown number smaller than p {\displaystyle \,p} , f {\displaystyle \;f} is an unknown function on R k + 1 {\displaystyle \mathbb {R} ^{k+1}} as it only depends on k {\displaystyle \,k} arguments, and ε {\displaystyle \varepsilon } is a random variable representing error with E [ ε | X ] = 0 {\displaystyle E[\varepsilon |X]=0} and a finite variance of σ 2 {\displaystyle \sigma ^{2}} . The model describes an ideal solution, where Y {\displaystyle \,Y} depends on X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} only through a k {\displaystyle \,k} dimensional subspace; i.e., one can reduce the dimension of the explanatory variables from p {\displaystyle \,p} to a smaller number k {\displaystyle \,k} without losing any information. An equivalent version of ( 1 ) {\displaystyle \,(1)} is: the conditional distribution of Y {\displaystyle \,Y} given X {\displaystyle \,X} depends on X {\displaystyle \,X} only through the k {\displaystyle \,k} dimensional random vector ( β 1 ⊤ X , … , β k ⊤ X ) {\displaystyle (\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X)} . It is assumed that this reduced vector is as informative as the original X {\displaystyle \,X} in explaining Y {\displaystyle \,Y} . The unknown β i ′ s {\displaystyle \,\beta _{i}'s} are called the effective dimension reducing directions (EDR-directions). The space that is spanned by these vectors is denoted by the effective dimension reducing space (EDR-space). == Relevant linear algebra background == Given a _ 1 , … , a _ r ∈ R n {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}\in \mathbb {R} ^{n}} , then V := L ( a _ 1 , … , a _ r ) {\displaystyle V:=L({\underline {a}}_{1},\ldots ,{\underline {a}}_{r})} , the set of all linear combinations of these vectors is called a linear subspace and is therefore a vector space. The equation says that vectors a _ 1 , … , a _ r {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}} span V {\displaystyle \,V} , but the vectors that span space V {\displaystyle \,V} are not unique. The dimension of V ( ∈ R n ) {\displaystyle \,V(\in \mathbb {R} ^{n})} is equal to the maximum number of linearly independent vectors in V {\displaystyle \,V} . A set of n {\displaystyle \,n} linear independent vectors of R n {\displaystyle \mathbb {R} ^{n}} makes up a basis of R n {\displaystyle \mathbb {R} ^{n}} . The dimension of a vector space is unique, but the basis itself is not. Several bases can span the same space. Dependent vectors can still span a space, but the linear combinations of the latter are only suitable to a set of vectors lying on a straight line. == Inverse regression == Computing the inverse regression curve (IR) means instead of looking for E [ Y | X = x ] {\displaystyle \,E[Y|X=x]} , which is a curve in R p {\displaystyle \mathbb {R} ^{p}} it is actually E [ X | Y = y ] {\displaystyle \,E[X|Y=y]} , which is also a curve in R p {\displaystyle \mathbb {R} ^{p}} , but consisting of p {\displaystyle \,p} one-dimensional regressions. The center of the inverse regression curve is located at E [ E [ X | Y ] ] = E [ X ] {\displaystyle \,E[E[X|Y]]=E[X]} . Therefore, the centered inverse regression curve is E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} which is a p {\displaystyle \,p} dimensional curve in R p {\displaystyle \mathbb {R} ^{p}} . == Inverse regression versus dimension reduction == The centered inverse regression curve lies on a k {\displaystyle \,k} -dimensional subspace spanned by Σ x x β i ′ s {\displaystyle \,\Sigma _{xx}\beta _{i}\,'s} . This is a connection between the model and inverse regression. Given this condition and ( 1 ) {\displaystyle \,(1)} , the centered inverse regression curve E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} is contained in the linear subspace spanned by Σ x x β k ( k = 1 , … , K ) {\displaystyle \,\Sigma _{xx}\beta _{k}(k=1,\ldots ,K)} , where Σ x x = C o v ( X ) {\displaystyle \,\Sigma _{xx}=Cov(X)} . == Estimation of the EDR-directions == After having had a look at all the theoretical properties, the aim now is to estimate the EDR-directions. For that purpose, weighted principal component analyses are needed. If the sample means m ^ h ′ s {\displaystyle \,{\hat {m}}_{h}\,'s} , X {\displaystyle \,X} would have been standardized to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} . Corresponding to the theorem above, the IR-curve m 1 ( y ) = E [ Z | Y = y ] {\displaystyle \,m_{1}(y)=E[Z|Y=y]} lies in the space spanned by ( η 1 , … , η k ) {\displaystyle \,(\eta _{1},\ldots ,\eta _{k})} , where η i = Σ x x 1 / 2 β i {\displaystyle \,\eta _{i}=\Sigma _{xx}^{1/2}\beta _{i}} . As a consequence, the covariance matrix c o v [ E [ Z | Y ] ] {\displaystyle \,cov[E[Z|Y]]} is degenerate in any direction orthogonal to the η i ′ s {\displaystyle \,\eta _{i}\,'s} . Therefore, the eigenvectors η k ( k = 1 , … , K ) {\displaystyle \,\eta _{k}(k=1,\ldots ,K)} associated with the largest K {\displaystyle \,K} eigenvalues are the standardized EDR-directions. == Algorithm == === SIR algorithm === The algorithm from Li, K-C. (1991) to estimate the EDR-directions via SIR is as follows. 1. Let Σ x x {\displaystyle \,\Sigma _{xx}} be the covariance matrix of X {\displaystyle \,X} . Standardize X {\displaystyle \,X} to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} ( 1 ) {\displaystyle \,(1)} can also be rewritten as Y = f ( η 1 ⊤ Z , … , η k ⊤ Z , ε ) {\displaystyle Y=f(\eta _{1}^{\top }Z,\ldots ,\eta _{k}^{\top }Z,\varepsilon )} where η k = β k Σ x x 1 / 2 ∀ k {\displaystyle \,\eta _{k}=\beta _{k}\Sigma _{xx}^{1/2}\quad \forall \;k} .) 2. Divide the range of y i {\displaystyle \,y_{i}} into S {\displaystyle \,S} non-overlapping slices H s ( s = 1 , … , S ) . n s {\displaystyle \,H_{s}(s=1,\ldots ,S).\;n_{s}} is the number of observations within each slice and I H s {\displaystyle \,I_{H_{s}}} is the indicator function for the slice: n s = ∑ i = 1 n I H s ( y i ) {\displaystyle n_{s}=\sum _{i=1}^{n}I_{H_{s}}(y_{i})} 3. Compute the mean of z i {\displaystyle \,z_{i}} over all slices, which is a crude estimate m ^ 1 {\displaystyle \,{\hat {m}}_{1}} of the inverse regression curve m 1 {\displaystyle \,m_{1}} : z ¯ s = n s − 1 ∑ i = 1 n z i I H s ( y i ) {\displaystyle \,{\bar {z}}_{s}=n_{s}^{-1}\sum _{i=1}^{n}z_{i}I_{H_{s}}(y_{i})} 4. Calculate the estimate for C o v { m 1 ( y ) } {\displaystyle \,Cov\{m_{1}(y)\}} : V ^ = n − 1 ∑ i = 1 S n s z ¯ s z ¯ s ⊤ {\displaystyle \,{\hat {V}}=n^{-1}\sum _{i=1}^{S}n_{s}{\bar {z}}_{s}{\bar {z}}_{s}^{\top }} 5. Identify the eigenvalues λ ^ i {\displaystyle \,{\hat {\lambda }}_{i}} and the eigenvectors η ^ i {\displaystyle \,{\hat {\eta }}_{i}} of V ^ {\displaystyle \,{\hat {V}}} , which are the standardized EDR-directions. 6. Transform the standardized EDR-directions back to the original scale. The estimates for the EDR-directions are given by: β ^ i = Σ ^ x x − 1 / 2 η ^ i {\displaystyle \,{\hat {\beta }}_{i}={\hat {\Sigma }}_{xx}^{-1/2}{\hat {\eta }}_{i}} (which are not necessarily orthogonal.)

    Read more →
  • Junction tree algorithm

    Junction tree algorithm

    The junction tree algorithm (also known as 'Clique Tree') is a method used in machine learning to extract marginalization in general graphs. In essence, it entails performing belief propagation on a modified graph called a junction tree. The graph is called a tree because it branches into different sections of data; nodes of variables are the branches. The basic premise is to eliminate cycles by clustering them into single nodes. Multiple extensive classes of queries can be compiled at the same time into larger structures of data. There are different algorithms to meet specific needs and for what needs to be calculated. Inference algorithms gather new developments in the data and calculate it based on the new information provided. == Junction tree algorithm == === Hugin algorithm === If the graph is directed then moralize it to make it un-directed. Introduce the evidence. Triangulate the graph to make it chordal. Construct a junction tree from the triangulated graph (we will call the vertices of the junction tree "supernodes"). Propagate the probabilities along the junction tree (via belief propagation) Note that this last step is inefficient for graphs of large treewidth. Computing the messages to pass between supernodes involves doing exact marginalization over the variables in both supernodes. Performing this algorithm for a graph with treewidth k will thus have at least one computation which takes time exponential in k. It is a message passing algorithm. The Hugin algorithm takes fewer computations to find a solution compared to Shafer-Shenoy. === Shafer-Shenoy algorithm === Computed recursively Multiple recursions of the Shafer-Shenoy algorithm results in Hugin algorithm Found by the message passing equation Separator potentials are not stored The Shafer-Shenoy algorithm is the sum product of a junction tree. It is used because it runs programs and queries more efficiently than the Hugin algorithm. The algorithm makes calculations for conditionals for belief functions possible. Joint distributions are needed to make local computations happen. === Underlying theory === The first step concerns only Bayesian networks, and is a procedure to turn a directed graph into an undirected one. We do this because it allows for the universal applicability of the algorithm, regardless of direction. The second step is setting variables to their observed value. This is usually needed when we want to calculate conditional probabilities, so we fix the value of the random variables we condition on. Those variables are also said to be clamped to their particular value. The third step is to ensure that graphs are made chordal if they aren't already chordal. This is the first essential step of the algorithm. It makes use of the following theorem: Theorem: For an undirected graph, G, the following properties are equivalent: Graph G is triangulated. The clique graph of G has a junction tree. There is an elimination ordering for G that does not lead to any added edges. Thus, by triangulating a graph, we make sure that the corresponding junction tree exists. A usual way to do this, is to decide an elimination order for its nodes, and then run the Variable elimination algorithm. The variable elimination algorithm states that the algorithm must be run each time there is a different query. This will result to adding more edges to the initial graph, in such a way that the output will be a chordal graph. All chordal graphs have a junction tree. The next step is to construct the junction tree. To do so, we use the graph from the previous step, and form its corresponding clique graph. Now the next theorem gives us a way to find a junction tree: Theorem: Given a triangulated graph, weight the edges of the clique graph by their cardinality, |A∩B|, of the intersection of the adjacent cliques A and B. Then any maximum-weight spanning tree of the clique graph is a junction tree. So, to construct a junction tree we just have to extract a maximum weight spanning tree out of the clique graph. This can be efficiently done by, for example, modifying Kruskal's algorithm. The last step is to apply belief propagation to the obtained junction tree. Usage: A junction tree graph is used to visualize the probabilities of the problem. The tree can become a binary tree to form the actual building of the tree. A specific use could be found in auto encoders, which combine the graph and a passing network on a large scale automatically. === Inference Algorithms === Loopy belief propagation: A different method of interpreting complex graphs. The loopy belief propagation is used when an approximate solution is needed instead of the exact solution. It is an approximate inference. Cutset conditioning: Used with smaller sets of variables. Cutset conditioning allows for simpler graphs that are easier to read but are not exact.

    Read more →